Albuquerque Real Estate Multiple Regression model

Brandon Rozek

July 12, 2017

Introduction

Albuquerque, New Mexico is a city of thriving culture. The population has grown drastically within the past several years and is a hotspot for New Mexico. Albuquerque’s one-of-a-kind southwestern culture is all around the city, from the quaint shops, Pueblo and Spanish inspired architecture and world famous cuisine, to the music and art of the city. Being the largest city in the state, it has a wide variety of homes across the entire city whether it be within the actual city of Albuquerque ore in the outskirts. Many people flock to the city for the weather. It is a mildly dry climate with roughly 310 days of solid sunshine each year, making it a literal hotspot for people who enjoy the outdoors and warm weather. The city has little to no natural disasters that threaten everyday life, make it a safe place to settle. In the city of Albuquerque, there is a wide variety of living communities throughout the area. As mentioned above, there are different diverse communities. The housing availability is vast as well. The housing market in this region is vast. Because there is little to no potential for natural disasters it causes the market to grow due to less of a liability for increased damage from weather. Albuquerque, has one of the lowest costs of living than anywhere else in the country which makes it a great place to invest in the housing market. The goal of this project is to figure out details of real estate, specifically in the city of Albuquerque, New Mexico. In the dataset given, the predictors available are square footage, age of home, number of features, amount in taxes, and several binary variables. Those binary variables being whether or not the home was of a custom design, is located in a corner location, or if it is located in the northeast quadrant of the city.

Descriptive Statistics

Our dataset features a representative sample of 117 homes in Albuquerque, New Mexico. From this dataset, we can describe the key patterns that arise and build an intuition for what is to come in our inferential statistics section. It is important to build this intuition since it helps when generalizing to a population.

Distribution of Data

To understand the boxplot of interactions shown by Figure 1, we first need to establish what each of the indices or numbers under the x-axis represent. The first index discloses whether or not the home is located in the northeast quadrant of Albuquerque, New Mexico. The second 1index tells us whether or not the home had a custom design. And finally, the third index indicates whether or not the home is located in the corner of an intersection. From the boxplot of interactions, we can see that homes with a custom design tend to have a higher price than homes that don’t. There are no other patterns from this figure that we can readily pick up in terms of how the different combination of factors affect a home’s price. Figure 2 consists of multiple boxplots each comparing the price distribution of the different levels in all the categorical variables. From this we can reconfirm that having a custom design leads to a higher price distribution. We also see that the Northeast quadrant has a larger variation in price. There does not seem to be any significant difference in the price of homes depending on if it’s located on a corner. Finally, the more features a home has, the larger the variation in prices become. Figure 3 shows a scatterplot matrix of our data. From the matrix we can see the possibility of a linear correlation between price and square footage, as well as, price and tax. There is no evidence, however, of linear correlation between price and the age of the home. This suggests that perhaps age will not become a significant factor in our model. Figure 4 contains histograms of the different quantitative variables. As shown in the figure, the histograms are all right skewed. This shows that the majority of homes are in the lower price range while there are a few homes in the upper price region that might skew our model. Figure 5 shows the distribution of the data in terms of the categorical variables. It is interesting to note that the number of features falls in a gaussian like curve which is unimodal and symmetric at about 4 features. There are more homes that do not have a custom design and do not lie on a corner than ones that do. In this dataset, there are also more homes in the northeast quadrant than homes that are not. Figure 6 shows the distribution of taxes to be right skewed as well.

Effect/Interaction Plots

In the following section we will consider the effect of different categorical variables on the price of homes in Albuquerque and the other variables. The effect plots are used primarily to build an intuition on an individual variable’s effect on the price of a home. Interaction plots, however, analyze the influence of a variable on another’s effect on the price of a home. If the lines in an interaction plot intersect, it tells us that a categorical variable has a significant influence on another’s effect on the price of a home. The main effect plots in Figure 7 shows us the mean price of a home in relation to the different levels for each of the factors in the dataset. Similar to what was discovered in the boxplot figures, having a custom design increases the price of a home drastically compared to the increase of the price of a home by other factors. Having a home in a non-corner location slightly decreases the value of a home on average. Finally, it is shown in the Feature Effect plot that increases in the number of features increases the price of a home. An interesting thing to note is how in the case of 8 features, the average price of a home decreases. Figures 8, 9, and 10 show the interaction plots of different categorical variable combinations. In Figures 8 and 9, the lines do not intersect, therefore we can conclude that being in the northeast bears no influence on how custom design influences the price of the home and being in the northeast also bears no influence on how the corner location affect on the price. Figure 10, however, shows an interaction between corner location and custom design. This suggests that corner location has an influence on how custom design effects the price.

Missing Data

In the dataset provided, 44% of the homes are missing some data. Missing data, like outliers, are not to be ignored. To deal with missing data, one needs to know the implications of missing data on the analysis. From there, we can think of several techniques to deal with the absence of the data. One of the most important implications to check is to see if the data is missing at random. If the data is not missing at random, then it taints the representability of the sample, making it harder to generalize to a population. In the dataset provided there are two variables that contain missing data. Figure 11 shows how 42% of the data is missing the age variable and 9% of the data is missing the Tax variable. Looking at the right side of Figure 11, we can see the pattern of missing data in our set. 35% of the data in our set is only missing the age variable, 7% of our data is missing the age and tax variable, and only 2% of our data is missing the tax variable alone. Figure 12 shows us the missing data with respect to quantitative variables such as price, square footage, and tax. As seen in the figure, the missing data is spread equally throughout the distribution. Figure 13 shows us the missing data with respect to the categorical variables in our dataset. As seen in the figure, the proportion of missing data in the binary variables are about the same for each level. When looking at the number of features, however, there is a higher proportion of data missing in the homes with fewer features than in the homes with more features. Looking at the scatterplot matrix of missing values in Figure 14, we can see that the missing values in red are spread out evenly through the observed values in blue. Missing data with respect to age was not considered since the majority of the missing data comes from the age variable.

Inferential Statistics

Now that we’ve described the sample, we can now look forward to generalizing our suspicions to the population. In this section, we define the hypotheses of interest in our model and then consider different techniques that lead up to the final model and analysis.
Hypotheses The main hypothesis of interest is if the model is significant.
➢ H​ 0​ : There is no linear association between the price of homes in Albuquerque, New Mexico and any of the following: square footage, age of the home, tax, number of features, corner location, northeast location, or custom design.
➢ H​ A​ : There is a linear association between the price of homes in Albuquerque, New Mexico and at least one of the following predictors: square footage, age of the home, tax, number of features, corner location, northeast location, or custom design. Afterwards, we consider if square footage, age, tax, number of features, corner location, northeast location, or custom design are significant predictors in the model individually.

Ignore Missing Data Technique

One of the first regression models we considered is if we ignore all of the missing data. Since 44% of our data have missing values, this drops our degrees of freedom significantly to 61. This technique also doesn’t utilize all of the data, which is prefered so that it doesn’t affect the coefficients of our model. Running the stepwise algorithm to maximize the AIC, we obtain the following model
PRICE = 97.92321 + 0.33638(SQFT) + 0.52308(TAX) + 177.18519(CUST1) − 77.82234(COR1)

Multiple Imputation without Outliers

Removing the outliers from the model and running the stepwise algorithm to maximize AIC, we obtain
PRICE = 76.47917 + 0.64130(TAX) + 0.27290(SQFT) + 77.58816(CUST2)