We will illustrate some of the methods outlined during the lectures. Note that we will update the tasks or data as necessary for new methods introduced during upcoming lectures. In some cases, we may need to present an additional dataset to carry out some examples.
Generally, a number of example datasets can be found at Principles of Econometrics book website with their dataset definitions or from R datasets which are available in both R and Python.
We will examine the dataset br
which contains 1080 home sales in Baton Rouge, LA during mid-2005. The following variables are presented in the dataset:
price:
sale price, dollars
sqft:
total square feet
bedrooms:
number of bedrooms
baths:
number of full baths
age:
age in years
Occupancy:
Owner = 1
Vacant = 2
Tenant = 3
Pool:
Yes = 1
No = 0
Style:
Traditional = 1
Townhouse = 2
Ranch = 3
New Orleans = 4
Mobile Home = 5
Garden = 6
French = 7
Cottage = 8
Contemporary = 9
Colonial = 10
Acadian = 11
Fireplace:
Yes = 1
No = 0
Waterfront:
Yes = 1
No = 0
DOM:
Days on the market
#
#
br <- read.csv2(file = "http://www.principlesofeconometrics.com/poe5/data/csv/br.csv", sep = ",", header = TRUE)
head(br)
## price sqft bedrooms baths age occupancy pool style fireplace waterfront dom
## 1 66500 741 1 1 18 1 1 1 1 0 6
## 2 66000 741 1 1 18 2 1 1 0 0 23
## 3 68500 790 1 1 18 1 0 1 1 0 8
## 4 102000 2783 2 2 18 1 0 1 1 0 50
## 5 54000 1165 2 1 35 2 0 1 0 0 190
## 6 143000 2331 2 2 25 1 0 1 1 0 86
import pandas as pd
#
br = pd.read_csv("http://www.principlesofeconometrics.com/poe5/data/csv/br.csv")
print(br.head())
## price sqft bedrooms baths ... style fireplace waterfront dom
## 0 66500 741 1 1 ... 1 1 0 6
## 1 66000 741 1 1 ... 1 0 0 23
## 2 68500 790 1 1 ... 1 1 0 8
## 3 102000 2783 2 2 ... 1 1 0 50
## 4 54000 1165 2 1 ... 1 0 0 190
##
## [5 rows x 11 columns]
Let’s say that a contractor is thinking about building housing on a large plot as well as selling their existing housing stock. The contractor wishes to maximize their profits and as such is interested to know which variables affect the price. For example, would they focus on smaller houses, or maybe larger houses are more profitable?
Below are the tasks that we wish to accomplish (will be updated with further lectures):
price
of a sold home depends on the total square feet, sqft
, of the house. To do this, we will begin by examining the scatter plot of our dependent and independent variables.The main idea behind the tasks for the univariate regression is to understand how and which underlying formulas are applied in single-line functions, which immediately evaluate the regression model parameters and output them alongside other values and test statistics. In other words, the end goal of the Univariate Regression Chapters is to not only be able to apply the following functions to estimate the models:
#
#
#
mdl_ols <- lm(price ~ 1 + sqft, data = br)
#
print(summary(mdl_ols))
##
## Call:
## lm(formula = price ~ 1 + sqft, data = br)
##
## Residuals:
## Min 1Q Median 3Q Max
## -366641 -31399 -1535 25601 932272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -60861.462 6110.187 -9.961 <2e-16 ***
## sqft 92.747 2.411 38.476 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79820 on 1078 degrees of freedom
## Multiple R-squared: 0.5786, Adjusted R-squared: 0.5783
## F-statistic: 1480 on 1 and 1078 DF, p-value: < 2.2e-16
import statsmodels.api as sm
#
mdl = sm.OLS(br["price"], sm.add_constant(br["sqft"]))
mdl_ols = mdl.fit()
#
print(mdl_ols.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: price R-squared: 0.579
## Model: OLS Adj. R-squared: 0.578
## Method: Least Squares F-statistic: 1480.
## Date: Thu, 20 Sep 2018 Prob (F-statistic): 1.54e-204
## Time: 07:54:27 Log-Likelihood: -13722.
## No. Observations: 1080 AIC: 2.745e+04
## Df Residuals: 1078 BIC: 2.746e+04
## Df Model: 1
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const -6.086e+04 6110.187 -9.961 0.000 -7.29e+04 -4.89e+04
## sqft 92.7474 2.411 38.476 0.000 88.018 97.477
## ==============================================================================
## Omnibus: 1185.147 Durbin-Watson: 1.886
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 139602.251
## Skew: 5.135 Prob(JB): 0.00
## Kurtosis: 57.743 Cond. No. 6.38e+03
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 6.38e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
but also understand what (most of) the presented values say about our model. As such, in the following parts of our example we will go throughout the steps of the results in these functions, by calculating the values manually and explaining their role in our modelling process.
We will begin by looking at the scatter plot of \(Y\) on \(X\) to examine whether there may be any relationship between the variables:
#
#
#
#
#
#
plot(x = br[, "sqft"], y = br[, "price"],
xlab = "sqft", ylab = "price", col = "blue")
import matplotlib.pyplot as plt
#
plt.figure(num = 0, figsize = (10, 8))
plt.plot(br["sqft"], br["price"], linestyle = "None", marker = "o",
markerfacecolor = 'None', markeredgecolor = "blue")
plt.xlabel("sqft")
plt.ylabel("price")
plt.show()
It appears that for very large houses the price is also very large, with most values concentrated around 2000 sqft. We can also look at the histograms and boxplots of these variables:
#
#
#
#
par(mfrow = c(2,3))
boxplot(br[, "sqft"])
boxplot(br[, "sqft"], outline = FALSE)
hist(br[, "sqft"], breaks = 20, col = "cornflowerblue")
boxplot(br[, "price"])
boxplot(br[, "price"], outline = FALSE)
hist(br[, "price"], breaks = 20, col = "cornflowerblue")
fig = plt.figure(num = 1, figsize = (10, 8))
fig.add_subplot('231').boxplot(br["sqft"])
fig.add_subplot('232').boxplot(br["sqft"], showfliers = False)
fig.add_subplot('233').hist(br["sqft"], bins = 20, ec = 'black')
plt.title("Histogram of sqft")
fig.add_subplot('234').boxplot(br["price"])
fig.add_subplot('235').boxplot(br["price"], showfliers = False)
fig.add_subplot('236').hist(br["price"], bins = 20, ec = 'black')
plt.title("Histogram of sqft")
plt.tight_layout()
plt.show()
The box itself spans the first (bottom of the square), second (middle line in the square) and third (top line of the square) quartiles - the whole box represents the InterQuartile Range - \(IQR = Q_3 - Q_1\) (middle fifty) - of the data. The “whiskers” above and below the box show the locations, which are no more than \(1.5 \cdot IQR\). The points outside the whiskers are the outliers of the data:
The second boxplot in each row omits the outliers from the plots.
The correlation between the independent variable sqft
and the dependent variable price
as well as the summary statistics can also be calculated:
#
#
print(cor(br[, "price"], br[, "sqft"]))
## [1] 0.7606892
summary(br[, "sqft"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 662 1604 2186 2326 2800 7897
summary(br[, "price"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22000 99000 130000 154863 170163 1580000
import numpy as np
#
print(np.corrcoef(br["price"], br["sqft"])[0][1])
## 0.7606891740822463
print(br["sqft"].describe())
## count 1080.000000
## mean 2325.937963
## std 1008.097991
## min 662.000000
## 25% 1604.500000
## 50% 2186.500000
## 75% 2800.000000
## max 7897.000000
## Name: sqft, dtype: float64
print(br["price"].describe())
## count 1.080000e+03
## mean 1.548632e+05
## std 1.229128e+05
## min 2.200000e+04
## 25% 9.900000e+04
## 50% 1.300000e+05
## 75% 1.701625e+05
## max 1.580000e+06
## Name: price, dtype: float64
It appears that the correlation between the variables is quite notable.
We can also take a closer look at the data, when \(sqft \leq 4000\):
a <- br[, "sqft"] <= 4000
#
#
#
#
#
#
plot(x = br[a, "sqft"], y = br[a, "price"], main = "Scatter plot of data subset",
xlab = "sqft", ylab = "price", col = "blue")
a = br["sqft"] < 4000
#
plt.figure(num = 2, figsize = (10, 8))
plt.plot(br["sqft"][a], br["price"][a], linestyle = "None", marker = "o",
markerfacecolor = 'None', markeredgecolor = "blue")
plt.xlabel("sqft")
plt.ylabel("price")
plt.title("Scatterplot of data subset")
plt.show()
We can see a bit more clearly that there is a relationship between the square foot of the house, and the price of which it was sold for.
Initially, we want to estimate a linear regression model: \[
\text{price} = \beta_0 + \beta_1 \cdot \text{sqft} + \epsilon
\] Looking at the scatter plots and thinking about the nature of the data - a larger house should be more expensive. Therefore, we expect that \(\beta_1 > 0\), as sqft
indicates the square feet (i.e. the size) of the house.
x_mat <- cbind(1, br[, "sqft"])
beta_mat <- solve(t(x_mat) %*% x_mat) %*% t(x_mat) %*% br[, "price"]
print(t(beta_mat))
## [,1] [,2]
## [1,] -60861.46 92.74737
x_mat = np.column_stack((np.ones(len(br["sqft"])), br["sqft"]))
beta_mat = np.dot(np.linalg.inv(np.dot(np.transpose(x_mat), x_mat)),
np.dot(np.transpose(x_mat), br["price"]))
print(beta_mat)
## [-60861.4622215 92.7473744]
Compared to the initial output from the OLS regression functions, the values are the same:
print(coef(mdl_ols))
## (Intercept) sqft
## -60861.46222 92.74737
print(mdl_ols.params)
## const -60861.462221
## sqft 92.747374
## dtype: float64
We can calculate the standard errors from the following formula variance formula of the OLS estimates of the coefficients: \[ \begin{aligned} \mathbb{V}{\rm ar} (\widehat{\boldsymbol{\beta}}) = \begin{bmatrix} \mathbb{V}{\rm ar} (\widehat{\beta}_0) & \mathbb{C}{\rm ov} (\widehat{\beta}_0, \widehat{\beta}_1) \\ \mathbb{C}{\rm ov} (\widehat{\beta}_1, \widehat{\beta}_0) & \mathbb{V}{\rm ar} (\widehat{\beta}_1) \end{bmatrix} &= \sigma^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \end{aligned} \] and substituting the sample residual variance: \[ \widehat{\sigma}^2 = \dfrac{1}{N-2} \sum_{i = 1}^N \widehat{\epsilon}_i^2 \] to get: \[ \begin{aligned} \widehat{\mathbb{V}{\rm ar}} (\widehat{\boldsymbol{\beta}}) &= \widehat{\sigma}^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \end{aligned} \]
We can then calculate the standard errors as: \[ \text{se}(\widehat{\beta}_i) = \sqrt{\widehat{\mathbb{V}{\rm ar}} (\mathbf{\widehat{\beta}_i})},\quad i = 1, 2 \]
y_fit <- beta_mat[1] + beta_mat[2] * br[, "sqft"]
resid <- br[, "price"] - y_fit
sigma_est <- sum(resid^2) / (length(br[, "sqft"]) - 2)
var_beta <- sigma_est * solve(t(x_mat) %*% x_mat)
print(sqrt(diag(var_beta)))
## [1] 6110.187009 2.410502
y_fit = beta_mat[0] + beta_mat[1] * br["sqft"]
resid = br["price"] - y_fit
sigma_est = sum(resid**2) / (len(br["sqft"]) - 2)
var_beta = sigma_est * np.linalg.inv(np.dot(np.transpose(x_mat), x_mat))
print(np.sqrt(np.diag(var_beta)))
## [6.11018701e+03 2.41050218e+00]
The initial output from the OLS regression functions has calculated the same values:
print(summary(mdl_ols)$coefficients[, 2, drop = FALSE])
## Std. Error
## (Intercept) 6110.187009
## sqft 2.410502
print(mdl_ols.bse)
## const 6110.187009
## sqft 2.410502
## dtype: float64
Having estimated the parameters and the standard errors, we can write down the estimated regression as: \[ \widehat{\text{price}}_i = \underset{(6110.187)}{-60861.462} + \underset{(2.411)}{92.747} \cdot \text{sqft}_i \] where the standard errors of the coefficients are provided in parenthesis below the estimated coefficients.
y_fit <- beta_mat[1] + beta_mat[2] * br[, "sqft"]
print(y_fit[1:10])
## [1] 7864.342 7864.342 12408.964 197254.481 47189.229 155332.667 53032.314 53032.314 53032.314 53032.314
y_fit = beta_mat[0] + beta_mat[1] * br["sqft"]
print(y_fit[:10])
## 0 7864.342206
## 1 7864.342206
## 2 12408.963552
## 3 197254.480723
## 4 47189.228950
## 5 155332.667496
## 6 53032.313537
## 7 53032.313537
## 8 53032.313537
## 9 53032.313537
## Name: sqft, dtype: float64
print(mdl_ols$fitted.values[1:10])
## 1 2 3 4 5 6 7 8 9 10
## 7864.342 7864.342 12408.964 197254.481 47189.229 155332.667 53032.314 53032.314 53032.314 53032.314
print(mdl_ols.fittedvalues[:10])
## 0 7864.342206
## 1 7864.342206
## 2 12408.963552
## 3 197254.480723
## 4 47189.228950
## 5 155332.667496
## 6 53032.313537
## 7 53032.313537
## 8 53032.313537
## 9 53032.313537
## dtype: float64
Note: we may sometimes run into cases where the built in functions are cumbersome to use for calculating the fitted (or more generally, the forecasted) values. As such, it is very useful to try to evaluate \(\widehat{Y}\) by hand at first.
We can now plot the estimated regression:
#
#
#
#
plot(x = br[, "sqft"], y = br[, "price"],
xlab = "sqft", ylab = "price", col = "blue")
lines(x = br[, "sqft"][order(br[, "sqft"])], y = y_fit[order(br[, "sqft"])],
col = "red")
plt.figure(num = 3, figsize = (10, 8))
plt.plot(br["sqft"], br["price"], linestyle = "None", marker = "o",
markerfacecolor = 'None', markeredgecolor = "blue")
plt.plot(br["sqft"][np.argsort(br["sqft"])], y_fit[np.argsort(br["sqft"])],
linestyle = "-", color = "red")
plt.xlabel("sqft")
plt.ylabel("price")
plt.show()
Note that we do need to sort the values by the x-axis
variable. Otherwise, the points of the lines will be connected going from the first array element to the second, then to the third, etc., which are unordered in the data array.
What can we say about the plot? We see that when sqft
is small and when sqft
is very large, the fitted regression underestimates the value of \(\widehat{Y}\).
In this simple linear regression model our estimated slope coefficient \(\widehat{\beta}_1 = 92.747\) shows that for an additional square foot the estimated average (or, expected) selling price of a house increases by \(92.747\) dollars.
In this case we cannot interpret \(\beta_0\), as it is not possible for sqft
to have a zero value. Furthermore, based on the negative sign of \(\widehat{\beta}_0\), we strongly suspect that our intercept coefficient accounts for a bias in our specified model.
In the linear specification the expected price per square foot is constant for all values of sqft
. However, it would be reasonable to assume, that, generally, a larger house, which is already expensive, may have a higher value for an additional square foot, compared to smaller houses.
We will estimate two models:
sqft
will have an increasing effect on \(Y\).x_mat <- cbind(1, br[, "sqft"]^2)
beta_mat <- solve(t(x_mat) %*% x_mat) %*% t(x_mat) %*% br[, "price"]
print(t(beta_mat))
## [,1] [,2]
## [1,] 55776.57 0.0154213
x_mat = np.column_stack((np.ones(len(br["sqft"])), br["sqft"]**2))
beta_mat = np.dot(np.linalg.inv(np.dot(np.transpose(x_mat), x_mat)),
np.dot(np.transpose(x_mat), br["price"]))
print(beta_mat)
## [5.57765656e+04 1.54213014e-02]
x_mat <- cbind(1, br[, "sqft"])
beta_mat <- solve(t(x_mat) %*% x_mat) %*% t(x_mat) %*% log(br[, "price"])
print(t(beta_mat))
## [,1] [,2]
## [1,] 10.8386 0.0004112689
x_mat = np.column_stack((np.ones(len(br["sqft"])), br["sqft"]))
beta_mat = np.dot(np.linalg.inv(np.dot(np.transpose(x_mat), x_mat)),
np.dot(np.transpose(x_mat), np.log(br["price"])))
print(beta_mat)
## [1.08385963e+01 4.11268853e-04]
mdl_sq <- lm(price ~ 1 + I(sqft^2), data = br)
print(summary(mdl_sq))
##
## Call:
## lm(formula = price ~ 1 + I(sqft^2), data = br)
##
## Residuals:
## Min 1Q Median 3Q Max
## -696604 -23366 779 21869 713159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.578e+04 2.890e+03 19.30 <2e-16 ***
## I(sqft^2) 1.542e-02 3.131e-04 49.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68210 on 1078 degrees of freedom
## Multiple R-squared: 0.6923, Adjusted R-squared: 0.6921
## F-statistic: 2426 on 1 and 1078 DF, p-value: < 2.2e-16
mdl_log <- lm(log(price) ~ 1 + sqft, data = br)
print(summary(mdl_log))
##
## Call:
## lm(formula = log(price) ~ 1 + sqft, data = br)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48912 -0.13653 0.02876 0.18500 0.98066
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.084e+01 2.461e-02 440.46 <2e-16 ***
## sqft 4.113e-04 9.708e-06 42.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3215 on 1078 degrees of freedom
## Multiple R-squared: 0.6248, Adjusted R-squared: 0.6244
## F-statistic: 1795 on 1 and 1078 DF, p-value: < 2.2e-16
Note that specifying a transformation of \(X\) without I()
would yield incorrect results:
mdl_bad <- lm(price ~ 1 + sqft^2, data = br)
print(coef(summary(mdl_bad)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -60861.46222 6110.187009 -9.960655 2.035728e-22
## sqft 92.74737 2.410502 38.476370 1.540660e-204
The function I()
is used to bracket those portions of a model formula where the operators are used in their arithmetic sense.
mdl_sq = sm.OLS(br["price"], sm.add_constant(br["sqft"]**2)).fit()
print(mdl_sq.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: price R-squared: 0.692
## Model: OLS Adj. R-squared: 0.692
## Method: Least Squares F-statistic: 2426.
## Date: Thu, 20 Sep 2018 Prob (F-statistic): 3.37e-278
## Time: 07:54:31 Log-Likelihood: -13552.
## No. Observations: 1080 AIC: 2.711e+04
## Df Residuals: 1078 BIC: 2.712e+04
## Df Model: 1
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 5.578e+04 2890.441 19.297 0.000 5.01e+04 6.14e+04
## sqft 0.0154 0.000 49.254 0.000 0.015 0.016
## ==============================================================================
## Omnibus: 534.314 Durbin-Watson: 1.865
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 61385.029
## Skew: 1.276 Prob(JB): 0.00
## Kurtosis: 39.846 Cond. No. 1.29e+07
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 1.29e+07. This might indicate that there are
## strong multicollinearity or other numerical problems.
mdl_log = sm.OLS(np.log(br["price"]), sm.add_constant(br["sqft"])).fit()
print(mdl_log.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: price R-squared: 0.625
## Model: OLS Adj. R-squared: 0.624
## Method: Least Squares F-statistic: 1795.
## Date: Thu, 20 Sep 2018 Prob (F-statistic): 1.11e-231
## Time: 07:54:31 Log-Likelihood: -305.80
## No. Observations: 1080 AIC: 615.6
## Df Residuals: 1078 BIC: 625.6
## Df Model: 1
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 10.8386 0.025 440.459 0.000 10.790 10.887
## sqft 0.0004 9.71e-06 42.365 0.000 0.000 0.000
## ==============================================================================
## Omnibus: 227.328 Durbin-Watson: 1.354
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 599.271
## Skew: -1.091 Prob(JB): 7.41e-131
## Kurtosis: 5.926 Cond. No. 6.38e+03
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 6.38e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
Task 8 and Task 9 will be updated after 2018-09-20 lecture.
For now, we note the very large standard errors for the intercept. We also note that the intercept coefficient is negative. What would be the price in such a regression, if sqft = 0
and is it even a possibility given the nature of our data?