We will illustrate some of the methods outlined during the lectures. Note that we will update the tasks or data as necessary for new methods introduced during upcoming lectures. In some cases, we may need to present an additional dataset to carry out some examples.

Generally, a number of example datasets can be found at Principles of Econometrics book website with their dataset definitions or from R datasets which are available in both R and Python.

Data Overview

We will examine the dataset br which contains 1080 home sales in Baton Rouge, LA during mid-2005. The following variables are presented in the dataset:

price:
  sale price, dollars
sqft:
  total square feet
bedrooms:
  number of bedrooms
baths:
  number of full baths
age:
  age in years
Occupancy:
  Owner = 1
  Vacant = 2
  Tenant = 3
Pool:
  Yes = 1
  No = 0   
Style:
  Traditional = 1
  Townhouse = 2
  Ranch = 3
  New Orleans = 4
  Mobile Home = 5
  Garden = 6
  French = 7
  Cottage = 8
  Contemporary = 9
  Colonial = 10
  Acadian = 11
Fireplace:
  Yes = 1
  No = 0    
Waterfront:
  Yes = 1
  No = 0
DOM:
  Days on the market

#
#
br <- read.csv2(file = "http://www.principlesofeconometrics.com/poe5/data/csv/br.csv", sep = ",", header = TRUE)
head(br)

##    price sqft bedrooms baths age occupancy pool style fireplace waterfront dom
## 1  66500  741        1     1  18         1    1     1         1          0   6
## 2  66000  741        1     1  18         2    1     1         0          0  23
## 3  68500  790        1     1  18         1    0     1         1          0   8
## 4 102000 2783        2     2  18         1    0     1         1          0  50
## 5  54000 1165        2     1  35         2    0     1         0          0 190
## 6 143000 2331        2     2  25         1    0     1         1          0  86

import pandas as pd
#
br = pd.read_csv("http://www.principlesofeconometrics.com/poe5/data/csv/br.csv")
print(br.head())

##     price  sqft  bedrooms  baths ...   style  fireplace  waterfront  dom
## 0   66500   741         1      1 ...       1          1           0    6
## 1   66000   741         1      1 ...       1          0           0   23
## 2   68500   790         1      1 ...       1          1           0    8
## 3  102000  2783         2      2 ...       1          1           0   50
## 4   54000  1165         2      1 ...       1          0           0  190
## 
## [5 rows x 11 columns]

Tasks

Let’s say that a contractor is thinking about building housing on a large plot as well as selling their existing housing stock. The contractor wishes to maximize their profits and as such is interested to know which variables affect the price. For example, would they focus on smaller houses, or maybe larger houses are more profitable?

Below are the tasks that we wish to accomplish (will be updated with further lectures):

(2018-09-13) We wish to evaluate how the price of a sold home depends on the total square feet, sqft, of the house. To do this, we will begin by examining the scatter plot of our dependent and independent variables.
(2018-09-13) After examining the plot, we will begin by specifying a simple linear regression model as an equation. What do you think should be the sign of \(\beta_1\)?
(2018-09-13) After specifying the equation, we want to evaluate the coefficients (i.e. the parameters of our model) via OLS.
(2018-09-13) Having evaluated the coefficients, we want to estimate the standard errors of our coefficients.
(2018-09-13) We want to Write down the estimated regression.
(2018-09-13) Calculate the fitted values \(\widehat{Y}\) and plot the regression alongside the data sample.
(2018-09-20) Provide an interpretation for the coefficient \(\beta_1\). Would interpreting \(\beta_0\) make sense?
(2018-09-20) It may be interesting to see, whether the price is affected by the same amount regardless of the size of the house, e.g. if the house is already quite large, does a additional square foot add the same to the price, as would be the case if the house is smaller?
(2018-09-20) Now that we have specified a couple of different models, we may want to begin evaluating their adequacy - let’s begin by looking at the residual plots. Do the residuals appear to be normally distributed, with a constant mean and variance?
(Further chapter regarding coefficient standard errors and hypothesis testing) Finally, we should examine, whether the coefficients are statistically significantly different from zero.

Solutions

The main idea behind the tasks for the univariate regression is to understand how and which underlying formulas are applied in single-line functions, which immediately evaluate the regression model parameters and output them alongside other values and test statistics. In other words, the end goal of the Univariate Regression Chapters is to not only be able to apply the following functions to estimate the models:

#
#
#
mdl_ols <- lm(price ~ 1 + sqft, data = br)
#
print(summary(mdl_ols))

## 
## Call:
## lm(formula = price ~ 1 + sqft, data = br)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -366641  -31399   -1535   25601  932272 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -60861.462   6110.187  -9.961   <2e-16 ***
## sqft            92.747      2.411  38.476   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79820 on 1078 degrees of freedom
## Multiple R-squared:  0.5786, Adjusted R-squared:  0.5783 
## F-statistic:  1480 on 1 and 1078 DF,  p-value: < 2.2e-16

import statsmodels.api as sm
#
mdl = sm.OLS(br["price"], sm.add_constant(br["sqft"]))
mdl_ols = mdl.fit()
#
print(mdl_ols.summary())

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                  price   R-squared:                       0.579
## Model:                            OLS   Adj. R-squared:                  0.578
## Method:                 Least Squares   F-statistic:                     1480.
## Date:                Thu, 20 Sep 2018   Prob (F-statistic):          1.54e-204
## Time:                        07:54:27   Log-Likelihood:                -13722.
## No. Observations:                1080   AIC:                         2.745e+04
## Df Residuals:                    1078   BIC:                         2.746e+04
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const      -6.086e+04   6110.187     -9.961      0.000   -7.29e+04   -4.89e+04
## sqft          92.7474      2.411     38.476      0.000      88.018      97.477
## ==============================================================================
## Omnibus:                     1185.147   Durbin-Watson:                   1.886
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):           139602.251
## Skew:                           5.135   Prob(JB):                         0.00
## Kurtosis:                      57.743   Cond. No.                     6.38e+03
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 6.38e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.

but also understand what (most of) the presented values say about our model. As such, in the following parts of our example we will go throughout the steps of the results in these functions, by calculating the values manually and explaining their role in our modelling process.

Task 1: Initial Data Overview

We will begin by looking at the scatter plot of \(Y\) on \(X\) to examine whether there may be any relationship between the variables:

#
#
#
#
#
#
plot(x = br[, "sqft"], y = br[, "price"],
     xlab = "sqft", ylab = "price", col = "blue")

import matplotlib.pyplot as plt
#
plt.figure(num = 0, figsize = (10, 8))
plt.plot(br["sqft"], br["price"], linestyle = "None", marker = "o",
         markerfacecolor = 'None', markeredgecolor = "blue")
plt.xlabel("sqft")
plt.ylabel("price")
plt.show()

It appears that for very large houses the price is also very large, with most values concentrated around 2000 sqft. We can also look at the histograms and boxplots of these variables:

#
#
#
#
par(mfrow = c(2,3))
boxplot(br[, "sqft"])
boxplot(br[, "sqft"], outline = FALSE)
hist(br[, "sqft"], breaks  = 20, col = "cornflowerblue")
boxplot(br[, "price"])
boxplot(br[, "price"], outline = FALSE)
hist(br[, "price"], breaks  = 20, col = "cornflowerblue")

fig = plt.figure(num = 1, figsize = (10, 8))
fig.add_subplot('231').boxplot(br["sqft"])
fig.add_subplot('232').boxplot(br["sqft"], showfliers = False)
fig.add_subplot('233').hist(br["sqft"], bins = 20, ec = 'black')
plt.title("Histogram of sqft")
fig.add_subplot('234').boxplot(br["price"])
fig.add_subplot('235').boxplot(br["price"], showfliers = False)
fig.add_subplot('236').hist(br["price"], bins = 20, ec = 'black')
plt.title("Histogram of sqft")
plt.tight_layout()
plt.show()

The box itself spans the first (bottom of the square), second (middle line in the square) and third (top line of the square) quartiles - the whole box represents the InterQuartile Range - \(IQR = Q_3 - Q_1\) (middle fifty) - of the data. The “whiskers” above and below the box show the locations, which are no more than \(1.5 \cdot IQR\). The points outside the whiskers are the outliers of the data:

Suspected outliers are \(\geq 1.5 \cdot IQR\) above the third, or below the first quartiles;
Outliers are \(\geq 3 \cdot IQR\) above the third, or below the first quartiles;

The second boxplot in each row omits the outliers from the plots.

The correlation between the independent variable sqft and the dependent variable price as well as the summary statistics can also be calculated:

#
#
print(cor(br[, "price"], br[, "sqft"]))

## [1] 0.7606892

summary(br[, "sqft"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     662    1604    2186    2326    2800    7897

summary(br[, "price"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22000   99000  130000  154863  170163 1580000

import numpy as np
#
print(np.corrcoef(br["price"], br["sqft"])[0][1])

## 0.7606891740822463

print(br["sqft"].describe())

## count    1080.000000
## mean     2325.937963
## std      1008.097991
## min       662.000000
## 25%      1604.500000
## 50%      2186.500000
## 75%      2800.000000
## max      7897.000000
## Name: sqft, dtype: float64

print(br["price"].describe())

## count    1.080000e+03
## mean     1.548632e+05
## std      1.229128e+05
## min      2.200000e+04
## 25%      9.900000e+04
## 50%      1.300000e+05
## 75%      1.701625e+05
## max      1.580000e+06
## Name: price, dtype: float64

It appears that the correlation between the variables is quite notable.

We can also take a closer look at the data, when \(sqft \leq 4000\):

a <- br[, "sqft"] <= 4000
#
#
#
#
#
#
plot(x = br[a, "sqft"], y = br[a, "price"], main = "Scatter plot of data subset",
     xlab = "sqft", ylab = "price", col = "blue")

a = br["sqft"] < 4000
#
plt.figure(num = 2, figsize = (10, 8))
plt.plot(br["sqft"][a], br["price"][a], linestyle = "None", marker = "o",
         markerfacecolor = 'None', markeredgecolor = "blue")
plt.xlabel("sqft")
plt.ylabel("price")
plt.title("Scatterplot of data subset")
plt.show()

We can see a bit more clearly that there is a relationship between the square foot of the house, and the price of which it was sold for.

Task 2: Specifying the linear Regression Model

Initially, we want to estimate a linear regression model: \[ \text{price} = \beta_0 + \beta_1 \cdot \text{sqft} + \epsilon \] Looking at the scatter plots and thinking about the nature of the data - a larger house should be more expensive. Therefore, we expect that \(\beta_1 > 0\), as sqft indicates the square feet (i.e. the size) of the house.

Task 3: Estimating Parameters of The Linear Regression Model via OLS

We will estimate the parameters via the matrix specification of the OLS: \[ \widehat{\boldsymbol{\beta}} = \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{Y} \]

x_mat <- cbind(1, br[, "sqft"])
beta_mat <- solve(t(x_mat) %*% x_mat) %*% t(x_mat) %*% br[, "price"]
print(t(beta_mat))

##           [,1]     [,2]
## [1,] -60861.46 92.74737

x_mat = np.column_stack((np.ones(len(br["sqft"])), br["sqft"]))
beta_mat = np.dot(np.linalg.inv(np.dot(np.transpose(x_mat), x_mat)), 
                  np.dot(np.transpose(x_mat), br["price"]))
print(beta_mat)

## [-60861.4622215     92.7473744]

Compared to the initial output from the OLS regression functions, the values are the same:

print(coef(mdl_ols))

##  (Intercept)         sqft 
## -60861.46222     92.74737

print(mdl_ols.params)

## const   -60861.462221
## sqft        92.747374
## dtype: float64

Task 4: Calculating the Standard Errors of Our Estimated Parameters

We can calculate the standard errors from the following formula variance formula of the OLS estimates of the coefficients: \[ \begin{aligned} \mathbb{V}{\rm ar} (\widehat{\boldsymbol{\beta}}) = \begin{bmatrix} \mathbb{V}{\rm ar} (\widehat{\beta}_0) & \mathbb{C}{\rm ov} (\widehat{\beta}_0, \widehat{\beta}_1) \\ \mathbb{C}{\rm ov} (\widehat{\beta}_1, \widehat{\beta}_0) & \mathbb{V}{\rm ar} (\widehat{\beta}_1) \end{bmatrix} &= \sigma^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \end{aligned} \] and substituting the sample residual variance: \[ \widehat{\sigma}^2 = \dfrac{1}{N-2} \sum_{i = 1}^N \widehat{\epsilon}_i^2 \] to get: \[ \begin{aligned} \widehat{\mathbb{V}{\rm ar}} (\widehat{\boldsymbol{\beta}}) &= \widehat{\sigma}^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \end{aligned} \]

We can then calculate the standard errors as: \[ \text{se}(\widehat{\beta}_i) = \sqrt{\widehat{\mathbb{V}{\rm ar}} (\mathbf{\widehat{\beta}_i})},\quad i = 1, 2 \]

y_fit <- beta_mat[1] + beta_mat[2] * br[, "sqft"]
resid <- br[, "price"] - y_fit
sigma_est <- sum(resid^2) / (length(br[, "sqft"]) - 2) 
var_beta  <- sigma_est * solve(t(x_mat) %*% x_mat)
print(sqrt(diag(var_beta)))

## [1] 6110.187009    2.410502

y_fit = beta_mat[0] + beta_mat[1] * br["sqft"]
resid = br["price"] - y_fit
sigma_est = sum(resid**2) / (len(br["sqft"]) - 2)
var_beta = sigma_est * np.linalg.inv(np.dot(np.transpose(x_mat), x_mat))
print(np.sqrt(np.diag(var_beta)))

## [6.11018701e+03 2.41050218e+00]

The initial output from the OLS regression functions has calculated the same values:

print(summary(mdl_ols)$coefficients[, 2, drop = FALSE])

##              Std. Error
## (Intercept) 6110.187009
## sqft           2.410502

print(mdl_ols.bse)

## const    6110.187009
## sqft        2.410502
## dtype: float64

Task 5: Writing Down The Estimated Regression

Having estimated the parameters and the standard errors, we can write down the estimated regression as: \[ \widehat{\text{price}}_i = \underset{(6110.187)}{-60861.462} + \underset{(2.411)}{92.747} \cdot \text{sqft}_i \] where the standard errors of the coefficients are provided in parenthesis below the estimated coefficients.

Task 6: Plotting The Fitted Values

y_fit <- beta_mat[1] + beta_mat[2] * br[, "sqft"]
print(y_fit[1:10])

##  [1]   7864.342   7864.342  12408.964 197254.481  47189.229 155332.667  53032.314  53032.314  53032.314  53032.314

y_fit = beta_mat[0] + beta_mat[1] * br["sqft"]
print(y_fit[:10])

## 0      7864.342206
## 1      7864.342206
## 2     12408.963552
## 3    197254.480723
## 4     47189.228950
## 5    155332.667496
## 6     53032.313537
## 7     53032.313537
## 8     53032.313537
## 9     53032.313537
## Name: sqft, dtype: float64

We can get the fitted values from the built-in functions as well:

print(mdl_ols$fitted.values[1:10])

##          1          2          3          4          5          6          7          8          9         10 
##   7864.342   7864.342  12408.964 197254.481  47189.229 155332.667  53032.314  53032.314  53032.314  53032.314

print(mdl_ols.fittedvalues[:10])

## 0      7864.342206
## 1      7864.342206
## 2     12408.963552
## 3    197254.480723
## 4     47189.228950
## 5    155332.667496
## 6     53032.313537
## 7     53032.313537
## 8     53032.313537
## 9     53032.313537
## dtype: float64

Note: we may sometimes run into cases where the built in functions are cumbersome to use for calculating the fitted (or more generally, the forecasted) values. As such, it is very useful to try to evaluate \(\widehat{Y}\) by hand at first.

We can now plot the estimated regression:

#
#
#
#
plot(x = br[, "sqft"], y = br[, "price"],
     xlab = "sqft", ylab = "price", col = "blue")
lines(x = br[, "sqft"][order(br[, "sqft"])], y = y_fit[order(br[, "sqft"])], 
      col = "red")

plt.figure(num = 3, figsize = (10, 8))
plt.plot(br["sqft"], br["price"], linestyle = "None", marker = "o",
         markerfacecolor = 'None', markeredgecolor = "blue")
plt.plot(br["sqft"][np.argsort(br["sqft"])], y_fit[np.argsort(br["sqft"])], 
         linestyle = "-", color = "red")         
plt.xlabel("sqft")
plt.ylabel("price")
plt.show()

Note that we do need to sort the values by the x-axis variable. Otherwise, the points of the lines will be connected going from the first array element to the second, then to the third, etc., which are unordered in the data array.

What can we say about the plot? We see that when sqft is small and when sqft is very large, the fitted regression underestimates the value of \(\widehat{Y}\).

Task 7: Interpreting the coefficients

In this simple linear regression model our estimated slope coefficient \(\widehat{\beta}_1 = 92.747\) shows that for an additional square foot the estimated average (or, expected) selling price of a house increases by \(92.747\) dollars.

In this case we cannot interpret \(\beta_0\), as it is not possible for sqft to have a zero value. Furthermore, based on the negative sign of \(\widehat{\beta}_0\), we strongly suspect that our intercept coefficient accounts for a bias in our specified model.

Task 8: Additional Model Specifications

In the linear specification the expected price per square foot is constant for all values of sqft. However, it would be reasonable to assume, that, generally, a larger house, which is already expensive, may have a higher value for an additional square foot, compared to smaller houses.

We will estimate two models:

\(\widehat{\text{price}} = \widehat{\beta}_0 + \widehat{\beta}_1 \cdot \text{sqft}^2\) - since in this model specification - larger values of sqft will have an increasing effect on \(Y\).

x_mat <- cbind(1, br[, "sqft"]^2)
beta_mat <- solve(t(x_mat) %*% x_mat) %*% t(x_mat) %*% br[, "price"]
print(t(beta_mat))

##          [,1]      [,2]
## [1,] 55776.57 0.0154213

x_mat = np.column_stack((np.ones(len(br["sqft"])), br["sqft"]**2))
beta_mat = np.dot(np.linalg.inv(np.dot(np.transpose(x_mat), x_mat)), 
                  np.dot(np.transpose(x_mat), br["price"]))
print(beta_mat)

## [5.57765656e+04 1.54213014e-02]

\(\log(\widehat{\text{price}}) = \widehat{\beta}_0 + \widehat{\beta}_1 \cdot \text{sqft}\) - a log-linear model has the property that the marginal effect increases for larger values of \(Y\).

x_mat <- cbind(1, br[, "sqft"])
beta_mat <- solve(t(x_mat) %*% x_mat) %*% t(x_mat) %*% log(br[, "price"])
print(t(beta_mat))

##         [,1]         [,2]
## [1,] 10.8386 0.0004112689

x_mat = np.column_stack((np.ones(len(br["sqft"])), br["sqft"]))
beta_mat = np.dot(np.linalg.inv(np.dot(np.transpose(x_mat), x_mat)), 
                  np.dot(np.transpose(x_mat), np.log(br["price"])))
print(beta_mat)

## [1.08385963e+01 4.11268853e-04]

which can be also specified with the built-in functions:

mdl_sq <- lm(price ~ 1 + I(sqft^2), data = br)
print(summary(mdl_sq))

## 
## Call:
## lm(formula = price ~ 1 + I(sqft^2), data = br)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -696604  -23366     779   21869  713159 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.578e+04  2.890e+03   19.30   <2e-16 ***
## I(sqft^2)   1.542e-02  3.131e-04   49.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68210 on 1078 degrees of freedom
## Multiple R-squared:  0.6923, Adjusted R-squared:  0.6921 
## F-statistic:  2426 on 1 and 1078 DF,  p-value: < 2.2e-16

mdl_log <- lm(log(price) ~ 1 + sqft, data = br)
print(summary(mdl_log))

## 
## Call:
## lm(formula = log(price) ~ 1 + sqft, data = br)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48912 -0.13653  0.02876  0.18500  0.98066 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.084e+01  2.461e-02  440.46   <2e-16 ***
## sqft        4.113e-04  9.708e-06   42.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3215 on 1078 degrees of freedom
## Multiple R-squared:  0.6248, Adjusted R-squared:  0.6244 
## F-statistic:  1795 on 1 and 1078 DF,  p-value: < 2.2e-16

Note that specifying a transformation of \(X\) without I() would yield incorrect results:

mdl_bad <- lm(price ~ 1 + sqft^2, data = br)
print(coef(summary(mdl_bad)))

##                 Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) -60861.46222 6110.187009 -9.960655  2.035728e-22
## sqft            92.74737    2.410502 38.476370 1.540660e-204

The function I() is used to bracket those portions of a model formula where the operators are used in their arithmetic sense.

mdl_sq = sm.OLS(br["price"], sm.add_constant(br["sqft"]**2)).fit()
print(mdl_sq.summary())

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                  price   R-squared:                       0.692
## Model:                            OLS   Adj. R-squared:                  0.692
## Method:                 Least Squares   F-statistic:                     2426.
## Date:                Thu, 20 Sep 2018   Prob (F-statistic):          3.37e-278
## Time:                        07:54:31   Log-Likelihood:                -13552.
## No. Observations:                1080   AIC:                         2.711e+04
## Df Residuals:                    1078   BIC:                         2.712e+04
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const       5.578e+04   2890.441     19.297      0.000    5.01e+04    6.14e+04
## sqft           0.0154      0.000     49.254      0.000       0.015       0.016
## ==============================================================================
## Omnibus:                      534.314   Durbin-Watson:                   1.865
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):            61385.029
## Skew:                           1.276   Prob(JB):                         0.00
## Kurtosis:                      39.846   Cond. No.                     1.29e+07
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 1.29e+07. This might indicate that there are
## strong multicollinearity or other numerical problems.

mdl_log = sm.OLS(np.log(br["price"]), sm.add_constant(br["sqft"])).fit()
print(mdl_log.summary())

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                  price   R-squared:                       0.625
## Model:                            OLS   Adj. R-squared:                  0.624
## Method:                 Least Squares   F-statistic:                     1795.
## Date:                Thu, 20 Sep 2018   Prob (F-statistic):          1.11e-231
## Time:                        07:54:31   Log-Likelihood:                -305.80
## No. Observations:                1080   AIC:                             615.6
## Df Residuals:                    1078   BIC:                             625.6
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const         10.8386      0.025    440.459      0.000      10.790      10.887
## sqft           0.0004   9.71e-06     42.365      0.000       0.000       0.000
## ==============================================================================
## Omnibus:                      227.328   Durbin-Watson:                   1.354
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):              599.271
## Skew:                          -1.091   Prob(JB):                    7.41e-131
## Kurtosis:                       5.926   Cond. No.                     6.38e+03
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 6.38e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.

Task 8 and Task 9 will be updated after 2018-09-20 lecture.

-

For now, we note the very large standard errors for the intercept. We also note that the intercept coefficient is negative. What would be the price in such a regression, if sqft = 0 and is it even a possibility given the nature of our data?

Univariate Regression Example

Andrius Buteikis

Last updated: 2018-09-20