3.8 GoodnessOfFit
Having the following model: \[ Y_i = \beta_0+ \beta_1 X_i + \epsilon_i,\quad i =1,...,N \] Allows us to:
 Explain, how the dependent variable \(Y_i\) changes, if the independent variable \(X_i\) changes.
 Predict the value of \(\widetilde{Y}\), given a value of \(\widetilde{X}\).
In order to have an accurate prediction of \(Y\), we hope that the independent variable \(X\) helps us explain as much variation in \(Y\) as possible (hence why \(X\) is usually referred to as an explanatory variable). Ideally, the variance of \(X\) will help explain the variance in \(Y\). Having said that, we would like to have a way to measure just how good our model is  how much of the variation in \(Y\) can be explained by the variation in \(X\) using our model  we need a goodnessoffit measure.
Another way to look at it is  a goodnessoffit measure aims to quantify how well the estimated model fits the data. Fortunately, there are many ways to measure the goodnessoffit of the estimated model.
3.8.1 Model Residuals: RSS, ESS and TSS
We can separate our univariate regression into two components: \[ Y_i = \mathbb{E}(Y_iX_i) + \epsilon_i \] where \(\mathbb{E}(Y_iX_i) = \beta_0 + \beta_1X_i\) is the explainable, systematic, component of our model and \(\epsilon_i\) is the random, unsystematic, unexplainable component of our model.
In practical application, we do not observe the true systematic and the true random components, but we can use the OLS to estimate the unknown parameters. then our regression can be written as: \[ Y_i = Y_i \pm \widehat{Y}_i = \widehat{Y}_i + \widehat{\epsilon}_i \] where \(\widehat{Y}_i = \widehat{\beta}_0 + \widehat{\beta}_1 X_i\) and \(\widehat{\epsilon}_i = Y_i  \widehat{Y}_i\).
Because the least squares fitted regression passed through the sample mean \((\overline{Y}, \overline{X})\), if we subtract the sample mean of \(Y\), \(\overline{Y} = \dfrac{1}{N} \sum_{i = 1}^N Y_i\) from both sides of the equation, we rewrite our model in terms of differences (i.e. variation) from the process mean: \[ Y_i  \overline{Y} = (\widehat{Y}_i \overline{Y}) + \widehat{\epsilon}_i \] This expression states that the difference between \(Y_i\) and its sample mean, \(\overline{Y}\), consists of an explained, \((\widehat{Y}_i \overline{Y})\), and unexplained, \(\widehat{e}_i\), part. Taking the squares of both sides and summing across \(i = 1,...,N\) yields: \[ \begin{aligned} \sum_{i = 1}^N \left(Y_i  \overline{Y} \right)^2 &= \sum_{i = 1}^N \left( (\widehat{Y}_i \overline{Y}) + \widehat{\epsilon}_i \right)^2 \\ &=\sum_{i = 1}^N \left( \widehat{Y}_i \overline{Y}\right)^2 + 2 \sum_{i = 1}^N \left( \widehat{Y}_i \overline{Y}\right)\widehat{\epsilon}_i + \sum_{i = 1}^N \widehat{\epsilon}^2_i \end{aligned} \] Using the fact that: \[ \begin{aligned} \sum_{i = 1}^N \left( \widehat{Y}_i  \overline{Y}\right)\widehat{\epsilon}_i &= \sum_{i = 1}^N \left( \widehat{\beta}_0 + \widehat{\beta}_1 X_i\right)\widehat{\epsilon}_i  \overline{Y} \sum_{i = 1}^N \widehat{\epsilon}_i \\ &= \widehat{\beta}_0 \sum_{i = 1}^N \widehat{\epsilon}_i + \widehat{\beta}_1 \sum_{i = 1}^N X_i \widehat{\epsilon}_i  \overline{Y} \sum_{i = 1}^N \widehat{\epsilon}_i \\ &= 0 \end{aligned} \]
we can rewrite the equality as: \[\begin{equation} \sum_{i = 1}^N \left(Y_i  \overline{Y} \right)^2 = \sum_{i = 1}^N \left( \widehat{Y}_i \overline{Y}\right)^2 + \sum_{i = 1}^N \widehat{\epsilon}^2_i \tag{3.11} \end{equation}\]
This equation gives us a decomposition of the total sample variation, into explained and unexplained components.
Define the following:
 Total Sum of Squares (SST or TSS) as: \[\text{TSS} = \sum_{i = 1}^N (Y_i  \overline{Y})^2\] It is a measure of total variation in \(Y\) around the sample mean.
 Explained Sum of Squares (ESS) as: \[\text{ESS} = \sum_{i = 1}^N (\widehat{Y}_i  \overline{Y})^2\] It is the part of the total variation in \(Y\) around the sample mean, that is explained by our regression. This is sometimes called the model sum of squares or sum of squares due to regression (which is confusingly also abbreviated as “SSR”).
 Residual Sum of Squares (SSR or RSS) as: \[\text{RSS} = \sum_{i = 1}^N \widehat{\epsilon}_i^2\] It is the part of the total variation in \(Y\) around the sample mean that is not explained by our regression. This is sometimes called the unexplained sum of squares or the sum of squared estimate of errors (SSE).
Then, (3.11) can be written simply as: \[ \text{TSS} = \text{ESS} + \text{RSS} \]
3.8.2 Rsquared, \(R^2\)
It is often useful to compute a number that summarizes how well the OLS regression fits the data. This measure is called the coefficient of determination, \(R^2\), which is the ratio of explained variation, compared to the total variation, i.e. the proportion of variation in \(Y\) that is explained by \(X\) in our regression model: \[ R^2 = \dfrac{\text{ESS}}{\text{TSS}} = 1  \dfrac{\text{RSS}}{\text{TSS}} \]
 The closer \(R^2\) is to \(1\), the closer the sample values of \(Y_i\) are to the fitted values \(\widehat{Y}\) of our regression. Ir \(R^2 = 1\), then all the sample data fall exactly on the fitted regression. In such a case our model would be a perfect fit for our data.
 If the sample data of \(Y\) and \(X\) do not have a linear relationship, then \(R^2 = 0\) of a univariate regression.
 Values \(0 < R^2 < 1\), the interpretation of \(R^2\) is as the proportion of the variation in \(Y\) around its mean, that is explained by the regression model. For example \(R^2 = 0.17\) means that \(17\%\) of the variation in \(Y\) is explained by \(X\).
When comparing \(\text{RSS}\) of different models, we want to choose the model, which better fits our data. If we want to choose a model based on its \(R^2\) value we should note a couple of things:

\(R^2\) comparison is not valid for comparing models, that do not have have the same transformation of the dependent variable, for example two models  one with \(Y\) and the other with \(\log(Y)\) dependent variables cannot be compared via \(R^2\).

\(R^2\) does not measure the predictability power of the model. For example, a linear model may be a good fit for the data, but its forecasts may not make economic sense (e.g. forecasting negative wage for low values of years in education via a simple linear model).

\(R^2\) is based on the sample data, so it says nothing whether our model is close to the true population DGP.

\(R^2\) may be low if: the error variance, \(\sigma^2\), is large; or if the variance of \(X\) is small.

\(R^2\) may be large even if the model is wrong. For example, even if the true relationship is nonlinear, a linear model may have a larger \(R^2\), compared to the quadratic, or even the loglinear model.

On the other hand, the goodnessoffit of the model does not depend on the unit of measurement of our variables (e.g. dollars vs thousands of dollars). Furthermore, comparisons of \(R^2\) are valid, if we compare a simple linear model to a linearlog model, as they both have the same dependent variable, \(Y\).
In any case, a model should not be chosen only on the basis of model fit with \(R^2\) as the criterion.
Next, we will estimate the coefficients. We will use to builtin functions as we have already, plentifully, shown how the coefficients, standard errors, fitted values and residuals can be calculated manually:
## [1] 1.9322145 0.4248597
## [2.02615049 0.40280677]
Next, we will use the residuals to calculate the \(\text{TSS}\), \(\text{RSS}\) and \(R^2\):
## [1] 0.6518439
## 0.5200895667266314
Which we can also conveniently extract from the estimated model objects:
## [1] 0.6518439
## 0.5200895667266314
Finally, we may look at the full summary output of our models:
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## 4.9071 1.1047 0.0692 1.2970 4.1897
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 1.93221 0.36309 5.322 6.52e07 ***
## x 0.42486 0.03137 13.546 < 2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.829 on 98 degrees of freedom
## Multiple Rsquared: 0.6518, Adjusted Rsquared: 0.6483
## Fstatistic: 183.5 on 1 and 98 DF, pvalue: < 2.2e16
## OLS Regression Results
## ==============================================================================
## Dep. Variable: y Rsquared: 0.520
## Model: OLS Adj. Rsquared: 0.515
## Method: Least Squares Fstatistic: 106.2
## Date: Tue, 13 Oct 2020 Prob (Fstatistic): 2.63e17
## Time: 21:39:35 LogLikelihood: 223.27
## No. Observations: 100 AIC: 450.5
## Df Residuals: 98 BIC: 455.8
## Df Model: 1
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>t [0.025 0.975]
## 
## const 2.0262 0.452 4.478 0.000 1.128 2.924
## x1 0.4028 0.039 10.306 0.000 0.325 0.480
## ==============================================================================
## Omnibus: 2.753 DurbinWatson: 1.975
## Prob(Omnibus): 0.252 JarqueBera (JB): 1.746
## Skew: 0.035 Prob(JB): 0.418
## Kurtosis: 2.356 Cond. No. 23.1
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
and see a variety of familiar statistics.
Furthermore, if we decide to scale the variables, by say, dividing by \(10\), then the \(R^2\) would be unchanged:
## [1] 0.19322145 0.04248597
## [1] 0.6518439
## [1] 1.932214 4.248597
## [1] 0.6518439
## [1] 0.1932214 0.4248597
## [1] 0.6518439
## [0.20261505 0.04028068]
## 0.5200895667266313
## [2.02615049 4.02806766]
## 0.5200895667266314
## [0.20261505 0.40280677]
## 0.5200895667266314
Finally, we will plot \(Y_i  \overline{Y}\), \(\widehat{Y}_i  \overline{Y}\) and \(\widehat{\epsilon}_i\) for a better visual understanding of what \(\text{TSS}\), \(\text{ESS}\) and \(\text{RSS}\) measure and their impact when calculating \(R^2\):
#
#
plot(x, y  mean(y), pch = 19)
points(x, lm_fit$fitted.values  mean(y), col = "blue", pch = 19)
points(x, lm_fit$residuals, col = "red", pch = 19)
#
legend(x = 0, y = 7,
legend = c(expression(Y[i]  bar(Y)[i]),
expression(widehat(Y)[i]bar(Y)[i]),
expression(widehat(epsilon)[i])),
pch = c(19, 19, 19), col = c("black", "blue", "red"))
import matplotlib.pyplot as plt
#
_ = plt.figure(num = 0, figsize = (10, 8))
_ = plt.plot(x, y  np.mean(y), linestyle = "None", marker = "o",
color = "black", label = "$Y_i  \\overline{Y}_i$")
_ = plt.plot(x, lm_fit.fittedvalues  np.mean(y), linestyle = "None", marker = "o",
color = "blue", label = "$\\widehat{Y}_i  \\overline{Y}_i$")
_ = plt.plot(x, lm_fit.resid, linestyle = "None", marker = "o",
color = "red", label = "$\\widehat{\\epsilon}_i$")
_ = plt.legend()
plt.show()
3.8.2.1 Cases When \(R^2\) is Negative
A Case of a negative \(R^2\) can arise when:
The predictions that are being compared to the corresponding outcomes have not been derived from a modelfitting procedure using those data. E.g. if we try to guess the coefficient values  e.g. we assume that coefficients of models on similar data, or in similar countries would be the same for our data;
We do not include an intercept, \(\beta_0\) in our linear regression model;
When a nonlinear function is used to fit the data;
In cases where negative \(R^2\) values arise, the mean of the data provides a better fit to the outcomes, rather than the fitted model values, according to this criterion, \(R^2\). We will later see, that there is a variety of different alternative criterions for evaluating the accuracy of a model.
We will look at each case separately.
3.8.2.1.1 Fitted values are not derived from the data, which is being analysed
Let’s say that we use a model, which was fitted on the following dataset:
The estimated model of such a dataset is:
## (Intercept) x0
## 1.954413 1.971793
## [2.00889979 2.01142461]
Now, assume that we are analyzing a different data sample. Let’s say that our data sample comes from the following underlying DGP:
However, we make the incorrect assumption that our data sample comes from the same population as the previous data. This leads us to calculating the fitted values, residuals and \(R^2\) using preestimated coefficients:
Visual inspection reveals that our assumption that an existing model of one dataset is good enough for our dataset was incorrect  it is clear that our dataset is from a different DGP. For comparison, we also plot the process mean.
plot(x_other, y_other)
lines(x_other[order(x_other)], y_fit_other[order(x_other)], col = "red")
abline(h = mean(y_other), col = "blue", lty = 2)
legend(x = 1.5, y = 4,
legend = c("Data Sample",
expression(widehat(Y)[i]),
expression(bar(Y))),
pch = c(1, NA, NA), lty = c(NA, 1, 2),
col = c("black", "red", "blue"))
_ = plt.figure(num = 1, figsize = (10, 8))
_ = plt.plot(x_other, y_other, linestyle = "None", marker = "o", markerfacecolor = "None",
color = "black", label = "Data Sample")
_ = plt.plot(x_other[np.argsort(x_other)], y_fit_other[np.argsort(x_other)], linestyle = "",
color = "red", label = "$\\widehat{Y}_i$")
_ = plt.axhline(y = np.mean(y_other), linestyle = "", color = "blue",
label = "$\\overline{Y}_i$")
_ = plt.legend()
plt.show()
If we compare models from datasets of different countries, different firms, we would run into such problems. For example, if one firm is very large, while another is relatively new and small  making an assumption that a model on the data of one firm can be applied to the data of this new firm would be incorrect  some variables may have similar effects, but they would most likely not be the same in magnitude.
3.8.2.1.2 Regression without an intercept
We will generate an OLS model with an intercept
But we will estimate the parameters of a regression model without an intercept. The estimated coefficient, fitted values and residuals are calculated as follows (take note that we do not include a constant in the independent variable matrix):
## x
## 4.248594
## [4.24107257]
Which results in the following negative \(R^2\):
## [1] 0.6507254
RSS = np.sum(np.array(lm_fit.resid)**2)
TSS = np.sum((y  np.mean(y))**2)
R_sq = 1  RSS / TSS
print(R_sq)
## 0.6718501471478293
For cases when a model does not have an intercept, \(R^2\) is usually computed as: \[ R^2 = 1  \dfrac{RSS}{\sum_{i = 1}^N Y_i^2} \] where the denominator acts as if we assume that \(\mathbb{E}(Y) = 0\) (and hence we assume that \(\overline{Y} \approx 0\)).
Applying this expression for the \(R^2\) yields:
## [1] 0.9136212
## 0.9129369975970213
Furthermore, this value of \(R^2\) is (silently) applied in the builtin OLS estimation functions:
## [1] 0.9136212
## 0.9129369975970213
Unfortunately, if \(R^2\) is calculated in this way, it ignores a very important fact about our model  a negative \(R^2\) indicates that the regression is actually worse than a simple average of the process. In fact, the modified \(R^2\) shows a very high value  the complete opposite of what we would expect to see in such a situation.
Visually, we can see that our model provides quite a poor fit. For comparison, we also plot the process mean:
plot(x, y, ylim = c(0, max(lm_fit$fitted.values)))
lines(x, lm_fit$fitted.values, col = "red")
abline(h = mean(y), lty = 2, col = "blue")
legend(x = 0, y = 80,
legend = c("Data Sample",
expression(widehat(Y) == widehat(beta)[1]*X),
expression(bar(Y))),
lty = c(NA, 1, 2), pch = c(1, NA, NA),
col = c("black", "red", "blue"))
_ = plt.figure(num = 2, figsize = (10, 8))
_ = plt.plot(x, y, linestyle = "None", marker = "o", markerfacecolor = "None",
color = "black", label = "Data Sample")
_ = plt.plot(x[np.argsort(x)], lm_fit.fittedvalues[np.argsort(x)], linestyle = "",
color = "red", label = "$\\widehat{Y} = \\widehat{\\beta}_1 X$")
_ = plt.axhline(y = np.mean(y), linestyle = "",
color = "blue", label = "$\\overline{Y}$")
_ = plt.legend()
plt.show()
So, while the modified \(R^2\) seems high, in reality the model provides a poor fit for the data sample.
3.8.2.1.3 A Nonlinear function is used to fit the data with large error variance
As an example we will simulate data from the following loglinear model: \(\log(Y) = \beta_0 + \beta_1 X + \epsilon\), where \(\beta_0 = 0.2\), \(\beta_1 = 2\), \(N = 100\), \(\epsilon \sim \mathcal{N}(0, 1)\), and \(X\) is a random sample with replacement from an interval from \(0\) to \(0.5\), equally spaced into \(N\) elements.
This data has a small variation in \(X\) and a (relative to the variance in \(\log(Y)\) and in \(\epsilon\)) large error variance:
## 0.023800418324660753
## 1.1624704319571753
## 1.0245932457731062
If we estimate the correct model and look at the coefficients and \(R^2\):
## (Intercept) x
## 0.3305187 1.5632999
## [1] 0.06068536
## [0.0878827 2.44826432]
## 0.12272111156714938
We see that the \(R^2\) is very small. Furthermore, if we were to backtransform our fitted values and calculate \(R^2\):
We see that it is even worse. The plot of the fitted values:
_ = plt.figure(num = 3, figsize = (10, 8))
_ = plt.plot(x, y, linestyle = "None", marker = "o", markerfacecolor = "None", color = "black")
_ = plt.plot(x[np.argsort(x)], y_fit[np.argsort(x)], linestyle = "", color = "red")
_ = plt.axhline(y = np.mean(y), linestyle = "", color = "blue")
plt.show()
So, a large variance of the error term, or a small variance of the independent variable, will result in a lower \(R^2\) value overall. Furthermore, backtransforming would likely result in a lower \(R^2\) value.
Finally, plotting the residuals against the fitted values indicates that the residuals of \(Y  \exp(\widehat{\log(Y)})\) do not have the same properties as the loglinear model residuals:
fig = plt.figure(num = 4, figsize = (10, 8))
_ = fig.add_subplot('211').plot(lm_fit.fittedvalues, lm_fit.resid, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("loglinear model residuals")
_ = fig.add_subplot('212').plot(y_fit, resid, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Dependent variable and the backtransformed fitted value residuals")
plt.show()
3.8.2.2 Correlation Analysis
The correlation coefficient between \(X\) and \(Y\) is defined as: \[ \rho_{X,Y} = \dfrac{\mathbb{C}{\rm ov} (X, Y)}{\sqrt{\mathbb{V}{\rm ar} (X)} \sqrt{\mathbb{V}{\rm ar} (Y)}} = \dfrac{\sigma_{X,Y}}{\sigma_X \sigma_Y} \] The sample correlation is calculated as: \[ r_{X, Y} = \dfrac{\widehat{\sigma}_{X,Y}}{\widehat{\sigma}_X \widehat{\sigma}_Y} = \dfrac{\dfrac{1}{N1} \sum_{i = 1}^N (X_i  \overline{X})(Y_i  \overline{Y})}{\sqrt{\dfrac{1}{N1} \sum_{i = 1}^N (X_i  \overline{X})^2}\sqrt{\dfrac{1}{N1} \sum_{i = 1}^N (Y_i  \overline{Y})^2}} \] The sample correlation \(1 \leq r_{X,Y} \leq 1\) measures the strength of the linear association between the sample values of \(X\) and \(Y\).
3.8.2.3 Correlation Analysis and \(R^2\)
There is a relationship between \(R^2\) and \(r_{X,Y}\):
 \(R^2 = r_{X,Y}^2\). So, \(R^2\) can be computed as the square of the sample correlation between \(Y\) and \(X\).
 \(R^2 = r_{Y, \widehat{Y}}^2\). So, \(R^2\) can be computed as the square of the sample correlation between \(Y\) and its fitted values \(\widehat{Y}\). As such, \(R^2\) measures the linear association, the goodnessoffit, between the sample data \(Y\), and its predicted values \(\widehat{Y}\). Because of this, \(R^2\) is sometimes called a measure of goodnessoffit
## [1] 0.06068536
## 0.12272111156714938
## [1] 0.06068536
## 0.12272111156714918
## [1] 0.06068536
## 0.12272111156714925
3.8.2.4 A General (pseudo) \(R\)squared Measure, \(R^2_g\)
As we have seen, we may need to backtransform our independent variable. Then, we can calculate a general (pseudo) measure of \(R^2\): \[ R^2_g = r_{Y, \widehat{Y}}^2 = \mathbb{C}{\rm orr}(Y, \widehat{Y})^2 \]
In our previous example we can calculate \(R^2_g\) for both the log and the backtransformed values:
## 0.12272111156714925
## 0.05975262825731168
A way to look at it is that \(R^2\) measures the variation explained by our model, whereas \(R^2_g\) measures the variance explained by our model. In a linear regression, the two definitions are the same, as long as the intercept coefficient is included in the model.
3.8.3 Regression Diagnostics
In many cases while carrying out statistical/econometric analysis, we are not sure, whether we have correctly specified our model. As we have seen, the \(R^2\) can be artificially small (or large), regardless of the specified model. As such, there are a number of regression diagnostics and specification tests.
For the univariate regression, the most crucial assumptions come from (UR.3) and (UR.4), namely:
 \(\mathbb{V}{\rm ar} (\epsilon_i  \mathbf{X} ) = \sigma^2_\epsilon,\ \forall i = 1,..,N\)
 \(\mathbb{C}{\rm ov} (\epsilon_i, \epsilon_j) = 0,\ i \neq j\)
 \(\boldsymbol{\varepsilon}  \mathbf{X} \sim \mathcal{N} \left( \mathbf{0}, \sigma^2_\epsilon \mathbf{I} \right)\)
We note that the residuals are defined as: \[ \begin{aligned} \widehat{\boldsymbol{\varepsilon}} &= \mathbf{Y}  \widehat{\mathbf{Y} } \\ &= \mathbf{Y}  \mathbf{X} \widehat{\boldsymbol{\beta}} \\ &= \mathbf{Y}  \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{1} \mathbf{X}^\top \mathbf{Y} \\ &= \left[ \mathbf{I}  \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{1} \mathbf{X}^\top \right]\mathbf{Y} \end{aligned} \] Hence, for the OLS residuals (i.e. not the true unobserved errors) the expected value of the residuals is still zero: \[ \begin{aligned} \mathbb{E} \left( \widehat{\boldsymbol{\varepsilon}} \mathbf{X} \right) &= \mathbb{E} \left( \left[ \mathbf{I}  \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{1} \mathbf{X}^\top \right]\mathbf{Y}  \mathbf{X} \right)\\ &= \mathbb{E} \left( \left[ \mathbf{I}  \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{1} \mathbf{X}^\top \right] \left( \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \right)  \mathbf{X} \right) \\ &= \mathbf{X} \boldsymbol{\beta} + \mathbb{E} (\boldsymbol{\varepsilon})  \mathbf{X} \boldsymbol{\beta}  \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{1} \mathbf{X}^\top \mathbb{E} (\boldsymbol{\varepsilon}) \\ &= 0 \end{aligned} \] For simplicity, let \(\widehat{\boldsymbol{\varepsilon}} = \left[ \mathbf{I}  \mathbf{H}\right]\mathbf{Y}\), where \(\mathbf{H}\ = \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{1} \mathbf{X}^\top\).
Consequently, the variancecovariance matrix of the residuals is:
\[\begin{equation} \begin{aligned} \mathbb{V}{\rm ar} \left( \widehat{\boldsymbol{\varepsilon}} \mathbf{X}\right) &= \mathbb{V}{\rm ar} \left( \left[ \mathbf{I}  \mathbf{H}\right]\mathbf{Y}\mathbf{X}\right) \\ &= \left[ \mathbf{I}  \mathbf{H}\right]\mathbb{V}{\rm ar} \left( \mathbf{Y}  \mathbf{X}\right) \left[ \mathbf{I}  \mathbf{H}\right]^\top \\ &= \left[ \mathbf{I}  \mathbf{H}\right] \sigma^2 \left[ \mathbf{I}  \mathbf{H}\right]^\top \\ &= \sigma^2 \left[ \mathbf{I}  \mathbf{H}^\top  \mathbf{H} + \mathbf{H} \mathbf{H}^\top\right] \\ &= \sigma^2 \left[ \mathbf{I}  \mathbf{H}^\top  \mathbf{H} + \mathbf{H}^\top\right] \\ &= \sigma^2 \left[ \mathbf{I}  \mathbf{H}\right] \end{aligned} \tag{3.12} \end{equation}\]
This result shows an important distinction of the residuals from the errors  the residuals may have different variances (which are the diagonal elements of \(\mathbb{V}{\rm ar} \left( \widehat{\boldsymbol{\varepsilon}} \mathbf{X}\right)\)), even if the true errors (which affect the process \(\mathbf{Y}\)) all have the same variance \(\sigma^2\).
The variance for the fitted values is smallest for observations near the mean and the largest for values, which deviate the most from the process mean.
3.8.3.1 Residual Diagnostic Plots
One way to examine the adequacy of the model is to visualize the residuals. There are a number of ways to do this:
 Plotting the residuals \(\widehat{\epsilon}_i\) against the fitted values \(\widehat{Y}_i\);
 Plotting the residuals \(\widehat{\epsilon}_i\) against \(X_i\)
 Plotting the residual QQ plot, histogram or boxplot.
In all cases, if there are no violations of our (UR.2) or (UR.3) assumptions  the plots should reveal no patterns. The residual histogram, QQ plot should be approximately normal so that our assumption (UR.4) holds.
As we are not guaranteed to specify a correct functional form, residual plots offer a great insight on what possible functional form we may have missed.
We should note that when having multiple models, it is only meaningful to compare the residuals of models with the same dependent variable. For example, comparing the residuals of a linearlinear model (with \(Y\)) and of a loglinear model (with \(\log(Y)\)) is not meaningful as they have different value scales.
Transforming the dependent or the independent variables may help to alleviate some of the problems of the residuals:
 If nonlinearities are present in the residual plots  we must firstly account for them, and only after can we check, whether the errors have a constant variance.
 Transforming \(\mathbf{Y}\) primarily aims to help with problems with the error terms (and may help with nonlinearity);
 Transforming \(\mathbf{X}\) primarily aims to help with correcting for nonlinearity;
 Sometimes transforming \(\mathbf{X}\) is enough to account for nonlinearity and have normally distributed errors, while transforming \(\mathbf{Y}\) may account for nonlinearity but might make the errors nonnormally distributed.
 Other times, transforming \(\mathbf{X}\) does not help account for the nonlinear relationship at all;
Remember that the QQ plot plots quantiles of the data versus quantiles of a distribution. If the observations come from a normal distribution we would expect the observed order statistics plotted against the expected (theoretical) order statistics to form an approximately straight line.
 a simple linear model: \(Y = \beta_0 + \beta_1 X + \epsilon\);
 a loglinear model: \(\log(Y) = \beta_0 + \beta_1 X + \epsilon\);
 a linearlog model: \(Y = \beta_0 + \beta_1 \log(X) + \epsilon\);
 a loglog model: \(\log(Y) = \beta_0 + \beta_1 \log(X) + \epsilon\);
For each case, we will estimate a simple linear model on the data and examine the residual plots. For simplicity, we will use the same \(X_i\) and the same \(\epsilon_i \sim \mathcal{N}(0, 0.2^2)\), \(i = 1,...,N\) with \(N = 200\), \(\beta_0 = 1\) and \(\beta_1 = 2\).
#
#
#
#
#
#
set.seed(123)
# Sample size and coefficients
N = 200
beta_0 < 1
beta_1 < 2
# Variables which will be the same for each model:
x < seq(from = 0.1, to = 2, length.out = N)
e < rnorm(mean = 0, sd = 0.2, n = N)
# Simple linear model
y < beta_0 + beta_1 * x + e
data_lin < data.frame(y, x, e)
# LinearLog model:
y < beta_0 + beta_1 * log(x) + e
data_linlog < data.frame(y, x, e)
# Loglinear model:
y < exp(beta_0 + beta_1 * x + e)
data_loglin < data.frame(y, x, e)
# LogLog model:
y < exp(beta_0 + beta_1 * log(x) + e)
data_loglog < data.frame(y, x, e)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats as stats
#
np.random.seed(123)
# Sample size and coefficients
N = 200
beta_0 = 1
beta_1 = 2
# Variables which will be the same for each model:
x = np.linspace(start = 0.1, stop = 2, num = N)
e = np.random.normal(loc = 0, scale = 0.2, size = N)
# Simple linear model
y = beta_0 + beta_1 * x + e
data_lin = pd.DataFrame([y, x, e], index = ["y", "x", "e"]).T
# LinearLog model:
y = beta_0 + beta_1 * np.log(x) + e
data_linlog = pd.DataFrame([y, x, e], index = ["y", "x", "e"]).T
# Loglinear model:
y = np.exp(beta_0 + beta_1 * x + e)
data_loglin = pd.DataFrame([y, x, e], index = ["y", "x", "e"]).T
# LogLog model:
y = np.exp(beta_0 + beta_1 * np.log(x) + e)
data_loglog = pd.DataFrame([y, x, e], index = ["y", "x", "e"]).T
# Plot the data  remember: X is the same for all models
fig = plt.figure(num = 5, figsize = (10, 8))
_ = fig.add_subplot("221").plot(x, data_lin["y"], linestyle = "None",
marker = "o", markerfacecolor = "None", color = "black")
_ = plt.title("Simple linear DGP")
_ = fig.add_subplot("222").plot(x, data_linlog["y"], linestyle = "None",
marker = "o", markerfacecolor = "None", color = "black")
_ = plt.title("linearlog DGP")
_ = fig.add_subplot("223").plot(x, data_loglin["y"], linestyle = "None",
marker = "o", markerfacecolor = "None", color = "black")
_ = plt.title("loglinear DGP")
_ = fig.add_subplot("224").plot(x, data_loglog["y"], linestyle = "None",
marker = "o", markerfacecolor = "None", color = "black")
_ = plt.title("loglog DGP")
plt.tight_layout()
plt.show()
Next, we will estimate the simple linear regression for each dataset:
We can plot the fitted values alongside the actual data:
#
#
par(mfrow = c(2, 2))
#
#
#
plot(x, data_lin$y, main = "Simple linear DGP")
lines(x[order(x)], mdl1$fitted.values[order(x)], col = "red", lwd = 2)
#
#
#
plot(x, data_linlog$y, main = "linearlog DGP")
lines(x[order(x)], mdl2$fitted.values[order(x)], col = "red", lwd = 2)
#
#
#
plot(x, data_loglin$y, main = "loglinear DGP")
lines(x[order(x)], mdl3$fitted.values[order(x)], col = "red", lwd = 2)
#
#
#
plot(x, data_loglog$y, main = "loglog DGP")
lines(x[order(x)], mdl4$fitted.values[order(x)], col = "red", lwd = 2)
fig = plt.figure(num = 6, figsize = (10, 8))
_ = fig.add_subplot("221").plot(x, data_lin["y"], linestyle = "None",
marker = "o", markerfacecolor = "None", color = "black")
_ = plt.plot(x[np.argsort(x)], mdl1.fittedvalues[np.argsort(x)],
color = "red", linewidth = 2)
_ = plt.title("Simple linear DGP")
_ = fig.add_subplot("222").plot(x, data_linlog["y"], linestyle = "None",
marker = "o", markerfacecolor = "None", color = "black")
_ = plt.plot(x[np.argsort(x)], mdl2.fittedvalues[np.argsort(x)],
color = "red", linewidth = 2)
_ = plt.title("linearlog DGP")
_ = fig.add_subplot("223").plot(x, data_loglin["y"], linestyle = "None",
marker = "o", markerfacecolor = "None", color = "black")
_ = plt.plot(x[np.argsort(x)], mdl3.fittedvalues[np.argsort(x)],
color = "red", linewidth = 2)
_ = plt.title("loglinear DGP")
_ = fig.add_subplot("224").plot(x, data_loglog["y"], linestyle = "None",
marker = "o", markerfacecolor = "None", color = "black")
_ = plt.plot(x[np.argsort(x)], mdl4.fittedvalues[np.argsort(x)],
color = "red", linewidth = 2)
_ = plt.title("loglog DGP")
plt.tight_layout()
plt.show()
Then, the different residual plots are as follows:
plot_resid < function(resid, y_fit, x, plt_title){
plot(c(0, 1), c(0, 1), ann = F, bty = 'n', type = 'n', xaxt = 'n', yaxt = 'n')
text(x = 0.5, y = 0.5, plt_title, cex = 1.6, col = "black")
#
qqnorm(resid, main = "QQ plot of residuals")
qqline(resid, col = "red", lwd = 2)
#
hist(resid, main = "Histogram of residuals", col = "cornflowerblue", breaks = 25)
#
plot(y_fit, resid, main = "Residuals vs Fitted values")
#
plot(x, resid, main = "Residuals vs X")
}
#
#
#
#
#
#
#
#
#
def plot_resid(resid, y_fit, x, plt_title, plt_row, plt_col, plt_pos, fig):
ax = fig.add_subplot(plt_row, plt_col, plt_pos)
ax.set_yticklabels([])
ax.set_xticklabels([])
ax.tick_params(right = False, top = False, left = False, bottom= False)
ax.text(0.5, 0.5, plt_title, horizontalalignment='center', verticalalignment='center', transform = ax.transAxes)
#
ax = fig.add_subplot(plt_row, plt_col, plt_pos + 1)
stats.probplot(resid, dist = "norm", plot = ax)
#
ax = fig.add_subplot(plt_row, plt_col, plt_pos + 2)
ax.hist(resid, color = "cornflowerblue", bins = 25, ec = 'black')
plt.title("Histogram of residuals")
#
ax = fig.add_subplot(plt_row, plt_col, plt_pos + 3)
ax.plot(y_fit, resid, linestyle = "None", marker = "o", markerfacecolor = "None", color = "black")
plt.title("Residuals vs Fitted values")
#
ax = fig.add_subplot(plt_row, plt_col, plt_pos + 4)
ax.plot(x, resid, linestyle = "None", marker = "o", markerfacecolor = "None", color = "black")
plt.title("Residuals vs X")
#
par(mfrow = c(4, 5))
#
plot_resid(mdl1$residuals, mdl1$fitted.values, x, "simple linear DGP")
#
plot_resid(mdl2$residuals, mdl2$fitted.values, x, "linearlog DGP")
#
plot_resid(mdl3$residuals, mdl3$fitted.values, x, "loglinear DGP")
#
plot_resid(mdl4$residuals, mdl4$fitted.values, x, "loglog DGP")
fig = plt.figure(num = 7, figsize = (12, 10))
#
plot_resid(mdl1.resid, mdl1.fittedvalues, x, "simple linear DGP", 4, 5, 1, fig)
plot_resid(mdl2.resid, mdl2.fittedvalues, x, "linearlog DGP", 4, 5, 6, fig)
plot_resid(mdl3.resid, mdl3.fittedvalues, x, "loglinear DGP", 4, 5, 11, fig)
plot_resid(mdl4.resid, mdl4.fittedvalues, x, "loglog DGP", 4, 5, 16, fig)
#
plt.tight_layout()
## <string>:1: UserWarning: Tight layout not applied. tight_layout cannot make axes width small enough to accommodate all axes decorations
We see that the linear model for the dataset, which is generated from a simple linear DGP, has residuals which appear to be random  we do not see any nonrandom patterns in the scatterplots. Furthermore, the histogram and QQ plot indicate that the residuals may be from a normal distribution. On the other hand, a simple linear model does not fit the data well, if the data is sampled from a nonlinear DGP  we see clear patterns in the residual scatter plots, as well as nonnormality.
For comparison, if we were to fit the correct models, we would have the following plots:
mdl2_correct < lm(y ~ 1 + log(x), data = data_linlog)
mdl3_correct < lm(log(y) ~ 1 + x, data = data_loglin)
mdl4_correct < lm(log(y) ~ 1 + log(x), data = data_loglog)
#
#
par(mfrow = c(4, 5))
#
plot_resid(mdl1$residuals, mdl1$fitted.values, x, "simple linear DGP")
#
plot_resid(mdl2_correct$residuals, mdl2_correct$fitted.values, x, "linearlog DGP")
#
plot_resid(mdl3_correct$residuals, mdl3_correct$fitted.values, x, "loglinear DGP")
#
plot_resid(mdl4_correct$residuals, mdl4_correct$fitted.values, x, "loglog DGP")
mdl2_correct = sm.OLS(data_linlog["y"], sm.add_constant(np.log(data_linlog["x"]))).fit()
mdl3_correct = sm.OLS(np.log(data_loglin["y"]), sm.add_constant(data_loglin["x"])).fit()
mdl4_correct = sm.OLS(np.log(data_loglog["y"]), sm.add_constant(np.log(data_loglog["x"]))).fit()
#
#
fig = plt.figure(num = 8, figsize = (12, 10))
#
plot_resid(mdl1.resid, mdl1.fittedvalues, x, "simple linear DGP", 4, 5, 1, fig)
plot_resid(mdl2_correct.resid, mdl2_correct.fittedvalues, x, "linearlog DGP", 4, 5, 6, fig)
plot_resid(mdl3_correct.resid, mdl3_correct.fittedvalues, x, "loglinear DGP", 4, 5, 11, fig)
plot_resid(mdl4_correct.resid, mdl4_correct.fittedvalues, x, "loglog DGP", 4, 5, 16, fig)
#
plt.tight_layout()
plt.show()
Then the residuals are normally distributed and do not have any patterns or change in variance.
3.8.3.2 Residual Heteroskedasticity
If \(\mathbb{V}{\rm ar} (\epsilon_i  \mathbf{X} ) = \sigma^2_\epsilon,\ \forall i = 1,..,N\), we say that the residuals are homoskedastic. If this assumption is violated, we say that the residuals are heteroskedastic  that is, their variance is not constant throughout observations.
The consequences of heteroskedasticity are as follows:
 OLS parameters remain unbiased;
 OLS estimates are no longer efficient (i.e. they no longer have the smallest variance). The reason for this is that OLS gives equal weight to all observations in the data, when in fact, observation with larger error variance contain less information, compared to observations with smaller error variance;
 The variance estimate of the residuals is biased, and hence the standard errors are biased. This in turn leads to a bias in test statistics and confidence intervals.
 Because of standard error bias, we may fail to reject the null hypothesis whether \(\beta_i = 0\) in our estimated model, when the null hypothesis is actually false (i.e. making a Type II error).
There are a few possible corrections to account for heteroskedasticity:
 Take logarithms of the data, this may be able to help linearize the data and in turn, the residuals;
 Apply a different estimation method. We will examine this later on, but one possibility is to use a Weighted Least Squares estimation method, which gives different observations different weights and allows to account for a nonconstant variance;
 It is possible to correct the the biased standard errors for heteroskedasticity. This would leave the OLS estimates unchanged. White’s heteroskedasticityconsistent standard errors (or, robust standard errors) give a consistent variance estimator.
\[ \begin{aligned} Y_i &= \beta_0 + \beta_1 X_i + u_i\\ u_i &= \sqrt{i} \cdot \epsilon_i,\text{ where } \epsilon_i \sim \mathcal{N}(0, \sigma^2) \end{aligned} \]
set.seed(123)
#
N < 100
beta_0 < 8
beta_1 < 10
#
x < seq(from = 0, to = 5, length.out = N)
e < rnorm(mean = 0, sd = 0.8, n = N)
u < (1:N) * e
#
y < beta_0 + beta_1 * x + u
#
mdl < lm(y ~ 1 + x)
summary(mdl)$coefficients
## Estimate Std. Error t value Pr(>t)
## (Intercept) 1.398462 8.475098 0.1650084 8.692773e01
## x 14.771130 2.928474 5.0439679 2.095328e06
np.random.seed(123)
#
N = 100
beta_0 = 8
beta_1 = 10
#
x = np.linspace(start = 0, stop = 5, num = N)
e = np.random.normal(loc = 0, scale = 0.8, size = N)
u = np.array(list(range(1, N + 1))) * e
#
y = beta_0 + beta_1 * x + u
#
mdl = sm.OLS(y, sm.add_constant(x)).fit()
print(mdl.summary().tables[1])
## ==============================================================================
## coef std err t P>t [0.025 0.975]
## 
## const 10.8945 9.978 1.092 0.278 8.907 30.696
## x1 9.3559 3.448 2.714 0.008 2.514 16.198
## ==============================================================================
The residuals appear to be nonnormal and have nonlinearities remaining:
fig = plt.figure(num = 9, figsize = (10, 8))
ax = fig.add_subplot("231")
_ = ax.plot(x, y, linestyle = "None", marker = "o", markerfacecolor = "None", color = "black")
_ = ax.plot(x[np.argsort(x)], mdl.fittedvalues[np.argsort(x)], color = "red")
plot_resid(mdl.resid, mdl.fittedvalues, x, "heteroskedastic shock DGP", 2, 3, 2, fig)
plt.tight_layout()
plt.show()
There are a number of methods to test for the presence of heteroskedasticity. Some of the tests are:
 Goldfeld–Quandt Test. It divides the dataset into two subsets. The subsets are specified so that the observations for which the explanatory variable takes the lowest values are in one subset, and the highest values  in the other subset. The subsets are not necessarily of equal size, nor do they contain all the observations between them. The test statistic used is the ratio of the mean square residual errors for the regressions on the two subsets. This test statistic corresponds to an Ftest of equality of variances. The Goldfeld–Quandt test requires that data be ordered along a known explanatory variable, from lowest to highest.
If the error structure depends on an unknown variable or an unobserved variable the Goldfeld–Quandt test provides little guidance. Also, error variance must be a monotonic function of the specified explanatory variable. For example, when faced with a quadratic function mapping the explanatory variable to error variance the Goldfeld–Quandt test may improperly accept the null hypothesis of homoskedastic errors.
Unfortunately the Goldfeld–Quandt test is not very robust to specification errors. The Goldfeld–Quandt test detects nonhomoskedastic errors but cannot distinguish between heteroskedastic error structure and an underlying specification problem such as an incorrect functional form or an omitted variable.
Breusch–Pagan Test. After estimating the linear regression \(Y = \beta_0 + \beta_1 X + \epsilon\), calculate the model residuals \(\widehat{\epsilon}_i\). The OLS assumptions state that the residual variance does not depend on the independent variables \(\mathbb{V}{\rm ar} (\epsilon_i  \mathbf{X} ) = \sigma^2_\epsilon\). If this assumptions is not true, then there may be a linear relationship between \(\widehat{\epsilon}_i^2\) and \(X_i\). So, the BreushPagan test is the based on the following regression: \[ \widehat{\epsilon}_i^2 = \gamma_0 + \gamma_1 X_i + v_i \] The hypothesis tests is: \[ \begin{aligned} H_0&: \gamma_1 = 0 \text{ (residuals are homoskedastic)}\\ H_1&: \gamma_1 \neq 0 \text{ (residuals are heteroskedastic)} \end{aligned} \] It is a chisquared test, where the test statistic: \[ LM = N \cdot R^2_{\widehat{\epsilon}} \] is distributed as \(\chi^2_1\) under the null. Here \(R^2_{\widehat{\epsilon}}\) is the Rsquare of the squared residual regression. One weakness of the BP test is that it assumes that the heteroskedasticity is a linear relationship of the independent variables. If we fail to reject the null hypothesis, we still do not rule out the possibility of a nonlinear relationship between the independent variables and the error variance.
White Test is more generic than the BP test as it allows the independent variables to have a nonlinear effect on the error variance. For example, a combination of linear, quadratic and crossproducts of the independent variables. It is a more commonly used test for homoskedasticity. The test statistic is calculated the same way as in BP test: \[ LM = N \cdot R^2_{\widehat{\epsilon}} \] the difference from BP is that the squared residual model, from which we calculate \(R^2_{\widehat{\epsilon}}\), may be nonlinear. A shortcoming of the White test is that it can lose its power if the model has many exogenous variables.
We can carry out these tests via R
and Python
:
##
## GoldfeldQuandt test
##
## data: mdl
## GQ = 6.9269, df1 = 48, df2 = 48, pvalue = 4.407e10
## alternative hypothesis: variance changes from segment 1 to 2
import statsmodels.stats.diagnostic as sm_diagnostic
# Goldfeld–Quandt Test
print(sm_diagnostic.het_goldfeldquandt(y = y, x = sm.add_constant(x), alternative = "twosided"))
## (5.330019801721165, 4.3458162211165886e08, 'twosided')
##
## studentized BreuschPagan test
##
## data: mdl
## BP = 18.864, df = 1, pvalue = 1.404e05
# Breusch–Pagan Test
print(sm_diagnostic.het_breuschpagan(resid = mdl.resid, exog_het = sm.add_constant(x)))
## (27.15440005289671, 1.8783720306329005e07, 36.53111796891305, 2.7232777548974932e08)
##
## studentized BreuschPagan test
##
## data: mdl
## BP = 21.665, df = 2, pvalue = 1.975e05
The White test is equivalent to the BreuschPagan test with an auxiliary model containing all regressors, their squares and their crossproducts. As such there is only the bptest()
function, which can be leveraged to carry out the White test.
## (27.63658737233078, 9.972207411097485e07, 18.522820288405395, 1.5369984549894296e07)
In the general description of LM test  it exaggerates the significance of results in small or moderately large samples. In this case the Fstatistic is preferable. As such we have both LM
and F
statistics and \(p\)values provided for the BP test. For BP and White tests  the first value is the \(LM\) statistic, the second value is the \(p\)value of the LM statistic, the third value is the \(F\)test statistic, the last value is the \(p\)value of the \(F\) test. For the GQ test, the first value is the \(F\)statistic, the second value is the associated \(p\)value.
We see that in all cases the \(p\)value is less than \(0.05\), so we reject the null hypothesis and conclude that the residuals are hetereoskedastic.
On the other hand, if we were to carry out these tests for a correctly specified model, like the one for the simple linear regression:
##
## GoldfeldQuandt test
##
## data: mdl1
## GQ = 1.119, df1 = 98, df2 = 98, pvalue = 0.579
## alternative hypothesis: variance changes from segment 1 to 2
# Goldfeld–Quandt Test
print(sm_diagnostic.het_goldfeldquandt(y = data_lin["y"], x = sm.add_constant(data_lin["x"]), alternative = "twosided"))
## (0.7299402182948976, 0.12090366054870887, 'twosided')
##
## studentized BreuschPagan test
##
## data: mdl1
## BP = 0.41974, df = 1, pvalue = 0.5171
# Breusch–Pagan Test
print(sm_diagnostic.het_breuschpagan(resid = mdl1.resid, exog_het = sm.add_constant(data_lin["x"])))
## (3.987557828302002, 0.04583745596968394, 4.027991495112321, 0.04611120607782332)
##
## studentized BreuschPagan test
##
## data: mdl1
## BP = 0.42016, df = 2, pvalue = 0.8105
# White Test
print(sm_diagnostic.het_white(resid = mdl1.resid, exog = sm.add_constant(data_lin["x"])))
## (4.0785282350399354, 0.13012443194407053, 2.050490063862835, 0.13140921798003924)
We see that we do not reject the null hypothesis of homoskedastic residuals (except for the BP test in Python
, where the \(p\)value is close to 0.05, on the other hand, the remaining two tests do not reject the null).
There are also a number of additional heteroskedasticity tests. A discussion of their quality can be found here.
3.8.3.3 Residual Autocorrelation
If \(\mathbb{C}{\rm ov} (\epsilon_i, \epsilon_j) \neq 0\) for some \(i \neq j\), then the errors are correlated. Autocorrelation is frequently encountered in timeseries models.
\[ \begin{aligned} Y_t &= \beta_0 + \beta_1 X_t + \epsilon_t \\ \epsilon_t &= \rho \epsilon_{t1} + u_t,\ \rho < 1,\ u_t \sim \mathcal{N}(0, \sigma^2) \end{aligned} \] Then we say that the model has autocorrelated, or serially correlated errors.
In this case, we have that: \[ \mathbb{C}{\rm ov}(\epsilon_t, \epsilon_{t1}) = \mathbb{C}{\rm ov}(\rho \epsilon_{t1} + u_t, \epsilon_{t1}) = \rho \mathbb{C}{\rm ov}(\epsilon_{t1},\epsilon_{t1}) = \rho \sigma^2 \neq 0 \] Estimating the coefficients via OLS and ignoring the violation will still result in unbiased and consistent OLS estimators. However, the estimators are inefficient and the variance of the regression coefficients will be biased.
On the other hand, autocorrelation in errors may be a result of a misspecified model.
fig = plt.figure(num = 10, figsize = (10, 8))
ax = fig.add_subplot("121")
_ = ax.plot(x, y, linestyle = "None", marker = "o", markerfacecolor = "None", color = "black")
_ = ax.plot(x, lm_fit.fittedvalues, color = "red")
_ = plt.title("linear regression")
ax = fig.add_subplot("122")
_ = ax.plot(x[np.argsort(x)], lm_fit.resid[np.argsort(x)], linestyle = "", marker = "o", markerfacecolor = "None", color = "black")
_ = plt.title("residual plot against X")
plt.tight_layout()
plt.show()
There are a number of tests for the presence of autocorrelation:
Durbin–Watson Test for the hypothesis: \[ \begin{aligned} H_0&:\text{the errors are serially uncorrelated}\\ H_1&:\text{the errors follow a first order autoregressive process (i.e. autocorrelation at lag 1)} \end{aligned} \] The test statistic: \[ d = \dfrac{\sum_{i = 2}^N (\widehat{\epsilon}_i  \widehat{\epsilon}_{i1})^2}{\sum_{i = 1}^N \widehat{\epsilon}_i^2} \] The value of \(d\) always lies between 0 and 4. \(d = 2\) indicates no autocorrelation. If the Durbin–Watson statistic is not close to 2, there is evidence of a serial correlation.
BreuschGodfrey Test is a more flexible test, covering autocorrelation of higher orders and applicable whether or not the regressors include lags of the dependent variable. Consider the following linear regression: \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \] We then estimate the model via OLS and fit the following model on the residuals \(\widehat{\epsilon}_i\): \[ \widehat{\epsilon}_i = \alpha_0 + \alpha_1 X_i + \rho_1 \widehat{\epsilon}_{i  1} + \rho_2 \widehat{\epsilon}_{i  2} + ... + \rho_p \widehat{\epsilon}_{i  p} + u_t \] and calculate its \(R^2\) (Rsquared), then testing the hypothesis: \[ \begin{aligned} H_0&:\rho_1 = \rho_2 = ... = \rho_p = 0\\ H_1&:\rho_j \neq 0 \text{ for some } j \end{aligned} \] Under the null hypothesis the test statistic: \[ LM = (Np)R^2 \sim \chi^2_p \]
We can carry out these tests via R
and Python
:
##
## DurbinWatson test
##
## data: lm_fit
## DW = 0.26274, pvalue < 2.2e16
## alternative hypothesis: true autocorrelation is not 0
##
## BreuschGodfrey test for serial correlation of order up to 2
##
## data: lm_fit
## LM test = 75.941, df = 2, pvalue < 2.2e16
## (93.98337188826778, 3.9063405254180396e21, 749.7890457680423, 2.5642011470769976e59)
For the Durbin–Watson Test, the DW statistic is returned. The test statistic equals 2 for no serial correlation. If it is closer to zero  we have evidence of positive correlation. If it is closer to 4, then we have more evidence of negative serial correlation. The BreuschGodfrey Test returns the LM statistic with its corresponding \(p\)value as well as the alternative test version with the \(F\)statistic with its corresponding \(p\)value.
In all test cases (because \(p\)values are less than 0.05 and the DurbinWatson test statistic is further from 2), we reject the null hypothesis of no serial correlation.
On the other hand, if we were to carry out these tests for a correctly specified model, like the one for the simple linear regression:
##
## DurbinWatson test
##
## data: mdl1
## DW = 2.1222, pvalue = 0.4255
## alternative hypothesis: true autocorrelation is not 0
## 1.9377965734765383
##
## BreuschGodfrey test for serial correlation of order up to 2
##
## data: mdl1
## LM test = 2.4867, df = 2, pvalue = 0.2884
## (0.20396667467699192, 0.9030445986699857, 0.10004570053602482, 0.9048422424493952)
In this case the DW statistic is close to 2, and the test \(p\)values are greater than 0.05, so we fail to reject the null hypothesis of no serial correlation.
Note There is also the LjungBox Test for testing the null hypothesis of no autocorrelation of residuals.
3.8.3.4 Residual Normality Testing
The normality requirement is necessary if we want to obtain the correct \(p\)values and critical \(t\)values when testing the hypothesis that \(H_0: \beta_j = c\), especially for significance testing, with \(c = 0\). Assume that we want to test whether our residuals \(z_1,...,z_N\) come from a normal distribution. The hypothesis can be stated as: \[ \begin{aligned} H_0&:\text{residuals follow a normal distribution}\\ H_1&:\text{residuals do not follow a normal distribution} \end{aligned} \]
There are a number of normality tests, like:
AndersonDarling Test. The test statistic is calculated as: \[ A^2 = N  \sum_{i 1}^N \dfrac{2i1}{N}\left[ \log(F(z_{(i)}) + \log\left(1  F(z_{(N+1i)}) \right)\right] \] where \(z_{(i)}\) are the ordered data and \(F(\cdot)\) is the cumulative distribution function (cdf) of the distribution being tested (for the univariate regression residuals  we are usually interested in testing for the normal distribution). The test statistic is compared against the critical values from the normal distribution. Empirical testing indicates that the Anderson–Darling test is not quite as good as ShapiroWilk, but is better than other tests.
ShapiroWilk Test. The test statistic is: \[ W = \dfrac{\left(\sum_{i = 1}^N a_i z_{(i)} \right)^2}{\sum_{i = 1}^N (z_i  \overline{z})^2} \] where \(z_{(i)}\) is the \(i\)th smallest value in the sample (i.e. the data are ordered). \(a_i\) values are calculated using means, variances and covariances of \(z_{(i)}\). \(W\) is compared against tabulated values of this statistic’s distribution. Small values of \(W\) will lead to the rejection of the null hypothesis. Monte Carlo simulation has found that Shapiro–Wilk has the best power for a given significance, followed closely by Anderson–Darling when comparing the Shapiro–Wilk, Kolmogorov–Smirnov, Lilliefors and Anderson–Darling tests.
KolmogorovSmirnov Test. The test statistic is given by: \[ D = \max\{ D^+; D^\} \] where: \[ \begin{aligned} D^+ &= \max_i \left( \dfrac{i}{N}  F(z_{(i)})\right)\\ D^ &= \max_i \left( F(z_{(i)})  \dfrac{i  1}{N} \right) \end{aligned} \] where \(F(\cdot)\) is the theoretical cdf of the distribution being tested (for the univariate regression residuals  we are usually interested in testing for the normal distribution). The Lilliefors Test is based on the KomogorovSmirnov Test as a special case of this for the normal distribution.. For the normal distribution case, the test statistic is compared against the critical values from a normal distribution in order to determine the \(p\)value.
Cramer–von Mises Test is an alternative to the Kolmogorov–Smirnov test. The test statistic: \[ W = N\omega^2 = \dfrac{1}{12N} + \sum_{i = 1}^N \left[ \dfrac{2i1}{2N}  F(z_{(i)}) \right]^2 \] If this value is larger than the tabulated value, then the hypothesis that the data came from the distribution \(F\) can be rejected.
Jarque–Bera Test (valid for large samples). The statistic is calculated as: \[ JB = \dfrac{Nk+1}{6} \left(S^2 + \dfrac{(C  3)^2}{4}\right) \] where \[ \begin{aligned} S &= \dfrac{\dfrac{1}{N}\sum_{i = 1}^N (z_i  \overline{z})^3}{\left( \dfrac{1}{N}\sum_{i = 1}^N (z_i  \overline{z})^2 \right)^{3/2}}= \dfrac{\widehat{\mu}_3}{\widehat{\sigma}^3}\\ C &= \dfrac{\dfrac{1}{N}\sum_{i = 1}^N (z_i  \overline{z})^4}{\left( \dfrac{1}{N}\sum_{i = 1}^N (z_i  \overline{z})^2 \right)^{2}} = \dfrac{\widehat{\mu}_4}{\widehat{\sigma}^4}\\ \end{aligned} \] \(N\) is the sample size, \(S\) is the skewness and \(C\) is kurtosis and \(k\) is the number of regressors (i.e. the number of different independent variables \(X\), with \(k = 1\) outside a regression context).
If the data comes from a normal distribution, then the \(JB\) statistic has a chisquared distribution with two degrees of freedom, \(\chi^2_2\).Chisquared (GoodnessOfFit) Test. The chisquare test is an alternative to the AndersonDarling and KolmogorovSmirnov goodnessoffit tests. The chisquare goodnessoffit test can be applied to discrete distributions such as the binomial and the Poisson, while the KolmogorovSmirnov and AndersonDarling tests are restricted to continuous distributions. This is not a restriction per say, since for nonbinned data you can simply calculate a histogram before generating the chisquare test. However, the value of the chisquare test statistic are dependent on how the data is binned. Another disadvantage of the chisquare test is that it requires a sufficient sample size in order for the chisquare approximation to be valid. The test statistic: \[ \chi^2 = \sum_{i = 1}^k \dfrac{(O_i  E_i)^2}{E_i} \] where \(k\) is the number of (nonempty) bins, \(O_i\) is the observed frequency for bin \(i\) (i.e. the number of observations in bin \(i\)) and \(E_i\) is the expected (theoretical) frequency for bin \(i\), which is calculated as: \[ E_i = N(F(z_u)  F(z_l)) \] where \(N\) is the total sample size, \(F(\cdot)\) is the cdf for the distribution being tested, \(z_u\) is the upper limit for bin \(i\) and \(z_l\) is the lower limit for bin \(i\). The chisquared statistic can then be used to calculate a \(p\)value by comparing the value of the statistic to a chisquared distribution with \(k  c\) degrees of freedom (where \(c\) is the number of estimated parameters for the distribution plus one  for the normal distribution with mean and standard deviation parameters \(c = 2 + 1 = 3\)).
We will carry out the normality tests on the loglinear DGP, with an incorrectly specified linear model:
#
# AndersonDarling Test
#print(goftest::ad.test(mdl3$residuals, null = "pnorm"))
print(nortest::ad.test(mdl3$residuals))
##
## AndersonDarling normality test
##
## data: mdl3$residuals
## A = 2.787, pvalue = 4.883e07
# May need to install through terminal: pip install scikitgof
import skgof as skgof
# AndersonDarling Test
print(sm_diagnostic.normal_ad(x = mdl3.resid))
## (2.742775345091559, 6.263145991939627e07)
##
## ShapiroWilk normality test
##
## data: mdl3$residuals
## W = 0.90394, pvalue = 4.514e10
## (0.9430360198020935, 4.2506789554863644e07)
##
## Onesample KolmogorovSmirnov test
##
## data: mdl3$residuals
## D = 0.50971, pvalue < 2.2e16
## alternative hypothesis: twosided
# KolmogorovSmirnov Test
print(sm_diagnostic.kstest_normal(x = mdl3.resid, dist = "norm")) #statistic and pvalue
## (0.08586356720817684, 0.0010846493406071833)
# Cramer–von Mises Test
#print(goftest::cvm.test(mdl3$residuals, null = "pnorm"))
print(nortest::cvm.test(mdl3$residuals))
##
## Cramervon Mises normality test
##
## data: mdl3$residuals
## W = 0.37085, pvalue = 5.297e05
# Cramer–von Mises test
print(skgof.cvm_test(data = mdl3.resid,
dist = stats.norm(0, np.sqrt(np.var(mdl3.resid)))))
## GofResult(statistic=0.39458042141994304, pvalue=0.07458108247027562)
##
## Jarque Bera Test
##
## data: mdl3$residuals
## Xsquared = 268.92, df = 2, pvalue < 2.2e16
# Jarque–Bera Test
print(sm_tools.jarque_bera(mdl3.resid)) #JB statistic, pvalue, skew and kurtosis.
## (27.00245322122904, 1.3692784843495163e06, 0.8659632801063863, 3.490637112922614)
Note that the JarqueBera tests in R
and Python
in these packages do not allow to control for the fact that we are carrying out the tests on the residuals. In other words, it assumes that \(k = 1\).
In this case the \(p\)value is less than 0.05 for most tests, so we reject the null hypothesis and conclude that the residuals are not normally distributed.
For a correctly specified loglinear model:
# AndersonDarling Test
#print(goftest::ad.test(mdl3_correct$residuals, null = "pnorm"))
print(nortest::ad.test(mdl3_correct$residuals))
##
## AndersonDarling normality test
##
## data: mdl3_correct$residuals
## A = 0.39645, pvalue = 0.3665
## (0.1576307896975777, 0.9518802524386675)
##
## ShapiroWilk normality test
##
## data: mdl3_correct$residuals
## W = 0.99066, pvalue = 0.2225
## (0.9960342049598694, 0.8864037990570068)
# KolmogorovSmirnov Test
print(ks.test(mdl3_correct$residuals, y = "pnorm", alternative = "two.sided"))
##
## Onesample KolmogorovSmirnov test
##
## data: mdl3_correct$residuals
## D = 0.35043, pvalue < 2.2e16
## alternative hypothesis: twosided
# KolmogorovSmirnov Test
print(sm_diagnostic.kstest_normal(x = mdl3_correct.resid, dist = "norm")) #statistic and pvalue
## (0.031206131899122358, 0.9168591544429977)
# Cramer–von Mises Test
#print(goftest::cvm.test(mdl3_correct$residuals, null = "pnorm"))
print(nortest::cvm.test(mdl3_correct$residuals))
##
## Cramervon Mises normality test
##
## data: mdl3_correct$residuals
## W = 0.058543, pvalue = 0.3937
# Cramer–von Mises test
print(skgof.cvm_test(data = mdl3_correct.resid,
dist = stats.norm(0, np.sqrt(np.var(mdl3_correct.resid)))))
## GofResult(statistic=0.02066788690314537, pvalue=0.996414601640607)
##
## Jarque Bera Test
##
## data: mdl3_correct$residuals
## Xsquared = 4.795, df = 2, pvalue = 0.09095
# Jarque–Bera Test
print(sm_tools.jarque_bera(mdl3_correct.resid)) #JB statistic, pvalue, skew and kurtosis.
## (0.4627520589444065, 0.793441052712381, 0.11645049039934306, 2.9641199189474325)
In this case, the \(p\)values for most of the tests are greater than 0.05, so we do not reject the null hypothesis of normality.
The more tests we carry out the more certain we can be, whether the residuals are (not) from a normal distribution.
For now, focus on at least one test from each category  namely, the Breusch–Pagan Test for homoskedasticity, the DurbinWatson Test for autocorrelation, ShapiroWilk Test for normality.
3.8.3.5 Standardized Residuals
When we compare residuals for different observations, we want to take into account that their variances may be different (as we have shown in eq. (3.12)). One way to account for this is to divide the residuals by an estimate the residuals standard deviation. This results in calculating the standardized residuals: \[ s_i = \dfrac{\widehat{\epsilon_i}}{\widehat{\sigma}\sqrt{1  h_{ii}}} \] where \(h_{ii}\) is the \(i\)th diagonal element of \(\mathbf{H}\). Standardized residuals are useful in detecting outliers. Generally, any observation with a standardized residual greater than 2 in absolute value should be examined more closely.