3.8 Goodness-Of-Fit

Having the following model: \[ Y_i = \beta_0+ \beta_1 X_i + \epsilon_i,\quad i =1,...,N \] Allows us to:

  1. Explain, how the dependent variable \(Y_i\) changes, if the independent variable \(X_i\) changes.
  2. Predict the value of \(\widetilde{Y}\), given a value of \(\widetilde{X}\).

In order to have an accurate prediction of \(Y\), we hope that the independent variable \(X\) helps us explain as much variation in \(Y\) as possible (hence why \(X\) is usually referred to as an explanatory variable). Ideally, the variance of \(X\) will help explain the variance in \(Y\). Having said that, we would like to have a way to measure just how good our model is - how much of the variation in \(Y\) can be explained by the variation in \(X\) using our model - we need a goodness-of-fit measure.

Another way to look at it is - a goodness-of-fit measure aims to quantify how well the estimated model fits the data. Fortunately, there are many ways to measure the goodness-of-fit of the estimated model.

3.8.1 Model Residuals: RSS, ESS and TSS

We can separate our univariate regression into two components: \[ Y_i = \mathbb{E}(Y_i|X_i) + \epsilon_i \] where \(\mathbb{E}(Y_i|X_i) = \beta_0 + \beta_1X_i\) is the explainable, systematic, component of our model and \(\epsilon_i\) is the random, unsystematic, unexplainable component of our model.

In practical application, we do not observe the true systematic and the true random components, but we can use the OLS to estimate the unknown parameters. then our regression can be written as: \[ Y_i = Y_i \pm \widehat{Y}_i = \widehat{Y}_i + \widehat{\epsilon}_i \] where \(\widehat{Y}_i = \widehat{\beta}_0 + \widehat{\beta}_1 X_i\) and \(\widehat{\epsilon}_i = Y_i - \widehat{Y}_i\).

Because the least squares fitted regression passed through the sample mean \((\overline{Y}, \overline{X})\), if we subtract the sample mean of \(Y\), \(\overline{Y} = \dfrac{1}{N} \sum_{i = 1}^N Y_i\) from both sides of the equation, we rewrite our model in terms of differences (i.e. variation) from the process mean: \[ Y_i - \overline{Y} = (\widehat{Y}_i -\overline{Y}) + \widehat{\epsilon}_i \] This expression states that the difference between \(Y_i\) and its sample mean, \(\overline{Y}\), consists of an explained, \((\widehat{Y}_i -\overline{Y})\), and unexplained, \(\widehat{e}_i\), part. Taking the squares of both sides and summing across \(i = 1,...,N\) yields: \[ \begin{aligned} \sum_{i = 1}^N \left(Y_i - \overline{Y} \right)^2 &= \sum_{i = 1}^N \left( (\widehat{Y}_i -\overline{Y}) + \widehat{\epsilon}_i \right)^2 \\ &=\sum_{i = 1}^N \left( \widehat{Y}_i -\overline{Y}\right)^2 + 2 \sum_{i = 1}^N \left( \widehat{Y}_i -\overline{Y}\right)\widehat{\epsilon}_i + \sum_{i = 1}^N \widehat{\epsilon}^2_i \end{aligned} \] Using the fact that: \[ \begin{aligned} \sum_{i = 1}^N \left( \widehat{Y}_i - \overline{Y}\right)\widehat{\epsilon}_i &= \sum_{i = 1}^N \left( \widehat{\beta}_0 + \widehat{\beta}_1 X_i\right)\widehat{\epsilon}_i - \overline{Y} \sum_{i = 1}^N \widehat{\epsilon}_i \\ &= \widehat{\beta}_0 \sum_{i = 1}^N \widehat{\epsilon}_i + \widehat{\beta}_1 \sum_{i = 1}^N X_i \widehat{\epsilon}_i - \overline{Y} \sum_{i = 1}^N \widehat{\epsilon}_i \\ &= 0 \end{aligned} \]

we can rewrite the equality as: \[\begin{equation} \sum_{i = 1}^N \left(Y_i - \overline{Y} \right)^2 = \sum_{i = 1}^N \left( \widehat{Y}_i -\overline{Y}\right)^2 + \sum_{i = 1}^N \widehat{\epsilon}^2_i \tag{3.11} \end{equation}\]

This equation gives us a decomposition of the total sample variation, into explained and unexplained components.

Define the following:

  • Total Sum of Squares (SST or TSS) as: \[\text{TSS} = \sum_{i = 1}^N (Y_i - \overline{Y})^2\] It is a measure of total variation in \(Y\) around the sample mean.
  • Explained Sum of Squares (ESS) as: \[\text{ESS} = \sum_{i = 1}^N (\widehat{Y}_i - \overline{Y})^2\] It is the part of the total variation in \(Y\) around the sample mean, that is explained by our regression. This is sometimes called the model sum of squares or sum of squares due to regression (which is confusingly also abbreviated as “SSR”).
  • Residual Sum of Squares (SSR or RSS) as: \[\text{RSS} = \sum_{i = 1}^N \widehat{\epsilon}_i^2\] It is the part of the total variation in \(Y\) around the sample mean that is not explained by our regression. This is sometimes called the unexplained sum of squares or the sum of squared estimate of errors (SSE).

Then, (3.11) can be written simply as: \[ \text{TSS} = \text{ESS} + \text{RSS} \]

3.8.2 R-squared, \(R^2\)

It is often useful to compute a number that summarizes how well the OLS regression fits the data. This measure is called the coefficient of determination, \(R^2\), which is the ratio of explained variation, compared to the total variation, i.e. the proportion of variation in \(Y\) that is explained by \(X\) in our regression model: \[ R^2 = \dfrac{\text{ESS}}{\text{TSS}} = 1 - \dfrac{\text{RSS}}{\text{TSS}} \]

  • The closer \(R^2\) is to \(1\), the closer the sample values of \(Y_i\) are to the fitted values \(\widehat{Y}\) of our regression. Ir \(R^2 = 1\), then all the sample data fall exactly on the fitted regression. In such a case our model would be a perfect fit for our data.
  • If the sample data of \(Y\) and \(X\) do not have a linear relationship, then \(R^2 = 0\) of a univariate regression.
  • Values \(0 < R^2 < 1\), the interpretation of \(R^2\) is as the proportion of the variation in \(Y\) around its mean, that is explained by the regression model. For example \(R^2 = 0.17\) means that \(17\%\) of the variation in \(Y\) is explained by \(X\).

When comparing \(\text{RSS}\) of different models, we want to choose the model, which better fits our data. If we want to choose a model based on its \(R^2\) value we should note a couple of things:

  • \(R^2\) comparison is not valid for comparing models, that do not have have the same transformation of the dependent variable, for example two models - one with \(Y\) and the other with \(\log(Y)\) dependent variables cannot be compared via \(R^2\).

  • \(R^2\) does not measure the predictability power of the model. For example, a linear model may be a good fit for the data, but its forecasts may not make economic sense (e.g. forecasting negative wage for low values of years in education via a simple linear model).

  • \(R^2\) is based on the sample data, so it says nothing whether our model is close to the true population DGP.

  • \(R^2\) may be low if: the error variance, \(\sigma^2\), is large; or if the variance of \(X\) is small.

  • \(R^2\) may be large even if the model is wrong. For example, even if the true relationship is non-linear, a linear model may have a larger \(R^2\), compared to the quadratic, or even the log-linear model.

  • On the other hand, the goodness-of-fit of the model does not depend on the unit of measurement of our variables (e.g. dollars vs thousands of dollars). Furthermore, comparisons of \(R^2\) are valid, if we compare a simple linear model to a linear-log model, as they both have the same dependent variable, \(Y\).

In any case, a model should not be chosen only on the basis of model fit with \(R^2\) as the criterion.

Example 3.30 We will generate a univariate linear regression with \(\beta_0 = 2\), \(\beta_1 = 0.4\), \(N = 100\) and \(X\) - an equally spaced sequence from an interval in \(\left[0, 20 \right]\).

Next, we will estimate the coefficients. We will use to built-in functions as we have already, plentifully, shown how the coefficients, standard errors, fitted values and residuals can be calculated manually:

Next, we will use the residuals to calculate the \(\text{TSS}\), \(\text{RSS}\) and \(R^2\):

Which we can also conveniently extract from the estimated model objects:

## 0.5200895667266314

Finally, we may look at the full summary output of our models:

## Call:
## lm(formula = y ~ x)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9071 -1.1047 -0.0692  1.2970  4.1897 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.93221    0.36309   5.322 6.52e-07 ***
## x            0.42486    0.03137  13.546  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.829 on 98 degrees of freedom
## Multiple R-squared:  0.6518, Adjusted R-squared:  0.6483 
## F-statistic: 183.5 on 1 and 98 DF,  p-value: < 2.2e-16
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                      y   R-squared:                       0.520
## Model:                            OLS   Adj. R-squared:                  0.515
## Method:                 Least Squares   F-statistic:                     106.2
## Date:                Tue, 13 Oct 2020   Prob (F-statistic):           2.63e-17
## Time:                        21:39:35   Log-Likelihood:                -223.27
## No. Observations:                 100   AIC:                             450.5
## Df Residuals:                      98   BIC:                             455.8
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          2.0262      0.452      4.478      0.000       1.128       2.924
## x1             0.4028      0.039     10.306      0.000       0.325       0.480
## ==============================================================================
## Omnibus:                        2.753   Durbin-Watson:                   1.975
## Prob(Omnibus):                  0.252   Jarque-Bera (JB):                1.746
## Skew:                           0.035   Prob(JB):                        0.418
## Kurtosis:                       2.356   Cond. No.                         23.1
## ==============================================================================
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

and see a variety of familiar statistics.

Furthermore, if we decide to scale the variables, by say, dividing by \(10\), then the \(R^2\) would be unchanged:

Finally, we will plot \(Y_i - \overline{Y}\), \(\widehat{Y}_i - \overline{Y}\) and \(\widehat{\epsilon}_i\) for a better visual understanding of what \(\text{TSS}\), \(\text{ESS}\) and \(\text{RSS}\) measure and their impact when calculating \(R^2\):

Example 3.31 We will now present an example of when a linear model may prove to be a better fit in terms of its \(R^2\), even when the data has a nonlinear relationship Cases When \(R^2\) is Negative

A Case of a negative \(R^2\) can arise when:

  1. The predictions that are being compared to the corresponding outcomes have not been derived from a model-fitting procedure using those data. E.g. if we try to guess the coefficient values - e.g. we assume that coefficients of models on similar data, or in similar countries would be the same for our data;

  2. We do not include an intercept, \(\beta_0\) in our linear regression model;

  3. When a non-linear function is used to fit the data;

In cases where negative \(R^2\) values arise, the mean of the data provides a better fit to the outcomes, rather than the fitted model values, according to this criterion, \(R^2\). We will later see, that there is a variety of different alternative criterions for evaluating the accuracy of a model.

We will look at each case separately. Fitted values are not derived from the data, which is being analysed

Let’s say that we use a model, which was fitted on the following dataset:

The estimated model of such a dataset is:

Now, assume that we are analyzing a different data sample. Let’s say that our data sample comes from the following underlying DGP:

However, we make the incorrect assumption that our data sample comes from the same population as the previous data. This leads us to calculating the fitted values, residuals and \(R^2\) using pre-estimated coefficients:

Visual inspection reveals that our assumption that an existing model of one dataset is good enough for our dataset was incorrect - it is clear that our dataset is from a different DGP. For comparison, we also plot the process mean.

If we compare models from datasets of different countries, different firms, we would run into such problems. For example, if one firm is very large, while another is relatively new and small - making an assumption that a model on the data of one firm can be applied to the data of this new firm would be incorrect - some variables may have similar effects, but they would most likely not be the same in magnitude. Regression without an intercept

We will generate an OLS model with an intercept

But we will estimate the parameters of a regression model without an intercept. The estimated coefficient, fitted values and residuals are calculated as follows (take note that we do not include a constant in the independent variable matrix):

Which results in the following negative \(R^2\):

For cases when a model does not have an intercept, \(R^2\) is usually computed as: \[ R^2 = 1 - \dfrac{RSS}{\sum_{i = 1}^N Y_i^2} \] where the denominator acts as if we assume that \(\mathbb{E}(Y) = 0\) (and hence we assume that \(\overline{Y} \approx 0\)).

Applying this expression for the \(R^2\) yields:

Furthermore, this value of \(R^2\) is (silently) applied in the built-in OLS estimation functions:

## 0.9129369975970213

Unfortunately, if \(R^2\) is calculated in this way, it ignores a very important fact about our model - a negative \(R^2\) indicates that the regression is actually worse than a simple average of the process. In fact, the modified \(R^2\) shows a very high value - the complete opposite of what we would expect to see in such a situation.

Visually, we can see that our model provides quite a poor fit. For comparison, we also plot the process mean:

So, while the modified \(R^2\) seems high, in reality the model provides a poor fit for the data sample. A Nonlinear function is used to fit the data with large error variance

As an example we will simulate data from the following log-linear model: \(\log(Y) = \beta_0 + \beta_1 X + \epsilon\), where \(\beta_0 = 0.2\), \(\beta_1 = 2\), \(N = 100\), \(\epsilon \sim \mathcal{N}(0, 1)\), and \(X\) is a random sample with replacement from an interval from \(0\) to \(0.5\), equally spaced into \(N\) elements.

This data has a small variation in \(X\) and a (relative to the variance in \(\log(Y)\) and in \(\epsilon\)) large error variance:

## [1] 0.02231239
## [1] 0.8985592
## [1] 0.8482849
## 0.023800418324660753
## 1.1624704319571753
## 1.0245932457731062

If we estimate the correct model and look at the coefficients and \(R^2\):

We see that the \(R^2\) is very small. Furthermore, if we were to back-transform our fitted values and calculate \(R^2\):

We see that it is even worse. The plot of the fitted values:

So, a large variance of the error term, or a small variance of the independent variable, will result in a lower \(R^2\) value overall. Furthermore, back-transforming would likely result in a lower \(R^2\) value.

Finally, plotting the residuals against the fitted values indicates that the residuals of \(Y - \exp(\widehat{\log(Y)})\) do not have the same properties as the log-linear model residuals: Correlation Analysis

The correlation coefficient between \(X\) and \(Y\) is defined as: \[ \rho_{X,Y} = \dfrac{\mathbb{C}{\rm ov} (X, Y)}{\sqrt{\mathbb{V}{\rm ar} (X)} \sqrt{\mathbb{V}{\rm ar} (Y)}} = \dfrac{\sigma_{X,Y}}{\sigma_X \sigma_Y} \] The sample correlation is calculated as: \[ r_{X, Y} = \dfrac{\widehat{\sigma}_{X,Y}}{\widehat{\sigma}_X \widehat{\sigma}_Y} = \dfrac{\dfrac{1}{N-1} \sum_{i = 1}^N (X_i - \overline{X})(Y_i - \overline{Y})}{\sqrt{\dfrac{1}{N-1} \sum_{i = 1}^N (X_i - \overline{X})^2}\sqrt{\dfrac{1}{N-1} \sum_{i = 1}^N (Y_i - \overline{Y})^2}} \] The sample correlation \(-1 \leq r_{X,Y} \leq 1\) measures the strength of the linear association between the sample values of \(X\) and \(Y\). Correlation Analysis and \(R^2\)

There is a relationship between \(R^2\) and \(r_{X,Y}\):

  1. \(R^2 = r_{X,Y}^2\). So, \(R^2\) can be computed as the square of the sample correlation between \(Y\) and \(X\).
  2. \(R^2 = r_{Y, \widehat{Y}}^2\). So, \(R^2\) can be computed as the square of the sample correlation between \(Y\) and its fitted values \(\widehat{Y}\). As such, \(R^2\) measures the linear association, the goodness-of-fit, between the sample data \(Y\), and its predicted values \(\widehat{Y}\). Because of this, \(R^2\) is sometimes called a measure of goodness-of-fit
## 0.12272111156714938
## [1] 0.06068536 A General (pseudo) \(R\)-squared Measure, \(R^2_g\)

As we have seen, we may need to back-transform our independent variable. Then, we can calculate a general (pseudo) measure of \(R^2\): \[ R^2_g = r_{Y, \widehat{Y}}^2 = \mathbb{C}{\rm orr}(Y, \widehat{Y})^2 \]

In our previous example we can calculate \(R^2_g\) for both the log and the back-transformed values:

A way to look at it is that \(R^2\) measures the variation explained by our model, whereas \(R^2_g\) measures the variance explained by our model. In a linear regression, the two definitions are the same, as long as the intercept coefficient is included in the model.

3.8.3 Regression Diagnostics

In many cases while carrying out statistical/econometric analysis, we are not sure, whether we have correctly specified our model. As we have seen, the \(R^2\) can be artificially small (or large), regardless of the specified model. As such, there are a number of regression diagnostics and specification tests.

For the univariate regression, the most crucial assumptions come from (UR.3) and (UR.4), namely:

  • \(\mathbb{V}{\rm ar} (\epsilon_i | \mathbf{X} ) = \sigma^2_\epsilon,\ \forall i = 1,..,N\)
  • \(\mathbb{C}{\rm ov} (\epsilon_i, \epsilon_j) = 0,\ i \neq j\)
  • \(\boldsymbol{\varepsilon} | \mathbf{X} \sim \mathcal{N} \left( \mathbf{0}, \sigma^2_\epsilon \mathbf{I} \right)\)

We note that the residuals are defined as: \[ \begin{aligned} \widehat{\boldsymbol{\varepsilon}} &= \mathbf{Y} - \widehat{\mathbf{Y} } \\ &= \mathbf{Y} - \mathbf{X} \widehat{\boldsymbol{\beta}} \\ &= \mathbf{Y} - \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{Y} \\ &= \left[ \mathbf{I} - \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \right]\mathbf{Y} \end{aligned} \] Hence, for the OLS residuals (i.e. not the true unobserved errors) the expected value of the residuals is still zero: \[ \begin{aligned} \mathbb{E} \left( \widehat{\boldsymbol{\varepsilon}}| \mathbf{X} \right) &= \mathbb{E} \left( \left[ \mathbf{I} - \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \right]\mathbf{Y} | \mathbf{X} \right)\\ &= \mathbb{E} \left( \left[ \mathbf{I} - \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \right] \left( \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \right) | \mathbf{X} \right) \\ &= \mathbf{X} \boldsymbol{\beta} + \mathbb{E} (\boldsymbol{\varepsilon}) - \mathbf{X} \boldsymbol{\beta} - \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbb{E} (\boldsymbol{\varepsilon}) \\ &= 0 \end{aligned} \] For simplicity, let \(\widehat{\boldsymbol{\varepsilon}} = \left[ \mathbf{I} - \mathbf{H}\right]\mathbf{Y}\), where \(\mathbf{H}\ = \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top\).

Consequently, the variance-covariance matrix of the residuals is:

\[\begin{equation} \begin{aligned} \mathbb{V}{\rm ar} \left( \widehat{\boldsymbol{\varepsilon}}| \mathbf{X}\right) &= \mathbb{V}{\rm ar} \left( \left[ \mathbf{I} - \mathbf{H}\right]\mathbf{Y}|\mathbf{X}\right) \\ &= \left[ \mathbf{I} - \mathbf{H}\right]\mathbb{V}{\rm ar} \left( \mathbf{Y} | \mathbf{X}\right) \left[ \mathbf{I} - \mathbf{H}\right]^\top \\ &= \left[ \mathbf{I} - \mathbf{H}\right] \sigma^2 \left[ \mathbf{I} - \mathbf{H}\right]^\top \\ &= \sigma^2 \left[ \mathbf{I} - \mathbf{H}^\top - \mathbf{H} + \mathbf{H} \mathbf{H}^\top\right] \\ &= \sigma^2 \left[ \mathbf{I} - \mathbf{H}^\top - \mathbf{H} + \mathbf{H}^\top\right] \\ &= \sigma^2 \left[ \mathbf{I} - \mathbf{H}\right] \end{aligned} \tag{3.12} \end{equation}\]

This result shows an important distinction of the residuals from the errors - the residuals may have different variances (which are the diagonal elements of \(\mathbb{V}{\rm ar} \left( \widehat{\boldsymbol{\varepsilon}}| \mathbf{X}\right)\)), even if the true errors (which affect the process \(\mathbf{Y}\)) all have the same variance \(\sigma^2\).

The variance for the fitted values is smallest for observations near the mean and the largest for values, which deviate the most from the process mean. Residual Diagnostic Plots

One way to examine the adequacy of the model is to visualize the residuals. There are a number of ways to do this:

  • Plotting the residuals \(\widehat{\epsilon}_i\) against the fitted values \(\widehat{Y}_i\);
  • Plotting the residuals \(\widehat{\epsilon}_i\) against \(X_i\)
  • Plotting the residual Q-Q plot, histogram or boxplot.

In all cases, if there are no violations of our (UR.2) or (UR.3) assumptions - the plots should reveal no patterns. The residual histogram, Q-Q plot should be approximately normal so that our assumption (UR.4) holds.

As we are not guaranteed to specify a correct functional form, residual plots offer a great insight on what possible functional form we may have missed.

We should note that when having multiple models, it is only meaningful to compare the residuals of models with the same dependent variable. For example, comparing the residuals of a linear-linear model (with \(Y\)) and of a log-linear model (with \(\log(Y)\)) is not meaningful as they have different value scales.

Transforming the dependent or the independent variables may help to alleviate some of the problems of the residuals:

  • If nonlinearities are present in the residual plots - we must firstly account for them, and only after can we check, whether the errors have a constant variance.
  • Transforming \(\mathbf{Y}\) primarily aims to help with problems with the error terms (and may help with non-linearity);
  • Transforming \(\mathbf{X}\) primarily aims to help with correcting for non-linearity;
  • Sometimes transforming \(\mathbf{X}\) is enough to account for non-linearity and have normally distributed errors, while transforming \(\mathbf{Y}\) may account for non-linearity but might make the errors non-normally distributed.
  • Other times, transforming \(\mathbf{X}\) does not help account for the nonlinear relationship at all;

Remember that the Q-Q plot plots quantiles of the data versus quantiles of a distribution. If the observations come from a normal distribution we would expect the observed order statistics plotted against the expected (theoretical) order statistics to form an approximately straight line.

Example 3.32 We will generate four different models:
  • a simple linear model: \(Y = \beta_0 + \beta_1 X + \epsilon\);
  • a log-linear model: \(\log(Y) = \beta_0 + \beta_1 X + \epsilon\);
  • a linear-log model: \(Y = \beta_0 + \beta_1 \log(X) + \epsilon\);
  • a log-log model: \(\log(Y) = \beta_0 + \beta_1 \log(X) + \epsilon\);

For each case, we will estimate a simple linear model on the data and examine the residual plots. For simplicity, we will use the same \(X_i\) and the same \(\epsilon_i \sim \mathcal{N}(0, 0.2^2)\), \(i = 1,...,N\) with \(N = 200\), \(\beta_0 = 1\) and \(\beta_1 = 2\).

Next, we will estimate the simple linear regression for each dataset:

We can plot the fitted values alongside the actual data:

Then, the different residual plots are as follows:

We see that the linear model for the dataset, which is generated from a simple linear DGP, has residuals which appear to be random - we do not see any non-random patterns in the scatterplots. Furthermore, the histogram and Q-Q plot indicate that the residuals may be from a normal distribution. On the other hand, a simple linear model does not fit the data well, if the data is sampled from a non-linear DGP - we see clear patterns in the residual scatter plots, as well as non-normality.

For comparison, if we were to fit the correct models, we would have the following plots:

Then the residuals are normally distributed and do not have any patterns or change in variance. Residual Heteroskedasticity

If \(\mathbb{V}{\rm ar} (\epsilon_i | \mathbf{X} ) = \sigma^2_\epsilon,\ \forall i = 1,..,N\), we say that the residuals are homoskedastic. If this assumption is violated, we say that the residuals are heteroskedastic - that is, their variance is not constant throughout observations.

The consequences of heteroskedasticity are as follows:

  • OLS parameters remain unbiased;
  • OLS estimates are no longer efficient (i.e. they no longer have the smallest variance). The reason for this is that OLS gives equal weight to all observations in the data, when in fact, observation with larger error variance contain less information, compared to observations with smaller error variance;
  • The variance estimate of the residuals is biased, and hence the standard errors are biased. This in turn leads to a bias in test statistics and confidence intervals.
  • Because of standard error bias, we may fail to reject the null hypothesis whether \(\beta_i = 0\) in our estimated model, when the null hypothesis is actually false (i.e. making a Type II error).

There are a few possible corrections to account for heteroskedasticity:

  • Take logarithms of the data, this may be able to help linearize the data and in turn, the residuals;
  • Apply a different estimation method. We will examine this later on, but one possibility is to use a Weighted Least Squares estimation method, which gives different observations different weights and allows to account for a non-constant variance;
  • It is possible to correct the the biased standard errors for heteroskedasticity. This would leave the OLS estimates unchanged. White’s heteroskedasticity-consistent standard errors (or, robust standard errors) give a consistent variance estimator.
Example 3.33 We are going to simulate the following model:

\[ \begin{aligned} Y_i &= \beta_0 + \beta_1 X_i + u_i\\ u_i &= \sqrt{i} \cdot \epsilon_i,\text{ where } \epsilon_i \sim \mathcal{N}(0, \sigma^2) \end{aligned} \]

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  1.398462   8.475098 0.1650084 8.692773e-01
## x           14.771130   2.928474 5.0439679 2.095328e-06
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const         10.8945      9.978      1.092      0.278      -8.907      30.696
## x1             9.3559      3.448      2.714      0.008       2.514      16.198
## ==============================================================================

The residuals appear to be non-normal and have nonlinearities remaining:

There are a number of methods to test for the presence of heteroskedasticity. Some of the tests are:

  • Goldfeld–Quandt Test. It divides the dataset into two subsets. The subsets are specified so that the observations for which the explanatory variable takes the lowest values are in one subset, and the highest values - in the other subset. The subsets are not necessarily of equal size, nor do they contain all the observations between them. The test statistic used is the ratio of the mean square residual errors for the regressions on the two subsets. This test statistic corresponds to an F-test of equality of variances. The Goldfeld–Quandt test requires that data be ordered along a known explanatory variable, from lowest to highest.

If the error structure depends on an unknown variable or an unobserved variable the Goldfeld–Quandt test provides little guidance. Also, error variance must be a monotonic function of the specified explanatory variable. For example, when faced with a quadratic function mapping the explanatory variable to error variance the Goldfeld–Quandt test may improperly accept the null hypothesis of homoskedastic errors.

Unfortunately the Goldfeld–Quandt test is not very robust to specification errors. The Goldfeld–Quandt test detects non-homoskedastic errors but cannot distinguish between heteroskedastic error structure and an underlying specification problem such as an incorrect functional form or an omitted variable.

  • Breusch–Pagan Test. After estimating the linear regression \(Y = \beta_0 + \beta_1 X + \epsilon\), calculate the model residuals \(\widehat{\epsilon}_i\). The OLS assumptions state that the residual variance does not depend on the independent variables \(\mathbb{V}{\rm ar} (\epsilon_i | \mathbf{X} ) = \sigma^2_\epsilon\). If this assumptions is not true, then there may be a linear relationship between \(\widehat{\epsilon}_i^2\) and \(X_i\). So, the Breush-Pagan test is the based on the following regression: \[ \widehat{\epsilon}_i^2 = \gamma_0 + \gamma_1 X_i + v_i \] The hypothesis tests is: \[ \begin{aligned} H_0&: \gamma_1 = 0 \text{ (residuals are homoskedastic)}\\ H_1&: \gamma_1 \neq 0 \text{ (residuals are heteroskedastic)} \end{aligned} \] It is a chi-squared test, where the test statistic: \[ LM = N \cdot R^2_{\widehat{\epsilon}} \] is distributed as \(\chi^2_1\) under the null. Here \(R^2_{\widehat{\epsilon}}\) is the R-square of the squared residual regression. One weakness of the BP test is that it assumes that the heteroskedasticity is a linear relationship of the independent variables. If we fail to reject the null hypothesis, we still do not rule out the possibility of a non-linear relationship between the independent variables and the error variance.

  • White Test is more generic than the BP test as it allows the independent variables to have a nonlinear effect on the error variance. For example, a combination of linear, quadratic and cross-products of the independent variables. It is a more commonly used test for homoskedasticity. The test statistic is calculated the same way as in BP test: \[ LM = N \cdot R^2_{\widehat{\epsilon}} \] the difference from BP is that the squared residual model, from which we calculate \(R^2_{\widehat{\epsilon}}\), may be nonlinear. A shortcoming of the White test is that it can lose its power if the model has many exogenous variables.

We can carry out these tests via R and Python:

##  Goldfeld-Quandt test
## data:  mdl
## GQ = 6.9269, df1 = 48, df2 = 48, p-value = 4.407e-10
## alternative hypothesis: variance changes from segment 1 to 2
##  studentized Breusch-Pagan test
## data:  mdl
## BP = 18.864, df = 1, p-value = 1.404e-05
## (27.15440005289671, 1.8783720306329005e-07, 36.53111796891305, 2.7232777548974932e-08)
##  studentized Breusch-Pagan test
## data:  mdl
## BP = 21.665, df = 2, p-value = 1.975e-05

The White test is equivalent to the Breusch-Pagan test with an auxiliary model containing all regressors, their squares and their cross-products. As such there is only the bptest() function, which can be leveraged to carry out the White test.

## (27.63658737233078, 9.972207411097485e-07, 18.522820288405395, 1.5369984549894296e-07)

In the general description of LM test - it exaggerates the significance of results in small or moderately large samples. In this case the F-statistic is preferable. As such we have both LM and F statistics and \(p\)-values provided for the BP test. For BP and White tests - the first value is the \(LM\) statistic, the second value is the \(p\)-value of the LM statistic, the third value is the \(F\)-test statistic, the last value is the \(p\)-value of the \(F\) test. For the GQ test, the first value is the \(F\)-statistic, the second value is the associated \(p\)-value.

We see that in all cases the \(p\)-value is less than \(0.05\), so we reject the null hypothesis and conclude that the residuals are hetereoskedastic.

On the other hand, if we were to carry out these tests for a correctly specified model, like the one for the simple linear regression:

##  Goldfeld-Quandt test
## data:  mdl1
## GQ = 1.119, df1 = 98, df2 = 98, p-value = 0.579
## alternative hypothesis: variance changes from segment 1 to 2
##  studentized Breusch-Pagan test
## data:  mdl1
## BP = 0.41974, df = 1, p-value = 0.5171
## (3.987557828302002, 0.04583745596968394, 4.027991495112321, 0.04611120607782332)
##  studentized Breusch-Pagan test
## data:  mdl1
## BP = 0.42016, df = 2, p-value = 0.8105
## (4.0785282350399354, 0.13012443194407053, 2.050490063862835, 0.13140921798003924)

We see that we do not reject the null hypothesis of homoskedastic residuals (except for the BP test in Python, where the \(p\)-value is close to 0.05, on the other hand, the remaining two tests do not reject the null).

There are also a number of additional heteroskedasticity tests. A discussion of their quality can be found here. Residual Autocorrelation

If \(\mathbb{C}{\rm ov} (\epsilon_i, \epsilon_j) \neq 0\) for some \(i \neq j\), then the errors are correlated. Autocorrelation is frequently encountered in time-series models.

Example 3.34 Assume that our model is defined as follows:

\[ \begin{aligned} Y_t &= \beta_0 + \beta_1 X_t + \epsilon_t \\ \epsilon_t &= \rho \epsilon_{t-1} + u_t,\ |\rho| < 1,\ u_t \sim \mathcal{N}(0, \sigma^2) \end{aligned} \] Then we say that the model has autocorrelated, or serially correlated errors.

In this case, we have that: \[ \mathbb{C}{\rm ov}(\epsilon_t, \epsilon_{t-1}) = \mathbb{C}{\rm ov}(\rho \epsilon_{t-1} + u_t, \epsilon_{t-1}) = \rho \mathbb{C}{\rm ov}(\epsilon_{t-1},\epsilon_{t-1}) = \rho \sigma^2 \neq 0 \] Estimating the coefficients via OLS and ignoring the violation will still result in unbiased and consistent OLS estimators. However, the estimators are inefficient and the variance of the regression coefficients will be biased.

On the other hand, autocorrelation in errors may be a result of a misspecified model.

Example 3.35 If we were to fit a linear model on a quadratic - we may get residuals, which appear to be correlated.

There are a number of tests for the presence of autocorrelation:

  • Durbin–Watson Test for the hypothesis: \[ \begin{aligned} H_0&:\text{the errors are serially uncorrelated}\\ H_1&:\text{the errors follow a first order autoregressive process (i.e. autocorrelation at lag 1)} \end{aligned} \] The test statistic: \[ d = \dfrac{\sum_{i = 2}^N (\widehat{\epsilon}_i - \widehat{\epsilon}_{i-1})^2}{\sum_{i = 1}^N \widehat{\epsilon}_i^2} \] The value of \(d\) always lies between 0 and 4. \(d = 2\) indicates no autocorrelation. If the Durbin–Watson statistic is not close to 2, there is evidence of a serial correlation.

  • Breusch-Godfrey Test is a more flexible test, covering autocorrelation of higher orders and applicable whether or not the regressors include lags of the dependent variable. Consider the following linear regression: \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \] We then estimate the model via OLS and fit the following model on the residuals \(\widehat{\epsilon}_i\): \[ \widehat{\epsilon}_i = \alpha_0 + \alpha_1 X_i + \rho_1 \widehat{\epsilon}_{i - 1} + \rho_2 \widehat{\epsilon}_{i - 2} + ... + \rho_p \widehat{\epsilon}_{i - p} + u_t \] and calculate its \(R^2\) (R-squared), then testing the hypothesis: \[ \begin{aligned} H_0&:\rho_1 = \rho_2 = ... = \rho_p = 0\\ H_1&:\rho_j \neq 0 \text{ for some } j \end{aligned} \] Under the null hypothesis the test statistic: \[ LM = (N-p)R^2 \sim \chi^2_p \]

We can carry out these tests via R and Python:

##  Durbin-Watson test
## data:  lm_fit
## DW = 0.26274, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is not 0
##  Breusch-Godfrey test for serial correlation of order up to 2
## data:  lm_fit
## LM test = 75.941, df = 2, p-value < 2.2e-16
## (93.98337188826778, 3.9063405254180396e-21, 749.7890457680423, 2.5642011470769976e-59)

For the Durbin–Watson Test, the DW statistic is returned. The test statistic equals 2 for no serial correlation. If it is closer to zero - we have evidence of positive correlation. If it is closer to 4, then we have more evidence of negative serial correlation. The Breusch-Godfrey Test returns the LM statistic with its corresponding \(p\)-value as well as the alternative test version with the \(F\)-statistic with its corresponding \(p\)-value.

In all test cases (because \(p\)-values are less than 0.05 and the Durbin-Watson test statistic is further from 2), we reject the null hypothesis of no serial correlation.

On the other hand, if we were to carry out these tests for a correctly specified model, like the one for the simple linear regression:

##  Durbin-Watson test
## data:  mdl1
## DW = 2.1222, p-value = 0.4255
## alternative hypothesis: true autocorrelation is not 0
##  Breusch-Godfrey test for serial correlation of order up to 2
## data:  mdl1
## LM test = 2.4867, df = 2, p-value = 0.2884
## (0.20396667467699192, 0.9030445986699857, 0.10004570053602482, 0.9048422424493952)

In this case the DW statistic is close to 2, and the test \(p\)-values are greater than 0.05, so we fail to reject the null hypothesis of no serial correlation.

Note There is also the Ljung-Box Test for testing the null hypothesis of no autocorrelation of residuals. Residual Normality Testing

The normality requirement is necessary if we want to obtain the correct \(p\)-values and critical \(t\)-values when testing the hypothesis that \(H_0: \beta_j = c\), especially for significance testing, with \(c = 0\). Assume that we want to test whether our residuals \(z_1,...,z_N\) come from a normal distribution. The hypothesis can be stated as: \[ \begin{aligned} H_0&:\text{residuals follow a normal distribution}\\ H_1&:\text{residuals do not follow a normal distribution} \end{aligned} \]

There are a number of normality tests, like:

  • Anderson-Darling Test. The test statistic is calculated as: \[ A^2 = -N - \sum_{i 1}^N \dfrac{2i-1}{N}\left[ \log(F(z_{(i)}) + \log\left(1 - F(z_{(N+1-i)}) \right)\right] \] where \(z_{(i)}\) are the ordered data and \(F(\cdot)\) is the cumulative distribution function (cdf) of the distribution being tested (for the univariate regression residuals - we are usually interested in testing for the normal distribution). The test statistic is compared against the critical values from the normal distribution. Empirical testing indicates that the Anderson–Darling test is not quite as good as Shapiro-Wilk, but is better than other tests.

  • Shapiro-Wilk Test. The test statistic is: \[ W = \dfrac{\left(\sum_{i = 1}^N a_i z_{(i)} \right)^2}{\sum_{i = 1}^N (z_i - \overline{z})^2} \] where \(z_{(i)}\) is the \(i\)-th smallest value in the sample (i.e. the data are ordered). \(a_i\) values are calculated using means, variances and covariances of \(z_{(i)}\). \(W\) is compared against tabulated values of this statistic’s distribution. Small values of \(W\) will lead to the rejection of the null hypothesis. Monte Carlo simulation has found that Shapiro–Wilk has the best power for a given significance, followed closely by Anderson–Darling when comparing the Shapiro–Wilk, Kolmogorov–Smirnov, Lilliefors and Anderson–Darling tests.

  • Kolmogorov-Smirnov Test. The test statistic is given by: \[ D = \max\{ D^+; D^-\} \] where: \[ \begin{aligned} D^+ &= \max_i \left( \dfrac{i}{N} - F(z_{(i)})\right)\\ D^- &= \max_i \left( F(z_{(i)}) - \dfrac{i - 1}{N} \right) \end{aligned} \] where \(F(\cdot)\) is the theoretical cdf of the distribution being tested (for the univariate regression residuals - we are usually interested in testing for the normal distribution). The Lilliefors Test is based on the Komogorov-Smirnov Test as a special case of this for the normal distribution.. For the normal distribution case, the test statistic is compared against the critical values from a normal distribution in order to determine the \(p\)-value.

  • Cramer–von Mises Test is an alternative to the Kolmogorov–Smirnov test. The test statistic: \[ W = N\omega^2 = \dfrac{1}{12N} + \sum_{i = 1}^N \left[ \dfrac{2i-1}{2N} - F(z_{(i)}) \right]^2 \] If this value is larger than the tabulated value, then the hypothesis that the data came from the distribution \(F\) can be rejected.

  • Jarque–Bera Test (valid for large samples). The statistic is calculated as: \[ JB = \dfrac{N-k+1}{6} \left(S^2 + \dfrac{(C - 3)^2}{4}\right) \] where \[ \begin{aligned} S &= \dfrac{\dfrac{1}{N}\sum_{i = 1}^N (z_i - \overline{z})^3}{\left( \dfrac{1}{N}\sum_{i = 1}^N (z_i - \overline{z})^2 \right)^{3/2}}= \dfrac{\widehat{\mu}_3}{\widehat{\sigma}^3}\\ C &= \dfrac{\dfrac{1}{N}\sum_{i = 1}^N (z_i - \overline{z})^4}{\left( \dfrac{1}{N}\sum_{i = 1}^N (z_i - \overline{z})^2 \right)^{2}} = \dfrac{\widehat{\mu}_4}{\widehat{\sigma}^4}\\ \end{aligned} \] \(N\) is the sample size, \(S\) is the skewness and \(C\) is kurtosis and \(k\) is the number of regressors (i.e. the number of different independent variables \(X\), with \(k = 1\) outside a regression context).
    If the data comes from a normal distribution, then the \(JB\) statistic has a chi-squared distribution with two degrees of freedom, \(\chi^2_2\).

  • Chi-squared (Goodness-Of-Fit) Test. The chi-square test is an alternative to the Anderson-Darling and Kolmogorov-Smirnov goodness-of-fit tests. The chi-square goodness-of-fit test can be applied to discrete distributions such as the binomial and the Poisson, while the Kolmogorov-Smirnov and Anderson-Darling tests are restricted to continuous distributions. This is not a restriction per say, since for non-binned data you can simply calculate a histogram before generating the chi-square test. However, the value of the chi-square test statistic are dependent on how the data is binned. Another disadvantage of the chi-square test is that it requires a sufficient sample size in order for the chi-square approximation to be valid. The test statistic: \[ \chi^2 = \sum_{i = 1}^k \dfrac{(O_i - E_i)^2}{E_i} \] where \(k\) is the number of (non-empty) bins, \(O_i\) is the observed frequency for bin \(i\) (i.e. the number of observations in bin \(i\)) and \(E_i\) is the expected (theoretical) frequency for bin \(i\), which is calculated as: \[ E_i = N(F(z_u) - F(z_l)) \] where \(N\) is the total sample size, \(F(\cdot)\) is the cdf for the distribution being tested, \(z_u\) is the upper limit for bin \(i\) and \(z_l\) is the lower limit for bin \(i\). The chi-squared statistic can then be used to calculate a \(p\)-value by comparing the value of the statistic to a chi-squared distribution with \(k - c\) degrees of freedom (where \(c\) is the number of estimated parameters for the distribution plus one - for the normal distribution with mean and standard deviation parameters \(c = 2 + 1 = 3\)).

We will carry out the normality tests on the log-linear DGP, with an incorrectly specified linear model:

##  Shapiro-Wilk normality test
## data:  mdl3$residuals
## W = 0.90394, p-value = 4.514e-10
## (0.9430360198020935, 4.2506789554863644e-07)
##  One-sample Kolmogorov-Smirnov test
## data:  mdl3$residuals
## D = 0.50971, p-value < 2.2e-16
## alternative hypothesis: two-sided
##  Cramer-von Mises normality test
## data:  mdl3$residuals
## W = 0.37085, p-value = 5.297e-05
##  Jarque Bera Test
## data:  mdl3$residuals
## X-squared = 268.92, df = 2, p-value < 2.2e-16
## (27.00245322122904, 1.3692784843495163e-06, 0.8659632801063863, 3.490637112922614)

Note that the Jarque-Bera tests in R and Python in these packages do not allow to control for the fact that we are carrying out the tests on the residuals. In other words, it assumes that \(k = 1\).

In this case the \(p\)-value is less than 0.05 for most tests, so we reject the null hypothesis and conclude that the residuals are not normally distributed.

For a correctly specified log-linear model:

##  Anderson-Darling normality test
## data:  mdl3_correct$residuals
## A = 0.39645, p-value = 0.3665
##  Shapiro-Wilk normality test
## data:  mdl3_correct$residuals
## W = 0.99066, p-value = 0.2225
## (0.9960342049598694, 0.8864037990570068)
##  One-sample Kolmogorov-Smirnov test
## data:  mdl3_correct$residuals
## D = 0.35043, p-value < 2.2e-16
## alternative hypothesis: two-sided
##  Cramer-von Mises normality test
## data:  mdl3_correct$residuals
## W = 0.058543, p-value = 0.3937
##  Jarque Bera Test
## data:  mdl3_correct$residuals
## X-squared = 4.795, df = 2, p-value = 0.09095
## (0.4627520589444065, 0.793441052712381, -0.11645049039934306, 2.9641199189474325)

In this case, the \(p\)-values for most of the tests are greater than 0.05, so we do not reject the null hypothesis of normality.

The more tests we carry out the more certain we can be, whether the residuals are (not) from a normal distribution.

For now, focus on at least one test from each category - namely, the Breusch–Pagan Test for homoskedasticity, the Durbin-Watson Test for autocorrelation, Shapiro-Wilk Test for normality. Standardized Residuals

When we compare residuals for different observations, we want to take into account that their variances may be different (as we have shown in eq. (3.12)). One way to account for this is to divide the residuals by an estimate the residuals standard deviation. This results in calculating the standardized residuals: \[ s_i = \dfrac{\widehat{\epsilon_i}}{\widehat{\sigma}\sqrt{1 - h_{ii}}} \] where \(h_{ii}\) is the \(i\)-th diagonal element of \(\mathbf{H}\). Standardized residuals are useful in detecting outliers. Generally, any observation with a standardized residual greater than 2 in absolute value should be examined more closely.