4.7 Heteroskedastic Errors

Consider the case where assumption (MR.3) does not hold, but assumption (MR.4) (and the other remaining assumptions (MR.1), (MR.2), (MR.5) and, optionally, (MR.6)) are still valid. Then we can write the following model as: \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \mathbb{E} \left( \boldsymbol{\varepsilon} | \mathbf{X}\right) = \boldsymbol{0},\quad \mathbb{V}{\rm ar}\left( \boldsymbol{\varepsilon} | \mathbf{X} \right) = \mathbb{E} \left( \boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^\top \right)= \mathbf{\Sigma} \neq \sigma_\epsilon^2 \mathbf{I} \] The case when the error variance-covariance matrix is diagonal, but not equal to \(\sigma_\epsilon^2 \mathbf{I}\), expressed as: \[ \mathbf{\Sigma}= \begin{bmatrix} \sigma_1^2 & 0 & 0 & ... & 0\\ 0 & \sigma^2_2 & 0 & ... & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & 0 & ... & \sigma^2_N \end{bmatrix} \] is referred to as heteroskedasticity. As mentioned in 4.6:

  • the OLS estimators will remain unbiased;
  • the OLS variance estimator is biased and inconsistent;
  • the usual \(t\)-statistics of the OLS estimates do not have Student’s \(t\) distribution, even in large samples.

The strategy is as follows:

  1. Assume that \(\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}\) is the true model.
  2. Test the null hypothesis that the residuals are homoskedastic: \[ H_0: \mathbb{V}{\rm ar}\left( \boldsymbol{\varepsilon} | \mathbf{X} \right) = \sigma_\epsilon^2 \mathbf{I} \]
  3. If we fail to reject the null hypothesis - we can use the usual OLS estimators.
  4. If we reject the null hypothesis, there are two ways we can go:
    • Use the OLS estimators, but correct their variance estimators (i.e. make them consistent);
    • Instead of OLS, use the weighted least squares (WLS) to estimate the parameters;
    • Attempt to specify a different model, which would hopefully, be able to account for heteroskedasticity (this is the least preferred method - our initial model should already be the one we want in terms of variables, signs, interpretation, etc.).
Example 4.28 We will simulate the following model:

\[ \begin{aligned} Y_{i} &= \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + u_i\\ u_i &= {i} \cdot \epsilon_i,\quad \epsilon_i \sim \mathcal{N}(0, \sigma^2) \quad \left[\text{ or } u_i \sim \mathcal{N}(0, i^2 \cdot \sigma^2) \right] \end{aligned} \]

4.7.1 Testing For Heteroskedasticity

We can examine the presence of heteroskedasticity from the residuals plots, as well as conducting a number of formal tests.

We will begin by estimating our model via OLS, as we usually would.

##             Estimate Std. Error  t value Pr(>|t|)
## (Intercept) -0.89066   28.42373 -0.03134  0.97503
## x1           7.21929    5.93429  1.21654  0.22524
## x2          -1.83213    2.18885 -0.83703  0.40359
##               Coef.  Std.Err.        t    P>|t|    [0.025    0.975]
## Intercept  26.68842  28.71633  0.92938  0.35383 -29.94245  83.31928
## x1          2.88314   6.31740  0.45638  0.64862  -9.57527  15.34156
## x2         -3.95521   2.14688 -1.84231  0.06693  -8.18902   0.27860

4.7.1.2 Heteroskedasticity Tests

Most of the tests are identical to the ones described in 3.8.3.2, but with a bit more mathematical detail.

4.7.1.2.1 The Goldfeld–Quandt Test

This test is designed for the case where we have two sub-samples with possibly different variances. The sub-samples can be created in a number of ways:

  • create data sub-samples based on an indicator variable;
  • sort the data along a known explanatory variable, from lowest to highest;

Let the the sub-samples \(j = 1,2\) contain \(N_j\) observations and let the regression on sub-sample \(j\) have \(k_j + 1\) parameters (including the intercept). Let \(\widehat{\boldsymbol{\varepsilon}}_j\) be the residuals of the regression on the \(j\)-th sub-sample and let the true variance of the sub-sample errors be \(\sigma^2_j\). Then the sub-sample variance can be estimated by: \[ \widehat{\sigma}^2_j = \dfrac{\widehat{\boldsymbol{\varepsilon}}_j^\top\widehat{\boldsymbol{\varepsilon}}_j}{N_j - (k_j + 1)} \] We want to test the null hypothesis: \[ \begin{cases} H_0: \widehat{\sigma}_1^2 / \widehat{\sigma}_2^2 = 1\\ H_1: \widehat{\sigma}_1^2 / \widehat{\sigma}_2^2 \neq 1 \end{cases} \]

Then, the GQ statistic can be defined as: \[ GQ = \dfrac{\widehat{\sigma}^2_1}{\widehat{\sigma}^2_2} \sim F_{(N_1 - (k_1 + 1), N_2 - (k_2 + 1))} \] If we reject the null hypothesis, for a chosen significance level \(\alpha\), then the variances in the sub-samples are different, hence we cannot reject the null hypothesis of heteroskedasticity.

In our example model, we see that the residuals can be ordered by either the fitted values, or by \(X_1\). If we specify to order by \(X_1\):

## 
##  Goldfeld-Quandt test
## 
## data:  mdl_1
## GQ = 10.217, df1 = 97, df2 = 97, p-value < 2.2e-16
## alternative hypothesis: variance changes from segment 1 to 2

since the \(p\)-value is less than 0.05, we reject the null hypothesis and conclude that the residuals are heteroskedastic.

Furthermore, we can order by the fitted values:

## 
##  Goldfeld-Quandt test
## 
## data:  mdl_1
## GQ = 5.7826, df1 = 97, df2 = 97, p-value = 3.736e-16
## alternative hypothesis: variance changes from segment 1 to 2

and we would arrive at the same conclusions.

On the other hand, if we would have specified to order by \(X_2\):

## 
##  Goldfeld-Quandt test
## 
## data:  mdl_1
## GQ = 1.1747, df1 = 97, df2 = 97, p-value = 0.4292
## alternative hypothesis: variance changes from segment 1 to 2

By default, if order.by is not specified, then the data is taken as ordered - i.e. it would be the equivalent of the residual run-sequence plot.

We would have no grounds to reject the null hypothesis, and would have concluded that the errors are homoskedastic. This is why the residual plots are important. Without them, we may have blindly selected one explanatory variable at random to order by, and would have arrived at a completely different conclusion!

4.7.1.2.2 A General Heteroskedasticity Test

The next test is used for conditional heteroskedasticity, which is related to explanatory variables. This test is a generalization of the Breusch–Pagan Test.

Consider the following regression: \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \] We assume that the general expression for the conditional variance can be expressed as: \[ \mathbb{V}{\rm ar}\left( \boldsymbol{\varepsilon} | \mathbf{Z} \right) = \mathbb{E} \left( \boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^\top | \mathbf{Z}\right) = \text{diag}\left( h\left( \mathbf{Z} \boldsymbol{\alpha}\right) \right) \] where \(\mathbf{Z}\) is some explanatory variable matrix, which may include some (or all) of the explanatory variables from \(\mathbf{X}\). \(h(\cdot)\) is some smooth function, and \(\boldsymbol{\alpha} = \left[ \alpha_0, ..., \alpha_m \right]^\top\).

We want to test the null hypothesis: \[ \begin{cases} H_0: \alpha_1 = ... = \alpha_m = 0\\ H_1: \text{ at least one } \alpha_j \neq 0,\ j \in \{1,...,m\} \end{cases} \] Then the associated BP test specifies the following linear model for the squared residuals: \[ \widehat{\boldsymbol{\varepsilon}} \widehat{\boldsymbol{\varepsilon}}^\top = \text{diag}\left( \mathbf{Z} \boldsymbol{\alpha} + \boldsymbol{v} \right) \] Estimating the squared residuals model via OLS yields the parameter estimates. Then, using the \(F\)-test for the joint significance of \(\alpha_1,...,\alpha_m\) would be equivalent to testing for homoskedasticity. Alternatively, and more conveniently, there is a test based on the \(R^2_{\widehat{\epsilon}}\): \[ LM = N \cdot R^2_{\widehat{\epsilon}} \sim \chi^2_{m} \] If we reject the null hypothesis, for a chosen significance level \(\alpha\), then the error variance is heteroskedastic.

For our sample data the test results are:

## 
##  studentized Breusch-Pagan test
## 
## data:  mdl_1
## BP = 35.896, df = 2, p-value = 1.604e-08
##               0             1
## 0  LM statistic  2.948140e+01
## 1       p-value  3.964566e-07
## 2       F-value  1.702992e+01
## 3    F: p-value  1.506829e-07

The \(p\)-value is less than the chosen 0.05 significance level, so the residuals are heteroskedastic.

4.7.1.2.3 The White Test

An even more generalized test, which proposes to include not only the exogenous variables, but also their polynomial and interaction terms.

The associated OLS squared residual regression is similar to the general case: \[ \widehat{\boldsymbol{\varepsilon}} \widehat{\boldsymbol{\varepsilon}}^\top = \text{diag}\left( \mathbf{Z} \boldsymbol{\alpha} + \boldsymbol{v} \right) \] except now, the matrix \(\mathbf{Z}\) also includes \(s\) additional polynomial and interaction terms of the explanatory variables. The parameter vector is \(\alpha_1,...,\alpha_{m+s}\). We want to test the null hypothesis: \[ \begin{cases} H_0: \alpha_1 = ... = \alpha_{m+s} = 0\\ H_1: \text{ at least one } \alpha_j \neq 0,\ j \in \{1,...,m+s\} \end{cases} \]

Then the test statistic is calculated in the same way as before: \[ LM = N \cdot R^2_{\widehat{\epsilon}} \sim \chi^2_{m+s} \] One difficulty with the White test is that it can detect problems other than heteroskedasticity.

Thus, while it is a useful diagnostic, be careful about interpreting the test result - instead of heteroskedasticity, it may be that you have an incorrect functional form, or an omitted variable.

The test is also performed similar to how it was for the univariate regression with one explanatory variable:

## 
##  studentized Breusch-Pagan test
## 
## data:  mdl_1
## BP = 43.195, df = 5, p-value = 3.373e-08

we again reject the null hypothesis of homoskedasticity and conclude that the residuals are heteroskedastic.

4.7.2 Heteroskedasticity-Consistent Standard Errors (HCE)

Once, we have determined that the errors are heteroskedastic, we want to have a way to account for that. One alternative is to stick with OLS estimates, but correct their variance. This is known as the White correction - it will not change the \(\widehat{\boldsymbol{\beta}}_{OLS}\), which are unbiased and ineffective, but it will correct \(\widehat{\mathbb{V}{\rm ar}}\left( \widehat{\boldsymbol{\beta}}_{OLS} \right)\).

If the errors from the error vector \(\boldsymbol{\varepsilon}\) are independent, but have distinct variances, so that \(\mathbb{V}{\rm ar}\left( \boldsymbol{\varepsilon} | \mathbf{X} \right) = \mathbf{\Sigma} = \text{diag}(\sigma_1^2,...,\sigma_N^2)\).

Then the true underlying variance-covariance matrix of the OLS estimates would be: \[ \mathbb{V} \left( \widehat{\boldsymbol{\beta}}_{OLS} \right)= \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{\Sigma}\mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \]

Since \(\mathbb{E} (\epsilon_i) = 0\), then \(\mathbb{V}{\rm ar}(\epsilon_i) = \mathbb{E} (\epsilon_i^2) = \sigma_i^2\). Which means that we can estimate the variance diagonal elements as: \[ \widehat{\sigma}^2_i = \widehat{\epsilon}_i^2 \] Let \(\widehat{\mathbf{\Sigma}} = \text{diag}(\widehat{\epsilon}_1^2,...,\widehat{\epsilon}_N^2)\). Then the Whites Estimators, also known as the Heteroscedasticity-Consistent Estimator (HCE), can be specified as: \[ \mathbb{V}_{HCE} \left( \widehat{\boldsymbol{\beta}}_{OLS} \right) = \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \widehat{\mathbf{\Sigma}}\mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \]

Furthermore, alternative specifications of \(\widehat{\mathbf{\Sigma}}\) are also possible.

In our simulated case, the white correction can be estimated either manually:

##                      x1         x2
##    693.25117 -91.740409 -52.244062
## x1 -91.74041  46.570134   2.341761
## x2 -52.24406   2.341761   4.749053
## [[595.42764356 -89.89928301 -43.54326469]
##  [-89.89928301  45.34276137   2.52223861]
##  [-43.54326469   2.52223861   3.9731209 ]]

or, via the built-in functions:

##             (Intercept)         x1         x2
## (Intercept)   693.25117 -91.740409 -52.244062
## x1            -91.74041  46.570134   2.341761
## x2            -52.24406   2.341761   4.749053
## [[595.42764356 -89.89928301 -43.54326469]
##  [-89.89928301  45.34276137   2.52223861]
##  [-43.54326469   2.52223861   3.9731209 ]]

Having estimated the standard errors is nice, but we would also like to get the associated \(p\)-values.

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.89066   26.32966 -0.0338   0.9730
## x1           7.21929    6.82423  1.0579   0.2914
## x2          -1.83213    2.17923 -0.8407   0.4015
##               Coef.  Std.Err.        t    P>|t|    [0.025    0.975]
## Intercept  26.68842  24.40139  1.09373  0.27541 -21.43305  74.80988
## x1          2.88314   6.73370  0.42817  0.66900 -10.39625  16.16254
## x2         -3.95521   1.99327 -1.98428  0.04861  -7.88609  -0.02433

compared to the biased standard errors:

##               Estimate Std. Error     t value  Pr(>|t|)
## (Intercept) -0.8906595  28.423734 -0.03133506 0.9750341
## x1           7.2192910   5.934290  1.21653830 0.2252355
## x2          -1.8321252   2.188846 -0.83702804 0.4035910
##               Coef.  Std.Err.        t    P>|t|    [0.025    0.975]
## Intercept  26.68842  28.71633  0.92938  0.35383 -29.94245  83.31928
## x1          2.88314   6.31740  0.45638  0.64862  -9.57527  15.34156
## x2         -3.95521   2.14688 -1.84231  0.06693  -8.18902   0.27860

We see that after correcting the standard errors, the \(p\)-values are also slightly different. The extent of the difference depends of the severity of heteroskedasticity and the method used to correct for it.

While the corrected variance estimated helps us avoid incorrect \(t\)-statistics (as well as avoid incorrect confidence intervals) in case of heteroskedasticity, it does not address the problem that OLS estimates are no longer the best (in terms of variance).

If we have a large number of observations (i.e. thousands upon thousands), then the robust corrected standard errors are enough. On the other hand, if we have s smaller sample size, then we would like to have a more efficient estimator - this is where our previously presented GLS and FGLS come in handy.

4.7.3 HCE Alternative Specifications

A nice summary on alternative HCE specifications are presented in (Hayes and Cai 2007) (can be found at this paper). Several alternative \(\widehat{\mathbf{\Sigma}}\) specifications have been proposed in literature. Here we will briefly summarize them.

Taking \(\widehat{\mathbf{\Sigma}} = \text{diag}(\widehat{\epsilon}_1^2,...,\widehat{\epsilon}_N^2)\), where \(\epsilon_i\) are the OLS residuals, leads to the following White’s HCE: \[ \text{HC0}_{\widehat{\boldsymbol{\beta}}_{OLS}} = \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \text{diag}(\widehat{\epsilon}_1^2,...,\widehat{\epsilon}_N^2)\mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \]

  • for small sample sizes, the standard errors from \(\text{HC0}_{\widehat{\boldsymbol{\beta}}_{OLS}}\) are quite biased. This results in overly liberal inferences in regression models.
  • \(\text{HC0}_{\widehat{\boldsymbol{\beta}}_{OLS}}\) is consistent when the errors are heteroskedastic - the bias shrinks as the sample size increases.

Estimators in this family have come to be known as sandwich estimators - \(\mathbf{X}^\top \widehat{\mathbf{\Sigma}} \mathbf{X}\) is the filling between two matrices \(\left( \mathbf{X}^\top \mathbf{X}\right)^{-1}\).

Consequently, alternative estimators to \(\text{HC0}_{\widehat{\boldsymbol{\beta}}_{OLS}}\) were proposed, which are asymptotically equivalent to \(\text{HC0}_{\widehat{\boldsymbol{\beta}}_{OLS}}\), but have superior small sample properties, when compared to \(\text{HC0}_{\widehat{\boldsymbol{\beta}}_{OLS}}\).

  • An alternative HC estimator adjusts the degrees of freedom of \(\text{HC0}_{\widehat{\boldsymbol{\beta}}_{OLS}}\): \[ \text{HC1}_{\widehat{\boldsymbol{\beta}}_{OLS}} = \dfrac{N}{N-(k+1)}\left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \text{diag}(\widehat{\epsilon}_1^2,...,\widehat{\epsilon}_N^2)\mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \]

Another HC estimator is defined based on extensive research, that finite-sample bias is a result of the existence of points of high leverage. It uses the weight matrix, which is defined as: \[ \mathbf{H} = \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X} ^\top \]

  • Then the HC estimator is constructed by by weighting the \(i\)-th squared OLS residual by using the \(i\)-th diagonal elements of the weight matrix \(\mathbf{H}\): \[ \text{HC2}_{\widehat{\boldsymbol{\beta}}_{OLS}} = \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \text{diag}\left(\dfrac{\widehat{\epsilon}_1^2}{1 - h_{11}},...,\dfrac{\widehat{\epsilon}_N^2}{1 - h_{NN}} \right)\mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \]
  • Similarly, using \(1/(1 - h_{11})^2\), instead of \(1/(1 - h_{11})\) for the weights leads to yet another HCE specification: \[ \text{HC3}_{\widehat{\boldsymbol{\beta}}_{OLS}} = \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \text{diag}\left(\dfrac{\widehat{\epsilon}_1^2}{(1 - h_{11})^2},...,\dfrac{\widehat{\epsilon}_N^2}{(1 - h_{NN})^2} \right)\mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \]

Evaluating the empirical power of the four methods: \(\text{HC0, HC1, HC2 and HC3}\) suggested that \(\text{HC3}\) is a superior estimate, regardless of the presence, or absence, of heteroskedasticity.

Nevertheless, the performance of \(\text{HC3}\) depends on the presence, or absence, of points of high leverage in \(\mathbf{X}\) and it may fail for certain forms of heteroskedasticity (for example, when the predictors are from heavy-tailed distributions, and the errors are from light-tailed distributions).

  • To account for large leverage values, another HC estimator was proposed: \[ \text{HC4}_{\widehat{\boldsymbol{\beta}}_{OLS}} = \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \text{diag}\left(\dfrac{\widehat{\epsilon}_1^2}{(1 - h_{11})^{\delta_1}},...,\dfrac{\widehat{\epsilon}_N^2}{(1 - h_{NN})^{\delta_N}} \right)\mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \] where: \[ \delta_i = \min\left \{4, \dfrac{N \cdot h_{ii}}{k+1} \right \} \] Simulation tests indicated that \(\text{HC4}\) can outperform \(\text{HC3}\) when there are high leverage points and non-normal errors.

Nevertheless, even with these alternative specifications, the variability of HCE are often larger than model-based estimators, like WLS, GLS of FGLS, if the residual covariance-matrix is correctly specified. On the other hand, HCE are derived under a minimal set of assumptions about the errors. As such, it is useful when heteroskedasticity if of an unknown form and cannot be adequately evaluated from the data.

Therefore, when using HCE (instead of some other model-based estimation methods), we are trading efficiency for consistency.

Furthermore, there is also another specification, \(\text{HC5}\) (Source).

In our example dataset different HCE can be easily specified with the built-in functions:

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.89066   26.32966 -0.0338   0.9730
## x1           7.21929    6.82423  1.0579   0.2914
## x2          -1.83213    2.17923 -0.8407   0.4015
##               Coef.  Std.Err.        t    P>|t|    [0.025    0.975]
## Intercept  26.68842  24.40139  1.09373  0.27541 -21.43305  74.80988
## x1          2.88314   6.73370  0.42817  0.66900 -10.39625  16.16254
## x2         -3.95521   1.99327 -1.98428  0.04861  -7.88609  -0.02433
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.89066   26.52939 -0.0336   0.9733
## x1           7.21929    6.87600  1.0499   0.2950
## x2          -1.83213    2.19576 -0.8344   0.4051
##               Coef.  Std.Err.        t    P>|t|    [0.025    0.975]
## Intercept  26.68842  24.58648  1.08549  0.27903 -21.79807  75.17490
## x1          2.88314   6.78478  0.42494  0.67134 -10.49698  16.26327
## x2         -3.95521   2.00839 -1.96934  0.05032  -7.91591   0.00549
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.89066   26.59922 -0.0335   0.9733
## x1           7.21929    6.89407  1.0472   0.2963
## x2          -1.83213    2.20143 -0.8322   0.4063
##               Coef.  Std.Err.        t    P>|t|    [0.025    0.975]
## Intercept  26.68842  24.61773  1.08411  0.27964 -21.85969  75.23653
## x1          2.88314   6.79336  0.42441  0.67173 -10.51391  16.28020
## x2         -3.95521   2.01119 -1.96661  0.05063  -7.92143   0.01101
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.89066   26.87208 -0.0331   0.9736
## x1           7.21929    6.96475  1.0365   0.3012
## x2          -1.83213    2.22390 -0.8238   0.4110
##               Coef.  Std.Err.        t    P>|t|    [0.025    0.975]
## Intercept  26.68842  24.83627  1.07457  0.28388 -22.29067  75.66750
## x1          2.88314   6.85362  0.42067  0.67445 -10.63273  16.39902
## x2         -3.95521   2.02929 -1.94906  0.05271  -7.95713   0.04671
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.89066   26.72908 -0.0333   0.9735
## x1           7.21929    6.92585  1.0424   0.2985
## x2          -1.83213    2.21168 -0.8284   0.4085

Python currently does not have HC4 specification. Available specifications can be found here.

4.7.4 Weighted Least Squares (WLS)

After detecting heteroskedasticity in the errors, we may want to impose a structure on the residual covariance matrix and estimate the coefficients via GLS. If we know that there is no serial correlation in the errors, then the covariance is diagonal. This leads to a specific case of GLS - weighted least squares (WLS).

In reality, we do not know the true form of heteroskedasticity. So, even knowing that the covariance matrix is diagonal still does not say anything about the diagonal elements, they could be either of the following: \[ \begin{aligned} \mathbf{\Sigma} &= \sigma^2 \cdot \text{diag}\left(X_{j,1}, X_{j,2},..., X_{j,N} \right) \quad \text{i.e., variance is proportional to } X_j\\ \mathbf{\Sigma} &= \sigma^2 \cdot \text{diag}\left(X_{j,1}^2, X_{j,2}^2,..., X_{j,N}^2 \right)\\ \mathbf{\Sigma} &= \sigma^2 \cdot \text{diag}\left(\sqrt{X_{j,1}}, \sqrt{X_{j,2}},..., \sqrt{X_{j,N}} \right), \text{ if } X_{j,i} \geq 0,\ \forall i = 1,...,N \end{aligned} \] Furthermore, in a multiple regression setting, heteroskedasticity pattern may depend on more than one explanatory variable - it could even be related to variables, not included in the model. So, how do we select the most likely form for heteroskedasticity?

In general, one specification of heteroskedasticity, which works quite well, is: \[ \begin{aligned} \sigma^2_i &= \exp\left( \alpha_0 + \alpha_1 Z_{1,i} + ... + \alpha_m Z_{m,i} \right)\\ &= \sigma^2 \exp\left(\alpha_1 Z_{1,i} + ... + \alpha_m Z_{m,i} \right),\ \text{ where } \exp\left( \alpha_0 \right) = \sigma^2 \end{aligned} \] taking logarithms of both sides and adding/removing the OLS residual yields: \[ \log(\sigma^2_i) = \alpha_0 + \alpha_1 Z_{1,i} + ... + \alpha_m Z_{m,i} \pm \log(\widehat{\epsilon}^2_i) \] which simplifies to: \[ \begin{aligned} \log(\widehat{\epsilon}^2_i) &= \alpha_0 + \alpha_1 Z_{1,i} + ... + \alpha_m Z_{m,i} + \log \left( \dfrac{\widehat{\epsilon}^2_i}{\sigma^2_i} \right)\\ &= \alpha_0 + \alpha_1 Z_{1,i} + ... + \alpha_m Z_{m,i} + v_i \end{aligned} \] using this model we can estimate \(\alpha_0, ..., \alpha_m\) via OLS. the properties of this model depend on the introduced error term \(v_i\) - whether it is homoskedastic, with zero-mean. In smaller samples it is not, but in larger samples it is closer to what we expect from the error term.

Feasible GLS Procedure:

  1. Estimate the regression \(\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}\) via OLS.
  2. Use the residuals \(\widehat{\boldsymbol{\varepsilon}}_{OLS}\) and create \(\log(\epsilon_i^2)\).
  3. Estimate the regression \(\log(\epsilon_i^2) = \alpha_0 + \alpha_1 Z_{1,i} + ... + \alpha_m Z_{m,i} + v_i\) and calculate the fitted values \(\widehat{\log(\epsilon_i^2)}\). In practice we use the same variables from \(\mathbf{X}\), unless we know for sure that there are additional explanatory variables, which may determine heteroskedasticity.
  4. Take the exponent of the fitted values: \(\widehat{h}_i = \exp\left(\widehat{\log(\epsilon_i^2)}\right)\).
  5. Estimate the regression \(\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}\) via WLS using weights \(\omega_i^{-1} = 1/\sqrt{\widehat{h}_i}\), i.e. use FGLS with \(\mathbf{\Psi}^\top = \mathbf{\Psi} = \text{diag}\left( 1/\sqrt{\widehat{h}_1}, ..., 1/\sqrt{\widehat{h}_N}\right)\).

As noted in 4.6.4, applying WLS is equivalent to dividing each observation by \(\sqrt{\widehat{h}_i}\) and estimating OLS on the transformed data: \[ Y_i / \sqrt{\widehat{h}_i} = \beta_0 \cdot \left(1/\sqrt{\widehat{h}_i} \right) + \beta_1 \cdot \left(X_{1,i} / \sqrt{\widehat{h}_i} \right) + ... + \beta_k \cdot \left(X_{k,i} / \sqrt{\widehat{h}_i} \right) + \epsilon_i/\sqrt{\widehat{h}_i} \] The downside is that our model no longer contains a constant - \(\beta_0\) is now used for the new (non-constant) variable \(1/\sqrt{\widehat{h}_i} \neq 1\).

We begin by manually transforming the data and estimating OLS on the transformed variables:

Next, we carry out WLS (where \(\mathbf{\Omega}^{-1} = \text{diag} \left(\widehat{h}_1^{-1},...,\widehat{h}_N^{-1} \right)\)).

Manually:

Note: in most econometric software, you need to pass weights, which are inversely proportional to the variances. In other words you need to supply \(\text{weights}= 1 / \sigma_i^2 = 1 / \omega_i^2\) which are the diagonal elements of \(\mathbf{\Omega}^{-1}\), not \(\mathbf{\Psi}\) (regardless of your specification of \(\mathbf{\Sigma}\) - refer to the end of 4.6.3.3). Then, the software automatically takes the square root of the specified values when calculating.

Using built-in functions:

##             Estimate Std. Error  t value Pr(>|t|)
## (Intercept) -1.39237   10.19482 -0.13658  0.89151
## x1           7.97645    4.00325  1.99249  0.04770
## x2          -2.02459    0.88157 -2.29656  0.02270
##               Coef.  Std.Err.        t    P>|t|    [0.025    0.975]
## Intercept  13.07847  14.04532  0.93116  0.35291 -14.62000  40.77694
## x1          6.50810   4.68251  1.38987  0.16614  -2.72618  15.74238
## x2         -3.25579   1.10827 -2.93772  0.00370  -5.44139  -1.07019

We see that we get identical results in all cases.

Finally, we would like to compare the residuals from the OLS and GLS procedures. If we have indeed accounted for heteroskedasticity, then the residual plots should indicate so as well. We note that the GLS estimates are for the model: \[ \begin{aligned} \mathbf{Y}^* &= \mathbf{X}^* \boldsymbol{\beta} + \boldsymbol{\varepsilon}^*,\quad \text{ where } \mathbf{Y}^* = \mathbf{\Psi}^\top \mathbf{Y}, \quad \mathbf{X}^* = \mathbf{\Psi}^\top\mathbf{X}, \quad \boldsymbol{\varepsilon}^* = \mathbf{\Psi}^\top\boldsymbol{\varepsilon} \end{aligned} \] In other words we need to calculate: \(\widehat{\boldsymbol{\varepsilon}}^*_{WLS} = \mathbf{Y}^* - \mathbf{X}^* \widehat{\boldsymbol{\beta}}_{WLS}\).

In most software, using built-in WLS estimation, the residuals are calculated as: \(\widehat{\boldsymbol{\varepsilon}}_{WLS} = \mathbf{Y} - \mathbf{X} \widehat{\boldsymbol{\beta}}_{WLS}\).

Consequently, since we have specified \(\mathbf{\Psi}^\top = \mathbf{\Psi} = \text{diag}\left( 1/\sqrt{\widehat{h}_1}, ..., 1/\sqrt{\widehat{h}_N}\right)\), we can calculate the residuals for the transformed model as: \[ \begin{aligned} \widehat{\boldsymbol{\varepsilon}^*}_{WLS} &= \mathbf{\Psi}^\top \widehat{\boldsymbol{\varepsilon}}_{WLS} \\\\ &= \text{diag}\left( 1/\sqrt{\widehat{h}_1}, ..., 1/\sqrt{\widehat{h}_N}\right)\widehat{\boldsymbol{\varepsilon}}_{WLS} \end{aligned} \]

Now we can compare the plots:

We see that:

  • the magnitude of the residuals is smaller;
  • there still appears to be some (albeit possibly insignificant) heteroskedasticity.

What would have happened, if we were to plot the WLS residuals of the non-transformed data?

Looking at the original, untransformed, data it would appear that we did not account for heteroskedasticity but this is not the case. As mentioned, we have created a model on the transformed data, hence we should analyse the residuals of the transformed data.

As mentioned before - we have attempted to approximately calculate the weights, using the logarithms of the squared residuals. Therefore, we may not always be able to capture all of the heteroskedasticity. Nevertheless, it is much better than it was initially.

On the other hand, if we were to use \(1 / i\) as the weights, instead of \(1 / \sqrt{\widehat{h}_{i}}\):

##             Estimate Std. Error   t value Pr(>|t|)
## (Intercept) 15.34254    2.37257   6.46663  0.00000
## x1           5.33401    2.90696   1.83491  0.06803
## x2          -3.32553    0.22286 -14.92183  0.00000

Note that if we wanted to test the residuals of a WLS, we should probably use car::ncvTest() instead of lmtest::bptest(). See also this answer for comparison between bptest() and ncvTest(). Note that ncvTest() supports either a weighted or unweighted linear model, produced by lm.

Another discussion can be found here:

  • for bptest() - the error variance is a function of a linear combination of the predictors in the model;
  • for ncvTest() - the error variance is a function of the expectation of Y. For the weighted model, ncvTest() uses the Pearson residuals and hence takes the weights into account.

The resulting errors of the transformed model are no longer heteroskedastic.

Unfortunately, if we would have used \(1 / \sqrt{X_{1,i}}\), or \(1 / X_{1,i}\), then the residuals would have likely remained heteroskedastic.

This is an important conclusion regarding WLS (and FGLS in general): if we attempt to approximate the results, then, no matter the approximation, there is still a chance that the WLS residuals would still contain some kind of (weak) heteroskedasticity. As such, we should examine different residual plots to determine what kind of weights are appropriate.

In this example the true (unknown) variance of the error term is \(\mathbb{V}{\rm ar}\left( \boldsymbol{\varepsilon} | \mathbf{X} \right) = \mathbf{\Sigma} = \sigma_\epsilon^2 \cdot \text{diag}(1^2, 2^2,...,N^2)\). As mentioned at the end of 4.6.4, this would be equivalent to the case where the \(i\)-th response is an aggregated total of \(N_i = i^2\) observations.

Regarding \(R^2\):

In most econometric software, the coefficient of determination for WLS uses a weighted mean when calculating \(\text{TSS}\). As a result, the reported \(R^2\) now measures the proportion of total variation in weighted \(Y\), \(Y^*\) explained by the weighted X, \(X^*\).

To get a more conventional expression of \(R^2\) - use the general (or pseudo-) expression: \[ R^2_g = \mathbb{C}{\rm orr}(Y, \widehat{Y})^2 \] where \(\widehat{Y}\) are the fitted values of the original (i.e. non-weighted) dependent variable).

Note that the weighted mean is calculated as: \[ \overline{Y}^{(w)} = \dfrac{\sum_{i = 1}^N Y_i/ h_i}{\sum_{i = 1}^N 1/ h_i} \] We can verify this by examining the manual results with the output:

The general (pseudo) \(R^2\):

As was mentioned before regarding \(R^2\) - it is not always a good measure of the overall model adequacy. Even if the OLS \(R^2\) is higher, if the residuals do not conform to (MR.2) - (MR.6) assumptions, then the model and its hypothesis test results are not valid.

4.7.5 Choosing between HCE and WLS

The generalized least squares estimator require that we know the underlying form of the variance-covariance matrix.

Regarding HCE:

  • The variance estimator is quite robust because it is valid whether heteroskedasticity is present or not, but only in a matter that is appropriate asymptotically.
  • In other words, if we are not sure whether the random errors are heteroskedastic or homoskedastic, then we can use a robust variance estimator and be confident that our standard errors, \(t\)-tests, and interval estimates are valid in large samples.
  • In small samples, whether we modify the covariance estimator or not, the usual statistics will still be unreliable.
  • This estimator needs to be modified, if we suspect that the errors may exhibit autocorrelation (of some unknown form).

Regarding WLS:

  • If we know the underlying form of the residual variance-covariance matrix, then FGLS would result in more efficient estimators;
  • If we specify the covariance structure incorrectly, i.e. we do not completely remove heteroskedasticity, then the FGLS estimator will be unbiased, but not the best and the standard errors will still be biased (as in the OLS case).

So, both methods have their benefits and their drawbacks. However, we can combine them both:

  • Attempt to correct for heteroskedasticity via the WLS;
  • Test the residuals from WLS for heteroskedasticity;
  • If the WLS residuals are homoskedastic - the WLS has improved the precision of our model;
  • If the WLS residuals are heteroskedastic - we may use HCE on the WLS residuals.
  • This way we protect ourselves from a possible misspecification of the unknown variance-covariance structure.

Finally, one more thing to note:

Take care of heteroskedasticitry only in severe cases, otherwise, the usual OLS procedures give satisfactory and reliable results, most of the time. Furthermore:

  • it is usually possible to account for heteroskedasticity by transforming the dependent variable \(Y\), for example \(\log(Y)\), or \(\sqrt{Y}\);
  • using an incorrect functional form, or omitting important variables may also cause heteroskedasticity;

It is important to note that the OLS estimates are unbiased, even if the underlying (true) data generating process actually follows the GLS model. If GLS is unbiased then so is OLS (and vice versa).

The estimates of both OLS and GLS should be close to one another. If they are not, then it would most likely indicate additional problems, like function form misspecification, omitted variables, etc.

4.7.6 Monte Carlo Simulation: OLS vs FGLS

To illustrate the effects of heteroskedasticity on the standard errors of the estimates, and efficiency between OLS, GLS and FGLS, we will carry out a Monte Carlo simulation. We will simulate the following model: \[ \begin{aligned} Y_{i}^{(m)} &= \beta_0 + \beta_1 X_{1,i}^{(m)} + \beta_2 X_{2,i}^{(m)} + \epsilon_i^{(m)} ,\quad \epsilon_i^{(m)} \sim \mathcal{N}(0, i^2 \cdot \sigma^2), \quad i = 1, ..., N,\quad m = 1, ..., MC \end{aligned} \] We will simulate a total of \(MC = 1000\) samples from this model with specific coefficients and estimate the parameters via OLS, WLS, as well as correct the standard errors of OLS via HCE. We will do so with the following code:

set.seed(123)
# Number of simulations:
MC <- 1000
# Fixed parameters
N <- 100
beta_vec <- c(10, 5, -3)
# matrix of parameter estimates for each sample:
beta_est_ols  <- NULL
beta_pval_ols <- NULL
beta_pval_hce <- NULL
#
beta_est_wls  <- NULL
beta_pval_wls <- NULL
#
beta_est_gls  <- NULL
beta_pval_gls <- NULL
#
for(i in 1:MC){
  # simulate the data:
  x1 <- seq(from = 0, to = 5, length.out = N)
  x2 <- sample(seq(from = 3, to = 17, length.out = 80), size = N, replace = TRUE)
  e  <- rnorm(mean = 0, sd = 1:N, n = N)
  y  <- cbind(1, x1, x2) %*% beta_vec + e
  data_mat <- data.frame(y, x1, x2)
  # estimate via OLS
  mdl_0 <- lm(y ~ x1 + x2, data = data_mat)
  # correct OLS se's:
  mdl_hce <- lmtest::coeftest(mdl_0, vcov. = sandwich::vcovHC(mdl_0, type = "HC0"))
  # estimate via WLS
  resid_data <- data.frame(log_e2 = log(mdl_0$residuals^2), x1, x2)
  h_est   <- exp(lm(log_e2 ~ x1 + x2, data = resid_data)$fitted.values)
  mdl_wls <- lm(y ~ x1 + x2, data = data_mat, weights =  1 / h_est)
  # estimate via GLS (by knowing the true covariance matrix)
  mdl_gls <- lm(y ~ x1 + x2, data = data_mat, weights =  1 / (1:N)^2)
  # Save the estimates from each sample:
  beta_est_ols  <- rbind(beta_est_ols, coef(summary(mdl_0))[, 1])
  beta_est_wls  <- rbind(beta_est_wls, coef(summary(mdl_wls))[, 1])
  beta_est_gls  <- rbind(beta_est_gls, coef(summary(mdl_gls))[, 1])
  # Save the coefficient p-values from each sample:
  beta_pval_ols <- rbind(beta_pval_ols, coef(summary(mdl_0))[, 4])
  beta_pval_hce <- rbind(beta_pval_hce, mdl_hce[, 4])
  beta_pval_wls <- rbind(beta_pval_wls, coef(summary(mdl_wls))[, 4])
  beta_pval_gls <- rbind(beta_pval_gls, coef(summary(mdl_gls))[, 4])
}
np.random.seed(123)
# Number of simulations:
MC = 1000
# Fixed parameters
N = 200
beta_vec = np.array([10, 5, -3])
# matrix of parameter estimates for each sample:
beta_est_ols  = np.empty((0, len(beta_vec)), int)
beta_pval_ols = np.empty((0, len(beta_vec)), int)
beta_pval_hce = np.empty((0, len(beta_vec)), int)
#
beta_est_wls  = np.empty((0, len(beta_vec)), int)
beta_pval_wls = np.empty((0, len(beta_vec)), int)
#
beta_est_gls  = np.empty((0, len(beta_vec)), int)
beta_pval_gls = np.empty((0, len(beta_vec)), int)
#
for i in range(0, MC):
    # simulate the data:
    x1 = np.linspace(start = 0, stop = 5, num = N)
    x2 = np.random.choice(np.linspace(start = 3, stop = 17, num = 80), size = N, replace = True)
    e  = np.random.normal(loc = 0, scale = list(range(1, N + 1)), size = N)
    y  = np.dot(sm.add_constant(np.column_stack((x1, x2))), beta_vec) + e
    data_mat = pd.DataFrame(np.column_stack([y, x1, x2]), columns = ["y", "x1", "x2"])
    # estimate via OLS
    mdl_1 = smf.ols(formula = "y ~ x1 + x2", data = data_mat)
    # correct OLS se's:
    mdl_hce = mdl_1.fit().get_robustcov_results(cov_type = "HC0")#.summary2().tables[1]
    # estimate via WLS
    resid_data = pd.DataFrame(np.column_stack([np.log(mdl_1.fit().resid**2), x1, x2]), columns = ["log_e2", "x1", "x2"])
    h_est = np.exp(smf.ols(formula = "log_e2 ~ x1 + x2", data = resid_data).fit().fittedvalues)
    mdl_wls = smf.wls("y ~ x1 + x2", data = data_mat, weights = 1.0 / h_est)
    # estimate via GLS (by knowing the true covariance matrix)
    mdl_gls = smf.wls("y ~ x1 + x2", data = data_mat, weights = 1.0 / np.array(list(range(1, N+1)))**2)
    # Save the estimates from each sample:    
    beta_est_ols = np.vstack([beta_est_ols, mdl_1.fit().params])
    beta_est_wls = np.vstack([beta_est_wls, mdl_wls.fit().params])
    beta_est_gls = np.vstack([beta_est_gls, mdl_gls.fit().params])
    # Save the coefficient p-values from each sample:
    beta_pval_ols = np.vstack([beta_pval_ols, mdl_1.fit().pvalues])
    beta_pval_hce = np.vstack([beta_pval_hce, mdl_hce.pvalues])
    beta_pval_wls = np.vstack([beta_pval_wls, mdl_wls.fit().pvalues])
    beta_pval_gls = np.vstack([beta_pval_gls, mdl_gls.fit().pvalues])
#

Firstly, it is interesting to see how many times we would have rejected the null hypothesis that a parameter is insignificant with significance level \(\alpha = 0.05\). We will divide the number of times that we would have rejected the null hypothesis by the total number of samples \(MC\) to get the rejection rate:

We will comment on the results with R. We see that we would have rejected the null hypothesis that \(X_2\) is insignificant in around \(55.5\%\) of the simulated samples with OLS. If we were to correct for heteroskedasticity, this would have increased to \(57.4\%\). On the other hand, if we were to re-estimate the model with WLS, then we would have rejected the null hypothesis that \(X_2\) is insignificant around in around \(97.6\%\) of the simulated samples. The results in Python are similar.

One possible explanation for the relatively poor performance of HCE is the fact that it uses \(\widehat{\mathbf{\Sigma}} = \text{diag}(\widehat{\epsilon}_1^2,...,\widehat{\epsilon}_N^2)\) for the covariance matrix. In this case it is clearly inferior to the covariance matrix specification used in WLS. On the other hand, as we have already mentioned - HCE are only useful in large samples.

Because of the large variance of \(\epsilon_i\), we see that we would often not reject the null hypothesis that \(X_1\) is insignificant. On the other hand, as we can see, using WLS reduces this risk significantly.

Finally, if we were to know the true covariance structure - we could incorporate GLS - this would mean that we would almost never reject the null hypothesis that the coefficient of \(X_2\) is insignificant! Furthermore, the rejection rate of the null hypothesis for \(\beta_0\) and \(\beta_1\) are also significantly larger. Unfortunately, in empirical applications we will never know the true covariance structure, so the GLS results are only presented as the best case scenario, which we would hope to achieve with FGLS.

We can look at the coefficient estimate histograms:

which give a better look in terms of efficiency of the estimators - OLS estimates are less efficient than WLS in terms of the estimate variance. On the other hand, both estimates are unbiased - their average is equal to the true parameter value. Consequently, because OLS estimators are less efficient, we would need a larger sample to achieve a similar level of precision as the WLS.

As mentioned before, GLS is the best case scenario, when we exactly know the true covariance structure and do not need to estimate it.

References

Hayes, Andrew F., and Li Cai. 2007. “Using Heteroskedasticity-Consistent Standard Error Estimators in Ols Regression: An Introduction and Software Implementation.” Behavior Research Methods 39 (4): 709–22. https://doi.org/10.3758/BF03192961.