4.9 General Modelling Difficulties

Up until now, we have assumed that the multiple regression equation we are estimating includes all the relevant explanatory variables from the underlying DGP. In practice, this is rarely the case. Sometimes some variables may not be included due to various factors like oversight, ignorance, or even lack of observations. On the other hand, sometimes irrelevant variables are included in the model.

To further complicate things, even if we somehow include all the relevant variables, we may have specified an incorrect functional form, or even overfitted our model. Even worse, the data itself may have measurement errors, outliers and influential observations.

Finally, if we attempt to take this into account, we may find ourselves with more than one model, which we may find acceptable, yet we may want to select one model to present coefficient interpretations, predict values, test results, etc.

4.9.1 Causality vs. Prediction

When specifying a model, an important consideration should be taken with regards to the purpose of creating the model. Do we want to:

  1. Use the model for prediction?
  2. Use the model for causal analysis?

With causal inference we are interested in the effect of a one unit (or one percent, depending on the variable) change in a regressor on the conditional mean of the dependent variable, other factors held constant (i.e. ceteris paribus). This includes:

  • whether the effect is statistically significant;
  • the direction (positive or negative) of the effect;
  • the magnitude of the effect;

This type of analysis is important for measuring the effects of policies, laws, taxes, firm expenditure, and so on, on some performance measure, like income, education quality, revenue, etc. In order to measure these effects, we need to be able to separate them from all other possible effects, which may also have an effect - we want to be able to hold all these other effects constant to be able to isolate and measure the effects that we are interested in.

When analysing causality, we may be tempted to primarily rely on the correlation between variables, however this is known as the cum hoc ergo propter hoc (“with this, therefore because of this”) problem, in which two events occur simultaneously. This is also usually stated as correlation does not imply causation.

https://xkcd.com/552/

Figure 4.2: https://xkcd.com/552/

On the other hand:

If the purpose of our model is to predict the value of the dependent variable, then we want to choose the regressors, that are highly correlated with the dependent variable (and subsequently lead to a high value of \(R^2\), or small value of AIC/BIC, or overall a small variance of the residuals). Whether or not these variables have a direct effect on the dependent variable, or even if we have omitted some relevant variables - is less important.

Predictive analysis is important in the field of big data, where variables are chosen for their predictive ability. rather than causal relationships.

https://xkcd.com/605/

Figure 4.3: https://xkcd.com/605/

For the purpose of the next three subsections, we will deal with the following example:

Example 4.30 Assume that the DGP (i.e. the true) model is of the following form:

\[ Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \epsilon_i \]

When carrying out regression analysis it is assumed that only the correct and important explanatory variables are included in the model and the correct functional form of the relationship is specified. If any of these requirements are violated, then we are making a specification error.

In practical applications, we seldom know the true functional form of the model, so we have to infer it based on the data and relationships between variables, as well as economic theory. Furthermore, having selected a functional form, we usually have a limited pool of variables to choose from, which often influence not only the subset of variables included in our model, but also the functional form.

We will look at how the usually encountered problems like:

  • correlation between explanatory variables;
  • omitting a relevant variable;
  • including an irrelevant variable;

affect our estimated model and our results.

4.9.2 Correlation Between Explanatory Variables

On the other hand, assuming \(\mathbb{E}(\epsilon_i | X_{1,i}, X_{2,i}) = 0\), looking at the conditional expectation: \[ \mathbb{E}\left( Y_i | X_{1,i}, X_{2,i} \right) = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} \] allows us to have causal interpretations for the coefficients: \[ \beta_j = \dfrac{\partial \mathbb{E}\left( Y_i | X_{1,i}, X_{2,i} \right)}{\partial X_j,i}, j = 1,2 \] where \(\beta_j\) represents the change in the mean of \(Y\), resulting from a change in \(X_i\), other factors held constant.

Assumption \(\mathbb{E}(\epsilon_i | X_{1,i}, X_{2,i}) = 0\) means that changes in \(X_1\) and \(X_2\) have no effects on the error term.

Now, suppose that the explanatory regressors are correlated: \[ \mathbb{C}{\rm orr}(X_{1,i}, X_{2,i}) \neq 0 \] This is a common occurrence among explanatory variables in practical applications.

For simplicity, assume that the correlation can be described as: \[ \mathbb{E} \left( X_{2,i} |X_{1,i} \right) = \gamma_0 + \gamma_1 X_{1,i} \] If we plug this expression into our conditional expectation of the dependent variable, but this time only condition on \(X_{1,i}\), we get: \[ \begin{aligned} \mathbb{E}\left( Y_i | X_{1,i} \right) &= \beta_0 + \beta_1 X_{1,i} + \beta_2 \mathbb{E} \left( X_{2,i} |X_{1,i} \right) + \mathbb{E}\left( \epsilon_i | X_{1,i} \right) \\ &= \beta_0 + \beta_1 X_{1,i} + \beta_2 \left[ \gamma_0 + \gamma_1 X_{1,i} \right] \\ &= (\beta_0 + \beta_2 \gamma_0) + (\beta_1 + \beta_2 \gamma_1) X_{1,i} \\ &= \alpha_0 + \alpha_1 X_{1,i} \end{aligned} \]

From this specification, we may believe that we only need to estimate \[ Y_i = \alpha_0 + \alpha_1 X_{1,i} + u_i \] via OLS. In such a case, the OLS estimates of \(\alpha_0\) and \(\alpha_1\) would be unbiased. But this may not be as straightforward as it seems.

If our goal is to use \(X_{1,i}\) to predict \(Y_i\), we can use the model with only \(X_{1,i}\) and not worry about excluding \(X_{2,i}\).

On the other hand, since \(X_{2,i}\) is not held constant, \(\alpha_1 = \beta_1 + \beta_2 \gamma_1\) does not measure the causal effect of \(X_{1,i}\) on \(Y_i\), as it includes an indirect effects of the correlation between \(X_{1,i}\) on \(X_{2,i}\), as \(\gamma_1\), and the effect of a unit change in \(X_{2,i}\), as \(\beta_2\). The true causal effect of a unit change in \(X_{1,i}\), given in \(\beta_1\), is not directly estimated, unless \(\beta_2 = 0\), i.e. if \(X_{2,i}\) does not affect \(Y_i\).

Consequently, to estimate the causal effect of \(X_1\) using OLS, we would need to include ALL explanatory variables, which:

  • are correlated with \(X_1\)
  • are correlated with \(Y\)

On the other hand, we need to be mindful of the possibility of multicollinearity.

Because of this, we may need to replace the correlated variables with control variables, which act as a proxy for the omitted correlated explanatory variable.

4.9.3 Omitting Relevant Variables

In order to keep our model easy to interpret, we may sometimes choose to exclude some explanatory variables (including interaction and polynomial terms), which may be significant in the true (unobserved) model. Another reason for omitting variables, is that we may not be able to quantify them (e.g. taste, responsibility, etc.), or simply because we do not have them in our dataset (e.g. it may be very expensive to create/acquire a large dataset, compared to a simplified one with less variables). On the other hand, omission of relevant variables is always a possibility when having a large selection of various explanatory variables. Finally, we may sometimes simply overlook a particular variable.

Assume that our initial model with \(k\) regressors and \(N\) observations can be written as: \[ \underset{N \times 1}{\mathbf{Y}} = \underset{N \times k}{\mathbf{X}} \ \underset{k \times 1}{\boldsymbol{\beta} }+ \underset{N \times 1}{\boldsymbol{\varepsilon}}, \quad \mathbb{E} \left( \boldsymbol{\varepsilon} | \mathbf{X}\right) = \boldsymbol{0},\quad \mathbb{V}{\rm ar}\left( \boldsymbol{\varepsilon} | \mathbf{X} \right) = \sigma_\epsilon^2 \mathbf{I} \] For simplicity, assume that of the \(k\) variables, we decide to include only \(r\) variables. Then the matrix and parameter vector can be partitioned as follows: \[ \underset{N \times k}{\mathbf{X}} = \begin{bmatrix} \underset{N \times r}{\mathbf{X}_1} & \underset{N \times (k-r)}{\mathbf{X}_2} \end{bmatrix},\quad \underset{k \times 1}{\boldsymbol{\beta}} = \begin{bmatrix} \underset{r \times 1}{\boldsymbol{\beta}_1} & \underset{(k-r) \times 1}{\boldsymbol{\beta}_2} \end{bmatrix} \] So, the full (or true) model can be written as: \[ \underset{N \times 1}{\mathbf{Y}} = \underset{N \times r}{\mathbf{X}_1} \ \underset{r \times 1}{\boldsymbol{\beta}_1} + \underset{N \times (k-r)}{\mathbf{X}_2} \ \underset{(k-r) \times 1}{\boldsymbol{\beta}_2}+ \underset{N \times 1}{\boldsymbol{\varepsilon}} \] But we decided to specify a model with \(k-r\) excluded variables: \[ \underset{N \times 1}{\mathbf{Y}} = \underset{N \times r}{\mathbf{X}_1} \ \underset{r \times 1}{\boldsymbol{\beta}_1} + \underset{N \times 1}{\boldsymbol{\zeta}}, \text{ where } \boldsymbol{\zeta} = \mathbf{X}_2 \boldsymbol{\beta}_2 + \boldsymbol{\varepsilon} \] If we were to estimate the parameters of this misspecified model via OLS and use the linear model expression of the true \(\mathbf{Y}\): \[ \begin{aligned} \widehat{\boldsymbol{\beta}}_1 &= \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \mathbf{Y} \\ &=\left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \left( \mathbf{X}_1 \boldsymbol{\beta}_1 + \mathbf{X}_2 \boldsymbol{\beta}_2 + \boldsymbol{\varepsilon}\right) \\ &= \boldsymbol{\beta}_1 + \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \boldsymbol{\varepsilon} + \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \mathbf{X}_2 \boldsymbol{\beta}_2 \end{aligned} \] Then, the expected value is calculated similarly to 3.2.4.3: \[ \begin{aligned} \mathbb{E}\left[ \widehat{\boldsymbol{\beta}}_1 \right] &= \boldsymbol{\beta}_1 + \mathbb{E}\left[ \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \boldsymbol{\varepsilon} \right] + \mathbb{E}\left[ \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \mathbf{X}_2 \boldsymbol{\beta}_2 \right] \\ &= \boldsymbol{\beta}_1 + \mathbb{E}\left[ \mathbb{E} \left(\left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \boldsymbol{\varepsilon} \biggr\rvert \mathbf{X}_1\right) \right] + \mathbb{E}\left[ \mathbb{E} \left( \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \mathbf{X}_2 \boldsymbol{\beta}_2 \biggr\rvert \mathbf{X}_1\right)\right] \\ &= \boldsymbol{\beta}_1 + \mathbb{E} \left[ \left( \mathbf{X}_1 ^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1 ^\top \mathbb{E} \left( \boldsymbol{\varepsilon} | \mathbf{X}_1 \right)\right] + \mathbb{E} \left[ \left( \mathbf{X}_1 ^\top \mathbf{X}_1 \right)^{-1} \mathbb{E} \left( \mathbf{X}_1^\top \mathbf{X}_2 | \mathbf{X}_1 \right) \boldsymbol{\beta}_2 \right] \\ &= \boldsymbol{\beta}_1 + \mathbb{E} \left[ \left( \mathbf{X}_1 ^\top \mathbf{X}_1 \right)^{-1} \mathbb{E} \left( \mathbf{X}_1 ^\top \mathbf{X}_2 | \mathbf{X}_1 \right) \boldsymbol{\beta}_2 \right] \end{aligned} \] From this we see a very important result:

  • \(\widehat{\boldsymbol{\beta}}_1\) is generally biased, since \(\mathbb{E}\left[ \widehat{\boldsymbol{\beta}}_1 \right] \neq \boldsymbol{\beta}_1\).
  • The bias vanishes if \(\mathbf{X}_1^\top \mathbf{X}_2 \ = \boldsymbol{0}\) - every element of \(\mathbf{X}_1^\top \mathbf{X}_2\) is a linear combination, which sums to zero - that is, if the omitted variables in \(\mathbf{X}_2\) are not correlated with any of the included variables in \(\mathbf{X}_1\).

Similarly, the variance can be expressed as: \[ \begin{aligned} \mathbb{V}{\rm ar} (\widehat{\boldsymbol{\beta}}_1 \rvert \mathbf{X}_1) &= \mathbb{E} \left[(\widehat{\boldsymbol{\beta}}_1 - \mathbb{E}(\widehat{\boldsymbol{\beta}}_1))(\widehat{\boldsymbol{\beta}}_1 - \mathbb{E}(\widehat{\boldsymbol{\beta}}_1))^\top \rvert \mathbf{X}_1 \right] \\ &= \mathbb{E} \left[ \left(\left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \boldsymbol{\varepsilon} + \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \mathbf{X}_2 \boldsymbol{\beta}_2 \right) \left(\left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \boldsymbol{\varepsilon} + \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \mathbf{X}_2 \boldsymbol{\beta}_2 \right)^\top \biggr\rvert \mathbf{X}_1 \right] \\ &= \left(\mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \mathbb{E} \left[ \boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^\top \right]\mathbf{X}_1 \left(\mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} + \left(\mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \mathbb{E} \left[ \boldsymbol{\varepsilon}\right]\boldsymbol{\beta}_2^\top \mathbb{E} \left[\mathbf{X}_2^\top \biggr\rvert \mathbf{X}_1 \right]\mathbf{X}_1\left(\mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \\ &+ \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \mathbb{E} \left[\mathbf{X}_2 \biggr\rvert \mathbf{X}_1 \right]\boldsymbol{\beta}_2 \mathbb{E} \left[\boldsymbol{\varepsilon}^\top \right]\mathbf{X}_1 \left(\mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} + \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \mathbb{E} \left[\mathbf{X}_2 \boldsymbol{\beta}_2 \boldsymbol{\beta}_2^\top \mathbf{X}_2^\top \biggr\rvert \mathbf{X}_1 \right] \mathbf{X}_1\left(\mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \\ &= \sigma^2 \left( \mathbf{X}_1^\top \mathbf{X}_1\right)^{-1} + \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \mathbb{E} \left[\mathbf{X}_2 \boldsymbol{\beta}_2 \boldsymbol{\beta}_2^\top \mathbf{X}_2^\top \biggr\rvert \mathbf{X}_1 \right] \mathbf{X}_1\left(\mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \geq \sigma^2 \left( \mathbf{X}_1^\top \mathbf{X}_1\right)^{-1} \end{aligned} \] So:

The variance of the OLS estimates is not the smallest in the misspecified model. This means that the OLS estimates are inefficient.

Finally, we know that the variance of the OLS residuals can be calculated as: \[ \widehat{\sigma}^2 = \dfrac{\mathbb{\widehat{\epsilon}}^\top \mathbb{\widehat{\epsilon}}}{N - r} \] where we can write the residuals as: \[ \begin{aligned} \mathbb{\widehat{\epsilon}} &= \mathbf{Y} - \mathbf{X}_1 \widehat{\boldsymbol{\beta}}_1 \\ &= \mathbf{Y} - \mathbf{X}_1 \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \mathbf{Y} \\ &= \left[ \mathbf{I} - \mathbf{X}_1 \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top\right] \mathbf{Y} \\ &= \left[ \mathbf{I} - \mathbf{X}_1 \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top\right] \left[ \mathbf{X}_1 \boldsymbol{\beta}_1 + \mathbf{X}_2 \boldsymbol{\beta}_2 + \boldsymbol{\varepsilon} \right] \\ &= \left[ \mathbf{I} - \mathbf{X}_1 \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top\right] \left[ \mathbf{X}_2 \boldsymbol{\beta}_2 + \boldsymbol{\varepsilon} \right] \\ &= \mathbf{H} \left[ \mathbf{X}_2 \boldsymbol{\beta}_2 + \boldsymbol{\varepsilon} \right], \end{aligned} \] where\(\mathbf{H} = \mathbf{I} - \mathbf{X}_1 \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top\). We also have that \(\mathbf{H}\) and \(\mathbf{X}_1 \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1}\) are symmetric and idempotent, i.e. \(\mathbf{H}^\top \mathbf{H} = \mathbf{H} \mathbf{H}^\top = \mathbf{H}\). Furthermore, note that a trace of a matrix can be defined as the sum of the elements on the main diagonal: \[ \begin{aligned} \text{tr}(\mathbf{X}) &= \sum_{i = 1}^N x_{ii}\\ \text{tr}(\mathbf{A}\mathbf{B}\mathbf{C}) &= \text{tr}(\mathbf{B}\mathbf{C}\mathbf{A}) = \text{tr}(\mathbf{C}\mathbf{A}\mathbf{B}) \\ \text{tr}(\mathbf{A}\mathbf{B}\mathbf{C}) &\neq \text{tr}(\mathbf{C}\mathbf{B}\mathbf{A})\\ \text{tr}(\mathbf{X}^\top \mathbf{Y}) &= \text{tr}(\mathbf{X} \mathbf{Y}^\top) = \text{tr}(\mathbf{Y}^\top \mathbf{X}) = \text{tr}(\mathbf{Y} \mathbf{X}^\top) =\sum_{i,j = 1}^N X_{i,j} Y_{i,j} \end{aligned} \] then: \[ \text{tr} \left( \mathbf{H} \right) = \text{tr} \left( \mathbf{I} - \mathbf{X}_1 \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \right) = \text{tr} \left( \mathbf{I} \right) - \text{tr} \left(\mathbf{X}_1 \left( \mathbf{X}_1^\top \mathbf{X}_1 \right)^{-1} \mathbf{X}_1^\top \right) = N - r \] Furthermore, since \(\boldsymbol{\varepsilon}^\top \mathbf{H} \boldsymbol{\varepsilon}\) is a scalar (i.e. a number), it holds that: \[ \boldsymbol{\varepsilon}^\top \mathbf{H} \boldsymbol{\varepsilon} = \text{tr} \left(\boldsymbol{\varepsilon}^\top \mathbf{H} \boldsymbol{\varepsilon} \right) = \text{tr} \left(\mathbf{H} \boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^\top \right) \] and: \[ \mathbb{E}(\boldsymbol{\varepsilon}^\top \mathbf{H} \boldsymbol{\varepsilon}) = \text{tr} \left(\mathbf{H} \mathbb{E}(\boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^\top) \right) = \sigma^2 \text{tr} \left(\mathbf{H}\right) \] Which means that the expected value of the residual variance estimate is: \[ \begin{aligned} \mathbb{E} \left[ \widehat{\sigma}^2 \right] &= \dfrac{1}{N - r} \mathbb{E} \left[ \mathbb{\widehat{\epsilon}}^\top \mathbb{\widehat{\epsilon}} \right] \\ &= \dfrac{1}{N - r} \mathbb{E} \left[ \left( \mathbf{H} \mathbf{X}_2 \boldsymbol{\beta}_2 + \mathbf{H} \boldsymbol{\varepsilon}\right)^\top \left( \mathbf{H} \mathbf{X}_2 \boldsymbol{\beta}_2 + \mathbf{H} \boldsymbol{\varepsilon}\right)\right] \\ &= \dfrac{1}{N - r} \mathbb{E} \left[ \left( \mathbf{H} \mathbf{X}_2 \boldsymbol{\beta}_2\right)^\top \left( \mathbf{H} \mathbf{X}_2 \boldsymbol{\beta}_2\right) + \left( \mathbf{H} \mathbf{X}_2 \boldsymbol{\beta}_2\right)^\top \boldsymbol{\varepsilon} + \boldsymbol{\varepsilon}^\top \mathbf{H}^\top \left( \mathbf{H} \mathbf{X}_2 \boldsymbol{\beta}_2\right) + \boldsymbol{\varepsilon}^\top \mathbf{H}^\top \mathbf{H} \boldsymbol{\varepsilon}\right] \\ &= \dfrac{1}{N - r} \mathbb{E} \left[ \left( \mathbf{H} \mathbf{X}_2 \boldsymbol{\beta}_2\right)^\top \left( \mathbf{H} \mathbf{X}_2 \boldsymbol{\beta}_2\right)+ \boldsymbol{\varepsilon}^\top \mathbf{H} \boldsymbol{\varepsilon}\right] \\ &= \dfrac{1}{N - r} \mathbb{E} \left[ \left( \mathbf{H} \mathbf{X}_2 \boldsymbol{\beta}_2\right)^\top \left( \mathbf{H} \mathbf{X}_2 \boldsymbol{\beta}_2\right)\right] + \dfrac{1}{N - r} \text{tr} \left(\mathbf{H} \mathbb{E} \left[ \boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^\top \right]\right) \\ &= \dfrac{1}{N - r} \sigma^2 \text{tr} \left(\mathbf{H}\right) + \dfrac{1}{N - r} \mathbb{E} \left[ \left( \mathbf{H} \mathbf{X}_2 \boldsymbol{\beta}_2\right)^\top \left( \mathbf{H} \mathbf{X}_2 \boldsymbol{\beta}_2\right)\right] \\ &= \sigma^2 + \dfrac{1}{N - r} \mathbb{E} \left[ \left( \mathbf{H} \mathbf{X}_2 \boldsymbol{\beta}_2\right)^\top \left( \mathbf{H} \mathbf{X}_2 \boldsymbol{\beta}_2\right)\right] > \sigma^2 \end{aligned} \] Thus:

If we omit relevant variables from our regression model:

  • The OLS estimator is biased and inefficient;
  • The residual OLS variance estimator is biased;
  • \(\widehat{\sigma}^2\) is, on average, overestimated, since \(\mathbb{E} \left[ \widehat{\sigma}^2 \right] \gt \sigma^2\);
  • This overestimation of \(\widehat{\sigma}^2\) holds true, even if \(\mathbf{X}_1^\top \mathbf{X}_2 \ = \boldsymbol{0}\);

As a result, statistical inferences on the coefficients based on \(t\)-tests and confidence intervals are invalid.

All in all, we want to avoid excluding important variables from our model.

An alternative (and possibly more intuitive) way of thinking is that omitting a relevant variable \(X_{j}\) is equivalent to using restricted least squares (RLS), where the restriction \(\beta_j = 0\) is incorrect. As was mentioned for RLS, this results in biased and inconsistent parameter estimators (however, they are still efficient). However, if the excluded explanatory variable is not correlated with the included explanatory variables - then there will be no omitted variable bias.

4.9.3.1 Monte Carlo Simulation Example

TBA

4.9.4 Including Irrelevant Variables

Assume that wanting to avoid omitting an important variable, or simply being too enthusiastic, we may include explanatory variables, which are not relevant to the model. These variables contribute very little to the explanatory power of the model and reduce the degrees of freedom \(N-k\). This may effect any inference we may draw on the model, for example, by arbitrary increasing the \(R^2\) of the model.

Now assume that our true model includes \(k\) explanatory variables: \[ \underset{N \times 1}{\mathbf{Y}} = \underset{N \times k}{\mathbf{X}}\underset{k \times 1}{\boldsymbol{\beta}} + \underset{N \times 1}{\boldsymbol{\varepsilon}},\quad \mathbb{E} \left( \boldsymbol{\varepsilon} | \mathbf{X}\right) = \boldsymbol{0},\quad \mathbb{V}{\rm ar}\left( \boldsymbol{\varepsilon} | \mathbf{X} \right) = \sigma_\epsilon^2 \mathbf{I} \] However, we have included \(r\) additional variables, which we will collect in a separate matrix \(\mathbf{Z}\). Then our misspecified model is: \[ \underset{N \times 1}{\mathbf{Y}} = \underset{N \times k}{\mathbf{X}} \ \underset{k \times 1}{\boldsymbol{\beta}} + \underset{N \times r}{\mathbf{Z}} \ \underset{r \times 1}{\boldsymbol{\gamma}}+ \underset{N \times 1}{\boldsymbol{u}} \] or, more compactly: \[ \mathbf{Y} = \begin{bmatrix} \mathbf{X} & \mathbf{Z} \end{bmatrix} \begin{bmatrix} \boldsymbol{\beta} \\ \boldsymbol{\gamma} \end{bmatrix} + \boldsymbol{u} \] If we were to estimate our misspecified model via OLS, we would get: \[ \begin{bmatrix} \widehat{\boldsymbol{\beta}} \\ \widehat{\boldsymbol{\gamma}} \end{bmatrix} = \begin{bmatrix} \mathbf{X}^\top \mathbf{X} & \mathbf{X}^\top \mathbf{Z}\\ \mathbf{Z}^\top \mathbf{X} & \mathbf{Z}^\top \mathbf{Z} \end{bmatrix}^{-1} \begin{bmatrix} \mathbf{X}^\top \mathbf{Y}\\ \mathbf{Z}^\top \mathbf{Y} \end{bmatrix} \] which we can also write as: \[ \begin{bmatrix} \mathbf{X}^\top \mathbf{X} & \mathbf{X}^\top \mathbf{Z}\\ \mathbf{Z}^\top \mathbf{X} & \mathbf{Z}^\top \mathbf{Z} \end{bmatrix} \begin{bmatrix} \widehat{\boldsymbol{\beta}} \\ \widehat{\boldsymbol{\gamma}} \end{bmatrix} = \begin{bmatrix} \mathbf{X}^\top \mathbf{Y}\\ \mathbf{Z}^\top \mathbf{Y} \end{bmatrix} \] We are interested how well OLS estimates \(\boldsymbol{\beta}\). Notice that the first and second rows of the equality can be written as separate equalities: \[ \begin{aligned} \mathbf{X}^\top \mathbf{X} \widehat{\boldsymbol{\beta}} + \mathbf{X}^\top \mathbf{Z} \widehat{\boldsymbol{\gamma}} &= \mathbf{X}^\top \mathbf{Y}\\ \mathbf{Z}^\top \mathbf{X} \widehat{\boldsymbol{\beta}} + \mathbf{Z}^\top \mathbf{Z} \widehat{\boldsymbol{\gamma}} &= \mathbf{Z}^\top \mathbf{Y} \end{aligned} \] Multiplying both sides of the second row by \(\mathbf{X}^\top \mathbf{Z} \left(\mathbf{Z}^\top \mathbf{Z} \right)^{-1}\) yields: \[ \begin{aligned} \mathbf{X}^\top \mathbf{Z} \left(\mathbf{Z}^\top \mathbf{Z} \right)^{-1} \mathbf{Z}^\top \mathbf{X} \widehat{\boldsymbol{\beta}} + \mathbf{X}^\top \mathbf{Z} \left(\mathbf{Z}^\top \mathbf{Z} \right)^{-1} \mathbf{Z}^\top \mathbf{Z}\widehat{\boldsymbol{\gamma}} &= \mathbf{X}^\top \mathbf{Z} \left(\mathbf{Z}^\top \mathbf{Z} \right)^{-1} \mathbf{Z}^\top \mathbf{Y}\\ \Downarrow& \\ \mathbf{X}^\top \mathbf{Z} \left(\mathbf{Z}^\top \mathbf{Z} \right)^{-1} \mathbf{Z}^\top \mathbf{X} \widehat{\boldsymbol{\beta}} + \mathbf{X}^\top \mathbf{Z} \widehat{\boldsymbol{\gamma}} &= \mathbf{X}^\top \mathbf{Z} \left(\mathbf{Z}^\top \mathbf{Z} \right)^{-1} \mathbf{Z}^\top \mathbf{Y} \end{aligned} \] Then, subtracting it from the first row gives us: \[ \begin{aligned} \left( \mathbf{X}^\top \mathbf{X} - \mathbf{X}^\top \mathbf{Z} \left(\mathbf{Z}^\top \mathbf{Z} \right)^{-1} \mathbf{Z}^\top \mathbf{X} \right) \widehat{\boldsymbol{\beta}} &= \mathbf{X}^\top \mathbf{Y} - \mathbf{X}^\top \mathbf{Z} \left(\mathbf{Z}^\top \mathbf{Z} \right)^{-1} \mathbf{Z}^\top \mathbf{Y}\\ \Downarrow& \\ \mathbf{X}^\top \left( \mathbf{I} - \mathbf{Z} \left(\mathbf{Z}^\top \mathbf{Z} \right)^{-1} \mathbf{Z}^\top \right) \mathbf{X} \widehat{\boldsymbol{\beta}} &= \mathbf{X}^\top \left(\mathbf{I} - \mathbf{Z} \left(\mathbf{Z}^\top \mathbf{Z} \right)^{-1} \mathbf{Z}^\top \right)\mathbf{Y} \end{aligned} \] Setting \(\mathbf{H} = \mathbf{I} - \mathbf{Z} \left(\mathbf{Z}^\top \mathbf{Z} \right)^{-1}\) allows to to get the OLS estimate of \(\boldsymbol{\beta}\): \[ \begin{aligned} \widehat{\boldsymbol{\beta}} &= \left( \mathbf{X}^\top \mathbf{H} \mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{H}\mathbf{Y} \\ &= \left( \mathbf{X}^\top \mathbf{H} \mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{H} ( \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}) \\ &= \boldsymbol{\beta} + \left( \mathbf{X}^\top \mathbf{H} \mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{H}\boldsymbol{\varepsilon} \end{aligned} \] Then, the expected value: \[ \begin{aligned} \mathbb{E}\left[ \widehat{\boldsymbol{\beta}}\right] &= \boldsymbol{\beta} + \mathbb{E}\left[ \left( \mathbf{X}^\top \mathbf{H} \mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{H} \boldsymbol{\varepsilon}\right] \\ &= \boldsymbol{\beta} + \mathbb{E}\left[ \mathbb{E} \left(\left( \mathbf{X}^\top \mathbf{H} \mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{H} \boldsymbol{\varepsilon} \biggr \rvert \mathbf{X}\right)\right] \\ &= \boldsymbol{\beta} + \mathbb{E}\left[ \left( \mathbf{X}^\top \mathbf{H} \mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{H} \mathbb{E}(\boldsymbol{\varepsilon})\right] \\ &= \boldsymbol{\beta} \end{aligned} \]

\(\widehat{\boldsymbol{\beta}}\) is an unbiased estimate, even when some irrelevant variables are included in the model.

The variance of the OLS estimate is: \[ \begin{aligned} \mathbb{V}{\rm ar} (\widehat{\boldsymbol{\beta}} \rvert \mathbf{X}) &= \mathbb{E} \left[(\widehat{\boldsymbol{\beta}} - \mathbb{E}(\widehat{\boldsymbol{\beta}}))(\widehat{\boldsymbol{\beta}} - \mathbb{E}(\widehat{\boldsymbol{\beta}}))^\top \rvert \mathbf{X} \right] \\ &=\mathbb{E}\left[ \left( \mathbf{X}^\top \mathbf{H} \mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{H}\boldsymbol{\varepsilon} \left(\left( \mathbf{X}^\top \mathbf{H} \mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{H}\boldsymbol{\varepsilon}\right)^\top\biggr\rvert \mathbf{X}\right]\\ &=\mathbb{E}\left[ \left( \mathbf{X}^\top \mathbf{H} \mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{H}\boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^\top \mathbf{H}^\top \mathbf{X} \left( \mathbf{X}^\top \mathbf{H} \mathbf{X} \right)^{-1}\biggr\rvert \mathbf{X}\right]\\ &= \left( \mathbf{X}^\top \mathbf{H} \mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{H} \left(\sigma^2 \mathbf{I} \right) \mathbf{H}^\top \mathbf{X} \left( \mathbf{X}^\top \mathbf{H} \mathbf{X} \right)^{-1} \\ &= \sigma^2 \left( \mathbf{X}^\top \mathbf{H} \mathbf{X} \right)^{-1} \geq \sigma^2 \left( \mathbf{X}^\top \mathbf{X} \right)^{-1} \end{aligned} \]

If \(\mathbf{X}\) and \(\mathbf{Z}\) are orthogonal (i.e. uncorrelated), then the OLS estimates are efficient, despite inclusion of irrelevant variables. Otherwise, they are inefficient.

Finally, it can be shown, that: \[ \widehat{\sigma}^2 = \dfrac{\mathbb{\widehat{\epsilon}}^\top \mathbb{\widehat{\epsilon}}}{N - k-r} \] is an unbiased estimator of \(\sigma^2\).

If we include irrelevant variables in our regression model:

  • The OLS estimator is still unbiased.
  • The standard errors of the estimators are larger, compared to the model with only relevant variables.
  • Consequently OLS estimators are no longer BLUE, as they are no longer efficient.

The loss in efficiency is less harmful compared to biased and inconsistent estimates. Therefore, when one is unsure about a model specification, one is better off including too many variables, than too few (though including too many variables leads to its own set of problems concerning collinearity, efficiency, etc.). This is sometimes called a kitchen-sink regression.

4.9.4.1 Monte Carlo Simulation Example

TBA

4.9.5 Incorrect Functional Form

Specifying an incorrect functional form may result in the following:

  • Heteroskedastic residuals;
  • Autocorrelated residuals;

Hence, do not believe that a significant heteroskedasticity or autocorrelation can be adequately fixed by using robust standard errors. Examine the model for any specification problems.

Many economic applications use either a log-log, or log-linear model, since logarithmic transformations stabilize the variance of the transformed variable.

Example 4.31 Assume that we specify a linear, instead of a log-linear model.

Specifically, assume that the true model is: \[ \log(Y_i) = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \epsilon_i \] But assume that we fit the following model: \[ Y_i = \alpha_0 + \alpha_1 X_{1,i} + \alpha_2 X_{2,i} + \zeta_i \]

##             Estimate Std. Error   t value Pr(>|t|)
## (Intercept)  1.28436    0.80315   1.59915  0.11107
## x1           9.62872    5.25331   1.83289  0.06802
## x2          -0.66323    0.04810 -13.78934  0.00000
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept      3.0074      1.300      2.313      0.022       0.446       5.569
## x1             0.0914      8.490      0.011      0.991     -16.630      16.813
## x2            -0.8737      0.084    -10.461      0.000      -1.038      -0.709
## ==============================================================================

So, we would reject the null hypothesis of homoskedasticity, which means that the residuals would appear to be heteroskedastic. However, if instead of opting to use HCE, we were to specify a log-linear model:

##             Estimate Std. Error    t value Pr(>|t|)
## (Intercept)  0.14242    0.06768    2.10452  0.03634
## x1           0.26047    0.44266    0.58844  0.55678
## x2          -0.50738    0.00405 -125.19452  0.00000
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept      0.1086      0.069      1.566      0.119      -0.028       0.245
## x1             0.6438      0.453      1.422      0.156      -0.248       1.535
## x2            -0.5095      0.004   -114.405      0.000      -0.518      -0.501
## ==============================================================================
## 
##  Goldfeld-Quandt test
## 
## data:  lm_1_true
## GQ = 1, df1 = 122, df2 = 122, p-value = 1
## alternative hypothesis: variance changes from segment 1 to 2
## 
##  Goldfeld-Quandt test
## 
## data:  lm_1_true
## GQ = 0.97116, df1 = 122, df2 = 122, p-value = 0.8719
## alternative hypothesis: variance changes from segment 1 to 2
## 
##  studentized Breusch-Pagan test
## 
## data:  lm_1_true
## BP = 8.5609, df = 5, p-value = 0.1279

We would not reject the null hypothesis that the residuals are homoskedastic.

Example 4.32 Assume that we specify a correct form for the dependent variable, but an incorrect form for the independent variable:
##              Estimate Std. Error  t value Pr(>|t|)
## (Intercept) 393.97792    6.32438 62.29509  0.00000
## x1            0.54056    0.01485 36.39325  0.00000
## x2            2.79541    2.24355  1.24597  0.21395
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept    387.9252      6.082     63.777      0.000     375.945     399.905
## x1             0.5424      0.015     36.879      0.000       0.513       0.571
## x2             5.3592      2.053      2.610      0.010       1.316       9.403
## ==============================================================================

If we test for autocorrelation:

## 
##  Durbin-Watson test
## 
## data:  lm_2_false
## DW = 0.053879, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is not 0
## 
##  Breusch-Godfrey test for serial correlation of order up to 2
## 
## data:  lm_2_false
## LM test = 213.53, df = 2, p-value < 2.2e-16
##              0              1
## 0      LM-stat   2.223670e+02
## 1  LM: p-value   5.171543e-49
## 2      F-value   9.857762e+02
## 3  F:  p-value  6.718765e-118

We would reject the null hypothesis of no residual autocorrelation.

If we were to test for heteroskedasticity:

## 
##  Goldfeld-Quandt test
## 
## data:  lm_2_false
## GQ = 0.035159, df1 = 122, df2 = 122, p-value < 2.2e-16
## alternative hypothesis: variance changes from segment 1 to 2
## 
##  studentized Breusch-Pagan test
## 
## data:  lm_2_false
## BP = 70.538, df = 5, p-value = 7.92e-14

We would also reject the null hypothesis of homoskedasticity.

Specifying the true model:

##             Estimate Std. Error   t value Pr(>|t|)
## (Intercept)  2.23478    2.16937   1.03015  0.30395
## log(x1)     99.88577    0.38009 262.79397  0.00000
## x2           3.38075    0.33788  10.00584  0.00000
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept      1.6708      0.849      1.968      0.050      -0.001       3.343
## np.log(x1)    99.9908      0.150    665.583      0.000      99.695     100.287
## x2             3.1337      0.123     25.394      0.000       2.891       3.377
## ==============================================================================

Now regarding the autocorrelation:

## 
##  Durbin-Watson test
## 
## data:  lm_2_true
## DW = 2.0926, p-value = 0.5
## alternative hypothesis: true autocorrelation is not 0
## 
##  Breusch-Godfrey test for serial correlation of order up to 2
## 
## data:  lm_2_true
## LM test = 0.61856, df = 2, p-value = 0.734
##              0         1
## 0      LM-stat  0.410973
## 1  LM: p-value  0.814251
## 2      F-value  0.201708
## 3  F:  p-value  0.817469

We would not reject the null hypothesis of no serial correlation.

If we were to test for heteroskedasticity:

## 
##  Goldfeld-Quandt test
## 
## data:  lm_2_true
## GQ = 0.89493, df1 = 122, df2 = 122, p-value = 0.5408
## alternative hypothesis: variance changes from segment 1 to 2

In this case, we would not reject the null hypothesis of homoskedasticity.

So, choosing a correct functional form for both the dependent and independent variables is crucial to guarantee that any heteroskedasticity or autocorrelation in the residuals is due to the underlying process in the data and not because of an incorrectly specified functional form.

Fortunately, we can test the functional form specification (and inclusion of variables) somewhat more conveniently with a number of tests.

4.9.6 Polynomial vs Orthogonal Polynomial Explanatory Variables

To be updated!

Orthogonal polynomials first set up the model matrix with the raw coding \(X\), \(X^2\), \(X^3\), … and then scales the columns so that each column is orthogonal to the previous ones. This does not change the fitted values but has the advantage that you can see whether a certain order in the polynomial significantly improves the regression over the lower orders.

In R, the poly() function is equivalent to the QR decomposition of the matrix whose columns are powers of \(X\) (after centering).

##                 1           2          3           4
##  [1,] -0.49543369  0.52223297 -0.4534252  0.33658092
##  [2,] -0.38533732  0.17407766  0.1511417 -0.41137668
##  [3,] -0.27524094 -0.08703883  0.3778543 -0.31788198
##  [4,] -0.16514456 -0.26111648  0.3346710  0.05609682
##  [5,] -0.05504819 -0.34815531  0.1295501  0.33658092
##  [6,]  0.05504819 -0.34815531 -0.1295501  0.33658092
##  [7,]  0.16514456 -0.26111648 -0.3346710  0.05609682
##  [8,]  0.27524094 -0.08703883 -0.3778543 -0.31788198
##  [9,]  0.38533732  0.17407766 -0.1511417 -0.41137668
## [10,]  0.49543369  0.52223297  0.4534252  0.33658092
## [[-0.49543369  0.52223297  0.45342519 -0.33658092]
##  [-0.38533732  0.17407766 -0.15114173  0.41137668]
##  [-0.27524094 -0.08703883 -0.37785433  0.31788198]
##  [-0.16514456 -0.26111648 -0.33467098 -0.05609682]
##  [-0.05504819 -0.34815531 -0.12955006 -0.33658092]
##  [ 0.05504819 -0.34815531  0.12955006 -0.33658092]
##  [ 0.16514456 -0.26111648  0.33467098 -0.05609682]
##  [ 0.27524094 -0.08703883  0.37785433  0.31788198]
##  [ 0.38533732  0.17407766  0.15114173  0.41137668]
##  [ 0.49543369  0.52223297 -0.45342519 -0.33658092]]
##        1   2    3     4
##  [1,]  1   1    1     1
##  [2,]  2   4    8    16
##  [3,]  3   9   27    81
##  [4,]  4  16   64   256
##  [5,]  5  25  125   625
##  [6,]  6  36  216  1296
##  [7,]  7  49  343  2401
##  [8,]  8  64  512  4096
##  [9,]  9  81  729  6561
## [10,] 10 100 1000 10000
## [[    1     1     1     1]
##  [    2     4     8    16]
##  [    3     9    27    81]
##  [    4    16    64   256]
##  [    5    25   125   625]
##  [    6    36   216  1296]
##  [    7    49   343  2401]
##  [    8    64   512  4096]
##  [    9    81   729  6561]
##  [   10   100  1000 10000]]
Example 4.33 Consider the datasets::cars dataset and assume that we want to fit a polynomial model for dist using speed as an explanatory variable.
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791     6.7584 -2.6011   0.0123
## speed         3.9324     0.4155  9.4640   0.0000
##              Coef.  Std.Err.       t   P>|t|
## Intercept -17.5791    6.7584 -2.6011  0.0123
## speed       3.9324    0.4155  9.4640  0.0000
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   2.4701    14.8172  0.1667   0.8683
## speed         0.9133     2.0342  0.4490   0.6555
## I(speed^2)    0.1000     0.0660  1.5153   0.1364
##                               Coef.  Std.Err.       t   P>|t|
## Intercept                    2.4701   14.8172  0.1667  0.8683
## poly(speed, 2, raw=True)[0]  0.9133    2.0342  0.4490  0.6555
## poly(speed, 2, raw=True)[1]  0.1000    0.0660  1.5153  0.1364
##                              Estimate Std. Error t value Pr(>|t|)
## (Intercept)                   42.9800     2.1462 20.0259   0.0000
## poly(speed, 2, raw = FALSE)1 145.5523    15.1761  9.5909   0.0000
## poly(speed, 2, raw = FALSE)2  22.9958    15.1761  1.5153   0.1364
##                                  Coef.  Std.Err.        t   P>|t|
## Intercept                      42.9800    2.1462  20.0259  0.0000
## poly(speed, 2, raw=False)[0]  145.5523   15.1761   9.5909  0.0000
## poly(speed, 2, raw=False)[1]  -22.9958   15.1761  -1.5153  0.1364

Thus:

  • The \(p\)-value for speed^2 is the same in both models. However, the significance of speed is different in the two models. This is because they are correlated (i.e. orthogonal) in the original data (this is to be expected, since we are examining the polynomial of the same variable).
  • Thus, in the raw coding you can only interpret the \(p\)-value of speed if speed^2 remains in the model, as both regressors are highly correlated. This is why it is considered good practice to drop one-variable-at-a-time, instead of simultaneously dropping multiple regressors. On the other hand, since both of them are insignificant, in this case, we usually drop the higher order polynomial, instead of the lower order polynomial with the larger \(p\)-value.
  • However, in the orthogonal coding speed^2 only captures the quadratic part that has not been captured by the linear term. And then it becomes clear that the linear part is significant while the quadratic part has no additional significance.
Note, that the exact units in polynomial regressions are generally difficult to interpret but this is true in either polynomial specification. In the orthogonal case, the quadratic term gives you the deviations from just the linear polynomial; and the cubic term the deviations from just the quadratic polynomial etc.

Unfortunately there is another undesirable aspect with ordinary polynomials in regression. For example, if we fit a quadratic and then a cubic model - the lower order coefficients in the cubic model are all different than for the ones in the quadratic model:

##                             Estimate Std. Error t value Pr(>|t|)
## (Intercept)                   2.4701    14.8172  0.1667   0.8683
## poly(speed, 2, raw = TRUE)1   0.9133     2.0342  0.4490   0.6555
## poly(speed, 2, raw = TRUE)2   0.1000     0.0660  1.5153   0.1364
##                             Estimate Std. Error t value Pr(>|t|)
## (Intercept)                 -19.5050    28.4053 -0.6867   0.4957
## poly(speed, 3, raw = TRUE)1   6.8011     6.8011  1.0000   0.3225
## poly(speed, 3, raw = TRUE)2  -0.3497     0.4999 -0.6995   0.4878
## poly(speed, 3, raw = TRUE)3   0.0103     0.0113  0.9074   0.3689
##                               Coef.  Std.Err.       t   P>|t|
## Intercept                    2.4701   14.8172  0.1667  0.8683
## poly(speed, 2, raw=True)[0]  0.9133    2.0342  0.4490  0.6555
## poly(speed, 2, raw=True)[1]  0.1000    0.0660  1.5153  0.1364
##                                Coef.  Std.Err.       t   P>|t|
## Intercept                   -19.5050   28.4053 -0.6867  0.4957
## poly(speed, 3, raw=True)[0]   6.8011    6.8011  1.0000  0.3225
## poly(speed, 3, raw=True)[1]  -0.3497    0.4999 -0.6995  0.4878
## poly(speed, 3, raw=True)[2]   0.0103    0.0113  0.9074  0.3689

What we would really like is to add the cubic term in such a way that the lower order coefficients that were fit using the quadratic stay the same after refitting with a cubic fit.

To do this, take linear combinations of the columns of poly(speed, 2, raw = TRUE) such that the columns in the quadratic model are orthogonal to each other (and similarly do the same with poly(speed, 3, raw = TRUE) for the cubic model). That is sufficient to guarantee that the lower order coefficients won’t change when we add higher order coefficients.

##                              Estimate Std. Error t value Pr(>|t|)
## (Intercept)                   42.9800     2.1462 20.0259   0.0000
## poly(speed, 2, raw = FALSE)1 145.5523    15.1761  9.5909   0.0000
## poly(speed, 2, raw = FALSE)2  22.9958    15.1761  1.5153   0.1364
##                              Estimate Std. Error t value Pr(>|t|)
## (Intercept)                   42.9800     2.1503 19.9882   0.0000
## poly(speed, 3, raw = FALSE)1 145.5523    15.2047  9.5729   0.0000
## poly(speed, 3, raw = FALSE)2  22.9958    15.2047  1.5124   0.1373
## poly(speed, 3, raw = FALSE)3  13.7969    15.2047  0.9074   0.3689
##                                  Coef.  Std.Err.        t   P>|t|
## Intercept                      42.9800    2.1462  20.0259  0.0000
## poly(speed, 2, raw=False)[0]  145.5523   15.1761   9.5909  0.0000
## poly(speed, 2, raw=False)[1]  -22.9958   15.1761  -1.5153  0.1364
##                                  Coef.  Std.Err.        t   P>|t|
## Intercept                      42.9800    2.1503  19.9882  0.0000
## poly(speed, 3, raw=False)[0]  145.5523   15.2047   9.5729  0.0000
## poly(speed, 3, raw=False)[1]  -22.9958   15.2047  -1.5124  0.1373
## poly(speed, 3, raw=False)[2]  -13.7969   15.2047  -0.9074  0.3689

Note how the first three coefficients are now the same in the two sets below (whereas above they differ).

Importantly, since the columns of poly(speed, 2) are just linear combinations of the columns of poly(speed, 2, raw = TRUE) the two quadratic models (orthogonal and raw) represent the same models (i.e. they give the same predictions) and only differ in parameterization. For example, the fitted values are the same:

## True

In Python, the equality holds if we round to the first 12 digits. Otherwise, the first couple of values will not be equal due to machine precision:

##          quad  quad_orth  compare
## 0    7.722637   7.722637    False
## 1    7.722637   7.722637    False
## 2   13.761157  13.761157    False
## 3   13.761157  13.761157    False
## 4   16.173834  16.173834    False
## 5   18.786430  18.786430    False
## 6   21.598944  21.598944    False
## 7   21.598944  21.598944    False
## 8   21.598944  21.598944    False
## 9   24.611377  24.611377    False
## 10  24.611377  24.611377    False
## 11  27.823729  27.823729    False
## 12  27.823729  27.823729    False
## 13  27.823729  27.823729    False
## 14  27.823729  27.823729    False
## 15  31.235999  31.235999    False
## 16  31.235999  31.235999    False
## 17  31.235999  31.235999    False
## 18  31.235999  31.235999    False
## 19  34.848188  34.848188    False
## 20  34.848188  34.848188    False
## 21  34.848188  34.848188    False
## 22  34.848188  34.848188    False
## 23  38.660295  38.660295    False
## 24  38.660295  38.660295    False
## 25  38.660295  38.660295    False
## 26  42.672321  42.672321    False
## 27  42.672321  42.672321    False
## 28  46.884266  46.884266    False
## 29  46.884266  46.884266    False
## 30  46.884266  46.884266    False
## 31  51.296129  51.296129    False
## 32  51.296129  51.296129    False
## 33  51.296129  51.296129    False
## 34  51.296129  51.296129    False
## 35  55.907911  55.907911     True
## 36  55.907911  55.907911     True
## 37  55.907911  55.907911     True
## 38  60.719611  60.719611    False
## 39  60.719611  60.719611    False

For a discussion of these methods, see stackoverflow 1, stackoverflow 2, stackoverflow 3, stackoverflow 4 and stackoverflow 5.

4.9.7 Model Selection

When carrying out data analysis, model specification and evaluating model adequacy, it is worthwhile to remember a number of quotes:

“All models are wrong, but some are useful.”

— George E. P. Box, 1987

The general idea is that all models are approximations. As such, you should not strive to build a model, which would be able to perfectly represent reality. Instead, you should try to build a simplified (but not oversimplified) version of reality, which would be useful and provide insights to the task at hand. One way of achieving this is by examining how various economical and sociological effects would be represented in you model, e.g. What variable signs would make sense? What functional form is adequate? What variables must be included in you model? And so on.

“Keep it sophisticatedly simple.”

— Arnold Zellner, 1999

This also known as the KISS principle, adapted to econometrical (and statistical) modelling methodology. The idea is that sometimes increasing the complexity results in a small increase in model accuracy. Of course, an econometric model can be too simple, in which case a more complex model (e.g. one containing more explanatory variables, a log-linear model, etc.) may give a better results in terms of model (out of sample) fit. On the other hand, a more common occurrence is that a complex econometric model performs just as well (or more pessimistically, just as bad) as its simplified version (e.g. one with less explanatory variables, or one that is easier to interpret).

“One surprising result is that, for prediction purposes, knowledge of the true structure of the model generating the data is not particularly useful unless parameter values are also known.”

— Paul D. Gilbert, 1995

For example, think about how many factors influence your purchase, from the smallest, seemingly insignificant ones in the past (maybe even childhood), to more recent ones (including seen ads, discussions with other people, geographical location and so on) - how many different variables would you need in order to account for all possible variables, what should the functional form be and how many observations would you need? Would you even be able to acquire all the required data?

Consequently, since the best prediction of the dependent variable is the mean, we are usually modelling the average purchase, so we would combine most of the smaller effects as noise, while the most impactful ones would be represented by a number of variables (e.g. geographical location, average wage, unemployment rate, advertisement rate, etc.).

As mentioned before, we may sometimes be overwhelmed with the number of variables to choose from. Hence, we would like to have some way to:

  • Test whether our current specification is adequate;
  • Have an automated procedure to select the best model (in terms of some information criterion) out of a number of possible specifications would be very useful to have.

4.9.7.1 Variable Correlation Matrix Heatmap Plot

The linear correlation between two variables is known as Pearson’s correlation coefficient. While we have examine variable correlation coefficients and their scatterplots - it may make sense to visualize the correlation matrix between many different pairs of variables as a whole in order to quickly identify variables, which have the strongest linear relationship.

We can do this via a heatmap - a graphical representation of data, where the individual values contained in a matrix are represented as colors.

To do this, we begin by calculating the correlation matrix between all of the variables of interest, as we would before:

##                    mpg disp  hp drat    wt  qsec
## Mazda RX4         21.0  160 110 3.90 2.620 16.46
## Mazda RX4 Wag     21.0  160 110 3.90 2.875 17.02
## Datsun 710        22.8  108  93 3.85 2.320 18.61
## Hornet 4 Drive    21.4  258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7  360 175 3.15 3.440 17.02
##                     mpg   disp   hp  drat     wt   qsec
## Mazda RX4          21.0  160.0  110  3.90  2.620  16.46
## Mazda RX4 Wag      21.0  160.0  110  3.90  2.875  17.02
## Datsun 710         22.8  108.0   93  3.85  2.320  18.61
## Hornet 4 Drive     21.4  258.0  110  3.08  3.215  19.44
## Hornet Sportabout  18.7  360.0  175  3.15  3.440  17.02
##             mpg       disp         hp        drat         wt        qsec
## mpg   1.0000000 -0.8475514 -0.7761684  0.68117191 -0.8676594  0.41868403
## disp -0.8475514  1.0000000  0.7909486 -0.71021393  0.8879799 -0.43369788
## hp   -0.7761684  0.7909486  1.0000000 -0.44875912  0.6587479 -0.70822339
## drat  0.6811719 -0.7102139 -0.4487591  1.00000000 -0.7124406  0.09120476
## wt   -0.8676594  0.8879799  0.6587479 -0.71244065  1.0000000 -0.17471588
## qsec  0.4186840 -0.4336979 -0.7082234  0.09120476 -0.1747159  1.00000000
##            mpg      disp        hp      drat        wt      qsec
## mpg   1.000000 -0.847551 -0.776168  0.681172 -0.867659  0.418684
## disp -0.847551  1.000000  0.790949 -0.710214  0.887980 -0.433698
## hp   -0.776168  0.790949  1.000000 -0.448759  0.658748 -0.708223
## drat  0.681172 -0.710214 -0.448759  1.000000 -0.712441  0.091205
## wt   -0.867659  0.887980  0.658748 -0.712441  1.000000 -0.174716
## qsec  0.418684 -0.433698 -0.708223  0.091205 -0.174716  1.000000

Next, we move on to plotting the heatmap of this matrix:

We can quickly identify, which variables have a strong positive/negative correlation. Likewise, we can see variables, which are weakly correlated.

The diagonal plane mirrors itself on either side. So, to make the plot easier to read, we can leave only the lower triangle:

This makes it easier to examine the unique linear relationships between variables. On the other hand, if we have a very large number of variables, even this correlation heatmap may not be convenient to determine, which variables should be included in the model.

For more examples, see here and here.

4.9.7.2 Model Specification Tests

Model specification Tests are used to determine whether the selected functional form and included variables are appropriate.

In Python, a number of tests by type can be found here.

4.9.7.2.1 Harvey–Collier Test for Linearity

Harvey–Collier test performs a \(t\)-test on the recursive residuals. If the true relationship is non-linear (i.e. it is either convex, or concave), then the mean of the recursive residuals should differ from zero significantly.

In more detail, the null hypothesis is: \[ H_0:\text{The regression is correctly modeled as linear} \] The recursive residuals are: \[ \tilde{u}_j = \dfrac{Y_j - \boldsymbol{x}_j \widehat{\boldsymbol{\beta}}_j}{(1 + \boldsymbol{x}_j (\boldsymbol{X}_j^\top \boldsymbol{X}_j)^{-1} \boldsymbol{x}_j)^{1/2}},\quad j = k+1,...,N, \] where:

  • \(\boldsymbol{x}_j = \begin{bmatrix}1 & X_{1j} & ... & X_{kj}\end{bmatrix}\),
  • \(\boldsymbol{X}_j = \begin{bmatrix} \boldsymbol{x}_1 & ... & \boldsymbol{x}_j \end{bmatrix}^\top\) is the matrix of full rank, consisting of the first \(j\) sets of observations on the independent variables.
  • \(\widehat{\boldsymbol{\beta}}^{(j)} = \begin{bmatrix} \widehat{\beta}_{0}^{(j)} & \widehat{\beta}_{1}^{(j)} & ... & \widehat{\beta}_{k}^{(j)}\end{bmatrix}^\top\) are the OLS estimates from the first \(j\) observations.

The statistic: \[ \psi = \left[ \dfrac{1}{N - k - 1} \sum_{j = k+1}^N (\tilde{u}_j -\overline{\tilde{u}}_j)^2 \right] ^{-1/2} (N-k)^{-1/2} \sum_{j = k+1}^N \tilde{u}_j \]

Under the null hypothesis \(\psi\) follows a \(t\)-distribution with \(N-k+1\) degrees of freedom since the recursive residuals have the same properties as the true disturbances - hence their arithmetic mean should be zero.

The test procedure consists of arranging the observations in ascending (or descending) order according to the variable which is to be tested for functional misspecification. The least squares coefficients are calculated recursively and the \(\psi\) statistic is formed from the resulting recursive residuals (Source).

Example 4.34 If we look at the residuals versus explanatory variables plots of lm_1_false, we see that the residual appear to be randomly scattered if ploted against x1, but they appear to have some nonlinear structure when ploted against x2. We can see this from the test results as well:
## 
##  Harvey-Collier test
## 
## data:  y ~ 1 + x1 + x2
## HC = 0.62833, df = 246, p-value = 0.5304
## 
##  Harvey-Collier test
## 
## data:  y ~ 1 + x1 + x2
## HC = 53.397, df = 246, p-value < 2.2e-16
## Ttest_1sampResult(statistic=63.765318728571444, pvalue=5.4703602867489115e-155)

Note that currently, statsmodels does not support a sort_by option (but it is marked as a possible additional feature in a future update). Nevertheless, if we know what the test does, we can always carry it out by reordering the data before estimating the model.

In case of correctly specifying the variable by which we want to order, we reject the null hypothesis and conclude that our model does not capture some kind of nonlinear relationship.

Note that our true model is log-linear, yet it appears that the nonlinearity is because of x2. Our first instinct might be to transform x2 (example, assuming an exponential relationship), but it would in fact be easier to interpret a log-linear model (and in this example it is the true model).

If we carry out the test on a correctly specified model:

## 
##  Harvey-Collier test
## 
## data:  log(y) ~ 1 + x1 + x2
## HC = 0.31069, df = 246, p-value = 0.7563
## 
##  Harvey-Collier test
## 
## data:  log(y) ~ 1 + x1 + x2
## HC = 0.54014, df = 246, p-value = 0.5896

then, regardless of the variable by which we order the data, we do not reject the null hypothesis and conclude that the regression is correctly modeled as linear.

Note that by transforming the dependent variable we have linearized the regression. The alternative would have been to model an exponential (i.e. nonlinear) regression model, which would be more difficult.

Example 4.35 If we look at the residuals versus explanatory variables plots of lm_2_false, we see a similar result. Carrying out the HC test gives similar results to the previous example:
## 
##  Harvey-Collier test
## 
## data:  y ~ 1 + x1 + x2
## HC = 51.487, df = 246, p-value < 2.2e-16
## 
##  Harvey-Collier test
## 
## data:  y ~ 1 + x1 + x2
## HC = 0.75443, df = 246, p-value = 0.4513

In this example, we reject the null hypothesis when ordering by x1 and have no grounds to reject it if we order by x2.

On the other hand, if we tried to estimate a log-linear model as before:

## 
##  Harvey-Collier test
## 
## data:  log(y) ~ 1 + x1 + x2
## HC = 75.594, df = 246, p-value < 2.2e-16
## 
##  Harvey-Collier test
## 
## data:  log(y) ~ 1 + x1 + x2
## HC = 0.73061, df = 246, p-value = 0.4657
## Ttest_1sampResult(statistic=-0.23741312830449907, pvalue=0.8125339086971091)

Looking at R results - we would have no grounds to reject the null hypothesis.

We would still end up rejecting the null hypothesis and conclude that the linear specification is incorrect.

If we correctly transform the explanatory variable x1:

## 
##  Harvey-Collier test
## 
## data:  y ~ 1 + log(x1) + x2
## HC = 0.22183, df = 246, p-value = 0.8246
## Ttest_1sampResult(statistic=2.3248629426992644, pvalue=0.020893793305525547)

In this case we would still reject the null hypothesis, though the p-value is much larger (pvalue=0.0209) than before. Because of the nature of random value generation, with a larger sample we would most likely have no grounds to reject the null hypothesis. Especially since we know the true DGP.

## 
##  Harvey-Collier test
## 
## data:  y ~ 1 + log(x1) + x2
## HC = 0.27328, df = 246, p-value = 0.7849
4.9.7.2.2 Rainbow Test for Linearity

The paper describing this test (Utts 1982) is available (online).

Assume that we have data consisting of \(N \times 1\) vector of observations \(\boldsymbol{Y}\) and the corresponding \(N \times k\) matrix \(\boldsymbol{X}\) and we fit the following model: \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \] Now, assume that true model includes additional \(r\) terms so that the true model can be expressed as: \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \boldsymbol{\theta} + \boldsymbol{\varepsilon} \] where \(\mathbf{Z}\) is a \(r \times k\) matrix with the corresponding parameter vector \(\boldsymbol{\theta}\).

We wish to test the following null hypothesis: \[ \begin{cases} H_0&: \boldsymbol{\theta} = \boldsymbol{0} \text{ (i.e. that the model fit is adequate)}\\ H_1&: \boldsymbol{\theta} \neq \boldsymbol{0} \end{cases} \] This test covers a wide spectrum of alternatives, since we do not need to specify \(\mathbf{Z}\), nor \(\boldsymbol{\theta}\).

Example 4.36 To motivate the upcomming test procedure, consider the following example - assume that the true model is quadratic:

\[ Y_i = \beta_0 + \beta_1 X_i^2 + \epsilon_i,\text{ where } X_i \sim\mathcal{N}(1,\ 1),\ \epsilon_i \sim \mathcal{N}(0,\ 0.5^2) \]

If we fit a linear model on the full dataset, we get:

Obviously, this is not a good fit. On the other hand, if we were to examine only a subset of the observations in the central region of the independent variable \(X\) (in this case, \(X \in \left[ \overline{X} - \widehat{\sigma}_X; \overline{X} + \widehat{\sigma}_X \right]\)) and fit a model to that subset only:

Then the (incorrect) linear fit isn’t as far from the true model as it is when we fit over the entire range.

If the true model was linear, we would expect some improvement on the subset. However, we would not expect the fit to get much better, since a true linear model should not significantly improve on any specific region.

Hence, if the true model is not linear, then the improvement will be bigger than we expected.

The idea of the Rainbow test is that even if the true relationship is non-linear, a good linear fit can be achieved on a subsample in the “middle” of the data. The null hypothesis is rejected whenever the overall fit is significantly worse than the fit for the subsample. The test statistic: \[ F = \dfrac{SSE_{full} - SSE_{middle}}{N - N_{middle}}\bigg{/}\dfrac{SSE_{middle}}{N_{middle}-k}, \] where denominator is based on the estimate using the points in the middle and the numerator is based on the difference between that and the usual variance estimate, derived from the full dataset. Under \(H_0\) the aforementioned statistic follows an \(F\) distribution with \(k\) degrees of freedom.

Remember that the data needs to be ordered by some variable, otherwise one might get unexpected results.
## 
##  Rainbow test
## 
## data:  y ~ x
## Rain = 46.665, df1 = 50, df2 = 48, p-value < 2.2e-16

We can carry out the tests in the same manner on our previous example data:

## 
##  Rainbow test
## 
## data:  y ~ 1 + x1 + x2
## Rain = 2.9501, df1 = 125, df2 = 122, p-value = 2.474e-09
## 
##  Rainbow test
## 
## data:  y ~ 1 + x1 + x2
## Rain = 507.19, df1 = 125, df2 = 122, p-value < 2.2e-16

The variable, by which we select to order the data, affects the calculated p-value and may even give us misleading results.

## 
##  Rainbow test
## 
## data:  y ~ 1 + x1 + x2
## Rain = 58.187, df1 = 125, df2 = 122, p-value < 2.2e-16
## 
##  Rainbow test
## 
## data:  y ~ 1 + x1 + x2
## Rain = 0.69204, df1 = 125, df2 = 122, p-value = 0.9792

Again, the data ordering affects the calculated p-value and may give us unexpected results.

Re-examine the residuals versus explanatory variables plots to verify that ordering by x2 for data_mat_2 (and x1 for data_mat_1) leads to the rejection of the null hypothesis:

4.9.7.2.3 MacKinnon-White-Davidson PE Test For Linear vs Log-Linear Specification

The MacKinnon-White-Davidson PE Test For Linear vs Log-Linear Specification is available in R, but currently not available in the statsmodels package of Python.

The description of this test is taken from lmest package for R:

The PE test compares two non-nest models where one has a linear specification of type y ~ x1 + x2 and the other has a log-linear specification of type log(y) ~ z1 + z2. Typically, the regressors in the latter model are logs of the regressors in the former, i.e., z1 is log(x1) etc.

The idea of the PE test is the following: If the linear specification is correct then adding an auxiliary regressor with the difference of the log-fitted values from both models should be non-significant. Conversely, if the log-linear specification is correct then adding an auxiliary regressor with the difference of fitted values in levels should be non-significant. The PE test statistic is simply the marginal test of the auxiliary variable(s) in the augmented model(s).

To be updated!

This section will be updated sometime in the future - either when statsmodels (or a similar package) will have this test for Python, or a manual calculation will be provided for Python.

On the other hand, since linear, log-linear, log-log and other models were already covered in the previous chapters, along with their residual, coefficient sign and other tests - automating the selection between linear and log-linear may not be the best choice as it side-steps some model building steps, which may help better understand the empirical data at hand.

4.9.7.2.4 Ramsey Regression Specification Error Test (RESET)

The Regression Specification Error Test is useful for detecting omitted variables, or an incorrect functional form.

Assume that the DGP is described as: \[ Y_i = \beta_0 + \beta_1 X_{1,i} + ... + \beta_k X_{k,i} + \epsilon_i \] In such a case, any nonlinear functions of the explanatory variables, as well as their interaction terms, should be insignificant when added to the model. If we were to include all of these variables, then our \(F\)-test for significance would loose many degrees of freedom (and thus accuracy).

As an alternative:

Let the OLS fit of the initial regression be: \[ \widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 X_{1,i} + ... + \widehat{\beta}_k X_{k,i} \] Then, consider the expanded equation: \[ Y_i = \beta_0 + \beta_1 X_{1,i} + ... + \beta_k X_{k,i} + \delta_1 \widehat{Y}_i^2 + \delta_2 \widehat{Y}_i^3 + u_i \] The null hypothesis is that the original equation is correctly specified.

RESET is an \(F\)-test for testing the hypothesis: \[ \begin{cases} H_0: \delta_1 = \delta_2 = 0 \\ H_1: \delta_1 \neq 0 \text{ and/or } \delta_2 \neq 0 \end{cases} \] If we reject the null hypothesis, then we can conclude that the original model is inadequate and can be improved.

If we fail to reject the null hypothesis, then we can conclude that we did not detect any misspecifications.

Note that we can rewrite the squared term as: \[ \begin{aligned} \widehat{Y}_i^2 &= (\widehat{\beta}_0 + \widehat{\beta}_1 X_{1,i} + ... + \widehat{\beta}_k X_{k,i})^2 \\ &= \widehat{\beta}_0^2 + \sum_{j = 1}^k \widehat{\beta}_j^2 X_{j,i}^2 +2 \widehat{\beta}_0 \sum_{j = 1}^k \widehat{\beta}_j X_{j,i} +2 \sum_{j_1 < j_2} \widehat{\beta}_{j_1} \widehat{\beta}_{j_2} X_{j_1,i} X_{j_2,i} \end{aligned} \] and similarly with the cubed fitted value \(\widehat{Y}_i^3\). In other words, we are including different kinds of polynomials and interaction terms.

The general philosophy of the RESET is: if we can significantly improve the model by artificially including powers of the predictions of the model, then the original model must have been inadequate.

## 
##  RESET test
## 
## data:  y ~ 1 + x1 + x2
## RESET = 508.76, df1 = 1, df2 = 246, p-value < 2.2e-16
## 
##  RESET test
## 
## data:  y ~ 1 + x1 + x2
## RESET = 212.99, df1 = 2, df2 = 245, p-value < 2.2e-16

Note: by specifying a type, we can include powers of either the fitted variables, the regressor variables, or princomp - the first principal component of \(\mathbf{X}\).

## <F test: F=array([[274.19122606]]), p=6.926198819148281e-42, df_denom=246, df_num=1>

In Python the RESET test in statsmodels is implemented for powers of the fitted values only.

Since \(p\)-value < 0.05, we reject the null hypothesis and conclude that the original model is inadequate.

Note that our true model is log-linear and we did not reject the null hypothesis, that the initial model should be improved. However, going by the the test results alone would suggest that we would need to include additional polynomial explanatory variables in our model - we know that this is not the bes course of action, sicne simply taking the logarithm of our dependent variable \(Y\) would gives us a model close to the true one.

If we run RESET on the correctly specified model:

## 
##  RESET test
## 
## data:  log(y) ~ 1 + x1 + x2
## RESET = 2.1018, df1 = 1, df2 = 246, p-value = 0.1484
## 
##  RESET test
## 
## data:  log(y) ~ 1 + x1 + x2
## RESET = 1.0993, df1 = 2, df2 = 245, p-value = 0.3347
## <F test: F=array([[1.31214037]]), p=0.25312063343113816, df_denom=246, df_num=1>

If we correctly specify the model, then we have no grounds to reject the null hypothesis and conclude that we did not detect any misspecifications.

We get similar results when trying on the dataset, where the true model includes log(x1) instead of x1:

## 
##  RESET test
## 
## data:  y ~ 1 + x1 + x2
## RESET = 560.44, df1 = 1, df2 = 246, p-value < 2.2e-16
## 
##  RESET test
## 
## data:  y ~ 1 + x1 + x2
## RESET = 286.17, df1 = 2, df2 = 245, p-value < 2.2e-16
## <F test: F=array([[639.99757981]]), p=2.122286455628377e-70, df_denom=246, df_num=1>

If we tried to specify a log-linear model in this case:

## 
##  RESET test
## 
## data:  log(y) ~ 1 + x1 + x2
## RESET = 400.96, df1 = 1, df2 = 246, p-value < 2.2e-16
## 
##  RESET test
## 
## data:  log(y) ~ 1 + x1 + x2
## RESET = 202.04, df1 = 2, df2 = 245, p-value < 2.2e-16
## <F test: F=array([[433.77074312]]), p=3.2138765177123256e-56, df_denom=246, df_num=1>

We would still reject the null hypothesis and conclude that the model can be further improved.

Specifying the correct functional form:

## 
##  RESET test
## 
## data:  y ~ 1 + log(x1) + x2
## RESET = 0.041635, df1 = 1, df2 = 246, p-value = 0.8385
## 
##  RESET test
## 
## data:  y ~ 1 + log(x1) + x2
## RESET = 0.082745, df1 = 2, df2 = 245, p-value = 0.9206
## <F test: F=array([[4.52924959]]), p=0.034314592461041515, df_denom=246, df_num=1>
  • On one hand, we would have no grounds to reject the null hypothesis in the sample generated in R (as we would expect, since this specification is the same as the true model).

  • On the other hand, as is usually the case when dealing with randomness, we would reject the null hypothesis for the sample generated in Python. However, the \(p\)-value is much closer to \(0.05\) than in the previous test results. In Pythons case, we could argue that we need more data to decide whether our model should be further improved.

4.9.7.3 Variable Selection Algorithms: Stepwise Regression (or Stepwise Selection)

Assume that, you begin by selecting a number of variables and including them in our model. Then, looking at the \(F\)-test for goodness of fit (see section 4.3) in order to determine if at least one explanatory variable is statistically significant. If we conclude that at least one variable is significant - we are naturally interested in determining which one(-s) are. Generally, either all of the included variables, or only a subset of included variables will be statistically significant.

Ideally, we would perform variable selection by comparing models with different variable combinations of predictors. If we have \(k\) explanatory variables, which we could include in our model, then there are \(2^k\) possible model combinations (since each variable can be either included or not in the model). If we have \(k = 20\) explanatory variables, which is possible in most empirical datasets nowadays, then we might need to compare and evaluate 1048576 different models. In practical applications this is not feasible. Therefore, there are a number of automated approaches to variable selection.

Careful!

The use of stepwise regression is a bit controversial for a number of reasons:

  • The coefficient significance tests (\(t\)-test, \(F\)-test) are biased on the data sample. After all, until we have included all of the significant variables, we are dealing with an omitted variable bias;

  • As noted before, more variables artificially increase the \(R^2\). A possible solution would be to avoid using \(R^2\), or opt to use the adjusted \(R^2\).;

  • The method is prone to overfitting the data. A possible solution is to use out-of-sample data to perform model validation.

  • It has severe problems in the presence of collinearity. We cannot know this beforehand - we would need to calculate the variable VIFs, or group variables based on their definitions and economic theory.

  • Increasing the sample size does not help very much. See Derksen, S. and H. J. Keselman, Backward, forward and stepwise automated subset selection algorithms: frequency of obtaining authentic and noise variables, 1992.

  • Automating the variable selection procedure allows us to not think about the problem.

    • This includes variables, which may have to be included in the model (from an economic sense). For example, age may usually be a significant variable in most datasets based on consumer purchases, however it may be strongly correlated with other variables. While statistically, it may make sense to include the other variables, their definitions may not be helpful in model interpretation. Consider that work experience may be strongly correlated with a customers age. However, not including a customers age but including their work experience when modelling the number of purchases may make little economic sense, especially if the age variable is available in the dataset.
    • As mentioned, variable collinearity is also a problem which requires manual intervention.
    • Furthermore, these variable selection techniques do not address variable signs, which may again be opposite of what we would expect.
  • When examining data samples, we are in fact analyzing a realization of random variables. Consequently, some relationships between variables may seem significant, when in fact they may only seem so because of randomness, or measurement errors. This problem can be somewhat addressed by splitting your dataset into training and test sets multiple times and running the same models each time in order to verify that your model is stable (in the sense that the coefficients and model accuracy do not depend much on which observations are selected for the training set).

“Long-standing problems often gain notoriety because solution of them is of wide interest and at the same time illusive. Automatic model building in linear regression is one such problem. My main point is that neither LARS [Least Angle Regression] nor, as near as I can tell, any other automatic method has any hope of solving this problem because automatic procedures by their very nature do not consider the context of the problem at hand.”

— Bradley Efron, Trevor Hastie, I. J. & Tibshirani, R., 2004, “Least Angle Regression”

On the other hand, we can leverage the stepwise algorithms in order to get some initial models, which we could hopefully refine in order to make economic sense, as long as we keep in mind the previously highlighted drawbacks of such methods.

Also, see David Hendry - Automatic Methods For Empirical Model Discovery:

4.9.7.3.1 Forward Selection

Forward selection consists of the following algorithm:

  1. Begin with a null model: \(Y_i = \beta_0 + \epsilon_i\). This model contains the intercept only;

  2. Fit \(k\) separate simple linear regressions: \(Y_i = \beta_0 + \beta_j X_{j, i} + \epsilon_i\), \(j = 1,...,k\). Then, for each model calculate its corresponding \(RSS\) (or \(R^2\), or \(AIC\), or \(BIC\)) and add to the null model the variable \(X_{j_{1}}\), which results in the lowest RSS (or highest \(R^2\), or smallest \(AIC / BIC\)). The updated model becomes \(Y_i = \beta_0 + \beta_1 X_{j_{1},i} + \epsilon_i\);

  3. Continue by fitting \(k-1\) linear regressions: \(Y_i = \beta_0 + \beta_1 X_{j_{1},i} + \beta_j X_{j,i} + \epsilon_i\), \(j = 1, ..., j_{1} - 1,\ j_{1},\ j_{1} + 1,...,k\) and update the previous model with another variable \(X_{j_2}\), which results in the lowest RSS (or highest \(R^2\), or smallest \(AIC / BIC\)). The updated model becomes \(Y_i = \beta_0 + \beta_1 X_{j_{1},i} + \beta_2 X_{j_{2},i} + \epsilon_i\);

  4. Continue this approach until some stopping rule is satisfied. For example:

    • Only add those variables, which are statistically significant, in addition to improving the model.
    • Perform model cross-validation on a training set and check whether the prediction accuracy increased (by comparing the model out-of-sample MAPE);
    • Check whether the AIC or BIC decreased, or whether the adjusted \(R^2\) increased.
4.9.7.3.2 Backward Selection or (Backward Elimination)

Backward selection consists of the following algorithm:

  1. Begin with all \(k\) explanatory variables included in the model: \(Y_i = \beta_0 + \beta_1 X_{1,i} + ... + \beta_k X_{k,i} + \epsilon_i\);

  2. Remove the variable \(X_{j_1}\) with the largest \(p-value \geq 0.05\) - i.e. remove the variable which is the least statistically significant. The updated model is \(Y_i = \beta_0 + \beta_1 X_{1,i} + ... + \beta_{j_1-1} X_{j_1 -1} + \beta_{j_1+1} X_{j_1+1} +... + \beta_k X_{k,i} + \epsilon_i\).

  3. Continue the procedure by removing the insignificant variable with the larges \(p-value \geq 0.05\) from the updated model.

  4. Continue this approach until some stopping rule is reached. For example, stop when all the remaining variables have \(p-values < 0.05\).

Note: instead of \(0.05\), some other threshold for \(p-value\) can be selected.

Alternatively, instead of \(p-value\), remove the variables one by one and select to remove the variable, which results in a largest decrease in \(AIC\), or \(BIC\). If the \(AIC\), or \(BIC\) does not decrease - stop the algorithm.

4.9.7.3.3 Mixed Selection (or Stepwise Selection, or Sequential Replacement)

This is a combination of forward and backward selection.

  1. We start with no variables in the model, and as with forward selection, we add the variable that provides the best fit (in terms of RSS, \(R^2\), etc.). We continue to add variables one-by-one as before.

  2. If at any point the \(p-value\) for any one of the included variables in the model rises above a certain threshold, then we remove that variable from the model, similarly to backward selection.

  3. Continue performing these forward and backward selection steps until all variables in the model have sufficiently low \(p-value\) and all of the non-included variables have a large \(p-value\).

Alternatively, simialrly to backward selection, remove the variables according to the largest reduction in \(AIC/BIC\).

It is important to note that:

  • Backward selection cannot be used if \(k \gt N\), i.e. if we have more explanatory variables than observations, we will not be able to evaluate the model coefficients.
  • Forward selection can be used, even if \(k \gt N\).
  • Forward selectrion is a greedy approach, which might initially include variables, wich later become redundant (i.e. statistically insignificant). Mixed selection can remedy this by removing the insignificant variables.
  • Note that these selection methods do not consider including interaction terms. In order to account for this, these existing algorithms need to be expanded.
Example 4.37 We will use a readily available dataset in order to show the functionality of the stepwise algorithms themselves, rather than compare the results with some manually created models. The examined dataset is the standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888.
##              Fertility Agriculture Examination Education Catholic Infant.Mortality
## Courtelary        80.2        17.0          15        12     9.96             22.2
## Delemont          83.1        45.1           6         9    84.84             22.2
## Franches-Mnt      92.5        39.7           5         5    93.40             20.2
## Moutier           85.8        36.5          12         7    33.77             20.3
## Neuveville        76.9        43.5          17        15     5.16             20.6
##               Fertility  Agriculture  Examination  Education  Catholic  Infant.Mortality
## Courtelary         80.2         17.0           15         12      9.96              22.2
## Delemont           83.1         45.1            6          9     84.84              22.2
## Franches-Mnt       92.5         39.7            5          5     93.40              20.2
## Moutier            85.8         36.5           12          7     33.77              20.3
## Neuveville         76.9         43.5           17         15      5.16              20.6

There are a number of functions available in R. These include:

Unfortunately, Python does not have stepwise regression functions. On the other hand, there is an implementation available in stepwise-regression package.

def backward_selection(y, X):
    # Start with a full model:
    model = sm.OLS(y, sm.add_constant(X)).fit()
    exhausted = False
    while not exhausted:
        # Get list of variables included in the current model:
        included_vars = model.params.index.to_list()
        # Excluide the constant to get a list of variables to exclude from the current model
        # Note: we are excluding variables only, so we do not need to compare with the original design matrix X.columns
        to_exclude = list(set(included_vars) - set(["const"]))
        if len(to_exclude) == 0:
            # If there is only a constant remaining in the model - there is nothing left to exclude
            break;
        print('\nModel: {} ~ {}'.format('Y', ' + '.join(map(str, included_vars))))
        print("   {:30}:  AIC = {:.6}".format("<none>", model.aic))
        prev_aic = model.aic
        # Backward selection:
        for j in to_exclude:
            vars_to_include = [x for x in included_vars if x not in set([j])]
            temp_model = sm.OLS(y, sm.add_constant(X)[vars_to_include]).fit()
            print(" - {:30}:  AIC = {:.6}".format(j, temp_model.aic))
            if temp_model.aic <= prev_aic:
                # If a better model is found - set it as the current model, which will be updated in the next loop
                model = temp_model
                prev_aic = temp_model.aic
        if included_vars == model.params.index.to_list():
            # If no more variables were removed - the current model is the optimal one
            break;                
    final_eq = '{} ~ {}'.format('Y', ' + '.join(map(str, included_vars)))
    print("\nBest model with AIC = {:6.2f}; {}".format(model.aic, final_eq))
    # Return variable list and the estimated model object
    return {"variables": included_vars[1:], "model": model}
def mixed_selection(y, X):
    model = sm.OLS(y, sm.add_constant(X).iloc[:, 0]).fit()
    exhausted = False
    while not exhausted:
        included_vars = model.params.index.to_list()
        to_include = list(set(X.columns) - set(included_vars))
        if len(to_include) == 0:
            break # No more variables
        print('\nModel: {} ~ {}'.format('Y', ' + '.join(map(str, included_vars))))
        print("   {:30}:  AIC = {:.6}".format("<none>", model.aic))
        prev_aic = model.aic
        # Forward selection:
        for j in to_include:
            temp_model = sm.OLS(y, sm.add_constant(X)[included_vars + [j]]).fit()
            print(" + {:30}:  AIC = {:.6}".format(j, temp_model.aic))
            if temp_model.aic <= prev_aic:
                model = temp_model
                prev_aic = temp_model.aic
        if included_vars == model.params.index.to_list():
            # If there are no more variables to add - the previous iteration already included droping of existing terms
            # So, this is the optimal model
            break;
        # Backward selection:
        included_vars = model.params.index.to_list()
        to_exclude = list(set(included_vars) - set(["const"]))
        if len(to_exclude) != 0:
            for j in to_exclude:
                vars_to_include = [x for x in included_vars if x not in set([j])]
                temp_model = sm.OLS(y, sm.add_constant(X)[vars_to_include]).fit()
                print(" - {:30}:  AIC = {:.6}".format(j, temp_model.aic))
                if temp_model.aic <= prev_aic:
                    model = temp_model
                    prev_aic = temp_model.aic
    final_eq = '{} ~ {}'.format('Y', ' + '.join(map(str, included_vars)))
    print("\nBest model with AIC = {:6.2f}; {}".format(model.aic, final_eq))
    # Return variable list and the estimated model object
    return {"variables": included_vars[1:], "model": model}
## Start:  AIC=238.35
## Fertility ~ 1
## 
##                    Df Sum of Sq    RSS    AIC
## + Education         1    3162.7 4015.2 213.04
## + Examination       1    2994.4 4183.6 214.97
## + Catholic          1    1543.3 5634.7 228.97
## + Infant.Mortality  1    1245.5 5932.4 231.39
## + Agriculture       1     894.8 6283.1 234.09
## <none>                          7178.0 238.34
## 
## Step:  AIC=213.04
## Fertility ~ Education
## 
##                    Df Sum of Sq    RSS    AIC
## + Catholic          1    961.07 3054.2 202.18
## + Infant.Mortality  1    891.25 3124.0 203.25
## + Examination       1    465.63 3549.6 209.25
## <none>                          4015.2 213.04
## + Agriculture       1     61.97 3953.3 214.31
## 
## Step:  AIC=202.18
## Fertility ~ Education + Catholic
## 
##                    Df Sum of Sq    RSS    AIC
## + Infant.Mortality  1    631.92 2422.2 193.29
## + Agriculture       1    486.28 2567.9 196.03
## <none>                          3054.2 202.18
## + Examination       1      2.46 3051.7 204.15
## 
## Step:  AIC=193.29
## Fertility ~ Education + Catholic + Infant.Mortality
## 
##               Df Sum of Sq    RSS    AIC
## + Agriculture  1   264.176 2158.1 189.86
## <none>                     2422.2 193.29
## + Examination  1     9.486 2412.8 195.10
## 
## Step:  AIC=189.86
## Fertility ~ Education + Catholic + Infant.Mortality + Agriculture
## 
##               Df Sum of Sq    RSS    AIC
## <none>                     2158.1 189.86
## + Examination  1    53.027 2105.0 190.69
## 
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality + 
##     Agriculture, data = swiss)
## 
## Coefficients:
##      (Intercept)         Education          Catholic  Infant.Mortality       Agriculture  
##          62.1013           -0.9803            0.1247            1.0784           -0.1546
## 
## Model: Y ~ const
##    <none>                        :  AIC = 371.725
##  + Education                     :  AIC = 346.422
##  + Examination                   :  AIC = 348.353
##  + Agriculture                   :  AIC = 367.467
##  + Catholic                      :  AIC = 362.348
##  + Infant.Mortality              :  AIC = 364.768
## 
## Model: Y ~ const + Education
##    <none>                        :  AIC = 346.422
##  + Catholic                      :  AIC = 335.564
##  + Agriculture                   :  AIC = 347.691
##  + Infant.Mortality              :  AIC = 336.626
##  + Examination                   :  AIC = 342.629
## 
## Model: Y ~ const + Education + Catholic
##    <none>                        :  AIC = 335.564
##  + Agriculture                   :  AIC = 329.413
##  + Infant.Mortality              :  AIC = 326.668
##  + Examination                   :  AIC = 337.526
## 
## Model: Y ~ const + Education + Catholic + Infant.Mortality
##    <none>                        :  AIC = 326.668
##  + Agriculture                   :  AIC = 323.241
##  + Examination                   :  AIC = 328.484
## 
## Model: Y ~ const + Education + Catholic + Infant.Mortality + Agriculture
##    <none>                        :  AIC = 323.241
##  + Examination                   :  AIC = 324.072
## 
## Best model with AIC = 323.24; Y ~ const + Education + Catholic + Infant.Mortality + Agriculture
## {'variables': ['Education', 'Catholic', 'Infant.Mortality', 'Agriculture'], 'model': <statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x000000000E36DE80>}
## Start:  AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## - Examination       1     53.03 2158.1 189.86
## <none>                          2105.0 190.69
## - Agriculture       1    307.72 2412.8 195.10
## - Infant.Mortality  1    408.75 2513.8 197.03
## - Catholic          1    447.71 2552.8 197.75
## - Education         1   1162.56 3267.6 209.36
## 
## Step:  AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                          2158.1 189.86
## - Agriculture       1    264.18 2422.2 193.29
## - Infant.Mortality  1    409.81 2567.9 196.03
## - Catholic          1    956.57 3114.6 205.10
## - Education         1   2249.97 4408.0 221.43
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Coefficients:
##      (Intercept)       Agriculture         Education          Catholic  Infant.Mortality  
##          62.1013           -0.1546           -0.9803            0.1247            1.0784

Note that MASS::stepAIC() procedure uses MASS::stepAIC(). So, they should give equivalent results.

## 
## Model: Y ~ const + Agriculture + Examination + Education + Catholic + Infant.Mortality
##    <none>                        :  AIC = 324.072
##  - Education                     :  AIC = 342.738
##  - Examination                   :  AIC = 323.241
##  - Agriculture                   :  AIC = 328.484
##  - Catholic                      :  AIC = 331.135
##  - Infant.Mortality              :  AIC = 330.412
## 
## Model: Y ~ const + Agriculture + Education + Catholic + Infant.Mortality
##    <none>                        :  AIC = 323.241
##  - Catholic                      :  AIC = 338.485
##  - Education                     :  AIC = 354.809
##  - Agriculture                   :  AIC = 326.668
##  - Infant.Mortality              :  AIC = 329.413
## 
## Best model with AIC = 323.24; Y ~ const + Agriculture + Education + Catholic + Infant.Mortality
## {'variables': ['Agriculture', 'Education', 'Catholic', 'Infant.Mortality'], 'model': <statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x000000000E317FD0>}

With direction = "both" starts with an empty model and builds it up sequentially, similarly to forward selection. However, every time a new variable is added, other already included variables are checked to see if any of them should be dropped.

## Start:  AIC=238.35
## Fertility ~ 1
## 
##                    Df Sum of Sq    RSS    AIC
## + Education         1    3162.7 4015.2 213.04
## + Examination       1    2994.4 4183.6 214.97
## + Catholic          1    1543.3 5634.7 228.97
## + Infant.Mortality  1    1245.5 5932.4 231.39
## + Agriculture       1     894.8 6283.1 234.09
## <none>                          7178.0 238.34
## 
## Step:  AIC=213.04
## Fertility ~ Education
## 
##                    Df Sum of Sq    RSS    AIC
## + Catholic          1     961.1 3054.2 202.18
## + Infant.Mortality  1     891.2 3124.0 203.25
## + Examination       1     465.6 3549.6 209.25
## <none>                          4015.2 213.04
## + Agriculture       1      62.0 3953.3 214.31
## - Education         1    3162.7 7178.0 238.34
## 
## Step:  AIC=202.18
## Fertility ~ Education + Catholic
## 
##                    Df Sum of Sq    RSS    AIC
## + Infant.Mortality  1    631.92 2422.2 193.29
## + Agriculture       1    486.28 2567.9 196.03
## <none>                          3054.2 202.18
## + Examination       1      2.46 3051.7 204.15
## - Catholic          1    961.07 4015.2 213.04
## - Education         1   2580.50 5634.7 228.97
## 
## Step:  AIC=193.29
## Fertility ~ Education + Catholic + Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## + Agriculture       1    264.18 2158.1 189.86
## <none>                          2422.2 193.29
## + Examination       1      9.49 2412.8 195.10
## - Infant.Mortality  1    631.92 3054.2 202.18
## - Catholic          1    701.74 3124.0 203.25
## - Education         1   2380.38 4802.6 223.46
## 
## Step:  AIC=189.86
## Fertility ~ Education + Catholic + Infant.Mortality + Agriculture
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                          2158.1 189.86
## + Examination       1     53.03 2105.0 190.69
## - Agriculture       1    264.18 2422.2 193.29
## - Infant.Mortality  1    409.81 2567.9 196.03
## - Catholic          1    956.57 3114.6 205.10
## - Education         1   2249.97 4408.0 221.43
## 
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality + 
##     Agriculture, data = swiss)
## 
## Coefficients:
##      (Intercept)         Education          Catholic  Infant.Mortality       Agriculture  
##          62.1013           -0.9803            0.1247            1.0784           -0.1546
## 
## Model: Y ~ const
##    <none>                        :  AIC = 371.725
##  + Education                     :  AIC = 346.422
##  + Examination                   :  AIC = 348.353
##  + Agriculture                   :  AIC = 367.467
##  + Catholic                      :  AIC = 362.348
##  + Infant.Mortality              :  AIC = 364.768
##  - Education                     :  AIC = 371.725
## 
## Model: Y ~ const + Education
##    <none>                        :  AIC = 346.422
##  + Catholic                      :  AIC = 335.564
##  + Agriculture                   :  AIC = 347.691
##  + Infant.Mortality              :  AIC = 336.626
##  + Examination                   :  AIC = 342.629
##  - Catholic                      :  AIC = 346.422
##  - Education                     :  AIC = 362.348
## 
## Model: Y ~ const + Education + Catholic
##    <none>                        :  AIC = 335.564
##  + Agriculture                   :  AIC = 329.413
##  + Infant.Mortality              :  AIC = 326.668
##  + Examination                   :  AIC = 337.526
##  - Catholic                      :  AIC = 336.626
##  - Education                     :  AIC = 356.838
##  - Infant.Mortality              :  AIC = 335.564
## 
## Model: Y ~ const + Education + Catholic + Infant.Mortality
##    <none>                        :  AIC = 326.668
##  + Agriculture                   :  AIC = 323.241
##  + Examination                   :  AIC = 328.484
##  - Catholic                      :  AIC = 338.485
##  - Education                     :  AIC = 354.809
##  - Agriculture                   :  AIC = 326.668
##  - Infant.Mortality              :  AIC = 329.413
## 
## Model: Y ~ const + Education + Catholic + Infant.Mortality + Agriculture
##    <none>                        :  AIC = 323.241
##  + Examination                   :  AIC = 324.072
## 
## Best model with AIC = 323.24; Y ~ const + Education + Catholic + Infant.Mortality + Agriculture
## {'variables': ['Education', 'Catholic', 'Infant.Mortality', 'Agriculture'], 'model': <statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x000000000E36D3D0>}

If we wanted some kind of graphical representation of our variable selection - we can use the leaps::regsybsets() function in R:

## Subset selection object
## Call: regsubsets.formula(Fertility ~ ., data = swiss, nbest = 2)
## 5 Variables  (and intercept)
##                  Forced in Forced out
## Agriculture          FALSE      FALSE
## Examination          FALSE      FALSE
## Education            FALSE      FALSE
## Catholic             FALSE      FALSE
## Infant.Mortality     FALSE      FALSE
## 2 subsets of each size up to 5
## Selection Algorithm: exhaustive
##          Agriculture Examination Education Catholic Infant.Mortality
## 1  ( 1 ) " "         " "         "*"       " "      " "             
## 1  ( 2 ) " "         "*"         " "       " "      " "             
## 2  ( 1 ) " "         " "         "*"       "*"      " "             
## 2  ( 2 ) " "         " "         "*"       " "      "*"             
## 3  ( 1 ) " "         " "         "*"       "*"      "*"             
## 3  ( 2 ) "*"         " "         "*"       "*"      " "             
## 4  ( 1 ) "*"         " "         "*"       "*"      "*"             
## 4  ( 2 ) " "         "*"         "*"       "*"      "*"             
## 5  ( 1 ) "*"         "*"         "*"       "*"      "*"

With nbest = 2 we want to see two best models with for each different number of included variables - from 1 to 5. We can also make the matrix easier to read using TRUE/FALSE:

##   (Intercept) Agriculture Examination Education Catholic Infant.Mortality
## 1        TRUE       FALSE       FALSE      TRUE    FALSE            FALSE
## 1        TRUE       FALSE        TRUE     FALSE    FALSE            FALSE
## 2        TRUE       FALSE       FALSE      TRUE     TRUE            FALSE
## 2        TRUE       FALSE       FALSE      TRUE    FALSE             TRUE
## 3        TRUE       FALSE       FALSE      TRUE     TRUE             TRUE
## 3        TRUE        TRUE       FALSE      TRUE     TRUE            FALSE
## 4        TRUE        TRUE       FALSE      TRUE     TRUE             TRUE
## 4        TRUE       FALSE        TRUE      TRUE     TRUE             TRUE
## 5        TRUE        TRUE        TRUE      TRUE     TRUE             TRUE
## 1

The last matrix indicates which variables are included in each of the models:

  • The included variables are indicated by asterisks in quotes: "*"; variables not in a model have empty quotes: " "; Alternatively, they can be represented as TRUE/FALSE values.
  • The first column indicates models, which are listed in order of variable size. In this example the model include between 1 and 5 variables;
  • The number in brackets ( ) indicates the model ranking for each variable size. In this example:
    • 1 (1) is the best model model with 1 explanatory variable, while 1 (2) is the \(2^{nd}\) best model with 1 explanatory variable;
    • 4 (1) is the best model with 4 explanatory variables, while 4 (2) is the \(2^{nd}\) best model with 4 explanatory variables;
    • 5 (1) is the best model with 5 explanatory variables. Since we only have 5 explanatory variables in our dataset - this is the only model for this number of variables.

Now we can visualize these models and their GoF in terms of \(BIC\), or \(R^2_{adj}\):

Here:

  • Each row in this graph represents a model;
  • The shaded rectangles in the columns indicate the variables included in the given model (i.e. in each row);
  • The numbers on the left margin are the values of the BIC;
  • The x-axis is not quantitative but is ordered.
  • The darkness of the shading simply represents both the ordering of the BIC values.

In this example, we see that:

  • The model with the lowest \(BIC\) is the one with 4 variables (not counting the intercept) included in the model (the top row of the plot). The only variable not included in the model is Examination.
  • The \(2^{nd}\) best model in terms of \(BIC\) is one with 3 explanatory variable included (not counting the intercept).
4.9.7.3.4 Other Methods for high-dimensional data

Assume that our dataset has such a large number of \(k\) exogeneous variables that \(k > N\). This is also known as high-dimensional data. Alternatively, high dimensional data may sometimes also be referred to as having such a large number of explanatory variables \(k\), that calculations become extremely difficult.

In such cases, we aim to reduce the number of explanatory variables to use in the model. This is known as dimensionality reduction. When specifying our model, we have a number of methods to choose from:

  • Feature Selection (i.e. variable selection) via Stepwise Regression;
  • Feature Selection via Penalized Regression (Ridge and Lasso regression);
  • Feature Extraction via Partial Least Squares Regression;
  • Feature Extraction via Principle Component Analysis.
  • Model averaging may also be an alternative, where we develop a set of plausible models and then average the coefficient values (if we are estimating a linear model). Instead of a simple average, we can also calculate weighted averages, where the weights are based on some criteria (e.g. AIC, or BIC).

We have already examine Stepwise Regression. For a general overview of the other options (and more) - see wikipedia.

4.9.8 Overfitting

“The production of an analysis which corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.”

— OxfordDictionaries.com definition for statistics

An overfitted model has (unknowingly) extracted some of the residual variation (i.e. the noise) as if that variation is represented by underlying model structure, rather by the random noise. If the number of parameters is the same as, or greater than, the number of observations, then a model can perfectly predict the training data.

The reason for this is that the model fitting problem does not have a unique solution as there may be multiple parameter combinations that fit the data equally well. To think about this in another way - the design matrix \(\boldsymbol{X}\) would have \(N\) rows (observations) and \(k > N\) columns (predictors, \(X_{i,j}\), \(i = 1,...,k\) for each observation \(j = 1,...,N\)). In such cases, the maximum possible rank (i.e. the full rank) of the matrix is \(rank(k) = \min\{ k, N\} = N\). Then, the remaining \(k-N\) coefficients can be expressed as arbitrary linear combinations of those first \(N < k\) predictors.

On the other hand, an underfitted model cannot adequately capture the underlying structure of the data and suffers from omitted variable bias. Underfitting would also occur, when fitting a linear model to non-linear data. Such a model will tend to have poor predictive performance.

In machine learning, these problems are sometimes called overtraining and undertraining, respectively.

Example 4.38 Again assume that our dependent variable \(\boldsymbol{Y}\) is generated by an (unobserved) proces: \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\), \(\epsilon_ \mathcal{N}(0, \sigma^2)\). Furthermore, for this example assume that we have around \(N = 10\) observations.

Next, assume that we fit two different models:

  • A simple linear regression model using \(X\) as the explanatory variable;
  • A multiple linear regression model using \(X\), \(X^2\), …, \(X^8\) as explanatory variables;
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 0.5373128  0.1544445  3.479002 8.331829e-03
## x           1.9077824  0.1209840 15.768877 2.613907e-07
##           Coef.  Std.Err.          t     P>|t|
## const  0.365242  0.218549   1.671216  0.133223
## x1     1.998361  0.171200  11.672672  0.000003
## 
## C:\Users\Andrius\AppData\Local\Programs\Python\Python38\lib\site-packages\scipy\stats\stats.py:1534: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
##   warnings.warn("kurtosistest only valid for n>=20 ... continuing "
##                            Estimate Std. Error   t value   Pr(>|t|)
## (Intercept)              1.03037543 0.13865778  7.431068 0.08515843
## poly(x, 8, raw = TRUE)1  3.56103104 0.32284305 11.030223 0.05755859
## poly(x, 8, raw = TRUE)2 -1.48065588 0.84369028 -1.754976 0.32972064
## poly(x, 8, raw = TRUE)3 -3.79718884 0.61065862 -6.218186 0.10151114
## poly(x, 8, raw = TRUE)4  1.60468265 1.16252403  1.380344 0.39913027
## poly(x, 8, raw = TRUE)5  1.96181514 0.31543339  6.219428 0.10149122
## poly(x, 8, raw = TRUE)6 -0.68294137 0.51449516 -1.327401 0.41102901
## poly(x, 8, raw = TRUE)7 -0.27730194 0.04591093 -6.039998 0.10445315
## poly(x, 8, raw = TRUE)8  0.09052359 0.06873176  1.317056 0.41342532
##           Coef.  Std.Err.         t     P>|t|
## const  1.079481  0.658528  1.639233  0.348721
## x1     2.662934  1.533281  1.736756  0.332586
## x2    -6.723107  4.006945 -1.677864  0.342164
## x3    -1.687290  2.900206 -0.581783  0.664554
## x4     8.672001  5.521184  1.570678  0.360929
## x5     0.920680  1.498090  0.614569  0.649182
## x6    -3.465707  2.443496 -1.418340  0.390953
## x7    -0.134672  0.218045 -0.617632  0.647769
## x8     0.425305  0.326428  1.302905  0.416742

Let’s say that, instead of examining whether the coefficients are statistically significant or not, we immediately decide to plot the actual and fitted values of our models:

At a glance, it would seem that the polynomial model provides a much better fit for the data. However, if we were to examine the predictions for new values of \(X\), which were not used in the design matrix (i.e. value, which we did not observe when fitting our model):

We see that the polynomial model is extremely worse that the simple linear model.

In practical applications it may very well be the case that we will have much more observations (than in this example) and much more explanatory variables (instead of the same variable only squared, cubed, etc.). Furthermore, if we do not have any theoretical knowledge, which would help in our analysis (in terms of variable selection, this would usually be some kind of underlying economic theory), then we may be prone to overfiting.

In those cases it would always be a good idea to set aside some data and perform an out-of-sample validation to determine, how well the model predicts with new data to make sure that it does not overfit.

For further discussion, see here and here.

“Given a data set, you can fit thousands of models at the push of a button, but how do you choose the best? With so many candidate models, overfitting is a real danger.”

— Claeskens, G.; Hjort, N.L.(2008), Model Selection and Model Averaging

To avoid overfitting, we should adhere to the “Principle of Parsimony”, a.k.a, Occam’s razor.

4.9.9 Occam’s razor

Occam’s razor states that of two competing theories, the simpler and more parsimonious is preferred. Parsimonious models are simple models with an optimal amount of predictors needed to have a strong explanatory predictive power. While generally it is not easy to identify what a parsimonious model should look like, the \(BIC\) is a well-known criterion for model selection that favors more parsimonious models over more complex models. Thus it can be leveraged when comparing multiple models to determine whether an additional predictor (and an increase in model complexity) results in an improvement in the model.

“Overfitted models <…> are often free of bias in the parameter estimators, but have estimated (and actual) sampling variances that are needlessly large (the precision of the estimators is poor, relative to what could have been accomplished with a more parsimonious model). False treatment effects tend to be identified, and false variables are included with overfitted models. … A best approximating model is achieved by properly balancing the errors of underfitting and overfitting.”

— Burnham, K. P., Anderson, D. R. (2002), Model Selection and Multimodel Inference (2nd ed.)

The bias–variance tradeoff is a framework that incorporates the Occam’s razor principal in its balance between overfitting (i.e. variance minimization) and underfitting (i.e. bias minimization).

4.9.10 Bias-Variance Tradeoff

The bias–variance tradeoff is a key problem in econometrics, statistics and machine learning. Ideally, one wants to choose such a model, which can both accurately capture the training data, but also generalize well to unseen test data.Unfortunately, it is typically impossible to do both simultaneously.

This problem can be described as a set of predictive models where models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa. This tradeoff consists of two elements:

  • The bias error - an error, which arises from incorrect assumptions in the model (see the previous sections on what causes the model parameters to be. High bias can cause a model to miss the relevant relations between the explanatory variables and target variable, which leads to underfitting. Thus such models may not be able to adequately capture important explanatory variable effects. Models with higher bias tend to be relatively simple

  • The variance - an error from sensitivity to small fluctuations in the training set. High variance can cause a model to treat some of the variation in the random noise as part of the endogenous variable variation, rather than the intended output. While such a model may be able to represent their training set quite well, it usually leads to overfitting to noisy or unrepresentative training data. Models with high variance are usually more complex.

4.9.10.1 Bias-Variance Decomposition

Assume that our data sample can be described by some functional form: \[ Y_i = f(X_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2) \] Our goal is to find a function \(\widehat{f}(X_i)\) such that it approximates the true function \(f(X_i)\) as well as possible. We measure how well our model performs in terms of the squared error \(\left( Y_i - \widehat{f}(X_i) \right)^2\) for \(i = 1,...,N\), and for the points outside of our sample. Since \(\epsilon_i\) is a random noise, which we cannot account for - we will always encounter some irreducible error - that is, the error, which is due to the random component \(\epsilon_i\).

Regardless of the functional form we select for \(\widehat{f}\), we can decompose its expected error on some (potentially unseen) sample value \(X\) as follows: \[ \begin{aligned} \mathbb{E} \left[ \left( Y - \widehat{f}(X) \right)^2\right] &= \mathbb{E} \left[ \left( Y^2 + \widehat{f}^2(X) - 2 Y \widehat{f}(X) \right)\right]\\ &= \mathbb{E}\left[ Y^2\right] + \mathbb{E}\left[ \widehat{f}^2(X)\right] - 2 \mathbb{E}\left[ Y \widehat{f}(X)\right] \\ &= \mathbb{V}{\rm ar}\left[Y \right] + (\mathbb{E}[Y])^2 + \mathbb{V}{\rm ar}\left[\widehat{f}^2(X) \right] + \left(\mathbb{E}[\widehat{f}(X)] \right)^2 -2 \mathbb{E}\left[ Y \widehat{f}(X)\right] \\ &= \mathbb{V}{\rm ar}\left[Y \right] + \mathbb{V}{\rm ar}\left[\widehat{f}^2(X) \right] + \left( \mathbb{E}[Y] - \mathbb{E}\left[\widehat{f}(X)\right]\right)^2 \\ &= \sigma^2 + \mathbb{V}{\rm ar}\left[\widehat{f}^2(X) \right] + \left( \mathbb{E}\left[\widehat{f}(X)\right] - f(X)\right)^2 \end{aligned} \]

Where we used the fact that:

  • \(\mathbb{E}[\epsilon] = 0\), \(\mathbb{V}{\rm ar}[\epsilon] = \sigma^2\);
  • \(\mathbb{E}[f(X)] = f(X)\), since we assume that \(X\) is deterministic and \(f\) is some functional transformation. Thus, \(\mathbb{E}[Y] = \mathbb{E}[f(X) + \epsilon] = f(X)\);
  • \(\mathbb{V}{\rm ar}[Y] = \mathbb{E}[Y - \mathbb{E}[Y]]^2 = \mathbb{E}[f(X) + \epsilon - f(X)]^2 = \mathbb{E}[\epsilon]^2 = \sigma^2\).
  • \(\mathbb{V}{\rm ar}[Y] = \mathbb{E} [Y^2] - (\mathbb{E}[Y])^2 \iff \mathbb{E} [Y^2] = \mathbb{V}{\rm ar}[Y + (\mathbb{E}[Y])^2\).

This gives the bias-variance decomposition: \[ \mathbb{E} \left[ \left( Y - \widehat{f}(X) \right)^2\right] = \underbrace{\left( \mathbb{E}\left[\widehat{f}(X)\right] - f(X) \right)^2}_{\text{Bias}^2\left[ \widehat{f}(X)\right]} + \underbrace{\mathbb{V}{\rm ar}\left[\widehat{f}^2(X) \right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\substack{\text{Irreducible}\\ \text{Error}}} \] where:

  • the square of the bias of the model, which can be thought of as the error caused by simplifying assumptions, e.g. when approximating a non-linear function, making general assumptions based on economic theory.
  • the variance of the model (how much the model estimates will be scattered around its mean);
  • the irreducible error, \(\sigma^2\), resulting from noise in the data itself.

4.9.10.2 Variance Reduction Approaches

  • The bias–variance decomposition forms the basis for regression regularization methods, such as Lasso and ridge regression. Regularization methods introduce bias into the regression solution that can reduce variance considerably, relative to the ordinary least squares (OLS), which provide unbiased estimates.
  • Variable selection can decrease the variance by simplifying models;
  • Dimensionality reduction also simplifies models, by obtaining a set of principal variables via linear data transformations (e.g. PCA).

4.9.11 Data Dredging & Spurious correlation

Data dredging is sometimes referred to as \(p\)-hacking, which can be thought of as a conscious (or subconscious) manipulation of data in a way that produces a desired \(p\)-value, which understates the risk of a false positive (i.e. Type I errors).

This is done by performing many statistical tests on the data and only reporting those that come back with significant results.

https://xkcd.com/882/

Figure 4.6: https://xkcd.com/882/

This is usually involves using a single data set and searching by brute-force for various explanatory variable combinations with the hopes of finding some significant relationships. The reason this is a serious problem since:

  • Every dataset contains some randomness, meaning that some patterns could have happened simply by chance - having only one dataset increases the risk of treating those patterns as valid for the whole population;
  • While we select a significance level for the test, there is no guarantee that for a specific dataset the test wouldn’t produce false positive results;
  • The aforementioned stepwise regression is a great example of usign all of the available data to exhaustively search for any kind of significant explanatory variable combination, which could help explain the variation of some independent variable;
  • If the dataset is not representative of the whole population, there is even a higher chance of incorrect conclusions for the whole population - the dataset might strongly support some variable correlation, however it is likely to be a spurious relationship.

Spurious correlation is a relationship between two or more variables, which are associated but not causally related. This relationship is due to either coincidence, or a presence of another, unseen factor.

https://xkcd.com/925/

Figure 4.7: https://xkcd.com/925/

This is also known as the post hoc ergo propter hoc (“after this, therefore because of this”), or simply post hoc fallacy, referring to the mistake that because two events happen in chronological order, the former event must have caused the latter event (e.g. if \(Y\) changes happened after \(X\) changes, then \(X\) changes must have caused \(Y\) changes). The falacy lies in the conclusion, which is based solely on the order of the events, rather than taking into account other factors potentially responsible for the result that might rule out the connection.

This is slightly different from the cum hoc ergo propter hoc in 4.9.1, where the event order does not matter.

Some additional examples of spurious correlation can be found at tylervigen.com.

Fortunately, there are a number of solutions to reduce (or eliminate a large part of) the effects of data dredging. These include:

  • Perform an out-of-sample validation (or, cross-validation) - partition the dataset into two subsets. The first subset is used for the model estimation and hypothesis testing. The second subset can be used to either compare the fitted/predicted values of the model, or even re-estimate the model and compare the coefficient values/significance.

  • Record the number of all significance tests conducted during the analysis and divide the significance level by it. This is known as the Bonferoni criterion.

    Definition 4.1 Let \(H_1, ..., H_m\) be a family of hypothesis and \(p_1,...,p_m\) be their corresponding \(p\)-values. Let \(m\) be the total number of null hypothesis and \(m_0\) be the number of true null hypothesis. Let the familywise error rate (\(FWER\)) be the probability of rejecting at least one true \(H_i\) (i.e. making a Type I error). The Bonferroni correction rejects the null hypothesis for each \(p_i \leq \dfrac{\alpha}{m}\), thereby controlling the \(FWER\) at \(\alpha\): \[ FWER = \mathbb{P} \left( \bigcup\limits_{i = 1}^{m_0} \left( p_i \leq \dfrac{\alpha}{m} \right) \right) \leq \sum_{i = 1}^{m_0} \mathbb{P} \left( p_i \leq \dfrac{\alpha}{m} \right) = m_0 \dfrac{\alpha}{m} \leq m \dfrac{\alpha}{m} = \alpha \] This control does not require any assumptions about dependence among the \(p\)-values or about how many of the null hypotheses are true.

    This is related to the multiple testing problem, which states that the more statistical inferences (parameter estimate calculation, hypothesis testing) are made, the more likely erroneous inferences are to occur.

The Bonferoni criterion is usually too conservative, hence we may be interested in examining the false discovery fate (FDR), via the Benjamini–Hochberg (BH) procedure:

Definition 4.2 Assume that the \(m\) hypothesis and their associated \(p\) values are as defined before. Further assume that we can sort the \(p\) values in an ascending order: \(p_{(1)}, ..., p_{(m)}\). A procedure that goes from a small \(p\)-value to a large one is called a step-up procedure. Consequently the BH step-up procedure controls the FDR at significance level \(\alpha\) in two steps:

  1. For a given \(\alpha\) fing the largest \(k \in \{ 1, ..., m \}\), such that \(p_{(k)} \leq \dfrac{k}{m} \alpha\).
  2. Reject the null hypothesis \(H_{(i)}\), \(\forall i = 1,...,k\).
The BH procedure is valid when the \(m\) tests are independent.

Note: in practical terms the first step in the BF procedure means that we need to find which of the inequalities \(p_{(1)} \leq \dfrac{1}{m} \alpha, ..., p_{(m)} \leq \dfrac{m}{m} \alpha\) are valid and select the one with the largest index, which is equivalent to the valid inequality with the largest \(p\)-value, since the \(p\)-values are sorted in ascending order.

4.9.12 Outliers

An outlier is an observation which is significantly different from other values in a random sample from a population.

If we collect all of the various problems that can arise - we can rank them in terms of severity: \[ outliers > non-linearity > heteroscedasticity > non-normality \]

4.9.12.1 Outlier Causes

Outliers can be cause by:

  • measurement errors;
  • being from a different process, compared to the rest of the data;
  • not having a representative sample (e.g. measuring a single observation from a different city, when the remaining observations are all from one city);

4.9.12.2 Outlier Consequences

https://xkcd.com/539/

Figure 4.8: https://xkcd.com/539/

Outliers can lead to misleading results in parameter estimation and hypothesis testing. This means that a single outlier can make it seem like:

  • a non-linear model may be better suited to the data sample, as opposed to a linear model;
  • the residuals are heteroskedastic, when in fact only a residual has a larger variance, which is different from the rest;
  • the distribution is skewed (i.e. non-normal), because of a single observation/residual, which is significantly different form the test.

4.9.12.3 Outlier Detection

The broad definition of outliers means that the decision whether an observation should be considered an outlier is left to the econometrician/statistician/data scientist.

Nevertheless, there are a number of different methods, which can be used to identify abnormal observations.

Specifically, for regression models, outliers are also detected by comparing the true and fitted values. Assume that our true model is the linear regression: \[\begin{equation} \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \end{equation}\] Then, assume that we estimate \(\widehat{\boldsymbol{\beta}}\) via OLS. Consequently, we can write the fitted values as: \[ \widehat{\mathbf{Y}} = \mathbf{X}\widehat{\boldsymbol{\beta}} = \mathbf{X} \left(\mathbf{X}^\top \mathbf{X}\right)^{-1}\mathbf{X}^\top \mathbf{Y} = \mathbf{H} \mathbf{Y} \] where \(\mathbf{H} = \mathbf{X} \left(\mathbf{X}^\top \mathbf{X}\right)^{-1}\mathbf{X}^\top\) is called the hat matrix (or the projection matrix), which is the orthogonal projection that maps the vector of the response values, \(\mathbf{Y}\), to the vector of fitted/predicted values, \(\widehat{\mathbf{Y}}\). It describes the influence that each response value has on each fitted value, which is why \(\mathbf{H}\) is sometimes also referred to as the influence matrix.

To understand the projection matrix a bit better do not treat the fitted values as something that is separate from the true values.

Instead assume that you have two sets of values: \(\mathbf{Y}\) and \(\widehat{\mathbf{Y}}\). Ideally, we would want \(\widehat{\mathbf{Y}} = \mathbf{Y}\). Assuming that the linear relationship, \(\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}\), holds, this will generally not be possible because of the random shocks \(\boldsymbol{\varepsilon}\). However, the closest approximation would be the conditional expectation of \(\mathbf{Y}\), given a design matrix \(\mathbf{X}\), since we know that the conditional expectation is the best predictor from the proof in 3.7.1.1.

Thus, using the OLS definition of \(\widehat{\boldsymbol{\beta}}\), the best predictor (i.e. the conditional expectation) maps the values of \(\mathbf{Y}\) to the values of \(\widehat{\mathbf{Y}}\) via the projection matrix \(\mathbf{H}\).

The projection matrix can be utilized when calculating leverage scores and Cook’s distance, which are used to identify influential observations.

4.9.12.3.1 Box plots

Using the lower quartile \(Q_1\) and the upper quartile \(Q_3\), we can define the interquartile range as \[ IQR = Q_3 - Q_1 \] this is also known as the midspread, which defines the boxplot. Based on the \(IQR\), Tukey defined an outlier to be any observation outside the interval: \[ \left[ Q_1 - k \times IQR,\ Q_3 + k \times IQR \right] \] where we usually take \(k = 1.5\). The lower and upper bound are also marked in the boxplot via the whiskers. The points, which are clearly marked outside the boxplot can be considered outliers.

Unfortunately, there is no statistical rationale for choosing \(k = 1.5\) - this was the value proposed by Tukey. Nevertheless, it is usually taken as a default value in order to create the whiskers.

4.9.12.3.2 Leverage Score of Observations

Leverage measures how far away an observation of a predictor variable, \(\mathbf{X}\), is from the mean of the predictor variable. For the linear regression model, the leverage score for the \(i\)-th observation is defined as the \(i\)-th diagonal element of the projection matrix \(\mathbf{H} = \mathbf{X} \left(\mathbf{X}^\top \mathbf{X}\right)^{-1}\mathbf{X}^\top\), which is equivalent to taking a partial derivative of \(\widehat{Y}_i\) with respect to \(Y_i\):

\[ h_{ii} = \dfrac{\partial \widehat{Y}_i}{\partial Y_i} = \left( \mathbf{H}\right)_{ii} \]

Defining the leverage score via the partial derivative allows us to interpret the leverage score as the observation self-influence, which describes how the actual value, \(Y_i\), influences the fitted value, \(\widehat{Y}_i\).

The leverage score \(h_{ii}\) is bounded: \[ 0 \leq h_{ii} \leq 1 \]

Proof. Noting that \(\mathbf{H}\) is symmetric and the fact that it is an idempotent matrix: \[ \mathbf{H}^2 = \mathbf{H} \mathbf{H} = \mathbf{X} \left(\mathbf{X}^\top \mathbf{X}\right)^{-1}\mathbf{X}^\top \mathbf{X} \left(\mathbf{X}^\top \mathbf{X}\right)^{-1}\mathbf{X}^\top = \mathbf{X} \mathbf{I} \left(\mathbf{X}^\top \mathbf{X}\right)^{-1}\mathbf{X}^\top = \mathbf{H} \] we can examine the diagonal elements of the equality \(\mathbf{H}^2 = \mathbf{H}\) to get the following bounds of \(H_{ii}\): \[ \begin{aligned} h_{ii} &= h_{ii}^2 + \sum_{i \neq j} h_{ij}^2 \geq 0 \\ h_{ii} &\geq h_{ii}^2 \Longrightarrow h_{ii} \leq 1 \end{aligned} \]

We can also relate the residuals to the leverage score: \[ \widehat{\boldsymbol\varepsilon} = \mathbf{Y} - \widehat{\mathbf{Y}} = \left( \mathbf{I} - \mathbf{H}\right)\mathbf{Y} \] Examining the variance-covariance matrix of the regression errors we see that: \[ \mathbb{V}{\rm ar}(\widehat{\boldsymbol\varepsilon}) = \mathbb{V}{\rm ar}(\left( \mathbf{I} - \mathbf{H}\right)\mathbf{Y}) = \left( \mathbf{I} - \mathbf{H}\right) \mathbb{V}{\rm ar}(\mathbf{Y}) \left( \mathbf{I} - \mathbf{H}\right)^\top = \sigma^2 \left( \mathbf{I} - \mathbf{H}\right) \left( \mathbf{I} - \mathbf{H}\right)^\top = \sigma^2 \left( \mathbf{I} - \mathbf{H}\right), \] where we have used the fact that \(\left( \mathbf{I} - \mathbf{H}\right)\) is idempotent and \(\mathbb{V}{\rm ar}(\mathbf{Y}) = \sigma^2 \mathbf{I}\).

Since the diagonal elements of the variance-covariance matrix are the variances of each observation, we have that \(\mathbb{V}{\rm ar}(\widehat{\epsilon}_i) = (1 - h_{ii}) \sigma^2\).

Thus, we can see that a leverage score of \(h_{ii} \approx 0\) would indicate that the \(i\)-th observation has no influence on the error variance, which would mean that its variance close to the true (unobserved) variance \(\sigma^2\).

Observations with leverage score values larger than \(2(k+1)/N\) are considered to be potentially highly influential.

Assume that we estimate the model via OLS:

4.9.12.3.3 Studentized Residuals

The studentized residuals are related to the standardized residuals in chapter 3.8.3.5, as they are defined as before: \[ t_i = \dfrac{\widehat{\epsilon_i}}{\widehat{\sigma}\sqrt{1 - h_{ii}}} \] The main distinction comes from the calculation of \(\widehat{\sigma}\), which can be calculated in two ways:

  • Standardized residuals calculate the internally studentized residual variance estimate: \[ \widehat{\sigma}^2 = \dfrac{1}{N-(k+1)} \sum_{j = 1}^N \widehat{\epsilon}^2_j \]
  • If we suspect that the \(i\)-th residual of being improbably large (i.e. it cannot be from the same normal distribution as the remaining of the residuals) - we exclude it from variance estimation by calculating the externally studentized residual variance estimate: \[ \widehat{\sigma}^2_{(i)} = \dfrac{1}{N-(k+1)-1} \sum_{\substack{j = 1 \\ j \neq i}}^N \widehat{\epsilon}^2_j \]

If the residuals are independent and \(\boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma^2 \mathbf{I})\), then the distribution of the studentized residuals depends on the calculation of the variance estimate:

  • If the residuals are internally studentized - they have a tau distribution: \[ t_i \sim \dfrac{\sqrt{v} \cdot t_{v-1}}{\sqrt{t^2_{v-1} + v - 1}},\ \text{where } v = N-(k+1) \]
  • If the residuals are externally studentized - they have a Student’s \(t\)-distribution (we will also refer to them as \(t_{i(i)}\)): \[ t_i = t_{i(i)} \sim t_{(N-(k-1)-1)} \]

For a more complete proof of this distribution, see (here) and (here).

Observations with studentized residual values larger than 3 in absolute value could be considered outliers.

We can plot the studentized and standardized residuals:

We can examine the same plots on the model, with the outlier observation removed from the data:

While the studentized residuals appear to have no outliers, the standardized residuals indicate that a few observations may be influential. Since we have simulated the data, we know that our data contained only one outlier. Consequently, we should not treat all observations outside the threshold as definite outliers.

We may also be interested in plotting the studentized residuals against the leverage points:

This plot combined the leverage score, which shows influential explanatory variable observations, and the studentized residual plot, which shows outlier residuals of the difference between the actual and fitted dependent variables.

4.9.12.3.4 Influential observations

Influential observations are defined as observations, which have a large effect on the results of a regression.

4.9.12.3.4.1 DFBETAS

The \(DFBETA_i\) vector measures how much an observation \(i\) has effected the estimate of a regression coefficient vector . It measures the difference between the regression coefficients, calculated for all of the data, and the regression coefficients, calculated with the observation \(i\) deleted: \[ DFBETA_i = \dfrac{\widehat{\boldsymbol{\beta}} - \widehat{\boldsymbol{\beta}}_{(i)}}{\sqrt{\widehat{\sigma}^2_{(i)} \text{diag}\left( (\mathbf{X}^\top \mathbf{X})^{-1}\right)}} \]

We can calculate the appropriate \(DFBETAS\) for the last 10 observations as follows:

##         (Intercept)             x
##  [1,]  2.799175e-04 -0.0003691735
##  [2,] -7.813149e-04  0.0014300274
##  [3,]  5.192259e-04  0.0014789711
##  [4,]  1.550342e-02 -0.0123997509
##  [5,]  1.514765e-02 -0.0184449214
##  [6,]  2.874382e-02 -0.0227895538
##  [7,]  3.074469e-02 -0.0348445590
##  [8,]  2.040379e-02 -0.0242984293
##  [9,]  6.702931e-03 -0.0042425475
## [10,] -2.923078e+01 25.3628767691
## 1

While these calculations are a bit more involved, we can use the built-in functions as well:

##       (Intercept)             x
## 91   2.799175e-04 -0.0003691735
## 92  -7.813149e-04  0.0014300274
## 93   5.192259e-04  0.0014789711
## 94   1.550342e-02 -0.0123997509
## 95   1.514765e-02 -0.0184449214
## 96   2.874382e-02 -0.0227895538
## 97   3.074469e-02 -0.0348445590
## 98   2.040379e-02 -0.0242984293
## 99   6.702931e-03 -0.0042425475
## 100 -2.923078e+01 25.3628767691
## 1

Observations with a \(DFBETA\) value larger than \(2/\sqrt{N}\) in absolute value should be carefully inspected.

The recommended general cutoff (absolute) value is \(2\).

If we wanted, we could also plot these values:

If we were to remove the last observation and examine the \(DFBETAS\) plot:

We see that there are some observations, which may be worth examining. In this case, we know that there are no more outliers because we have simulated the data ourselves. So this is a good example that you should not blindly trust the above charts, as the influential observations are not necessarily outliers.

4.9.12.3.4.2 DFFITS

\(DFFITS\) measures how much an observation \(i\) has effected the fitted value of a regression. It is defined as a Studentized difference between the fitted values from a regression, estimated on all of the data, and the fitted values from a regression, estimated on the data with observation \(i\) deleted: \[ DFFITS_i = \dfrac{\widehat{Y}_i - \widehat{Y}_{i(i)}}{\widehat{\sigma}^2_{(i)} \sqrt{h_{ii}}}= t_{i(i)} \sqrt{\dfrac{h_{ii}}{1-h_{ii}}} \] where \(t_{i(i)}\) is the externally studentized residual.

##     tmp_val         
## 91  " -0.0005235787"
## 92  "  0.0031760359"
## 93  "  0.0091761236"
## 94  "  0.0199891169"
## 95  " -0.0226748095"
## 96  "  0.0376495768"
## 97  " -0.0379725909"
## 98  " -0.0287153104"
## 99  "  0.0125545090"
## 100 "-32.6910440413"
## 1
Observations with a \(DFFITS\) value larger than \(2\sqrt{(k+1)/N}\) in absolute value, as suggested by (Belsley, Kuh, and Welsch 2005).

We can also plot the \(DFFITS\) values with the last observation removed:

Similarly to what we have observed with \(DFBETAS\) - we should not blindly trust that each value outside the cutoff region is an outlier. Instead, we should treat them as influential observations, which need additional analysis to determine whether they are acceptable.

4.9.12.3.4.3 Cook’s distance

Cook’s D measures the aggregate impact of each observation on the group of regression coefficients, as well as the group of fitted values. It can be used to:

  • indicate influential data points (i.e. potential outliers);
  • indicate regions, where more observations would be needed;

Cook’s distance for observation \(i\) is defined as: \[ D_i = \dfrac{\sum_{j = 1}^N (\widehat{Y}_j - \widehat{Y}_{j(i)})^2}{(k+1) \widehat{\sigma}^2} = \dfrac{\widehat{\epsilon}^2_i}{(k+1)\widehat{\sigma}^2} \left[ \dfrac{h_{ii}}{(1-h_{ii})^2}\right] \] where:

  • \(\widehat{Y}_{j(i)}\) is the fitted value of \(Y_j\), obtained by excluding the \(i\)-th observation and re-estimating the same model via OLS.
  • \(\widehat{\sigma}^2 = \dfrac{\widehat{\boldsymbol{\varepsilon}}^\top \widehat{\boldsymbol{\varepsilon}}}{N-(k+1)}\) is the mean squared error of the error term.

Note: in practical terms, it may be easier to use the leverage score expression of \(D_i\) instead of re-estimating the model for each observation case.

##     tmp_val          
## 91  "0.0000001384804"
## 92  "0.0000050955563"
## 93  "0.0000425310902"
## 94  "0.0002017917033"
## 95  "0.0002596785491"
## 96  "0.0007154000322"
## 97  "0.0007282312180"
## 98  "0.0004164378945"
## 99  "0.0000796089714"
## 100 "1.2596831219103"
## 1

Cook’s distance values, which are:

  • larger than \(4/N\) (the traditional cut-off);
  • larger than \(3 \times \dfrac{1}{N}\sum_{i = 1}^N D_i\)

could be considered highly influential.

We can plot the \(D_i\) points:

As well as plot the \(D_i\) on the data without the outlier observation:

We again see a similar result, as with \(DFBETAS\) and \(DFFITS\).

Also note that R has a lot of different plots for the default lm model output:

## 1

4.9.12.4 Addressing Outliers

After determining that a specific observation is indeed an outlier, we want to address them in some way.

4.9.12.4.1 Capping the Outliers

If we find that the explanatory variables \(X_{1,i},..., X_{k,i}\) of an outlier variable \(Y_i\) are similar to other observations, with non-outlier values of \(Y_i\), we may cap the value of the outlier, to match the values.

4.9.12.4.2 Replacing Outliers with Imputed Values

If we are certain that the outlier is due to some error in the data itself - we could try to impute the observations by treating them as missing values and substituting them for some average value of \(Y\).

The Expectation-maximization (EM) algorithm could be utilized for missing data imputation.

4.9.12.4.3 Deleting Outliers

In some cases, if we are absolutely sure that the observation is an outlier, which is either completely unlikely, or impossible to encounter again, we could drop it.

In our example, we know that the last observation was differently generated, and is thus an outlier, which we can delete.

We can compare how would our model look like with the whole dataset, and if we were to drop the outlier observation:

4.9.12.4.4 Robust Regression

In addition to the methods mentioned before, we could also run a Robust regression.

4.9.13 Misinterpretation of the \(p\)-value

See (Murdoch, Tsai, and Adcock 2008), (Greenland et al. 2016) and a youtube video ‘Dance of the p Values’. A Review of hypothesis testing can be found in 3.6

Question: What is a test statistic, \(t_i\)?

Answer: A test statistic is computed from the data under a specific null hypothesis and tested against pre-determined critical values for a selected significance level.

Question: What is the significance level, \(\alpha\)?

Answer: \(\alpha\) can be though of as the probability of making a Type I error - i.e. that we will reject the null hypothesis \((100\cdot \alpha) \%\) of the time, when it is in fact true. They define the sensitivity of the test.

The choice of \(\alpha\) is somewhat arbitrary, although in practice the values are usually selected quite small: \(\alpha \in \{0.01,\ 0.05,\ 0.1 \}\).

Question: What is the critical value, \(t_c\)?

Answer: Critical values are cut-off values that define regions, where the test statistic is unlikely to lie. The null hypothesis is rejected if the test statistic lies within this region (also known as the rejection region), which is an unlikely event to occur under the null.

In general, for every hypothesis test there is an equivalent statement about whether the hypothesized parameter value is included in a confidence interval.

Question: What is the \(p\)-value?

Answer: The standard definition of a \(p\)-value is that it is the probability, computed assuming that \(H_0\) is true, that the test statistic would take a value as extreme, or more extreme, than that actually observed.

The test statistic is constructed in order to quantify departures from \(H_0\). Hence, we reject \(H_0\) when \(p\)-value is small. However, defining the \(p\)-value as the probability should be taken with care, since the \(p\)-value is not the probability of a Type I error (the Type I error - rejecting the null hypothesis, when it is true - is given by a chosen value of significance level \(\alpha\)):

  • If you collect data many times, under the assumption that the null is true, a proportion of \(\alpha\) of those times you would reject the null (i.e. a Type I error);
  • The \(p\)-value is a random variable computed from the actual observed data that can be compared to \(\alpha\) as one way of performing the test (instead of comparing the test-statistics).

Below we will hightlight that \(p\)-value are random variables and have a distribution, which can be studies.

Example 4.39 Consider a \(t\)-test for a single sample \(X_1,...,X_N\) from a \(\mathcal{N}(\mu, \sigma^2)\) distribution.

Assume that we want to test the hypothesis: \[ \begin{cases} H_0&: \mu = 0 \\ H_1&: \mu \neq 0 \end{cases} \] The test statistic is calculated using the sample mean ,\(\overline{X}\), and sample standard deviation, \(s\), as: \[ T = \dfrac{\overline{X}}{s/ \sqrt{N}} \] Under the null hypothesis \(T \sim t_{(N-1)}\).

We will carry out a Monte Carlo simulation 10000 times with \(N = 4\) (i.e. a very small sample size) and \(\mathcal{N}(\mu , 1)\). We will repeat these simulation experiments for different values \(\mu \in \{0, 1\}\)

Next, we will examine the histograms of the \(p\)-values. Remember that for \(\mu \neq 0\) we hope to reject the null more frequently under a \(\alpha = 0.05\) significance level.

We note that when \(\mu \neq 0\) we get the expected outcome - the calculated \(p\)-value is more frequently observed to be less than \(0.05\) under the null hypothesis \(H_0: \mu = 0\).

We also notice something of interest - if the true \(\mu = 0\), then the \(p\)-value histogram is uniform. This result can be verified via the probability integral transformation. Under the null hypothesis our \(T\) statistic has the distribution \(F(t)\) (in our case this is the \(t\)-distribution). Then the \(p\)-value \(P = F_T(T)\) has the following probability distribution \(F_P(p)\), defined as: \[ \begin{aligned} F_P(p) = \mathbb{P}(P \leq p) &= \mathbb{P}(F_T(T) \leq p) = \mathbb{P}(T \leq F_T^{-1}(p)) = F_T(F_T^{-1}(p)) = p \end{aligned} \] which is exactly the cumulative distribution function of a \(Uniform(0,1)\) a uniform distribution. In other words, when the null hypothesis is true, the \(p\)-value has a uniform distribution for the binary hypothesis test (for composite hypothesis tests - e.g. when testing \(\mu_1 < \mu_2\) - as well as the intuition for the uniform distribution, see the discussion here).

A more intuitive explanation is related to the definition of the significance level \(\alpha\). In general, we reject the null hypothesis when \(p\)-value < \(\alpha\), but the only way this happens for any value of \(\alpha\), when the null hypothesis is true (i.e. and we make a Type I error), is when the \(p\)-value comes from a uniform distribution. Hence, the main motivation for specifying the correct distribution for the test statistic is so that we could use the probability integral transformation to transform the test statistic to a uniform \(p\)-value. If the null hypothesis is false, then the distribution of the \(p\)-value should be more weighted towards 0.

We want the probability of rejecting a true null hypothesis to be alpha, we reject when the observed p-value<α, the only way this happens for any value of alpha is when the p-value comes from a uniform distribution.

This is also a nice indication of the power of a specific (one-tail) hypothesis test, which is defined as \(power = \mathbb{P}(\text{reject } H_0 | H_1 \text{ is true})\) (usually test power in small samples is of much interest).

As mentioned in (Greenland et al. 2016), many authors agree that confidence intervals are superior to tests and \(p\)-values because they allow one to shift focus away from the null hypothesis, toward the full range of effect sizes (an example of an effect size is the magnitude of an estimated coefficient in a linear regression model) compatible with the data.

Confidence intervals are examples of interval estimates. As mentioned in 3.5 and (Greenland et al. 2016), if one calculates \(95\%\) confidence intervals repeatedly in valid applications, then \(95\%\) of those intervals would, on average, contain the true effect (e.g. parameter) size. This is a property of a long sequence of confidence intervals computed from valid models, rather than a property of any single confidence interval.

4.9.14 Representative Sample Importantce: Random Sampling & Stratified Sampling

TBA

4.9.15 Linear Model Residuals are Non-Normally Distributed

TBA

References

Belsley, D. A., E. Kuh, and R. E. Welsch. 2005. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley Series in Probability and Statistics. Wiley. https://books.google.lt/books?id=GECBEUJVNe0C.

Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. 2016. “Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations.” European Journal of Epidemiology 31 (4): 337–50. https://doi.org/10.1007/s10654-016-0149-3.

Murdoch, Duncan J, Yu-Ling Tsai, and James Adcock. 2008. “P-Values Are Random Variables.” The American Statistician 62 (3): 242–45. https://doi.org/10.1198/000313008X332421.

Utts, Jessica M. 1982. “The Rainbow Test for Lack of Fit in Regression.” Communications in Statistics - Theory and Methods 11 (24): 2801–15. https://doi.org/10.1080/03610928208828423.