4.5 Multicollinearity
4.5.1 Definition of Multicollinearity
Sometimes explanatory variables are tightly connected and it is impossible to distinguish their individual influences on the dependent variable. Many economic variables may move together in some systematic way. Such variables are said to be collinear and cause the collinearity problem.
In the same way, multicollinearity refers to a situation in which two or more explanatory variables are highly linearly related. Consequently:
A set of variables is perfectly multicollinear if there exist one of more exact linear relationship among some of the variables. This means that if for a multiple regression model: \[ Y_i = \beta_0 + \beta_1 X_{1,i} + ... + \beta_k X_{k,i} + \epsilon_i \] the following relationship holds true for \(m\) explanatory variables: \[ \gamma_0 + \gamma_{1} X_{1,i} + ... + \gamma_{k}X_{k,i} = 0, \forall i = 1,...,N \] for some \(\gamma_j \neq 0\), \(j = 1,...,k\). Then these \(m\) variables exhibit multicollinearity and violate assumption (MR.5).
\[ Y_{i} = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \epsilon_i \] Now, assume that \(X_{2,i} = \gamma_0 + \gamma_1 X_{1,i}\). Then our initial expression becomes: \[ \begin{aligned} Y_{i} &= \beta_0 + \beta_1 X_{1,i} + \beta_2 (\gamma_0 + \gamma_1 X_{2,i}) + \epsilon_i\\ &= \left[ \beta_0 + \beta_2 \gamma_0 \right] + \left[ \beta_1 + \beta_2 \gamma_1 \right] \cdot X_{1,i} + \epsilon \\ &= \alpha_0 + \alpha_1 X_{1,i} + \epsilon \end{aligned} \] Thus, while we may be able to estimate \(\alpha_0\) and \(\alpha_1\), we would not be able to obtain estimates of the original \(\beta_0\), \(\beta_1\), \(\beta_2\).
To verify that we cannot estimate the original coefficients, consider the following example.
\[ \begin{aligned} Y_{1,i} &= \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \epsilon_i\\ Y_{2,i} &= \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{3,i} + \epsilon_i \end{aligned} \] where \(X_{1,i}\) and \(X_{2,i}\) are independent, but \(X_{3,i} = \gamma_0 + \gamma_1 X_{1,i}\).
import numpy as np
import statsmodels.api as sm
#
np.random.seed(123)
#
N = 200
beta_vec = np.array([2, 4, -2])
x1 = np.random.normal(loc = 8, scale = 2, size = N)
x2 = np.random.normal(loc = 2, scale = 1, size = N)
x3 = 2 - 2 * x1
e = np.random.normal(loc = 0, scale = 1, size = N)
#
y1 = np.dot(sm.add_constant(np.column_stack((x1, x2))), beta_vec) + e
y2 = np.dot(sm.add_constant(np.column_stack((x1, x3))), beta_vec) + e
If we were to manually examine the design matrices of our models, we can verify that assumption (MR.5) does not hold for the design matrix of the second model:
## [1] 3
## [1] 2
Consequently, calculating the inverse of \(\mathbf{X}^\top \mathbf{X}\) yields an error:
## Error in solve.default(t(x_mat_2) %*% x_mat_2): system is computationally singular: reciprocal condition number = 1.23755e-16
## [[ 7.33007752e+11 -7.33007752e+11 -3.66503876e+11]
## [-7.33007752e+11 7.33007752e+11 3.66503876e+11]
## [-3.66503876e+11 3.66503876e+11 1.83251938e+11]]
Strangely, this works. It turns out that numpy
uses LU decomposition to calculate the inverse, which yields results even in cases when the matrix is singular (i.e. non-invertible), we can check this not only by examining its rank, but from the determinant:
## 9.669827929485215e-07
In theory, if we have calculated the inverse matrix, then the following relation should hold: \[ \left(\mathbf{X}^\top \mathbf{X} \right)^{-1} \left(\mathbf{X}^\top \mathbf{X} \right) = \mathbf{I} \] For the invertible matrix this holds true (as it should):
xtx_1 = np.dot(np.transpose(x_mat_1), x_mat_1)
xtx_1_inv = np.linalg.inv(xtx_1)
print(np.round(np.dot(xtx_1_inv, xtx_1), 10))
## [[ 1. -0. -0.]
## [-0. 1. -0.]
## [-0. 0. 1.]]
On the other hand, for the non-invertible matrix, this does not hold (because the inverse should not exist):
xtx_2 = np.dot(np.transpose(x_mat_2), x_mat_2)
xtx_2_inv = np.linalg.inv(xtx_2)
print(np.round(np.dot(xtx_2_inv, xtx_2), 10))
## [[ 1.51922246 5.13709534 -10.90241243]
## [ 0.01652794 0.32784124 1.79404007]
## [ 0.034764 0.03262621 1.21260891]]
It should be pointed out that this is extremely dependent on your package version as discussed here.
As of writing these notes, this remains a “problem” (in the sense that you need to be mindful when doing these calculations in Python
) with the specific package versions (which are provided in section 2). We note that, again, when there is perfect collinearity, the design matrix should not be invertible, as is clearly demonstrated in R
.
On the other hand, dropping the variable \(X_{3,i}\) (and thus, losing some potentially valuable information on our dependent variable) yields the following coefficients:
x_mat_2_small <- cbind(1, x1)
print(solve(t(x_mat_2_small) %*% x_mat_2_small) %*% t(x_mat_2_small) %*% y2)
## [,1]
## -1.844784
## x1 7.984537
Consequently, this is exactly what is (implicitly) done in R
. If we estimate the models separately, we get:
#
#
#
dt1 <- data.frame(y1, x1, x2)
dt2 <- data.frame(y2, x1, x3)
#
mdl_1 <- lm(y1 ~ x1 + x2, data = dt1)
print(coef(summary(mdl_1)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.264582 0.33337750 6.792846 1.269838e-10
## x1 3.983796 0.03638855 109.479387 2.073561e-178
## x2 -2.050659 0.06892157 -29.753510 8.319799e-75
import statsmodels.formula.api as smf
import pandas as pd
#
dt1 = pd.DataFrame(np.column_stack([y1, x1, x2]), columns=["y1", "x1", "x2"])
dt2 = pd.DataFrame(np.column_stack([y1, x1, x3]), columns=["y2", "x1", "x3"])
#
mdl_1 = smf.ols("y1 ~ x1 + x2", data = dt1)
print(mdl_1.fit().summary().tables[1])
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 2.0215 0.320 6.321 0.000 1.391 2.652
## x1 3.9985 0.034 116.274 0.000 3.931 4.066
## x2 -1.9689 0.078 -25.185 0.000 -2.123 -1.815
## ==============================================================================
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.844784 0.29798365 -6.19089 3.384508e-09
## x1 7.984537 0.03633232 219.76400 1.739763e-238
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 0.8601 0.294 2.926 0.004 0.280 1.440
## x1 1.4866 0.222 6.707 0.000 1.050 1.924
## x3 -1.2531 0.145 -8.652 0.000 -1.539 -0.967
## ==============================================================================
Furthermore, if we print the summary of our model with collinear variables:
##
## Call:
## lm(formula = y2 ~ x1 + x3, data = dt2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.83465 -0.59224 0.06078 0.64298 2.46366
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.84478 0.29798 -6.191 3.38e-09 ***
## x1 7.98454 0.03633 219.764 < 2e-16 ***
## x3 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9668 on 198 degrees of freedom
## Multiple R-squared: 0.9959, Adjusted R-squared: 0.9959
## F-statistic: 4.83e+04 on 1 and 198 DF, p-value: < 2.2e-16
We see that the variable \(X_3\) has NA
values - in other words, it is not included in our lm()
estimated regression which is why coef()
does not return the row with x3
variable.
Python
does not adjust for multicollinearity by removing collinear variables. On the other hand, it calculates a generalized inverse in order to estimate the OLS parameters.
Fortunately, if we look at the output summary()
- there is a warning regarding possible multicollinearity, which we can extract separately:
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The smallest eigenvalue is 1.25e-30. This might indicate that there are
## strong multicollinearity problems or that the design matrix is singular.
There is no warning about multicollinearity for \(Y_1\) model (which is what we would expect), but there is a warning about a strong multicollinearity in the model for \(Y_2\) (which is what we know is true).
which is inline with what we have seen. However, if we were to simply look at our model, we would be puzzled as to why one variable was excluded in R
. In Python
the model is evaluated, despite multicollinearity, which would cause further confusion if were to compare to the model in R
. In this case Python
uses a formula to calculate the generalized inverse, which can be calculated when when a matrix is non-invertible. A very good discussion on this topic can be found on the statsmodels github page and a comparison of R and Python output on stackexchange. Fortunately, Python
does print a warning regarding multicollinearity, as shown above.
Always be mindful of the possible consequences of collinearity (which results in an non-invertible matrices in OLS calculation), as well as other possible problems like autocorrelation and heteroskedasticity (which are to be discussed further on).
Econometric software is not always explicit in its methodology when some of these problems arise - it is usually assumed that the user if knowledgeable enough to identify these issues from the output, or by reading the software documentation. We may sometimes be content with an approximate solution, while other times, we want to be fully aware of any and all possible problems with the data, especially if we are automating some model estimation processes, since extracting only the coefficient (or any other) values may not tell us about any problems detected by the software.
4.5.2 Consequences of Multicollinearity
While we may be able to estimate \(\alpha_0\) and \(\alpha_1\), we would not be able to obtain estimates of the original \(\beta_0\), \(\beta_1\), \(\beta_2\).
On one hand, this situation virtually never arises in practice and can be disregarded.
On the other hand, the relationship in real-world data is usually approximate.
In other words if the relationship is approximately linear: \[ X_{3,i} \approx \gamma_0 + \gamma_1 f(X_{1,i}) + \eta_i,\quad \eta_i \sim D(\mu_\eta, \sigma^2_\eta) \] where \(\eta\) is some kind of (not necessarily Gaussian) random variable. If \(\eta\) is small, then \(\text{det}\left( \mathbf{X}^\top \mathbf{X} \right)\) is close to zero, then our data is close to multicollinear and \(\left( \mathbf{X}^\top \mathbf{X} \right)^{-1}\) would be large, which would then result in imprecise estimators.
Consequently multicollinearity results in multiple difficulties:
- That the standard errors of the affected coefficients tend to be large. This results in failure to reject the null hypothesis on coefficient significance.
- Small changes to the input data can lead to large changes in the model, even resulting in changes of sign of parameter estimates. This is easier to verify if we estimate the same model on different subsets of the data and examine its coefficients.
- \(R^2\) may be strangely large, which should be concerning when the model contains insignificant variables and variables with (economically) incorrect signs.
- The usual interpretation of a unit increase in \(X_{j,i}\) effect on \(Y\), holding all else constant (ceteris paribus) does not hold if the explanatory variables are correlated.
- In some sense, the collinear variables contain, at least partially, the same information about the dependent variable. A consequence of such data redundancy is overfitting.
- Depending on the software used, an approximate inverse of \(\mathbf{X}^\top \mathbf{X}\) may be computed, although the resulting approximated inverse may be highly sensitive to slight variations in the data and so may be very inaccurate or very sample-dependent.
Modifying our previous example: \[ X_{3,i} = 2 - 2 \cdot \left(X_{1,i} \right)^{0.9} + \eta_i,\quad \eta_i \sim \mathcal{N}(-2, 1) \]
#
set.seed(123)
#
x3_new = 2 - 2 * x1^(0.9) + rnorm(mean = -2, sd = 1, n = N)
y2_new = cbind(1, x1, x3_new) %*% beta_vec + e
#
plot(1:N, x3, type = "l", col = "blue")
#
lines(1:N, x3_new, col = "red")
#
legend("bottomleft", legend = c("x3", "x3_new"), col = c("blue", "red"), lty = 1)
import matplotlib.pyplot as plt
np.random.seed(123)
#
x3_new = 2 - 2 * np.power(x1, 0.9) + np.random.normal(loc = -2, scale = 1, size = N)
y2_new = sm.add_constant(np.column_stack([x1, x3_new])).dot(beta_vec) + e
#
_ = plt.figure(num = 0, figsize = (10, 8))
_ = plt.plot(x3, label = "x3", color = "blue")
_ = plt.plot(x3_new, label = "x3_new", color = "red")
_ = plt.legend(loc = "lower left")
plt.show()
## x3 x3_new
## x3 1.0000000 0.9996801
## x3_new 0.9996801 1.0000000
## [[1. 0.99944538]
## [0.99944538 1. ]]
yields the following model coefficient estimates:
dt3 <- data.frame(y2_new, x1, x3_new)
#
mdl_3 <- lm(y2_new ~ x1 + x3_new, data = dt3)
#
print(round(coef(summary(mdl_3)), 4))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.2029 7.9048 0.9112 0.3633
## x1 4.9036 1.4387 3.4084 0.0008
## x3_new -1.0435 1.4969 -0.6971 0.4866
## [1] 0.9926
dt3 = pd.DataFrame(np.column_stack([y2_new, x1, x3_new]),
columns = ["y2_new", "x1", "x3_new"])
mdl_3 = smf.ols("y2_new ~ x1 + x3_new", data = dt3)
#
print(np.round(mdl_3.fit().summary2().tables[1], 4))
## Coef. Std.Err. t P>|t| [0.025 0.975]
## Intercept 2.2531 5.5268 0.4077 0.6840 -8.6463 13.1525
## x1 4.0311 1.0331 3.9021 0.0001 1.9938 6.0683
## x3_new -1.9665 1.0638 -1.8486 0.0660 -4.0644 0.1313
## 0.9934
We see that the parameters are estimated close to the true values, even when \(X_1\) and \(X_3\) are related (though in this case - only approximately linearly). On the other hand, the coefficient of \(X_3\) is not statistically significantly different form zero.
On the other hand, if we were to estimate the following model:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.4846 65.7724 -0.6307 0.5290
## x1 -8.1334 17.5430 -0.4636 0.6434
## x3_new -11.0723 13.5328 -0.8182 0.4142
## I(x3_new^2) 0.1341 0.1798 0.7457 0.4568
mdl_4 = smf.ols("y2_new ~ x1 + x3_new + np.power(x3_new, 2)", data = dt3)
print(np.round(mdl_4.fit().summary2().tables[1], 4))
## Coef. Std.Err. t P>|t| [0.025 0.975]
## Intercept 7.1324 32.1346 0.2220 0.8246 -56.2416 70.5064
## x1 5.4225 9.0857 0.5968 0.5513 -12.4958 23.3407
## x3_new -0.9443 6.7169 -0.1406 0.8883 -14.1910 12.3025
## np.power(x3_new, 2) -0.0161 0.1045 -0.1541 0.8777 -0.2221 0.1899
Suddenly not only did all of our coefficients become insignificant, some of them even changed signs, while others significantly decreased. Furthermore, from this model we cannot say which variables may be collinear, or even if there may be a collinearity problem.
What makes matters worse is the \(R^2\) and the \(R^2_{adj}\) of this model is very large:
## [1] 0.9926
## [1] 0.9925
## 0.9934
## 0.9933
we have that this model, with all insignificant coefficients, explains \(99.3\%\) of the variation in \(Y\).
If we were to do the same on data without multicollinearity:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2646 0.3334 6.7928 0
## x1 3.9838 0.0364 109.4794 0
## x2 -2.0507 0.0689 -29.7535 0
## Coef. Std.Err. t P>|t| [0.025 0.975]
## Intercept 2.0215 0.3198 6.3212 0.0 1.3909 2.6522
## x1 3.9985 0.0344 116.2739 0.0 3.9307 4.0663
## x2 -1.9689 0.0782 -25.1852 0.0 -2.1231 -1.8147
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2197 0.3800 5.8406 0.0000
## x1 3.9838 0.0365 109.2182 0.0000
## x2 -1.9952 0.2341 -8.5237 0.0000
## I(x2^2) -0.0132 0.0534 -0.2479 0.8045
## Coef. Std.Err. t P>|t| [0.025 0.975]
## Intercept 2.0687 0.3746 5.5222 0.0000 1.3299 2.8075
## x1 3.9983 0.0345 115.9761 0.0000 3.9304 4.0663
## x2 -2.0282 0.2558 -7.9289 0.0000 -2.5326 -1.5237
## I(x2 ** 2) 0.0149 0.0612 0.2435 0.8079 -0.1057 0.1355
Then we see that only the additional variable is insignificant. Furthermore, the additional variable did not significantly change the values of the remaining coefficients.
If the underlying model is correctly specified, multicollinearity still produces unbiased results. Only the standard errors are large in the related explanatory variables.
4.5.3 Testing for Multicollinearity
As we have mentioned, if \(\text{det}\left( \mathbf{X}^\top \mathbf{X} \right)\) is close to zero, then our data exhibits collinearity. Though there are a number of alternative ways to measure this presence, without examining only the design matrix:
- Large changes in the estimated regression coefficients when a predictor variable is added/removed are indicative of multicollinearity;
- If the multiple regression contains insignificant coefficients, then the \(F\)-test for significance should not reject the null hypothesis. If the null hypothesis for those coefficients is rejected, then we have an indication that the low \(t\)-values are due to multicollinearity.
- A popular measure of multicollinearity is the variance inflation factor (VIF).
4.5.3.1 Variance Inflation Factor (VIF)
Variance Inflation Factor (VIF)
Consider the following multiple regression model: \[ Y_i = \beta_0 + \beta_1 X_{1,i} + ... + \beta_k X_{k,i} + \epsilon_k \] Then, the variance inflation factor of \(\widehat{\beta}_j\) is evaluated in the following steps:
-
Run the OLS auxiliary regression (i.e. a regression, whose coefficients are not of direct interest) of \(X_j\) as a function of all other explanatory variables: \[ X_{j,i} = \alpha_0 + \alpha_1 X_{1,i} + ... + \alpha_{j-1} X_{j-1,i} + \alpha_{j+1} X_{j+1,i}+ ... + \alpha_{k} X_{k,i} + e_i \]
-
Then, the VIF of \(\widehat{\beta}_j\) is: \[ \text{VIF}(\widehat{\beta}_j) = \dfrac{1}{1 - R_j^2} \] where \(R_j^2\) is the coefficient of determination of the regression in step \((1)\).
-
Analyze the magnitude of multicollinearity from the size of \(\text{VIF}(\widehat{\beta}_j)\). as a rule of thumb:
- if \(\text{VIF}(\widehat{\beta}_j) \gt 10\), then multicollinearity is high.
- if \(\text{VIF}(\widehat{\beta}_j) \lt 5\), then the variable is not collinear with the remaining explanatory variables.
Finally: the square root of the variance inflation factor indicates how much larger the standard error is, compared with what it would be if that variable were uncorrelated with the other predictor variables in the model.
As we have already seen, including the squared value of \(X_3\) changed the sign and significance of other explanatory variables (which under assumptions (MR.1) - (MR.5) should be independent of \(X_3\)).
Looking at the \(F\)-test in our model summary:
##
## Call:
## lm(formula = y2_new ~ x1 + x3_new + I(x3_new^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.78937 -0.58760 0.05885 0.68430 2.44656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.4846 65.7724 -0.631 0.529
## x1 -8.1334 17.5430 -0.464 0.643
## x3_new -11.0723 13.5328 -0.818 0.414
## I(x3_new^2) 0.1341 0.1798 0.746 0.457
##
## Residual standard error: 0.9693 on 196 degrees of freedom
## Multiple R-squared: 0.9926, Adjusted R-squared: 0.9925
## F-statistic: 8763 on 3 and 196 DF, p-value: < 2.2e-16
## Results: Ordinary least squares
## ====================================================================
## Model: OLS Adj. R-squared: 0.993
## Dependent Variable: y2_new AIC: 582.0119
## Date: 2020-10-13 21:40 BIC: 595.2052
## No. Observations: 200 Log-Likelihood: -287.01
## Df Model: 3 F-statistic: 9889.
## Df Residuals: 196 Prob (F-statistic): 1.34e-213
## R-squared: 0.993 Scale: 1.0538
## --------------------------------------------------------------------
## Coef. Std.Err. t P>|t| [0.025 0.975]
## --------------------------------------------------------------------
## Intercept 7.1324 32.1346 0.2220 0.8246 -56.2416 70.5064
## x1 5.4225 9.0857 0.5968 0.5513 -12.4958 23.3407
## x3_new -0.9443 6.7169 -0.1406 0.8883 -14.1910 12.3025
## np.power(x3_new, 2) -0.0161 0.1045 -0.1541 0.8777 -0.2221 0.1899
## --------------------------------------------------------------------
## Omnibus: 2.572 Durbin-Watson: 1.839
## Prob(Omnibus): 0.276 Jarque-Bera (JB): 2.633
## Skew: -0.265 Prob(JB): 0.268
## Kurtosis: 2.812 Condition No.: 84740
## ====================================================================
## * The condition number is large (8e+04). This might indicate
## strong multicollinearity or other numerical problems.
We see that the \(p\)-value is less than the 0.05 significance level, meaning that we reject the null hypothesis that \(\widehat{\beta}_1 = \widehat{\beta}_2 = \widehat{\beta}_3 = 0\) in the model \(\widehat{Y}_i = \widehat{\beta}_0 + \widehat{\beta}_1 X_{1,i} + \widehat{\beta}_2 X_{3,i} + \widehat{\beta}_3 X_{3,i}^2\). On the other hand, the summary statistics show that we do not reject the individual hypothesis tests that \(H_0: \widehat{\beta}_j = 0\), \(j = 1,2,3\). So, clearly something isn’t right with the specified model.
- We can calculate the VIF manually:
## [1] 1563.274
## 901.7658385490258
## [1] 1563.274
## 901.7658385490258
We can also use the built-in functions to verify that the variables are clearly collinear:
## x1 x3_new
## 1563.274 1563.274
from statsmodels.stats.outliers_influence import variance_inflation_factor
#
vif = pd.DataFrame()
#
vif_out = np.array([])
for i in range(1, mdl_3.exog.shape[1]):
tmp_val = variance_inflation_factor(mdl_3.exog, i)
vif_out = np.append(vif_out, [tmp_val])
#
vif["VIF Factor"] = vif_out
vif["Variable"] = mdl_3.exog_names[1:]
#
print(vif)
## VIF Factor Variable
## 0 901.765839 x1
## 1 901.765839 x3_new
or, in a more compact form:
vif = pd.DataFrame()
#
vif["VIF Factor"] = [variance_inflation_factor(mdl_3.exog, i) for i in range(1, mdl_3.exog.shape[1])]
vif["Variable"] = mdl_3.exog_names[1:]
#
print(vif)
## VIF Factor Variable
## 0 901.765839 x1
## 1 901.765839 x3_new
while for the model without collinearity the VIF is around 1 for each variable:
## x1 x2
## 1.000768 1.000768
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(mdl_1.exog, i) for i in range(1, mdl_1.exog.shape[1])]
vif["Variable"] = mdl_1.exog_names[1:]
#
print(vif)
## VIF Factor Variable
## 0 1.000042 x1
## 1 1.000042 x2
Note that this does not say anything about causality - whether \(X_1\) influences \(X_3\), or whether \(X_3\) influences \(X_1\) - we do not know.
4.5.3.2 Generalized Variance Inflation Factor (GVIF)
Variance inflation factors are not fully applicable to models, which include a set of regressors (i.e. indicator regressors for the same categorical variable), or polynomial regressors. This is because the correlations among these variables are induced by the model structure and therefore are artificial. Usually, we are not concerned with artificial correlations of these types - we want to identify the effects of different explanatory variables.
Consequently, (Fox and Monette 1992) introduced the Generalized VIF, denoted GVIF.
Assume that our regression model \[ \underset{N \times 1}{\mathbf{Y}} = \underset{N \times (k+1)}{\mathbf{X}} \ \underset{(k+1) \times 1}{\boldsymbol{\beta} }+ \underset{N \times 1}{\boldsymbol{\varepsilon}} \] can be written as: \[ \underset{N \times 1}{\mathbf{Y}} = \beta_0 + \underset{N \times r}{\mathbf{X}_1} \ \underset{r \times 1}{\boldsymbol{\beta}_1} + \underset{N \times (k-r)}{\mathbf{X}_2} \ \underset{(k-r) \times 1}{\boldsymbol{\beta}_2}+ \underset{N \times 1}{\boldsymbol{\varepsilon}} \] where:
- \(\mathbf{X}_1\) contains the related \(r\) indicator variables (or, a specific variable and its polynomial terms).
- \(\mathbf{X}_2\) contains the remaining regressors, excluding the constant;
Then the Generalized VIF is defined as: \[ \text{GVIF}_1 = \dfrac{\det(\mathbf{R}_{11}) \det(\mathbf{R}_{22})}{\det(\mathbf{R})} \] where:
- \(\mathbf{R}_{11}\) is the correlation matrix for \(\mathbf{X}_1\);
- \(\mathbf{R}_{22}\) is the correlation matrix for \(\mathbf{X}_2\);
- \(\mathbf{R}\) is the correlation matrix for all variables in the whole design matrix \(\mathbf{X}\), excluding the constant;
The GVIF is calculated for sets of related regressors, such as a for a set of indicator regressors for some kind of categorical variable, or for polynomial variables:
- For the continuous variables GVIF is the same as the VIF values before;
- For the categorical variables, we now get one GVIF value for each separate category type (e.g. one value for all age groups, another value for all regional indicator variables and so on);
So, variables which require more than 1 coefficient and thus more than 1 degree of freedom are typically evaluated using the GVIF.
To make GVIFs comparable across dimensions, (Fox and Monette 1992) also suggested using GVIF^(1/(2*DF))
, where DF
(degrees of freedom) is the number of coefficients in the subset. This reduces the GVIF to a linear measure. It is analogous to taking the square root of the usual VIF. In other words:
We can apply the usual VIF rule of thumb if we squared the GVIF^(1/(2*Df))
value.
For example, requiring \(\left(\text{GVIF}^{1/(2\cdot\text{DF})}\right)^2 \lt 5\) is equivalent to a requirement that \(\text{VIF} \lt 5\) for the continuous (i.e. non-categorical) variables.
age
:
set.seed(123)
#
dt3$age <- sample(c("10_18", "18_26", "26_40", "other"), size = length(x1), replace = TRUE)
print(head(dt3))
## y2_new x1 x3_new age
## 1 53.25364 6.879049 -11.90550 26_40
## 2 56.09227 7.539645 -12.55117 26_40
## 3 77.66881 11.117417 -15.91695 26_40
## 4 60.79827 8.141017 -13.13152 18_26
## 5 62.19339 8.258575 -13.24420 26_40
## 6 78.47473 11.430130 -16.20238 18_26
np.random.seed(123)
#
dt3["age"] = np.random.choice(["10_18", "18_26", "26_40", "other"], size = N, replace = True)
print(dt3.head())
## y2_new x1 x3_new age
## 0 48.567172 5.828739 -10.859063 26_40
## 1 71.212106 9.994691 -14.881628 18_26
## 2 62.848188 8.565957 -13.537666 26_40
## 3 40.641276 4.987411 -10.000399 26_40
## 4 53.102155 6.842799 -11.869809 10_18
Now assume that we simply want to add it to our model with collinear variables:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.7840 7.9300 0.7294 0.4666
## x1 4.6422 1.4438 3.2154 0.0015
## x3_new -1.3184 1.5024 -0.8775 0.3813
## age18_26 0.1004 0.1943 0.5168 0.6059
## age26_40 -0.2383 0.1929 -1.2358 0.2180
## ageother -0.0594 0.2080 -0.2854 0.7757
mdl_5 = smf.ols("y2_new ~ x1 + x3_new + age", data = dt3)
print(np.round(mdl_5.fit().summary2().tables[1], 4))
## Coef. Std.Err. t P>|t| [0.025 0.975]
## Intercept 2.2623 5.5151 0.4102 0.6821 -8.6150 13.1396
## age[T.18_26] -0.2813 0.2066 -1.3612 0.1750 -0.6888 0.1263
## age[T.26_40] -0.4044 0.2032 -1.9903 0.0480 -0.8051 -0.0037
## age[T.other] -0.2008 0.2027 -0.9904 0.3232 -0.6005 0.1990
## x1 4.0034 1.0308 3.8836 0.0001 1.9703 6.0364
## x3_new -1.9999 1.0617 -1.8838 0.0611 -4.0938 0.0939
x1
and x3_new
are continuous predictors and age
is a categorical variable presented by the indicator/dummy variables age18_26
, 26_40
and other
. In this case 10_18
is the baseline, or base group.
## [,1]
## x1 1579.3000
## x3_new 1579.7000
## age18_26 1.6115
## age26_40 1.6389
## ageother 1.5362
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(mdl_5.exog, i) for i in range(1, mdl_5.exog.shape[1])]
vif["Variable"] = mdl_5.exog_names[1:]
print(vif)
## VIF Factor Variable
## 0 1.473014 age[T.18_26]
## 1 1.504754 age[T.26_40]
## 2 1.497897 age[T.other]
## 3 903.350945 x1
## 4 903.629974 x3_new
The problem is that the VIF values are affected by the baseline of the categorical variable. In order to be sure of not having a VIF value above an acceptable level, it would be necessary to redo this analysis for every level of the categorical variable being the base group - i.e. we would also need to use age18_26
as the base group and calculate the VIF’s, then do the same with age26_40
and finally with other
. This gets more complicated if we have more than one categorical variable.
As such, it would be easier to calculate the GVIF’s. We begin by separating the indicator variables for the categorical age
variable in a separate matrix \(\mathbf{X}_1\) and the remaining explanatory variables (excluding the constant) in \(\mathbf{X}_2\):
import patsy as patsy
#
age = pd.DataFrame(dt3["age"], columns= ["age"])
X1 = patsy.dmatrix("0 + C(age)", age, return_type = "dataframe")
X2 = dt3[["x1", "x3_new"]].copy()
Note that in pandas
changing a subset will change the initial DataFrame
, so we need to create a copy of it using the .copy()
method.
We can manually transform the categorical variable in Python
using the patsy
package. It can use the C()
function, which marks data as being categorical
. this is especially useful if our data is not automatically treated as categorical, for example, a column of integer-valued codes. It also allows to optionally set the preferred coding scheme and level ordering. In general, if our categorical data is not numeric, it should automatically convert the categorical variable to the indicator variable matrix.
and here is how the matrices would look like:
## age10_18 age18_26 age26_40 ageother
## 1 0 0 1 0
## 2 0 0 1 0
## 3 0 0 1 0
## 4 0 1 0 0
## 5 0 0 1 0
## 6 0 1 0 0
## C(age)[10_18] C(age)[18_26] C(age)[26_40] C(age)[other]
## 0 0.0 0.0 1.0 0.0
## 1 0.0 1.0 0.0 0.0
## 2 0.0 0.0 1.0 0.0
## 3 0.0 0.0 1.0 0.0
## 4 1.0 0.0 0.0 0.0
Note that since we do not include all of the indicator variables in the model so as to avoid the dummy variable trap, we need to drop one of the indicator variables. For simplicity, we will drop 10_18
, since it was the base group for the previous model.
## age18_26 age26_40 ageother
## 1 0 1 0
## 2 0 1 0
## 3 0 1 0
## 4 1 0 0
## 5 0 1 0
## 6 1 0 0
## C(age)[18_26] C(age)[26_40] C(age)[other]
## 0 0.0 1.0 0.0
## 1 1.0 0.0 0.0
## 2 0.0 1.0 0.0
## 3 0.0 1.0 0.0
## 4 0.0 0.0 0.0
## [,1] [,2]
## [1,] 6.879049 -11.90550
## [2,] 7.539645 -12.55117
## [3,] 11.117417 -15.91695
## [4,] 8.141017 -13.13152
## [5,] 8.258575 -13.24420
## [6,] 11.430130 -16.20238
## x1 x3_new
## 0 5.828739 -10.859063
## 1 9.994691 -14.881628
## 2 8.565957 -13.537666
## 3 4.987411 -10.000399
## 4 6.842799 -11.869809
Now, we can calculate the GVIF for age
:
tmp_gvif <- det(cor(X1)) * det(cor(X2)) / det(cor(cbind(X1, X2)))
tmp_gvif <- data.frame(GVIF = tmp_gvif)
tmp_gvif$"GVIF^(1/2Df)" <- tmp_gvif$GVIF^(1/(2 * (ncol(X1))))
print(tmp_gvif)
## GVIF GVIF^(1/2Df)
## 1 1.015084 1.002498
tmp_gvif = np.linalg.det(X1.corr()) * np.linalg.det(X2.corr()) / np.linalg.det(pd.concat([X1.reset_index(drop = True), X2], axis = 1).corr())
tmp_gvif = pd.DataFrame([tmp_gvif], columns = ["GVIF"])
tmp_gvif["GVIF^(1/2Df)"] = np.power(tmp_gvif["GVIF"], 1 / (2 * len(X1.columns)))
print(tmp_gvif)
## GVIF GVIF^(1/2Df)
## 0 1.025511 1.004207
We can also do the same for the continuous variables, say x1
:
X1 <- cbind(x1)
X2 <- cbind(model.matrix(~ age + 0))[, -1]
X2 <- cbind(X2, x3_new)
#
tmp_gvif <- det(cor(X1)) * det(cor(X2)) / det(cor(cbind(X1, X2)))
tmp_gvif <- data.frame(GVIF = tmp_gvif)
tmp_gvif$"GVIF^(1/2Df)" <- tmp_gvif$GVIF^(1/(2 * (ncol(X1))))
print(tmp_gvif)
## GVIF GVIF^(1/2Df)
## 1 1579.27 39.74003
X1 = dt3[["x1"]].copy()
X2 = patsy.dmatrix("0 + C(age)", age, return_type = "dataframe").iloc[:, 1:]
X2 = pd.concat([X2.reset_index(drop = True), dt3[["x3_new"]]], axis = 1)
#
tmp_gvif = np.linalg.det(X1.corr()) * np.linalg.det(X2.corr()) / np.linalg.det(pd.concat([X1.reset_index(drop = True), X2], axis = 1).corr())
tmp_gvif = pd.DataFrame([tmp_gvif], columns = ["GVIF"])
tmp_gvif["GVIF^(1/2Df)"] = np.power(tmp_gvif["GVIF"], 1 / (2 * len(X1.columns)))
print(tmp_gvif)
## GVIF GVIF^(1/2Df)
## 0 903.350945 30.055797
Alternatively, we can do the same using the built-in functions:
## GVIF Df GVIF^(1/(2*Df))
## x1 1579.270044 1 39.740031
## x3_new 1579.727037 1 39.745780
## age 1.015084 3 1.002498
Note: we are still using the same car::vif()
function - since we have included age
- the GVIF
is calculated automatically instead of the standard VIF
.
Unfortunately, statsmodels
does not have a GVIF
implementation as of yet. So we can only calculate it manually.
and get identical calculation results.
Next, if we wanted to compare the values in the same sense as for the VIF, we need to square the GVIF^(1/(2*Df))
values:
## GVIF^(1/(2*Df))
## x1 1579.270044
## x3_new 1579.727037
## age 1.005003
We make a mental note to update this in the future, in case a new statsmodels
package version allows calculation of GVIF’s.
We can then compare the values with the usual rule of thumb values of \(5\) and \(10\), as we did for \(\text{VIF}\).
4.5.4 Methods to Address Multicolinearity
The multicollinearity problem is that the collinear variables do not contain enough information about individual explanatory variable effects
There are a number of ways we can try to deal with multicollinearity:
- Make sure you have not fallen into the dummy variable trap - including a dummy variable for every category and including a constant term in the regression together guarantees perfect multicollinearity.
- Do nothing - leave the model as is, despite multicollinearity. If our purpose is to predict \(Y\), the presence of multicollinearity doesn’t affect the efficiency of extrapolating the fitted model to new data as long as the predictor variables follow the same pattern of multicollinearity in the new data as in the data on which the regression model is based. As mentioned before, if our sample is not representative, then we may get coefficients with incorrect signs, or our model may be overfitted.
- Try to drop some of the collinear variables. If our purpose is to estimate the individual influence of each explanatory variable, multicollinearity prevents us from doing so. However, we lose additional information on \(Y\) from the dropped variables. Furthermore, omission of relevant variables results in biased coefficient estimates of the remaining coefficients.
- Obtain more data. This the ideal solution. As we have seen from the OLS formulas, a larger sample size produces more precise parameter estimates (i.e. with smaller standard errors).
- Using polynomial models, or models with interaction terms may sometimes lead to multicollinearity, especially if those variables have a very limited attainable value range. A way to solve this would be to mean-center the predictor variables. On the other hand, if the variables do not have a limited range, then this transformation does not help.
- Alternatively, we may combine all (or some) multicollinear variables into groups and use the method of principle components (alternatively, ridge regression, partial least squares regression can also be used).
- If the variables with high VIF’s are indicator variables, that represent a category with three or more categories, then high VIF may be caused when the base category has a small fraction of the overall observations. To address this, select the base category with a larger fraction of overall cases.
- Generally, if we are analyzing time series data - the correlated variables may in fact be the same variable, but at different lagged times. Consequently, including the relevant lags of the explanatory variables would account for such a problem. For cross-sectional data however, this is not applicable.
The simulated data is for the following model: \[ Y_{i} = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{1,i}^2 + \beta_3 X_{2,i} + \beta_4 \left( X_{1,i} \times X_{2,i} \right) + \beta_5 X_{3,i} + \epsilon_i \]
set.seed(123)
#
N <- 1000
beta_vec <- c(5, 2, 0.002, 3, -0.05, 4)
#
x1 <- rnorm(mean = 15, sd = 5, n = N)
x2 <- sample(1:20, size = N, replace = TRUE)
x3 <- sample(seq(from = 0, to = 12, length.out = 50), size = N, replace = TRUE)
e <- rnorm(mean = 0, sd = 3, n = N)
#
x_mat <- cbind(1, x1, x1^2, x2, x1 * x2, x3)
#
y <- x_mat %*% beta_vec + e
#
dt_sq <- data.frame(y, x1, x2, x3)
np.random.seed(123)
#
N = 1000
beta_vec = np.array([5, 2, 0.002, 3, -0.05, 4])
#
x1 = np.random.normal(loc = 15, scale = 5, size = N)
x2 = np.random.choice(list(range(1, 21)), size = N, replace = True)
x3 = np.random.choice(np.linspace(start = 0, stop = 12, num = 50), size = N, replace = True)
e = np.random.normal(loc = 0, scale = 3, size = N)
#
y = np.dot(sm.add_constant(np.column_stack((x1, x1**2, x2, x1 * x2, x3))), beta_vec) + e
#
dt_sq = pd.DataFrame(np.column_stack([y, x1, x2, x3]), columns=["y", "x1", "x2", "x3"])
Our estimated model is:
mdl_sq <- lm(y ~ x1 + I(x1^2) + x2 + x1:x2 + x3, data = dt_sq)
print(round(coef(summary(mdl_sq)), 5))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.84627 0.87003 5.57024 0.00000
## x1 2.08893 0.09254 22.57284 0.00000
## I(x1^2) -0.00147 0.00267 -0.55039 0.58217
## x2 2.93555 0.05116 57.37500 0.00000
## x3 3.96010 0.02542 155.80122 0.00000
## x1:x2 -0.04594 0.00328 -13.98404 0.00000
mdl_sq = smf.ols("y ~ x1 + np.power(x1, 2) + x2 + x1:x2 + x3", data = dt_sq)
print(mdl_sq.fit().summary().tables[1])
## ===================================================================================
## coef std err t P>|t| [0.025 0.975]
## -----------------------------------------------------------------------------------
## Intercept 5.1728 0.807 6.414 0.000 3.590 6.756
## x1 1.9426 0.087 22.396 0.000 1.772 2.113
## np.power(x1, 2) 0.0055 0.003 2.046 0.041 0.000 0.011
## x2 3.0335 0.051 59.916 0.000 2.934 3.133
## x1:x2 -0.0522 0.003 -16.131 0.000 -0.059 -0.046
## x3 3.9746 0.027 147.969 0.000 3.922 4.027
## ===================================================================================
We see that the squared term may be insignificant. If we examine the VIF:
## x1 I(x1^2) x2 x3 x1:x2
## 25.641296 20.849557 9.806743 1.003584 12.685321
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(mdl_sq.exog, i) for i in range(1, mdl_sq.exog.shape[1])]
vif.index = mdl_sq.exog_names[1:]
print(vif)
## VIF Factor
## x1 21.164949
## np.power(x1, 2) 18.623614
## x2 9.541997
## x1:x2 13.193461
## x3 1.001149
We see that the VIF is very large, despite the fact that only one variable appears to be insignificant. Furthermore, the signs of the coefficients are close to the true ones.
One of the strategies of dealing with multicollinearity in polynomial and interaction terms is to initially center the values by subtracting their means and then create polynomial and interaction terms. The mean of \(X_1\) and \(X_2\) is:
## y x1 x2 x3
## 83.299219 15.080639 10.493000 5.990694
## y 83.234105
## x1 14.802179
## x2 10.711000
## x3 5.981878
## dtype: float64
One efficient way of subtracting the means is by applying a function to the data:
## y x1 x2 x3
## 1 80.27734 12.19762 1 12.0000000
## 2 111.49284 13.84911 17 9.3061224
## 3 108.27019 22.79354 18 6.6122449
## 4 58.86745 15.35254 10 0.9795918
## 5 102.96509 15.64644 9 11.2653061
## 6 100.32617 23.57532 11 6.8571429
dt_sq_centered[, c("x1", "x2")] <- apply(dt_sq_centered[, c("x1", "x2")], MARGIN = 2, function(x){x - mean(x)})
print(head(dt_sq_centered))
## y x1 x2 x3
## 1 80.27734 -2.8830176 -9.493 12.0000000
## 2 111.49284 -1.2315268 6.507 9.3061224
## 3 108.27019 7.7129022 7.507 6.6122449
## 4 58.86745 0.2719026 -0.493 0.9795918
## 5 102.96509 0.5657993 -1.493 11.2653061
## 6 100.32617 8.4946856 0.507 6.8571429
## y x1 x2 x3
## 0 58.634044 9.571847 1.0 7.591837
## 1 89.582073 19.986727 7.0 7.102041
## 2 100.017263 16.414892 20.0 3.918367
## 3 77.421422 7.468526 9.0 9.061224
## 4 119.866161 12.106999 20.0 9.795918
dt_sq_centered[["x1", "x2"]] = dt_sq_centered[["x1", "x2"]].apply(lambda x: x - x.mean())
print(dt_sq_centered.head())
## y x1 x2 x3
## 0 58.634044 -5.230332 -9.711 7.591837
## 1 89.582073 5.184548 -3.711 7.102041
## 2 100.017263 1.612713 9.289 3.918367
## 3 77.421422 -7.333653 -1.711 9.061224
## 4 119.866161 -2.695181 9.289 9.795918
Finally, if we re-estimate the model with the centered variables:
mdl_sq_centered <- lm(formula(mdl_sq), data = dt_sq_centered)
print(round(coef(summary(mdl_sq_centered)), 5))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.54801 0.18804 316.68340 0.00000
## x1 1.56258 0.01833 85.23459 0.00000
## I(x1^2) -0.00147 0.00267 -0.55039 0.58217
## x2 2.24278 0.01642 136.62322 0.00000
## x3 3.96010 0.02542 155.80122 0.00000
## x1:x2 -0.04594 0.00328 -13.98404 0.00000
mdl_sq_centered = smf.ols("y ~ x1 + np.power(x1, 2) + x2 + x1:x2 + x3", data = dt_sq_centered)
print(mdl_sq_centered.fit().summary().tables[1])
## ===================================================================================
## coef std err t P>|t| [0.025 0.975]
## -----------------------------------------------------------------------------------
## Intercept 59.3513 0.198 300.336 0.000 58.964 59.739
## x1 1.5465 0.019 81.956 0.000 1.509 1.584
## np.power(x1, 2) 0.0055 0.003 2.046 0.041 0.000 0.011
## x2 2.2612 0.016 137.904 0.000 2.229 2.293
## x1:x2 -0.0522 0.003 -16.131 0.000 -0.059 -0.046
## x3 3.9746 0.027 147.969 0.000 3.922 4.027
## ===================================================================================
Note: the standard errors, \(t\)-statistics and \(p\)-values remain the same. However, if we were to check the collinearity:
## x1 I(x1^2) x2 x3 x1:x2
## 1.006275 1.009893 1.009524 1.003584 1.012698
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(mdl_sq_centered.exog, i) for i in range(1, mdl_sq_centered.exog.shape[1])]
vif.index = mdl_sq_centered.exog_names[1:]
print(vif)
## VIF Factor
## x1 1.001712
## np.power(x1, 2) 1.003848
## x2 1.000838
## x1:x2 1.003530
## x3 1.001149
We see that it has disappeared.
So, in case of polynomial and interaction variables, their multicollinearity has no adverse consequences.
On the other hand, if there are many other variables, then it may be difficult to distinguish, which variables are affected by multicollinearity. Consequently, it is better to determine the collinearity between different variables first, and then add polynomial and interaction terms later.
Furthermore, it is worth noting, that when we center the independent variable, the interpretation of its coefficient remains the same. On the other hand, the intercept parameter changes - \(\beta_0\) can now be interpreted as the average value of \(Y\), when \(X\) is equal to the average in the sample (when \(X\) is equal to the sample mean - its centered variable is zero, therefore the associated \(\beta_j\) also becomes zero).
Finally:
From a set of multiple linear regression models, defined as: \[ Y_{i} = \beta_0 + \beta_1 X_{1,i} + ... + \beta_k X_{k,i} + \epsilon_i \] The best regression models are those where the predictor variables \(X_j\):
- each highly correlate with the dependent variable \(Y\);
- minimally correlate with each other (preferably - no correlation);
Weak collinearity is not a violation of OLS assumptions. If the estimated coefficients:
- have the expected signs and magnitudes;
- they are not sensitive to adding or removing a few observations
- they are not sensitive to adding or removing insignificant variables;
then there is no reason to try to identify and mitigate collinearity.
References
Fox, John, and Georges Monette. 1992. “Generalized Collinearity Diagnostics.” Journal of the American Statistical Association 87 (417): 178–83. http://www.jstor.org/stable/2290467.