Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2840 0.0972 13.2102 0.0000
educ 0.1070 0.0053 20.1863 0.0000
exper 0.0086 0.0012 7.4405 0.0000
metro 0.1479 0.0392 3.7758 0.0002
south -0.0998 0.0443 -2.2515 0.0246
west -0.0463 0.0465 -0.9957 0.3196
midwest -0.0487 0.0463 -1.0530 0.2926
female -0.1560 0.0304 -5.1363 0.0000
black -0.0472 0.0566 -0.8350 0.4039
Hypothesis testing
This section continues on from the end of the previous section on model specification.
We want to test the hypothesis, whether a specific coefficient is statistically significantly different from zero, against the alternative, that it is statistically significant1:
\[ \begin{cases} H_0: \beta_j &= 0\\ H_1: \beta_j &\neq 0 \end{cases} \] For our tests, we select \(\alpha=0.05\) significance level.
Insignificant variables are those, whose \(p\)-value is greater than the \(0.05\) significance level:
print(mdl_0.summary().tables[1])
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.2840 0.097 13.210 0.000 1.093 1.475
educ 0.1070 0.005 20.186 0.000 0.097 0.117
exper 0.0086 0.001 7.440 0.000 0.006 0.011
metro 0.1479 0.039 3.776 0.000 0.071 0.225
south -0.0998 0.044 -2.252 0.025 -0.187 -0.013
west -0.0463 0.046 -0.996 0.320 -0.137 0.045
midwest -0.0487 0.046 -1.053 0.293 -0.140 0.042
female -0.1560 0.030 -5.136 0.000 -0.216 -0.096
black -0.0472 0.057 -0.835 0.404 -0.158 0.064
==============================================================================
We see that:
-
educ
,exper
,metro
,south
andfemale
are statistically significant. -
west
andmidwest
are insignificant -
black
is insignificant.
If we include many insignificant variables in our model - the estimated coefficients are less accurate. This also means that there could be cases, where a statistically significant coefficient becomes insignificant once we include a insignificant variables in our model.
To account for this, it is recommended to remove insignificant variables one-by-one, instead of removing all at once.
On the other hand, indicator variables from the same category are rarely separately removed - if many indicator variables from the same category are insignificant - we should either remove the categorical variable from our model (i.e. remove all indicator variables from that specific category), of change the category variable into a binary one. E.g. if we have monday
to saturday
indicator variables, but only monday
is statistically significant, then we should replace these categorical variables with a single indicator variable is_monday
, which would allow us to compare the effect of monday
, vs. all other days of the week.
We will begin by removing the insignificant variable with the largest \(p\)-value - in this case we have to choose either black
, or to remove the regional indicator variables, since only one of them is statistically significant. Let’s compare the results of both actions:
mdl_2a <- lm(formula = "log(wage) ~ educ + exper + metro + geography + female", data = dt_train)
#
mdl_2a %>% summary() %>% coef() %>% round(4) %>% print()
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2784 0.0970 13.1861 0.0000
educ 0.1073 0.0053 20.3059 0.0000
exper 0.0087 0.0012 7.4669 0.0000
metro 0.1467 0.0391 3.7490 0.0002
geographysouth -0.1044 0.0440 -2.3744 0.0178
geographywest -0.0459 0.0465 -0.9870 0.3239
geographymidwest -0.0484 0.0463 -1.0454 0.2961
female -0.1585 0.0302 -5.2444 0.0000
= smf.ols(formula = "np.log(wage) ~ educ + exper + metro + geography + female", data = dt_train)
mdl_2a_spec = mdl_2a_spec.fit()
mdl_2a print(mdl_2a.summary().tables[1])
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
Intercept 1.2784 0.097 13.186 0.000 1.088 1.469
geography[T.south] -0.1044 0.044 -2.374 0.018 -0.191 -0.018
geography[T.west] -0.0459 0.046 -0.987 0.324 -0.137 0.045
geography[T.midwest] -0.0484 0.046 -1.045 0.296 -0.139 0.042
educ 0.1073 0.005 20.306 0.000 0.097 0.118
exper 0.0087 0.001 7.467 0.000 0.006 0.011
metro 0.1467 0.039 3.749 0.000 0.070 0.224
female -0.1585 0.030 -5.244 0.000 -0.218 -0.099
========================================================================================
mdl_2b <- lm(formula = "log(wage) ~ educ + exper + metro + female + south + black", data = dt_train)
#
mdl_2b %>% summary() %>% coef() %>% round(4) %>% print()
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2432 0.0905 13.7388 0.0000
educ 0.1072 0.0053 20.2625 0.0000
exper 0.0087 0.0012 7.5030 0.0000
metro 0.1486 0.0387 3.8398 0.0001
female -0.1564 0.0304 -5.1524 0.0000
south -0.0647 0.0322 -2.0103 0.0447
black -0.0465 0.0566 -0.8225 0.4110
= smf.ols(formula = "np.log(wage) ~ educ + exper + metro + female + south + black", data = dt_train)
mdl_2b_spec = mdl_2b_spec.fit()
mdl_2b print(mdl_2b.summary().tables[1])
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.2432 0.090 13.739 0.000 1.066 1.421
educ 0.1072 0.005 20.263 0.000 0.097 0.118
exper 0.0087 0.001 7.503 0.000 0.006 0.011
metro 0.1486 0.039 3.840 0.000 0.073 0.225
female -0.1564 0.030 -5.152 0.000 -0.216 -0.097
south -0.0647 0.032 -2.010 0.045 -0.128 -0.002
black -0.0465 0.057 -0.823 0.411 -0.158 0.064
==============================================================================
Note that the \(p\)-value of south
is very close to the significance level \(\alpha = 0.05\), so we choose to treat it as insignificant. Both cases show that neither black
, nor the regional indicator variables are statistically significant. Therefore, we remove them both:
mdl_2 <- lm(formula = "log(wage) ~ educ + exper + metro + female", data = dt_train)
#
mdl_2 %>% summary() %>% coef() %>% round(4) %>% print()
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2168 0.0899 13.5327 0e+00
educ 0.1076 0.0053 20.3473 0e+00
exper 0.0088 0.0012 7.5954 0e+00
metro 0.1433 0.0387 3.7030 2e-04
female -0.1605 0.0302 -5.3053 0e+00
= smf.ols(formula = "np.log(wage) ~ educ + exper + metro + female", data = dt_train)
mdl_2_spec = mdl_2_spec.fit()
mdl_2 print(mdl_2.summary().tables[1])
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.2168 0.090 13.533 0.000 1.040 1.393
educ 0.1076 0.005 20.347 0.000 0.097 0.118
exper 0.0088 0.001 7.595 0.000 0.007 0.011
metro 0.1433 0.039 3.703 0.000 0.067 0.219
female -0.1605 0.030 -5.305 0.000 -0.220 -0.101
==============================================================================
We can verify that all of the \(p\)-values are less than \(0.05\):
Next, we check whether some variables are collinear:
Sometimes explanatory variables are tightly connected and it is impossible to distinguish their individual influences on the dependent variable. Many economic variables may move together in some systematic way. Such variables are said to be collinear and cause the collinearity problem.
In the same way, multicollinearity refers to a situation in which two or more explanatory variables are highly linearly related.
For example, consider the following model: \[ Y_{i} = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \epsilon_i \] Now, assume that \(X_{2,i} = \gamma_0 + \gamma_1 X_{1,i}\). Then our initial expression becomes: \[ \begin{aligned} Y_{i} &= \beta_0 + \beta_1 X_{1,i} + \beta_2 (\gamma_0 + \gamma_1 X_{2,i}) + \epsilon_i\\ &= \left[ \beta_0 + \beta_2 \gamma_0 \right] + \left[ \beta_1 + \beta_2 \gamma_1 \right] \cdot X_{1,i} + \epsilon \\ &= \alpha_0 + \alpha_1 X_{1,i} + \epsilon \end{aligned} \] Thus, while we may be able to estimate \(\alpha_0\) and \(\alpha_1\), we would not be able to obtain estimates of the original \(\beta_0, \beta_1, \beta_2\).
To verify that we cannot estimate the original coefficients note that the \(\rm OLS\) formula requires us to calculate \(\left(\mathbf{X}^\top \mathbf{X} \right)^{-1}\). So, if our design matrix is: \[ \begin{bmatrix} 1 & 1.1 \\ 1 & 1.2 \\ 1 & 1.3 \\ \vdots & \vdots \\ 1 & 2 \\ \end{bmatrix} \] We can calculate the following:
= np.full((10, 2), fill_value = np.nan)
x_mat 0] = 1
x_mat[:, 1] = np.arange(start = 1.1, stop = 2 + 0.1, step = 0.1)
x_mat[:, print(np.linalg.inv(x_mat.T @ x_mat))
[[ 3.01212121 -1.87878788]
[-1.87878788 1.21212121]]
However, if we have: \[ \begin{bmatrix} 1 & 1.1 & 2.2\\ 1 & 1.2 & 2.4\\ 1 & 1.3 & 2.6\\ \vdots & \vdots & \vdots \\ 1 & 2 & 4\\ \end{bmatrix} \] then:
This also relates to the dummy variable trap, since the base group will always be equal to \(1\), when all the other indicator variables of the same categorical variable are zero, and vice versa.
Note, that some libraries calculate the generalized inverse instead:
print(np.linalg.pinv(x_mat_2.T @ x_mat_2))
[[ 3.01212121 -0.37575758 -0.75151515]
[-0.37575758 0.04848485 0.0969697 ]
[-0.75151515 0.0969697 0.19393939]]
Although in multicollinearity cases using generalized inverse methods produce unreliable \(\rm OLS\) estimates.
While exact collinearity prevents us from estimating parameters using \(\rm OLS\), in actual empirical data exact collinearity is rare. Data usually exhibits *approximate** collinearity, which does allow carrying out \(\rm OLS\). Large changes in the estimated regression coefficients when a predictor variable is added, or removed, are indicative of multicollinearity in such cases.
A popular measure of multicollinearity is the variance inflation factor (VIF). Which estimates two models:
- The specified model.
- An auxiliary regression, where a specific explanatory variable is used as a dependent variable, while all of the other variables from the specified model, are used as explanatory variables: \[ X_{j,i} = \alpha_0 + \alpha_1 X_{1,i} + ... + \alpha_{j-1} X_{j-1,i} + \alpha_{j+1} X_{j+1,i}+ ... + \alpha_{k} X_{k,i} + e_i \] If such a model is accurate - variable \(X_{j}\) is linearly related to (some of) the other explanatory variables.
The \(\rm VIF\) is calculated as: \[ \text{VIF}(\widehat{\beta}_j) = \dfrac{1}{1 - R_j^2} \] where \(R^2\) is the coefficient of determination of the auxiliary regression. As a rule of thumb:
- if \(\text{VIF}(\widehat{\beta}_j)>10\), then multicollinearity is high.
- if \(\text{VIF}(\widehat{\beta}_j) < 5\), then the variable is not collinear with the remaining explanatory variables.
In our case:
#
#
#
#
#
#
#
car::vif(mdl_2)
educ exper metro female
1.079126 1.056398 1.014557 1.024313
# VIF dataframe
= pd.DataFrame()
vif_data "variable"] = mdl_2_spec.exog_names
vif_data[# calculating VIF for each feature
"VIF"] = [variance_inflation_factor(mdl_2_spec.exog, i) for i in range(len(vif_data["variable"]))]
vif_data[# drop intercept:
= vif_data[vif_data['variable'] != 'Intercept']
vif_data print(vif_data.T)
1 2 3 4
variable educ exper metro female
VIF 1.079126 1.056398 1.014557 1.024313
There is no multicollinearity, as the \(\text{VIF}\) of each explanatory variable is less than \(5\).
Since we can see the coefficient sign in the model summary - in this case, it isn’t important whether the sign is positive, or negative - only whether it is statistically significant, or not.↩︎