Hypothesis testing

Note

This section continues on from the end of the previous section on model specification.

We want to test the hypothesis, whether a specific coefficient is statistically significantly different from zero, against the alternative, that it is statistically significant1:

\[ \begin{cases} H_0: \beta_j &= 0\\ H_1: \beta_j &\neq 0 \end{cases} \] For our tests, we select \(\alpha=0.05\) significance level.

Insignificant variables are those, whose \(p\)-value is greater than the \(0.05\) significance level:

mdl_0 %>% summary() %>% coef() %>% round(4) %>% print()
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.2840     0.0972 13.2102   0.0000
educ          0.1070     0.0053 20.1863   0.0000
exper         0.0086     0.0012  7.4405   0.0000
metro         0.1479     0.0392  3.7758   0.0002
south        -0.0998     0.0443 -2.2515   0.0246
west         -0.0463     0.0465 -0.9957   0.3196
midwest      -0.0487     0.0463 -1.0530   0.2926
female       -0.1560     0.0304 -5.1363   0.0000
black        -0.0472     0.0566 -0.8350   0.4039
print(mdl_0.summary().tables[1])
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.2840      0.097     13.210      0.000       1.093       1.475
educ           0.1070      0.005     20.186      0.000       0.097       0.117
exper          0.0086      0.001      7.440      0.000       0.006       0.011
metro          0.1479      0.039      3.776      0.000       0.071       0.225
south         -0.0998      0.044     -2.252      0.025      -0.187      -0.013
west          -0.0463      0.046     -0.996      0.320      -0.137       0.045
midwest       -0.0487      0.046     -1.053      0.293      -0.140       0.042
female        -0.1560      0.030     -5.136      0.000      -0.216      -0.096
black         -0.0472      0.057     -0.835      0.404      -0.158       0.064
==============================================================================

We see that:

Tip

If we include many insignificant variables in our model - the estimated coefficients are less accurate. This also means that there could be cases, where a statistically significant coefficient becomes insignificant once we include a insignificant variables in our model.

To account for this, it is recommended to remove insignificant variables one-by-one, instead of removing all at once.

On the other hand, indicator variables from the same category are rarely separately removed - if many indicator variables from the same category are insignificant - we should either remove the categorical variable from our model (i.e. remove all indicator variables from that specific category), of change the category variable into a binary one. E.g. if we have monday to saturday indicator variables, but only monday is statistically significant, then we should replace these categorical variables with a single indicator variable is_monday, which would allow us to compare the effect of monday, vs. all other days of the week.

We will begin by removing the insignificant variable with the largest \(p\)-value - in this case we have to choose either black, or to remove the regional indicator variables, since only one of them is statistically significant. Let’s compare the results of both actions:

mdl_2a <- lm(formula = "log(wage) ~ educ + exper + metro + geography + female", data = dt_train)
#
mdl_2a %>% summary() %>% coef() %>% round(4) %>% print()
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)        1.2784     0.0970 13.1861   0.0000
educ               0.1073     0.0053 20.3059   0.0000
exper              0.0087     0.0012  7.4669   0.0000
metro              0.1467     0.0391  3.7490   0.0002
geographysouth    -0.1044     0.0440 -2.3744   0.0178
geographywest     -0.0459     0.0465 -0.9870   0.3239
geographymidwest  -0.0484     0.0463 -1.0454   0.2961
female            -0.1585     0.0302 -5.2444   0.0000
mdl_2a_spec = smf.ols(formula = "np.log(wage) ~ educ + exper + metro + geography + female", data = dt_train)
mdl_2a = mdl_2a_spec.fit()
print(mdl_2a.summary().tables[1])
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept                1.2784      0.097     13.186      0.000       1.088       1.469
geography[T.south]      -0.1044      0.044     -2.374      0.018      -0.191      -0.018
geography[T.west]       -0.0459      0.046     -0.987      0.324      -0.137       0.045
geography[T.midwest]    -0.0484      0.046     -1.045      0.296      -0.139       0.042
educ                     0.1073      0.005     20.306      0.000       0.097       0.118
exper                    0.0087      0.001      7.467      0.000       0.006       0.011
metro                    0.1467      0.039      3.749      0.000       0.070       0.224
female                  -0.1585      0.030     -5.244      0.000      -0.218      -0.099
========================================================================================
mdl_2b <- lm(formula = "log(wage) ~ educ + exper + metro + female + south + black", data = dt_train)
#
mdl_2b %>% summary() %>% coef() %>% round(4) %>% print()
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.2432     0.0905 13.7388   0.0000
educ          0.1072     0.0053 20.2625   0.0000
exper         0.0087     0.0012  7.5030   0.0000
metro         0.1486     0.0387  3.8398   0.0001
female       -0.1564     0.0304 -5.1524   0.0000
south        -0.0647     0.0322 -2.0103   0.0447
black        -0.0465     0.0566 -0.8225   0.4110
mdl_2b_spec = smf.ols(formula = "np.log(wage) ~ educ + exper + metro + female + south + black", data = dt_train)
mdl_2b = mdl_2b_spec.fit()
print(mdl_2b.summary().tables[1])
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.2432      0.090     13.739      0.000       1.066       1.421
educ           0.1072      0.005     20.263      0.000       0.097       0.118
exper          0.0087      0.001      7.503      0.000       0.006       0.011
metro          0.1486      0.039      3.840      0.000       0.073       0.225
female        -0.1564      0.030     -5.152      0.000      -0.216      -0.097
south         -0.0647      0.032     -2.010      0.045      -0.128      -0.002
black         -0.0465      0.057     -0.823      0.411      -0.158       0.064
==============================================================================

Note that the \(p\)-value of south is very close to the significance level \(\alpha = 0.05\), so we choose to treat it as insignificant. Both cases show that neither black, nor the regional indicator variables are statistically significant. Therefore, we remove them both:

mdl_2 <- lm(formula = "log(wage) ~ educ + exper + metro + female", data = dt_train)
#
mdl_2 %>% summary() %>% coef() %>% round(4) %>% print()
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.2168     0.0899 13.5327    0e+00
educ          0.1076     0.0053 20.3473    0e+00
exper         0.0088     0.0012  7.5954    0e+00
metro         0.1433     0.0387  3.7030    2e-04
female       -0.1605     0.0302 -5.3053    0e+00
mdl_2_spec = smf.ols(formula = "np.log(wage) ~ educ + exper + metro + female", data = dt_train)
mdl_2 = mdl_2_spec.fit()
print(mdl_2.summary().tables[1])
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.2168      0.090     13.533      0.000       1.040       1.393
educ           0.1076      0.005     20.347      0.000       0.097       0.118
exper          0.0088      0.001      7.595      0.000       0.007       0.011
metro          0.1433      0.039      3.703      0.000       0.067       0.219
female        -0.1605      0.030     -5.305      0.000      -0.220      -0.101
==============================================================================

We can verify that all of the \(p\)-values are less than \(0.05\):

mdl_2 %>% summary() %>% coef() %>% 
  as_tibble() %>% 
  dplyr::filter(`Pr(>|t|)` >= 0.05) %>%
  nrow()
[1] 0
#
#
#
np.sum(mdl_2.pvalues >= 0.05)
0

Next, we check whether some variables are collinear:

Sometimes explanatory variables are tightly connected and it is impossible to distinguish their individual influences on the dependent variable. Many economic variables may move together in some systematic way. Such variables are said to be collinear and cause the collinearity problem.

In the same way, multicollinearity refers to a situation in which two or more explanatory variables are highly linearly related.

For example, consider the following model: \[ Y_{i} = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \epsilon_i \] Now, assume that \(X_{2,i} = \gamma_0 + \gamma_1 X_{1,i}\). Then our initial expression becomes: \[ \begin{aligned} Y_{i} &= \beta_0 + \beta_1 X_{1,i} + \beta_2 (\gamma_0 + \gamma_1 X_{2,i}) + \epsilon_i\\ &= \left[ \beta_0 + \beta_2 \gamma_0 \right] + \left[ \beta_1 + \beta_2 \gamma_1 \right] \cdot X_{1,i} + \epsilon \\ &= \alpha_0 + \alpha_1 X_{1,i} + \epsilon \end{aligned} \] Thus, while we may be able to estimate \(\alpha_0\) and \(\alpha_1\), we would not be able to obtain estimates of the original \(\beta_0, \beta_1, \beta_2\).

To verify that we cannot estimate the original coefficients note that the \(\rm OLS\) formula requires us to calculate \(\left(\mathbf{X}^\top \mathbf{X} \right)^{-1}\). So, if our design matrix is: \[ \begin{bmatrix} 1 & 1.1 \\ 1 & 1.2 \\ 1 & 1.3 \\ \vdots & \vdots \\ 1 & 2 \\ \end{bmatrix} \] We can calculate the following:

x_mat <- cbind(1, seq(from = 1.1, to = 2, by = 0.1))
#
#
solve(t(x_mat) %*% x_mat)
          [,1]      [,2]
[1,]  3.012121 -1.878788
[2,] -1.878788  1.212121
x_mat = np.full((10, 2), fill_value = np.nan)
x_mat[:, 0] = 1
x_mat[:, 1] = np.arange(start = 1.1, stop = 2 + 0.1, step = 0.1)
print(np.linalg.inv(x_mat.T @ x_mat))
[[ 3.01212121 -1.87878788]
 [-1.87878788  1.21212121]]

However, if we have: \[ \begin{bmatrix} 1 & 1.1 & 2.2\\ 1 & 1.2 & 2.4\\ 1 & 1.3 & 2.6\\ \vdots & \vdots & \vdots \\ 1 & 2 & 4\\ \end{bmatrix} \] then:

x_mat_2 <- cbind(x_mat, x_mat[, 2] * 2)
solve(t(x_mat_2) %*% x_mat_2)
Error in solve.default(t(x_mat_2) %*% x_mat_2): Lapack routine dgesv: system is exactly singular: U[3,3] = 0
x_mat_2 = np.column_stack((x_mat,  x_mat[:, 1] * 2))
print(np.linalg.inv(x_mat_2.T @ x_mat_2))
Singular matrix

This also relates to the dummy variable trap, since the base group will always be equal to \(1\), when all the other indicator variables of the same categorical variable are zero, and vice versa.

Note, that some libraries calculate the generalized inverse instead:

MASS::ginv(t(x_mat_2) %*% x_mat_2) %>% print()
           [,1]        [,2]       [,3]
[1,]  3.0121212 -0.37575758 -0.7515152
[2,] -0.3757576  0.04848485  0.0969697
[3,] -0.7515152  0.09696970  0.1939394
print(np.linalg.pinv(x_mat_2.T @ x_mat_2))
[[ 3.01212121 -0.37575758 -0.75151515]
 [-0.37575758  0.04848485  0.0969697 ]
 [-0.75151515  0.0969697   0.19393939]]

Although in multicollinearity cases using generalized inverse methods produce unreliable \(\rm OLS\) estimates.

While exact collinearity prevents us from estimating parameters using \(\rm OLS\), in actual empirical data exact collinearity is rare. Data usually exhibits *approximate** collinearity, which does allow carrying out \(\rm OLS\). Large changes in the estimated regression coefficients when a predictor variable is added, or removed, are indicative of multicollinearity in such cases.

A popular measure of multicollinearity is the variance inflation factor (VIF). Which estimates two models:

The \(\rm VIF\) is calculated as: \[ \text{VIF}(\widehat{\beta}_j) = \dfrac{1}{1 - R_j^2} \] where \(R^2\) is the coefficient of determination of the auxiliary regression. As a rule of thumb:

In our case:

#
#
#
#
#
#
#
car::vif(mdl_2)
    educ    exper    metro   female 
1.079126 1.056398 1.014557 1.024313 
# VIF dataframe 
vif_data = pd.DataFrame() 
vif_data["variable"] = mdl_2_spec.exog_names
# calculating VIF for each feature 
vif_data["VIF"] = [variance_inflation_factor(mdl_2_spec.exog, i) for i in range(len(vif_data["variable"]))]
# drop intercept:
vif_data = vif_data[vif_data['variable'] != 'Intercept']
print(vif_data.T)
                 1         2         3         4
variable      educ     exper     metro    female
VIF       1.079126  1.056398  1.014557  1.024313

There is no multicollinearity, as the \(\text{VIF}\) of each explanatory variable is less than \(5\).


  1. Since we can see the coefficient sign in the model summary - in this case, it isn’t important whether the sign is positive, or negative - only whether it is statistically significant, or not.↩︎