Updating and re-fiting the model

Note

This section continues on from the end of the previous section on hypothesis testing.

Taking everything that we have seen so far, we will initially specify our model without any interaction terms: \[ \begin{aligned} \log(wage_i) &= \beta_0 + \beta_1 educ_i + \beta_2 educ_i^2 + \beta_3 exper_i + \beta_4 exper_i^2 \\ &+ \beta_5 metro_i + \beta_6 female_i + \beta_{7} black_i + \beta_{8} female_i \times black_i + \epsilon_i,\quad i = 1,...,N \end{aligned} \]

We expect the following signs for the non-intercept coefficients:

We now estimate the model:

mdl_3 <- lm(formula = "log(wage) ~ educ + I(educ^2) + exper + I(exper^2) + metro + female*black", data = dt_train)
mdl_3 %>% summary() %>% coef() %>% round(4) %>% print()
             Estimate Std. Error t value Pr(>|t|)
(Intercept)    1.5599     0.1841  8.4727   0.0000
educ           0.0367     0.0253  1.4527   0.1466
I(educ^2)      0.0025     0.0009  2.7116   0.0068
exper          0.0277     0.0040  6.9058   0.0000
I(exper^2)    -0.0004     0.0001 -4.9790   0.0000
metro          0.1356     0.0383  3.5450   0.0004
female        -0.1514     0.0312 -4.8543   0.0000
black         -0.0346     0.0853 -0.4061   0.6848
female:black  -0.0548     0.1112 -0.4927   0.6223
mdl_3_spec = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + female*black", data = dt_train)
mdl_3 = mdl_3_spec.fit()
print(mdl_3.summary().tables[1])
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              1.5599      0.184      8.473      0.000       1.199       1.921
educ                   0.0367      0.025      1.453      0.147      -0.013       0.086
np.power(educ, 2)      0.0025      0.001      2.712      0.007       0.001       0.004
exper                  0.0277      0.004      6.906      0.000       0.020       0.036
np.power(exper, 2)    -0.0004   8.05e-05     -4.979      0.000      -0.001      -0.000
metro                  0.1356      0.038      3.545      0.000       0.061       0.211
female                -0.1514      0.031     -4.854      0.000      -0.213      -0.090
black                 -0.0346      0.085     -0.406      0.685      -0.202       0.133
female:black          -0.0548      0.111     -0.493      0.622      -0.273       0.163
======================================================================================

The square of education is significant - we do not remove the lower order educ variable, even though it is insignificant. On the other hand - we see that the interaction term between race and gender is insignificant, so we remove it.

mdl_3 <- lm(formula = "log(wage) ~ educ + I(educ^2) + exper + I(exper^2) + metro + female + black", data = dt_train)
mdl_3 %>% summary() %>% coef() %>% round(4) %>% print()
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.5607     0.1840  8.4809   0.0000
educ          0.0369     0.0252  1.4618   0.1441
I(educ^2)     0.0025     0.0009  2.7033   0.0070
exper         0.0276     0.0040  6.9032   0.0000
I(exper^2)   -0.0004     0.0001 -4.9732   0.0000
metro         0.1355     0.0382  3.5435   0.0004
female       -0.1556     0.0300 -5.1856   0.0000
black        -0.0668     0.0549 -1.2164   0.2241
mdl_3_spec = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + female + black", data = dt_train)
mdl_3 = mdl_3_spec.fit()
print(mdl_3.summary().tables[1])
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              1.5607      0.184      8.481      0.000       1.200       1.922
educ                   0.0369      0.025      1.462      0.144      -0.013       0.086
np.power(educ, 2)      0.0025      0.001      2.703      0.007       0.001       0.004
exper                  0.0276      0.004      6.903      0.000       0.020       0.035
np.power(exper, 2)    -0.0004   8.04e-05     -4.973      0.000      -0.001      -0.000
metro                  0.1355      0.038      3.544      0.000       0.060       0.211
female                -0.1556      0.030     -5.186      0.000      -0.214      -0.097
black                 -0.0668      0.055     -1.216      0.224      -0.175       0.041
======================================================================================

Just like in our previous model specification - the race indicator variable does not affect wage, so we remove it from the model:

mdl_4 <- lm(formula = "log(wage) ~ educ + I(educ^2) + exper + I(exper^2) + metro + female", data = dt_train)
mdl_4 %>% summary() %>% coef() %>% round(4) %>% print()
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.5556     0.1840  8.4534   0.0000
educ          0.0367     0.0253  1.4545   0.1461
I(educ^2)     0.0025     0.0009  2.7293   0.0065
exper         0.0276     0.0040  6.9020   0.0000
I(exper^2)   -0.0004     0.0001 -4.9615   0.0000
metro         0.1334     0.0382  3.4912   0.0005
female       -0.1592     0.0299 -5.3320   0.0000
mdl_4_spec = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + female", data = dt_train)
mdl_4 = mdl_4_spec.fit()
print(mdl_4.summary().tables[1])
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              1.5556      0.184      8.453      0.000       1.194       1.917
educ                   0.0367      0.025      1.455      0.146      -0.013       0.086
np.power(educ, 2)      0.0025      0.001      2.729      0.006       0.001       0.004
exper                  0.0276      0.004      6.902      0.000       0.020       0.035
np.power(exper, 2)    -0.0004   8.05e-05     -4.962      0.000      -0.001      -0.000
metro                  0.1334      0.038      3.491      0.001       0.058       0.208
female                -0.1592      0.030     -5.332      0.000      -0.218      -0.101
======================================================================================

Note that the relationship between educ and educ^2 is non-linear:

p5 <- dt_train %>% 
  mutate(educ2 = educ^2) %>% 
  dplyr::select(educ, educ2) %>%
  ggplot(aes(x = educ, y = educ2)) +
  geom_line(color = "cornflowerblue") +  
  geom_point(color = "red") +
  theme_bw()
print(p5)

dt_train['educ2'] = dt_train['educ'] ** 2
p5 = (pl.ggplot(data = dt_train[['educ', 'educ2']], mapping = pl.aes(x = 'educ', y = 'educ2')) +\
      pl.geom_line(color = "cornflowerblue") +\
      pl.geom_point(color = "red") +\
      pl.theme_bw())
#
dt_train = dt_train.drop(columns= ['educ2'])
print(p5)

Nevertheless, correlation (i.e. a linear measure) is high:

cor(dt_train['educ'], dt_train['educ']^2)
          educ
educ 0.9793619
np.corrcoef(dt_train['educ'], dt_train['educ']**2)
array([[1.        , 0.97936189],
       [0.97936189, 1.        ]])

As a result, if we were to check for multicollinearity - we would see that the \(\rm VIF\) is quite high:

#
#
#
#
#
#
#
car::vif(mdl_4) %>% print()
      educ  I(educ^2)      exper I(exper^2)      metro     female 
 25.328928  24.925901  12.961817  13.212006   1.018394   1.027602 
# VIF dataframe 
vif_data = pd.DataFrame() 
vif_data["variable"] = mdl_4_spec.exog_names
# calculating VIF for each feature 
vif_data["VIF"] = [variance_inflation_factor(mdl_4_spec.exog, i) for i in range(len(vif_data["variable"]))]
# drop intercept:
vif_data = vif_data[vif_data['variable'] != 'Intercept']
print(vif_data.T)
                  1                  2          3                   4         5         6
variable       educ  np.power(educ, 2)      exper  np.power(exper, 2)     metro    female
VIF       25.328928          24.925901  12.961817           13.212006  1.018394  1.027602

When including interaction and polynomial terms, the high VIF is deceptive, since the reason for such a high \(\rm VIF\) is the square of the same variable, but it is NOT a linear relationship!

This is why it is better to check for multicollinearity BEFORE including interaction and polynomial terms!

So, we can write down our final model as follows (coefficient standard errors are also provided in parenthesis): \[ \begin{aligned} \underset{(s.e.)}{\widehat{\log(wage_i)}} &= \underset{(0.1840)}{1.5556} + \underset{(0.0253)}{0.0367} \cdot educ_i + \underset{(0.0009)}{0.0025} \cdot educ_i^2 + \underset{(0.0040)}{0.0276} \cdot exper_i - \underset{(0.0001)}{0.0004} exper_i^2 \\ &+ \underset{(0.0382)}{0.1334} \cdot metro_i - \underset{(0.0299)}{0.1592} \cdot female_i \end{aligned} \]

We can interpret the coefficients as follows:

Another way to look at it is to take two cases:

Then the difference between the two predictions is: \[ \begin{aligned} \widehat{\log(wage_i)}^{(1)} - \widehat{\log(wage_i)}^{(0)} &= 0.0367 \times (educ_i+1) + 0.0025 \times (educ_i+1)^2 \\ &- (0.0367 \times educ_i + 0.0025 \times educ_i^2) \\ &= 0.0367 + + 0.0025 \times (2 \cdot educ_i + 1) \\ &= 0.0392 + 0.005 \times educ_i \end{aligned} \] In other words, if education increases by one year, then wage increases by \(100 \cdot \left[ 0.0367 + 0.0025 \cdot (2 \cdot educ + 1) \right] \%\). The effect is more pronounced with higher values of education.

For example if \(educ_1 = 10\) and \(educ_2 = 20\), then wage will increase by a larger amount for the individual, who has \(20\) years of education and obtains an additional year, compared to someone who went from \(10\) to \(11\) years of education.

Assume that our model has interaction effects - how would we interpret the interaction coefficient?

For example, let’s say that our estimated regression model is:

mdl_x <- lm(formula = "log(wage) ~ educ + I(educ^2) + exper + I(exper^2) + metro + south + female*educ", data = dt_train)
mdl_x %>% summary() %>% coef() %>% round(4) %>% print()
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.6205     0.1836  8.8272   0.0000
educ          0.0406     0.0251  1.6160   0.1064
I(educ^2)     0.0019     0.0009  2.0835   0.0375
exper         0.0286     0.0040  7.1727   0.0000
I(exper^2)   -0.0004     0.0001 -5.1857   0.0000
metro         0.1423     0.0380  3.7431   0.0002
south        -0.0809     0.0311 -2.6023   0.0094
female       -0.6067     0.1586 -3.8253   0.0001
educ:female   0.0312     0.0108  2.8839   0.0040
mdl_x_spec = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + south + female*educ", data = dt_train)
mdl_x = mdl_x_spec.fit()
print(mdl_x.summary().tables[1])
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              1.6205      0.184      8.827      0.000       1.260       1.981
educ                   0.0406      0.025      1.616      0.106      -0.009       0.090
np.power(educ, 2)      0.0019      0.001      2.083      0.037       0.000       0.004
exper                  0.0286      0.004      7.173      0.000       0.021       0.036
np.power(exper, 2)    -0.0004   8.01e-05     -5.186      0.000      -0.001      -0.000
metro                  0.1423      0.038      3.743      0.000       0.068       0.217
south                 -0.0809      0.031     -2.602      0.009      -0.142      -0.020
female                -0.6067      0.159     -3.825      0.000      -0.918      -0.295
female:educ            0.0312      0.011      2.884      0.004       0.010       0.052
======================================================================================

\[ \begin{aligned} \widehat{\log(wage)} &= {1.6205} + {0.0406} \cdot educ + {0.0019} \cdot educ^2 \\ &+ {0.0286} \cdot exper - {0.0004} \cdot exper^2 \\ &+ {0.1423} \cdot metro - {0.0809} \cdot south \\ &- {0.6067} \cdot female + {0.0312} \cdot \left(female \times educ\right) \end{aligned} \] To highlight the effect of gender, we can rearrange the above model as: \[ \begin{aligned} \widehat{\log(wage)} &= {1.6205} + {0.0406} \cdot educ + {0.0019} \cdot educ^2 \\ &+ {0.0286} \cdot exper - {0.0004} \cdot exper^2 \\ &+ {0.1423} \cdot metro - {0.0809} \cdot south \\ &+ \left({0.0312} \cdot educ - {0.6067}\right) \cdot female \end{aligned} \] A possible interpretation could be as follows: if the person is female then their \(\log(wage)\) differs by \((0.0312\cdot educ−0.6067\)), compared to the base non-female group, ceteris paribus.

By specifying this type model we can see how much education offsets discrimination based on gender - if \({0.0312} \cdot educ - {0.6067} = 0\), then having \(educ = 0.6067 / 0.0312 \approx 19.45\) years of education will offset gender-based discrimination.