Updating and re-fiting the model

Note

This section continues on from the end of the previous section on hypothesis testing.

Taking everything that we have seen so far, we will initially specify our model without any interaction terms: \[ \begin{aligned} \log(wage_i) &= \beta_0 + \beta_1 educ_i + \beta_2 educ_i^2 + \beta_3 exper_i + \beta_4 exper_i^2 \\ &+ \beta_5 metro_i + \beta_6 female_i + \beta_{7} black_i + \beta_{8} female_i \times black_i + \epsilon_i,\quad i = 1,...,N \end{aligned} \]

We expect the following signs for the non-intercept coefficients:

\(\beta_2 > 0\) - generally it may be hard to say - with each additional year of education, the effect on wage may be lessened, however, if the additional year is for PhD-level education, then the additional year of education may have an increased effect on wage. For now, we will assume that is the case, i.e. \(\beta_2 > 0\).
\(\beta_4 < 0\) - generally, experience is correlated with age. Furthermore, the level of knowledge tends to even-out for each additional year of experience the more years of experience you have. In other words, if you already have a lot of experience, then an additional year of experience would have a lessened (but not necessarily negative) effect on wage. Note: exper^2 is alongside exper, so we do not know (as of right now), if there is a number of years of experience that results in a decrease in wage. We are assuming that it would result in an increase, but at a lower rate (\(\beta_4 < 0\)), compared to someone with less initial years of experience;
\(\beta_{8}\) - while we have already determined that race does not have a significant effect on \(wage\), this does not take into account that race of a specific gender may face discrimination, rather than race as a whole. To account for this, we include the gender and race interaction term. We also include the separate terms for gender and race, in order to have a more concise coefficient interpretation, if the interaction term is statistically significant.

We now estimate the model:

R
Python

mdl_3 <- lm(formula = "log(wage) ~ educ + I(educ^2) + exper + I(exper^2) + metro + female*black", data = dt_train)
mdl_3 %>% summary() %>% coef() %>% round(4) %>% print()

             Estimate Std. Error t value Pr(>|t|)
(Intercept)    1.5599     0.1841  8.4727   0.0000
educ           0.0367     0.0253  1.4527   0.1466
I(educ^2)      0.0025     0.0009  2.7116   0.0068
exper          0.0277     0.0040  6.9058   0.0000
I(exper^2)    -0.0004     0.0001 -4.9790   0.0000
metro          0.1356     0.0383  3.5450   0.0004
female        -0.1514     0.0312 -4.8543   0.0000
black         -0.0346     0.0853 -0.4061   0.6848
female:black  -0.0548     0.1112 -0.4927   0.6223

mdl_3_spec = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + female*black", data = dt_train)
mdl_3 = mdl_3_spec.fit()
print(mdl_3.summary().tables[1])

======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              1.5599      0.184      8.473      0.000       1.199       1.921
educ                   0.0367      0.025      1.453      0.147      -0.013       0.086
np.power(educ, 2)      0.0025      0.001      2.712      0.007       0.001       0.004
exper                  0.0277      0.004      6.906      0.000       0.020       0.036
np.power(exper, 2)    -0.0004   8.05e-05     -4.979      0.000      -0.001      -0.000
metro                  0.1356      0.038      3.545      0.000       0.061       0.211
female                -0.1514      0.031     -4.854      0.000      -0.213      -0.090
black                 -0.0346      0.085     -0.406      0.685      -0.202       0.133
female:black          -0.0548      0.111     -0.493      0.622      -0.273       0.163
======================================================================================

The square of education is significant - we do not remove the lower order educ variable, even though it is insignificant. On the other hand - we see that the interaction term between race and gender is insignificant, so we remove it.

R
Python

mdl_3 <- lm(formula = "log(wage) ~ educ + I(educ^2) + exper + I(exper^2) + metro + female + black", data = dt_train)
mdl_3 %>% summary() %>% coef() %>% round(4) %>% print()

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.5607     0.1840  8.4809   0.0000
educ          0.0369     0.0252  1.4618   0.1441
I(educ^2)     0.0025     0.0009  2.7033   0.0070
exper         0.0276     0.0040  6.9032   0.0000
I(exper^2)   -0.0004     0.0001 -4.9732   0.0000
metro         0.1355     0.0382  3.5435   0.0004
female       -0.1556     0.0300 -5.1856   0.0000
black        -0.0668     0.0549 -1.2164   0.2241

mdl_3_spec = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + female + black", data = dt_train)
mdl_3 = mdl_3_spec.fit()
print(mdl_3.summary().tables[1])

======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              1.5607      0.184      8.481      0.000       1.200       1.922
educ                   0.0369      0.025      1.462      0.144      -0.013       0.086
np.power(educ, 2)      0.0025      0.001      2.703      0.007       0.001       0.004
exper                  0.0276      0.004      6.903      0.000       0.020       0.035
np.power(exper, 2)    -0.0004   8.04e-05     -4.973      0.000      -0.001      -0.000
metro                  0.1355      0.038      3.544      0.000       0.060       0.211
female                -0.1556      0.030     -5.186      0.000      -0.214      -0.097
black                 -0.0668      0.055     -1.216      0.224      -0.175       0.041
======================================================================================

Just like in our previous model specification - the race indicator variable does not affect wage, so we remove it from the model:

R
Python

mdl_4 <- lm(formula = "log(wage) ~ educ + I(educ^2) + exper + I(exper^2) + metro + female", data = dt_train)
mdl_4 %>% summary() %>% coef() %>% round(4) %>% print()

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.5556     0.1840  8.4534   0.0000
educ          0.0367     0.0253  1.4545   0.1461
I(educ^2)     0.0025     0.0009  2.7293   0.0065
exper         0.0276     0.0040  6.9020   0.0000
I(exper^2)   -0.0004     0.0001 -4.9615   0.0000
metro         0.1334     0.0382  3.4912   0.0005
female       -0.1592     0.0299 -5.3320   0.0000

mdl_4_spec = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + female", data = dt_train)
mdl_4 = mdl_4_spec.fit()
print(mdl_4.summary().tables[1])

======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              1.5556      0.184      8.453      0.000       1.194       1.917
educ                   0.0367      0.025      1.455      0.146      -0.013       0.086
np.power(educ, 2)      0.0025      0.001      2.729      0.006       0.001       0.004
exper                  0.0276      0.004      6.902      0.000       0.020       0.035
np.power(exper, 2)    -0.0004   8.05e-05     -4.962      0.000      -0.001      -0.000
metro                  0.1334      0.038      3.491      0.001       0.058       0.208
female                -0.1592      0.030     -5.332      0.000      -0.218      -0.101
======================================================================================

What would happen, if we check for collinearity in a model with interaction and polynomial terms? (click to expand)

Note that the relationship between educ and educ^2 is non-linear:

R
Python

p5 <- dt_train %>% 
  mutate(educ2 = educ^2) %>% 
  dplyr::select(educ, educ2) %>%
  ggplot(aes(x = educ, y = educ2)) +
  geom_line(color = "cornflowerblue") +  
  geom_point(color = "red") +
  theme_bw()
print(p5)

dt_train['educ2'] = dt_train['educ'] ** 2
p5 = (pl.ggplot(data = dt_train[['educ', 'educ2']], mapping = pl.aes(x = 'educ', y = 'educ2')) +\
      pl.geom_line(color = "cornflowerblue") +\
      pl.geom_point(color = "red") +\
      pl.theme_bw())
#
dt_train = dt_train.drop(columns= ['educ2'])
print(p5)

Nevertheless, correlation (i.e. a linear measure) is high:

R
Python

cor(dt_train['educ'], dt_train['educ']^2)

          educ
educ 0.9793619

np.corrcoef(dt_train['educ'], dt_train['educ']**2)

array([[1.        , 0.97936189],
       [0.97936189, 1.        ]])

As a result, if we were to check for multicollinearity - we would see that the \(\rm VIF\) is quite high:

R
Python

#
#
#
#
#
#
#
car::vif(mdl_4) %>% print()

      educ  I(educ^2)      exper I(exper^2)      metro     female 
 25.328928  24.925901  12.961817  13.212006   1.018394   1.027602

# VIF dataframe 
vif_data = pd.DataFrame() 
vif_data["variable"] = mdl_4_spec.exog_names
# calculating VIF for each feature 
vif_data["VIF"] = [variance_inflation_factor(mdl_4_spec.exog, i) for i in range(len(vif_data["variable"]))]
# drop intercept:
vif_data = vif_data[vif_data['variable'] != 'Intercept']
print(vif_data.T)

                  1                  2          3                   4         5         6
variable       educ  np.power(educ, 2)      exper  np.power(exper, 2)     metro    female
VIF       25.328928          24.925901  12.961817           13.212006  1.018394  1.027602

When including interaction and polynomial terms, the high VIF is deceptive, since the reason for such a high \(\rm VIF\) is the square of the same variable, but it is NOT a linear relationship!

This is why it is better to check for multicollinearity BEFORE including interaction and polynomial terms!

So, we can write down our final model as follows (coefficient standard errors are also provided in parenthesis): \[ \begin{aligned} \underset{(s.e.)}{\widehat{\log(wage_i)}} &= \underset{(0.1840)}{1.5556} + \underset{(0.0253)}{0.0367} \cdot educ_i + \underset{(0.0009)}{0.0025} \cdot educ_i^2 + \underset{(0.0040)}{0.0276} \cdot exper_i - \underset{(0.0001)}{0.0004} exper_i^2 \\ &+ \underset{(0.0382)}{0.1334} \cdot metro_i - \underset{(0.0299)}{0.1592} \cdot female_i \end{aligned} \]

We can interpret the coefficients as follows:

The \(metro\) coefficient shows that if a person is from a metropolitan area, their wage will be, on average, \(0.1334 \times 100 = 13.34\%\) higher, compared to people, who are not from a metropolitan area, ceteris paribus;
The \(female\) coefficient is negative, which indicates that there might be a pay gap between genders - the coefficient shows that if someone is female, their wage is around \(0.1592 \times 100 = 15.9+2\%\) lower, compared to someone who is not female, ceteris paribus.
Since education and exper have polynomial terms - their interpretation is a bit different. For example, the partial derivative with respect to educ is: \[ \dfrac{\partial \widehat{\log(wage_i)}}{\partial educ_i} = 0.0367 + 2 \times 0.0025 \times educ_i = 0.0367 + 0.005 \times educ_i \] This means that the effect of education now depends on the initial value of education.

Another way to look at it is to take two cases:

In the first case we have \(educ_i\), and predict \(\widehat{\log(wage_i)}^{(0)}\).
In the second case, we increase it by one (\(educ_i + 1\)) and calculate \(\widehat{\log(wage_i)}^{(1)}\). We assume that every other explanatory variable is the same (i.e. ceteris paribus).

Then the difference between the two predictions is: \[ \begin{aligned} \widehat{\log(wage_i)}^{(1)} - \widehat{\log(wage_i)}^{(0)} &= 0.0367 \times (educ_i+1) + 0.0025 \times (educ_i+1)^2 \\ &- (0.0367 \times educ_i + 0.0025 \times educ_i^2) \\ &= 0.0367 + + 0.0025 \times (2 \cdot educ_i + 1) \\ &= 0.0392 + 0.005 \times educ_i \end{aligned} \] In other words, if education increases by one year, then wage increases by \(100 \cdot \left[ 0.0367 + 0.0025 \cdot (2 \cdot educ + 1) \right] \%\). The effect is more pronounced with higher values of education.

For example if \(educ_1 = 10\) and \(educ_2 = 20\), then wage will increase by a larger amount for the individual, who has \(20\) years of education and obtains an additional year, compared to someone who went from \(10\) to \(11\) years of education.

Interpreting interaction effects (click to expand)

Assume that our model has interaction effects - how would we interpret the interaction coefficient?

For example, let’s say that our estimated regression model is:

R
Python

mdl_x <- lm(formula = "log(wage) ~ educ + I(educ^2) + exper + I(exper^2) + metro + south + female*educ", data = dt_train)
mdl_x %>% summary() %>% coef() %>% round(4) %>% print()

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.6205     0.1836  8.8272   0.0000
educ          0.0406     0.0251  1.6160   0.1064
I(educ^2)     0.0019     0.0009  2.0835   0.0375
exper         0.0286     0.0040  7.1727   0.0000
I(exper^2)   -0.0004     0.0001 -5.1857   0.0000
metro         0.1423     0.0380  3.7431   0.0002
south        -0.0809     0.0311 -2.6023   0.0094
female       -0.6067     0.1586 -3.8253   0.0001
educ:female   0.0312     0.0108  2.8839   0.0040

mdl_x_spec = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + south + female*educ", data = dt_train)
mdl_x = mdl_x_spec.fit()
print(mdl_x.summary().tables[1])

======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              1.6205      0.184      8.827      0.000       1.260       1.981
educ                   0.0406      0.025      1.616      0.106      -0.009       0.090
np.power(educ, 2)      0.0019      0.001      2.083      0.037       0.000       0.004
exper                  0.0286      0.004      7.173      0.000       0.021       0.036
np.power(exper, 2)    -0.0004   8.01e-05     -5.186      0.000      -0.001      -0.000
metro                  0.1423      0.038      3.743      0.000       0.068       0.217
south                 -0.0809      0.031     -2.602      0.009      -0.142      -0.020
female                -0.6067      0.159     -3.825      0.000      -0.918      -0.295
female:educ            0.0312      0.011      2.884      0.004       0.010       0.052
======================================================================================

\[ \begin{aligned} \widehat{\log(wage)} &= {1.6205} + {0.0406} \cdot educ + {0.0019} \cdot educ^2 \\ &+ {0.0286} \cdot exper - {0.0004} \cdot exper^2 \\ &+ {0.1423} \cdot metro - {0.0809} \cdot south \\ &- {0.6067} \cdot female + {0.0312} \cdot \left(female \times educ\right) \end{aligned} \] To highlight the effect of gender, we can rearrange the above model as: \[ \begin{aligned} \widehat{\log(wage)} &= {1.6205} + {0.0406} \cdot educ + {0.0019} \cdot educ^2 \\ &+ {0.0286} \cdot exper - {0.0004} \cdot exper^2 \\ &+ {0.1423} \cdot metro - {0.0809} \cdot south \\ &+ \left({0.0312} \cdot educ - {0.6067}\right) \cdot female \end{aligned} \] A possible interpretation could be as follows: if the person is female then their \(\log(wage)\) differs by \((0.0312\cdot educ−0.6067\)), compared to the base non-female group, ceteris paribus.

By specifying this type model we can see how much education offsets discrimination based on gender - if \({0.0312} \cdot educ - {0.6067} = 0\), then having \(educ = 0.6067 / 0.0312 \approx 19.45\) years of education will offset gender-based discrimination.