We begin this notebook by loading the required libraries.
import numpy as np
import matplotlib.pyplot as plt
#
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import summary_table
import statsmodels.formula.api as smf
#
import scipy.stats as stats
import scipy.optimize as optimize
#
import pandas as pd
dt4
Dataset¶from datetime import date
#
today = str(date.today())
print("Last updated: " + today)
As in the univariate case, we begin by loading the dataset:
dt4 = pd.read_csv("http://www.principlesofeconometrics.com/poe5/data/csv/cps5_small.csv")
It is always a good idea to get a general look at the data - to make sure that everything loaded correctly:
dt4.head()
Make sure that the data types assigned to each column are correct:
dt4.dtypes
We can also get some summary statistics:
dt4.describe()
Everything appears to be in order - we can move on to modelling.
In this example data, we have the following:
wage
;black
, educ
, exper
, faminc
, female
, metro
, south
, midwest
, west
.We will begin by plotting pairwise scatter-plots for non-indicator variables:
pd.plotting.scatter_matrix(dt4[['educ','exper', 'faminc', 'wage']],
alpha = 0.2, figsize = (20, 15),
marker = "o",
hist_kwds = dict(edgecolor = "black", linewidth = 1, bins = 30),
edgecolor = "black")
plt.tight_layout()
plt.show()
Note that the diagonal elements are the histogram of the variables, while the upper and lower triangle of the plot matrix are the scatter-plots of the same variables. So, we will examine thediagonal plots and the plots in either the upper, or lower, triangle.
From the plots we can say that:
wage
and faminc
data appears to be scattered more for larger values ofwage
. - `educ` and `exper`;
- `educ` and `faminc`;
- `educ` and `wage`;
exper
and faminc
is not as clear from the plots;We also see that the correlation between explanatory variables is weaker, compared to the correlation between educ
and the remaining variables.
print(dt4[['educ', 'exper', 'faminc', 'wage']].corr())
We can also plot the scatter-matrix of the whole dataset:
pd.plotting.scatter_matrix(dt4, alpha = 0.2, figsize = (20, 15), marker = "o",
hist_kwds = dict(edgecolor = "black", linewidth = 1, bins = 30),
edgecolor = "black")
plt.tight_layout()
plt.show()
Though for indicator variables these plots do not show much.
We will quickly check if the regional indicator variables provided do not have all of the regions:
dt4[["south", "west", "midwest", "metro"]].head()
np.min(dt4["south"] + dt4["west"] + dt4["midwest"])
Since they do not sum to one - we can include all of the variable in our model without falling into a dummy variable trap (otherwise we would need to exclude one regional indicator variable from the model and treat it as a base region).
We can also look at their frequency table:
pd.crosstab(index = dt4["south"] + dt4["west"] + dt4["midwest"], columns="count")
Note that the maximum value is 1. If the maximum was 2 - this would show that some of the variables indicate something else, than the rest.
For example, if we include the metro
indicator variable:
pd.crosstab(index = dt4["south"] + dt4["west"] + dt4["midwest"] + dt4["metro"], columns="count")
We see that there are a number of rows that have a sum of 2, which means that metro
indicates something else, rather than region.
In other words, south
, west
and midwest
regions will be compared to a base OTHER
region.
We will begin by specifying the following model: $$ \begin{aligned} \log(wage) &= \beta_0 + \beta_1 educ + \beta_2 educ^2 \\ &+ \beta_3 exper + \beta_4 exper^2 + \beta_5 metro + \beta_6 south + \beta_7 west + \beta_8 midwest + \beta_9 female + \beta_{10} black \end{aligned} $$
We expect the followign signs for the non-intercept coefficients:
wage
;wage
may be lessened, however, if the additional year is for PhD-level education, then the additional year of education may have an increased effect on wage
. For now, we will assume that is the case, i.e. $\beta_2 > 0$.wage
;wage
. Note: exper^2
is alongside exper
, so we do not know (as of right now), if there is a number of years of experience that results in a decrease in wage
. We are assuming that it would result in an increase, but at a lower rate ($\beta_4 < 0$), compared to someone with less initial years of experience;We assume that other family income,faminc
, should not affect the wage of a person, i.e. we treat it as insignificant variable, whose correlation may be spurious and as such, we do not include it in the model.
Note: it is possible to also include interaction variables in the model. As we already have 10 variables - we will skip this for now, but later on, we will examine some of these variables.
mdl_0 = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + south + west + midwest + female + black", data = dt4)
mdl_0_fit = mdl_0.fit()
print(mdl_0_fit.summary())
P.S. If we specify our own power function, we can estimate the same model:
def poly_var(x, p):
return(np.array(x)**p)
#
print(smf.ols(formula = "np.log(wage) ~ educ + poly_var(educ, 2) + exper + poly_var(exper, 2) + metro + south + west + midwest + female + black", data = dt4).fit().summary())
Going back to our case, we see that:
educ
, exper
and their squared values;metro
is also as we would expect;south
region earn significantly less than people in all of the remaining regions (other + west + midwest), however, we can only check this if we remove the remaining regional variables and only leave the south variable.female
is negative and significant, indicating possible discrimination in the work force (again, this is only the inital model, so we cannot be sure yet);black
is negatuve but it is insignificant, indicating that there is no racial discrimination.We want to separately test the hypothesis that a coefficient is significant: $$ H_0: \beta_j = 0\\ H_1: \beta_j \neq 0 $$ The test is automatically carried out and the $t$-statistic and $p$-values are presented in the model summary output.
Insignificant variables are those, whose $p$-value is greater than the 0.05 significance level:
print(mdl_0_fit.summary().tables[1])
We will begin by removing the insignificant variable (where the $p$-value is greater than 0.05) with the largest $p$-values - in this case it is the indicator variable black
.
mdl_1 = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + south + west + midwest + female", data = dt4)
mdl_1_fit = mdl_1.fit()
print(mdl_1_fit.summary().tables[1])
Next up, we will remove the indicartor variable west
. Then we will remove the indicator variable midwest
. We note that after doing so, the base group would be other
+ west
+ midwest
.
mdl_2 = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + south + midwest + female", data = dt4)
mdl_2_fit = mdl_2.fit()
print(mdl_2_fit.summary().tables[1])
mdl_3 = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + south + female", data = dt4)
mdl_3_fit = mdl_3.fit()
print(mdl_3_fit.summary().tables[1])
The coefficient of the indicator variable black
is insignificant, so we remove it as well.
From the model output we can write down the estimated model as:
$$ \begin{aligned} \underset{(se)}{\widehat{\log(wage)}} &= \underset{(0.167)}{1.5288} + \underset{(0.023)}{0.0452} \cdot educ + \underset{(0.001)}{0.0022} \cdot educ^2 \\ &+ \underset{(0.004)}{0.0287} \cdot exper - \underset{(0.0001)}{0.0004} \cdot exper^2 \\ &+ \underset{(0.034)}{0.1219} \cdot metro - \underset{(0.028)}{0.0640} \cdot south - \underset{(0.027)}{0.1788} \cdot female \end{aligned} $$
Furthermore, we can interpret the variables in the following way:
educ
(education) by 1 year
results in an increasing effect on wage
(i.e. the positive effect on wage
is further increased depending on the initial level of eudc
), ceteris paribus:educ = 0
and then increases by $1$ year, then wage
increases by $100 \cdot 0.0452 = 4.52\%$;educ > 0
and then increases by $1$ year, then the difference in the change in educ
results in:
$$
\begin{aligned}
\underset{(se)}{\widehat{\log(wage)}} \bigg|_{educ + 1} - \underset{(se)}{\widehat{\log(wage)}} \bigg|_{educ} &= 0.0452 + 0.0022 \cdot (educ + 1)^2 - 0.0022 \cdot educ^2\\
&= 0.0452 + 0.0022 \cdot (2 \cdot educ + 1)
\end{aligned}
$$
then wage
increases by $100 \cdot \left[ 0.0452 + 0.0022 \cdot (2 \cdot educ + 1) \right] \%$exper
(years of experience) by 1 year
results in a decreasing effect on wage
(i.e. the positive effect on wage
is decreased depending on the initial level of exper
, ceteris paribus:exper = 0
increases by $1$ year, then wage
increases by $100 \cdot 0.0287 = 2.87\%$;exper > 0
increases by $1$ year, then the difference in the change in exper
results in:
$$
\begin{aligned}
\underset{(se)}{\widehat{\log(wage)}} \bigg|_{exper + 1} - \underset{(se)}{\widehat{\log(wage)}} \bigg|_{exper} &= 0.0287 - 0.0004 \cdot (exper + 1)^2 + 0.0004 \cdot exper^2\\
&= 0.0287 - 0.0004 \cdot (2 \cdot exper + 1)
\end{aligned}
$$
then wage
changes by $100 \cdot \left[ 0.0452 + 0.0022 \cdot (2 \cdot exper + 1) \right] \%$.exper
that result in a maximum (or minimum) value of wage
, ceteris paribus. Taking the partial derivative (calculating the marginal effect) and equating it to zero yields:
$$
\begin{aligned}
\dfrac{\partial {\widehat{\log(wage)}}}{\partial exper} &= 0.0287 - 0.0004 \cdot 2 \cdot educ = 0.0287 - 0.0008 \cdot exper = 0
\end{aligned}
$$print(mdl_3_fit.params)
print("Maximum Wage when exper = " + str(mdl_3_fit.params[3] / (-mdl_3_fit.params[4] * 2)))
So, when $exper = 33.95$, wage
will be at its maximum since $\dfrac{\partial^2 {\widehat{\log(wage)}}}{\partial^2 exper} < 0$.
Note that we used the exact values instead of the rounded values from the formulas. Reason being that the rounding error would give us a different value:
0.0287 / 0.0008
Which would not be the same, one we try with to verify the results with the actual data.
exper
results in no change in wage:
$$
\begin{aligned}
0.0287 - 0.0004 \cdot (2 \cdot educ + 1) &= 0
\end{aligned}
$$zero_inc = (mdl_3_fit.params[3] / (-mdl_3_fit.params[4]) - 1) / 2
print(zero_inc)
So, if the initial value of $exper = 33.45$, wage
would not change with an additional unit increase in exper
. Remember thatexper
can only be integer-valued.
Furthermore, for $exper > 33.45$, wage
would decrease from a unit increase in exper
. We can verify this by taking the initial value exper = 36
:
# Repeat the first row twice:
tst_dt = dt4.iloc[[0,0], :]
# Reset row index numbering to avoid duplicate row index numbers
tst_dt = tst_dt.reset_index(drop = True)
# Set `exper` column values:
tst_dt.loc[:, "exper"] = [36, 37]
# Print the sample data:
print(tst_dt)
Note the ceteris paribus condition - there is only only a unit change in exper
, while the remaining values are do not change.
Now, we can calculate and compare the predicted values:
tst_pred = mdl_3_fit.predict(tst_dt)
print(tst_pred)
np.diff(tst_pred)
which would be a $0.215\%$ decrease in wage for someone with 36 years of experience earning an additional year of experience.
This can be verified manually, by taking the exponents of the predicted values (since these predicted values are from the log-linear model) and caluclating their percentage change: $$ \dfrac{WAGE_{NEW} - WAGE_{OLD}}{WAGE_{OLD}} \cdot 100 $$
np.diff(np.exp(tst_pred)) / np.exp(tst_pred[0]) * 100
We note that this is an approximation only, and not a true equality between logarithm differences and percentage change.
It is very close when the percentage change is small, but for larger percentage changes, it may differ greatly.
exper = 34.95
):tst_dt.loc[:, "exper"] = [33, 34]
print(tst_dt)
tst_pred = mdl_3_fit.predict(tst_dt)
print(tst_pred)
np.diff(tst_pred)
exper
is only integer valued, then we can verify the point, where wage
does not change from a unit incease in exper
:tst_dt.loc[:, "exper"] = [zero_inc, zero_inc + 1]
print(tst_dt)
tst_pred = mdl_3_fit.predict(tst_dt)
print(tst_pred)
np.diff(tst_pred)
metro = 1
) earns around $100 \cdot 0.1219 = 12.19\%$ more than someone not from a metro area, ceteris paribus;south = 1
) earns around $100 \cdot 0.0640 = 6.4\%$ less (since the coefficient is $-0.0640 < 0$) than someone not from the south. female = 1
), then they earn around $100 \cdot 0.1788 = 17.88\%$ less than someone who is not female.fig = plt.figure(num = 2, figsize = (10, 8))
# Plot fitted vs residual plots:
ax = fig.add_subplot(2, 2, 1)
ax.plot(mdl_3_fit.fittedvalues, mdl_3_fit.resid, linestyle = "None", marker = "o", markeredgecolor = "black")
# Plot the residual histogram
ax = fig.add_subplot(2, 2, 2)
ax.hist(mdl_3_fit.resid, bins = 30, edgecolor = "black")
# Plot the residual Q-Q plot:
ax = fig.add_subplot(2, 1, 2)
stats.probplot(mdl_3_fit.resid, dist = "norm", plot = ax)
# Fix layout in case the labels do overlap:
plt.tight_layout()
plt.show()
We can see that:
Next, we move on to testing a few hypothesis.
The hypothesis that we want to test is: $$ \begin{cases} H_0&: \gamma_1 = 0 \text{ (residuals are homoskedastic)}\\ H_1&: \gamma_1 \neq 0 \text{ (residuals are heteroskedastic)} \end{cases} $$
We will begin with the Breusch-Pagan Test:
import statsmodels.stats.diagnostic as sm_diagnostic
#
bp_test = sm_diagnostic.het_breuschpagan(resid = mdl_3_fit.resid,
exog_het = pd.DataFrame(mdl_3.exog, columns = mdl_3.exog_names))
print(bp_test)
The BP test in Python
returns the values in the followin order:
lm
– lagrange multiplier statisticlm_pvalue
– p-value of lagrange multiplier testfvalue
– f-statistic of the hypothesis that the error variance does not depend on xf_pvalue
– p-value for the f-statisticWe are interested in the $LM$ statistic and its associated p-value, so we need the second element in the array. We have that the $p$-value < 0.05, so we reject the null hypothesis that the residuasl are homoskedastic. Which means that the residuals are heteroskedastic.
Next, we look at the Goldfeld-Quandt Test results:
# Goldfeld–Quandt Test
print(sm_diagnostic.het_goldfeldquandt(y = mdl_3_fit.model.endog, x = mdl_3_fit.model.exog, alternative = "two-sided"))
The $p$-value > 0.05, so we have no grounds to reject the null hypothesis and conclude that the residuals are homoskedastic.
Finally, we look at the White Test results:
# White Test
print(sm_diagnostic.het_white(resid = mdl_3_fit.resid, exog = mdl_3_fit.model.exog))
The White test returns the results in the same order as the BP test. So, the $p$-value = 0.00026269798 < 0.05, so we reject the null hypothesis and conclude that the rsiduals are heteroskedastic.
The hypothesis that we want to test is: $$ \begin{cases} H_0&:\text{the errors are serially uncorrelated}\\ H_1&:\text{the errors are autocorrelated (the exact order of the autocorrelation depends on the test carried out)} \end{cases} $$
We will begin with the Durbin-Watson Test, where the alternative hypothesis is that the autocorrelation is of order 1:
import statsmodels.stats.stattools as sm_tools
# Durbin–Watson Test
print(sm_tools.durbin_watson(mdl_3_fit.resid))
The DW statistic is close to 2, so we do not reject the null hypothesis that there is no serial correlation.
Next up is the Breusch-Godfrey Test, where we can select the autocorrelation order ourselves. We have selected a 2nd order autocorrelation:
print(sm_diagnostic.acorr_breusch_godfrey(mdl_3_fit, nlags = 2))
The BG test returns the values in the same order as the BP test. the $p$-value = 0.901018 > 0.05, so we have no grounds to reject the null hypothesis of no autocorrelation.
We could test with higher order autocorrelation and examine the results, lets try with up to order 20:
for i in range(2, 21):
print("BG Test for autocorrelation order = "+str(i)+"; p-value = " + str(np.round(sm_diagnostic.acorr_breusch_godfrey(mdl_3_fit, nlags = i)[1], 5)))
As we can see, we have no grounds to reject the null hypothesis of autocorrelation in any of the cases.
The hypothesis that we want to test is: $$ \begin{cases} H_0&:\text{residuals follow a normal distribution}\\ H_1&:\text{residuals do not follow a normal distribution} \end{cases} $$
We will carry our the following tests and combine their $p$-values a single output:
norm_tests = ["Anderson-Darling",
"Shapiro-Wilk",
"Kolmogorov-Smirnov",
"Cramer–von Mises",
"Jarque–Bera"]
import skgof as skgof
#
norm_test = pd.DataFrame()
norm_test["p_value"] = [
sm_diagnostic.normal_ad(x = mdl_3_fit.resid)[1],
stats.shapiro(x = mdl_3_fit.resid)[1],
sm_diagnostic.kstest_normal(x = mdl_3_fit.resid, dist = "norm")[1],
skgof.cvm_test(data = mdl_3_fit.resid, dist = stats.norm(0, np.sqrt(np.var(mdl_3_fit.resid))))[1],
sm_tools.jarque_bera(mdl_3_fit.resid)[1]
]
norm_test["Test"] = norm_tests
print(norm_test)
We see that the $p$-value is less than the $5\%$ significance level for the Anderson-Darling, Shapiro-Wilk and Kolmogorov-Smirnov tests, where we would reject the null hypothesis of normality. On the other hand the $p$-value is greater than 0.05 for Cramer-von Mises and Jarque-Bera tests, where we would not reject the null hypothesis of normality.
As indicated in the lecture notes, that Shapiro–Wilk has the best power for a given significance, furthermore, 3 our of 5 tests suggest non-normal residuals, so we will go with their results.
OVERALL CONCLUSIONS:
Assumption (MR.5) is related to multicollinearity and will be examined in a later TASK. But from what we have seen so far, almost all of the coefficients are statistically significant, with correct signs. Furthermore, since we were able to estimate the model via OLS, there is not exact collinearity (i.e. no exact linear dependence) between the regressors. So, there may be no collinear variables in our model.
If we look back at our final univariate regression model - the log-linear model: $$ \underset{(se)}{\widehat{\log(\text{wage})}} = \underset{(0.0702)}{1.5968} + \underset{(0.0048)}{0.0988} \cdot \text{educ} $$ We can estimate it here as well, and re-examine its residuals:
lm_univar = smf.ols(formula = "np.log(wage) ~ educ", data = dt4)
lm_univar_fit = lm_univar.fit()
print(lm_univar_fit.summary2().tables[1])
fig = plt.figure(num = 3, figsize = (10, 8))
# Plot fitted vs residual plots:
ax = fig.add_subplot(2, 2, 1)
ax.plot(lm_univar_fit.fittedvalues, lm_univar_fit.resid, linestyle = "None", marker = "o", markeredgecolor = "black")
# Plot the residual histogram
ax = fig.add_subplot(2, 2, 2)
ax.hist(lm_univar_fit.resid, bins = 30, edgecolor = "black")
# Plot the residual Q-Q plot:
ax = fig.add_subplot(2, 1, 2)
stats.probplot(lm_univar_fit.resid, dist = "norm", plot = ax)
# Fix layout in case the labels do overlap:
plt.tight_layout()
plt.show()
Compared to the univariate model:
Again note that the fitted values are on the horizontal axis, which also highlights another interesting poitns regarding the range of attainable fitted values in these models.
Looking at the residual vs fitted value plot, the number of fitted values, greater than 3.2, but less than 3.35 is:
len(lm_univar_fit.fittedvalues[np.logical_and(lm_univar_fit.fittedvalues > 3.2, lm_univar_fit.fittedvalues < 3.35)])
len(mdl_3_fit.fittedvalues[np.logical_and(mdl_3_fit.fittedvalues > 3.2, mdl_3_fit.fittedvalues < 3.35)])
By using the multivariate regression mode specification we now get fitted values, which are more evenly scattered across their interval, whereas in the univaraite case, we had fitted values clustered along a limited range.
To make everything easier to follow, we will examine the interaction terms one-by one, so as not to overwhelm with too many variables in the model.
mdl_4 = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + south + female*black", data = dt4)
mdl_4_fit = mdl_4.fit()
print(mdl_4_fit.summary2())
Both black
and female * black
are insignificant, so we can remove them from the regression.
Alternatively, we may want to carry out an $F$-test to test the joint hypothesis that: $$ \begin{cases} H_0&: \beta_{female} = 0, \beta_{black} = 0, \beta_{female \times black} = 0\\ H_1&: \text{at least one of the tested parameters is not zero} \end{cases} $$
If fail to reject the null hypothesis, then both race and gender have no significant effect on the model.
print(mdl_4_fit.f_test("female=0, black=0, female:black=0"))
Since the $p$-value < 0.05, we reject the null hypothesis and conclude that at least one of the variables is statistically significant.
If we only look at the joint hypothesis for black
and female:black
:
print(mdl_4_fit.f_test("black=0, female:black=0"))
Then we do not reject the null hypothesis that both black
and female:black
are not statistically significant and thus we can remove them both from our model.
We can also do this with the ANOVA test: by specifying the restricted model under the null:
mdl_4_restricted = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + south", data = dt4)
mdl_4_restricted_fit = mdl_4_restricted.fit()
print(sm.stats.anova_lm(mdl_4_restricted_fit, mdl_4_fit))
mdl_4_restricted = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + south + female", data = dt4)
mdl_4_restricted_fit = mdl_4_restricted.fit()
print(sm.stats.anova_lm(mdl_4_restricted_fit, mdl_4_fit))
We see that we get the exact same $F$-statistic and the exact same $p$-value. So, we can use either method to carry out the $F$-test for multiple coefficient significance (i.e. multiple restricitons).
Note: In case of RunTime Warning
- these specific RuntimeWarnings
are coming from scipy.stats.distributions, but are “by design”. In statsmodels these “invalid” RuntimeWarnings should not cause problems
female
is negative, it would be interesting to see, whether a higher education has a different effect based on a persons gender.mdl_4 = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + south + female*educ", data = dt4)
mdl_4_fit = mdl_4.fit()
print(mdl_4_fit.summary2())
We note that the $p$-value of educ
is close to 0.05. On the other hand the interaction variable between gender and education is significant (as well as the squared education, $educ^2$), so we will leave the variables included. We will further motivate this decision via the $F$-test.
Looking at the $F$-test for the hypothesis: $$ \begin{cases} H_0&: \beta_{educ} = 0, \beta_{female \times educ} = 0\\ H_1&: \text{at least one of the tested parameters is not zero} \end{cases} $$
print(mdl_4_fit.f_test("educ=0, female:educ=0"))
The $p$-value is less than 0.05, so we reject the null hypothesis and conclude that at least one variable is statistically significant.
However, removing only educ
, but leaving the interaction term would further complicate interpretation, especially since its $p$ value is so close to the $5\%$ significance level. If we relax the significance level, the nall the variables are statistically significant at the 0.1 ($10\%$) significance level.
INTERPRETATION:
Looking at the model coefficients: $$ \begin{aligned} \underset{(se)}{\widehat{\log(wage)}} &= \underset{(0.1688)}{1.5876} + \underset{(0.0227)}{0.0446} \cdot educ + \underset{(0.0008)}{0.0019} \cdot educ^2 \\ &+ \underset{(0.0036)}{0.0289} \cdot exper - \underset{(0.0001)}{0.0004} \cdot exper^2 \\ &+ \underset{(0.0345)}{0.1254} \cdot metro - \underset{(0.0280)}{0.0653} \cdot south \\ &- \underset{(0.1391)}{0.4903} \cdot female + \underset{(0.0095)}{0.0217} \cdot \left(female \times educ\right) \end{aligned} $$
or, with a little bit of rearranging, to highlight the effect of gender, we get: $$ \begin{aligned} \underset{(se)}{\widehat{\log(wage)}} &= \underset{(0.1688)}{1.5876} + \underset{(0.0227)}{0.0446} \cdot educ + \underset{(0.0008)}{0.0019} \cdot educ^2 \\ &+ \underset{(0.0036)}{0.0289} \cdot exper - \underset{(0.0001)}{0.0004} \cdot exper^2 \\ &+ \underset{(0.0345)}{0.1254} \cdot metro - \underset{(0.0280)}{0.0653} \cdot south \\ &+ \left(\underset{(0.0095)}{0.0217} \cdot educ - \underset{(0.1391)}{0.4903}\right) \cdot female \end{aligned} $$
a possible interpretation could be as follows: if the person is female, then their $\log(wage)$ differs by $\left(\underset{(0.0095)}{0.0217} \cdot educ - \underset{(0.1391)}{0.4903}\right)$, compared to males (or the base non-female group), ceteris paribus.
By specifying this model we can see how much education offsets discrimination based on gender. Notice that in this case, if educ = 0
, then there is a large difference in wage - the wage is lwoer by around $100 \cdot 0.4903 = 49.03 \%$ for females.
HOWEVER, if we look at the sample data:
dt4.loc[dt4["educ"] == 0]
We only have two cases when educ = 0
- ONE FOR FEMALES and ONE FOR MALES. Looking at the difference:
(12.50 - 9.19)/9.19
is around $36\%$, however, other factors, like metro
, south
and exper
are different, while the coefficient in our model, holds these values cosntant (i.e. the same), with only gender being different (this explains the $49.03\%$ value in our model).
Having so few datapoints does not reflect the case when educ = 0
, hence we should be careful when identifying it.
mdl_4 = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + south*educ + female*educ", data = dt4)
mdl_4_fit = mdl_4.fit()
print(mdl_4_fit.summary().tables[1])
We see that the interaction variable between south
and educ
is insignificant, so we will not include it in our model.
mdl_4 = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + south + metro*female*educ", data = dt4)
mdl_4_fit = mdl_4.fit()
print(mdl_4_fit.summary().tables[1])
The $F$-test for the joint significance for education significance: $$ \begin{cases} H_0&: \beta_{educ} = \beta_{educ^2} = \beta_{female \times educ} = \beta_{metro \times educ} = \beta_{metro \times female \times educ}= 0\\ H_1&: \text{at least one of the tested parameters is not zero} \end{cases} $$
print(mdl_4_fit.params.index.format())
print(mdl_4_fit.f_test("educ=0, np.power(educ, 2) = 0, female:educ=0, metro:educ=0, metro:female:educ=0"))
With $p$-value < 0.05, we reject the null hypothesis and conclude that educ
is statistically significant in our model.
On the other hand, we could remove the the squared value of educ
. Though we will examine this in more detail in the collinearity task.
Furthermore, testing the significance of only the $educ$ and its polynomial $educ^2$: $$ \begin{cases} H_0&: \beta_{educ} = \beta_{educ^2} = 0\\ H_1&: \text{at least one of the tested parameters is not zero} \end{cases} $$ yields:
print(mdl_4_fit.f_test("educ=0, np.power(educ, 2) = 0"))
a $p$-value < 0.05, which means that we still reject the null hypothesis and conclude that education has a significant effect on wage
.
Finally, the $R^2_{adj}$ is:
print(mdl_4_fit.rsquared_adj)
Interaction terms are not restricted to indicator variables - we can include interactions where BOTH variables are non-indivcators
Consequently, let us look at yet another interaction variable, but this time between edu
and exper
.
The motivation for including this interaction variable can be formulated as a question:
In other words, we want to include an additional variable, $educ \times exper$ in our model:
mdl_4 = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + south + metro*female*educ + educ:exper", data = dt4)
mdl_4_fit = mdl_4.fit()
print(mdl_4_fit.summary().tables[1])
The coefficient of the interaction term educ:exper
is statistically significant ($p$-value < 0.05).
INTERPRETATION:
This means that we can write our model as (note, we will keep a general notation to make it easier to see what we want to explain): $$ \begin{aligned} \log(wage) &= \beta_0 + \beta_1 educ + \beta_2 educ^2 + \beta_3 exper + \beta_4 exper^2 \\ &+ \beta_5 metro + \beta_6 south + \beta_7 west + \beta_8 midwest + \beta_9 female + \beta_{10} black \\ &+ \beta_{11} \left( educ \times exper \right) + \epsilon \end{aligned} $$
We can re-write this equation as: $$ \begin{aligned} \log(wage) &= \beta_0 + \left(\beta_1 + \beta_{11} exper \right)educ + \beta_2 educ^2 + \beta_3 exper + \beta_4 exper^2 \\ &+ \beta_5 metro + \beta_6 south + \beta_7 west + \beta_8 midwest + \beta_9 female + \beta_{10} black \\ &+ \epsilon \end{aligned} $$
or as:
So, the coefficient $\beta_{11}$ can be interpreted as the change in effectiveness of education for a one unit increase in experience.
Alternatively, rewriting the equation as: $$ \begin{aligned} \log(wage) &= \beta_0 + \beta_1 educ + \beta_2 educ^2 + \left( \beta_3 + \beta_{11} educ \right) exper + \beta_4 exper^2 \\ &+ \beta_5 metro + \beta_6 south + \beta_7 west + \beta_8 midwest + \beta_9 female + \beta_{10} black \\ &+ \epsilon \end{aligned} $$
In this case, the coefficient $\beta_{11}$ can be interpreted as the change in effectiveness of experience for a one unit increase in education.
We would also like to point out that the adjusted $R^2$ is larger than in the previous model. The $R^2_{adj}$ of the new model is slightly larger than before:
print(mdl_4_fit.rsquared_adj)
We do note one more result: the square of educ
is now insignificant - np.power(educ, 2)
has a $p$-value of 0.635, in which case we would not reject the null hypothesis that it is insignificant.
If we drop this squared variable and compare $R_{adj}^2$, AIC and BIC values.
The unrestricted model:
print(mdl_4_fit.summary2())
The restricted model:
mdl_4_R = smf.ols(formula = "np.log(wage) ~ educ + exper + np.power(exper, 2) + south + metro*female*educ + educ:exper", data = dt4)
mdl_4_R_fit = mdl_4_R.fit()
print(mdl_4_R_fit.summary2())
While the coefficient of educ
is now significant, we see that the adjusted $R^2$ is unchanged, the AIC and BIC are slightly lower (indicating a slightly better model).
All in all dropping the variable appears to not yield any noticeable improvement.
In such a case it is usefull to:
The relevant coefficients, which we want to compare, are:
coef_mat = pd.DataFrame()
coef_mat["COEFS"] = mdl_4.exog_names
coef_mat["UNRESTRICTED"] = np.array(mdl_4_fit.params)
coef_mat["RESTRICTED"] = np.insert(np.array(mdl_4_R_fit.params), 2, np.nan)
coef_mat["CHANGE (%)"] = (coef_mat["RESTRICTED"].values / coef_mat["UNRESTRICTED"].values - 1) * 100
#
print(coef_mat)
We see that educ
coefficient value is affected the most - inreasing by around $17\%$, while the remaining parameters (excluding the intercept) increasd between $0.17\%$ and $6.5\%$.
Generally, we may want to remove the insignificant variables. However, before deciding on the removal of this variable, let us examine, whether any linear restrictions can be applied.
Maybe re-estimating the coefficients via RLS would improve the significance of the squared educ
variable in our model?
On the other hand, looking at the coefficient signs and magnitude for educ
and exper
, we may want to verify the following hypothesis:
$$ \text{Each additional year of education has the same effect as each additional year of experience on }\log(wage) $$
Note that this concerns not only educ
and exper
, but their polynomial terms as well!
This restriction can be formulated as the following hypothesis: $$ \begin{cases} H_0&: \beta_{educ} = \beta_{exper},\text{ and } \beta_{educ^2} = \beta_{exper^2}\\\\ H_1&: \beta_{educ} \neq \beta_{exper}\text{ or } \beta_{educ^2} \neq \beta_{exper^2} \text{ or both} \end{cases} $$
Note that in TASK 8 we have already carried our a number of multiple restriction tests, but we simply tested whether multiple parameters are significant or not, we did not test, whether some parameters are statistically significantly identical to one another.
print(mdl_4_fit.f_test("educ-exper=0, np.power(educ, 2) - np.power(exper, 2)=0"))
So, we reject the null hypothesis and conclude that education and experience have different effects on $\log(wage)$.
Nevertheless, we may still be interested to test if the non-squared coefficients are equal, that is: $$ \begin{cases} H_0&: \beta_{educ} = \beta_{exper}\\\\ H_1&: \beta_{educ} \neq \beta_{exper} \end{cases} $$
Note in this case, there is less economic reasoning for this restrition, since we are ignoring their polynomial variables.
print(mdl_4_fit.f_test("educ-exper=0"))
In this case we do not reject the null hypothesis that the coefficients are equal.
This conclusion allows us to re-estimate a the regression via restricted least squares (RLS).
In order to re-estimate the model via RLS in Python
, we need to specify our model as a Generalized Linear Model (GLM). This is pretty straightforward:
mdl_4_rls = sm.GLM.from_formula(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + south + metro*female*educ + educ:exper", data = dt4)
mdl_4_rls_fit = mdl_4_rls.fit()
print(mdl_4_rls_fit.summary().tables[1])
we see that the output table is pretty much identical. GLM maximizes the likelihood function in order to estimate the model, rather than using the exact OLS expression.
Now, we can apply the linear restriction as follows:
mdl_4_rls_fit = mdl_4_rls.fit_constrained("educ - exper = 0")
print(mdl_4_rls_fit.summary().tables[1])
Or alternatively, via the restriction matrices, as defined in the lecture notes:
L = [[0, 1, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
r = [0]
print(mdl_4_rls.fit_constrained((L, r)).summary().tables[1])
We can see from the output that the coefficients are now equal.
Furthermore, a consequence of RLS is that the associated standard errors are smaller. Consequently, np.power(educ, 2) is now significant.
We will calculate the Variance Inflation Factor for each parameter (note that we do not calcualte VIF for the intercept).
import statsmodels.stats as smstats
#
vif = pd.DataFrame()
vif_out = np.array([])
for i in range(1, mdl_4.exog.shape[1]):
tmp_val = smstats.outliers_influence.variance_inflation_factor(mdl_4.exog, i)
vif_out = np.append(vif_out, [tmp_val])
#
vif["VIF Factor"] = vif_out
vif["Variable"] = mdl_4.exog_names[1:]
#
print(vif)
Alternatively, a more compact way of using a for
loop:
[smstats.outliers_influence.variance_inflation_factor(mdl_4.exog, i) for i in range(1, mdl_4.exog.shape[1])]
A couple of points regarding high VIF for polynomial and indicator variables:
centering
the variables (i.e., subtracting their means) before creating the powers or the products. So, in our cases, we see that the interaction terms and indicator variables are taken for all variable combinations. Newertheless, we may be interested in checking whether educ
and exper
are collinear.
To do this, we can either define the regression model without any interaction or polynomial variables, or specify the auxillary regressions manually. We will define a new regression to save some space, but you are encouraged to verify the VIF values by calculating them manually (i.e. without the built-in VIF functions).
mdl_small = smf.ols(formula = "np.log(wage) ~ educ + exper + south + metro + female", data = dt4)
#
vif_small = pd.DataFrame()
vif_small["VIF Factor"] = [smstats.outliers_influence.variance_inflation_factor(mdl_small.exog, i) for i in range(1, mdl_small.exog.shape[1])]
vif_small["Variable"] = mdl_small.exog_names[1:]
#
print(vif_small)
Note that from the definition of $VIF$, the regression itself for wage
does not matter - we are using the design matrix to estimate a model for the exogeneous regressors, but we want to only use those exogeneous regressors, which we want to include in our final model.
From the resutls we see that these variables are NOT collinear. The collinearity only appears from the inclusion of polynomial and interaction variables and are not cause for concern.
In this specific case, the no collinearity result initially appears very weird, since from the exper
and educ
variable definitions for this dataset we have that:
educ
- years of educationexper
- potential experience = age - educ - 6
So, we would expect that educ
and exper
are collinear. We will examine this with more detail right now!
We will begin by taking a subset of the data with only educ
and exper
since we do not want to modify the original dataset:
dt4_new = dt4
dt4_new = dt4_new.assign(age = dt4_new["exper"] + dt4_new["educ"] + 6)
dt4_new.head()
If we were to calculate the correlation between these variables:
dt4_new[["educ", "exper", "age"]].corr()
We would find that:
educ
and age
is very small;educ
and exper
is around -0.2 - while small it may be somewhat significant;exper
and age
is very large;So, it may very well be that age
and exper
are collinear, but not exper
and educ
. In other words:
exper
- the potential experience (from the definition: years spent not in education, assumingly spent working) - is primarily driven by ones age;exper
and age
are highly correlated - and from the definition of exper
- we should be able to use age
as a proxy variable (i.e. a substitute, or as we will later see - an instrumental variable) for exper
exper
, or age
variables;We can very easily verify this, by replacing exper
with age
:
mdl_4_age = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + age + np.power(age, 2) + south + metro*female*educ + educ:age", data = dt4_new)
mdl_4_age_fit = mdl_4_age.fit()
print(mdl_4_age_fit.summary().tables[1])
comparing the coefficients with the previous model:
print(mdl_4_fit.summary().tables[1])
We see that, because exper
and educ
are highly correlated - the coefficient of age
and age^2
are very similar in terms of value, sign and significance ($t$ and $p$ values).
On the other hand, because educ
and age
have a very small correlation, the coefficient educ:exper
: is insignificant.
Furthermore, if we were to replace educ
with exper
, then, singe exper
and age
are highly correlated, we should run into a collinearity problem:
mdl_4_collin = smf.ols(formula = "np.log(wage) ~ exper + np.power(exper, 2) + age + np.power(age, 2) + south + metro*female*exper + exper:age", data = dt4_new)
mdl_4_collin_fit = mdl_4_collin.fit()
print(mdl_4_collin_fit.summary().tables[1])
What we immediately notice (compared with mdl_4_age_fit
) that by replacing this one variable in the model:
exper
is negative (more experience leads to a smaller wage, which is questionable);metro
, age^2
, metro:female
changed;Furthermore, if we were to carry out an $F$-test to check the overall significance (it is immediately available in one of the model output tables):
print(mdl_4_collin_fit.summary().tables[0])
With $p$-value = 4.44e-106 < 0.05, we reject the null hypothesis that all of the coefficients (excpet the intercept) are insignificant, while the individual $t$-statistic and their associated $p$-values indicate that almost all of the coefficients are insignificant.
If we were to examine the VIF of the parameters from this model:
mdl_small = smf.ols(formula = "np.log(wage) ~ exper + age + south + metro + female", data = dt4_new)
#
vif_small = pd.DataFrame()
vif_small["VIF Factor"] = [smstats.outliers_influence.variance_inflation_factor(mdl_small.exog, i) for i in range(1, mdl_small.exog.shape[1])]
vif_small["Variable"] = mdl_small.exog_names[1:]
#
print(vif_small)
We see that exper
and age
are highly collinear. If we were to further include educ
, then we would have a perfect multicollinearity, which would result in a warning:
mdl_small = smf.ols(formula = "np.log(wage) ~ educ + exper + age + south + metro + female", data = dt4_new)
#
vif_small = pd.DataFrame()
vif_small["VIF Factor"] = [smstats.outliers_influence.variance_inflation_factor(mdl_small.exog, i) for i in range(1, mdl_small.exog.shape[1])]
vif_small["Variable"] = mdl_small.exog_names[1:]
#
print(vif_small)
Since $R_j^2$ would be 1, for $j \in \{educ,\ exper,\ age\}$, then $VIF = \dfrac{1}{1 - R_j^2} = \dfrac{1}{0} = \infty$
So, we have determined the following:
educ
and exper
are correlated, but the correlation is not high enough to warrant collinearity - educ
has additional information, which is not included in exper
;exper
and age
are highly correlated, which results in collinearity between them - when estimating a regression model, we need to choose one of them to include in our model;Possible explanations for the fact that the correlation between educ
and exper
is smaller, even though it directly enters the formula, used to calculate exper
:
exper
increases with age
, while educ
tends to level-off (i.e. stop changing) after a certain number of years gained. For example, once someone gains a Master's degree, or a PhD, it may be very likely that they stop persuing additional degrees. As a result, their 'years spent in education' stops increasing, while they continue to age, and gain additional years of potential experience;educ
is more like a categorical variable, with cateogires corresponding to years in education, these range from 0 (the minimum) to 21 (the maximum), but since the numerical values assigned usually coincide with the number of years, it is treated like a non-categorical variable.dt4_full = pd.read_csv("http://www.principlesofeconometrics.com/poe5/data/csv/cps5.csv")
dt4_full.head()
print("Sample size: N = " + str(len(dt4_full.index)))
not only does the dataset contain more observations, but it also contains additional variables. The full variable list is as follows:
age
- ageasian
- =1 if asianblack
- =1 if blackdivorced
- =1 if divorcededuc
- years of educationexper
- potential experience = age - educ - 6faminc
- other family income, $\$$female
- =1 if femalehrswork
- hours worked last weekinsure
- covered by private health insurancemarried
- =1 if marriedmcaid
- =1 if covered by Medicaid last yearmcare
- =1 if covered by Medicare last yearmetro
- =1 if in metropolitan areamidwest
- =1 if midwest regionnchild
- number of own children in householdnortheast
- =1 if northeast regionsingle
- =1 if singlesouth
- =1 if south regionunion
- =1 if a union memberwage
- earnings per hour, $\$$west
- =1 if west regionwhite
- =1 if whiteIn fact, if we look at the regional indicator variables:
pd.crosstab(index = dt4_full["south"] + dt4_full["west"] + dt4_full["midwest"] + dt4_full["northeast"], columns="count")
We see that using all four indicator variables always sums up to one:
$$
\text{south}_i + \text{west}_i + \text{midwest}_i + \text{northeast}_i = 1,\quad \forall i = 1,...,N
$$
In other words, including all four of the regional variables would result in a dummy variable trap. As they are collinear. So, in our smaller dataset the $other$ region is actually the excluded midwest
column of the full dataset.
On the other hand, if we were to also examine the metro
variable instead of northeast
:
pd.crosstab(index = dt4_full["south"] + dt4_full["west"] + dt4_full["midwest"] + dt4_full["metro"], columns="count")
We see that not only do some rows sum to zero - some sum up even to $2$. This clearly shows that metro
variable indicates somethign completely different than the regional variables.
If we were to look back at out initial model:
mdl_fulldt = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + south + west + midwest + female + black", data = dt4_full)
mdl_fulldt_fit = mdl_fulldt.fit()
print(mdl_fulldt_fit.summary().tables[1])
We see a completely different result regarding race. Furthermore, regional indicator variables are also significant for most cases, except for west
indicator.
As was mentioned during lectures, a larger sample leads to smaller standard errors and more precise estimates. If we want to account for complex interaction effects and a large amount of variables - we need a large dataset, which would cover many possible combinations of these values (i.e. a larger variable value range).
Further looking at the interaction terms:
mdl_fulldt = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + south + west + midwest + female*black", data = dt4_full)
mdl_fulldt_fit = mdl_fulldt.fit()
print(mdl_fulldt_fit.summary().tables[1])
We now see that the interaction term $female \times black$ is statistically significant.
We can further create even more complex models by including even more interaction terms.
mdl_fulldt = smf.ols(formula = "np.log(wage) ~ educ + np.power(educ, 2) + exper + np.power(exper, 2) + metro + south + west + midwest + female*black + metro*female*educ + educ:exper", data = dt4_full)
mdl_fulldt_fit = mdl_fulldt.fit()
print(mdl_fulldt_fit.summary().tables[1])
Our finalized model is the following:
print(mdl_4_fit.summary().tables[1])
We begin by testing the model residuals for autocorrelation via Breusch-Godfrey test: $$ \begin{cases} H_0&:\text{the errors are serially uncorrelated}\\ H_1&:\text{the errors are autocorrelated at lag order 2} \end{cases} $$
from statsmodels.compat import lzip
#
name = ['LM-stat', 'LM: p-value', 'F-value', 'F: p-value']
bg_t = sm_diagnostic.acorr_breusch_godfrey(mdl_4_fit, nlags = 2)
print(pd.DataFrame(lzip(name, bg_t)))
We have that the $p$-value of the $LM$ statistic is greater than the $5\%$ significance level, we have no grounds to reject the null hypothesis and conclude that the residuals are not serially correlated.
Next up, we will test for heteroskedasticity in the errors: $$ \begin{cases} H_0&: \gamma_1 = 0 \text{ (residuals are homoskedastic)}\\ H_1&: \gamma_1 \neq 0 \text{ (residuals are heteroskedastic)} \end{cases} $$ For simplicity, we will carry out the Breusch-Pagan Test:
BP_t = sm_diagnostic.het_breuschpagan(resid = mdl_4_fit.resid, exog_het = mdl_1.exog)
print(pd.DataFrame(lzip(['LM statistic', 'p-value', 'F-value', 'F: p-value'], BP_t)))
Because the $p$-value < 0.05, we reject the null hypothesis and conclude that the residuals are heteroskedastic.
Our test results indicated that:
As a result, we need to correct the OLS standard errors for heteroskedasticity - We can use $HC0$, $HC1$, $HC2$ or $HC3$ estimators to consistently estimate the coefficient variance.
We have no need to correct for autocorrelation, as they are not serially correlated - there is no need to use HAC, yet it is a robust method that also takes into account heteroskedasticity, so, as an example, we will use it as well.
For comparison, our current model and its coefficient standard errors:
print(mdl_4_fit.summary().tables[1])
Then, the standard errors, corrected via the different HCE methods, as well as the biased OLS (because the errors are heteroskedastic) s.e.'s can be summarised as follows
pd.DataFrame([mdl_4_fit.HC0_se, mdl_4_fit.HC1_se, mdl_4_fit.HC2_se, mdl_4_fit.HC3_se, mdl_4_fit.bse], index = ["HC0", "HC1", "HC2", "HC3", "OLS"]).T
We see that the difference between the four HCE methods is not incredibly large, nevertheless, we will select HC3
and examine the coefficient summary output:
print(mdl_4_fit.get_robustcov_results(cov_type = "HC3").summary())
Note the results - $\text{educ}^2$ is still insignificant, the $p$-value of metro
decreased, the $p$-value of south
increased slightly. All in all, no significant changes.
If we wanted to also extract the HAC correction standard errors:
import statsmodels.stats as sm_stats
#
V_HAC = sm_stats.sandwich_covariance.cov_hac_simple(mdl_4_fit, nlags = 2)
print(pd.DataFrame(np.sqrt(np.diag(V_HAC)), index = mdl_4.exog_names, columns = ["HAC"]))
And the full model output:
#mdl_4.fit(cov_type = 'HAC', cov_kwds = {'maxlags':1})
print(mdl_4_fit.get_robustcov_results(cov_type = 'HAC', maxlags = 2).summary())
While the $p$-values slightly decreased, there are still no significant changes.
Since we have estimated determined that the residuals are heteroskedastic, but not autocorrelated, we can use WLS with a generic weight function $\widehat{h}_i = \exp\left(\widehat{\log(\epsilon_i^2)}\right)$, where $\log(\epsilon_i^2)$ are the fitted values from the following residual regression $\log(\epsilon_i^2) = \alpha_0 + \alpha_1 Z_{1,i} + ... + \alpha_m Z_{m,i} + v_i$
log_resid_sq_ols = sm.OLS(np.log(mdl_4_fit.resid**2), mdl_4.exog)
h_est = np.exp(log_resid_sq_ols.fit().fittedvalues)
Next, we can use the diagonal elements of $\widehat{\mathbf{\Omega}}^{-1} = \text{diag} \left(\widehat{h}_1^{-1},...,\widehat{h}_N^{-1} \right)$ as the weights:
mdl_4_wls = smf.wls(formula = mdl_4.formula, data = dt4, weights = 1.0 / h_est)
mdl_4_wls_fit = mdl_4_wls.fit()
#
print(mdl_4_wls_fit.summary().tables[1])
Compared to our OLS results:
print(mdl_4_fit.summary().tables[1])
we see that the WLS parameters are significant.
Regarding the $R^2$ - in the WLS it is larger:
mdl_4_wls_fit.rsquared_adj
But do note, that it is calculated on the weighted (i.e. transformed) data, so it is not directly comparable to the OLS $R^2$.
In general, the coefficients themselves are not largely different, which would indicate that there our model is likely correctly specified.
On the other hand, we may be more interested in comparing the model residuals of the WLS. It would make sense to compare the WLS residuals which are from the transformed data, since the model was fitted on the transformed values:
e_star = 1.0 / np.sqrt(h_est) * mdl_4_wls_fit.resid
fig = plt.figure(num = 9, figsize = (10, 8))
# Plot fitted vs residual plots:
ax = fig.add_subplot(2, 2, 1)
ax.plot(mdl_4_wls_fit.fittedvalues, e_star, linestyle = "None", marker = "o", markeredgecolor = "black", label = "WLS")
ax.plot(mdl_4_fit.fittedvalues, mdl_4_fit.resid, linestyle = "None", marker = "o", color = "red", markeredgecolor = "black", label = "OLS")
ax.legend()
# Plot the residual histogram
ax = fig.add_subplot(2, 2, 2)
ax.hist(e_star, bins = 30, edgecolor = "black", label = "WLS")
ax.hist(mdl_4_fit.resid, bins = 30, edgecolor = "black", color = "red", label = "OLS")
ax.legend()
# Fix layout in case the labels do overlap:
plt.tight_layout()
plt.show()
We do note that the residual variance is larger in the transformed data. Generally, we would hope that WLS (and (F)GLS) would reduce the variance of the residuals. This may indicate, that we need different weights. Newertheless, for now, we will use the WLS model.
Looking at it in a bit more detail:
fig = plt.figure(num = 10, figsize = (10, 8))
# Plot fitted vs residual plots:
ax = fig.add_subplot(2, 2, 1)
ax.plot(mdl_4_wls_fit.fittedvalues, e_star, linestyle = "None", marker = "o", markeredgecolor = "black", label = "WLS")
ax.legend()
# Plot the residual histogram
ax = fig.add_subplot(2, 2, 2)
ax.hist(e_star, bins = 30, edgecolor = "black", label = "WLS")
ax.legend()
# Plot the residual Q-Q plot:
ax = fig.add_subplot(2, 1, 2)
stats.probplot(e_star, dist = "norm", plot = ax)
# Fix layout in case the labels do overlap:
plt.tight_layout()
plt.show()
Visually, the scatterplot of the residuals may be better, but we are not sure. Thankfully, we know some tests, which can help us out.
name = ['LM-stat', 'LM: p-value', 'F-value', 'F: p-value']
bg_t = sm_diagnostic.acorr_breusch_godfrey(mdl_4_fit, nlags = 2)
print(pd.DataFrame(lzip(name, bg_t)))
BP_t = sm_diagnostic.het_breuschpagan(resid = e_star, exog_het = mdl_1.exog)
print(pd.DataFrame(lzip(['LM statistic', 'p-value', 'F-value', 'F: p-value'], BP_t)))
While the $p$-value is larger - we would still reject the null hypothesis that the residuals are homoskedastic.
So, our WLS procedure did not take into account all of the heteroskedasticity. Since we calculated the weights using the same exogeneous variables, as in the main model, it may be very likely, that the residuals variance depends on some additional exogeneous variables, which we did not include in our main model.
Since there are still is some heteroskedasticity - we need to correct our WLS standard errors. We can do this quite easily with:
# Combine WLS and HAC:
#print(mdl_4_wls_fit.get_robustcov_results(cov_type = "HC0").summary())
#print(mdl_4_wls_fit.get_robustcov_results(cov_type = "HC1").summary())
#print(mdl_4_wls_fit.get_robustcov_results(cov_type = "HC2").summary())
print(mdl_4_wls_fit.get_robustcov_results(cov_type = "HC3").summary())
We would again return to the conclusion that we should remove $educ^2$ as it is insignificant (though we would get different results with HC0
, HC1
and HC2
).
We can conclude the following:
While we have carried out all of these tests and different estimation methods, we would still like to account for the remaining heteroskedasticity. To do this we could look at:
black
is significant in the full dataset).For interests sake, if we were to compare the residuals for the original data - they would have minor differences
fig = plt.figure(num = 11, figsize = (10, 8))
# Plot fitted vs residual plots:
ax = fig.add_subplot(2, 2, 1)
ax.plot(mdl_4_wls_fit.fittedvalues, mdl_4_wls_fit.resid, linestyle = "None", marker = "o", markeredgecolor = "black", label = "WLS")
ax.plot(mdl_4_fit.fittedvalues, mdl_4_fit.resid, linestyle = "None", marker = "o", color = "red", markeredgecolor = "black", label = "OLS")
ax.legend()
# Plot the residual histogram
ax = fig.add_subplot(2, 2, 2)
ax.hist(mdl_4_fit.resid, bins = 30, edgecolor = "black", color = "red")
ax.hist(mdl_4_wls_fit.resid, bins = 30, edgecolor = "black")
# Plot the residual Q-Q plot:
ax = fig.add_subplot(2, 1, 2)
stats.probplot(mdl_4_wls_fit.resid, dist = "norm", plot = ax)
# Fix layout in case the labels do overlap:
plt.tight_layout()
plt.show()
Again, since WLS fits a model on the transformed data, we are interested if the residuals, from the fitted transformed data adhere to our (MR.3) - (MR.6) assumptions.