3.11 Example
Considerations
The following libraries will be useful for this exercise.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats.diagnostic as sm_diagnostic
import statsmodels.stats.stattools as sm_tools
from statsmodels.stats.outliers_influence import summary_table
from statsmodels.sandbox.regression.predstd import wls_prediction_std
# Additional methods for output formatting
from statsmodels.compat import lzip
# Additional method for pcustom lot legend creation
from matplotlib.lines import Line2D
We will show an example of how to carry out the tasks specified in section 3.10 on one of the datasets, namely Dataset 4
.
#
#
data_source <- "http://www.principlesofeconometrics.com/poe5/data/csv/cps5_small.csv"
dt4 <- read.csv(file = data_source, sep = ",", dec = ".", header = TRUE)
print(head(dt4, 10))
## black educ exper faminc female metro midwest south wage west
## 1 0 13 45 0 1 1 0 0 44.44 1
## 2 0 14 25 45351 1 1 1 0 16.00 0
## 3 0 18 27 91946 1 1 0 0 15.38 0
## 4 0 13 42 48370 0 1 1 0 13.54 0
## 5 0 13 41 10000 1 1 0 0 25.00 1
## 6 0 16 26 151308 1 1 0 0 24.05 0
## 7 0 16 11 110461 1 1 0 1 40.95 0
## 8 0 18 15 0 1 1 1 0 26.45 0
## 9 0 21 32 67084 0 1 1 0 30.76 0
## 10 0 14 12 14000 0 0 0 0 34.57 1
import pandas as pd
#
data_source = "http://www.principlesofeconometrics.com/poe5/data/csv/cps5_small.csv"
dt4 = pd.read_csv(data_source)
print(dt4.head(10))
## black educ exper faminc female metro midwest south wage west
## 0 0 13 45 0 1 1 0 0 44.44 1
## 1 0 14 25 45351 1 1 1 0 16.00 0
## 2 0 18 27 91946 1 1 0 0 15.38 0
## 3 0 13 42 48370 0 1 1 0 13.54 0
## 4 0 13 41 10000 1 1 0 0 25.00 1
## 5 0 16 26 151308 1 1 0 0 24.05 0
## 6 0 16 11 110461 1 1 0 1 40.95 0
## 7 0 18 15 0 1 1 1 0 26.45 0
## 8 0 21 32 67084 0 1 1 0 30.76 0
## 9 0 14 12 14000 0 0 0 0 34.57 1
We want to analyse, whether education (educ
column) affects the wage (wage
column) of a person from the data sample.
3.11.1 Exercise Set 1
To make it easier to follow, we will take only the columns that we need from the dataset.
3.11.1.1 Task 1
We will begin by plotting the scatterplot of \(Y\) on \(X\) (we will plot the histograms of \(Y\) and \(X\) in a separate task):
It appears that for larger values of educ
there are larger values of wage
. So \(Y\) and \(X\) appear to be correlated. the correlation between \(Y\) and \(X\) is:
## [1] 0.4553321
## 0.45533206358451567
We also note that there are fewer observations for people, whose education is less than 10 years:
This could mean that people with less than 10 years of education are underrepresented in our dataset.
Considerations
In some cases this may not be underrepresentation at all - for example, if we were examining a sample of firms, which primarily employ only industry veterans, it is very likely that there would be less people who have less than 10 years of experience.
3.11.1.2 Task 2
We begin by assuming that the relationship between wage and education can be defined by a linear regression:
\[
\text{wage} = \beta_0 + \beta_1 \cdot \text{educ} + \epsilon
\]
We expect that the more years one spends in education, the more qualified they are for higher position (i.e. higher-paying) jobs. As such, the hourly wage
should increase with more years spent on studying. So, we expect \(\beta_1 > 0\).
3.11.1.3 Task 3
We will estimate the parameters “by-hand” by using the OLS formula: \[ \widehat{\boldsymbol{\beta}} = \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{Y} \] where:
- \(\mathbf{Y} = [Y_1,...,Y_N]^\top\),
- \(\mathbf{X} = \begin{bmatrix} 1 & X_1 \\ 1 & X_2 \\ \vdots & \vdots \\ 1 & X_N \end{bmatrix}\)
- \(\boldsymbol{\varepsilon} = [\epsilon_1,...,\epsilon_N]^\top\),
- \(\boldsymbol{\beta} =[ \beta_0, \beta_1]^\top\).
Let’s begin by estimating separate components of our OLS formula.
- \(\mathbf{X}\):
## educ
## [1,] 1 13
## [2,] 1 14
## [3,] 1 18
## [4,] 1 13
## [5,] 1 13
## [6,] 1 16
## [7,] 1 16
## [8,] 1 18
## [9,] 1 21
## [10,] 1 14
## [[ 1. 13.]
## [ 1. 14.]
## [ 1. 18.]
## [ 1. 13.]
## [ 1. 13.]
## [ 1. 16.]
## [ 1. 16.]
## [ 1. 18.]
## [ 1. 21.]
## [ 1. 14.]]
- \(\mathbf{X}^\top \mathbf{X}\):
## educ
## 1200 17043
## educ 17043 252073
## [[ 1200. 17043.]
## [ 17043. 252073.]]
We can verify that we have correctly multiplied the matrices by looking at the diagonal elements in the matrix - they should be \(\sum_{i = 1}^N 1\) and \(\sum_{i = 1}^N \text{educ}_i\):
- \(\left( \mathbf{X}^\top \mathbf{X}\right)^{-1}\):
- \(\mathbf{X}^\top \mathbf{Y}\):
Finally, we can multiply both terms to estimate the parameters:
## [,1]
## -10.399959
## educ 2.396761
## [-10.39995925 2.3967612 ]
We see that the sign of educ
is positive, which is in line with our assumptions (which we base on economic theory).
3.11.1.4 Task 4
In order to estimate the standard errors, we first need to estimate the variance: \[ \widehat{\mathbb{V}{\rm ar}} (\widehat{\boldsymbol{\beta}}) = \begin{bmatrix} \widehat{\mathbb{V}{\rm ar}} (\widehat{\beta}_0) & \widehat{\mathbb{C}{\rm ov}} (\widehat{\beta}_0, \widehat{\beta}_1) \\ \widehat{\mathbb{C}{\rm ov}} (\widehat{\beta}_1, \widehat{\beta}_0) & \widehat{\mathbb{V}{\rm ar}} (\widehat{\beta}_1) \end{bmatrix} = \widehat{\sigma}^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \] where
- the estimated residual variance \(\widehat{\sigma}^2 = \dfrac{1}{N-2} \sum_{i = 1}^N \widehat{\epsilon}_i^2\),
- the residuals \(\widehat{\epsilon}_i = Y - \widehat{Y}_i\)
- the fitted values \(\widehat{Y}_i = \beta_0 + \beta_1 X_i\)
Then, the standard errors of the corresponding OLS estimators can be estimated as: \[\text{se}(\widehat{\beta}_i) = \sqrt{\widehat{\mathbb{V}{\rm ar}} (\mathbf{\widehat{\beta}_i})}\] for \(i = 0, 1\).
We begin by estimating the residual variance:
Now, we can estimate the OLS estimator variance-covariance matrix:
The standard errors are the square root of the diagonal elements of the estimated variance-covariance matrix of the parameters:
## educ
## 1.9624004 0.1353989
## [1.96240038 0.13539888]
3.11.1.5 Task 5
We first will print the estimated parameters and their standard errors in one table:
## estimate se
## -10.4000 1.9624
## educ 2.3968 0.1354
output1 = pd.DataFrame([beta_est_linear, beta_se_linear],
index = ["estimate", "se"], columns = ["intercept", "educ"]).T
print(np.round(output1, 4))
## estimate se
## intercept -10.4000 1.9624
## educ 2.3968 0.1354
Now we can write down the equation as: \[ \underset{(se)}{\widehat{\text{wage}}} = \underset{(1.9624)}{-10.4} + \underset{(0.1354 )}{2.3968} \cdot \text{educ} \]
3.11.1.6 Task 6
We have already calculated the fitted values in Task 4 when calculating the residuals, but we will restate the code here again:
Now, we will plot the fitted values alongside the true values from the data sample.
Considerations
We will also need to order the data by \(X\) - the reason is that when plotting a line, R
and Python
will connect the values as they are given in the dataset:
## [1] 13 14 18 13 13 16 16 18 21 14
## 0 13
## 1 14
## 2 18
## 3 13
## 4 13
## 5 16
## 6 16
## 7 18
## 8 21
## 9 14
## Name: educ, dtype: int64
We see that the first point of the line will start at \(X = 13\), then \(14\), \(18\) but then go back to \(13\). If we have a log-linear or any other model, where we transform \(Y\), where our fitted and back-transformed values are not a straight line, then the values will be incorrectly visualized.
So, we need to order both the \(X\) and \(Y\) vectors by the value of \(X\) when plotting.
#
#
#
#
plot(educ, wage, ylim = c(-10, 220))
lines(educ[order(educ)], y_fit_linear[order(educ)], col = "red")
abline(h = 0, col = "darkgreen", lty = 2)
fig = plt.figure(num = 1)
_ = plt.plot(educ, wage, linestyle = "None", marker = "o", markeredgecolor = "black")
_ = plt.plot(educ[np.argsort(educ)], y_fit_linear[np.argsort(educ)], linestyle = "-", color = "red")
_ = plt.axhline(y = 0, linestyle = "--", color = "darkgreen")
_ = plt.xlabel("educ")
_ = plt.ylabel("wage")
plt.show()
Considerations
We note that for educ
values less than 5, our fitted values for wage
are negative. Since wage
cannot be negative, and an education value of less than 5 years is realistic (and because we have these values in our dataset), this is an indication that our model may not be adequate for this dataset.
3.11.1.7 Task 7
We will use the built-in functions to automatically estimate the coefficients and standard errors:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.399959 1.9624004 -5.299611 1.380373e-07
## educ 2.396761 0.1353989 17.701484 1.820416e-62
import statsmodels.api as sm
#
lm_fit_linear = sm.OLS(wage, sm.add_constant(educ)).fit()
print(lm_fit_linear.summary2().tables[1])
## Coef. Std.Err. t P>|t| [0.025 0.975]
## const -10.399959 1.962400 -5.299611 1.380373e-07 -14.250083 -6.549835
## educ 2.396761 0.135399 17.701484 1.820416e-62 2.131116 2.662407
We see that the Estimate
(Coef.
) and Std. Error
columns correspond to our manually calculated values.
3.11.2 Exercise Set 2
3.11.2.1 Task 8
The run sequence plot can be used to plot values, equally spaced:
#
#
#
#
plot(wage)
This allows us to get a better overview of the data. In this case, from the run-sequence plots we see that:
- One value is a possible outlier - it’s value is much higher than any other. In practice we may need to get a deeper look at the dataset - is that value due to an error, or maybe it is from someone who is underprepresented in our dataset (e.g. a salary of a CEO; some management positions; etc.);
- The values of
wage
are fluctuating (i.e. they are not constant) - it appears that the mean and variance of \(Y\) in this data sample are the constant. But this would only be true if we ignore the fact that we have an additional variable,educ
. - If we look at the scatterplot of
wage
andeduc
fromTask 1
, we see that this is not necessarily the case, sincewage
appears to increase witheduc
(and we see a similar case for the variance ofwage
). One reason for this is that we are examining cross-sectional data, where the order of the data does not matter, so it is difficult to visually determine whether the mean (and variance) is constant or not using only the run-sequence plot of a single variable.
One more thing we do notice in our data - the distribution of \(Y\) is likely to be non-normal. The reason is that while most values appear to be “clustered” around 25, but some are much higher. Since wage
cannot be negative, this skews the distribution of the data.
Considerations
Since we mentioned that the observation order is not important in cross-sectional data - a natural question is what happens to the run-sequence plot if we do order the \(Y\) variable?
At first glance, this appears as a completely different result! However, the difference is mostly visual as we do not clearly see how many observations have low wage
, while we do see less observations with larger wage
values.
In fact, the above ordered value plot looks similar to (but not exactly the same as) the Q-Q plot of \(Y\):
The difference is that in the Q-Q plot the values of wage
are ploted against quantiles, which are calculated from a theoretical normal distribution (the number of quantiles is selected to match the size of the sample data).
On the other hand, as with the un-ordered values, we do clearly see that there is an outlier in our data - one wage
observation is much higher than the rest.
One way to examine exactly how many observations have low and large wage
values is by using histograms, which we will do in the next task.
3.11.2.2 Task 9
fig = plt.figure(num = 4)
ax = fig.add_subplot('121')
_ = ax.hist(wage, histtype = 'bar', color = "darkgreen", ec = 'black')
_ = plt.xlabel("wage")
_ = plt.title("Histogram of wage")
ax = fig.add_subplot('122')
_ = ax.hist(educ, histtype = 'bar', color = "orange", ec = 'black')
_ = plt.title("Histogram of educ")
_ = plt.xlabel("educ")
plt.show()
We see that the data for wage
and education
does not appear to be normal - the distributions of both wage
and educ
are skewed. For wage
- there are more people with wages with values less than 50.
For educ
- there are more people with more than 10 years of education.
3.11.2.3 Task 10
We begin by again plotting the scatter plot of wage
and educ
:
From the plots, as well as the previous Task
results, we highlight the following points:
- The variation of
wage
is very small foreduc <= 10
, compared to otherwage
values wheneduc > 10
. - As we have seen before, our simple linear model is not precise when the value of
educ
is small. - Furthermore, from the histograms plots we have noted that
wage
does not have a normal distribution.
As we know, we can also specify nonlinearities with the linear model in order to try to account for this. We may initially think about the following possible models:
- \(\log(\text{wage}) = \beta_0 + \beta_1 \cdot \text{educ} + \epsilon\), where \(\text{wage} > 0\);
- \(\text{wage} = \beta_0 + \beta_1 \cdot \log(\text{educ}) + \epsilon\), where \(\text{educ} > 0\);
- \(\log(\text{wage}) = \beta_0 + \beta_1 \cdot \log(\text{educ}) + \epsilon\), where \(\text{wage} > 0\), \(\text{educ} > 0\);
- \(\text{wage} = \beta_0 + \beta_1 \cdot \text{educ}^2 + \epsilon\)
Unfortunately, if we look at the minimum value of educ
:
## [1] 0
## 0
We see that it is zero - so we cannot take logarithms of educ
.
This leaves us with two possible alternative specification of a linear model:
- \(\log(\text{wage}) = \beta_0 + \beta_1 \cdot \text{educ} + \epsilon\)
- \(\text{wage} = \beta_0 + \beta_1 \cdot \text{educ}^2 + \epsilon\)
We will begin by estimating a log-linear model:
## [,1]
## 1.59683536
## educ 0.09875341
beta_est_loglin = np.dot(np.linalg.inv(np.dot(x_mat.T, x_mat)).dot(x_mat.T), np.log(wage))
print(beta_est_loglin)
## [1.59683536 0.09875341]
Next, we will calculate the fitted values \(\widehat{\log(\text{wage})}\) and residuals .
and the standard errors of the estimated parameters:
sigma2_est_loglin <- (length(educ) - 2)^(-1) * sum((resid_loglin)^2)
beta_est_loglin_var <- sigma2_est_loglin * solve(t(x_mat) %*% x_mat)
beta_se_loglin <- sqrt(diag(beta_est_loglin_var))
#
output2 <- data.frame(estimate = beta_est_loglin,
se = beta_se_loglin)
print(round(output2, 4))
## estimate se
## 1.5968 0.0702
## educ 0.0988 0.0048
sigma2_est_loglin = (len(educ) - 2)**(-1) * np.sum((resid_loglin)**2)
beta_est_loglin_var = sigma2_est_loglin * np.linalg.inv(x_mat.T.dot(x_mat))
beta_se_loglin = np.sqrt(np.diag(beta_est_loglin_var))
#
output2 = pd.DataFrame([beta_est_loglin, beta_se_loglin],
index = ["estimate", "se"], columns = ["intercept", "educ"]).T
print(np.round(output2, 4))
## estimate se
## intercept 1.5968 0.0702
## educ 0.0988 0.0048
We can also do this with the built-in OLS estimation function:
## (Intercept) educ
## 1.59683536 0.09875341
## const 1.596835
## educ 0.098753
## dtype: float64
Note that we have to remember that we are using log(wage)
instead of wage
in this model.
So, the estimated log-linear model is: \[ \underset{(se)}{\widehat{\log(\text{wage})}} = \underset{(0.0702)}{1.5968} + \underset{(0.0048)}{0.0988} \cdot \text{educ} \]
Next, we will estimate a quadratic model:
We may make the assumption that for people, who already have spent more years studying, an additional year in education might have a larger effect, compared to an additional year of education for people, who did not spend as many years studying.
We will estimate the following model: \[ \text{wage} = \beta_0 + \beta_1 \cdot \text{educ}^2 + \epsilon \]
x_mat_2 <- cbind(1, educ^2)
beta_est_quad <- solve(t(x_mat_2) %*% x_mat_2) %*% t(x_mat_2) %*% wage
print(t(beta_est_quad))
## [,1] [,2]
## [1,] 4.916477 0.08913401
We can also do this with the built-in OLS estimation function:
## (Intercept) I(educ^2)
## 4.91647729 0.08913401
## const 4.916477
## educ 0.089134
## dtype: float64
Note that we have used educ**2
, but the variable name is still educ
. To avoid confusion, we can rename the variable as follows:
x_mat_sq = sm.add_constant(educ**2)
x_mat_sq.columns = ['const', 'educ^2']
#
lm_fit_quad = sm.OLS(wage, x_mat_sq).fit()
print(lm_fit_quad.params)
## const 4.916477
## educ^2 0.089134
## dtype: float64
Now that we have addressed the naming issue, we will continue the tasks.
Along with the estimates, we can also immediately test the hypothesis: \[ \begin{aligned} H_0&: \beta_1 = 0\\ H_1&: \beta_1 \neq 0 \end{aligned} \]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.91647729 1.091864113 4.502829 7.359564e-06
## I(educ^2) 0.08913401 0.004858102 18.347497 1.862455e-66
## Coef. Std.Err. t P>|t| [0.025 0.975]
## const 4.916477 1.091864 4.502829 7.359564e-06 2.774299 7.058656
## educ^2 0.089134 0.004858 18.347497 1.862455e-66 0.079603 0.098665
Because the \(p\)-value column (Pr(>|t|)
in R
and P>|t|
in Python
) for educ
is less than 0.05, we reject the null hypothesis and are left with the alternative, that the coefficient \(\widehat{\beta}_1\) is statistically significantly different from zero.
As before, we can calculate the fitted values and the residuals for the quadratic model:
Considerations
Furthermore, we can plot all of our current models and compare them visually (to make it easier, we will not show all of the sample data, which we will limit with ylim
):
#
plot(educ, wage, ylim = c(-10, 80))
lines(educ[order(educ)], y_fit_linear[order(educ)],
col = "red", lwd = 2)
lines(educ[order(educ)], exp(y_loglin_fit[order(educ)]),
col = "orange", lwd = 2)
lines(educ[order(educ)], y_quad_fit[order(educ)],
col = "blue", lwd = 2)
abline(h = 0, col = "darkgreen", lty = 2)
#
# Add a legend:
legend("topleft",
legend = c("Sample data", "simple linear model", "log-linear model", "quadratic model"),
lty = c(NA, 1, 1, 1), lwd = c(NA, 2, 2, 2), pch = c(1, NA, NA, NA),
col = c("black", "red", "orange", "blue"))
fig = plt.figure(num = 6)
_ = plt.plot(educ, wage, linestyle = "None", marker = "o",
markeredgecolor = "black", label = "Sample data")
_ = plt.plot(educ[np.argsort(educ)], y_fit_linear[np.argsort(educ)],
linestyle = "-", color = "red", label = "simple linear model")
_ = plt.plot(educ[np.argsort(educ)], np.exp(y_loglin_fit[np.argsort(educ)]),
linestyle = "-", color = "orange", label = "log-linear model")
_ = plt.plot(educ[np.argsort(educ)], y_quad_fit[np.argsort(educ)],
linestyle = "-", color = "blue", label = "quadratic model")
_ = plt.axhline(y = 0, linestyle = "--", color = "darkgreen")
_ = plt.xlabel("educ")
_ = plt.ylabel("wage")
_ = plt.ylim(bottom = -12, top = 82)
_ = plt.legend(loc = "upper left")
plt.show()
We see that the quadratic model is somewhere in the middle between the linear and the log-linear model - the predicted values are greater than zero (which is better than the linear model and closer to the log-linear model specification), and the predicted values are closer to the mean for educ
between 15 and 20 (which is better than the log-linear model and closer to the linear model specification).
We could also say that the log-linear model underestimates the expected wage compared to the quadratic model when educ
is greater than 10, but is similar to the log-linear model for educ < 10
.
Our estimated quadratic model is: \[ \underset{(se)}{\widehat{\text{wage}}} = \underset{(1.09186)}{4.9165} + \underset{(0.00486)}{0.08913} \cdot \text{educ}^2 \]
We can interpret the coefficient of educ
as follows: education has a positive effect on wage, which is more pronounced for larger initial (or, base, or starting) values of education.
Consider the following equations where some base education increases by one: \[ \begin{aligned} {\widehat{\text{wage}}^{(1)}} &= {4.9165} + {0.08913} \cdot \text{educ}^2\\ {\widehat{\text{wage}}^{(2)}} &= {4.9165} + {0.08913} \cdot (\text{educ} + 1)^2 \\ &= {4.9165} + {0.08913}\cdot \text{educ}^2 + 0.08913 \cdot (2 \cdot\text{educ} + 1) \end{aligned} \]
Then the change in wage
from a one year increase in education is:
\[
\widehat{\text{wage}}^{(2)} - \widehat{\text{wage}}^{(1)} = 0.08913 \cdot (2 \cdot\text{educ} + 1)
\]
If education increased from 0
to 1
then wage increases by:
## I(educ^2)
## 0.08913401
## 0.08913400981358574
around 0.09
dollars per hour.
On the other hand, if education increases from 15
to 16
, then wage increases by:
## I(educ^2)
## 2.763154
## 2.763154304221158
roughly 2.76
dollars per hour.
We can verify this with the built-in functions as well:
## 2
## 0.08913401
## 0.08913400981358599
## 2
## 2.763154
x_new = sm.add_constant(np.array([15**2, 16**2]))
tst = lm_fit_quad.predict(x_new)
print(tst[1] - tst[0])
## 2.763154304221157
Note that we need to pass the transformed data to the .predict()
function in python, because we used the transformed data when estimating the model (as opposed to R
, where we transformed the data using the formula).
Fortunately, we can also use a similar formula notation in Python
:
import statsmodels.formula.api as smf
#
data = {"educ" : educ, "wage" : wage}
temp_mdl_fit = smf.ols("wage ~ I(educ**2)", data = data).fit()
print(temp_mdl_fit.summary().tables[1])
## ================================================================================
## coef std err t P>|t| [0.025 0.975]
## --------------------------------------------------------------------------------
## Intercept 4.9165 1.092 4.503 0.000 2.774 7.059
## I(educ ** 2) 0.0891 0.005 18.347 0.000 0.080 0.099
## ================================================================================
Then we can pass the original exogeneous variable values for educ
, which will be automatically transformed to I(educ ** 2)
:
## 2.763154304221157
We will use this formula notation more often in the next chapters. For now, we are focusing on a less-convenient specification in order to highlight how the transformations are applied when estimating the model and how they are incorporated in the fitted value calculation.
3.11.2.4 Task 11
We have already calculated the residuals in the previous Tasks
, so now we plot the run-sequence plots for the respective models:
#
#
par(mfrow = c(2, 3))
#
plot(resid_linear,
main = "level-level model\n Run-sequence plot")
#
plot(resid_loglin,
main = "log-linear model\n Run-sequence plot")
#
plot(resid_quad,
main = "quadratic model\n Run-sequence plot")
#
hist(resid_linear,
main = "level-level model\n Histogram", breaks = 25)
#
hist(resid_loglin,
main = "log-level model\n Histogram", breaks = 25)
#
hist(resid_quad,
main = "quadratic model\n Histogram", breaks = 25)
fig = plt.figure(num = 7, figsize = (10, 8))
_ = fig.add_subplot('231').plot(resid_linear, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("level-level model\n Run-sequence plot")
_ = fig.add_subplot('232').plot(resid_loglin, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("log-linear model\n Run-sequence plot")
_ = fig.add_subplot('233').plot(resid_quad, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("quadratic model\n Run-sequence plot")
_ = fig.add_subplot('234').hist(resid_linear, bins = 25,
histtype = 'bar', color = "cornflowerblue", ec = 'black')
_ = plt.title("level-level model\n Histogram")
_ = fig.add_subplot('235').hist(resid_loglin, bins = 25,
histtype = 'bar', color = "cornflowerblue", ec = 'black')
_ = plt.title("log-linear model\n Histogram")
_ = fig.add_subplot('236').hist(resid_quad, bins = 25,
histtype = 'bar', color = "cornflowerblue", ec = 'black')
_ = plt.title("quadratic model\n Histogram")
_ = plt.tight_layout()
plt.show()
We will also plot the run-sequence plot of the residual on educ
to see if the model performed:
#
#
par(mfrow = c(1, 3))
#
#
plot(educ, resid_linear,
main = "level-level model\n Scatter plot of residuals and educ")
#
#
plot(educ, resid_loglin,
main = "log-linear model\n Scatter plot of residuals and educ")
#
#
plot(educ, resid_quad,
main = "quadratic model\n Scatter plot of residuals and educ")
fig = plt.figure(num = 8, figsize = (10, 8))
_ = fig.add_subplot('131').plot(educ, resid_linear, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("level-level model\n Scatter plot of residuals and educ")
_ = fig.add_subplot('132').plot(educ, resid_loglin, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("log-linear model\n Scatter plot of residuals and educ")
_ = fig.add_subplot('133').plot(educ, resid_quad, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("quadratic model\n Scatter plot of residuals and educ")
_ = plt.tight_layout()
## <string>:1: UserWarning: Tight layout not applied. tight_layout cannot make axes width small enough to accommodate all axes decorations
for i, ax in enumerate(fig.axes):
_ = ax.set_ylabel("residuals")
_ = ax.set_xlabel("educ")
plt.show()
Ideally, we would want the residuals to be randomly scattered around zero on the vertical axis, which would suggest that our model captures any non-linearities in the data. They should also need to have the same variance, and to not have outliers
In our case: - In the linear (level-level) model, the residuals do not appear to be normally distributed, while in the log-linear model - they seem to have a normal distribution. (We see a similar situation with the quadratic model). - In the simple linear (an in the quadratic) model, there is one residual which stands out, which may be a sign of an outlier in the data. - In the log-linear model, as well as the simple linear and the quadratic model, the variance of residuals is smaller for smaller values of education. However, as we have mentioned, we have much fewer samples of wage data for education < 10.
Note that we are examining the residuals of the model. Since we are using the logarithm of wage in the log-linear model, the residuals are the difference between the actual and fitted values of the log of wage.
Based on our (UR.1) - (UR.4) assumptions, we would expect our residuals to be independent of \(X\), have the same mean and variance across observations and be normally distributed. The residuals of the log-linear model appear to be normally distributed. Because of a lack of data for lower values of educ
, we cannot say whether the variance of the residuals is the same.
3.11.2.5 Task 12
Consequently, the log-linear model appears to be better suited for our data, compared to the simple linear and the quadratic models: \[ \underset{(se)}{\widehat{\log(\text{wage})}} = \underset{(0.0702)}{1.5968} + \underset{(0.0048)}{0.0988} \cdot \text{educ} \] (Though the fitted value plot for the quadratic model seems close to the log-linear model)
3.11.2.6 Task 13
Looking back at the simple linear model: \[ \underset{(se)}{\widehat{\text{wage}}} = \underset{(1.9624)}{-10.4000} + \underset{(0.1354 )}{2.3968} \cdot \text{educ} \] The interpretation of \(\widehat{\beta}_1\) is that: \[ \text{an additional year of education increases the expected hourly wage by around } 2.3 \text{ dollars.} \]
For the log-linear model: \[ \underset{(se)}{\widehat{\log(\text{wage})}} = \underset{(0.0702)}{1.5968} + \underset{(0.0048)}{0.0988} \cdot \text{educ} \] The interpretation for \(\widehat{\beta}_1\) is that:
\[ \text{an additional year of education increases the expected hourly wage by around } 9.88\%. \]
Considerations
We will also plot the fitted model for both the log of wage and the exponentiated fitted value of the log-linear model.
#
#
par(mfrow = c(1, 2))
#
plot(educ, log(wage))
#
lines(educ[order(educ)], y_loglin_fit[order(educ)],
col = "red", lwd = 2)
#
#
plot(educ, wage)
#
lines(educ[order(educ)], exp(y_loglin_fit[order(educ)]),
col = "red", lwd = 2)
#
abline(h = 0, col = "darkgreen", lty = 2)
fig = plt.figure(num = 9, figsize = (10, 8))
_ = fig.add_subplot('121').plot(educ, np.log(wage), color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.plot(educ[np.argsort(educ)], y_loglin_fit[np.argsort(educ)],
linestyle = "-", color = "red")
_ = plt.ylabel("log(wage)")
_ = fig.add_subplot('122').plot(educ, wage, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.plot(educ[np.argsort(educ)], np.exp(y_loglin_fit[np.argsort(educ)]),
linestyle = "-", color = "red")
_ = plt.axhline(y = 0, linestyle = "--", color = "darkgreen")
_ = plt.ylabel("wage")
_ = plt.tight_layout()
for i, ax in enumerate(fig.axes):
_ = ax.set_xlabel("educ")
plt.show()
3.11.3 Exercise Set 3
3.11.3.1 Task 14
We would like to test the null hypothesis whether education has a significant effect on wage. We will examine the coefficients of:
- the simple linear: \(\text{wage} = \beta_0 + \beta_1 \cdot \text{educ} + \epsilon\);
- the log-linear model: \(\log(\text{wage}) = \beta_0 + \beta_1 \cdot \text{educ} + \epsilon\).
3.11.3.1.1 TASK 14.a
We will begin by testing the null hypothesis against the alternative, that \(\beta_1\) is negative:
\[ \begin{cases} H_0 &: \beta_1 = 0\\ H_1 &: \beta_1 < 0 \end{cases} \]
The critical value \(t_c = t_{(\alpha, N-2)}\) is:
## [1] -1.64613
We begin by calculating the \(t\)-ratio for the simple linear model is:
## educ
## 17.70148
We reject \(H_0\), if \(t-\text{ratio} \leq t_{(\alpha, N-2)}\).
In our case:
## educ
## FALSE
## False
Which means that we have no grounds to reject the null hypothesis.
We can calculate the p-value to verify this:
Since p-value
is greater than 0.05, so we have no grounds to reject the null hypothesis at the \(5\%\) significance level.
Next, we will calculate the \(t\)-ratio and the \(p\)-value for the log-linear model:
## educ
## 20.39437
## educ
## FALSE
So, we have no grounds to reject the null hypothesis at the \(5\%\) significance level for the log-linear model.
Considerations
In this case the null hypothesis can be interpreted as: \[ H_0: \text{ Education has no effect on wage (i.e. education has no significant effect on wage).} \] against the alternative: \[ H_1: \text{ Education has a negative effect on wage.} \]
Considering that we expect that higher education leads to larger wage, would it make sense to test the null hypothesis against such an alternative?
3.11.3.1.2 TASK 14.b
Next, we will test the null hypothesis against the alternative, that \(\beta_1\) is positive:
\[ \begin{cases} H_0 &: \beta_1 = 0\\ H_1 &: \beta_1 > 0 \end{cases} \]
The critical value \(t_c = t_{(1 - \alpha, N-2)}\) is:
## [1] 1.64613
The t-ratio
is unchanged since it depends on the null hypothesis, so we will not recalculate it.
We reject \(H_0\), if \(t-\text{ratio} \geq t_{(1 - \alpha, N-2)}\).
In our case, for the simple linear model:
## educ
## TRUE
## True
We reject the null hypothesis, and conclude that \(\beta_1 > 0\). We can verify this by calculating the p-value
:
## educ
## 9.102079e-63
## 9.102079371964216e-63
We did not round the p-value
to highlight that it is very close to zero (Note that the E-notation, e-63
is the notation for \(10^{-63}\)). Since \(p\)-value < 0.05 we reject the null hypothesis at the \(5\%\) significance level.
Next, we will calculate the \(t\)-ratio and the \(p\)-value for the log-linear model:
## educ
## TRUE
## True
## educ
## 6.724479e-80
## 6.724478717272381e-80
For the log-linear model we come to the same conclusion - we reject the null hypothesis at the \(5\%\) significance level for the log-linear model. If we look at the coefficient of educ
for the log-linear model, we see that it is \(0.0988\), which, seems small, however, since we have log-transformed our dependent variable wage
- the scale of the dependent variable changed, so this seemingly small value has a statistically significant ((-ly) different from zero) effect on log(wage)
.
Considerations
In this case the null hypothesis can be interpreted as: \[ H_0:\text{ Education has no effect on wage (i.e. education has no significant effect on wage).} \] against the alternative: \[ H_1:\text{ Education has a positive effect on wage.} \] We specify this as the alternative hypothesis, since we expect that higher education leads to larger wage.
3.11.3.1.3 TASK 14.c
Finally, we will test the null hypothesis against the alternative, that \(\beta_1\) is not zero:
\[ \begin{cases} H_0 &: \beta_1 = 0\\ H_1 &: \beta_1 \neq 0 \end{cases} \]
As before, we do not need to recalculate the t-ratio
.
We reject \(H_0\), if \(t-\text{ratio} \leq t_{(\alpha/2, N-2)}\) or \(t-\text{ratio} \geq t_{(1 - \alpha/2, N-2)}\).
In our case, for the simple linear model this is:
## educ
## TRUE
So, we reject the null hypothesis at the \(5\%\) significance level.
The \(p\)-value:
## educ
## 1.820416e-62
## 1.8204158743928433e-62
is less than 0.05, so we reject the null hypothesis.
Finally, we will calculate the \(t\)-ratio and the \(p\)-value for the log-linear model:
## educ
## TRUE
## educ
## 1.344896e-79
## 1.3448957434544762e-79
In the log-linear model case, we reject the null hypothesis at the \(5\%\) significance level.
From these tests, we conclude that:
educ
is statistically significantly different from zero in the simple linear model. In other words,educ
has a significant effect onwage
.educ
is statistically significantly different from zero in the log-linear model. In other words,educ
has a significant effect onlog(wage)
(and, by extension, onwage
itself).
Considerations
In this case the null hypothesis can be interpreted as: \[ H_0:\text{ Education has no effect on wage (i.e. education has no significant effect on wage).} \] against the alternative: \[ H_1:\text{ Education has a signification effect on wage.} \] Note that for the two-tailed hypothesis test, we do not make any assumptions about the sign of the effect - i.e. we do not care whether it is positive, or negative - we only care to test the overall significance of the effect.
Furthermore, the two-tailed hypothesis test, where we test \(H_0\) against \(H_1: \beta_i \neq 0\), is also automatically calculated in the summary output of our fitted models:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.399959 1.9624004 -5.299611 1.380373e-07
## educ 2.396761 0.1353989 17.701484 1.820416e-62
## Coef. Std.Err. t P>|t| [0.025 0.975]
## const -10.399959 1.962400 -5.299611 1.380373e-07 -14.250083 -6.549835
## educ 2.396761 0.135399 17.701484 1.820416e-62 2.131116 2.662407
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.59683536 0.07018017 22.75337 1.530466e-95
## educ 0.09875341 0.00484219 20.39437 1.344896e-79
## Coef. Std.Err. t P>|t| [0.025 0.975]
## const 1.596835 0.070180 22.753370 1.530466e-95 1.459146 1.734525
## educ 0.098753 0.004842 20.394367 1.344896e-79 0.089253 0.108254
The \(t\)-ratio is in the appropriately named t
value column and the \(p\)-value is in the Pr(>|t|)
in R
or P>|t|
in Python
. We see that these values are the same as our t_stat
and the calculated \(p\)-value of the two-tail hypothesis test.
Consequently, we do not need to manually carry out the two-tail hypothesis test as it is done automatically in the model summary. On the other hand, a one-tailed hypothesis test is not provided in the output, so it is always useful to see how we can do this ourselves.
3.11.3.2 Task 15
Let \(\widehat{Y}_i = \widehat{\beta}_0 + \widehat{\beta}_1 \tilde{X}_i\) be the predicted value of \(Y_i\) given values \(\tilde{X}_i\), \(i = 1,...\). We can also express this in a matrix notation as: \(\widehat{\mathbf{Y}} = \widetilde{\mathbf{X}} \widehat{\boldsymbol{\beta}}\)
The variance-covariance matrix of the estimated mean response is: \[ \widehat{\mathbb{V}{\rm ar}} (\widehat{\mathbf{Y}}) = \widetilde{\mathbf{X}} \widehat{\sigma}^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \widetilde{\mathbf{X}}^\top = \widetilde{\mathbf{X}} \widehat{\mathbb{V}{\rm ar}} (\widehat{\boldsymbol{\beta}}) \widetilde{\mathbf{X}}^\top \] which means that we need to use the OLS estimator variance-covariance matrix when calculating the confidence intervals for the mean response.
The confidence intervals for the mean response are constructed from two endpoints, which are calculated as follows: \[ \widehat{Y}_i \pm t_{(1 - \alpha/2, N-2)} \times \text{se}(\widehat{Y}_i), \] where the standard error of the mean response is the square root of the corresponding diagonal elements of the mean response variance-covariance matrix: \(\text{se}(\widehat{Y}_i) = \sqrt{\widehat{\mathbb{V}{\rm ar}} (\widehat{Y}_i)}\).
We will begin by calculating the standard error of the mean response for the simple linear model:
Next, we will calculate the \(95\%\) confidence intervals for the mean response:
Finally, we will plot the estimated mean response along with the lower an upper confidence intervals:
plot(educ, wage, ylim = c(min(wage_mean_fit), max(wage)))
#
lines(educ[order(educ)], wage_mean_fit[order(educ)])
#
abline(h = 0, col = "darkgreen", lwd = 2, lty = 2)
#
lines(educ[order(educ)], wage_mean_lower[order(educ)], lty = 2, col = "red")
#
lines(educ[order(educ)], wage_mean_upper[order(educ)], lty = 2, col = "red")
fig = plt.figure(num = 10, figsize = (10, 8))
_ = plt.plot(educ, wage, linestyle = "None", marker = "o", markeredgecolor = "black")
_ = plt.plot(educ[np.argsort(educ)], wage_mean_fit[np.argsort(educ)], linestyle = "-", color = "black")
_ = plt.axhline(y = 0, linestyle = "--", color = "darkgreen")
_ = plt.plot(educ[np.argsort(educ)], wage_mean_lower[np.argsort(educ)], linestyle = "--", color = "red")
_ = plt.plot(educ[np.argsort(educ)], wage_mean_upper[np.argsort(educ)], linestyle = "--", color = "red")
_ = plt.xlabel("educ")
_ = plt.ylabel("wage")
plt.show()
Considerations
Note that we can also calculate standard errors of the mean response automatically:
lm_predict <- predict(lm_fit_linear, newdata = data.frame(educ = educ),
interval = c("confidence"), level = 0.95)
print(t(head(lm_predict, 6)))
## 1 2 3 4 5 6
## fit 20.75794 23.15470 32.74174 20.75794 20.75794 27.94822
## lwr 19.92651 22.38520 31.47411 19.92651 19.92651 27.04421
## upr 21.58936 23.92419 34.00937 21.58936 21.58936 28.85223
lm_predict = lm_fit_linear.get_prediction(sm.add_constant(pd.DataFrame({'educ': educ})))
lm_predict = lm_predict.summary_frame(alpha = 0.05)
print(lm_predict.head(6).T)
## 0 1 2 3 4 5
## mean 20.757936 23.154698 32.741742 20.757936 20.757936 27.948220
## mean_se 0.423775 0.392209 0.646107 0.423775 0.423775 0.460771
## mean_ci_lower 19.926512 22.385204 31.474115 19.926512 19.926512 27.044212
## mean_ci_upper 21.589361 23.924191 34.009370 21.589361 21.589361 28.852227
## obs_ci_lower -5.845866 -3.447241 6.120737 -5.845866 -5.845866 1.342050
## obs_ci_upper 47.361739 49.756636 59.362747 47.361739 47.361739 54.554390
Which we can then use to plot the same charts:
plot(educ, wage, ylim = c(min(wage_mean_fit), max(wage)))
#
lines(educ[order(educ)], lm_predict[order(educ), "fit"])
#
abline(h = 0, col = "darkgreen", lwd = 2, lty = 2)
#
lines(educ[order(educ)], lm_predict[order(educ), "lwr"], lty = 2, col = "red")
#
lines(educ[order(educ)], lm_predict[order(educ), "upr"], lty = 2, col = "red")
fig = plt.figure(num = 11, figsize = (10, 8))
_ = plt.plot(educ, wage, linestyle = "None", marker = "o", markeredgecolor = "black")
_ = plt.plot(educ[np.argsort(educ)], lm_predict['mean'][np.argsort(educ)], linestyle = "-", color = "black")
_ = plt.axhline(y = 0, linestyle = "--", color = "darkgreen")
_ = plt.plot(educ[np.argsort(educ)], lm_predict.loc[np.argsort(educ), 'mean_ci_lower'], linestyle = "--", color = "red")
_ = plt.plot(educ[np.argsort(educ)], lm_predict.loc[np.argsort(educ), 'mean_ci_upper'], linestyle = "--", color = "red")
_ = plt.xlabel("educ")
_ = plt.ylabel("wage")
plt.show()
Note: we provided two examples of how to access a
pd.DataFrame
object in order to print specific columns and rows: either by first specifying the column name and then the rows, or by using .loc()
method, not to be confused with .iloc()
, which only works with column numbers, as opposed to their names:
## [-10.399959246494694, -10.399959246494694, -3.2096756527975208, -3.2096756527975208, -3.2096756527975208]
## [-10.399959246494694, -10.399959246494694, -3.2096756527975208, -3.2096756527975208, -3.2096756527975208]
# If we know that `mean` is the first column:
print(lm_predict.iloc[np.argsort(educ), 0].head().tolist())
## [-10.399959246494694, -10.399959246494694, -3.2096756527975208, -3.2096756527975208, -3.2096756527975208]
Again, we can see from the numerical values, that the built-in functions for confidence interval calculation give identical results to the manual calculation.
When learning new methods, it is always a good idea to try to manually implement any new formulas you come across and compare with the built-in functions in order to verify that they work as you would expect - not only will it provide a better intuition for the formulas themselves but sometimes there additional settings, or alternative formulas used in the built-int function, which may give different results!
Next, we will calculate the standard error of the mean response for the log-linear model.
We will use the built-in functions to do so. First, we will calculate the mean response of \(\widehat{\log(\text{wage})}\) along with its confidence interval:
lm_mean_loglin <- predict(lm_fit_loglin, newdata = data.frame(educ = educ),
interval = c("confidence"), level = 0.95)
print(t(head(lm_mean_loglin, 6)))
## 1 2 3 4 5 6
## fit 2.880630 2.979383 3.374397 2.880630 2.880630 3.176890
## lwr 2.850896 2.951864 3.329063 2.850896 2.850896 3.144560
## upr 2.910363 3.006902 3.419730 2.910363 2.910363 3.209219
lm_mean_loglin = lm_fit_loglin.get_prediction(sm.add_constant(pd.DataFrame({'educ': educ})))
lm_mean_loglin = lm_mean_loglin.summary_frame(alpha = 0.05)
print(lm_mean_loglin.head(6).T)
## 0 1 2 3 4 5
## mean 2.880630 2.979383 3.374397 2.880630 2.880630 3.176890
## mean_se 0.015155 0.014026 0.023106 0.015155 0.015155 0.016478
## mean_ci_lower 2.850896 2.951864 3.329063 2.850896 2.850896 3.144560
## mean_ci_upper 2.910363 3.006902 3.419730 2.910363 2.910363 3.209219
## obs_ci_lower 1.929214 2.028034 2.422365 1.929214 1.929214 2.225389
## obs_ci_upper 3.832046 3.930733 4.326428 3.832046 3.832046 4.128391
Considerations
For the log-linear model we can immediately plot the mean response and confidence intervals of \(\widehat{\log(\text{wage})}\) (not to be confused with \(\widehat{\text{wage}}\)):
plot(educ, log(wage), main = expression("Mean Response of"~ widehat(log(wage))))
#
lines(educ[order(educ)], lm_mean_loglin[order(educ), "fit"])
#
lines(educ[order(educ)], lm_mean_loglin[order(educ), "lwr"], lty = 2, col = "red")
#
lines(educ[order(educ)], lm_mean_loglin[order(educ), "upr"], lty = 2, col = "red")
#
abline(h = 0, col = "darkgreen", lty = 2, lwd = 2)
fig = plt.figure(num = 12, figsize = (10, 8))
_ = plt.plot(educ, np.log(wage), linestyle = "None", marker = "o", markeredgecolor = "black")
_ = plt.plot(educ[np.argsort(educ)], lm_mean_loglin['mean'][np.argsort(educ)], linestyle = "-", color = "black")
_ = plt.plot(educ[np.argsort(educ)], lm_mean_loglin.loc[np.argsort(educ), 'mean_ci_lower'], linestyle = "--", color = "red")
_ = plt.plot(educ[np.argsort(educ)], lm_mean_loglin.loc[np.argsort(educ), 'mean_ci_upper'], linestyle = "--", color = "red")
_ = plt.xlabel("educ")
_ = plt.ylabel("$\\log(wage)$")
_ = plt.title("Mean Response of $\\widehat{\\log(wage)}$")
plt.show()
As mentioned, this is not the original value of wage
, but rather its logarithmic transformation. Usually, we are interested in the original values, so it is important that we are able to transform the dependent variable predicted values back to their original scale.
Next, we will take an exponent of the mean response and confidence intervals of the log-linear model to get the values for wage
, i.e. \(\widehat{\text{wage}} = \exp\{\widehat{\log(\text{wage})}\}\):
#
#
plot(educ, wage, ylim = c(0, 100), main = "Mean Response of wage")
#
lines(educ[order(educ)], exp(lm_mean_loglin[order(educ), "fit"]))
#
lines(educ[order(educ)], exp(lm_mean_loglin[order(educ), "lwr"]), lty = 2, col = "red")
#
lines(educ[order(educ)], exp(lm_mean_loglin[order(educ), "upr"]), lty = 2, col = "red")
#
abline(h = 0, col = "darkgreen", lty = 2, lwd = 2)
fig = plt.figure(num = 13, figsize = (10, 8))
_ = plt.plot(educ, wage, linestyle = "None", marker = "o", markeredgecolor = "black")
_ = plt.ylim(bottom = -5, top = 100)
_ = plt.title("Mean Response of wage")
_ = plt.plot(educ[np.argsort(educ)], np.exp(lm_mean_loglin['mean'][np.argsort(educ)]), linestyle = "-", color = "black")
_ = plt.plot(educ[np.argsort(educ)], np.exp(lm_mean_loglin.loc[np.argsort(educ), 'mean_ci_lower']), linestyle = "--", color = "red")
_ = plt.plot(educ[np.argsort(educ)], np.exp(lm_mean_loglin.loc[np.argsort(educ), 'mean_ci_upper']), linestyle = "--", color = "red")
_ = plt.axhline(y = 0, linestyle = "--", color = "darkgreen")
_ = plt.xlabel("educ")
_ = plt.ylabel("wage")
plt.show()
Considerations
When plotting the fitted values, we have always ordered/sorted the observation pairs \((X_i, Y_i)\) by \(X\). What happens if we do not do this? Simply put - the fitted value points will be connected in the order that they appear:
#
#
par(mfrow = c(1, 2))
#
#
plot(educ, wage, ylim = c(0, 100),
main = "Fitted value and X pairs are not ordered \n(i.e. as they are in the data)")
lines(educ[order(educ)], exp(lm_mean_loglin[order(educ), "fit"]), lwd = 2, col = "blue")
#
#
plot(educ, wage, ylim = c(0, 100),
main = "Fitted value and X pairs are ordered by X")
lines(educ, exp(lm_mean_loglin[, "fit"]), lwd = 2, col = "blue")
fig = plt.figure(num = 99, figsize = (10, 8))
_ = fig.add_subplot(121).plot(educ, wage, linestyle = "None", marker = "o", markeredgecolor = "black", markerfacecolor = 'None')
_ = plt.plot(educ[np.argsort(educ)], np.exp(lm_mean_loglin['mean'][np.argsort(educ)]), linestyle = "-", color = "blue")
_ = plt.title("Fitted value and X pairs are not ordered \n(i.e. as they are in the data)")
_ = fig.add_subplot(122).plot(educ, wage, linestyle = "None", marker = "o", markeredgecolor = "black", markerfacecolor = 'None')
_ = plt.plot(educ, np.exp(lm_mean_loglin['mean']), linestyle = "-", color = "blue")
_ = plt.title("Fitted value and X pairs are ordered by X")
for i, ax in enumerate(fig.axes):
_ = ax.set_ylabel("wage")
_ = ax.set_xlabel("educ")
_ = ax.set_ylim(bottom = -5, top = 100)
_ = plt.tight_layout()
plt.show()
As such, we need to do one of the following: - order the data in the original dataset unless the order of the observations is important! For example if the data is autocorrelated, then the order must be kept; - create a variable with the index list ordered \(X\) values and use it to order the data for the scatterplot only, as we have done throughout this task.
3.11.3.3 Task 16
The variance-covariance matrix of the prediction \(\widehat{\mathbf{Y}}\) is: \[ \begin{aligned} \mathbb{V}{\rm ar}\left( \widetilde{\boldsymbol{e}} \right) &= \widehat{\sigma}^2 \left( \mathbf{I} + \widetilde{\mathbf{X}} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \widetilde{\mathbf{X}}^\top\right) = \widehat{\sigma}^2 \mathbf{I} + \widehat{\mathbb{V}{\rm ar}} (\widehat{\mathbf{Y}}) \end{aligned} \] The prediction interval endpoints for \(\widehat{\mathbf{Y}}\) are: \[ \widehat{Y}_i \pm t_{(1 - \alpha/2, N-2)} \times \text{se}(\widetilde{e}_i), \] where the standard errors of the forecast are the diagonal elements of the variance-covariance matrix of the prediction: \(\text{se}(\widetilde{e}_i) = \sqrt{\widehat{\mathbb{V}{\rm ar}} (\widetilde{e}_i)}\).
The prediction variance-covariance matrix is:
Then the \(95\%\) prediction intervals are:
Finally, we can plot the prediction intervals. We will plot them alongside the confidence intervals of the mean response:
#
plot(educ, wage, ylim = c(min(wage_pred_lower), max(wage)))
lines(educ[order(educ)], lm_predict[order(educ), "fit"])
abline(h = 0, col = "darkgreen", lwd = 2, lty = 2)
#
# CI
lines(educ[order(educ)], wage_mean_lower[order(educ)],
lty = 2, col = "red")
lines(educ[order(educ)], wage_mean_upper[order(educ)],
lty = 2, col = "red")
# PI
lines(educ[order(educ)], wage_pred_lower[order(educ)],
col = "blue", lty = 2)
lines(educ[order(educ)], wage_pred_upper[order(educ)],
col = "blue", lty = 2)
#
legend("topleft", legend = c("Simple Linear Model (Mean Response and Prediction)",
"Confidence Intervals for the Mean Response",
"Prediction Intervals"),
lty = c(1, 2, 2, 2), col = c("black", "red", "blue"))
fig = plt.figure(num = 14, figsize = (10, 8))
_ = plt.plot(educ, wage, linestyle = "None", marker = "o", markeredgecolor = "black")
_ = plt.plot(educ[np.argsort(educ)], lm_predict['mean'][np.argsort(educ)], linestyle = "-", color = "black",
label = "Simple Linear Model (Mean Response and Prediction)")
_ = plt.axhline(y = 0, linestyle = "--", color = "darkgreen")
# CI
_ = plt.plot(educ[np.argsort(educ)], wage_mean_lower[np.argsort(educ)], linestyle = "--", color = "red",
label = "Confidence Intervals for the Mean Response")
_ = plt.plot(educ[np.argsort(educ)], wage_mean_upper[np.argsort(educ)], linestyle = "--", color = "red")
# No need to repeat the label for the upper bound
# PI
_ = plt.plot(educ[np.argsort(educ)], wage_pred_lower[np.argsort(educ)], linestyle = "--", color = "blue",
label = "Prediction Intervals")
_ = plt.plot(educ[np.argsort(educ)], wage_pred_upper[np.argsort(educ)], linestyle = "--", color = "blue")
# No need to repeat the label for the upper bound
#
_ = plt.xlabel("educ")
_ = plt.ylabel("wage")
_ = plt.legend(loc = "upper left")
plt.show()
Considerations
As you might have guessed - there are also a built-in functions to calculate the prediction intervals:
#
#
lm_predict <- predict(lm_fit_linear, newdata = data.frame(educ = educ),
interval = c("prediction"), level = 0.95)
print(t(head(lm_predict, 5)))
## 1 2 3 4 5
## fit 20.757936 23.154698 32.741742 20.757936 20.757936
## lwr -5.845866 -3.447241 6.120737 -5.845866 -5.845866
## upr 47.361739 49.756636 59.362747 47.361739 47.361739
from statsmodels.stats.outliers_influence import summary_table
#
lm_predict = summary_table(lm_fit_linear, alpha = 0.05)
lm_predict = pd.DataFrame(lm_predict[1], columns = lm_predict[2])
print(lm_predict.head(5).T)
## 0 1 2 3 4
## Obs 1.000000 2.000000 3.000000 4.000000 5.000000
## Dep Var\nPopulation 44.440000 16.000000 15.380000 13.540000 25.000000
## Predicted\nValue 20.757936 23.154698 32.741742 20.757936 20.757936
## Std Error\nMean Predict 0.423775 0.392209 0.646107 0.423775 0.423775
## Mean ci\n95% low 19.926512 22.385204 31.474115 19.926512 19.926512
## Mean ci\n95% upp 21.589361 23.924191 34.009370 21.589361 21.589361
## Predict ci\n95% low -5.845866 -3.447241 6.120737 -5.845866 -5.845866
## Predict ci\n95% upp 47.361739 49.756636 59.362747 47.361739 47.361739
## Residual 23.682064 -7.154698 -17.361742 -7.217936 4.242064
## Std Error\nResidual 13.546654 13.547605 13.537872 13.546654 13.546654
## Student\nResidual 1.748185 -0.528115 -1.282457 -0.532821 0.313145
## Cook's\nD 0.001495 0.000117 0.001873 0.000139 0.000048
Note that we can get both the mean and prediction intervals from this table. For the prediction intervals, we need the values in the Predict ci\n95% low
and Predict ci\n95% upp
columns (we printed the first few values of the transposed output to make it easier to see all the columns).
lm_predict = lm_predict.loc[:, ['Predicted\nValue', 'Predict ci\n95% low', 'Predict ci\n95% upp']]
lm_predict.columns = ["pred", "pred_lower", "pred_upper"]
print(lm_predict.head(5).T)
## 0 1 2 3 4
## pred 20.757936 23.154698 32.741742 20.757936 20.757936
## pred_lower -5.845866 -3.447241 6.120737 -5.845866 -5.845866
## pred_upper 47.361739 49.756636 59.362747 47.361739 47.361739
Unfortunately, we cannot get prediction intervals for new values of \(X\) using summary_table()
.
Fortunately, we can use another function, like wls_prediction_std()
:
from statsmodels.sandbox.regression.predstd import wls_prediction_std
#
lm_predict_ci = wls_prediction_std(lm_fit_linear, exog = sm.add_constant(pd.DataFrame({'educ': educ})), alpha = 0.05)
# first element is the standard deviation;
# second is the lower and last is the upper prediction interval
lm_predict_ci = pd.DataFrame(lm_predict_ci, index = ["sdev", "pred_lower", "pred_upper"]).T
print(lm_predict_ci.head(5).T)
## 0 1 2 3 4
## sdev 13.559904 13.558955 13.568673 13.559904 13.559904
## pred_lower -5.845866 -3.447241 6.120737 -5.845866 -5.845866
## pred_upper 47.361739 49.756636 59.362747 47.361739 47.361739
We can also use the previously introduced .get_prediction()
method:
## 0 1 2 3 4
## mean 20.757936 23.154698 32.741742 20.757936 20.757936
## mean_se 0.423775 0.392209 0.646107 0.423775 0.423775
## mean_ci_lower 19.926512 22.385204 31.474115 19.926512 19.926512
## mean_ci_upper 21.589361 23.924191 34.009370 21.589361 21.589361
## obs_ci_lower -5.845866 -3.447241 6.120737 -5.845866 -5.845866
## obs_ci_upper 47.361739 49.756636 59.362747 47.361739 47.361739
Though if we decide to use wls_prediction_std()
- we also get the forecast standard error, while using get_prediction()
we get the standard error of the mean response.
Finally, we can plot the predicted values alongside their prediction intervals:
#
plot(educ, wage,
ylim = c(min(lm_predict[order(educ), "lwr"]), max(wage)))
#
lines(educ[order(educ)], lm_predict[order(educ), "fit"])
#
lines(educ[order(educ)], lm_predict[order(educ), "lwr"], lty = 2, col = "blue")
#
lines(educ[order(educ)], lm_predict[order(educ), "upr"], lty = 2, col = "blue")
fig = plt.figure(num = 15, figsize = (10, 8))
_ = plt.plot(educ, wage, linestyle = "None", marker = "o", markeredgecolor = "black")
_ = plt.ylim(bottom = np.min(lm_predict['pred_lower']), top = np.max(wage))
_ = plt.plot(educ[np.argsort(educ)], lm_predict['pred'][np.argsort(educ)], linestyle = "-", color = "black")
_ = plt.plot(educ[np.argsort(educ)], lm_predict.loc[np.argsort(educ), 'pred_lower'], linestyle = "--", color = "blue")
_ = plt.plot(educ[np.argsort(educ)], lm_predict.loc[np.argsort(educ), 'pred_upper'], linestyle = "--", color = "blue")
_ = plt.xlabel("educ")
_ = plt.ylabel("wage")
plt.show()
Next, we will calculate the standard error of the forecast (and consequently, the prediction intervals) for the log-linear model.
We will use the built-in functions to do so. First, we will calculate \(\widehat{\log(\text{wage})}\) along with its prediction interval:
#
#
#
lm_predict_loglin <- predict(lm_fit_loglin,
newdata = data.frame(educ = educ),
interval = c("predict"), level = 0.95)
print(t(head(lm_predict_loglin, 5)))
## 1 2 3 4 5
## fit 2.880630 2.979383 3.374397 2.880630 2.880630
## lwr 1.929214 2.028034 2.422365 1.929214 1.929214
## upr 3.832046 3.930733 4.326428 3.832046 3.832046
from statsmodels.stats.outliers_influence import summary_table
#
lm_predict_loglin = summary_table(lm_fit_loglin, alpha = 0.05)
lm_predict_loglin = pd.DataFrame(lm_predict_loglin[1], columns = lm_predict_loglin[2])
lm_predict_loglin = lm_predict_loglin.loc[:, ['Predicted\nValue', 'Predict ci\n95% low', 'Predict ci\n95% upp']]
lm_predict_loglin.columns = ["pred", "pred_lower", "pred_upper"]
print(lm_predict_loglin.head(5).T)
## 0 1 2 3 4
## pred 2.880630 2.979383 3.374397 2.880630 2.880630
## pred_lower 1.929214 2.028034 2.422365 1.929214 1.929214
## pred_upper 3.832046 3.930733 4.326428 3.832046 3.832046
Next, we will take the exponent of the predicted value in the log-linear model, to get the predicted value for wage
: \(\widehat{\text{wage}} = \exp(\widehat{\log(\text{wage})})\). We will do the same for the prediction interval.
#
#
plot(educ, wage, ylim = c(0, max(wage)))
#
lines(educ[order(educ)], exp(lm_predict_loglin[order(educ), "fit"]))
#
lines(educ[order(educ)], exp(lm_predict_loglin[order(educ), "lwr"]),
lty = 2, col = "blue")
#
lines(educ[order(educ)], exp(lm_predict_loglin[order(educ), "upr"]),
lty = 2, col = "blue")
#
abline(h = 0, col = "darkgreen", lty = 2, lwd = 2)
fig = plt.figure(num = 16, figsize = (10, 8))
_ = plt.plot(educ, wage, linestyle = "None", marker = "o", markeredgecolor = "black")
_ = plt.ylim(bottom = -5, top = np.max(wage))
_ = plt.plot(educ[np.argsort(educ)], np.exp(lm_predict_loglin['pred'][np.argsort(educ)]),
linestyle = "-", color = "black")
_ = plt.plot(educ[np.argsort(educ)], np.exp(lm_predict_loglin.loc[np.argsort(educ), 'pred_lower']),
linestyle = "--", color = "blue")
_ = plt.plot(educ[np.argsort(educ)], np.exp(lm_predict_loglin.loc[np.argsort(educ), 'pred_upper']),
linestyle = "--", color = "blue")
_ = plt.axhline(y = 0, linestyle = "--", color = "darkgreen")
_ = plt.xlabel("educ")
_ = plt.ylabel("wage")
plt.show()
Considerations
Looking at the prediction intervals for the simple linear model - would it make sense that the intervals cover negative values? Think about the aim of prediction intervals (i.e. what do they show) by re-examining section 3.7.3.
In such cases, where we are certain that some values are unattainable, forecasts and their prediction intervals can be cut-off - in this case, it may make sense to cutoff the forecasts and their intervals at zero by setting \(\widetilde{\widehat{Y}} = \max\{0, \widehat{Y}\}\) for the simple linear model.
On the other hand, we see that the forecasts and the prediction intervals are non-negative for the log-linear model.
Finally, since we used the same exogeneous variable observations that we used in model estimation, the forecasts are identical to the mean prediction. However, we may be interested to examine the forecasts with new, unknown values of educ
, which is what we will do in the next Task
.
3.11.3.4 Task 17
We will begin by calculating the new values of \(X\):
## new_x_1 new_x_2
## [1,] 0 23.1
new_x_1 = 0.8 * np.min(educ)
new_x_2 = 1.1 * np.max(educ)
pd.DataFrame([new_x_1, new_x_2], index = ['new_x_1', 'new_x_2']).T
## new_x_1 new_x_2
## 0 0.0 23.1
In this data sample the smallest value of educ
is zero. So, one of the calculated educ
values already exists in our historical data. On the other hand, we cannot use a value of educ
, that is less than zero, so we will continue with these values as per the task requirements.
Prediction is pretty straightforward with the built-int functions in R
, but a bit more work in Python
:
## educ
## 1 0.0
## 2 23.1
## const educ
## 0 1.0 0.0
## 1 1.0 23.1
Note that we need to pass the constant.
For Python
, calculating the prediction intervals can be done in a separate function from the one used to calculate the predicted value, which calculates the mean response confidence and the prediction intervals.
from statsmodels.sandbox.regression.predstd import wls_prediction_std
#
temp_1 = lm_fit_linear.get_prediction(x_new).summary_frame(alpha = 0.05)
print(temp_1.T)
## 0 1
## mean -10.399959 44.965224
## mean_se 1.962400 1.266652
## mean_ci_lower -14.250083 42.480122
## mean_ci_upper -6.549835 47.450327
## obs_ci_lower -37.268053 18.258544
## obs_ci_upper 16.468135 71.671905
temp_2 = wls_prediction_std(lm_fit_linear, exog = x_new, alpha = 0.05)
temp_2 = pd.DataFrame(temp_2, index = ["sdev", "pred_lower", "pred_upper"]).T
print(temp_2.T)
## 0 1
## sdev 13.694614 13.612341
## pred_lower -37.268053 18.258544
## pred_upper 16.468135 71.671905
wls_prediction_std()
function provides the forecast standard error, whereas get_prediction()
provides the standard errors of the mean response.
All the required values are available in the predict()
function output:
lin_predict_new_y <- predict(lm_fit_linear, newdata = x_new,
interval = c("predict"), level = 0.95)
print(lin_predict_new_y)
## fit lwr upr
## 1 -10.39996 -37.26805 16.46813
## 2 44.96522 18.25854 71.67190
If we wanted, we could combine data from the two outputs:
lin_predict_new_y = pd.concat([temp_1[['mean']].reset_index(drop = True),
temp_2[["pred_lower", "pred_upper"]].reset_index(drop = True)], axis = 1)
print(lin_predict_new_y)
## mean pred_lower pred_upper
## 0 -10.399959 -37.268053 16.468135
## 1 44.965224 18.258544 71.671905
Or use the results from the get_prediction()
method:
## mean obs_ci_lower obs_ci_upper
## 0 -10.399959 -37.268053 16.468135
## 1 44.965224 18.258544 71.671905
#
#
#
#
#
#
#
loglin_predict_new_y <- predict(lm_fit_loglin,
newdata = x_new,
interval = c("predict"), level = 0.95)
print(loglin_predict_new_y)
## fit lwr upr
## 1 1.596835 0.6359675 2.557703
## 2 3.878039 2.9229438 4.833134
# Old way:
#temp_1 = lm_fit_loglin.get_prediction(x_new).summary_frame(alpha = 0.05).loc[:, ['mean']]
#temp_2 = wls_prediction_std(lm_fit_loglin, exog = x_new, alpha = 0.05)[1:3]
#temp_2 = pd.DataFrame(temp_2, index = ["pred_lower", "pred_upper"]).T
#loglin_predict_new_y = pd.concat([temp_1.reset_index(drop = True), temp_2.reset_index(drop = True)], axis = 1)
#print(loglin_predict_new_y)
# New way
temp_1 = lm_fit_loglin.get_prediction(x_new).summary_frame(alpha = 0.05)
loglin_predict_new_y = temp_1[['mean', 'obs_ci_lower', 'obs_ci_upper']]
loglin_predict_new_y.columns = ['mean', 'pred_lower', 'pred_upper']
print(loglin_predict_new_y)
## mean pred_lower pred_upper
## 0 1.596835 0.635968 2.557703
## 1 3.878039 2.922944 4.833134
quad_predict_new_y <- predict(lm_fit_quad,
newdata = x_new,
interval = c("predict"), level = 0.95)
print(quad_predict_new_y)
## fit lwr upr
## 1 4.916477 -21.55861 31.39156
## 2 52.479276 25.90049 79.05806
temp_1 = lm_fit_quad.get_prediction(x_new**2).summary_frame(alpha = 0.05)
quad_predict_new_y = temp_1[['mean', 'obs_ci_lower', 'obs_ci_upper']]
quad_predict_new_y.columns = ['mean', 'pred_lower', 'pred_upper']
print(quad_predict_new_y)
## mean pred_lower pred_upper
## 0 4.916477 -21.558608 31.391562
## 1 52.479276 25.900494 79.058058
Note that for the quadratic model, we need to pass the transformed data, in this case, x_new**2
.
On the other hand, if we had used the model with the formula already supplied:
## ================================================================================
## coef std err t P>|t| [0.025 0.975]
## --------------------------------------------------------------------------------
## Intercept 4.9165 1.092 4.503 0.000 2.774 7.059
## I(educ ** 2) 0.0891 0.005 18.347 0.000 0.080 0.099
## ================================================================================
Then if we use the original data for prediction we get what we expect:
## mean
## 0 4.916477
## 1 52.479276
However, if we choose to pass the original data to the prediction interval calculation:
## obs_ci_lower obs_ci_upper
## 0 -21.558608 31.391562
## 1 25.900494 79.058058
print(pd.DataFrame(wls_prediction_std(temp_mdl_fit, exog = x_new, alpha = 0.05)[1:3],
index = ["pred_lower", "pred_upper"]).T)
## pred_lower pred_upper
## 0 -21.558608 31.391562
## 1 -19.483872 33.434818
We see that it is incorrect for the wls_prediction_std()
function . To fix this, we would need to pass the transformed values, regardless of whether we used a formula notation, or not:
print(pd.DataFrame(wls_prediction_std(temp_mdl_fit, exog = x_new**2, alpha = 0.05)[1:3],
index = ["pred_lower", "pred_upper"]).T)
## pred_lower pred_upper
## 0 -21.558608 31.391562
## 1 25.900494 79.058058
Again note, that this currently affects wls_prediction_std()
and not the .get_prediction()
method.
Considerations
Furthermore, we can plot these data points along with the true values:
#
#
#
plot(educ, wage, ylim = c(-40, 125), xlim = c(0, 25))
#
points(c(new_x_1, new_x_2), lin_predict_new_y[, "fit"],
col = "red", pch = 19)
points(c(new_x_1, new_x_2), exp(loglin_predict_new_y[, "fit"]),
col = "orange", pch = 19)
points(c(new_x_1, new_x_2), quad_predict_new_y[, "fit"],
col = "blue", pch = 20)
abline(h = 0, col = "darkgreen", lty = 2)
# Add prediction intervals:
points(c(new_x_1, new_x_2), lin_predict_new_y[, "lwr"],
col = "red", pch = 4)
points(c(new_x_1, new_x_2), lin_predict_new_y[, "upr"],
col = "red", pch = 4)
#
points(c(new_x_1, new_x_2), exp(loglin_predict_new_y[, "lwr"]),
col = "darkorange", pch = 4)
points(c(new_x_1, new_x_2), exp(loglin_predict_new_y[, "upr"]),
col = "darkorange", pch = 4)
#
points(c(new_x_1, new_x_2), quad_predict_new_y[, "lwr"],
col = "blue", pch = 4)
points(c(new_x_1, new_x_2), quad_predict_new_y[, "upr"],
col = "blue", pch = 4)
# Add legend
#
#
#
#
legend("topleft", cex = 0.75,
legend = c("Sample data", "simple linear model",
"log-linear model", "quadratic model",
"prediction interval, color indicates model"),
pch = c(1, 19, 19, 20, 4),
col = c("black", "red", "orange", "blue", "black"))
from matplotlib.lines import Line2D
#
fig = plt.figure(num = 17, figsize = (10, 8))
_ = plt.plot(educ, wage, linestyle = "None", marker = "o", markeredgecolor = "black", color = "none")
_ = plt.ylim(bottom = -40, top = 125)
_ = plt.plot([new_x_1, new_x_2], lin_predict_new_y['mean'],
linestyle = "None", marker = 'o', color = "red", markersize = 8)
_ = plt.plot([new_x_1, new_x_2], np.exp(loglin_predict_new_y['mean']),
linestyle = "None", marker = 'o', color = "orange", markersize = 8)
_ = plt.plot([new_x_1, new_x_2], quad_predict_new_y['mean'],
linestyle = "None", marker = 'o', color = "blue", markersize = 5)
_ = plt.axhline(y = 0, linestyle = "--", color = "darkgreen")
# Add prediction intervals:
_ = plt.plot([new_x_1, new_x_2], lin_predict_new_y['pred_lower'],
linestyle = "None", marker = 'x', color = "red", markersize = 8)
_ = plt.plot([new_x_1, new_x_2], lin_predict_new_y['pred_upper'],
linestyle = "None", marker = 'x', color = "red", markersize = 8)
#
_ = plt.plot([new_x_1, new_x_2], np.exp(loglin_predict_new_y['pred_lower']),
linestyle = "None", marker = 'x', color = "orange", markersize = 9)
_ = plt.plot([new_x_1, new_x_2], np.exp(loglin_predict_new_y['pred_upper']),
linestyle = "None", marker = 'x', color = "orange", markersize = 9)
#
_ = plt.plot([new_x_1, new_x_2], quad_predict_new_y['pred_lower'],
linestyle = "None", marker = 'x', color = "blue", markersize = 8)
_ = plt.plot([new_x_1, new_x_2], quad_predict_new_y['pred_upper'],
linestyle = "None", marker = 'x', color = "blue", markersize = 8)
# Add legend
colors = ["red", "orange", "blue", "black"]
markers = ["o"] * 3 +["x"]
lines = [Line2D([0], [0], color = "none", marker = "o", linestyle = "None", markeredgecolor = "black",)]
lines = lines + [Line2D([0], [0], color = c, marker = m, linestyle = "None") for c, m in zip(colors, markers)]
labels = ["Sample data", "simple linear model", "log-linear model", "quadratic model", "prediction interval, color indicates model"]
_ = plt.legend(lines, labels)
#
_ = plt.xlabel("educ")
_ = plt.ylabel("wage")
plt.show()
- We see that the log-linear and the quadratic model predicted mean response values are close, when \(\widetilde{\text{educ}} = 0.8 \cdot \min(\text{educ}_1, ..., \text{educ}_N) = 0\).
- When \(\widetilde{\text{educ}} = 1.1 \cdot \max(\text{educ}_1, ..., \text{educ}_N)\), we see that the quadratic model has the largest predicted value.
From the prediction intervals we see that the prediction intervals for the quadratic model contain negative values (not realistic for wage), but for larger values, they are lower than the prediction intervals of the log-linear model.
While the negative confidence interval is not quite realistic (for the quadratic model we could always manually cap the interval at zero, since \(\text{educ} \geq 0\) and \(\widehat{\beta}_0 = 4.91647 > 0\)), for the quadratic model the smaller confidence interval for the larger educ
value may be more plausible - from our data sample it appears that for \(\text{educ} > 20\) the variance in wage
is smaller, compared to the variance, when educ
is between 15 and 20.
3.11.4 Exercise Set 4
3.11.4.1 Task 18
We will begin by manually calculating \(R^2\) for the simple linear model.
We will need to calculate the following: \[ \text{RSS} = \sum_{i = 1}^N \widehat{\epsilon}_i^2 \] and \[ \text{TSS} = \sum_{i = 1}^N (Y_i - \overline{Y})^2 \]
This will allow us to calculate \(R^2\) as: \[ R^2 = 1 - \dfrac{\text{RSS}}{\text{TSS}} \]
## [1] 0.20733
## 0.20733
This value means that our simple linear regression explains roughly \(20.7\%\) of the variation in wage
.
Since we have a univariate regression with one variable, we could say that \(20.7\%\) of the variation in wage
is explained by education (while the rest is explained by the unobservable random shocks \(\epsilon\)).
This can also be done automatically from the model results:
- For the simple linear model:
## [1] 0.20733
## 0.20733
- For the log-linear model:
## [1] 0.25771
## 0.25771
We have that \(25.77\%\) of the variation in \(\log(\text{wage})\) is explained by our regression model.
- For the quadratic model:
## [1] 0.21936
## 0.21936
We have that \(21.94\%\) of the variation in wage is explained by our quadratic model.
Furthermore:
- We can directly compare \(R^2\) of the simple linear and quadratic models - our quadratic model is slightly better, as it explains around \(1\%\) more variation than the linear model.
- The \(R^2\) in the log-linear model is not comparable to the \(R^2\) in the simple linear and quadratic models, since the dependent variable is transformed (and it this case, its scale is logarithmic).
- If the dependent variable were simply scaled by some constant - our \(R^2\) would be unchanged and could be comparable (see the relevant section on \(R^2\) in section 3.8).
3.11.4.2 Task 19
From the lecture notes, \(R^2_g\) can be calculated as: \[ R^2_g = \mathbb{C}{\rm orr}(Y, \widehat{Y})^2 \]
- For the simple linear model:
## [1] 0.20733
- For the log-linear model:
- For the quadratic model:
## [1] 0.21936
\(R^2_g\) is the same for the simple linear model. It is also the same for the quadratic model. It is different (and slightly lower) for the log-linear model.
The general \(R^2\) (i.e. \(R^2_g\)) is the largest for the quadratic model, though not by much. On the other hand, the difference in \(R^2_g\) between between the log-linear and quadratic models is less than from the simple linear model.
3.11.4.3 Task 20
The largest \(R^2\) is for the log-linear model, but as mentioned before, it is not comparable to the linear and quadratic models, because the dependent variables do not have the same transformations. For the log linear model \(R^2 = 0.25771\) indicates that \(25.77\%\) of the variation in \(\log(\text{wage})\) is explained by our (log-linear) regression model. In this case, the logarithmic scale partially contributes to an increase in \(R^2\). From the calculated \(R^2_g\) we see that the log-linear model is indeed better than the simple linear, but less so than the quadratic model.
Considerations
Overall \(R^2\) did not help us identify any model, which would be marginally better than the rest. This is a great example of why we should only look at \(R^2\) as a reference, and pay closer attention to coefficient significance, signs as well as the prediction values and prediction intervals to make sure they make economic sense.
3.11.5 Exercise Set 5
While we have analysed coefficient signs, significance, predicted values and \(R^2\) - we have only briefly examined the model residuals. Remember that our model assumptions place certain requirements on the residual normality, autocorrelation and homoskedasticity.
3.11.5.1 Task 21
We will begin by examining the residual vs fitted scatterplot:
#
#
par(mfrow = c(1, 3))
#
#
#Linear Model
plot(lm_fit_linear$fitted.values, lm_fit_linear$residuals,
main = "Simple linear: residual vs fitted")
#
#
#Log-linear
plot(lm_fit_loglin$fitted.values, lm_fit_loglin$residuals,
main = "Log-linear: residual vs fitted")
#
#
#Quadratic Model
plot(lm_fit_quad$fitted.values, lm_fit_quad$residuals,
main = "Quadratic: residual vs fitted")
fig = plt.figure(num = 18, figsize = (10, 8))
#Linear Model
_ = fig.add_subplot(131).plot(lm_fit_linear.fittedvalues, lm_fit_linear.resid,
linestyle = "None", marker = "o", markeredgecolor = "black", color = "none")
_ = plt.title("Simple linear: residual vs fitted")
#Log-linear
_ = fig.add_subplot(132).plot(lm_fit_loglin.fittedvalues, lm_fit_loglin.resid, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Log-linear: residual vs fitted")
#Quadratic Model
_ = fig.add_subplot(133).plot(lm_fit_quad.fittedvalues, lm_fit_quad.resid, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Quadratic: residual vs fitted")
_ = plt.tight_layout()
for i, ax in enumerate(fig.axes):
_ = ax.set_ylabel("residuals")
_ = ax.set_xlabel("fitted")
plt.show()
We see that of the three plots the log-linear model residuals appear to have residuals without any non-linear relationships. Since we have fewer observations for lower values of wage
, it may appear that the variance is not constant. On the other hand, it may be possible that for lower wage
values the residuals have a smaller variance than for higher wage
values.
If we check the residual vs independent variable scatter plot:
#
#
#
#
par(mfrow = c(1, 3))
#
#
#Linear Model
plot(lm_fit_linear$model$educ, lm_fit_linear$residuals,
main = "Simple linear: residual vs X")
#
#
#Log-linear
plot(lm_fit_loglin$model$educ, lm_fit_loglin$residuals,
main = "Log-linear: residual vs X")
#
#
#Quadratic Model
plot(lm_fit_quad$model$`I(educ^2)`, lm_fit_quad$residuals,
main = "Quadratic: residual vs X")
fig = plt.figure(num = 19, figsize = (10, 8))
#Linear Model
_ = fig.add_subplot(131).plot(educ, lm_fit_linear.resid,
linestyle = "None", marker = "o", markeredgecolor = "black", color = "none")
_ = plt.title("Simple linear: residual vs X")
_ = plt.xlabel("educ")
#Log-linear
_ = fig.add_subplot(132).plot(educ, lm_fit_loglin.resid, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Log-linear: residual vs X")
_ = plt.xlabel("educ")
#Quadratic Model
_ = fig.add_subplot(133).plot(educ**2, lm_fit_quad.resid, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Quadratic: residual vs X")
_ = plt.xlabel("educ^2")
for i, ax in enumerate(fig.axes):
_ = ax.set_ylabel("residuals")
_ = plt.tight_layout()
plt.show()
we arrive at the same conclusions as we did for the residual vs fitted scatter plots.
Considerations
Note that if we wanted of access the data used in model estimation, we would need to examine the model object, before fitting it (more info here):
## wage ~ I(educ**2)
## [44.44 16. 15.38 13.54 25. ]
## wage
## 0 44.44
## 1 16.00
## 2 15.38
## 3 13.54
## 4 25.00
## [[ 1. 169.]
## [ 1. 196.]
## [ 1. 324.]
## [ 1. 169.]
## [ 1. 169.]]
## Intercept I(educ ** 2)
## 0 1.0 169.0
## 1 1.0 196.0
## 2 1.0 324.0
## 3 1.0 169.0
## 4 1.0 169.0
once we use the .fit()
method, we lose access to these variables.
Next, we will examine the residual Q-Q plot and histogram (we will also plot a boxplot (excluding outliers) for comparison):
par(mfrow = c(3, 3))
#Linear Model
qqnorm(lm_fit_linear$residuals, main = "Simple linear model residuals:\nQ-Q plot")
qqline(lm_fit_linear$residuals, col = "red", lwd = 2)
hist(lm_fit_linear$residuals, col = "cornflowerblue", breaks = 25,
main = "Simple linear model residuals:\nHistogram")
boxplot(lm_fit_linear$residuals, horizontal = TRUE, outline = FALSE,
main = "Simple linear model residuals:\nBoxplot (hidden outliers)")
#Log-linear
qqnorm(lm_fit_loglin$residuals, main = "Log-linear model residuals:\nQ-Q plot")
qqline(lm_fit_loglin$residuals, col = "red", lwd = 2)
hist(lm_fit_loglin$residuals, col = "cornflowerblue", breaks = 25,
main = "Log-linear model residuals:\nHistogram")
boxplot(lm_fit_loglin$residuals, horizontal = TRUE, outline = FALSE,
main = "Log-linear model residuals:\nBoxplot (hidden outliers)")
#Quadratic Model
qqnorm(lm_fit_quad$residuals, main = "Quadratic model residuals: Q-Q plot")
qqline(lm_fit_quad$residuals, col = "red", lwd = 2)
#
hist(lm_fit_quad$residuals, col = "cornflowerblue", breaks = 25,
main = "Quadratic model residuals:\nHistogram")
#
boxplot(lm_fit_quad$residuals, horizontal = TRUE, outline = FALSE,
main = "Quadratic model residuals:\nBoxplot (hidden outliers)")
fig = plt.figure(num = 20, figsize = (10, 8))
#Linear Model
_ = stats.probplot(lm_fit_linear.resid, dist = "norm", plot = fig.add_subplot(331))
_ = plt.title("Simple linear model residuals:\nQ-Q plot")
_ = fig.add_subplot(332).hist(lm_fit_linear.resid, bins = 25, histtype = 'bar', color = "cornflowerblue", ec = 'black')
_ = plt.title("Simple linear model residuals:\nHistogram")
_ = fig.add_subplot(333).boxplot(lm_fit_linear.resid, vert = False, showfliers = False)
_ = plt.title("Simple linear model residuals:\nBoxplot (hidden outliers)")
#Log-linear
_ = stats.probplot(lm_fit_loglin.resid, dist = "norm", plot = fig.add_subplot(334))
_ = plt.title("Log-linear model residuals:\nQ-Q plot")
_ = fig.add_subplot(335).hist(lm_fit_loglin.resid, bins = 25, histtype = 'bar', color = "cornflowerblue", ec = 'black')
_ = plt.title("Log-linear model residuals:\nHistogram")
_ = fig.add_subplot(336).boxplot(lm_fit_loglin.resid, vert = False, showfliers = False)
_ = plt.title("Log-linear model residuals:\nBoxplot (hidden outliers)")
#Quadratic Model
_ = stats.probplot(lm_fit_quad.resid, dist = "norm", plot = fig.add_subplot(337))
_ = plt.title("Quadratic linear model residuals:\nQ-Q plot")
_ = fig.add_subplot(338).hist(lm_fit_quad.resid, bins = 25, histtype = 'bar', color = "cornflowerblue", ec = 'black')
_ = plt.title("Quadratic linear model residuals:\nHistogram")
_ = fig.add_subplot(339).boxplot(lm_fit_quad.resid, vert = False, showfliers = False)
_ = plt.title("Quadratic linear model residuals:\nBoxplot (hidden outliers)")
_ = plt.tight_layout()
plt.show()
- For the simple linear model - it appears that the residuals are skewed and from a non-normal distribution - the residuals on the Q-Q plot do not fall along a straight line with the theoretical quantiles of a normal distribution. Furthermore, the histogram indicates that the residual distribution is skewed to the right hand side of the histogram (we can also see that the median is closer to the first quartile than the third quartile).
- For the log-linear model - the residuals of a log-linear model appear to closely resemble a normal distribution - the histogram resembles a bell shape that is familiar of a normal distribution, while the Q-Q plot falls along a straight line.
- For the quadratic model - unfortunately, the residuals of a quadratic model do not appear to be from a normal distribution - they are quite similar to the residuals from a simple linear model.
In this case, we see that, while the quadratic model appears to be similar to a log-linear model in terms of predicted values and \(R^2\), its residuals do not follow the normality assumption which we require in order to test hypothesis on coefficient significance and distribution.
3.11.5.2 Task 22
We will carry out a number of tests on the model residuals.
3.11.5.2.1 Testing for Residual Homoskedasticity
We will begin by testing the following hypothesis: \[ \begin{aligned} H_0&: \text{ (residuals are homoskedastic)}\\ H_1&: \text{ (residuals are heteroskedastic)} \end{aligned} \] via the Breusch-Pagan test:
##
## studentized Breusch-Pagan test
##
## data: lm_fit_linear
## BP = 7.4587, df = 1, p-value = 0.006313
from statsmodels.compat import lzip
import statsmodels.stats.diagnostic as sm_diagnostic
# Breusch–Pagan Test
test = sm_diagnostic.het_breuschpagan(resid = lm_fit_linear.resid, exog_het = sm.add_constant(educ))
name = ['Lagrange multiplier statistic', 'p-value', 'F-value', 'F p-value']
for a, b in lzip(name, np.round(test, 6)): print("{}: {}".format(a, b))
## Lagrange multiplier statistic: 7.458728
## p-value: 0.006313
## F-value: 7.492869
## F p-value: 0.006286
Since p-value = 0.0063 < 0.05
, we reject the null hypothesis and are left with the alternative that the simple linear regression residuals are heteroskedastic.
##
## studentized Breusch-Pagan test
##
## data: lm_fit_loglin
## BP = 6.9769, df = 1, p-value = 0.008257
# Breusch–Pagan Test
test = sm_diagnostic.het_breuschpagan(resid = lm_fit_loglin.resid, exog_het = sm.add_constant(educ))
name = ['Lagrange multiplier statistic', 'p-value', 'F-value', 'F p-value']
for a, b in lzip(name, np.round(test, 6)): print("{}: {}".format(a, b))
## Lagrange multiplier statistic: 6.976939
## p-value: 0.008257
## F-value: 7.006045
## F p-value: 0.00823
Since p-value = 0.0068257 < 0.05
, we reject the null hypothesis and accept the alternative that the log-linear regression residuals are heteroskedastic.
##
## studentized Breusch-Pagan test
##
## data: lm_fit_quad
## BP = 9.7331, df = 1, p-value = 0.00181
# Breusch–Pagan Test
test = sm_diagnostic.het_breuschpagan(resid = lm_fit_quad.resid, exog_het = sm.add_constant(educ))
name = ['Lagrange multiplier statistic', 'p-value', 'F-value', 'F p-value']
for a, b in lzip(name, np.round(test, 6)): print("{}: {}".format(a, b))
## Lagrange multiplier statistic: 9.472714
## p-value: 0.002086
## F-value: 9.532172
## F p-value: 0.002065
Since p-value < 0.05
, so we reject the null hypothesis and accept the alternative that the quadratic regression residuals are heteroskedastic.
3.11.5.2.2 Testing for Residual Autocorrelation
Next, we will test the hypothesis: \[ \begin{aligned} H_0&:\text{the residuals are serially uncorrelated}\\ H_1&:\text{the residuals are an autoregressive process} \end{aligned} \] via Durbin-Watson test (in this case the alternative hypothesis is equivalent to \(H_0:\text{significant autocorrelation at lag 1}\)):
##
## Durbin-Watson test
##
## data: lm_fit_linear
## DW = 1.9595, p-value = 0.4828
## alternative hypothesis: true autocorrelation is not 0
##
## Durbin-Watson test
##
## data: lm_fit_loglin
## DW = 1.9703, p-value = 0.6064
## alternative hypothesis: true autocorrelation is not 0
##
## Durbin-Watson test
##
## data: lm_fit_quad
## DW = 1.9587, p-value = 0.4742
## alternative hypothesis: true autocorrelation is not 0
Since p-value > 0.05
, we have no grounds to reject the null hypothesis that the residuals are not serially correlated in any of the three - linear, log-linear or quadratic - models. Unfortunately, we only get the DW
statistic in Python
and not the associated p-value
.
We can also test for autocorrelation the via the Breusch-Godfrey test (in this case the alternative hypothesis is equivalent to \(H_0:\text{significant autocorrelation at lag k}\), where we have to specify \(k\)):
##
## Breusch-Godfrey test for serial correlation of order up to 2
##
## data: lm_fit_linear
## LM test = 0.44312, df = 2, p-value = 0.8013
# Breusch-Godfrey Test
test = sm_diagnostic.acorr_breusch_godfrey(lm_fit_linear, nlags = 2)
name = ['Lagrange multiplier statistic', 'p-value', 'F-value', 'F p-value']
for a, b in lzip(name, np.round(test, 6)): print("{}: {}".format(a, b))
## Lagrange multiplier statistic: 0.443116
## p-value: 0.801269
## F-value: 0.220901
## F p-value: 0.801829
Since p-value = 0.8013 > 0.05
, we have no grounds to reject the null hypothesis and conclude that the linear model residuals are not autocorrelated.
##
## Breusch-Godfrey test for serial correlation of order up to 2
##
## data: lm_fit_loglin
## LM test = 0.85029, df = 2, p-value = 0.6537
# Breusch-Godfrey Test
test = sm_diagnostic.acorr_breusch_godfrey(lm_fit_loglin, nlags = 2)
name = ['Lagrange multiplier statistic', 'p-value', 'F-value', 'F p-value']
for a, b in lzip(name, np.round(test, 6)): print("{}: {}".format(a, b))
## Lagrange multiplier statistic: 0.85029
## p-value: 0.653675
## F-value: 0.424028
## F p-value: 0.654504
Since p-value = 0.6537 > 0.05
, we have no grounds to reject the null hypothesis and conclude that the log-linear model residuals are not autocorrelated.
##
## Breusch-Godfrey test for serial correlation of order up to 2
##
## data: lm_fit_quad
## LM test = 0.4735, df = 2, p-value = 0.7892
# Breusch-Godfrey Test
test = sm_diagnostic.acorr_breusch_godfrey(lm_fit_quad, nlags = 2)
name = ['Lagrange multiplier statistic', 'p-value', 'F-value', 'F p-value']
for a, b in lzip(name, np.round(test, 6)): print("{}: {}".format(a, b))
## Lagrange multiplier statistic: 0.473501
## p-value: 0.789188
## F-value: 0.236054
## F p-value: 0.789775
Since p-value = 0.7892 > 0.05
, we have no grounds to reject the null hypothesis and conclude that the quadratic model residuals are not autocorrelated.
3.11.5.2.3 Testing for Residual Normality
Finally, we will test the normality hypothesis: \[ \begin{aligned} H_0&:\text{residuals follow a normal distribution}\\ H_1&:\text{residuals do not follow a normal distribution} \end{aligned} \] via Shapiro-Wilk test:
##
## Shapiro-Wilk normality test
##
## data: lm_fit_linear$residuals
## W = 0.82704, p-value < 2.2e-16
# Shapiro-Wilk Test
test = stats.shapiro(x = lm_fit_linear.resid)
name = ['W-stat', 'p-value']
for a, b in lzip(name, test): print("{}: {}".format(a, b))
## W-stat: 0.8270390629768372
## p-value: 4.15151498014067e-34
##
## Shapiro-Wilk normality test
##
## data: lm_fit_loglin$residuals
## W = 0.99546, p-value = 0.001196
# Shapiro-Wilk Test
test = stats.shapiro(x = lm_fit_loglin.resid)
name = ['W-stat', 'p-value']
for a, b in lzip(name, test): print("{}: {}".format(a, b))
## W-stat: 0.9954646825790405
## p-value: 0.0011977104004472494
##
## Shapiro-Wilk normality test
##
## data: lm_fit_quad$residuals
## W = 0.82939, p-value < 2.2e-16
# Shapiro-Wilk Test
test = stats.shapiro(x = lm_fit_quad.resid)
name = ['W-stat', 'p-value']
for a, b in lzip(name, test): print("{}: {}".format(a, b))
## W-stat: 0.8293871283531189
## p-value: 6.2899302221184195e-34
In all three models, we reject the null hypothesis and conclude that the residuals are not normally distributed.
On the other hand, if we look at alternative normality tests for the log-linear model residuals (since their histogram and Q-Q plot most closely resembled the normal distribution case), for example the Jarque-Bera test:
##
## Jarque Bera Test
##
## data: lm_fit_linear$residuals
## X-squared = 63871, df = 2, p-value < 2.2e-16
# Jarque-Bera Test
test = sm_tools.jarque_bera(lm_fit_linear.resid)
name = ['JB-stat', 'p-value', 'Skew', 'Kurtosis']
for a, b in lzip(name, test): print("{}: {}".format(a, b))
## JB-stat: 63871.05629016829
## p-value: 0.0
## Skew: 3.328227726658413
## Kurtosis: 38.115704842724924
##
## Jarque Bera Test
##
## data: lm_fit_loglin$residuals
## X-squared = 1.5331, df = 2, p-value = 0.4646
# Jarque-Bera Test
test = sm_tools.jarque_bera(lm_fit_loglin.resid)
name = ['JB-stat', 'p-value', 'Skew', 'Kurtosis']
for a, b in lzip(name, test): print("{}: {}".format(a, b))
## JB-stat: 1.5330842008737402
## p-value: 0.4646168923309677
## Skew: 0.07673785662665146
## Kurtosis: 3.084302369236367
##
## Jarque Bera Test
##
## data: lm_fit_quad$residuals
## X-squared = 68599, df = 2, p-value < 2.2e-16
# Jarque-Bera Test
test = sm_tools.jarque_bera(lm_fit_quad.resid)
name = ['JB-stat', 'p-value', 'Skew', 'Kurtosis']
for a, b in lzip(name, test): print("{}: {}".format(a, b))
## JB-stat: 68598.56231270307
## p-value: 0.0
## Skew: 3.359877596181706
## Kurtosis: 39.425487456021315
- For the log-linear model: since
p-value > 0.05
, we have no grounds to reject the null hypothesis of normality for the log-linear model residuals. - For the remaining models: we reject the null hypothesis for the simple linear and the quadratic model since
p-value < 0.05
for those models’ residuals.
Considerations
As we have seen, it is a good idea to not rely solely on a single test for residual normality. Similarly, we should not rely on only one test, when testing other residual properties, as we have done in this Task
.
3.11.5.3 Task 23
From Task 22
we have that:
- none of the model residuals are autocorrelated;
- all of the model residuals are heteroskedastic;
- according to the Durbin-Watson test, none of the residuals are normally distributed;
- the Jarque-Bera test suggests that the log-linear model residuals may be normally distributed;
From Task 21
we have that:
- residual scatter plots (vs fitted; and vs
educ
) indicate that the log-linear model residuals exhibit the least amount of a non-linear dependence than the other model residuals. - the log-linear model residuals look normally distributed;
From Task 19
: according to \(R^2_g\), log-linear and quadratic models are similar in terms of percentage of variation in wage
explained by the specified model and are better than the simple linear model.
From Tasks 10, 16 & 17
:
- The log-linear and quadratic models provide fitted values which coincide with economic theory.
- The simple linear regression predicts negative values of
wage
for small values ofeduc
- this is not possible for hourlywage
and as such the linear model is not a good fit for our data.
We also note that the \(\beta_1\) coefficients have the expected (from economic theory) positive signs and are statistically significant.
From all of this, we can conclude that the log-linear model may be the best fit of the three models, because the residuals are also (unlike the quadratic model) normally distributed (which is what we want from our model).
Considerations
However, none of the model residuals are homoskedastic. Part of this could be explained by the fact that we do not have enough observations for smaller values of educ
(and for smaller values of wage
), so the variation may be different for lower values of wage
(for up to 10 years of educ
), compared to larger values of wage
(when educ
is 10+ years).
Because the residuals are not homoskedastic, the estimated OLS coefficients are not efficient (so they are not BLUE), and the standard errors are biased (so we may not make the correct assumptions about significance). To account for this, we would need to either: correct the residuals, use a different estimation method (weighted least squares), or attempt to account for the heteroskedasticity by including some additional variables alongside educ
or try some other transformation of the variables (maybe a cubic model?).
3.11.5.4 Task 24
We will begin by taking a subset of the data. To do this, firstly, get the index of the data \(i = 1,...,N\):
## [1] 1 2 3 4 5
## [0, 1, 2, 3, 4]
Then, \(80\%\) of the data is equal to:
## [1] 960
where we use floor()
to take the integer value, in case we get a value with a fraction.
We want to take a sample of the data. This is equivalent to taking a subset of the index vector, which we can use to select the data sample:
set.seed(123)
#
smpl_index <- sample(dt_index, size = floor(0.8 * nrow(dt4)), replace = FALSE)
print(head(smpl_index, 5))
## [1] 415 463 179 526 195
## [1] 2 3 5 6 8
np.random.seed(123)
#
smpl_index = np.random.choice(dt_index, size = int(np.floor(0.8 * len(dt4))), replace = False)
print(smpl_index[:5])
## [156 920 971 897 35]
## [0 1 4 5 6]
Note that we are taking the index values without replacement - each element from the index can be taken only once.
In comparison, the remaining index values, which we do not take, can be selected by using setdiff()
function, which returns the values in dt_index
, which are not in smpl_index
:
## [1] 1 4 7 12 14
## [2, 3, 515, 516, 519]
The dataset used for modelling is called the training set. The remaining data, which is used to evaluate the predictive accuracy of the model, is called the test set.
For comparison, the training and test data looks like:
#
#
par(mfrow = c(1, 2))
#
plot(dt4_train$educ, dt4_train$wage,
col = "red", main = "Training set")
#
plot(dt4_test$educ, dt4_test$wage,
col = "blue", main = "Test set")
fig = plt.figure(num = 21, figsize = (10, 8))
_ = fig.add_subplot(121).plot(dt4_train[['educ']], dt4_train[['wage']],
color = "red", linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Training set")
_ = fig.add_subplot(122).plot(dt4_test[['educ']], dt4_test[['wage']],
color = "blue", linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Test set")
plt.tight_layout()
plt.show()
As we have mentioned in Task 22
, we chose a log-linear as the best univariate regression model of the models that we used for this dataset.
#
dt4_smpl_mdl <- lm(log(wage) ~ 1 + educ, data = dt4_train)
print(summary(dt4_smpl_mdl)$coefficients)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.5774693 0.078801222 20.01833 9.504523e-75
## educ 0.1000511 0.005435415 18.40726 5.068383e-65
dt4_smpl_mdl = smf.ols("np.log(wage) ~ 1 + educ", data = dt4_train)
dt4_smpl_mdl = dt4_smpl_mdl.fit()
print(dt4_smpl_mdl.summary().tables[1])
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 1.6018 0.081 19.855 0.000 1.443 1.760
## educ 0.0988 0.006 17.681 0.000 0.088 0.110
## ==============================================================================
The coefficients are similar compared to the coefficients of the model on the full dataset:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.59683536 0.07018017 22.75337 1.530466e-95
## educ 0.09875341 0.00484219 20.39437 1.344896e-79
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 1.5968 0.070 22.753 0.000 1.459 1.735
## educ 0.0988 0.005 20.394 0.000 0.089 0.108
## ==============================================================================
The tests can be carried out as before.
- If we do not reject the null hypothesis that the residuals are homoskedastic, this would indicate that our model on the training set has homoskedastic errors (i.e. with a constant variance) - something that we did not have previously.
##
## studentized Breusch-Pagan test
##
## data: dt4_smpl_mdl
## BP = 4.2338, df = 1, p-value = 0.03963
In this case the \(p\)-value is closer to 0.05. This may suggest that if we did not have observations on smaller values of wage
, the residuals would be homoskedastic.
# Test for homoskedasticity
test = sm_diagnostic.het_breuschpagan(resid = dt4_smpl_mdl.resid, exog_het = sm.add_constant(dt4_train[["educ"]]))
name = ['Lagrange multiplier statistic', 'p-value', 'F-value', 'F p-value']
for a, b in lzip(name, np.round(test, 6)): print("{}: {}".format(a, b))
## Lagrange multiplier statistic: 5.706901
## p-value: 0.016898
## F-value: 5.729069
## F p-value: 0.016878
In this case, we reject the null hypothesis that the residuals are homoskedastic.
Note that the difference between R
and Python
versions is because of different data samples.
##
## Durbin-Watson test
##
## data: dt4_smpl_mdl
## DW = 2.0239, p-value = 0.7118
## alternative hypothesis: true autocorrelation is not 0
Since p-value > 0.05
(and we see that the DW
statistic is quite close to 2), we do not reject the null hypothesis of no serial correlation.
##
## Shapiro-Wilk normality test
##
## data: dt4_smpl_mdl$residuals
## W = 0.99366, p-value = 0.0004296
# Test for Normality:
test = stats.shapiro(x = dt4_smpl_mdl.resid)
name = ['W-stat', 'p-value']
for a, b in lzip(name, test): print("{}: {}".format(a, b))
## W-stat: 0.995557427406311
## p-value: 0.007039172574877739
##
## Jarque Bera Test
##
## data: dt4_smpl_mdl$residuals
## X-squared = 2.7612, df = 2, p-value = 0.2514
# Test for Normality:
test = sm_tools.jarque_bera(dt4_smpl_mdl.resid)
name = ['JB-stat', 'p-value', 'Skew', 'Kurtosis']
for a, b in lzip(name, test): print("{}: {}".format(a, b))
## JB-stat: 1.604276500225152
## p-value: 0.4483692128688961
## Skew: 0.08649250749274705
## Kurtosis: 3.1009113328427347
For normality test - we still reject the null hypothesis of normally distributed residuals via the Shapiro-Wilk test, and do not reject the null hypothesis via Jarque-Bera Test.
Overall, the only difference is in the homoskedasticity test (in R
) - the residuals may be homoskedastic, since the \(p\)-value is quite close to the \(5\%\) critical level.
Visually, though:
#
#
par(mfrow = c(1, 2))
#
plot(dt4_smpl_mdl$fitted.values, dt4_smpl_mdl$residuals,
main = "Log-linear: residual vs fitted")
#
plot(dt4_smpl_mdl$model$educ, dt4_smpl_mdl$residuals,
main = "Log-linear: residual vs X")
fig = plt.figure(num = 22, figsize = (10, 8))
_ = fig.add_subplot(121).plot(dt4_smpl_mdl.fittedvalues, dt4_smpl_mdl.resid,
color = "black", linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Log-linear: residual vs fitted")
_ = fig.add_subplot(122).plot(dt4_train[["educ"]], dt4_smpl_mdl.resid,
color = "black", linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Log-linear: residual vs X")
plt.tight_layout()
plt.show()
there is not a huge difference from before.
Consequently, the fitted values alongside the training dataset are similar to as before:
#
#
plot(dt4_train$educ, dt4_train$wage, col = "red", ylim = c(0, 100))
#
lines(dt4_train$educ[order(dt4_train$educ)], exp(dt4_smpl_mdl$fitted.values[order(dt4_train$educ)]), lwd = 2)
#
abline(h = 0, col = "darkgreen", lty = 2, lwd = 2)
fig = plt.figure(num = 23, figsize = (10, 8))
_ = plt.plot(dt4_train["educ"], dt4_train["wage"],
color = "red", linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.plot(dt4_train["educ"][np.argsort(dt4_train["educ"])],
np.exp(dt4_smpl_mdl.fittedvalues[np.argsort(dt4_train["educ"])]), linestyle = "-", color = "black")
_ = plt.axhline(y = 0, linestyle = "--", color = "darkgreen")
plt.show()
Next, we can calculate the predicted values for the test set:
test_set_predict <- predict(dt4_smpl_mdl,
newdata = data.frame(educ = dt4_test$educ),
interval = c("predict"), level = 0.95)
print(head(test_set_predict, 6))
## fit lwr upr
## 1 2.878134 1.933492 3.822775
## 2 2.878134 1.933492 3.822775
## 3 3.178287 2.233542 4.123032
## 4 2.778083 1.833235 3.722930
## 5 2.778083 1.833235 3.722930
## 6 3.178287 2.233542 4.123032
temp_1 = dt4_smpl_mdl.get_prediction(dt4_test).summary_frame(alpha = 0.05)
test_set_predict = temp_1[['mean', 'obs_ci_lower', 'obs_ci_upper']]
test_set_predict.columns = ['mean', 'pred_lower', 'pred_upper']
print(test_set_predict.head(6))
## mean pred_lower pred_upper
## 0 3.380522 2.410259 4.350784
## 1 2.886435 1.917009 3.855861
## 2 2.886435 1.917009 3.855861
## 3 3.380522 2.410259 4.350784
## 4 3.182887 2.213331 4.152443
## 5 3.676974 2.704724 4.649224
Next, we will plot the test data as well as the predicted values and confidence intervals:
#
plot(dt4_test$educ, dt4_test$wage, col = "blue")
lines(dt4_test$educ[order(dt4_test$educ)],
exp(test_set_predict[order(dt4_test$educ), "fit"]), lwd = 2)
lines(dt4_test$educ[order(dt4_test$educ)],
exp(test_set_predict[order(dt4_test$educ), "lwr"]), lwd = 2,
lty = 2, col = "orange")
lines(dt4_test$educ[order(dt4_test$educ)],
exp(test_set_predict[order(dt4_test$educ), "upr"]), lwd = 2,
lty = 2, col = "orange")
abline(h = 0, col = "darkgreen", lty = 2, lwd = 2)
fig = plt.figure(num = 23, figsize = (10, 8))
_ = plt.plot(dt4_test["educ"], dt4_test["wage"],
color = "blue", linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.plot(dt4_test["educ"][np.argsort(dt4_test["educ"])],
np.exp(test_set_predict['mean'][np.argsort(dt4_test["educ"])]), linestyle = "-", color = "black")
_ = plt.plot(dt4_test["educ"][np.argsort(dt4_test["educ"])],
np.exp(test_set_predict['pred_lower'][np.argsort(dt4_test["educ"])]), linestyle = "--", color = "orange")
_ = plt.plot(dt4_test["educ"][np.argsort(dt4_test["educ"])],
np.exp(test_set_predict['pred_upper'][np.argsort(dt4_test["educ"])]), linestyle = "--", color = "orange")
_ = plt.axhline(y = 0, linestyle = "--", color = "darkgreen")
plt.show()
We see that most, but not all of the values are in the prediction intervals.
Considerations
We can calculate which values fall in the interval (note, remember, that we are predicting the logarithms of wage in our log-linear model, so we need to take the exponent):
## fit lwr upr actual
## 1 17.78106 6.913612 45.73094 44.44
## 2 17.78106 6.913612 45.73094 13.54
## 3 24.00560 9.332865 61.74617 40.95
## 4 16.08814 6.254084 41.38549 21.50
## 5 16.08814 6.254084 41.38549 24.35
## 6 24.00560 9.332865 61.74617 55.00
## mean pred_lower pred_upper actual
## 0 29.386103 11.136849 77.539262 15.38
## 1 17.929271 6.800585 47.269279 13.54
## 2 17.929271 6.800585 47.269279 25.00
## 3 29.386103 11.136849 77.539262 23.86
## 4 24.116276 9.146133 63.589148 57.70
## 5 39.526615 14.950194 104.503880 30.76
Next, we will check which rows are greater than the lower bound and less than the upper bound:
tmp_dt$in_interval <- (tmp_dt$actual >= tmp_dt$lwr) & (tmp_dt$actual <= tmp_dt$upr)
print(head(tmp_dt, 5))
## fit lwr upr actual in_interval
## 1 17.78106 6.913612 45.73094 44.44 TRUE
## 2 17.78106 6.913612 45.73094 13.54 TRUE
## 3 24.00560 9.332865 61.74617 40.95 TRUE
## 4 16.08814 6.254084 41.38549 21.50 TRUE
## 5 16.08814 6.254084 41.38549 24.35 TRUE
tmp_dt['in_interval'] = np.logical_and(tmp_dt["actual"] >= tmp_dt["pred_lower"], tmp_dt["actual"] <= tmp_dt["pred_upper"])
print(tmp_dt.head(6))
## mean pred_lower pred_upper actual in_interval
## 0 29.386103 11.136849 77.539262 15.38 True
## 1 17.929271 6.800585 47.269279 13.54 True
## 2 17.929271 6.800585 47.269279 25.00 True
## 3 29.386103 11.136849 77.539262 23.86 True
## 4 24.116276 9.146133 63.589148 57.70 True
## 5 39.526615 14.950194 104.503880 30.76 True
Now, the percentage of values, which fall in the prediction interval is:
Since we have constructed a \(95\%\) prediction interval, this quite close and a good indication of our models predictive strength. Nevertheless, the residuals may still be heteroskedastic, so this model may need further improvement.