## 3.11 Example

We will show an example of how to carry out the tasks specified in section 3.10 on one of the datasets, namely Dataset 4.

##    black educ exper faminc female metro midwest south  wage west
## 1      0   13    45      0      1     1       0     0 44.44    1
## 2      0   14    25  45351      1     1       1     0 16.00    0
## 3      0   18    27  91946      1     1       0     0 15.38    0
## 4      0   13    42  48370      0     1       1     0 13.54    0
## 5      0   13    41  10000      1     1       0     0 25.00    1
## 6      0   16    26 151308      1     1       0     0 24.05    0
## 7      0   16    11 110461      1     1       0     1 40.95    0
## 8      0   18    15      0      1     1       1     0 26.45    0
## 9      0   21    32  67084      0     1       1     0 30.76    0
## 10     0   14    12  14000      0     0       0     0 34.57    1
##    black  educ  exper  faminc  female  metro  midwest  south   wage  west
## 0      0    13     45       0       1      1        0      0  44.44     1
## 1      0    14     25   45351       1      1        1      0  16.00     0
## 2      0    18     27   91946       1      1        0      0  15.38     0
## 3      0    13     42   48370       0      1        1      0  13.54     0
## 4      0    13     41   10000       1      1        0      0  25.00     1
## 5      0    16     26  151308       1      1        0      0  24.05     0
## 6      0    16     11  110461       1      1        0      1  40.95     0
## 7      0    18     15       0       1      1        1      0  26.45     0
## 8      0    21     32   67084       0      1        1      0  30.76     0
## 9      0    14     12   14000       0      0        0      0  34.57     1

We want to analyse, whether education (educ column) affects the wage (wage column) of a person from the data sample.

### 3.11.1 Exercise Set 1

To make it easier to follow, we will take only the columns that we need from the dataset.

We will begin by plotting the scatterplot of $$Y$$ on $$X$$ (we will plot the histograms of $$Y$$ and $$X$$ in a separate task):

It appears that for larger values of educ there are larger values of wage. So $$Y$$ and $$X$$ appear to be correlated. the correlation between $$Y$$ and $$X$$ is:

We also note that there are fewer observations for people, whose education is less than 10 years:

This could mean that people with less than 10 years of education are underrepresented in our dataset.

Considerations

In some cases this may not be underrepresentation at all - for example, if we were examining a sample of firms, which primarily employ only industry veterans, it is very likely that there would be less people who have less than 10 years of experience.

We begin by assuming that the relationship between wage and education can be defined by a linear regression: $\text{wage} = \beta_0 + \beta_1 \cdot \text{educ} + \epsilon$ We expect that the more years one spends in education, the more qualified they are for higher position (i.e. higher-paying) jobs. As such, the hourly wage should increase with more years spent on studying. So, we expect $$\beta_1 > 0$$.

We will estimate the parameters “by-hand” by using the OLS formula: $\widehat{\boldsymbol{\beta}} = \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{Y}$ where:

• $$\mathbf{Y} = [Y_1,...,Y_N]^\top$$,
• $$\mathbf{X} = \begin{bmatrix} 1 & X_1 \\ 1 & X_2 \\ \vdots & \vdots \\ 1 & X_N \end{bmatrix}$$
• $$\boldsymbol{\varepsilon} = [\epsilon_1,...,\epsilon_N]^\top$$,
• $$\boldsymbol{\beta} =[ \beta_0, \beta_1]^\top$$.

Let’s begin by estimating separate components of our OLS formula.

• $$\mathbf{X}$$:
##         educ
##  [1,] 1   13
##  [2,] 1   14
##  [3,] 1   18
##  [4,] 1   13
##  [5,] 1   13
##  [6,] 1   16
##  [7,] 1   16
##  [8,] 1   18
##  [9,] 1   21
## [10,] 1   14
## [[ 1. 13.]
##  [ 1. 14.]
##  [ 1. 18.]
##  [ 1. 13.]
##  [ 1. 13.]
##  [ 1. 16.]
##  [ 1. 16.]
##  [ 1. 18.]
##  [ 1. 21.]
##  [ 1. 14.]]
• $$\mathbf{X}^\top \mathbf{X}$$:
##              educ
##       1200  17043
## educ 17043 252073
## [[  1200.  17043.]
##  [ 17043. 252073.]]

We can verify that we have correctly multiplied the matrices by looking at the diagonal elements in the matrix - they should be $$\sum_{i = 1}^N 1$$ and $$\sum_{i = 1}^N \text{educ}_i$$:

• $$\left( \mathbf{X}^\top \mathbf{X}\right)^{-1}$$:
• $$\mathbf{X}^\top \mathbf{Y}$$:

Finally, we can multiply both terms to estimate the parameters:

We see that the sign of educ is positive, which is in line with our assumptions (which we base on economic theory).

In order to estimate the standard errors, we first need to estimate the variance: $\widehat{\mathbb{V}{\rm ar}} (\widehat{\boldsymbol{\beta}}) = \begin{bmatrix} \widehat{\mathbb{V}{\rm ar}} (\widehat{\beta}_0) & \widehat{\mathbb{C}{\rm ov}} (\widehat{\beta}_0, \widehat{\beta}_1) \\ \widehat{\mathbb{C}{\rm ov}} (\widehat{\beta}_1, \widehat{\beta}_0) & \widehat{\mathbb{V}{\rm ar}} (\widehat{\beta}_1) \end{bmatrix} = \widehat{\sigma}^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1}$ where

• the estimated residual variance $$\widehat{\sigma}^2 = \dfrac{1}{N-2} \sum_{i = 1}^N \widehat{\epsilon}_i^2$$,
• the residuals $$\widehat{\epsilon}_i = Y - \widehat{Y}_i$$
• the fitted values $$\widehat{Y}_i = \beta_0 + \beta_1 X_i$$

Then, the standard errors of the corresponding OLS estimators can be estimated as: $\text{se}(\widehat{\beta}_i) = \sqrt{\widehat{\mathbb{V}{\rm ar}} (\mathbf{\widehat{\beta}_i})}$ for $$i = 0, 1$$.

We begin by estimating the residual variance:

Now, we can estimate the OLS estimator variance-covariance matrix:

The standard errors are the square root of the diagonal elements of the estimated variance-covariance matrix of the parameters:

We first will print the estimated parameters and their standard errors in one table:

Now we can write down the equation as: $\underset{(se)}{\widehat{\text{wage}}} = \underset{(1.9624)}{-10.4} + \underset{(0.1354 )}{2.3968} \cdot \text{educ}$

We have already calculated the fitted values in Task 4 when calculating the residuals, but we will restate the code here again:

Now, we will plot the fitted values alongside the true values from the data sample.

Considerations

We will also need to order the data by $$X$$ - the reason is that when plotting a line, R and Python will connect the values as they are given in the dataset:

##  [1] 13 14 18 13 13 16 16 18 21 14
## 0    13
## 1    14
## 2    18
## 3    13
## 4    13
## 5    16
## 6    16
## 7    18
## 8    21
## 9    14
## Name: educ, dtype: int64

We see that the first point of the line will start at $$X = 13$$, then $$14$$, $$18$$ but then go back to $$13$$. If we have a log-linear or any other model, where we transform $$Y$$, where our fitted and back-transformed values are not a straight line, then the values will be incorrectly visualized.

So, we need to order both the $$X$$ and $$Y$$ vectors by the value of $$X$$ when plotting.

Considerations

We note that for educ values less than 5, our fitted values for wage are negative. Since wage cannot be negative, and an education value of less than 5 years is realistic (and because we have these values in our dataset), this is an indication that our model may not be adequate for this dataset.

We will use the built-in functions to automatically estimate the coefficients and standard errors:

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -10.399959  1.9624004 -5.299611 1.380373e-07
## educ          2.396761  0.1353989 17.701484 1.820416e-62
##            Coef.  Std.Err.          t         P>|t|     [0.025    0.975]
## const -10.399959  1.962400  -5.299611  1.380373e-07 -14.250083 -6.549835
## educ    2.396761  0.135399  17.701484  1.820416e-62   2.131116  2.662407

We see that the Estimate (Coef.) and Std. Error columns correspond to our manually calculated values.

### 3.11.2 Exercise Set 2

The run sequence plot can be used to plot values, equally spaced:

This allows us to get a better overview of the data. In this case, from the run-sequence plots we see that:

• One value is a possible outlier - it’s value is much higher than any other. In practice we may need to get a deeper look at the dataset - is that value due to an error, or maybe it is from someone who is underprepresented in our dataset (e.g. a salary of a CEO; some management positions; etc.);
• The values of wage are fluctuating (i.e. they are not constant) - it appears that the mean and variance of $$Y$$ in this data sample are the constant. But this would only be true if we ignore the fact that we have an additional variable, educ.
• If we look at the scatterplot of wage and educ from Task 1, we see that this is not necessarily the case, since wage appears to increase with educ (and we see a similar case for the variance of wage). One reason for this is that we are examining cross-sectional data, where the order of the data does not matter, so it is difficult to visually determine whether the mean (and variance) is constant or not using only the run-sequence plot of a single variable.

One more thing we do notice in our data - the distribution of $$Y$$ is likely to be non-normal. The reason is that while most values appear to be “clustered” around 25, but some are much higher. Since wage cannot be negative, this skews the distribution of the data.

Considerations

Since we mentioned that the observation order is not important in cross-sectional data - a natural question is what happens to the run-sequence plot if we do order the $$Y$$ variable?

At first glance, this appears as a completely different result! However, the difference is mostly visual as we do not clearly see how many observations have low wage, while we do see less observations with larger wage values.

In fact, the above ordered value plot looks similar to (but not exactly the same as) the Q-Q plot of $$Y$$:

The difference is that in the Q-Q plot the values of wage are ploted against quantiles, which are calculated from a theoretical normal distribution (the number of quantiles is selected to match the size of the sample data).

On the other hand, as with the un-ordered values, we do clearly see that there is an outlier in our data - one wage observation is much higher than the rest.

One way to examine exactly how many observations have low and large wage values is by using histograms, which we will do in the next task.

We see that the data for wage and education does not appear to be normal - the distributions of both wage and educ are skewed. For wage - there are more people with wages with values less than 50.

For educ - there are more people with more than 10 years of education.

We begin by again plotting the scatter plot of wage and educ:

From the plots, as well as the previous Task results, we highlight the following points:

• The variation of wage is very small for educ <= 10, compared to other wage values when educ > 10.
• As we have seen before, our simple linear model is not precise when the value of educ is small.
• Furthermore, from the histograms plots we have noted that wage does not have a normal distribution.

As we know, we can also specify nonlinearities with the linear model in order to try to account for this. We may initially think about the following possible models:

• $$\log(\text{wage}) = \beta_0 + \beta_1 \cdot \text{educ} + \epsilon$$, where $$\text{wage} > 0$$;
• $$\text{wage} = \beta_0 + \beta_1 \cdot \log(\text{educ}) + \epsilon$$, where $$\text{educ} > 0$$;
• $$\log(\text{wage}) = \beta_0 + \beta_1 \cdot \log(\text{educ}) + \epsilon$$, where $$\text{wage} > 0$$, $$\text{educ} > 0$$;
• $$\text{wage} = \beta_0 + \beta_1 \cdot \text{educ}^2 + \epsilon$$

Unfortunately, if we look at the minimum value of educ:

## [1] 0

We see that it is zero - so we cannot take logarithms of educ.

This leaves us with two possible alternative specification of a linear model:

• $$\log(\text{wage}) = \beta_0 + \beta_1 \cdot \text{educ} + \epsilon$$
• $$\text{wage} = \beta_0 + \beta_1 \cdot \text{educ}^2 + \epsilon$$

We will begin by estimating a log-linear model:

Next, we will calculate the fitted values $$\widehat{\log(\text{wage})}$$ and residuals .

and the standard errors of the estimated parameters:

We can also do this with the built-in OLS estimation function:

## const    1.596835
## educ     0.098753
## dtype: float64

Note that we have to remember that we are using log(wage) instead of wage in this model.

So, the estimated log-linear model is: $\underset{(se)}{\widehat{\log(\text{wage})}} = \underset{(0.0702)}{1.5968} + \underset{(0.0048)}{0.0988} \cdot \text{educ}$

Next, we will estimate a quadratic model:

We may make the assumption that for people, who already have spent more years studying, an additional year in education might have a larger effect, compared to an additional year of education for people, who did not spend as many years studying.

We will estimate the following model: $\text{wage} = \beta_0 + \beta_1 \cdot \text{educ}^2 + \epsilon$

We can also do this with the built-in OLS estimation function:

## (Intercept)   I(educ^2)
##  4.91647729  0.08913401
## const    4.916477
## educ     0.089134
## dtype: float64

Note that we have used educ**2, but the variable name is still educ. To avoid confusion, we can rename the variable as follows:

## const     4.916477
## educ^2    0.089134
## dtype: float64

Now that we have addressed the naming issue, we will continue the tasks.

Along with the estimates, we can also immediately test the hypothesis: \begin{aligned} H_0&: \beta_1 = 0\\ H_1&: \beta_1 \neq 0 \end{aligned}

##               Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 4.91647729 1.091864113  4.502829 7.359564e-06
## I(educ^2)   0.08913401 0.004858102 18.347497 1.862455e-66
##            Coef.  Std.Err.          t         P>|t|    [0.025    0.975]
## const   4.916477  1.091864   4.502829  7.359564e-06  2.774299  7.058656
## educ^2  0.089134  0.004858  18.347497  1.862455e-66  0.079603  0.098665

Because the $$p$$-value column (Pr(>|t|) in R and P>|t| in Python) for educ is less than 0.05, we reject the null hypothesis and are left with the alternative, that the coefficient $$\widehat{\beta}_1$$ is statistically significantly different from zero.

As before, we can calculate the fitted values and the residuals for the quadratic model:

Considerations

Furthermore, we can plot all of our current models and compare them visually (to make it easier, we will not show all of the sample data, which we will limit with ylim):

We see that the quadratic model is somewhere in the middle between the linear and the log-linear model - the predicted values are greater than zero (which is better than the linear model and closer to the log-linear model specification), and the predicted values are closer to the mean for educ between 15 and 20 (which is better than the log-linear model and closer to the linear model specification).

We could also say that the log-linear model underestimates the expected wage compared to the quadratic model when educ is greater than 10, but is similar to the log-linear model for educ < 10.

Our estimated quadratic model is: $\underset{(se)}{\widehat{\text{wage}}} = \underset{(1.09186)}{4.9165} + \underset{(0.00486)}{0.08913} \cdot \text{educ}^2$

We can interpret the coefficient of educ as follows: education has a positive effect on wage, which is more pronounced for larger initial (or, base, or starting) values of education.

Consider the following equations where some base education increases by one: \begin{aligned} {\widehat{\text{wage}}^{(1)}} &= {4.9165} + {0.08913} \cdot \text{educ}^2\\ {\widehat{\text{wage}}^{(2)}} &= {4.9165} + {0.08913} \cdot (\text{educ} + 1)^2 \\ &= {4.9165} + {0.08913}\cdot \text{educ}^2 + 0.08913 \cdot (2 \cdot\text{educ} + 1) \end{aligned}

Then the change in wage from a one year increase in education is: $\widehat{\text{wage}}^{(2)} - \widehat{\text{wage}}^{(1)} = 0.08913 \cdot (2 \cdot\text{educ} + 1)$

If education increased from 0 to 1 then wage increases by:

around 0.09 dollars per hour.

On the other hand, if education increases from 15 to 16, then wage increases by:

roughly 2.76 dollars per hour.

We can verify this with the built-in functions as well:

## 2.763154304221157

Note that we need to pass the transformed data to the .predict() function in python, because we used the transformed data when estimating the model (as opposed to R, where we transformed the data using the formula).

Fortunately, we can also use a similar formula notation in Python:

## ================================================================================
##                    coef    std err          t      P>|t|      [0.025      0.975]
## --------------------------------------------------------------------------------
## Intercept        4.9165      1.092      4.503      0.000       2.774       7.059
## I(educ ** 2)     0.0891      0.005     18.347      0.000       0.080       0.099
## ================================================================================

Then we can pass the original exogeneous variable values for educ, which will be automatically transformed to I(educ ** 2):

## 2.763154304221157

We will use this formula notation more often in the next chapters. For now, we are focusing on a less-convenient specification in order to highlight how the transformations are applied when estimating the model and how they are incorporated in the fitted value calculation.

We have already calculated the residuals in the previous Tasks, so now we plot the run-sequence plots for the respective models:

We will also plot the run-sequence plot of the residual on educ to see if the model performed:

Ideally, we would want the residuals to be randomly scattered around zero on the vertical axis, which would suggest that our model captures any non-linearities in the data. They should also need to have the same variance, and to not have outliers

In our case: - In the linear (level-level) model, the residuals do not appear to be normally distributed, while in the log-linear model - they seem to have a normal distribution. (We see a similar situation with the quadratic model). - In the simple linear (an in the quadratic) model, there is one residual which stands out, which may be a sign of an outlier in the data. - In the log-linear model, as well as the simple linear and the quadratic model, the variance of residuals is smaller for smaller values of education. However, as we have mentioned, we have much fewer samples of wage data for education < 10.

Note that we are examining the residuals of the model. Since we are using the logarithm of wage in the log-linear model, the residuals are the difference between the actual and fitted values of the log of wage.

Based on our (UR.1) - (UR.4) assumptions, we would expect our residuals to be independent of $$X$$, have the same mean and variance across observations and be normally distributed. The residuals of the log-linear model appear to be normally distributed. Because of a lack of data for lower values of educ, we cannot say whether the variance of the residuals is the same.

Consequently, the log-linear model appears to be better suited for our data, compared to the simple linear and the quadratic models: $\underset{(se)}{\widehat{\log(\text{wage})}} = \underset{(0.0702)}{1.5968} + \underset{(0.0048)}{0.0988} \cdot \text{educ}$ (Though the fitted value plot for the quadratic model seems close to the log-linear model)

Looking back at the simple linear model: $\underset{(se)}{\widehat{\text{wage}}} = \underset{(1.9624)}{-10.4000} + \underset{(0.1354 )}{2.3968} \cdot \text{educ}$ The interpretation of $$\widehat{\beta}_1$$ is that: $\text{an additional year of education increases the expected hourly wage by around } 2.3 \text{ dollars.}$

For the log-linear model: $\underset{(se)}{\widehat{\log(\text{wage})}} = \underset{(0.0702)}{1.5968} + \underset{(0.0048)}{0.0988} \cdot \text{educ}$ The interpretation for $$\widehat{\beta}_1$$ is that:

$\text{an additional year of education increases the expected hourly wage by around } 9.88\%.$

### 3.11.3 Exercise Set 3

We would like to test the null hypothesis whether education has a significant effect on wage. We will examine the coefficients of:

• the simple linear: $$\text{wage} = \beta_0 + \beta_1 \cdot \text{educ} + \epsilon$$;
• the log-linear model: $$\log(\text{wage}) = \beta_0 + \beta_1 \cdot \text{educ} + \epsilon$$.

We will begin by testing the null hypothesis against the alternative, that $$\beta_1$$ is negative:

$\begin{cases} H_0 &: \beta_1 = 0\\ H_1 &: \beta_1 < 0 \end{cases}$

The critical value $$t_c = t_{(\alpha, N-2)}$$ is:

We begin by calculating the $$t$$-ratio for the simple linear model is:

We reject $$H_0$$, if $$t-\text{ratio} \leq t_{(\alpha, N-2)}$$.

In our case:

##  educ
## FALSE

Which means that we have no grounds to reject the null hypothesis.

We can calculate the p-value to verify this:

Since p-value is greater than 0.05, so we have no grounds to reject the null hypothesis at the $$5\%$$ significance level.

Next, we will calculate the $$t$$-ratio and the $$p$$-value for the log-linear model:

So, we have no grounds to reject the null hypothesis at the $$5\%$$ significance level for the log-linear model.

Considerations

In this case the null hypothesis can be interpreted as: $H_0: \text{ Education has no effect on wage (i.e. education has no significant effect on wage).}$ against the alternative: $H_1: \text{ Education has a negative effect on wage.}$

Considering that we expect that higher education leads to larger wage, would it make sense to test the null hypothesis against such an alternative?

Next, we will test the null hypothesis against the alternative, that $$\beta_1$$ is positive:

$\begin{cases} H_0 &: \beta_1 = 0\\ H_1 &: \beta_1 > 0 \end{cases}$

The critical value $$t_c = t_{(1 - \alpha, N-2)}$$ is:

The t-ratio is unchanged since it depends on the null hypothesis, so we will not recalculate it.

We reject $$H_0$$, if $$t-\text{ratio} \geq t_{(1 - \alpha, N-2)}$$.

In our case, for the simple linear model:

We reject the null hypothesis, and conclude that $$\beta_1 > 0$$. We can verify this by calculating the p-value:

We did not round the p-value to highlight that it is very close to zero (Note that the E-notation, e-63 is the notation for $$10^{-63}$$). Since $$p$$-value < 0.05 we reject the null hypothesis at the $$5\%$$ significance level.

Next, we will calculate the $$t$$-ratio and the $$p$$-value for the log-linear model:

For the log-linear model we come to the same conclusion - we reject the null hypothesis at the $$5\%$$ significance level for the log-linear model. If we look at the coefficient of educ for the log-linear model, we see that it is $$0.0988$$, which, seems small, however, since we have log-transformed our dependent variable wage - the scale of the dependent variable changed, so this seemingly small value has a statistically significant ((-ly) different from zero) effect on log(wage).

Considerations

In this case the null hypothesis can be interpreted as: $H_0:\text{ Education has no effect on wage (i.e. education has no significant effect on wage).}$ against the alternative: $H_1:\text{ Education has a positive effect on wage.}$ We specify this as the alternative hypothesis, since we expect that higher education leads to larger wage.

Finally, we will test the null hypothesis against the alternative, that $$\beta_1$$ is not zero:

$\begin{cases} H_0 &: \beta_1 = 0\\ H_1 &: \beta_1 \neq 0 \end{cases}$

As before, we do not need to recalculate the t-ratio.

We reject $$H_0$$, if $$t-\text{ratio} \leq t_{(\alpha/2, N-2)}$$ or $$t-\text{ratio} \geq t_{(1 - \alpha/2, N-2)}$$.

In our case, for the simple linear model this is:

So, we reject the null hypothesis at the $$5\%$$ significance level.

The $$p$$-value:

is less than 0.05, so we reject the null hypothesis.

Finally, we will calculate the $$t$$-ratio and the $$p$$-value for the log-linear model:

In the log-linear model case, we reject the null hypothesis at the $$5\%$$ significance level.

From these tests, we conclude that:

• educ is statistically significantly different from zero in the simple linear model. In other words, educ has a significant effect on wage.
• educ is statistically significantly different from zero in the log-linear model. In other words, educ has a significant effect on log(wage) (and, by extension, on wage itself).

Considerations

In this case the null hypothesis can be interpreted as: $H_0:\text{ Education has no effect on wage (i.e. education has no significant effect on wage).}$ against the alternative: $H_1:\text{ Education has a signification effect on wage.}$ Note that for the two-tailed hypothesis test, we do not make any assumptions about the sign of the effect - i.e. we do not care whether it is positive, or negative - we only care to test the overall significance of the effect.

Furthermore, the two-tailed hypothesis test, where we test $$H_0$$ against $$H_1: \beta_i \neq 0$$, is also automatically calculated in the summary output of our fitted models:

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -10.399959  1.9624004 -5.299611 1.380373e-07
## educ          2.396761  0.1353989 17.701484 1.820416e-62
##            Coef.  Std.Err.          t         P>|t|     [0.025    0.975]
## const -10.399959  1.962400  -5.299611  1.380373e-07 -14.250083 -6.549835
## educ    2.396761  0.135399  17.701484  1.820416e-62   2.131116  2.662407
##               Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 1.59683536 0.07018017 22.75337 1.530466e-95
## educ        0.09875341 0.00484219 20.39437 1.344896e-79
##           Coef.  Std.Err.          t         P>|t|    [0.025    0.975]
## const  1.596835  0.070180  22.753370  1.530466e-95  1.459146  1.734525
## educ   0.098753  0.004842  20.394367  1.344896e-79  0.089253  0.108254

The $$t$$-ratio is in the appropriately named t value column and the $$p$$-value is in the Pr(>|t|) in R or P>|t| in Python. We see that these values are the same as our t_stat and the calculated $$p$$-value of the two-tail hypothesis test.

Consequently, we do not need to manually carry out the two-tail hypothesis test as it is done automatically in the model summary. On the other hand, a one-tailed hypothesis test is not provided in the output, so it is always useful to see how we can do this ourselves.

Let $$\widehat{Y}_i = \widehat{\beta}_0 + \widehat{\beta}_1 \tilde{X}_i$$ be the predicted value of $$Y_i$$ given values $$\tilde{X}_i$$, $$i = 1,...$$. We can also express this in a matrix notation as: $$\widehat{\mathbf{Y}} = \widetilde{\mathbf{X}} \widehat{\boldsymbol{\beta}}$$

The variance-covariance matrix of the estimated mean response is: $\widehat{\mathbb{V}{\rm ar}} (\widehat{\mathbf{Y}}) = \widetilde{\mathbf{X}} \widehat{\sigma}^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \widetilde{\mathbf{X}}^\top = \widetilde{\mathbf{X}} \widehat{\mathbb{V}{\rm ar}} (\widehat{\boldsymbol{\beta}}) \widetilde{\mathbf{X}}^\top$ which means that we need to use the OLS estimator variance-covariance matrix when calculating the confidence intervals for the mean response.

The confidence intervals for the mean response are constructed from two endpoints, which are calculated as follows: $\widehat{Y}_i \pm t_{(1 - \alpha/2, N-2)} \times \text{se}(\widehat{Y}_i),$ where the standard error of the mean response is the square root of the corresponding diagonal elements of the mean response variance-covariance matrix: $$\text{se}(\widehat{Y}_i) = \sqrt{\widehat{\mathbb{V}{\rm ar}} (\widehat{Y}_i)}$$.

We will begin by calculating the standard error of the mean response for the simple linear model:

Next, we will calculate the $$95\%$$ confidence intervals for the mean response:

Finally, we will plot the estimated mean response along with the lower an upper confidence intervals:

Considerations

Note that we can also calculate standard errors of the mean response automatically:

##            1        2        3        4        5        6
## fit 20.75794 23.15470 32.74174 20.75794 20.75794 27.94822
## lwr 19.92651 22.38520 31.47411 19.92651 19.92651 27.04421
## upr 21.58936 23.92419 34.00937 21.58936 21.58936 28.85223
##                        0          1          2          3          4          5
## mean           20.757936  23.154698  32.741742  20.757936  20.757936  27.948220
## mean_se         0.423775   0.392209   0.646107   0.423775   0.423775   0.460771
## mean_ci_lower  19.926512  22.385204  31.474115  19.926512  19.926512  27.044212
## mean_ci_upper  21.589361  23.924191  34.009370  21.589361  21.589361  28.852227
## obs_ci_lower   -5.845866  -3.447241   6.120737  -5.845866  -5.845866   1.342050
## obs_ci_upper   47.361739  49.756636  59.362747  47.361739  47.361739  54.554390

Which we can then use to plot the same charts:

Note: we provided two examples of how to access a pd.DataFrame object in order to print specific columns and rows: either by first specifying the column name and then the rows, or by using .loc() method, not to be confused with .iloc(), which only works with column numbers, as opposed to their names:

## [-10.399959246494694, -10.399959246494694, -3.2096756527975208, -3.2096756527975208, -3.2096756527975208]
## [-10.399959246494694, -10.399959246494694, -3.2096756527975208, -3.2096756527975208, -3.2096756527975208]
## [-10.399959246494694, -10.399959246494694, -3.2096756527975208, -3.2096756527975208, -3.2096756527975208]

Again, we can see from the numerical values, that the built-in functions for confidence interval calculation give identical results to the manual calculation.

When learning new methods, it is always a good idea to try to manually implement any new formulas you come across and compare with the built-in functions in order to verify that they work as you would expect - not only will it provide a better intuition for the formulas themselves but sometimes there additional settings, or alternative formulas used in the built-int function, which may give different results!

Next, we will calculate the standard error of the mean response for the log-linear model.

We will use the built-in functions to do so. First, we will calculate the mean response of $$\widehat{\log(\text{wage})}$$ along with its confidence interval:

##            1        2        3        4        5        6
## fit 2.880630 2.979383 3.374397 2.880630 2.880630 3.176890
## lwr 2.850896 2.951864 3.329063 2.850896 2.850896 3.144560
## upr 2.910363 3.006902 3.419730 2.910363 2.910363 3.209219
##                       0         1         2         3         4         5
## mean           2.880630  2.979383  3.374397  2.880630  2.880630  3.176890
## mean_se        0.015155  0.014026  0.023106  0.015155  0.015155  0.016478
## mean_ci_lower  2.850896  2.951864  3.329063  2.850896  2.850896  3.144560
## mean_ci_upper  2.910363  3.006902  3.419730  2.910363  2.910363  3.209219
## obs_ci_lower   1.929214  2.028034  2.422365  1.929214  1.929214  2.225389
## obs_ci_upper   3.832046  3.930733  4.326428  3.832046  3.832046  4.128391

Considerations

For the log-linear model we can immediately plot the mean response and confidence intervals of $$\widehat{\log(\text{wage})}$$ (not to be confused with $$\widehat{\text{wage}}$$):

As mentioned, this is not the original value of wage, but rather its logarithmic transformation. Usually, we are interested in the original values, so it is important that we are able to transform the dependent variable predicted values back to their original scale.

Next, we will take an exponent of the mean response and confidence intervals of the log-linear model to get the values for wage, i.e. $$\widehat{\text{wage}} = \exp\{\widehat{\log(\text{wage})}\}$$:

Considerations

When plotting the fitted values, we have always ordered/sorted the observation pairs $$(X_i, Y_i)$$ by $$X$$. What happens if we do not do this? Simply put - the fitted value points will be connected in the order that they appear:

As such, we need to do one of the following: - order the data in the original dataset unless the order of the observations is important! For example if the data is autocorrelated, then the order must be kept; - create a variable with the index list ordered $$X$$ values and use it to order the data for the scatterplot only, as we have done throughout this task.

The variance-covariance matrix of the prediction $$\widehat{\mathbf{Y}}$$ is: \begin{aligned} \mathbb{V}{\rm ar}\left( \widetilde{\boldsymbol{e}} \right) &= \widehat{\sigma}^2 \left( \mathbf{I} + \widetilde{\mathbf{X}} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \widetilde{\mathbf{X}}^\top\right) = \widehat{\sigma}^2 \mathbf{I} + \widehat{\mathbb{V}{\rm ar}} (\widehat{\mathbf{Y}}) \end{aligned} The prediction interval endpoints for $$\widehat{\mathbf{Y}}$$ are: $\widehat{Y}_i \pm t_{(1 - \alpha/2, N-2)} \times \text{se}(\widetilde{e}_i),$ where the standard errors of the forecast are the diagonal elements of the variance-covariance matrix of the prediction: $$\text{se}(\widetilde{e}_i) = \sqrt{\widehat{\mathbb{V}{\rm ar}} (\widetilde{e}_i)}$$.

The prediction variance-covariance matrix is:

Then the $$95\%$$ prediction intervals are:

Finally, we can plot the prediction intervals. We will plot them alongside the confidence intervals of the mean response:

Considerations

As you might have guessed - there are also a built-in functions to calculate the prediction intervals:

##             1         2         3         4         5
## fit 20.757936 23.154698 32.741742 20.757936 20.757936
## lwr -5.845866 -3.447241  6.120737 -5.845866 -5.845866
## upr 47.361739 49.756636 59.362747 47.361739 47.361739
##                                  0          1          2          3          4
## Obs                       1.000000   2.000000   3.000000   4.000000   5.000000
## Dep Var\nPopulation      44.440000  16.000000  15.380000  13.540000  25.000000
## Predicted\nValue         20.757936  23.154698  32.741742  20.757936  20.757936
## Std Error\nMean Predict   0.423775   0.392209   0.646107   0.423775   0.423775
## Mean ci\n95% low         19.926512  22.385204  31.474115  19.926512  19.926512
## Mean ci\n95% upp         21.589361  23.924191  34.009370  21.589361  21.589361
## Predict ci\n95% low      -5.845866  -3.447241   6.120737  -5.845866  -5.845866
## Predict ci\n95% upp      47.361739  49.756636  59.362747  47.361739  47.361739
## Residual                 23.682064  -7.154698 -17.361742  -7.217936   4.242064
## Std Error\nResidual      13.546654  13.547605  13.537872  13.546654  13.546654
## Student\nResidual         1.748185  -0.528115  -1.282457  -0.532821   0.313145
## Cook's\nD                 0.001495   0.000117   0.001873   0.000139   0.000048

Note that we can get both the mean and prediction intervals from this table. For the prediction intervals, we need the values in the Predict ci\n95% low and Predict ci\n95% upp columns (we printed the first few values of the transposed output to make it easier to see all the columns).

##                     0          1          2          3          4
## pred        20.757936  23.154698  32.741742  20.757936  20.757936
## pred_lower  -5.845866  -3.447241   6.120737  -5.845866  -5.845866
## pred_upper  47.361739  49.756636  59.362747  47.361739  47.361739

Unfortunately, we cannot get prediction intervals for new values of $$X$$ using summary_table().

Fortunately, we can use another function, like wls_prediction_std():

##                     0          1          2          3          4
## sdev        13.559904  13.558955  13.568673  13.559904  13.559904
## pred_lower  -5.845866  -3.447241   6.120737  -5.845866  -5.845866
## pred_upper  47.361739  49.756636  59.362747  47.361739  47.361739

We can also use the previously introduced .get_prediction() method:

##                        0          1          2          3          4
## mean           20.757936  23.154698  32.741742  20.757936  20.757936
## mean_se         0.423775   0.392209   0.646107   0.423775   0.423775
## mean_ci_lower  19.926512  22.385204  31.474115  19.926512  19.926512
## mean_ci_upper  21.589361  23.924191  34.009370  21.589361  21.589361
## obs_ci_lower   -5.845866  -3.447241   6.120737  -5.845866  -5.845866
## obs_ci_upper   47.361739  49.756636  59.362747  47.361739  47.361739

Though if we decide to use wls_prediction_std() - we also get the forecast standard error, while using get_prediction() we get the standard error of the mean response.

Finally, we can plot the predicted values alongside their prediction intervals:

Next, we will calculate the standard error of the forecast (and consequently, the prediction intervals) for the log-linear model.

We will use the built-in functions to do so. First, we will calculate $$\widehat{\log(\text{wage})}$$ along with its prediction interval:

Next, we will take the exponent of the predicted value in the log-linear model, to get the predicted value for wage: $$\widehat{\text{wage}} = \exp(\widehat{\log(\text{wage})})$$. We will do the same for the prediction interval.

Considerations

Looking at the prediction intervals for the simple linear model - would it make sense that the intervals cover negative values? Think about the aim of prediction intervals (i.e. what do they show) by re-examining section 3.7.3.

In such cases, where we are certain that some values are unattainable, forecasts and their prediction intervals can be cut-off - in this case, it may make sense to cutoff the forecasts and their intervals at zero by setting $$\widetilde{\widehat{Y}} = \max\{0, \widehat{Y}\}$$ for the simple linear model.

On the other hand, we see that the forecasts and the prediction intervals are non-negative for the log-linear model.

Finally, since we used the same exogeneous variable observations that we used in model estimation, the forecasts are identical to the mean prediction. However, we may be interested to examine the forecasts with new, unknown values of educ, which is what we will do in the next Task.

We will begin by calculating the new values of $$X$$:

In this data sample the smallest value of educ is zero. So, one of the calculated educ values already exists in our historical data. On the other hand, we cannot use a value of educ, that is less than zero, so we will continue with these values as per the task requirements.

Prediction is pretty straightforward with the built-int functions in R, but a bit more work in Python:

##    const  educ
## 0    1.0   0.0
## 1    1.0  23.1

Note that we need to pass the constant.

For Python, calculating the prediction intervals can be done in a separate function from the one used to calculate the predicted value, which calculates the mean response confidence and the prediction intervals.

##                        0          1
## mean          -10.399959  44.965224
## mean_se         1.962400   1.266652
## mean_ci_lower -14.250083  42.480122
## mean_ci_upper  -6.549835  47.450327
## obs_ci_lower  -37.268053  18.258544
## obs_ci_upper   16.468135  71.671905
##                     0          1
## sdev        13.694614  13.612341
## pred_lower -37.268053  18.258544
## pred_upper  16.468135  71.671905

wls_prediction_std() function provides the forecast standard error, whereas get_prediction() provides the standard errors of the mean response.

All the required values are available in the predict() function output:

##         fit       lwr      upr
## 1 -10.39996 -37.26805 16.46813
## 2  44.96522  18.25854 71.67190

If we wanted, we could combine data from the two outputs:

##         mean  pred_lower  pred_upper
## 0 -10.399959  -37.268053   16.468135
## 1  44.965224   18.258544   71.671905

Or use the results from the get_prediction() method:

##         mean  obs_ci_lower  obs_ci_upper
## 0 -10.399959    -37.268053     16.468135
## 1  44.965224     18.258544     71.671905
##         fit       lwr      upr
## 1  4.916477 -21.55861 31.39156
## 2 52.479276  25.90049 79.05806
##         mean  pred_lower  pred_upper
## 0   4.916477  -21.558608   31.391562
## 1  52.479276   25.900494   79.058058

Note that for the quadratic model, we need to pass the transformed data, in this case, x_new**2.

On the other hand, if we had used the model with the formula already supplied:

## ================================================================================
##                    coef    std err          t      P>|t|      [0.025      0.975]
## --------------------------------------------------------------------------------
## Intercept        4.9165      1.092      4.503      0.000       2.774       7.059
## I(educ ** 2)     0.0891      0.005     18.347      0.000       0.080       0.099
## ================================================================================

Then if we use the original data for prediction we get what we expect:

##         mean
## 0   4.916477
## 1  52.479276

However, if we choose to pass the original data to the prediction interval calculation:

##    obs_ci_lower  obs_ci_upper
## 0    -21.558608     31.391562
## 1     25.900494     79.058058
##    pred_lower  pred_upper
## 0  -21.558608   31.391562
## 1  -19.483872   33.434818

We see that it is incorrect for the wls_prediction_std() function . To fix this, we would need to pass the transformed values, regardless of whether we used a formula notation, or not:

##    pred_lower  pred_upper
## 0  -21.558608   31.391562
## 1   25.900494   79.058058

Again note, that this currently affects wls_prediction_std() and not the .get_prediction() method.

Considerations

Furthermore, we can plot these data points along with the true values:

• We see that the log-linear and the quadratic model predicted mean response values are close, when $$\widetilde{\text{educ}} = 0.8 \cdot \min(\text{educ}_1, ..., \text{educ}_N) = 0$$.
• When $$\widetilde{\text{educ}} = 1.1 \cdot \max(\text{educ}_1, ..., \text{educ}_N)$$, we see that the quadratic model has the largest predicted value.

From the prediction intervals we see that the prediction intervals for the quadratic model contain negative values (not realistic for wage), but for larger values, they are lower than the prediction intervals of the log-linear model.

While the negative confidence interval is not quite realistic (for the quadratic model we could always manually cap the interval at zero, since $$\text{educ} \geq 0$$ and $$\widehat{\beta}_0 = 4.91647 > 0$$), for the quadratic model the smaller confidence interval for the larger educ value may be more plausible - from our data sample it appears that for $$\text{educ} > 20$$ the variance in wage is smaller, compared to the variance, when educ is between 15 and 20.

### 3.11.4 Exercise Set 4

We will begin by manually calculating $$R^2$$ for the simple linear model.

We will need to calculate the following: $\text{RSS} = \sum_{i = 1}^N \widehat{\epsilon}_i^2$ and $\text{TSS} = \sum_{i = 1}^N (Y_i - \overline{Y})^2$

This will allow us to calculate $$R^2$$ as: $R^2 = 1 - \dfrac{\text{RSS}}{\text{TSS}}$

This value means that our simple linear regression explains roughly $$20.7\%$$ of the variation in wage.

Since we have a univariate regression with one variable, we could say that $$20.7\%$$ of the variation in wage is explained by education (while the rest is explained by the unobservable random shocks $$\epsilon$$).

This can also be done automatically from the model results:

• For the simple linear model:
• For the log-linear model:

We have that $$25.77\%$$ of the variation in $$\log(\text{wage})$$ is explained by our regression model.

We have that $$21.94\%$$ of the variation in wage is explained by our quadratic model.

Furthermore:

• We can directly compare $$R^2$$ of the simple linear and quadratic models - our quadratic model is slightly better, as it explains around $$1\%$$ more variation than the linear model.
• The $$R^2$$ in the log-linear model is not comparable to the $$R^2$$ in the simple linear and quadratic models, since the dependent variable is transformed (and it this case, its scale is logarithmic).
• If the dependent variable were simply scaled by some constant - our $$R^2$$ would be unchanged and could be comparable (see the relevant section on $$R^2$$ in section 3.8).

From the lecture notes, $$R^2_g$$ can be calculated as: $R^2_g = \mathbb{C}{\rm orr}(Y, \widehat{Y})^2$

• For the simple linear model:
• For the log-linear model:

$$R^2_g$$ is the same for the simple linear model. It is also the same for the quadratic model. It is different (and slightly lower) for the log-linear model.

The general $$R^2$$ (i.e. $$R^2_g$$) is the largest for the quadratic model, though not by much. On the other hand, the difference in $$R^2_g$$ between between the log-linear and quadratic models is less than from the simple linear model.

The largest $$R^2$$ is for the log-linear model, but as mentioned before, it is not comparable to the linear and quadratic models, because the dependent variables do not have the same transformations. For the log linear model $$R^2 = 0.25771$$ indicates that $$25.77\%$$ of the variation in $$\log(\text{wage})$$ is explained by our (log-linear) regression model. In this case, the logarithmic scale partially contributes to an increase in $$R^2$$. From the calculated $$R^2_g$$ we see that the log-linear model is indeed better than the simple linear, but less so than the quadratic model.

Considerations

Overall $$R^2$$ did not help us identify any model, which would be marginally better than the rest. This is a great example of why we should only look at $$R^2$$ as a reference, and pay closer attention to coefficient significance, signs as well as the prediction values and prediction intervals to make sure they make economic sense.

### 3.11.5 Exercise Set 5

While we have analysed coefficient signs, significance, predicted values and $$R^2$$ - we have only briefly examined the model residuals. Remember that our model assumptions place certain requirements on the residual normality, autocorrelation and homoskedasticity.

We will begin by examining the residual vs fitted scatterplot:

We see that of the three plots the log-linear model residuals appear to have residuals without any non-linear relationships. Since we have fewer observations for lower values of wage, it may appear that the variance is not constant. On the other hand, it may be possible that for lower wage values the residuals have a smaller variance than for higher wage values.

If we check the residual vs independent variable scatter plot:

we arrive at the same conclusions as we did for the residual vs fitted scatter plots.

Considerations

Note that if we wanted of access the data used in model estimation, we would need to examine the model object, before fitting it (more info here):

## wage ~ I(educ**2)
## [44.44 16.   15.38 13.54 25.  ]
##     wage
## 0  44.44
## 1  16.00
## 2  15.38
## 3  13.54
## 4  25.00
## [[  1. 169.]
##  [  1. 196.]
##  [  1. 324.]
##  [  1. 169.]
##  [  1. 169.]]
##    Intercept  I(educ ** 2)
## 0        1.0         169.0
## 1        1.0         196.0
## 2        1.0         324.0
## 3        1.0         169.0
## 4        1.0         169.0

once we use the .fit() method, we lose access to these variables.

Next, we will examine the residual Q-Q plot and histogram (we will also plot a boxplot (excluding outliers) for comparison):

• For the simple linear model - it appears that the residuals are skewed and from a non-normal distribution - the residuals on the Q-Q plot do not fall along a straight line with the theoretical quantiles of a normal distribution. Furthermore, the histogram indicates that the residual distribution is skewed to the right hand side of the histogram (we can also see that the median is closer to the first quartile than the third quartile).
• For the log-linear model - the residuals of a log-linear model appear to closely resemble a normal distribution - the histogram resembles a bell shape that is familiar of a normal distribution, while the Q-Q plot falls along a straight line.
• For the quadratic model - unfortunately, the residuals of a quadratic model do not appear to be from a normal distribution - they are quite similar to the residuals from a simple linear model.

In this case, we see that, while the quadratic model appears to be similar to a log-linear model in terms of predicted values and $$R^2$$, its residuals do not follow the normality assumption which we require in order to test hypothesis on coefficient significance and distribution.

We will carry out a number of tests on the model residuals.

##### 3.11.5.2.1 Testing for Residual Homoskedasticity

We will begin by testing the following hypothesis: \begin{aligned} H_0&: \text{ (residuals are homoskedastic)}\\ H_1&: \text{ (residuals are heteroskedastic)} \end{aligned} via the Breusch-Pagan test:

Since p-value = 0.0063 < 0.05, we reject the null hypothesis and are left with the alternative that the simple linear regression residuals are heteroskedastic.

##
##  studentized Breusch-Pagan test
##
## data:  lm_fit_loglin
## BP = 6.9769, df = 1, p-value = 0.008257

Since p-value = 0.0068257 < 0.05, we reject the null hypothesis and accept the alternative that the log-linear regression residuals are heteroskedastic.

##
##  studentized Breusch-Pagan test
##
## BP = 9.7331, df = 1, p-value = 0.00181

Since p-value < 0.05, so we reject the null hypothesis and accept the alternative that the quadratic regression residuals are heteroskedastic.

##### 3.11.5.2.2 Testing for Residual Autocorrelation

Next, we will test the hypothesis: \begin{aligned} H_0&:\text{the residuals are serially uncorrelated}\\ H_1&:\text{the residuals are an autoregressive process} \end{aligned} via Durbin-Watson test (in this case the alternative hypothesis is equivalent to $$H_0:\text{significant autocorrelation at lag 1}$$):

##
##  Durbin-Watson test
##
## data:  lm_fit_linear
## DW = 1.9595, p-value = 0.4828
## alternative hypothesis: true autocorrelation is not 0
##
##  Durbin-Watson test
##
## data:  lm_fit_loglin
## DW = 1.9703, p-value = 0.6064
## alternative hypothesis: true autocorrelation is not 0
##
##  Durbin-Watson test
##
## DW = 1.9587, p-value = 0.4742
## alternative hypothesis: true autocorrelation is not 0

Since p-value > 0.05, we have no grounds to reject the null hypothesis that the residuals are not serially correlated in any of the three - linear, log-linear or quadratic - models. Unfortunately, we only get the DW statistic in Python and not the associated p-value.

We can also test for autocorrelation the via the Breusch-Godfrey test (in this case the alternative hypothesis is equivalent to $$H_0:\text{significant autocorrelation at lag k}$$, where we have to specify $$k$$):

##
##  Breusch-Godfrey test for serial correlation of order up to 2
##
## data:  lm_fit_linear
## LM test = 0.44312, df = 2, p-value = 0.8013

Since p-value = 0.8013 > 0.05, we have no grounds to reject the null hypothesis and conclude that the linear model residuals are not autocorrelated.

##
##  Breusch-Godfrey test for serial correlation of order up to 2
##
## data:  lm_fit_loglin
## LM test = 0.85029, df = 2, p-value = 0.6537

Since p-value = 0.6537 > 0.05, we have no grounds to reject the null hypothesis and conclude that the log-linear model residuals are not autocorrelated.

##
##  Breusch-Godfrey test for serial correlation of order up to 2
##
## LM test = 0.4735, df = 2, p-value = 0.7892

Since p-value = 0.7892 > 0.05, we have no grounds to reject the null hypothesis and conclude that the quadratic model residuals are not autocorrelated.

##### 3.11.5.2.3 Testing for Residual Normality

Finally, we will test the normality hypothesis: \begin{aligned} H_0&:\text{residuals follow a normal distribution}\\ H_1&:\text{residuals do not follow a normal distribution} \end{aligned} via Shapiro-Wilk test:

##
##  Shapiro-Wilk normality test
##
## data:  lm_fit_linear$residuals ## W = 0.82704, p-value < 2.2e-16 ## ## Shapiro-Wilk normality test ## ## data: lm_fit_loglin$residuals
## W = 0.99546, p-value = 0.001196
##
##  Shapiro-Wilk normality test
##
## data:  lm_fit_quad$residuals ## W = 0.82939, p-value < 2.2e-16 In all three models, we reject the null hypothesis and conclude that the residuals are not normally distributed. On the other hand, if we look at alternative normality tests for the log-linear model residuals (since their histogram and Q-Q plot most closely resembled the normal distribution case), for example the Jarque-Bera test: ## ## Jarque Bera Test ## ## data: lm_fit_linear$residuals
## X-squared = 63871, df = 2, p-value < 2.2e-16
##
##  Jarque Bera Test
##
## data:  lm_fit_loglin$residuals ## X-squared = 1.5331, df = 2, p-value = 0.4646 ## JB-stat: 1.5330842008737402 ## p-value: 0.4646168923309677 ## Skew: 0.07673785662665146 ## Kurtosis: 3.084302369236367 ## ## Jarque Bera Test ## ## data: lm_fit_quad$residuals
## X-squared = 68599, df = 2, p-value < 2.2e-16
• For the log-linear model: since p-value > 0.05, we have no grounds to reject the null hypothesis of normality for the log-linear model residuals.
• For the remaining models: we reject the null hypothesis for the simple linear and the quadratic model since p-value < 0.05 for those models’ residuals.

Considerations

As we have seen, it is a good idea to not rely solely on a single test for residual normality. Similarly, we should not rely on only one test, when testing other residual properties, as we have done in this Task.

From Task 22 we have that:

• none of the model residuals are autocorrelated;
• all of the model residuals are heteroskedastic;
• according to the Durbin-Watson test, none of the residuals are normally distributed;
• the Jarque-Bera test suggests that the log-linear model residuals may be normally distributed;

From Task 21 we have that:

• residual scatter plots (vs fitted; and vs educ) indicate that the log-linear model residuals exhibit the least amount of a non-linear dependence than the other model residuals.
• the log-linear model residuals look normally distributed;

From Task 19: according to $$R^2_g$$, log-linear and quadratic models are similar in terms of percentage of variation in wage explained by the specified model and are better than the simple linear model.

From Tasks 10, 16 & 17:

• The log-linear and quadratic models provide fitted values which coincide with economic theory.
• The simple linear regression predicts negative values of wage for small values of educ - this is not possible for hourly wage and as such the linear model is not a good fit for our data.

We also note that the $$\beta_1$$ coefficients have the expected (from economic theory) positive signs and are statistically significant.

From all of this, we can conclude that the log-linear model may be the best fit of the three models, because the residuals are also (unlike the quadratic model) normally distributed (which is what we want from our model).

Considerations

However, none of the model residuals are homoskedastic. Part of this could be explained by the fact that we do not have enough observations for smaller values of educ (and for smaller values of wage), so the variation may be different for lower values of wage (for up to 10 years of educ), compared to larger values of wage (when educ is 10+ years).

Because the residuals are not homoskedastic, the estimated OLS coefficients are not efficient (so they are not BLUE), and the standard errors are biased (so we may not make the correct assumptions about significance). To account for this, we would need to either: correct the residuals, use a different estimation method (weighted least squares), or attempt to account for the heteroskedasticity by including some additional variables alongside educ or try some other transformation of the variables (maybe a cubic model?).

We will begin by taking a subset of the data. To do this, firstly, get the index of the data $$i = 1,...,N$$:

Then, $$80\%$$ of the data is equal to:

where we use floor() to take the integer value, in case we get a value with a fraction.

We want to take a sample of the data. This is equivalent to taking a subset of the index vector, which we can use to select the data sample:

Note that we are taking the index values without replacement - each element from the index can be taken only once.

In comparison, the remaining index values, which we do not take, can be selected by using setdiff() function, which returns the values in dt_index, which are not in smpl_index:

The dataset used for modelling is called the training set. The remaining data, which is used to evaluate the predictive accuracy of the model, is called the test set.

For comparison, the training and test data looks like:

As we have mentioned in Task 22, we chose a log-linear as the best univariate regression model of the models that we used for this dataset.

##              Estimate  Std. Error  t value     Pr(>|t|)
## (Intercept) 1.5774693 0.078801222 20.01833 9.504523e-75
## educ        0.1000511 0.005435415 18.40726 5.068383e-65
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept      1.6018      0.081     19.855      0.000       1.443       1.760
## educ           0.0988      0.006     17.681      0.000       0.088       0.110
## ==============================================================================

The coefficients are similar compared to the coefficients of the model on the full dataset:

##               Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 1.59683536 0.07018017 22.75337 1.530466e-95
## educ        0.09875341 0.00484219 20.39437 1.344896e-79
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          1.5968      0.070     22.753      0.000       1.459       1.735
## educ           0.0988      0.005     20.394      0.000       0.089       0.108
## ==============================================================================

The tests can be carried out as before.

• If we do not reject the null hypothesis that the residuals are homoskedastic, this would indicate that our model on the training set has homoskedastic errors (i.e. with a constant variance) - something that we did not have previously.
##
##  studentized Breusch-Pagan test
##
## data:  dt4_smpl_mdl
## BP = 4.2338, df = 1, p-value = 0.03963

In this case the $$p$$-value is closer to 0.05. This may suggest that if we did not have observations on smaller values of wage, the residuals would be homoskedastic.

## Lagrange multiplier statistic: 5.706901
## p-value: 0.016898
## F-value: 5.729069
## F p-value: 0.016878

In this case, we reject the null hypothesis that the residuals are homoskedastic.

Note that the difference between R and Python versions is because of different data samples.

##
##  Durbin-Watson test
##
## data:  dt4_smpl_mdl
## DW = 2.0239, p-value = 0.7118
## alternative hypothesis: true autocorrelation is not 0

Since p-value > 0.05 (and we see that the DW statistic is quite close to 2), we do not reject the null hypothesis of no serial correlation.

##
##  Shapiro-Wilk normality test
##
## data:  dt4_smpl_mdl$residuals ## W = 0.99366, p-value = 0.0004296 ## ## Jarque Bera Test ## ## data: dt4_smpl_mdl$residuals
## X-squared = 2.7612, df = 2, p-value = 0.2514
## JB-stat: 1.604276500225152
## p-value: 0.4483692128688961
## Skew: 0.08649250749274705
## Kurtosis: 3.1009113328427347

For normality test - we still reject the null hypothesis of normally distributed residuals via the Shapiro-Wilk test, and do not reject the null hypothesis via Jarque-Bera Test.

Overall, the only difference is in the homoskedasticity test (in R) - the residuals may be homoskedastic, since the $$p$$-value is quite close to the $$5\%$$ critical level.

Visually, though:

there is not a huge difference from before.

Consequently, the fitted values alongside the training dataset are similar to as before:

Next, we can calculate the predicted values for the test set:

##        fit      lwr      upr
## 1 2.878134 1.933492 3.822775
## 2 2.878134 1.933492 3.822775
## 3 3.178287 2.233542 4.123032
## 4 2.778083 1.833235 3.722930
## 5 2.778083 1.833235 3.722930
## 6 3.178287 2.233542 4.123032
##        mean  pred_lower  pred_upper
## 0  3.380522    2.410259    4.350784
## 1  2.886435    1.917009    3.855861
## 2  2.886435    1.917009    3.855861
## 3  3.380522    2.410259    4.350784
## 4  3.182887    2.213331    4.152443
## 5  3.676974    2.704724    4.649224

Next, we will plot the test data as well as the predicted values and confidence intervals:

We see that most, but not all of the values are in the prediction intervals.

Considerations

We can calculate which values fall in the interval (note, remember, that we are predicting the logarithms of wage in our log-linear model, so we need to take the exponent):

##        fit      lwr      upr actual
## 1 17.78106 6.913612 45.73094  44.44
## 2 17.78106 6.913612 45.73094  13.54
## 3 24.00560 9.332865 61.74617  40.95
## 4 16.08814 6.254084 41.38549  21.50
## 5 16.08814 6.254084 41.38549  24.35
## 6 24.00560 9.332865 61.74617  55.00
##         mean  pred_lower  pred_upper  actual
## 0  29.386103   11.136849   77.539262   15.38
## 1  17.929271    6.800585   47.269279   13.54
## 2  17.929271    6.800585   47.269279   25.00
## 3  29.386103   11.136849   77.539262   23.86
## 4  24.116276    9.146133   63.589148   57.70
## 5  39.526615   14.950194  104.503880   30.76

Next, we will check which rows are greater than the lower bound and less than the upper bound:

##        fit      lwr      upr actual in_interval
## 1 17.78106 6.913612 45.73094  44.44        TRUE
## 2 17.78106 6.913612 45.73094  13.54        TRUE
## 3 24.00560 9.332865 61.74617  40.95        TRUE
## 4 16.08814 6.254084 41.38549  21.50        TRUE
## 5 16.08814 6.254084 41.38549  24.35        TRUE
##         mean  pred_lower  pred_upper  actual  in_interval
## 0  29.386103   11.136849   77.539262   15.38         True
## 1  17.929271    6.800585   47.269279   13.54         True
## 2  17.929271    6.800585   47.269279   25.00         True
## 3  29.386103   11.136849   77.539262   23.86         True
## 4  24.116276    9.146133   63.589148   57.70         True
## 5  39.526615   14.950194  104.503880   30.76         True

Now, the percentage of values, which fall in the prediction interval is:

Since we have constructed a $$95\%$$ prediction interval, this quite close and a good indication of our models predictive strength. Nevertheless, the residuals may still be heteroskedastic, so this model may need further improvement.