3.10 Chapter Exercises

A number of example datasets can be found at Principles of Econometrics book website with their dataset definitions or from R datasets which are available in both R and Python.

Of course, you are encouraged to find any other data, which may be interesting to you. The data that will be used in this section is chosen for its ease of access.

3.10.1 Datasets

Below we list some select datasets, which you can analyse. Dataset 1

The dataset: food (data definition).

Let’s say that we are interested in estimating how the expenditure on food, food_exp (\(Y\)), depends on the income (\(X\)).

The dataset can be loaded in both R and Python: Dataset 2

The dataset: nuclear.

Let’s say that we are interested in estimating how the cost (\(Y\)) depends on the power capacity cap (\(X\)).

The dataset can be loaded in both R and Python: Dataset 3

The dataset: stockton5_small (data definition) contains data on houses sold in Stockton, California in 1996-1998.

Assume that we are interested how is the sale price, sprice (\(Y\)), affected by the house living area, livarea (\(X\)).

The dataset can be loaded in both R and Python: Dataset 4

The dataset: cps5_small (data definition) contains data on hourly wage rates, education, etc. from the 2013 Current Population Survey.

Suppose we are interested in examining how does education, educ (\(X\)), affect wage (\(Y\)).

The dataset can be loaded in both R and Python: Dataset 5

The dataset: tuna, (data definition) contains weekly data (we will ignore the time dimension for now) on the number of cans sold of \(brand\ 1\) tuna (sal1).

Consider examining how the ratio of brand \(brand\ 1\) tuna prices, apr1, to \(brand\ 3\) tuna prices, apr3, affects sal1 in thousands of units. In order to do this you will need to:

  • Firstly, scale sal1, so that it would measure sales in thousands (instead of single units).
  • Secondly, calculate the ratio as \(\text{price_ratio} = 100 \cdot (\text{apr1} / \text{apr3})\). This ratio indicates the percentage price of \(brand\ 1\) tuna, relative to \(brand\ 3\) tuna. When \(\text{price_ratio} > 100\), then \(brand\ 1\) tuna is more expensive, and less expensive when \(\text{price_ratio} < 100\). For example:
    • if the ratio equals \(100\), then the price of both brands is the same;
    • if it is equal to \(90\), then \(brand\ 1\) is cheaper by \(10\%\) than \(brand\ 3\);
    • if it is equal to \(110\), then \(brand\ 1\) is \(10\%\) more expensive than \(brand\ 3\).
  • Finally, estimate how the price ratio affects the sales numbers of \(brand\ 1\).

The dataset can be loaded in both R and Python:

3.10.2 Tasks

The following tasks are universal for all datasets. This is in order to highlight that in practical applications you will usually need to carry out similar steps and ask (yourself) similar general questions when working with any kind of data.

These general questions will then lead you to examine the data and variable definitions in more detail and think about specific geographical (e.g. Eastern vs Western Europe), social (e.g. lower vs middle class), psychological (e.g. cultural shifts between generations vs tradition), financial (e.g. is the dataset before/after or during the financial crisis, are the individuals from this dataset prone to save money) and economic (e.g. trade agreements, trade wars, tax laws etc.) factors that are present in your data.

An example with one of the datasets is provided in section 3.11. Additionally, some comments are provided within the example to highlight the additional possible insights and additional questions and possible further difficulties that can be identified from the dataset.

Below are the tasks that you should carry out for the datasets: Exercise Set 1

  1. Plot the scatter plot of the dependent variable \(Y\) and the independent variable \(X\). Do the variables look correlated? Calculate the correlation between \(X\) and \(Y\) to verify.
  2. Specify the regression in a mathematical formula notation. What coefficient sign do you expect \(\beta_1\) to have? Explain your answer.
  3. Estimate the regression via OLS without using the built-in OLS estimation functions. Is the sign on \(\beta_1\) the same as you expected?
  4. Calculate the standard errors of the estimated coefficients.
  5. Write Down the estimated regression formula.
  6. Calculate the fitted values and plot the estimated regression alongside the data.
  7. Finally: Use the built-in functions for OLS estimation and compare with your own results. Exercise Set 2

  1. Examine the run-sequence plot of the dependent variable, \(Y\). Do you notice anything about your \(Y\) data - are there any outliers, maybe there are more large (or small) values, or maybe the observations are equally likely to be observed (uniformally distributed) for various \(Y\) values? Look back at Task (1) and compare the run-sequence plot of \(Y\) and the scatter plot of \(Y\) and \(X\) - does the mean appear to be the same throughout observations for different value of \(X\)? What about the variance? Would you have been able to come to similar conclusions about the mean and variance from the run-sequence plot alone?
  2. Examine the histogram of your dependent (\(Y\)) and independent (\(X\)) variables. Are there any variables that appear to have a non-normal distribution?
  3. Take another look at the scatter plot from Task (1) - could you specify at least one more linear regression but this time with transformed variable(-s) (Note: consider transformations either by scaling with a constant, by taking logarithms, or by taking the square of the independent variable)?
  4. Examine the residual run-sequence plots and histograms from regressions in Task (3) and Task (9) - which regressions appear to have residuals that are random? Are there any regressions, where the residuals do not appear to be random (i.e. randomly dispersed around the horizontal axis)? What can you say about the models in such cases (in regards to some of the linear regression model assumptions (UR.1)-(UR.4) and the signs of the coefficient \(\beta_1\)).
  5. Select one model, which you believe to be best suited for your data from the conclusions in Tasks (7) through (10) and write down the equation.
  6. Provide an interpretation of \(\beta_1\) for your selected model.

Note: Initially try to estimate at least one model with transformed variables without using the built-in OLS estimation functions in order to make sure you understand how the transformations are applied and how they are incorporated in the same OLS formula as for the simple linear regression case. Exercise Set 3

  1. Select two models - one model from Task (12) and any other one model from either Task (3) or Task (10) - and test (by calculating the \(p\)-value) the null hypothesis that the coefficient of your explanatory variable \(X\) is not significant, with the alternative that:
    1. \(\beta_1\) is negative;
    2. \(\beta_1\) is positive;
    3. \(\beta_1\) is not zero;
  2. Plot the confidence intervals for the mean response for each model and plot them for \(Y\) (i.e. not \(\log(Y)\) or any other transformation of \(Y\)).
  3. Plot the prediction intervals of \(Y\) for existing values of \(X\).
  4. Let’s say our new \(X\) is:
    1. \(\widetilde{X} = 0.8 \cdot \min(X_1, ..., X_N)\)
    2. \(\widetilde{X} = 1.1 \cdot \max(X_1, ..., X_N)\)
    Calculate the predicted value along with the prediction intervals for each case. Exercise Set 4

  1. Calculate (either manually or using the built-in OLS estimation functions) \(R^2\) of your selected models from Task (14). Can you directly compare the \(R^2\) values of your selected models (and why)?
  2. Calculate \(R^2_g\) (the general \(R^2\)) for the back-transformed variables (i.e. for non-transformed values of the dependent variable - \(Y\)). Is it larger than the standard \(R^2\), which is reported by either lm() in R, or sm.OLS() in Python?
  3. Which model has the largest \(R^2\) (is it the same model if out selection is based on \(R^2_g\)) ? Is the model the same as in Task (12)? For the model with the largest \(R^2\), provide an interpretation of the calculated \(R^2\) value. Exercise Set 5

  1. Once again look at the residuals plots:
    • scatter plot of residuals and fitted values, scatter plot of the residuals and \(X\) - are there any non-linearities in the residuals?
    • residual Q-Q plot, their histogram - are the residuals normal?
  2. Carry out the Breusch-Pagan Test for homoskedasticity, Durbin-Watson Test for autocorrelation and Shapiro-Wilk Test for normality. What are the null and alternative hypothesis of these tests? What do these test results say about the residuals?
  3. Looking at all of the results thus far - which model would you consider the “best” and why?
  4. Take a subset of the data - around \(80\%\) of the sample and estimate the best model, which yo selected in Task (23), on the subset:
    • Are the model results (signs, significance, residual GoF tests) similar to the model on the full data?
    • Plot the subset data along with the predicted values of the model;
    • Calculate the predicted values and confidence intervals for the remaining \(20\%\);
    • Plot the predicted values, their confidence intervals and the actual values - do the true values fall in the confidence intervals?