4.10 Chapter Exercises

A number of example datasets can be found at Principles of Econometrics book website with their dataset definitions or from R datasets which are available in both R and Python.

Of course, you are encouraged to find any other data, which may be interesting to you. The data that will be used in this section is chosen for its ease of access.

4.10.1 Datasets

Below we list some select datasets, which you can analyse. Dataset 1

The dataset: food (data definition).

Let’s say that we are interested in estimating how the expenditure on food, food_exp (\(Y\)), depends on the income variable and its polynomial transformations. The dataset can be loaded in both R and Python:

Note: even though we only have one explanatory variable - do not forget that you can also include its square, cube, etc. values as additional explanatory variables. Dataset 2

The dataset: nuclear.

Let’s say that we are interested in estimating how the cost (\(Y\)) depends on (some of) the remaining explanatory variable(-s).

The dataset can be loaded in both R and Python: Dataset 3

The dataset: stockton5_small (data definition) contains data on houses sold in Stockton, California in 1996-1998.

Assume that we are interested how is the sale price, sprice (\(Y\)), affected by (some of) the remaining explanatory variable(-s).

The dataset can be loaded in both R and Python: Dataset 4

The dataset: cps5_small (data definition) contains data on hourly wage rates, education, etc. from the 2013 Current Population Survey.

Suppose we are interested in examining which of the various explanatory variables affect wage (\(Y\)).

The dataset can be loaded in both R and Python: Dataset 5

The dataset: tuna, (data definition) contains weekly data (we will ignore the time dimension for now) on the number of cans sold of \(brand\ 1\) tuna (sal1).

Consider examining how the ratio of brand \(brand\ 1\) tuna prices, apr1, to \(brand\ 3\) tuna prices, apr3; the ratio of \(brand\ 1\) tuna prices to \(brand\ 3\) (for the creation of such variables, see the equivalent Dataset 5 in 3.10 ), as well as the store display and advertisement affect sal1 in thousands of units.

The dataset can be loaded in both R and Python:

4.10.2 Tasks

The following tasks are universal for all datasets. This is in order to highlight that in practical applications you will usually need to carry out similar steps and ask (yourself) similar general questions when working with any kind of data.

The exercises generally build upon the initial ones in section 3.10 by asking you to consider inclusion of additional variables, additional variable testing and model specification tests. The general questions still remain - their purpose, as before, is to lead you to examine the data and variable definitions in more detail and think about the data that you are analyzing.

An example with one of the datasets is provided in section 4.11 - it is the continuation of the dataset example from section 3.11. As before, some comments are provided within the example to highlight the additional possible insights and additional questions and possible further difficulties that can be identified from the dataset.

We stress that the tasks and questions are there to give you some general steps (which are mostly in a logical order, which you would follow in real-world applications, though some are arranged to follow the chapter ordering) of the modelling process as a whole - it is not a rule that you always need to examine the scatterplots of every variable, or always run specific tests when carrying out your analysis. However, they may help you support any arguments/insights that you may discover from the modelling process (see ?? for the variety of problems that may arise - not all will be present when you analyse new data).

Below are the tasks that you should carry out for the datasets:

Note: Take \(80\%\) of the data as the training set and fit your model on this data. Hold the remaining \(20\%\) of the data as the test set, which you can then use to check the out-of-sample characteristics of your model. Exercise Set 1

(Reminder: use the training set)

  1. Visually examine the data:

    • Plot the scatter plots of the dependent variable \(Y\) and the independent variables \(X_1,...,X_k\). Which variables \(X_j\) visually appear to be related to \(Y\)? Are there any variables \((X_i, X_j)\) that seem to have a linear dependence between one another?
    • Examine the histograms of the dependent and independent variables and provide some insights.
    • Use the correlation matrix heatmap from for a visual representation of the variable correlation matrix - are there any independent variable pairs \((X_{i}, X_{j}), i \neq j\), which seem to be strongly correlated?
  2. Specify one regression model in a mathematical formula notation based on economic theory. What coefficient sign do you expect \(\beta_1,\beta_2,...\) to have? Explain. Note: This is not necessarily the best regression model - it is simply one you think makes economic sense.

  3. Estimate the regression via OLS. Are the signs of \(\beta_1,\beta_2,...\) the same as you expected (if not - can you think of any situations that might cause the different signs (think of various societal/economic/tradition/education/market competition/aging population/etc. situations, which may have an opposite effect)?

  4. Test, which variables are statistically significant. Remove the insignificant variables (leave the initial estimated model as a separate variable as a backup).

  5. Write down the final (with significant variables) estimated regression formula. Exercise Set 2

(Reminder: use the training set)

  1. Examine the residual plots. Test for normality, autocorrelation, heteroskedasticity. Do the residuals violate our (MR.3) - (MR.6) model assumptions?
  2. Add interaction variables (or polynomial (usually quadratic) terms, if your dataset does not have more than one explanatory variable) to your model. In case you are adding an interaction term with a variable, which you previously removed as insignificant - add in both the interaction and that discarded variable to the model. Provide an interpretation for what kind of interaction signs you expect. Then, estimate the model and check if they are significant. If they are - re-examine the residuals.
  3. Are there any economic restrictions you could evaluate on the estimated model? If so, test them, otherwise, think of some arbitrary ones from the model output and test them. Exercise Set 3

(Reminder: use the training set)

  1. If you do not reject the null hypothesis of your specified linear restrictions, try to re-estimate the model via RLS. What changes (if any) do you notice about your model coefficients and their significance?
  2. Using the model with OLS estimates, check if any variables are collinear in your model. If so, try to account for multicollinearity in some way.
  3. Use the residuals of your finalized model, with OLS estimates, and test them for autocorrelation and heteroskedasticity.
  4. If there is a presence of autocorrelation or heteroskedasticity in the residuals (of the model with OLS estimates), do the following (based on the test results):
    • use a consistent error variance estimator to re-estimate the standard errors;
    • specify the variance-covariance matrix form for the residuals and use a FGLS estimator to re-estimate the parameters.
  5. Compare the parameter estimates - if there are any differences between the FGLS and OLS estimates with consistent-errors - are they cause for concern? Exercise Set 4

  1. Check you model specification via the Rainbow Test for Linearity. What can you conclude about your model?

  2. Check you model specification via the Ramsey RESET. What can you conclude about your model?

  3. Carry out an automatic model selection procedure. If the model is different, examine:

    • The AIC and BIC values of the model.
    • The \(R^2_{adj}\).
    • The coefficient signs and significance.
    • Based on the above, which model would you consider superior and why?
  4. Use the test set and check how your model (selected based on Task 16 conclusions) performs on new data.

  5. TBA - Endogeneity, IV;