Tasks

Exercise Set 1: Exploratory Data Analysis (EDA)

Visually examine the data:

Plot the scatter plots of the dependent variable \(Y\) and the explanatory variables \(X_1, \dots, X_k\). Which variables appear to be related to \(Y\)? Are there any explanatory variable pairs \((X_i, X_j)\) that seem to have a linear dependence between one another?
Look at the descriptive statistics of the dependent and explanatory variables - are there any missing data? Are there any categorical, or indicator variables? If there are categorical variables - which group would be the base/reference group for each categorical variable?
Think a bit more about the data itself - what are the variable definitions? What would you expect their effect to be on the dependent variable \(Y\)? (in your opinion - is the effect likely to be positive, negative, or, maybe, it is difficult to determine?)
Examine the histogram of your dependent (\(Y\)) and independent (\(X\)) variables. Are there any variables that appear to have a non-normal distribution? If there are - can variable transformation (e.g. taking \(\log\)’s)¹ be applied to those variables?

Look at the scatterplots once more - are there any non-linear relationships between different variables? If there are, think of some ways to account for them - not just variable transformation, but including interaction and polynomial terms as well.
Split your datasets into a training set (\(\sim 80\%\) of the data) and testing set (the remaining \(\sim 20\%\) of the data).

Unless stated otherwise, use the training set for Exercise Sets 1 through 4. The test set is used for out-of-sample validation and to compare model predictions with actual outcomes on a new set of data.

Exercise Set 2: Regression specification (Modelling)

Specify one regression model in a mathematical formula notation based on economic theory. Do not include polynomial, or interaction terms. What coefficient sign do you expect \(\beta_1,\ \beta_2,\ \cdots\) to have? Explain.²

Estimate the regression via OLS. Are the signs of \(\beta_1,\ \beta_2,\ \cdots\) the same as you expected (if not - can you think of any situations that might cause the different signs³?

Exercise Set 3: Hypothesis testing

Test, which variables are statistically significant. Remove the insignificant variables (leave the initial estimated model as a separate variable as a backup).
Using the model with OLS estimates, check if any variables are collinear in your model. If so, try to account for multicollinearity in some way.

Exercise Set 4: Updating and re-fiting the model

Add polynomial and interaction variables⁴ to your model. Provide an interpretation for what kind of interaction signs you expect. Then, estimate the model and check if the coefficients are significant. Remove any insignificant interaction/polynomial terms.

If you are adding an interaction term with a variable, which you have previously removed as insignificant - add in both the interaction and that discarded variable to the model.

Write down the estimated regression equation of your final model from the previous task.
Provide an interpretation of a few of the explanatory variables from your model - at least one explanatory variable, which has polynomial or interaction effects, at least one continuous explanatory variable and at least one indicator variable (if there are any).

Exercise Set 5: Prediction

Calculate the predicted values for the testing set.
Using only the explanatory variables that were included in your final model - take the mean of the continuous explanatory variables, and the median of indicator variables. Calculate the predicted value based on these explanatory variables.

Exercise Set 6: Model adequacy

Take a look at residual diagnostics:
- are the residuals autocorrelated? Write down the null hypothesis and carry out the appropriate test.
- are the residuals heteroskedastic? Write down the null hypothesis and carry out the appropriate test.
- are the residuals normally distributed? Write down the null hypothesis and carry out the appropriate test.
If the residuals exhibit autocorrelation, or heteroskedasticity - correct the model coefficient standard errors. Are there any changes in the model coefficient statistical significance after this correction?
Compare your final model with the initial model from Exercise Set 3 in terms of \(\rm R^2\), \(\rm R^2_{adj}\), \(\rm AIC\) and \(\rm BIC\).
Examine the \(\rm RMSE\) and \(\rm MAPE\) of your model by comparing the in-sample and out-of-sample predictions. How well does your estimated model predict on new data?

Remember that \(\log(X)\) is defined for \(X>0\).↩︎
This is not necessarily the best regression model - it is simply one you think makes (economic) sense.↩︎
Think of various societal/economic/tradition/education/market competition/aging population/etc. situations, which may have caused an opposite effect.↩︎
Or polynomial (usually quadratic) terms, if your dataset does not have more than one explanatory variable.↩︎