## 5.3 Chapter Exercises

A number of example datasets can be found at Principles of Econometrics book website with their dataset definitions or from R datasets which are available in both R and Python.

Of course, you are encouraged to find any other data, which may be interesting to you. The data that will be used in this section is chosen for its ease of access.

### 5.3.1 Datasets

Below we list some select datasets, which you can analyse.

#### 5.3.1.1 Dataset 1

The dataset: transport ( data definition).

We want to evaluate whether it is better to drive by an automobile or via public transportation.

The dataset can be loaded in both R and Python:

data_source <- "http://www.principlesofeconometrics.com/poe5/data/csv/transport.csv"
dt1 <- read.csv(file = data_source, sep = ",", dec = ".", header = TRUE)
import pandas as pd
#
data_source = "http://www.principlesofeconometrics.com/poe5/data/csv/transport.csv"
dt1 = pd.read_csv(data_source)

#### 5.3.1.2 Dataset 2

The dataset: new titanic data (previous, much smaller version titanic).

Let’s say that we are interested in estimating whether a passenger will Survive based on their age, gender, economic status and other factors.

The dataset can be loaded in both R and Python:

#dt2 <- datasets::Titanic
#
#install.packages("titanic")
#dt2 <- titanic::titanic_train
data_source <- "https://raw.githubusercontent.com/paulhendricks/titanic/master/inst/data-raw/train.csv"
dt2 <- read.csv(file = data_source, sep = ",", dec = ".", header = TRUE)
#import statsmodels.api as sm
#
#dt2 = sm.datasets.get_rdataset("Titanic", "datasets")
#dt2 = dt2.data
import pandas as pd
#
data_source = "https://raw.githubusercontent.com/paulhendricks/titanic/master/inst/data-raw/train.csv"
dt2 = pd.read_csv(data_source)

#### 5.3.1.3 Dataset 3

The dataset: coke (data definition).

we want to evaluate whether a customer will choose coke, or pepsi.

The dataset can be loaded in both R and Python:

data_source <- "http://www.principlesofeconometrics.com/poe5/data/csv/coke.csv"
dt3 <- read.csv(file = data_source, sep = ",", dec = ".", header = TRUE)
import pandas as pd
#
data_source = "http://www.principlesofeconometrics.com/poe5/data/csv/coke.csv"
dt3 = pd.read_csv(data_source)

#### 5.3.1.4 Dataset 4

The dataset: default.

The aim is to predict which customers will default on their credit card debt.

The dataset can be loaded in both R and Python:

dt4 <- ISLR::Default
import statsmodels.api as sm
#
dt4 = sm.datasets.get_rdataset("Default", "ISLR")
dt4 = dt4.data

Note: it appears that this dataset is artificial. Be aware that this may lead to unexpected coefficient signs. Nevertheless, it is always interesting to analyse data, which challenge your assumptions for variables signs and significance.

#### 5.3.1.5 Dataset 5

The dataset: MROZ ( definition) ( more data at ).

The aim is to predict whether a woman will decide to return to the labor force, inlf.

The dataset can be loaded in both R and Python:

dt5 <- foreign::read.dta("http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta")
dt5 <- data.frame(dt5)
import pandas as pd
#
dt5 = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta")

The following tasks are universal for all datasets. This is in order to highlight that in practical applications you will usually need to carry out similar steps and ask (yourself) similar general questions when working with any kind of data.

An example with one of the datasets is provided in section 5.4. As before, some comments are provided within the example to highlight the additional possible insights and additional questions and possible further difficulties that can be identified from the dataset.

We stress that the tasks and questions are there to give you some general steps (which are mostly in a logical order, which you would follow in real-world applications, though some are arranged to follow the chapter ordering) of the modelling process as a whole - it is not a rule that you always need to examine the scatterplots of every variable, or always run specific tests when carrying out your analysis. However, they may help you support any arguments/insights that you may discover from the modelling process.

Below are the tasks that you should carry out for the datasets:

Note: Take $$80\%$$ of the data as the training set and fit your model on this data. Hold the remaining $$20\%$$ of the data as the test set, which you can then use to check the out-of-sample characteristics of your model.

#### 5.3.2.1 Exercise Set 1

(Reminder: use the training set)

1. Postulate, what kind of model(-s) would you need to specify to model the dependent variable (there may be more than one):

• A linear regression?
• A logistic regression?
• A probit regression?
• A multinomial logit regression?
• A regression for count data (e.g. Poisson regression)?
2. Examine:

• How the independent variables relate to the dependent variable and to one-another? Do you notice any relationships?
• What variables would you include in your model? (Note: do not include any polynomial or interaction terms just yet)
• What signs do you expect them to have?
3. Estimate one or more model(-s) based on your answers in the previous tasks. Are there any insignificant variables? Are the signs as you would expect?

4. Are there any collinear variables? If so, remove the multicollinearity, if it is meaningful to do so.

#### 5.3.2.2 Exercise Set 2

(Reminder: use the training set)

1. Include polynomial and/or interaction terms in your model. Explain your motivation for selecting these variables and their signs.

2. Calculate the predicted values for various combinations of values:

• For a continuous data variable $$X_{j,i}$$ - create new, equally spaced, values $$\tilde{X}_{i, j} > \tilde{X}_{i, j+1}$$, for $$j = 1,...,M$$, with $$\tilde{X}_{i, 1} = \min (X_{i})$$ and $$\tilde{X}_{i, M} = \max (X_{i})$$. Select an arbitrary value $$M$$ so that the predicted probability plot chart is readable.
• For a discrete/categorical variable - select some cases to compare the probabilities. For example two curves: when $$X_{j_1} = 1$$ vs when $$X_{j_1} = 0$$.
• Plot the $$95\%$$ confidence bounds for the predictions of the two cases.

Note:

• select one continuous and one discrete variable and treat the remaining variable values as fixed. In other words, calculate the predicted values with the new value of a single variable, ceteris paribus. You can do this by first selecting a random observation and duplicating it in a new dataset. Then replace the continuous or discrete variable with its new values. Repeat this for the continuous and discrete variable cases.

• remember that confidence intervals are not the same as prediction intervals!

1. If you are estimating the probability, select the cutoff prediction probability depending on the confusion matrix results:
• use the default $$0.5$$ cutoff value;
• try to select an alternative (hopefully optimal) cutoff value;
• Do the default and the optimal cutoff probability values differ?
2. Examine the $$ROC$$ curve.

#### 5.3.2.3 Exercise Set 3

1. Provide an interpretation for a few (not necessarily all) variables included in your model. Hint: it may be worthwhile to examine the interpretation of the parameters themselves, as well as the partial effects.

2. Write down the fitted model.

3. Can you check some linear restrictions?

4. Compare the model results between the training and test sets - is your model adequate for this new data?