We will illustrate some of the methods outlined during the lectures. Note that we will update the tasks or data as necessary for new methods introduced during upcoming lectures. In some cases, we may need to present an additional dataset to carry out some examples.
Generally, a number of example datasets can be found at Principles of Econometrics book website with their dataset definitions or from R datasets which are available in both R and Python.
Data Overview
Use the following datasets:
- transport ( data definition). We want to evaluate whether it is better to drive by an
auto
mobile or via public transportation. The dataset can be loaded in both R
and Python
:
#
#
dt1 <- read.csv(file = "http://www.principlesofeconometrics.com/poe5/data/csv/transport.csv", sep = ",", dec = ".", header = TRUE)
import pandas as pd
#
dt1 = pd.read_csv("http://www.principlesofeconometrics.com/poe5/data/csv/transport.csv")
- new titanic data (previous, much smaller version titanic). Let’s say that we are interested in estimating whether a passenger will
Survive
based on their age, gender, economic status and other factors. The dataset can be loaded in both R
and Python
:
#dt2 <- datasets::Titanic
#
#install.packages("titanic")
#dt2 <- titanic::titanic_train
dt2 <- read.csv(file = "https://raw.githubusercontent.com/paulhendricks/titanic/master/inst/data-raw/train.csv", sep = ",", dec = ".", header = TRUE)
#import statsmodels.api as sm
#
#dt2 = sm.datasets.get_rdataset("Titanic", "datasets")
#print(dt2.__doc__) #documentation about the data
#dt2 = dt2.data
import pandas as pd
#
dt2 = pd.read_csv("https://raw.githubusercontent.com/paulhendricks/titanic/master/inst/data-raw/train.csv")
- coke (data definition) we want to evaluate whether a customer will choose
coke
, or pepsi.
dt3 <- read.csv(file = "http://www.principlesofeconometrics.com/poe5/data/csv/coke.csv", sep = ",", dec = ".", header = TRUE)
dt3 = pd.read_csv("http://www.principlesofeconometrics.com/poe5/data/csv/coke.csv")
- default The aim is to predict which customers will
default
on their credit card debt.
import statsmodels.api as sm
#
dt4 = sm.datasets.get_rdataset("Default", "ISLR")
#print(dt4.__doc__) #documentation about the data
dt4 = dt4.data
- MROZ ( definition) ( more data at ) The aim is to predict whether a woman will decide to return to the labor force,
inlf
.
dt5 <- foreign::read.dta("http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta")
dt5 <- data.frame(dt5)
import pandas as pd
#
dt5 = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta")
#print(dt5.head())
Tasks
Below are the tasks that you should carry out for the datasets:
(2018-12-13)
- Split the data into \(80\%\) training and \(20\%\) test subsets.
- Postulate, what kind of model(-s) would you need to specify to model the dependent variable (there may be more than one):
- A linear regression;
- A logistic regression;
- A probit regression;
- A multinomial logit regression;
- A regression for count data (e.g. Poisson regression);
- Examine, how the independent variables relate to the dependent variable and to one-another. Do you notice any relationships? What variables would you include in your model? What signs do you expect them to have? (Note: do not include any polynomial or interaction terms just yet)
- Estimate one or more model(-s) based on your answers in Task 2 and Task 3. Are there any insignificant variables?
- Are there any collinear variables? If so, remove the multicollinearity if it is meaningfull to do so.
- Include polynomial and/or interaction terms in your model. Explain your motivation for selecting these variables and their signs.
- Calculate the predicted values For various combinations of values:
- For continuous data \(X_{j,i}\) create new value \(\tilde{X}_{i, j} > \tilde{X}_{i, j+1}\), \(j = 1,...,M\), where \(\tilde{X}_{i, 1} = \min (X_{i})\) and \(\tilde{X}_{i, M} = \max (X_{i})\). Select an arbitrary value \(M\) so that the predicted probability plot chart is readable.
- For discrete data - select some cases to compare the probabilities. For example two curves: when \(X_{j_1} = 1\) vs when \(X_{j_1} = 0\).
- Plot the \(95\%\) confidence bounds for the predictions for two cases.
- If estimating the probability, select the cutoff prediction probability depending on the confusion matrix results:
- use the default 0.5 cutoff value;
- try to select an alternative (hopefully optimal) cutoff value;
- Examine the ROC curve.
(2018-12-21)
- Provide an interpretation for a few variables included in your model.
- Write down the fitted model.
- Check some arbitrary linear restrictions.
- Examine the model residuals.
- Compare the model results between the training and test sets - is your model adequate for this data?