Datasets

A number of example datasets can be found at Principles of Econometrics book website along with their dataset definitions or from R datasets which are available in both R and Python.

Of course, you are encouraged to find any other data, which may be interesting to you. The data that will be used in this section is chosen for its ease of access.

The datasets can be loaded in both R and Python.

Dataset 1: Choice of transportation

The dataset: transport ( data definition).

We want to evaluate whether someone will choose to ride a bus, or car.

R
Python

#
#
data_source <- "http://www.principlesofeconometrics.com/poe5/data/csv/transport.csv"
dt1 <- read.csv(file = data_source, sep = ",", dec = ".", header = TRUE)

import pandas as pd
#
data_source = "http://www.principlesofeconometrics.com/poe5/data/csv/transport.csv"
dt1 = pd.read_csv(data_source)

Dataset 2: Titanic survivor data

The dataset: new titanic data (previous, much smaller version titanic).

Let’s say that we are interested in estimating whether a passenger will Survive¹ based on their age, gender, economic status and other factors.

R
Python

#
#
data_source <- "https://raw.githubusercontent.com/paulhendricks/titanic/master/inst/data-raw/train.csv"
dt2 <- read.csv(file = data_source, sep = ",", dec = ".", header = TRUE)

import pandas as pd
#
data_source = "https://raw.githubusercontent.com/paulhendricks/titanic/master/inst/data-raw/train.csv"
dt2 = pd.read_csv(data_source)

Dataset 3: Code or Pepsi?

The dataset: coke (data definition).

we want to evaluate whether a customer will choose coke, or pepsi.

R
Python

#
#
data_source <- "http://www.principlesofeconometrics.com/poe5/data/csv/coke.csv"
dt3 <- read.csv(file = data_source, sep = ",", dec = ".", header = TRUE)

import pandas as pd
#
data_source = "http://www.principlesofeconometrics.com/poe5/data/csv/coke.csv"
dt3 = pd.read_csv(data_source)

Dataset 4: Defaulting on debt

The dataset: credit card default.

The aim is to predict which customers will default on their credit card debt.

R
Python

#
#
#
#
dt4 <- ISLR::Default

import statsmodels.api as sm
#
dt4 = sm.datasets.get_rdataset("Default", "ISLR")
#print(dt4.__doc__) #documentation about the data
dt4 = dt4.data

Dataset 5: U.S. Women’s Labor-Force Participation

The dataset: MROZ (definition) ( more data ).

R
Python

#
#
dt5 <- foreign::read.dta("http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta")
dt5 <- data.frame(dt5)

import pandas as pd
#
dt5 = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta")

Or in this case, could have survived.↩︎