Datasets

A number of example datasets can be found at Principles of Econometrics book website along with their dataset definitions or from R datasets which are available in both R and Python.

Of course, you are encouraged to find any other data, which may be interesting to you. The data that will be used in this section is chosen for its ease of access.

The datasets can be loaded in both R and Python.

Dataset 1: Food & income

The dataset: food (data definition).

Let’s say that we are interested in estimating how the expenditure on food, food_exp (\(Y\)), depends on the income variable and its polynomial transformations.

R
Python

#
#
data_source <- "http://www.principlesofeconometrics.com/poe5/data/csv/food.csv"
dt1 <- read.csv(file = data_source, sep = ",", dec = ".", header = TRUE)

import pandas as pd
#
data_source = "http://www.principlesofeconometrics.com/poe5/data/csv/food.csv"
dt1 = pd.read_csv(data_source)

Dataset 2: Nuclear Power Station Construction Data

Let’s say that we are interested in estimating how the cost (Y) depends on (some of) the remaining explanatory variable(-s).

R
Python

dt2 <- boot::nuclear

import statsmodels.api as sm
#
dt2 = sm.datasets.get_rdataset("nuclear", "boot")
#print(dt.__doc__) #documentation about the data
dt2 = dt2.data

Dataset 3: Home sales

The dataset: stockton5_small (data definition) contains data on houses sold in Stockton, California in 1996-1998.

Assume that we are interested how does the sale price, sprice (\(Y\)), is affected by (some of) the remaining explanatory variable(-s).

R
Python

#
#
data_source <- "http://www.principlesofeconometrics.com/poe5/data/csv/stockton5_small.csv"
dt3 <- read.csv(file = data_source, sep = ",", dec = ".", header = TRUE)

import pandas as pd
#
data_source = "http://www.principlesofeconometrics.com/poe5/data/csv/stockton5_small.csv"
dt3 = pd.read_csv(data_source)

Dataset 4: 2013 Current Population Survey data

The dataset: cps5_small (data definition) contains data on hourly wage rates, education, etc. from the 2013 Current Population Survey.

Suppose we are interested in examining which of the various explanatory variables affect wage (\(Y\)).

R
Python

#
#
data_source <- "http://www.principlesofeconometrics.com/poe5/data/csv/cps5_small.csv"
dt4 <- read.csv(file = data_source, sep = ",", dec = ".", header = TRUE)

import pandas as pd
#
data_source = "http://www.principlesofeconometrics.com/poe5/data/csv/cps5_small.csv"
dt4 = pd.read_csv(data_source)

Dataset 5: Canned tuna sales

The dataset: tuna, (data definition) contains weekly data (we will ignore the time dimension for now) on the number of cans sold of brand 1 tuna (sal1).

Consider examining how the ratio of brand brand 1 tuna prices, apr1, to brand 3 tuna prices, apr3, affects sal1 in thousands of units. In order to do this you will need to:

Firstly, scale sal1, so that it would measure sales in thousands (instead of single units).
Secondly, calculate the ratio as \(\rm price\_ratio=100\cdot(apr1/apr3)\). This ratio indicates the percentage price of brand 1 tuna, relative to brand 3 tuna. When \(\rm price\_ratio>100\), then brand 1 tuna is more expensive, and less expensive when \(price\_ratio<100\). For example:
- if the ratio equals \(100\), then the price of both brands is the same;
- if it is equal to \(90\), then brand 1 is cheaper by \(10\%\) than brand 3;
- if it is equal to \(110\), then brand 1 is \(10\%\) more expensive than brand 3.
Finally, estimate how the price ratio affects the sales numbers of brand 1.

R
Python

#
#
data_source <- "http://www.principlesofeconometrics.com/poe5/data/csv/tuna.csv"
dt5 <- read.csv(file = data_source, sep = ",", dec = ".", header = TRUE)

import pandas as pd
#
data_source = "http://www.principlesofeconometrics.com/poe5/data/csv/tuna.csv"
dt5 = pd.read_csv(data_source)