We will illustrate some of the methods outlined during the lectures. Note that we will update the tasks or data as necessary for new methods introduced during upcoming lectures. In some cases, we may need to present an additional dataset to carry out some examples.
Generally, a number of example datasets can be found at Principles of Econometrics book website with their dataset definitions or from R datasets which are available in both R and Python.
Use the following datasets:
food_exp
, depends on the income
. The dataset can be loaded in both R
and Python
:dt1 <- read.csv(file = "http://www.principlesofeconometrics.com/poe5/data/csv/food.csv", sep = ",", dec = ".", header = TRUE)
import pandas as pd
#
dt1 = pd.read_csv("http://www.principlesofeconometrics.com/poe5/data/csv/food.csv")
cost
depends on the power capacity cap
. The dataset can be loaded in both R
and Python
:dt2 <- boot::nuclear
import statsmodels.api as sm
#
dt2 = sm.datasets.get_rdataset("nuclear", "boot")
#print(dt.__doc__) #documentation about the data
dt2 = dt2.data
sprice
, affected by the house living area, livarea
.dt3 <- read.csv(file = "http://www.principlesofeconometrics.com/poe5/data/csv/stockton5_small.csv", sep = ",", dec = ".", header = TRUE)
dt3 = pd.read_csv("http://www.principlesofeconometrics.com/poe5/data/csv/stockton5_small.csv")
educ
, affect wage
.dt4 <- read.csv(file = "http://www.principlesofeconometrics.com/poe5/data/csv/cps5_small.csv", sep = ",", dec = ".", header = TRUE)
dt4 = pd.read_csv("http://www.principlesofeconometrics.com/poe5/data/csv/cps5_small.csv")
sal1
). Consider examining how the ratio of brand \(\#1\) tuna prices, apr1
, to brand \(\#3\) tuna prices, apr3
, affects sal1
in thousands of units:sal1
, so that it would measure sales in thousands (instead of single units).dt5 <- read.csv(file = "http://www.principlesofeconometrics.com/poe5/data/csv/tuna.csv", sep = ",", dec = ".", header = TRUE)
dt5 = pd.read_csv("http://www.principlesofeconometrics.com/poe5/data/csv/tuna.csv")
Note: either dt3
or dt4
will be selected and analysed during lectures.
Below are the tasks that you should carry out for the datasets:
(2018-09-13)
Finally: Use the built-in functions for OLS estimation and compare with your own results.
(2018-09-20)
Note: Initially try to estimate at least one model with transformed variables without using the built-in OLS estimation functions in order to make sure you understand how the transformations are applied and how they are incorporated in the same OLS formula as for the simple linear regression case.
(2018-09-27)
Finally: Use the built-in functions for OLS estimation and compare with your own results.
(2018-10-04)
lm()
in R
, or sm.OLS()
in Python
?(2018-10-11)
Generate the following model: \[ Y = \beta_0 + \beta_1 X + \epsilon \] where:
Generate 1000 replications (i.e. different samples) for different sample sizes \(N \in \{10, 100, 500, 800\}\) and different error variance parameters \(\sigma^2 \in \{1, 2, 5 \}\). In other words: take \(\sigma^2 = 1\) and generate 1000 different samples for \(N = 10\), 1000 different samples for \(N = 100\) and 1000 different samples for \(N = 500\). Then take \(\sigma^2 = 2\) and do the same, etc.
Do the following:
We will give an outline on how the process of generating multiple samples for different parameters could be applied. Note that this is not the only way - there are multiple packages and functions, which would increase the speed of data generation and estimation, but for this exercise we will focus on code readability, instead of calculation speed.
We will show how to approach this task by specific parts and then combine these parts into a single code block.
So, we have some fixed parameters:
beta_0 <- 2.5
beta_1 <- -1.2
beta_0 = 2.5
beta_1 = -1.2
As well as parameters, which we can have different combinations of:
N_vec <- c(10, 100, 500)
sigma2_vec <- c(1, 2, 5)
N_vec = [10, 100, 500]
sigma2_vec = [1, 2, 5]
Note: We did not use all the values for N.
We will begin by taking one combination of parameters \(N = 10\) and \(\sigma^2 = 1\):
N <- N_vec[1]
s2 <- sigma2_vec[1]
N = N_vec[0]
s2 = sigma2_vec[0]
And try to generate a single sample of our data for this one combination of parameters:
#
#
#
set.seed(123)
#
x <- sample(1:50, replace = TRUE, size = N)
e <- rnorm(mean = 0, sd = sqrt(s2), n = N)
y <- beta_0 + beta_1 * x + e
#
#
#
plot(x, y)
import matplotlib.pyplot as plt
import numpy as np
#
np.random.seed(123)
#
x = np.random.choice(list(range(1, 51)), replace = True, size = N)
e = np.random.normal(loc = 0, scale = np.sqrt(s2), size = N)
y = beta_0 + beta_1 * x + e
#
plt.figure(num = 0, figsize = (10, 8))
plt.plot(x, y, linestyle = "None", marker = "o")
plt.show()
Next, we want to estimate \(\beta_0\) and \(\beta_1\) via OLS:
#
#
beta_est <- coef(lm(y ~ 1 + x))
print(beta_est)
## (Intercept) x
## 3.408751 -1.226338
import statsmodels.api as sm
#
beta_est = sm.OLS(y, sm.add_constant(x)).fit().params
print(beta_est)
## [ 3.99830368 -1.25405325]
We will leave the estimation of the standard errors for you to calcualte. The results can be appended to the output, which will be shown in this outline.
Now, we want to generate multiple samples for one combination of parameters.
Before that, comes a very important question - do we only need the simulated data for parameter estimation, or will we need to use the simulated data at some point later on?. Generating and saving data usually takes computational time and uses your computers RAM, so it is important to know, whether the data will need to be used, or not.
For now, we will show you how to save the data. We will simulate 1000 random samples with the specified parameters. Notice that we can also estimate the parameters immediately after generating the data. In practice it might be safer to firstly generate and save the data, and estimate the parameters after so as to avoid any PC crashes or other disasters (BSOD, electrical failures, etc.).
set.seed(123)
#
MC = 1000
x_samples <- NULL
y_samples <- NULL
beta_smpl <- NULL
for(i in 1:MC){
# Generate the data:
x <- sample(1:50, replace = TRUE, size = N)
e <- rnorm(mean = 0, sd = sqrt(s2), n = N)
y <- beta_0 + beta_1 * x + e
# Estimate the coefficients
beta_est <- coef(lm(y ~ 1 + x))
# Save the results
x_samples <- cbind(x_samples, x)
y_samples <- cbind(y_samples, y)
beta_smpl <- cbind(beta_smpl, beta_est)
}
np.random.seed(123)
#
MC = 1000
x_samples = np.array([])
y_samples = np.array([])
beta_smlp = np.array([])
for i in range(0, MC):
# Generate the data:
x = np.random.choice(list(range(1, 51)), replace = True, size = N)
e = np.random.normal(loc = 0, scale = np.sqrt(s2), size = N)
y = beta_0 + beta_1 * x + e
# Estimate the coefficients
beta_est <- sm.OLS(y, sm.add_constant(x)).fit().params
# Save the results
if len(x_samples) == 0:
#Add first element to the array, if it is empty
x_samples = x
y_samples = y
beta_smpl = beta_est
else:
x_samples = np.vstack((x_samples, x))
y_samples = np.vstack((y_samples, y))
beta_smpl = np.vstack((beta_smpl, beta_est))
Side-note - np.vstack
is not an optimal way to save the results in Python
speed-wise, but it is the easiest to read, so we will use it.
The way we have saved the data is as follows: the 1st
column of x_sampes
, y_samples
and beta_smpl
corresponds to the model generated in the first sample, the 2nd
model - with the second, etc. we can verify this, by estimating the coefficients of the samples from the first column and comparing with the coefficients in the first column:
print(coef(lm(y_samples[, 1] ~ 1 + x_samples[, 1])))
## (Intercept) x_samples[, 1]
## 3.408751 -1.226338
print(beta_smpl[, 1])
## (Intercept) x
## 3.408751 -1.226338
print(sm.OLS(y_samples[0], sm.add_constant(x_samples[0])).fit().params)
## [ 3.99830368 -1.25405325]
print(beta_smpl[0])
## [ 3.99830368 -1.25405325]
Now comes the tricky part - we want to iterate through different combinations of \(N\) and \(\sigma^2\). We are faced with two challenges:
To answer the first question, we note that we can generate names in the following way:
nms <- c()
for(N in N_vec){
for(s2 in sigma2_vec){
nms <- c(nms, paste0("N=", N, ";s2=", s2))
}
}
print(nms)
## [1] "N=10;s2=1" "N=10;s2=2" "N=10;s2=5" "N=100;s2=1" "N=100;s2=2"
## [6] "N=100;s2=5" "N=500;s2=1" "N=500;s2=2" "N=500;s2=5"
nms = np.array([])
for N in range(0, len(N_vec)):
for s2 in range(0, len(sigma2_vec)):
nms = np.append(nms, "N="+str(N)+";s2="+str(s2))
#
#
print(nms)
## ['N=0;s2=0' 'N=0;s2=1' 'N=0;s2=2' 'N=1;s2=0' 'N=1;s2=1' 'N=1;s2=2'
## 'N=2;s2=0' 'N=2;s2=1' 'N=2;s2=2']
We can use this to apply names to our generated data for each combination of \(N\) and \(\sigma^2\), which leads to the next question - for different sample sizes, how should we save the data? The easiest (though not the most optimal) way is to use list()
in R
and dict
in Python
, because they allow us to save different types (or in our case, lengths) of data and give them different names, all in one variable.
We will edit our previous data generation code to account for different combinations of \(N\) and \(\sigma^2\):
set.seed(123)
#
simulation_output <- list()
nms <- c()
for(N in N_vec){
for(s2 in sigma2_vec){
#Print which combination we are currently simulating:
print(paste0("Simualting for N = ", N, " and s2 = ", s2))
# Save the names
nms <- c(nms, paste0("N=", N, ";s2=", s2))
# Generate data and estimate coefficients for the current combination of N and s2:
x_samples <- NULL
y_samples <- NULL
beta_smpl <- NULL
for(i in 1:MC){
# Generate the data:
x <- sample(1:50, replace = TRUE, size = N)
e <- rnorm(mean = 0, sd = sqrt(s2), n = N)
y <- beta_0 + beta_1 * x + e
# Estimate the coefficients
beta_est <- coef(lm(y ~ 1 + x))
# Save the results
x_samples <- cbind(x_samples, x)
y_samples <- cbind(y_samples, y)
beta_smpl <- cbind(beta_smpl, beta_est)
}
# Save the curent parameter combination results:
simulation_output <- c(simulation_output,
list(list(x = x_samples,
y = y_samples,
b = beta_smpl)
)
)
}
}
## [1] "Simualting for N = 10 and s2 = 1"
## [1] "Simualting for N = 10 and s2 = 2"
## [1] "Simualting for N = 10 and s2 = 5"
## [1] "Simualting for N = 100 and s2 = 1"
## [1] "Simualting for N = 100 and s2 = 2"
## [1] "Simualting for N = 100 and s2 = 5"
## [1] "Simualting for N = 500 and s2 = 1"
## [1] "Simualting for N = 500 and s2 = 2"
## [1] "Simualting for N = 500 and s2 = 5"
# Assign names so we can identify to which combination of parameters belongs a specific data simulation
names(simulation_output) <- nms
np.random.seed(123)
#
simulation_output = dict()
nms = np.array([])
for ii in range(0, len(N_vec)):
for jj in range(0, len(sigma2_vec)):
N = N_vec[ii]
s2 = sigma2_vec[jj]
#Print which combination we are currently simulating:
print("Simulating for N = " + str(N) + " and s2 = " + str(s2))
# Save the names
nms = np.append(nms, "N="+str(N)+";s2="+str(s2))
# Generate data and estimate coefficients for the current combination of N and s2:
x_samples = np.array([])
y_samples = np.array([])
beta_smlp = np.array([])
for i in range(0, MC):
# Generate the data:
x = np.random.choice(list(range(1, 51)), replace = True, size = N)
e = np.random.normal(loc = 0, scale = np.sqrt(s2), size = N)
y = beta_0 + beta_1 * x + e
# Estimate the coefficients
beta_est = sm.OLS(y, sm.add_constant(x)).fit().params
# Save the results
if len(x_samples) == 0:
#Add first element to the array, if it is empty
x_samples = x
y_samples = y
beta_smpl = beta_est
else:
x_samples = np.vstack((x_samples, x))
y_samples = np.vstack((y_samples, y))
beta_smpl = np.vstack((beta_smpl, beta_est))
# Save the curent parameter combination results:
simulation_output["N="+str(N)+";s2="+str(s2)] = {"x": x_samples,
"y": y_samples,
"b": beta_smpl}
#
#
## Simulating for N = 10 and s2 = 1
## Simulating for N = 10 and s2 = 2
## Simulating for N = 10 and s2 = 5
## Simulating for N = 100 and s2 = 1
## Simulating for N = 100 and s2 = 2
## Simulating for N = 100 and s2 = 5
## Simulating for N = 500 and s2 = 1
## Simulating for N = 500 and s2 = 2
## Simulating for N = 500 and s2 = 5
We can then examine what we created:
#str(simulation_output)
#print(type(simulation_output))
#print(simulation_output.items())
As an example, we will examine how the mean and variance of \(\widehat{\beta}_0\) changed, as the sample size increased for \(\sigma^2 = 1\). For simplicity, we first find the positions for the values that we need manually and then calculate and plot the data:
print(nms)
## [1] "N=10;s2=1" "N=10;s2=2" "N=10;s2=5" "N=100;s2=1" "N=100;s2=2"
## [6] "N=100;s2=5" "N=500;s2=1" "N=500;s2=2" "N=500;s2=5"
print(nms[c(1, 4, 7)])
## [1] "N=10;s2=1" "N=100;s2=1" "N=500;s2=1"
#
tmp_nms <- nms[c(1, 4, 7)]
beta0_mean <- c()
beta0_var <- c()
for(i in 1:3){
temp_output <- simulation_output[[tmp_nms[i]]]
beta0_mean <- c(beta0_mean, mean(temp_output$b[1, ]))
beta0_var <- c(beta0_var, var(temp_output$b[1, ]))
}
#
par(mfrow = c(1, 2))
plot(N_vec, beta0_mean, type = "o", pch = 19,
main = expression("Mean of "~widehat(beta)[0]))
abline(h = beta_0, col = "red")
plot(N_vec, beta0_var, type = "o",
main = expression("Variance of "~widehat(beta)[0]))
abline(h = 0, col = "red")
print(nms)
## ['N=10;s2=1' 'N=10;s2=2' 'N=10;s2=5' 'N=100;s2=1' 'N=100;s2=2'
## 'N=100;s2=5' 'N=500;s2=1' 'N=500;s2=2' 'N=500;s2=5']
print(nms[[0, 3, 6]])
#
## ['N=10;s2=1' 'N=100;s2=1' 'N=500;s2=1']
tmp_nms = nms[[0, 3, 6]]
beta0_mean = np.array([])
beta0_var = np.array([])
for i in range(0, 3):
temp_output = simulation_output[tmp_nms[i]]
beta0_mean = np.append(beta0_mean, np.mean(np.transpose(temp_output["b"])[0]))
beta0_var = np.append(beta0_var, np.var(np.transpose(temp_output["b"])[0]))
#
fig = plt.figure(num = 1, figsize = (10, 8))
fig.add_subplot('121').plot(N_vec, beta0_mean, linestyle = "-", marker = "o")
plt.axhline(y=2.5, color = "red")
plt.title("Mean of $\\widehat{\\beta}_0$")
fig.add_subplot('122').plot(N_vec, beta0_var, linestyle = "-", marker = "o")
plt.title("Variance of $\\widehat{\\beta}_0$")
plt.axhline(y=0, color = "red")
plt.show()
Note that we could have also calculated the mean of \(\widehat{\beta}_0\), \(\widehat{\beta}_1\) and saved them as additional variables when simulating the data - try it.