We will illustrate some of the methods outlined during the lectures. Note that we will update the tasks or data as necessary for new methods introduced during upcoming lectures. In some cases, we may need to present an additional dataset to carry out some examples.

Generally, a number of example datasets can be found at Principles of Econometrics book website with their dataset definitions or from R datasets which are available in both R and Python.

Data Overview

Use the following datasets:

food (data definition). Let’s say that we are interested in estimating how the expenditure on food, food_exp, depends on the income. The dataset can be loaded in both R and Python:

dt1 <- read.csv(file = "http://www.principlesofeconometrics.com/poe5/data/csv/food.csv", sep = ",", dec = ".", header = TRUE)

import pandas as pd
#
dt1 = pd.read_csv("http://www.principlesofeconometrics.com/poe5/data/csv/food.csv")

nuclear. Let’s say that we are interested in estimating how the cost depends on the power capacity cap. The dataset can be loaded in both R and Python:

dt2 <- boot::nuclear

import statsmodels.api as sm
#
dt2 = sm.datasets.get_rdataset("nuclear", "boot")
#print(dt.__doc__) #documentation about the data
dt2 = dt2.data

stockton5_small (data definition) contains data on houses sold in Stockton, California in 1996-1998. Assume that we are interested how is the sale price, sprice, affected by the house living area, livarea.

dt3 <- read.csv(file = "http://www.principlesofeconometrics.com/poe5/data/csv/stockton5_small.csv", sep = ",", dec = ".", header = TRUE)

dt3 = pd.read_csv("http://www.principlesofeconometrics.com/poe5/data/csv/stockton5_small.csv")

cps5_small (data definition) contains data on hourly wage rates, education, etc. from the 2013 Current Population Survey. Suppose we are interested in examining how does education, educ, affect wage.

dt4 <- read.csv(file = "http://www.principlesofeconometrics.com/poe5/data/csv/cps5_small.csv", sep = ",", dec = ".", header = TRUE)

dt4 = pd.read_csv("http://www.principlesofeconometrics.com/poe5/data/csv/cps5_small.csv")

tuna, (data definition) contains weekly data (we will ignore the time dimension for now) on the number of cans sold of brand \(\#1\) tuna (sal1). Consider examining how the ratio of brand \(\#1\) tuna prices, apr1, to brand \(\#3\) tuna prices, apr3, affects sal1 in thousands of units:

Firstly, scale sal1, so that it would measure sales in thousands (instead of single units).
Secondly, calculate the ratio as \(\text{price_ratio} = 100 \cdot (\text{apr1} / \text{apr3})\). This ratio indicates the percentage price of brand \(\#1\) tuna, relative to brand \(\#3\) tuna. Brand \(\#1\) tuna is more expensive when \(\text{price_ratio} > 100\), and less expensive when \(\text{price_ratio} < 100\). For example:
- if the ratio equals 100, then the price of both brands is the same;
- if it is equal to 90, then brand \(\#1\) is cheaper by \(10\%\) than brand \(\#3\);
- if it is equal to \(110\), then brand \(\#1\) is \(10\%\) more expensive than brand \(\#3\).
Estimate how the price ratio affects the sales numbers of brand \(\#1\).

dt5 <- read.csv(file = "http://www.principlesofeconometrics.com/poe5/data/csv/tuna.csv", sep = ",", dec = ".", header = TRUE)

dt5 = pd.read_csv("http://www.principlesofeconometrics.com/poe5/data/csv/tuna.csv")

Note: either dt3 or dt4 will be selected and analysed during lectures.

Tasks

Below are the tasks that you should carry out for the datasets:

(2018-09-13)

Plot the scatter plot of the dependent variable \(Y\) and the independent variable \(X\). Do the variables look correlated? Calculate the correlation between \(X\) and \(Y\) to verify.
Specify the regression in a mathematical formula notation. What coefficient sign do you expect \(\beta_1\) to have? Explain.
Estimate the regression via OLS without using the built-in OLS estimation functions. Is the sign on \(\beta_1\) the same as you expected?
Calculate the standard errors of the estimated coefficients.
Write Down the estimated regression formula.
Calculate the fitted values and plot the estimated regression alongside the data.

Finally: Use the built-in functions for OLS estimation and compare with your own results.

(2018-09-20)

Examine the run-sequence plot of the dependent variable, \(Y\). Does the data appear random? Is the variance (and mean) the same throughout observations?
Examine the histogram of your dependent (\(Y\)) and independent (\(X\)) variables. Are there any variables that appear to have a non-normal distribution?
Take another look at the scatter plot from task (1) - could you specify at least one more linear regression but this time with transformed variable(-s) (Note: consider transformations either by scaling with a constant, by taking logarithms, or by taking the square of the independent variable)?
Examine the residual run-sequence plots and histograms from regressions in task (3) and task (9) - which regressions appear to have residuals that are random? Are there any regressions, where the residuals do not appear to be random (i.e. randomly dispersed around the horizontal axis)? What can you say about the models in such cases (in regards to some of the linear regression model assumptions (UR.1)-(UR.4) and the signs of the coefficient \(\beta_1\)).
Select one model, which you believe to be best suited for your data from the conclusions in task (7) through (10) and write down the equation.
Provide an interpretation of \(\beta_1\) for your selected model.

Note: Initially try to estimate at least one model with transformed variables without using the built-in OLS estimation functions in order to make sure you understand how the transformations are applied and how they are incorporated in the same OLS formula as for the simple linear regression case.

(2018-09-27)

Select two models - one model from Task (11) and any other one model from either Task (3) or Task (9) - and test (by calculating the \(p\)-value) the null hypothesis that the coefficient of your explanatory variable \(X\) is not significant, with the alternative that:
1. \(\beta_1\) is negative;
2. \(\beta_1\) is positive;
3. \(\beta_1\) is not zero;
Plot the confidence intervals for the mean response for each model and plot them for \(Y\) (i.e. not \(\log(Y)\) or any other transformation of \(Y\)).
Plot the prediction intervals of \(Y\) for existing values of \(X\).
Let’s say our new \(X\) is:
1. \(\widetilde{X} = 0.8 \cdot \min(X_1, ..., X_N)\)
2. \(\widetilde{X} = 1.1 \cdot \max(X_1, ..., X_N)\)
Calculate the predicted value along with the prediction intervals.

Finally: Use the built-in functions for OLS estimation and compare with your own results.

(2018-10-04)

Calculate (either manually or using the built-in OLS estimation functions) \(R^2\) of your selected models from Task (13). Can you directly compare the \(R^2\) values of your selected models (and why)?
Calculate \(R^2_g\) (the general \(R^2\)) for the back-transformed variables (i.e. for non-transformed values of the dependent variable - \(Y\)). Is it larger than the standard \(R^2\), which is reported by either lm() in R, or sm.OLS() in Python?
Which model has the largest \(R^2\) (is it the same model if we compare with \(R^2_g\)) ? Is the model the same as in Task (11)? For the model with the largest \(R^2\), provide an interpretation of the calculated \(R^2\) value.

(2018-10-11)

Once again look at the residuals plots:
- scatter plot of residuals and fitted values, scatter plot of the residuals and \(X\) - are there any non-linearities in the residuals?
- residual Q-Q plot, their histogram - are the residuals normal?
Carry out the Breusch-Pagan Test for homoskedasticity, Durbin-Watson Test for autocorrelation and Shapiro-Wilk Test for normality. What are the null and alternative hypothesis of these tests? What do these test results say about the residuals?
Looking at all of the results thus far - which model would you consider the “best” and why?
Take a subset of the data - around \(80\%\) of the sample and estimate the best model, which yo selected in Task (21), on the subset:
- Are the model results (signs, significance, residual GoF tests) similar to the model on the full data?
- Plot the subset data along with the predicted values of the model;
- Calculate the predicted values and confidence intervals for the remaining \(20\%\);
- Plot the predicted values, their confidence intervals and the actual values - do the true values fall in the confidence intervals?

Data Generation Task

Generate the following model: \[ Y = \beta_0 + \beta_1 X + \epsilon \] where:

\(\beta_0 = 2.5\), \(\beta_1 = - 1.2\);
\(X\) is a random sample with replacement from \(\{ 1, ..., 50 \}\)
\(\epsilon \sim \mathcal{N}(0, \sigma^2)\)

Generate 1000 replications (i.e. different samples) for different sample sizes \(N \in \{10, 100, 500, 800\}\) and different error variance parameters \(\sigma^2 \in \{1, 2, 5 \}\). In other words: take \(\sigma^2 = 1\) and generate 1000 different samples for \(N = 10\), 1000 different samples for \(N = 100\) and 1000 different samples for \(N = 500\). Then take \(\sigma^2 = 2\) and do the same, etc.

Do the following:

For each generated sample and each combination of \(N\) and \(\sigma^2\), estimate the unknown coefficients \(\beta_0\), \(\beta_1\) and \(\sigma^2\) and calculate the standard errors.
Plot the histograms of the estimated parameters and standard errors for different combinations of \(N\) and \(\sigma^2\).
How do the estimates change (their mean and variance of the estimate parameter values throughout the replications) when the sample size increases for a fixed error variance? For example: When \(\sigma^2 = 1\), examine what happens to the mean and variance of the estimated parameters as \(N\) increases from \(10\) to \(100\) and to \(1000\). Then, do the same for \(\sigma^2 = 2\) and for \(\sigma^2 = 5\).
What happens to the OLS estimates when the error variance increases but the sample size is fixed? For example: if \(N = 10\), what happens to the estimated parameter mean, variance if the error variance is increased from \(\sigma^2 = 1\) to \(\sigma^2 = 2\) and to \(\sigma^2 = 5\), etc.
From the previous two tasks, what can you conclude about the effects from an increase in the sample size on the OLS estimates (if all the remaining parameters are unchanged)?
From the previous two tasks, what can you conclude about the effects of error variance on the OLS estimates (if all other parameters, including the sample size are unchanged). For example, for sample of size \(N = 10\), but different error variance \(\sigma^2\), is the accuracy of the OLS estimates (in terms of their standard error) the same?

Outline on Data Simulation

We will give an outline on how the process of generating multiple samples for different parameters could be applied. Note that this is not the only way - there are multiple packages and functions, which would increase the speed of data generation and estimation, but for this exercise we will focus on code readability, instead of calculation speed.

We will show how to approach this task by specific parts and then combine these parts into a single code block.

So, we have some fixed parameters:

beta_0 <- 2.5
beta_1 <- -1.2

beta_0 = 2.5
beta_1 = -1.2

As well as parameters, which we can have different combinations of:

N_vec <- c(10, 100, 500)
sigma2_vec <- c(1, 2, 5)

N_vec = [10, 100, 500]
sigma2_vec = [1, 2, 5]

Note: We did not use all the values for N.

We will begin by taking one combination of parameters \(N = 10\) and \(\sigma^2 = 1\):

N <- N_vec[1]
s2 <- sigma2_vec[1]

N = N_vec[0]
s2 = sigma2_vec[0]

And try to generate a single sample of our data for this one combination of parameters:

#
#
#
set.seed(123)
#
x <- sample(1:50, replace = TRUE, size = N)
e <- rnorm(mean = 0, sd = sqrt(s2), n = N)
y <- beta_0 + beta_1 * x + e
#
#
#
plot(x, y)

import matplotlib.pyplot as plt
import numpy as np
#
np.random.seed(123)
#
x = np.random.choice(list(range(1, 51)), replace = True, size = N)
e = np.random.normal(loc = 0, scale = np.sqrt(s2), size = N)
y = beta_0 + beta_1 * x + e
#
plt.figure(num = 0, figsize = (10, 8))
plt.plot(x, y, linestyle = "None", marker = "o")
plt.show()

Next, we want to estimate \(\beta_0\) and \(\beta_1\) via OLS:

#
#
beta_est <- coef(lm(y ~ 1 + x))
print(beta_est)

## (Intercept)           x 
##    3.408751   -1.226338

import statsmodels.api as sm
#
beta_est = sm.OLS(y, sm.add_constant(x)).fit().params
print(beta_est)

## [ 3.99830368 -1.25405325]

We will leave the estimation of the standard errors for you to calcualte. The results can be appended to the output, which will be shown in this outline.

Now, we want to generate multiple samples for one combination of parameters.

Before that, comes a very important question - do we only need the simulated data for parameter estimation, or will we need to use the simulated data at some point later on?. Generating and saving data usually takes computational time and uses your computers RAM, so it is important to know, whether the data will need to be used, or not.

For now, we will show you how to save the data. We will simulate 1000 random samples with the specified parameters. Notice that we can also estimate the parameters immediately after generating the data. In practice it might be safer to firstly generate and save the data, and estimate the parameters after so as to avoid any PC crashes or other disasters (BSOD, electrical failures, etc.).

set.seed(123)
#
MC = 1000
x_samples <- NULL
y_samples <- NULL
beta_smpl <- NULL
for(i in 1:MC){
  # Generate the data:
  x <- sample(1:50, replace = TRUE, size = N)
  e <- rnorm(mean = 0, sd = sqrt(s2), n = N)
  y <- beta_0 + beta_1 * x + e
  # Estimate the coefficients
  beta_est <- coef(lm(y ~ 1 + x))
  # Save the results
  x_samples <- cbind(x_samples, x)
  y_samples <- cbind(y_samples, y)
  beta_smpl <- cbind(beta_smpl, beta_est)
}

np.random.seed(123)
#
MC = 1000
x_samples = np.array([])
y_samples = np.array([])
beta_smlp = np.array([])
for i in range(0, MC):
    # Generate the data:
    x = np.random.choice(list(range(1, 51)), replace = True, size = N)
    e = np.random.normal(loc = 0, scale = np.sqrt(s2), size = N)
    y = beta_0 + beta_1 * x + e
    # Estimate the coefficients
    beta_est <- sm.OLS(y, sm.add_constant(x)).fit().params
    # Save the results
    if len(x_samples) == 0:
        #Add first element to the array, if it is empty
        x_samples = x
        y_samples = y
        beta_smpl = beta_est
    else:
        x_samples = np.vstack((x_samples, x))
        y_samples = np.vstack((y_samples, y))
        beta_smpl = np.vstack((beta_smpl, beta_est))

Side-note - np.vstack is not an optimal way to save the results in Python speed-wise, but it is the easiest to read, so we will use it.

The way we have saved the data is as follows: the 1st column of x_sampes, y_samples and beta_smpl corresponds to the model generated in the first sample, the 2nd model - with the second, etc. we can verify this, by estimating the coefficients of the samples from the first column and comparing with the coefficients in the first column:

print(coef(lm(y_samples[, 1] ~ 1 + x_samples[, 1])))

##    (Intercept) x_samples[, 1] 
##       3.408751      -1.226338

print(beta_smpl[, 1])

## (Intercept)           x 
##    3.408751   -1.226338

print(sm.OLS(y_samples[0], sm.add_constant(x_samples[0])).fit().params)

## [ 3.99830368 -1.25405325]

print(beta_smpl[0])

## [ 3.99830368 -1.25405325]

Now comes the tricky part - we want to iterate through different combinations of \(N\) and \(\sigma^2\). We are faced with two challenges:

How to keep track, which data corresponds to which combination of values?
How to append data of different lengths?

To answer the first question, we note that we can generate names in the following way:

nms <- c()
for(N in N_vec){
  for(s2 in sigma2_vec){
    nms <- c(nms, paste0("N=", N, ";s2=", s2))
  }
}
print(nms)

## [1] "N=10;s2=1"  "N=10;s2=2"  "N=10;s2=5"  "N=100;s2=1" "N=100;s2=2"
## [6] "N=100;s2=5" "N=500;s2=1" "N=500;s2=2" "N=500;s2=5"

nms = np.array([])
for N in range(0, len(N_vec)):
    for s2 in range(0, len(sigma2_vec)):
        nms = np.append(nms, "N="+str(N)+";s2="+str(s2))
#
#
print(nms)

## ['N=0;s2=0' 'N=0;s2=1' 'N=0;s2=2' 'N=1;s2=0' 'N=1;s2=1' 'N=1;s2=2'
##  'N=2;s2=0' 'N=2;s2=1' 'N=2;s2=2']

We can use this to apply names to our generated data for each combination of \(N\) and \(\sigma^2\), which leads to the next question - for different sample sizes, how should we save the data? The easiest (though not the most optimal) way is to use list() in R and dict in Python, because they allow us to save different types (or in our case, lengths) of data and give them different names, all in one variable.

We will edit our previous data generation code to account for different combinations of \(N\) and \(\sigma^2\):

set.seed(123)
#
simulation_output <- list()
nms <- c()
for(N in N_vec){
  for(s2 in sigma2_vec){
    #Print which combination we are currently simulating:
    print(paste0("Simualting for N = ", N, " and s2 = ", s2))
    # Save the names
    nms <- c(nms, paste0("N=", N, ";s2=", s2))
    # Generate data and estimate coefficients for the current combination of N and s2:
    x_samples <- NULL
    y_samples <- NULL
    beta_smpl <- NULL
    for(i in 1:MC){
      # Generate the data:
      x <- sample(1:50, replace = TRUE, size = N)
      e <- rnorm(mean = 0, sd = sqrt(s2), n = N)
      y <- beta_0 + beta_1 * x + e
      # Estimate the coefficients
      beta_est <- coef(lm(y ~ 1 + x))
      # Save the results
      x_samples <- cbind(x_samples, x)
      y_samples <- cbind(y_samples, y)
      beta_smpl <- cbind(beta_smpl, beta_est)
    }
    # Save the curent parameter combination results:
    simulation_output <- c(simulation_output,
                           list(list(x = x_samples,
                                     y = y_samples,
                                     b = beta_smpl)
                                )
                           )
  }
}

## [1] "Simualting for N = 10 and s2 = 1"
## [1] "Simualting for N = 10 and s2 = 2"
## [1] "Simualting for N = 10 and s2 = 5"
## [1] "Simualting for N = 100 and s2 = 1"
## [1] "Simualting for N = 100 and s2 = 2"
## [1] "Simualting for N = 100 and s2 = 5"
## [1] "Simualting for N = 500 and s2 = 1"
## [1] "Simualting for N = 500 and s2 = 2"
## [1] "Simualting for N = 500 and s2 = 5"

# Assign names so we can identify to which combination of parameters belongs a specific data simulation
names(simulation_output) <- nms

np.random.seed(123)
#
simulation_output = dict()
nms = np.array([])
for ii in range(0, len(N_vec)):
    for jj in range(0, len(sigma2_vec)):
        N = N_vec[ii]
        s2 = sigma2_vec[jj]
        #Print which combination we are currently simulating:
        print("Simulating for N = " + str(N) + " and s2 = " + str(s2))
        # Save the names
        nms = np.append(nms, "N="+str(N)+";s2="+str(s2))
        # Generate data and estimate coefficients for the current combination of N and s2:
        x_samples = np.array([])
        y_samples = np.array([])
        beta_smlp = np.array([])
        for i in range(0, MC):
            # Generate the data:
            x = np.random.choice(list(range(1, 51)), replace = True, size = N)
            e = np.random.normal(loc = 0, scale = np.sqrt(s2), size = N)
            y = beta_0 + beta_1 * x + e
            # Estimate the coefficients
            beta_est = sm.OLS(y, sm.add_constant(x)).fit().params
            # Save the results
            if len(x_samples) == 0:
                #Add first element to the array, if it is empty
                x_samples = x
                y_samples = y
                beta_smpl = beta_est
            else:
                x_samples = np.vstack((x_samples, x))
                y_samples = np.vstack((y_samples, y))
                beta_smpl = np.vstack((beta_smpl, beta_est))
        # Save the curent parameter combination results:
        simulation_output["N="+str(N)+";s2="+str(s2)] = {"x": x_samples,
                                                         "y": y_samples,
                                                         "b": beta_smpl}
#
#

## Simulating for N = 10 and s2 = 1
## Simulating for N = 10 and s2 = 2
## Simulating for N = 10 and s2 = 5
## Simulating for N = 100 and s2 = 1
## Simulating for N = 100 and s2 = 2
## Simulating for N = 100 and s2 = 5
## Simulating for N = 500 and s2 = 1
## Simulating for N = 500 and s2 = 2
## Simulating for N = 500 and s2 = 5

We can then examine what we created:

#str(simulation_output)

#print(type(simulation_output))
#print(simulation_output.items())

As an example, we will examine how the mean and variance of \(\widehat{\beta}_0\) changed, as the sample size increased for \(\sigma^2 = 1\). For simplicity, we first find the positions for the values that we need manually and then calculate and plot the data:

print(nms)

## [1] "N=10;s2=1"  "N=10;s2=2"  "N=10;s2=5"  "N=100;s2=1" "N=100;s2=2"
## [6] "N=100;s2=5" "N=500;s2=1" "N=500;s2=2" "N=500;s2=5"

print(nms[c(1, 4, 7)])

## [1] "N=10;s2=1"  "N=100;s2=1" "N=500;s2=1"

#
tmp_nms <- nms[c(1, 4, 7)]
beta0_mean <- c()
beta0_var  <- c()
for(i in 1:3){
  temp_output <- simulation_output[[tmp_nms[i]]]
  beta0_mean <- c(beta0_mean, mean(temp_output$b[1, ]))
  beta0_var  <- c(beta0_var, var(temp_output$b[1, ]))
}
#
par(mfrow = c(1, 2))
plot(N_vec, beta0_mean, type = "o", pch = 19,
     main = expression("Mean of "~widehat(beta)[0]))
abline(h = beta_0, col = "red")
plot(N_vec, beta0_var, type = "o",
     main = expression("Variance of "~widehat(beta)[0]))
abline(h = 0, col = "red")

print(nms)

## ['N=10;s2=1' 'N=10;s2=2' 'N=10;s2=5' 'N=100;s2=1' 'N=100;s2=2'
##  'N=100;s2=5' 'N=500;s2=1' 'N=500;s2=2' 'N=500;s2=5']

print(nms[[0, 3, 6]])
#

## ['N=10;s2=1' 'N=100;s2=1' 'N=500;s2=1']

tmp_nms = nms[[0, 3, 6]]
beta0_mean = np.array([])
beta0_var = np.array([])
for i in range(0, 3):
    temp_output = simulation_output[tmp_nms[i]]
    beta0_mean = np.append(beta0_mean, np.mean(np.transpose(temp_output["b"])[0]))
    beta0_var = np.append(beta0_var, np.var(np.transpose(temp_output["b"])[0]))
#
fig = plt.figure(num = 1, figsize = (10, 8))
fig.add_subplot('121').plot(N_vec, beta0_mean, linestyle = "-", marker = "o")
plt.axhline(y=2.5, color = "red")
plt.title("Mean of $\\widehat{\\beta}_0$")
fig.add_subplot('122').plot(N_vec, beta0_var, linestyle = "-", marker = "o")
plt.title("Variance of $\\widehat{\\beta}_0$")
plt.axhline(y=0, color = "red")
plt.show()

Note that we could have also calculated the mean of \(\widehat{\beta}_0\), \(\widehat{\beta}_1\) and saved them as additional variables when simulating the data - try it.

Univariate Regression Tasks

Andrius Buteikis

Last updated: 2018-10-11

Data Overview

Tasks

Data Generation Task

Outline on Data Simulation