3.2 Ordinary Least Squares (OLS)

In this chapter we will present a general method of ordinary least squares (OLS), which produces estimates under certain conditions. Let our (random) samples of \(\epsilon\), \(X\) and \(Y\) be: \(\boldsymbol{\varepsilon} = (\epsilon_1, ..., \epsilon_N)^\top\), \(\mathbf{X} = (X_1,....,X_N)^\top\), and \(\mathbf{Y} = (Y_1, ..., Y_N)^\top\).

3.2.1 Key assumptions in Regression Analysis

The required conditions are:

(UR.1) The Data Generating Process (DGP), or in other words, the population, is described by a linear (in terms of the coefficients) model: \[Y = \beta_0 + \beta_1 X + \epsilon\]

Another example of a linear model is: \[\log (Y) = \beta_0 + \beta_1 \dfrac{1}{X} + \epsilon \iff U = \beta_0 + \beta_1 V + \epsilon,\ \text{where}\ U = \log(Y),\ V = \dfrac{1}{X}\] In a linear model \(\epsilon\) and \(Y\) are always dependent, thus \(\mathbb{C}{\rm ov} (Y, \epsilon) \neq 0\). However, \(\epsilon\) may (or may not) depend on \(X\). This leads us to further requirements for \(\epsilon\).

(UR.2) The error term \(\epsilon\) has an expected value of zero, given any value of the explanatory variable: \[\mathbb{E}(\epsilon_i| X_j) = 0,\ \forall i,j = 1,...,N\]

This means that whatever are the observations, \(X_j\), the errors, \(\epsilon_j\), are on average 0. This also implies that \(\mathbb{E}(\epsilon_i X_i) = 0\) and \(\mathbb{E}(\epsilon_i) = \mathbb{E}\left( \mathbb{E}\left( \epsilon_i | X_i\right) \right)=0\). An example would be the case, where \(X_j\) and \(\epsilon_i\) are independent r.v.’s and \(\mathbb{E}(\epsilon_i) = 0\) \(\forall i,j = 1,...,N\). Furthermore, This property implies that: \[\mathbb{E}(Y_i|X_i) = \beta_0 + \beta_1 X_i\] On the other hand, if \(\mathbb{C}{\rm ov}(X_i, \epsilon_i) \neq 0\), then expressing the covariance as \(\mathbb{C}{\rm ov}(X_i, \epsilon_i) = \mathbb{E}\left[ (X_i - \mathbb{E}(X_i))(\epsilon_i - \mathbb{E}(\epsilon_i))\right] = \mathbb{E}\left[ (X_i - \mathbb{E}(X_i))\epsilon_i\right] = \mathbb{E} \left[ (X_i - \mathbb{E}(X_i))\mathbb{E}(\epsilon_i | X_i) \right] \neq 0\), which implies that \(\mathbb{E}(\epsilon_i| X_j) \neq 0\), i.e. assumption (UR.2) does not hold.

(UR.3) The error term \(\epsilon\) has the same variance given any value of the explanatory variable (i.e. homoskedasticity): \[\mathbb{V}{\rm ar} (\epsilon_i | \mathbf{X} ) = \sigma^2_\epsilon,\ \forall i = 1,..,N\] and the error terms are not correlated across observations (i.e. no autocorrelation): \[\mathbb{C}{\rm ov} (\epsilon_i, \epsilon_j) = 0,\ i \neq j\]

This implies that the conditional variance-covariance matrix of a vector of disturbances \(\boldsymbol{\varepsilon}\) is a unit (or identity) matrix, times a constant, \(\sigma^2_\epsilon\): \[ \mathbb{V}{\rm ar}\left( \boldsymbol{\varepsilon} | \mathbf{X} \right) = \begin{bmatrix} \mathbb{V}{\rm ar} (\epsilon_1) & \mathbb{C}{\rm ov} (\epsilon_1, \epsilon_2) & ... & \mathbb{C}{\rm ov} (\epsilon_1, \epsilon_N) \\ \mathbb{C}{\rm ov} (\epsilon_2, \epsilon_1) & \mathbb{V}{\rm ar} (\epsilon_2) & ... & \mathbb{C}{\rm ov} (\epsilon_2, \epsilon_N) \\ \vdots & \vdots & \ddots & \vdots \\ \mathbb{C}{\rm ov} (\epsilon_N, \epsilon_1) & \mathbb{C}{\rm ov} (\epsilon_N, \epsilon_2) & ... & \mathbb{V}{\rm ar} (\epsilon_N) \end{bmatrix} = \sigma^2_\epsilon \mathbf{I} \] This means that the disturbances \(\epsilon_i\) and \(\epsilon_j\) are independent \(\forall i \neq j\) and independent of \(\mathbf{X}\), thus: \[\mathbb{C}{\rm ov} (\epsilon_i, \epsilon_j | \mathbf{X} ) = \mathbb{E} (\epsilon_i \epsilon_j | \mathbf{X} ) = 0,\ \forall i \neq j.\]

(UR.4) (optional) The residuals are normal: \[\boldsymbol{\varepsilon} | \mathbf{X} \sim \mathcal{N} \left( \mathbf{0}, \sigma^2_\epsilon \mathbf{I} \right)\]

This optional condition simplifies some statistical properties of the parameter estimators.

We can combine the requirements and restate them as the following:

The Data Generating Process \(Y = \beta_0 + \beta_1 X + \epsilon\) satisfies (UR.2) and (UR.3), if: (conditionally on all \(\mathbf{X}\)’s) \(\mathbb{E} (\epsilon_i) = 0\), \(\mathbb{V}{\rm ar}(\epsilon_i) = \sigma^2_\epsilon\) and \(\mathbb{C}{\rm ov} (\epsilon_i, \epsilon_j) = 0\), \(\forall i \neq j\) and \(\mathbb{C}{\rm ov} (\epsilon_i, X_j) = \mathbb{E} (\epsilon_i X_j) = 0\), \(\forall i,j\).

The linear relationship \(Y = \beta_0 + \beta_1 X + \epsilon\) is also referred as the regression line with an intercept \(\beta_0\) and a slope \(\beta_1\). From (UR.2) we have that the regression line coincides with the expected value of \(Y_i\), given \(X_i\).

In general, we do not know the true coefficient \(\beta_0\) and \(\beta_1\) values but we would like to estimate them from our sample data, which consists of points \((X_i, Y_i)\), \(i = 1,...,N\). We would also like to use the data in the best way possible to obtain an estimate of the regression: \[\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 X\]

3.2.2 Derivation of the Ordinary Least Squares Estimator

Our random sample \((X_1, Y_1), ..., (X_N, Y_N)\) comes from the data generating process \(Y = \beta_0 + \beta_1 X + \epsilon\), which we can rewrite as: \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i,\ i = 1,...,N \] We want to use the data to obtain estimates of the intercept \(\beta_0\) and the slope \(\beta_1\). To achieve this, we will present a number of derivation methods to derive the estimates. We will assume that the process satisfies (UR.1) - (UR.3) conditions.

Furthermore, we will use the following generated data sample for the estimation methods outlined in this section:

Example 3.1 Below we will assume that our data generating process satisfies (UR.1) - (UR.4) assumptions with the following parameters:
  • \(\beta_0 = 1\), \(\beta_1 = 0.5\)
  • \(X_i = i-1\), \(\epsilon_i \sim \mathcal{N}(0, 1^2), i = 1,...,51\)

3.2.2.1 The Method of Moments (MM)

As we have seen before consequence of the (UR.2) condition is that \(\mathbb{E}(\epsilon) = 0\) as well as \(\mathbb{C}{\rm ov} (\epsilon, X) = \mathbb{E} \left(\epsilon X \right) - \mathbb{E}(\epsilon)\mathbb{E}(X) = \mathbb{E} \left(\epsilon X \right) = 0\). Using these properties and the linear relationship of \(Y\) and \(X\) we have that the following relations: \[ \begin{cases} \mathbb{E}(\epsilon) &= 0 \\ \mathbb{E}(\epsilon X) &= 0 \\ \epsilon &= Y - \beta_0 - \beta_1 X \end{cases} \] can be written as: \[ \begin{cases} \mathbb{E}\left[Y - \beta_0 - \beta_1 X \right] &= 0 \\\\ \mathbb{E}\left[X \left(Y - \beta_0 - \beta_1 X \right) \right] &= 0 \end{cases} \] We have two unknown parameters and two equations, which we can use to estimate the unknown parameters.

Going back to our random sample, we can estimate the unknown parameters by replacing the expectation with its sample counterpart. Then, we want to choose the estimates \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) such that: \[\begin{equation} \begin{cases} \dfrac{1}{N} \sum_{i = 1}^N \left[Y_i - \widehat{\beta}_0 - \widehat{\beta}_1 X_i \right] &= 0 \\\\ \dfrac{1}{N} \sum_{i = 1}^N \left[X_i \left(Y_i - \widehat{\beta}_0 - \widehat{\beta}_1 X_i \right) \right] &= 0 \end{cases} \tag{3.3} \end{equation}\]

This is an example of the method of moments estimation approach.

By setting: \[ \begin{aligned} \overline{Y} = \dfrac{1}{N} \sum_{i = 1}^N Y_i,\quad \overline{X} = \dfrac{1}{N} \sum_{i = 1}^N X_i \end{aligned} \] We can rewrite the first equation of (3.3) as: \[\begin{equation} \widehat{\beta}_0 = \overline{Y} - \widehat{\beta}_1 \overline{X} \tag{3.4} \end{equation}\] And plug it into the second equation of (3.3) to get: \[ \sum_{i = 1}^N X_i\left( Y_i - \overline{Y} + \widehat{\beta}_1 \overline{X}- \widehat{\beta}_1 X_i \right) = 0 \Longrightarrow \sum_{i = 1}^N X_i\left( Y_i - \overline{Y} \right) = \widehat{\beta}_1 \sum_{i = 1}^N X_i\left( X_i - \overline{X} \right) \] which gives us: \[ \widehat{\beta}_1 = \dfrac{\sum_{i = 1}^N X_i\left( Y_i - \overline{Y} \right)}{\sum_{i = 1}^N X_i\left( X_i - \overline{X} \right)} \]

Using the following properties: \[ \begin{aligned} \sum_{i = 1}^N X_i\left( X_i - \overline{X} \right) &= \sum_{i = 1}^N \left( X_i - \overline{X} \right)^2 \\ \sum_{i = 1}^N X_i\left( Y_i - \overline{Y} \right) &= \sum_{i = 1}^N \left( X_i - \overline{X} \right) \left( Y_i - \overline{Y} \right) \end{aligned} \] yields the following estimate of \(\beta_1\): \[\begin{equation} \widehat{\beta}_1 = \dfrac{\sum_{i = 1}^N \left( X_i - \overline{X} \right) \left( Y_i - \overline{Y} \right)}{\sum_{i = 1}^N \left( X_i - \overline{X} \right)^2} = \dfrac{\widehat{\mathbb{C}{\rm ov}}(X, Y)}{\widehat{\mathbb{V}{\rm ar}}(X)} \tag{3.5} \end{equation}\]

It is important to note that by this construction:

  • the sample mean of the OLS residuals is always zero.
  • The residuals do not correlate with \(X\).

The expectation is also referred to as the first moment, which we then estimate from a sample, hence the name - Method of Moments.

There is another alternative way of approaching our task, and it is by attempting to minimize the sum of the squared errors.

3.2.2.2 OLS - System of Partial Derivatives Method

Suppose that we choose \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) to minimize the sum of squared residuals: \[\text{RSS} = \sum_{i = 1}^N \widehat{\epsilon}_i^2 = \sum_{i = 1}^N \left( Y_i - \widehat{\beta}_0 - \widehat{\beta}_1 X_i \right)^2\]

The term Ordinary Least Squares (OLS) comes from the fact that these estimates minimize the sum of squared residuals.

In order to minimize \(\text{RSS}\), we have to differentiate it with respect to \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) and equate the derivatives to zero in order to solve the following system of equations: \[\begin{equation} \begin{cases} \dfrac{\partial \text{RSS}}{\partial \widehat{\beta}_0} &= -2\sum_{i = 1}^N \left( Y_i - \widehat{\beta}_0 - \widehat{\beta}_1 X_i \right) = 0\\ \dfrac{\partial \text{RSS}}{\partial \widehat{\beta}_1} &= -2\sum_{i = 1}^N X_i\left( Y_i - \widehat{\beta}_0 - \widehat{\beta}_1 X_i \right) = 0 \end{cases} \tag{3.6} \end{equation}\]

If we multiply both equations by \(-\dfrac{1}{2}\) we will get the same equation system as in (3.3). So the solution here will have the same expression as for the Method of Moments estimators, namely:

\[ \begin{cases} \widehat{\beta}_1 = \dfrac{\sum_{i = 1}^N \left( X_i - \overline{X} \right) \left( Y_i - \overline{Y} \right)}{\sum_{i = 1}^N \left( X_i - \overline{X} \right)^2} = \dfrac{\widehat{\mathbb{C}{\rm ov}}(X, Y)}{\widehat{\mathbb{V}{\rm ar}}(X)} \\\\ \widehat{\beta}_0 = \overline{Y} - \widehat{\beta}_1 \overline{X} \end{cases} \]

\(\widehat{\beta}_1\) and \(\widehat{\beta}_0\) derived in this way are called the OLS estimates of the linear regression parameters.

Below we continue our example and estimate the parameters from our generated sample data:

## Estimated beta_0 = 0.8385362094098401. True beta_0 = 1
## Estimated beta_1 = 0.5099221954865624. True beta_1 = 0.5

Note that bias = True calculates the population covariance (with division by \(N\)), whereas bias = False calculates the sample covariance (with \(N-1\) instead of \(N\)) and the variance function var() calculates the population variance. Since we are dividing the covariance by the variance, we need to calculate both of them for the population (or the sample, since we are dividing the covariance by the variance):

## With bias = False (default) the estimate of beta_1 is biased: 
##  0.5201206393962936
## With sample variance the estimate of beta_1 is unbiased: 
##  0.5099221954865624

We can further re-write the estimates in a matrix notation.

3.2.2.3 OLS - The Matrix Method

It is convenient to use matrices when solving equation systems. Looking at our random sample equations: \[ \begin{cases} Y_1 &= \beta_0 + \beta_1 X_1 + \epsilon_1 \\ Y_2 &= \beta_0 + \beta_1 X_2 + \epsilon_2 \\ \vdots \\ Y_N &= \beta_0 + \beta_1 X_N + \epsilon_N \end{cases} \] which we can re-write in the following matrix notation: \[ \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_N \end{bmatrix} = \begin{bmatrix} 1 & X_1 \\ 1 & X_2 \\ \vdots & \vdots \\ 1 & X_N \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{bmatrix} \] or, in a more compact form: \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \] where \(\mathbf{Y} = [Y_1,...,Y_N]^\top\), \(\boldsymbol{\varepsilon} = [\epsilon_1,...,\epsilon_N]^\top\), \(\boldsymbol{\beta} =[ \beta_0, \beta_1]^\top\) and \(\mathbf{X} = \begin{bmatrix} 1 & X_1 \\ 1 & X_2 \\ \vdots & \vdots \\ 1 & X_N \end{bmatrix}\).

As in the previous case, we want to minimize the sum of squared residuals: \[ \begin{aligned} RSS(\boldsymbol{\beta}) &= \boldsymbol{\varepsilon}^\top \boldsymbol{\varepsilon} \\ &= \left( \mathbf{Y} - \mathbf{X} \boldsymbol{\beta} \right)^\top \left( \mathbf{Y} - \mathbf{X} \boldsymbol{\beta} \right) \\ &= \mathbf{Y} ^\top \mathbf{Y} - \boldsymbol{\beta}^\top \mathbf{X}^\top \mathbf{Y} - \mathbf{Y}^\top \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\beta}^\top \mathbf{X}^\top \mathbf{X} \boldsymbol{\beta} \rightarrow \min_{\beta_0, \beta_1} \end{aligned} \] After using some matrix calculus and equating the partial derivative to zero: \[ \dfrac{\partial RSS(\widehat{\boldsymbol{\beta}})}{\partial \widehat{\boldsymbol{\beta}}} = -2 \mathbf{X}^\top \mathbf{Y} + 2 \mathbf{X}^\top \mathbf{X} \widehat{\boldsymbol{\beta}} = 0 \] gives us the OLS estimator:

\[\begin{equation} \widehat{\boldsymbol{\beta}} = \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{Y} \tag{3.7} \end{equation}\]

Note that with the matrix notation we estimate both parameters at the same time, whereas with the Method of Moments, or by taking partial derivatives from the RSS and solving the equation system, we needed to first estimate \(\widehat{\beta}_1\) via (3.5) and then plug it in (3.4) in order to estimate \(\widehat{\beta}_0\).

In the univariate regression case, we have that: \[ \begin{aligned} \mathbf{X}^\top \mathbf{X} &= \begin{bmatrix} 1 & 1 & ... & 1\\ X_1 & X_2 & ... & X_N \end{bmatrix} \begin{bmatrix} 1 & X_1 \\ 1 & X_2 \\ \vdots \\ 1 & X_N \end{bmatrix} = \begin{bmatrix} N & \sum_{i = 1}^N X_i \\ \sum_{i = 1}^N X_i & \sum_{i = 1}^N X_i^2 \end{bmatrix} \\ \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} &= \dfrac{1}{N\sum_{i = 1}^N X_i^2 - \left( \sum_{i = 1}^N X_i \right)^2} \begin{bmatrix} \sum_{i = 1}^N X_i^2 & - \sum_{i = 1}^N X_i\\ -\sum_{i = 1}^N X_i & N \end{bmatrix}\\ \mathbf{X}^\top \mathbf{Y} &= \begin{bmatrix} \sum_{i = 1}^N Y_i \\ \sum_{i = 1}^N X_i Y_i \end{bmatrix} \end{aligned} \] and: \[ \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{Y} = \begin{bmatrix} \dfrac{N \left( \dfrac{1}{N}\sum_{i = 1}^N X_i^2 \right) \cdot N \left( \dfrac{1}{N} \sum_{i = 1}^N Y_i \right) - N \left( \dfrac{1}{N}\sum_{i = 1}^N X_i\right) \cdot N \left( \dfrac{1}{N} \sum_{i = 1}^N X_i Y_i \right)}{N^2\left( \dfrac{1}{N}\sum_{i = 1}^N X_i^2 \right) - N^2\left( \dfrac{1}{N}\sum_{i = 1}^N X_i \right)^2} \\ \dfrac{N^2 \dfrac{1}{N}\sum_{i = 1}^N X_i Y_i - N \left( \dfrac{1}{N} \sum_{i = 1}^N X_i \right) \cdot N \left( \dfrac{1}{N} \sum_{i = 1}^N Y_i \right)}{N^2\left( \dfrac{1}{N}\sum_{i = 1}^N X_i^2 \right) - N^2\left( \dfrac{1}{N}\sum_{i = 1}^N X_i \right)^2} \end{bmatrix} \] Which gives us the following expression for the OLS estimators for the univariate regression: \[ \widehat{\boldsymbol{\beta}} = \begin{bmatrix} \widehat{\beta_0} \\ \widehat{\beta_1} \end{bmatrix} = \begin{bmatrix} \dfrac{\overline{X^2} \cdot\overline{Y} - \overline{X} \cdot\overline{XY}}{\widehat{\mathbb{V}{\rm ar}}(X)} \\ \dfrac{\widehat{\mathbb{C}{\rm ov}}(X, Y)}{\widehat{\mathbb{V}{\rm ar}}(X)} \end{bmatrix} \]

Note that in this case, we have an exact expression for \(\widehat{\beta}_0\), which does not require estimating \(\widehat{\beta}_1\) beforehand.

Defining the estimates in the matrix notation is not only convenient (as they can be generalized to the multivariate case) but, in some cases, even faster to compute via software.

Below we will continue our example and show not only how to calculate the parameters manually using vectorization, but also via the built-in OLS estimation methods.

Alternatively, Python 3.5 introduced the @ symbol (see PEP465, or Python 3.5 release notes) for matrix multiplication. This allows us to rewrite the above expression as:

## [0.83853621 0.5099222 ]

Though, this will not be compatible with previous Python versions. On the other hand, using the @ symbol is faster on larger (and higher dimension) matrices).

To further complicate things, according to the official documentaion:

The @ (at) operator is intended to be used for matrix multiplication. No builtin Python types implement this operator.

In other words, @, which is a built-in operator in Python itself, only works with external packages that have implemented this operator (like NumPy). This is in part because Python does not let you define new operators, but it comes predefined with a set of operators, which can be overridden.

Both R and Python have packages with these estimation methods already defined - we only need to specify the data. We will explore these functions in more detail at the end of the chapter, but for now we will show only the relevant output for the currently discussed topic below:

3.2.3 Relationship between estimates, residuals, fitted and actual values

After obtaining the estimates \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\), we may want to examine the following values:

  • The fitted values of \(Y\), which are defined as the following OLS regression line (or more generally, the estimated regression line): \[\widehat{Y}_i = \widehat{\beta}_0 + \widehat{\beta}_1 X_i\] where \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) are estimated via OLS. By definition, each fitted value of \(\widehat{Y}_i\) is on the estimated OLS regression line.
  • The residuals, which are defined as the difference between the actual and fitted values of \(Y\): \[\widehat{\epsilon}_i = \widehat{e}_i = Y_i - \widehat{Y}_i = Y_i - \widehat{\beta}_0 - \widehat{\beta}_1 X_i\] which are hopefully close to the errors \(\epsilon_i\).

Below we present a graphical representation of some of these terms:

We see that the estimated regression is very close to the true underlying population regression (i.e. the true \(\mathbb{E} (Y | X)\)).

Looking closer at the fitted values in a subset of the data, we can see where the residuals originate and how the fitted values compare to the sample data:

plot(x = x, y = y, pch = 19, col = "black", 
     xlim = c(5, 15),
     ylim = c(0, 10), yaxs = "i", #Y axis starts at 0, ends at max(Y)
     xaxt = "n", yaxt = "n", # Do not plot initial ticks in axis
     xlab = "X", ylab = "Y") # Label the axis
title("Scatter diagram of (X,Y) sample data and the regression line")
lines(x = x, y = y_fit, col = "darkgreen", lty = 2)
# Add Axis labels and ticks at specific positions:
axis(side = 1, at = c(x[8], x[12]), labels = FALSE)
text(x = c(x[8], x[12]), y = -0.2, pos = 1, xpd = TRUE,
     labels = c(expression(x[8]), expression(x[12])))
#
#
# Add vertical lines:
segments(x0 = x[8], y0 = 0, x1 = x[8], y1 = y_fit[8], lty = 2)
segments(x0 = x[8], y0 = 0, x1 = x[8], y1 = y[8])
segments(x0 = x[12], y0 = 0, x1 = x[12], y1 = y[12])
#
#
# Add some curly brackets:
pBrackets::brackets(x1 = x[8], y1 = 0, x2 = x[8], y2 = y[8],
                    lwd = 1, h = 0.33, col = "red")
pBrackets::brackets(x1 = x[8], y1 = y_fit[8], x2 = x[8], y2 = 0,
                    lwd = 1, h = 0.33, col = "red")
pBrackets::brackets(x1 = x[12], y1 = y_fit[12], x2 = x[12], y2 = 0,
                    lwd = 1, h = 0.33, col = "red")
pBrackets::brackets(x1 = x[12], y1 = y[12], x2 = x[12], y2 = y_fit[12],
                    lwd = 1, h = 0.33, ticks = 0.25, col = "red")
#
#
# Add Actual, Fitted and Residual indicator text:
text(x = x[8] * 0.94, y = y[8] / 2, labels = expression(Y[8]))
text(x = x[8] * 1.07, y = y_fit[8] / 2, labels = expression(hat(Y)[8]))
text(x = x[12] * 1.1, y = y_fit[12] / 2,
     labels = expression(hat(Y)[12] == "fitted value"))
text(x = x[12] * 1.08, y = y_fit[12] + (y[12] - y_fit[12]) * 0.8,
     labels = expression(hat(epsilon)[12] == "residual"))
#
#
# Add Regression line
text(x = x[9], y = y[11],
     labels = expression(hat(Y) == hat(beta)[0] + hat(beta)[1] * X))
arrows(x0 = x[9], y0 = y_fit[10]*1.05, x1 = x[10], y1 = y_fit[10], length = 0.1, col = "red")

Other mathematical notations when plotting in R are available here.

_ = plt.figure(num = 1, figsize=(10, 8))
_ = plt.plot(x, y, linestyle = "None", marker = "o", markerfacecolor = 'black')
_ = plt.ylim(ymin = 0, ymax = 12)
_ = plt.xlim(xmin = 10, xmax = 20)
_ = plt.title("Scatter diagram of (X,Y) sample data and the regression line")
_ = plt.plot(x, y_dgp, linestyle = "--", color = "green")
# Add Axis labels and ticks at specific positions:
_ = plt.xticks([x[12], x[16]], ["$X_{13}$", "$X_{17}$"])
# Add vertical lines:
_ = plt.plot([x[12], x[12]], [0, y_fit[12]], '--', color = "black")
_ = plt.plot([x[12], x[12]], [0, y[12]], '-', color = "black")
_ = plt.plot([x[16], x[16]], [0, y[16]], '-', color = "black")
# Add some brackets:
_ = plt.annotate("", xy = (x[12]*0.955, y[12] / 2), 
             xytext = (x[12]*0.975, y[12] / 2),
             arrowprops = dict(arrowstyle = "]-, widthA=10.3,lengthA=1",
                               connectionstyle = "arc", color='red'))
_ = plt.annotate("", xy = (x[12]*1.015, y_fit[12] / 2),
             xytext = (x[12]*1.035, y_fit[12] / 2),
             arrowprops = dict(arrowstyle = "-[, widthB=12.5,lengthB=1",
                               connectionstyle = "arc", color='red'))
_ = plt.annotate("", xy = (x[16]*1.015, y_fit[16] / 2), 
             xytext = (x[16]*1.035, y_fit[16] / 2),
             arrowprops = dict(arrowstyle = "-[, widthB=16.2,lengthB=1.2",
                               connectionstyle = "arc", color='red'))
_ = plt.annotate("", xy = (x[16]*1.015, (y[16] + y_fit[16]) / 2),
             xytext = (x[16]*1.035, (y[16] + y_fit[16]) / 2),
             arrowprops = dict(arrowstyle = "-[, widthB=2.5,lengthB=1.2",
                               connectionstyle = "arc", color='red'))
# Add Actual, Fitted and Residual indicator text:
_ = plt.text(x[12]*0.93, y[12] / 2.1, r'$Y_{13}$', fontsize = 10)
_ = plt.text(x[12]*1.04, y_fit[12] / 2.1, r'$\widehat{Y}_{13}$', fontsize = 10)
_ = plt.text(x[16]*1.04, y_fit[16] / 2.1, 
  r'$\widehat{Y}_6$ = fitted value', fontsize = 10)
_ = plt.text(x[16]*1.04, (y[16] + y_fit[16]) / 2.02, 
  r'$\widehat{\epsilon}_6$ = residual', fontsize = 10)
# Add Regression line
_ = plt.text(x[17] - 0.2, y[20], 
  r'$\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 X$', fontsize = 10)
_ = plt.annotate("", xy = (x[19], y_fit[19]), xytext = (x[18]*0.99, y[20]),
            arrowprops = dict(arrowstyle = "->", 
                              connectionstyle = "arc", color='red'))
plt.show()

3.2.4 Properties of the OLS estimator

From the construction of the OLS estimators the following properties apply to the sample:

  1. The sum (and by extension, the sample average) of the OLS residuals is zero: \[\begin{equation} \sum_{i = 1}^N \widehat{\epsilon}_i = 0 \tag{3.8} \end{equation}\] This follows from the first equation of (3.3). The OLS estimates \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) are chosen such that the residuals sum up to zero, regardless of the underlying sample data.
## [1] "Sum of the residuals: -1.05582209641852e-13"
## Sum of the residuals: 2.042810365310288e-14

We see that the sum is very close to zero.

  1. The sample covariance between the regressors and the OLS residuals is zero: \[\begin{equation} \sum_{i = 1}^N X_i \widehat{\epsilon}_i = 0 \tag{3.9} \end{equation}\] This follows from the second equation of (3.3). Because the sample average of the OLS residuals is zero, \(\sum_{i = 1}^N X_i \widehat{\epsilon}_i\) is proportional to the sample covariance, between \(X_i\) and \(\widehat{\epsilon}_i\).
## [1] "Sum of X*resid: -5.92859095149834e-13"
## [1] "Sample covariance of X and residuals: 4.1182578180976e-14"
## Sum of X*resid: -3.765876499528531e-13
## Sample covariance of X and residuals: -1.7815748876159887e-14

We see that both the sum and the sample covariance are very close to zero.

  1. The point \((\overline{X}, \overline{Y})\) is always on the OLS regression line - if we calculate \(\widehat{\beta}_0 + \widehat{\beta}_1 \overline{X}\), the resulting value would be equal to \(\overline{Y}\).
## [1] "Predicted value with mean(X): 13.4534765045994"
## [1] "Sample mean of Y: 13.4534765045994"
## Predicted value with mean(X): 13.586591096573901
## Sample mean of Y: 13.5865910965739

We see that the predicted value is identical to the sample mean of \(Y\).

However, these properties are not the only ones, which justify the use of the OLS method, instead of some other competing estimator. The main advantage of the OLS estimators can be summarized by the following Gauss-Markov theorem:

Gauss-Markov theorem

Under the assumption that the conditions (UR.1) - (UR.3) hold true, the OLS estimators \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) are BLUE (Best Linear Unbiased Estimator) and Consistent.

3.2.4.1 What is an Estimator?

We will restate what we have said at the end of section 3.1 - an estimator is a rule that can be applied to any sample of data to produce an estimate. In other words the estimator is the rule and the estimate is the result.

So, eq. (3.7) is the rule and therefore an estimator.

The remaining components of the acronym BLUE are provided below.

3.2.4.2 OLS estimators are Linear

From the specification of the relationship between \(\mathbf{Y}\) and \(\mathbf{X}\) (using the matrix notation for generality): \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \] We see that the relationship is linear with respect to \(\mathbf{Y}\).

3.2.4.3 OLS estimators are Unbiased

Using the matrix notation for the sample linear equations (\(\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}\)) and plugging it into eq. (3.7) gives us the following:

\[ \begin{aligned} \widehat{\boldsymbol{\beta}} &= \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{Y} \\ &= \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \left( \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \right) \\ &= \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{X} \boldsymbol{\beta} + \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \boldsymbol{\varepsilon} \\ &= \boldsymbol{\beta} + \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \boldsymbol{\varepsilon} \end{aligned} \]

If we take the expectation of both sides and use the law of total expectation:

\[ \begin{aligned} \mathbb{E} \left[ \widehat{\boldsymbol{\beta}} \right] &= \boldsymbol{\beta} + \mathbb{E} \left[ \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \boldsymbol{\varepsilon} \right] \\ &= \boldsymbol{\beta} + \mathbb{E} \left[ \mathbb{E} \left( \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \boldsymbol{\varepsilon} \biggr\rvert \mathbf{X}\right)\right] \\ &= \boldsymbol{\beta} + \mathbb{E} \left[ \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbb{E} \left( \boldsymbol{\varepsilon} | \mathbf{X}\right)\right] \\ &= \boldsymbol{\beta} \end{aligned} \] since \(\mathbb{E} \left( \boldsymbol{\varepsilon} | \mathbf{X}\right) = \mathbf{0}\) from (UR.2).

We have shown that \(\mathbb{E} \left[ \widehat{\boldsymbol{\beta}} \right] = \boldsymbol{\beta}\) - i.e., the OLS estimator \(\widehat{\boldsymbol{\beta}}\) is an unbiased estimator of \(\boldsymbol{\beta}\).

Unbiasedness does not guarantee that the estimate we get with any particular sample is equal (or even very close) to \(\boldsymbol{\beta}\). It means that if we could repeatedly draw random samples from the population and compute the estimate each time, then the average of these estimates would be (very close to) \(\boldsymbol{\beta}\). However, in most applications we have just one random sample to work with, though as we will see later on, there are methods for creating additional samples from the available data (usually by creating and analysing different subsamples of the original data).

3.2.4.4 OLS estimators are Best (Efficient)

When there is more than one unbiased method of estimation to choose from, that estimator which has the lowest variance is the best. In other words, we want to show that OLS estimators are best in the sense that \(\widehat{\boldsymbol{\beta}}\) are efficient estimators of \(\boldsymbol{\beta}\) (i.e. they have the smallest variance).

To do so we will calculate the variance - the average distance of an element from the average - as follows (remember that for OLS estimators, condition (UR.3) holds true):

From the proof of unbiasedness of the OLS we have that: \[ \begin{aligned} \widehat{\boldsymbol{\beta}} = \boldsymbol{\beta} + \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \boldsymbol{\varepsilon} \Longrightarrow \widehat{\boldsymbol{\beta}} - \boldsymbol{\beta} = \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \boldsymbol{\varepsilon} \end{aligned} \] Which we can then use this expression for calculating the variance-covariance matrix of the OLS estimator: \[ \begin{aligned} \mathbb{V}{\rm ar} (\widehat{\boldsymbol{\beta}}) &= \mathbb{E} \left[(\widehat{\boldsymbol{\beta}} - \mathbb{E}(\widehat{\boldsymbol{\beta}}))(\widehat{\boldsymbol{\beta}} - \mathbb{E}(\widehat{\boldsymbol{\beta}}))^\top \right] \\ &= \mathbb{E} \left[(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta})(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta})^\top \right] \\ &= \mathbb{E} \left[ \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \boldsymbol{\varepsilon} \left( \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \boldsymbol{\varepsilon} \right)^\top \right] \\ &= \mathbb{E} \left[ \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^\top \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \right] \\ &= \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbb{E} \left[ \boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^\top\right] \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \\ &= \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \left(\sigma^2 \mathbf{I} \right) \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \\ &= \sigma^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{X} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \\ &= \sigma^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \end{aligned} \] For the univariate case, we have already calculated \(\left( \mathbf{X}^\top \mathbf{X}\right)^{-1}\), which leads to: \[ \begin{cases} \mathbb{V}{\rm ar} (\widehat{\beta}_0) &= \sigma^2\cdot \dfrac{\sum_{i = 1}^N X_i^2}{N\sum_{i = 1}^N \left( X_i - \overline{X} \right)^2} \\\\ \mathbb{V}{\rm ar} (\widehat{\beta}_1) &= \sigma^2\cdot \dfrac{1}{\sum_{i = 1}^N \left( X_i - \overline{X} \right)^2} \end{cases} \] Which correspond to the diagonal elements of the variance-covariance matrix: \[\mathbb{V}{\rm ar} (\widehat{\boldsymbol{\beta}}) = \begin{bmatrix} \mathbb{V}{\rm ar} (\widehat{\beta}_0) & \mathbb{C}{\rm ov} (\widehat{\beta}_0, \widehat{\beta}_1) \\ \mathbb{C}{\rm ov} (\widehat{\beta}_1, \widehat{\beta}_0) & \mathbb{V}{\rm ar} (\widehat{\beta}_1) \end{bmatrix}\] Note that we applied the following relationship: \(N\sum_{i = 1}^N X_i^2 - \left( \sum_{i = 1}^N X_i \right)^2 = N\sum_{i = 1}^N \left( X_i - \overline{X} \right)^2\).

Next, assume that we have some other estimator of \(\boldsymbol{\beta}\), which is also unbiased and can be expressed as: \[ \widetilde{\boldsymbol{\beta}} = \left[ \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top + \mathbf{D}\right] \mathbf{Y} =\mathbf{C}\mathbf{Y} \] Then, since \(\mathbb{E}\left[ \boldsymbol{\varepsilon} \right] = \boldsymbol{0}\): \[ \begin{aligned} \mathbb{E}\left[ \widetilde{\boldsymbol{\beta}} \right] &= \mathbb{E}\left[ \left( \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top + \mathbf{D}\right) \left( \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}\right) \right] \\ &= \left( \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top + \mathbf{D}\right) \mathbb{E}\left[ \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \right] \\ &= \left( \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top + \mathbf{D}\right)\mathbf{X} \boldsymbol{\beta} \\ &= \left( \mathbf{I} + \mathbf{D}\mathbf{X}\right) \boldsymbol{\beta} \\ &= \boldsymbol{\beta} \iff \mathbf{D}\mathbf{X} = \boldsymbol{0} \end{aligned} \] In other words, \(\widetilde{\boldsymbol{\beta}}\) is unbiased if and only if \(\mathbf{D}\mathbf{X} = \boldsymbol{0}\). Using this fact, we can calculate its variance as: \[ \begin{aligned} \mathbb{V}{\rm ar} (\widetilde{\boldsymbol{\beta}}) &= \mathbb{V}{\rm ar} (\mathbf{C}\mathbf{Y}) \\ &= \mathbf{C}\mathbb{V}{\rm ar} (\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon})\mathbf{C}^\top \\ &= \sigma^2 \mathbf{C}\mathbf{C}^\top \\ &= \sigma^2 \left( \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top + \mathbf{D}\right)\left( \mathbf{X}\left( \mathbf{X}^\top \mathbf{X}\right)^{-1} + \mathbf{D}^\top\right) \\ &= \sigma^2 \left[\left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{X}\left( \mathbf{X}^\top \mathbf{X}\right)^{-1} + \left(\mathbf{X}^\top \mathbf{X}\right)^{-1} (\mathbf{D}\mathbf{X})^\top + \mathbf{D} \mathbf{X}\left( \mathbf{X}^\top \mathbf{X}\right)^{-1} + \mathbf{D}\mathbf{D}^\top\right] \\ &= \sigma^2 \left[ \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} + \mathbf{D}\mathbf{D}^\top\right] \\ &= \mathbb{V}{\rm ar} (\widehat{\boldsymbol{\beta}}) + \mathbf{D}\mathbf{D}^\top \geq \mathbb{V}{\rm ar} (\widehat{\boldsymbol{\beta}}) \end{aligned} \] since \(\mathbf{D}\mathbf{D}^\top\) is a positive semidefinite matrix.

This means that \(\widehat{\boldsymbol{\beta}}\) has the smallest variance.

3.2.4.5 Estimating the variance parameter of the error term

We see an immediate problem from the OLS estimator variance formulas - we do not know the true error variance \(\sigma^2\). However, we can estimate it by calculating the sample residual variance: \[ \widehat{\sigma}^2 = s^2 = \dfrac{\widehat{\boldsymbol{\varepsilon}}^\top \widehat{\boldsymbol{\varepsilon}}}{N - 2} = \dfrac{1}{N-2} \sum_{i = 1}^N \widehat{\epsilon}_i^2 \]

Note that if we take \(N\) instead of \(N-2\) for the univariate regression case in the denominator, then the variance estimate would be biased. This is because the variance estimator would not account for two restrictions that must be satisfied by the OLS residuals, namely (3.8) and (3.9): \[ \sum_{i = 1}^N \widehat{\epsilon}_i = 0,\quad \sum_{i = 1}^N \widehat{\epsilon}_i X_i = 0 \] So, we take \(N - 2\) instead of \(N\), because of the number of restrictions on the residuals. Using \(\widehat{\sigma}^2\) allows us to calculate the estimate of \(\mathbb{V}{\rm ar} (\widehat{\boldsymbol{\beta}})\), i.e. we can calculate \(\widehat{\mathbb{V}{\rm ar}} (\widehat{\boldsymbol{\beta}})\). One way to view these restrictions is this: if we know \(N-2\) of the residuals, we can always get the other two residuals by using the restrictions implied by (3.8) and (3.9).

Thus, there are only \(N-2\) degrees of freedom in the OLS residuals, as opposed to \(N\) degrees of freedom in the errors.

Note that this is an estimated variance. Nevertheless, it is a key component in assessing the accuracy of the parameter estimates (when calculating test statistics and confidence intervals).

Since we estimate \(\widehat{\boldsymbol{\beta}}\) from the a random sample, the estimator \(\widehat{\boldsymbol{\beta}}\) is a random variable as well. We can measure the uncertainty of \(\widehat{\boldsymbol{\beta}}\) via its standard deviation. This is the standard error of our estimate of \(\boldsymbol{\beta}\):

The square roots of the diagonal elements of the variance-covariance matrix \(\widehat{\mathbb{V}{\rm ar}} (\widehat{\boldsymbol{\beta}})\) are called the standard errors (se) of the corresponding OLS estimators \(\widehat{\boldsymbol{\beta}}\), which we use to estimate the standard deviation of \(\widehat{\beta}_i\) from \(\beta_i\) \[ \text{se}(\widehat{\beta}_i) = \sqrt{\widehat{\mathbb{V}{\rm ar}} (\mathbf{\widehat{\beta}_i})} \]

The standard errors describe the accuracy of an estimator (the smaller the better). The standard errors are measures of the sampling variability of the least squares estimates \(\widehat{\beta}_1\) and \(\widehat{\beta}_2\) in repeated samples - if we collect a number of different data samples, the OLS estimates will be different for each sample. As such, the OLS estimators are random variables and have their own distribution.

See section 3.2.4.7 for an example on generating many different random samples, estimating and visualizing the parameters estimates and their distribution.

Now is also a good time to highlight the difference between the errors and the residuals.

  • The random sample, taken from a Data Generating Process (i.e. the population), is described via \[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\] where \(\epsilon_i\) is the error for observation \(i\).
  • After estimating the unknown parameters \(\beta_0\), \(\beta_1\), we can re-write the equation as: \[ Y_i = \widehat{\beta}_0 + \widehat{\beta}_1 X_i + \widehat{\epsilon}_i\] where \(\widehat{\epsilon}_i\) is the residual for observation \(i\).

The errors show up in the underlying (i.e. true) DGP equation, while the residuals show up in the estimated equation. The errors are never observed, while the residuals are calculated from the data.

We can also re-write the residuals in terms of the error term and the difference between the true and estimated parameters: \[ \begin{aligned} \widehat{\epsilon}_i &= Y_i - \widehat{Y}_i\\ &= \beta_0 + \beta_1 X_i + \epsilon_i - (\widehat{\beta}_0 + \widehat{\beta}_1 X_i) \\ &= \epsilon_i - \left( \widehat{\beta}_0 - \beta_0\right) - \left( \widehat{\beta}_1 - \beta_1\right) X_i \end{aligned} \]

Going back to our example, we can estimate the standard errors of our coefficients following the formulas in this section:

We can also use the built-in functions, just like we did with the coefficients:

##              Std. Error
## (Intercept) 0.266578700
## x           0.009188728

Note: use names(out) and str(out) to examine the summary() object and see what data can be accessed. Using drop = FALSE allows us to keep the matrix structure of the output.

## [0.26999713 0.00930656]

Note: the b in bse stands for the parameter vector \(\boldsymbol{\beta}\), and se - standard errors.

This highlights a potential problem, which will be addressed in later chapters concerning model adequacy/goodness of fit: if the residuals are large (since their mean will still be zero this concerns the case when the estimated variance of the residuals is large), then the standard errors of the coefficients are large as well.

3.2.4.6 OLS estimators are Consistent

A consistent estimator has the property that, as the number of data points (which are used to estimate the parameters) increases (i.e. \(N \rightarrow \infty\)), the estimates converges in probability to the true parameter, i.e.:

Definition 3.2 Let \(W_N\) be an estimator of a parameter \(\theta\) based on a sample \(Y_1,...,Y_N\). Then we say that \(W_N\) is a consistent estimator of \(\theta\) if \(\forall \epsilon > 0\):

\[ \mathbb{P} \left( |W_N - \theta| > \epsilon\right) \rightarrow 0,\text{ as } N \rightarrow \infty \] We can denote this as \(W_N \xrightarrow{P} \theta\) or \(\text{plim} (W_N) = \theta\). If \(W_N\) is not consistent, then we say that \(W_N\) is inconsistent.

Unlike unbiasedness, consistency involves the behavior of the sampling distribution of the estimator as the sample size \(N\) gets large and the distribution of \(W_N\) becomes more and more concentrated about \(\theta\). In other words, for larger sample sizes, \(W_N\) is less and less likely to be very far from \(\theta\). An inconsistent estimator does not help us learn about \(\theta\), regardless of the size of the data sample. For this reason, consistency is a minimal requirement of an estimator used in statistics or econometrics.

Unbiased estimators are not necessarily consistent, but those whose variances shrink to zero as the sample size grows are consistent. In other words:

If \(W_N\) is an unbiased estimator of \(\theta\) and \(\mathbb{V}{\rm ar}(W_N) \rightarrow 0\) as \(N \rightarrow \infty\), then \(\text{plim}(W_N) = \theta\)

Example 3.2 We will show that the sample average of a random sample drawn from a population with mean \(\mu\) and variance \(\sigma^2\) is a consistent estimator.

First, we will show that \(\overline{Y} = \dfrac{1}{N} \sum_{i = 1}^N Y_i\) is an unbiased estimator of the population mean \(\mu\): \[ \begin{aligned} \mathbb{E} \left( \overline{Y} \right) &= \dfrac{1}{N} \mathbb{E} \left( \sum_{i = 1}^N Y_i\right) = \dfrac{1}{N} \sum_{i = 1}^N \mathbb{E} \left( Y_i\right) \\ &= \dfrac{1}{N} \sum_{i = 1}^N \mu = \dfrac{1}{N} \cdot N \cdot \mu \\ &= \mu \end{aligned} \] We can now obtain the variance of the sample average (assuming that \(Y_1,...,Y_N\) are pairwise uncorrelated r.v.’s): \[ \begin{aligned} \mathbb{V}{\rm ar} (\overline{Y}) &= \mathbb{V}{\rm ar} (\overline{Y}) = \mathbb{V}{\rm ar} \left( \dfrac{1}{N} \sum_{i = 1}^N Y_i \right ) \\ &= \dfrac{1}{N^2} \mathbb{V}{\rm ar} \left( \sum_{i = 1}^N Y_i \right ) = \dfrac{1}{N^2} \sum_{i = 1}^N \mathbb{V}{\rm ar} \left( Y_i \right ) \\ &= \dfrac{1}{N^2} \sum_{i = 1}^N \sigma^2 = \dfrac{N \sigma^2}{N^2} \\ &= \dfrac{\sigma^2}{N} \end{aligned} \] Therefore, \(\mathbb{V}{\rm ar} (\overline{Y}) = \dfrac{\sigma^2}{N} \rightarrow 0\), as \(N \rightarrow \infty\). So, \(\overline{Y}\) is a consistent estimator of \(\mu\).

Example 3.3 Unbiased but not consistent: Assume that we select \(Y_1\) as an estimator of \(\mu\). This estimator is an unbiased estimator, since \(\mathbf{E} Y_1 = \mu\). However, it is not consistent, since it does not become more concentrated around \(\mu\) as the sample size increases - regardless of sample size \(Y_1 \sim \mathcal{N}(\mu, \sigma^2)\).
Example 3.4 Consistent but not unbiased: Assume that we want to estimate \(\sigma^2\) using a sample \(Y_1,...,Y_N\) drawn i.i.d. from \(\mathcal{N}(\mu, \sigma^2)\). For our estimator, we select: \[ \widehat{\sigma}^2 = \dfrac{1}{N} \sum_{i = 1}^N (Y_i - \overline{Y})^2 \] In this case:

\[ \begin{aligned} \mathbb{E} [\widehat{\sigma}^2] = \mathbb{E} \left[ \dfrac{1}{N} \sum_{i = 1}^N (Y_i - \overline{Y})^2\right] = \dfrac{1}{N} \mathbb{E} \left[ \sum_{i = 1}^N Y_i^2 - 2 \sum_{i = 1}^N Y_i \overline{Y} + \sum_{i = 1}^N \overline{Y}^2 \right] \end{aligned} \] We know that \(\sum_{i = 1}^N Y_i = N \cdot \overline{Y}\) and \(\sum_{i = 1}^N \overline{Y}^2 = N \cdot \overline{Y}^2\). Hence: \[ \begin{aligned} \mathbb{E} [\widehat{\sigma}^2] &= \dfrac{1}{N} \mathbb{E} \left[ \sum_{i = 1}^N Y_i^2 - 2 N \overline{Y}^2 + N \overline{Y}^2 \right] \\ &= \dfrac{1}{N} \mathbb{E} \left[ \sum_{i = 1}^N Y_i^2 - N \overline{Y}^2 \right] \\ &= \dfrac{1}{N} \mathbb{E} \left[ \sum_{i = 1}^N Y_i^2\right] - \mathbb{E} \left[ \overline{Y}^2 \right] \\ &= \mathbb{E} \left[ Y^2 \right] - \mathbb{E} \left[ \overline{Y}^2 \right] \end{aligned} \] Where we assume that all \(Y_i\) are i.i.d. drawn from the same distribution, hence \(\mathbb{E} \left[ Y_i^2 \right] = \mathbb{E} \left[ Y^2 \right], \forall i = 1,...,N\). From the variance definition it follows that: \[ \begin{aligned} \sigma^2_Y &= \sigma^2 = \mathbb{E} \left[ Y^2 \right] - \mathbb{E} \left[ Y \right]^2 \\ \sigma^2_\overline{Y} &= \mathbb{E} \left[ \overline{Y}^2 \right] - \mathbb{E} \left[ \overline{Y} \right]^2 \end{aligned} \] Thus, using the fact that the samples are drawn i.i.d. (i.e. uncorrelated): \[ \begin{aligned} \mathbb{E} [\widehat{\sigma}^2] &= \mathbb{E} \left[ Y^2 \right] - \mathbb{E} \left[ \overline{Y}^2 \right] \\ &= (\sigma^2 + \mu^2) - (\sigma^2_\overline{Y} + \mu^2) \\ &= \sigma^2 - \sigma^2_\overline{Y} \\ &= \sigma^2 - \mathbb{V}{\rm ar}(\overline{Y}) \\ &= \sigma^2 - \dfrac{1}{N^2}\mathbb{V}{\rm ar}\left(\sum_{i = 1}^N Y_i \right) \\ &= \sigma^2 - \dfrac{1}{N}\mathbb{V}{\rm ar}\left(Y \right) \\ &= \dfrac{N-1}{N}\sigma^2 \end{aligned} \] In other words, \(\mathbb{E} [\widehat{\sigma}^2] \neq \sigma^2\) and the estimator is biased. We can similarly show that: \[ \mathbb{V}{\rm ar}\left[ \widehat{\sigma}^2\right] = \dfrac{2 \sigma^2 (N-1)}{N^2} \] As the sample size increases, we note that \(\mathbb{V}{\rm ar}\left[ \widehat{\sigma}^2\right] \rightarrow 0\) as \(N \rightarrow \infty\), thus this biased estimator is consistent (furthermore, in this example the bias also decreases as \(N \rightarrow \infty\)).

Some useful properties of consistent estimators are presented below.

Let \(\text{plim}(T_N) = \alpha\) and \(\text{plim}(U_N) = \beta\), then:

  • \(\text{plim}(T_N + U_N) = \alpha + \beta\);
  • \(\text{plim}(T_N \cdot U_N) = \alpha \cdot \beta\);
  • If \(\beta \neq 0\), then \(\text{plim} \left( \dfrac{T_N}{U_N} \right) = \dfrac{\alpha}{\beta}\);

Going back to our OLS estimators \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) we can use the proof of unbiasedness of the OLS estimators and express \(\widehat{\boldsymbol{\beta}}\) as: \[ \begin{aligned} \widehat{\boldsymbol{\beta}} &= \boldsymbol{\beta} + \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \boldsymbol{\varepsilon} \\ &= \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \dfrac{1}{N^2 \left( \dfrac{1}{N} \sum_{i = 1}^N X_i ^2 - \left( \dfrac{1}{N} \sum_{i = 1}^N X_i\right)^2\right)} \begin{bmatrix} \sum_{i = 1}^N X_i^2 & - \sum_{i = 1}^N X_i\\ -\sum_{i = 1}^N X_i & N \end{bmatrix} \begin{bmatrix} \sum_{i = 1}^N \epsilon_i \\ \sum_{i = 1}^N X_i \epsilon_i \end{bmatrix} = \\ &= \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \dfrac{1}{\dfrac{1}{N} \sum_{i = 1}^N X_i ^2 - \left( \dfrac{1}{N} \sum_{i = 1}^N X_i\right)^2} \begin{bmatrix} \left(\dfrac{1}{N} \sum_{i = 1}^N \epsilon_i \right) \left(\dfrac{1}{N} \sum_{i = 1}^N X_i^2 \right) - \left(\dfrac{1}{N} \sum_{i = 1}^N X_i \epsilon_i \right) \left(\dfrac{1}{N} \sum_{i = 1}^N X_i \right)\\ \dfrac{1}{N} \sum_{i = 1}^N X_i\epsilon_i - \left(\dfrac{1}{N} \sum_{i = 1}^N X_i \right)\left(\dfrac{1}{N} \sum_{i = 1}^N \epsilon_i \right) \end{bmatrix} \end{aligned} \] Then, as \(N \rightarrow \infty\) we have that: \[ \begin{aligned} \widehat{\boldsymbol{\beta}} &\rightarrow \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \dfrac{1}{\mathbb{V}{\rm ar}(X)} \begin{bmatrix} \mathbb{E} (\epsilon) \cdot \mathbb{E}(X^2) - \mathbb{E} (X \epsilon)\cdot \mathbb{E} (X) \\ \mathbb{E} (X \epsilon) - \mathbb{E}(X) \cdot \mathbb{E}(\epsilon) \end{bmatrix} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \dfrac{1}{\mathbb{V}{\rm ar}(X)} \begin{bmatrix} 0 \\ 0 \end{bmatrix} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} \end{aligned} \] Since \(\mathbb{E}(\epsilon) = 0\) and \(\mathbb{E} (X \epsilon) = \mathbb{E} (X \epsilon) - \mathbb{E}(X) \mathbb{E}(\epsilon) = \mathbb{C}{\rm ov}(X, \epsilon) = 0\).

Which means that \(\widehat{\boldsymbol{\beta}} \rightarrow \boldsymbol{\beta}\), as \(N \rightarrow \infty\). So, the OLS parameter vector \(\widehat{\boldsymbol{\beta}}\) is a consistent estimator of \(\boldsymbol{\beta}\).

3.2.4.7 Practical illustration of the OLS properties

We will return to our example in this chapter. We have recently proved the unbiasedness and consistency of OLS estimators. To illustrate these properties empirically, we will generate 5000 replications (i.e. different samples) for each of the different sample sizes \(N \in \{11, 101, 1001 \}\):

The reason that we choose to generate 5000 different samples for each sample size, is to calculate the average and variance of the estimated parameters:

## With N = 11:
##   the AVERAGE of the estimated parameters:
##       beta_0: 0.98409
##       beta_1: 0.50222
## With N = 101:
##   the AVERAGE of the estimated parameters:
##       beta_0: 0.99733
##       beta_1: 0.50005
## With N = 1001:
##   the AVERAGE of the estimated parameters:
##       beta_0: 0.99982
##       beta_1: 0.5
## With N = 11:
##   the VARIANCE of the estimated parameters:
##       beta_0: 0.29418
##       beta_1: 0.00716
## With N = 101:
##   the VARIANCE of the estimated parameters:
##       beta_0: 0.03831
##       beta_1: 1e-05
## With N = 1001:
##   the VARIANCE of the estimated parameters:
##       beta_0: 0.00396
##       beta_1: 0
## With N = 10:
##   the AVERAGE of the estimated parameters:
##       beta_0: 1.00437
##       beta_1: 0.49984
## With N = 100:
##   the AVERAGE of the estimated parameters:
##       beta_0: 0.99789
##       beta_1: 0.50006
## With N = 1000:
##   the AVERAGE of the estimated parameters:
##       beta_0: 0.99957
##       beta_1: 0.5
## With N = 10:
##   the VARIANCE of the estimated parameters:
##       beta_0: 0.3178
##       beta_1: 0.00904
## With N = 100:
##   the VARIANCE of the estimated parameters:
##       beta_0: 0.03896
##       beta_1: 1e-05
## With N = 1000:
##   the VARIANCE of the estimated parameters:
##       beta_0: 0.00394
##       beta_1: 0.0

Note that unbiasedness is true for any \(N\), while consistency is an asymptotic property, i.e. it holds when \(N \rightarrow \infty\). We can see that, the mean of the estimated parameters are close to the true parameter value regardless of sample size. The variance of the estimated parameters decreases with larger sample size, i.e. the larger the sample size, the closer will our estimated parameters be to the true values.

We see that the histograms of the OLS estimators have a bell-shaped distribution. Under assumption (UR.4) it can be shown that since \(\boldsymbol{\varepsilon} | \mathbf{X} \sim \mathcal{N} \left( \mathbf{0}, \sigma^2_\epsilon \mathbf{I} \right)\), then the linear combination of epsilons in \(\widehat{\boldsymbol{\beta}} = \boldsymbol{\beta} + \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \boldsymbol{\varepsilon}\) will also be normal, i.e. \[ \widehat{\boldsymbol{\beta}} | \mathbf{X} \sim \mathcal{N} \left(\boldsymbol{\beta},\ \sigma^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \right) \]

In this chapter we have shown how to derive an OLS estimation method in order to estimate the unknown parameters of a linear regression with one variable. We have also shown that these estimators (under conditions (UR.1) - (UR.3)) are good in the sense that, as the sample size increases, the estimated parameter values approach the true parameter values and that the average of all the estimators, estimated from a number of different random samples, is very close to the true underlying parameter vector.

3.2.5 On Generation of the Independent Variable \(X\)

In this chapter we have opted to use \(X = 1,..., N\) for simplicity. In practice, depending on the type of data, we may have different values of \(Y\) for repeating values of \(X\) (for example, if \(X\) is the income of employees at a firm), which is to be expected, since our model has a random component \(\epsilon\).

Nevertheless, we can easily use a more randomized specification of \(X\), say \(X \sim \mathcal{N} \left( \mu_X,\ \sigma^2_X \right)\), with \(N = 300\), \(\mu_X = 5\) and \(\sigma^2_X = 2\):

Alternatively, we can draw a random sample of \(X_i\), \(i = 1,...,N\), with replacement from a set \(\{1,...,50 \}\) and \(N = 300\):

Note that when plotting the fitted values, we actually need to sort the values of \(\widehat{Y}\) by \(X\). For the linear regression case, the regression appears to be drawn as expected, but for nonlinear models, plotting unsorted data will results in an unreadable plot.

An example of the first 15 data point pairs of:

  • unsorted data points \((X_i, Y_i)\):
##        x     y_fit
##  [1,] 31 16.526761
##  [2,] 15  8.601252
##  [3,] 14  8.105907
##  [4,]  3  2.657120
##  [5,] 42 21.975549
##  [6,] 50 25.938304
##  [7,] 43 22.470893
##  [8,] 37 19.498827
##  [9,] 14  8.105907
## [10,] 25 13.554695
## [11,] 26 14.050040
## [12,] 27 14.545384
## [13,]  5  3.647808
## [14,] 27 14.545384
## [15,] 28 15.040728
## [[46.         23.73934845]
##  [ 3.          2.57705892]
##  [29.         15.37286189]
##  [35.         18.3257395 ]
##  [39.         20.29432457]
##  [18.          9.95925294]
##  [20.         10.94354548]
##  [43.         22.26290965]
##  [23.         12.41998428]
##  [34.         17.83359323]
##  [33.         17.34144696]
##  [50.         25.70793353]
##  [48.         24.72364099]
##  [10.          6.02208279]
##  [33.         17.34144696]]
  • sorted data points \((X_i, Y_i)\), such that \(X_i \leq X_j\) for \(i<j\):
##       [,1]     [,2]
##  [1,]    1 1.666431
##  [2,]    1 1.666431
##  [3,]    2 2.161775
##  [4,]    2 2.161775
##  [5,]    2 2.161775
##  [6,]    2 2.161775
##  [7,]    2 2.161775
##  [8,]    3 2.657120
##  [9,]    3 2.657120
## [10,]    3 2.657120
## [11,]    3 2.657120
## [12,]    4 3.152464
## [13,]    4 3.152464
## [14,]    4 3.152464
## [15,]    5 3.647808
## [[1.         1.59276638]
##  [1.         1.59276638]
##  [2.         2.08491265]
##  [2.         2.08491265]
##  [2.         2.08491265]
##  [2.         2.08491265]
##  [2.         2.08491265]
##  [2.         2.08491265]
##  [2.         2.08491265]
##  [2.         2.08491265]
##  [2.         2.08491265]
##  [3.         2.57705892]
##  [3.         2.57705892]
##  [3.         2.57705892]
##  [3.         2.57705892]]