## 3.4 Maximum Likelihood Estimator (MLE)

We have mentioned that (UR.4) is an optional assumption, which simplifies some statistical properties. But we will show how it can be applied to carry out an estimation method, which is based on the join distribution of $$Y_1,...,Y_N$$.

### 3.4.1 Important Distributions

Before we move on to examining model adequacy, like the coefficient significance and confidence intervals, we first need to talk about the distribution of $$Y$$. In order to do that, we will first introduce a few distributions, which are frequently encountered in econometrics literature.

#### 3.4.1.1 The Normal Distribution

The most widely used distribution in statistics and econometrics. A random normal variable $$X$$ is a continuous variable that can take any value. Its probability density function is defined as: $f(x) = \dfrac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[-\dfrac{(x-\mu)^2}{2\sigma^2} \right],\quad -\infty < x <\infty$ where $$\mathbb{E}(X) = \int_{-\infty}^{\infty} f(x) dx = \mu$$, $$\mathbb{V}{\rm ar}(X) = \sigma^2$$. We say that $$X$$ has a normal distribution and write $$X \sim \mathcal{N}(\mu, \sigma^2)$$. The normal distribution is sometimes called the Gaussian distribution.

#
#
#
# Calculate the probability density function for values of x in [-6;6]
x <- seq(from = -6, to = 6, length.out = 200)
#
plot(x, dnorm(x, mean = 0, sd = 1), type = "l",
ylim = c(0, 0.8), ylab = "Density", lwd = 2)
lines(x, dnorm(x, mean = 1, sd = 1.5), type = "l", col = "red", lwd = 2)
lines(x, dnorm(x, mean = -1, sd = 0.5), type = "l", col = "blue", lwd = 2)
title(expression("Probability density function of N("~mu~","~sigma^2~")"))
#
legend(x = 3, y = 0.8, lty = c(1, 1, 1),
col = c(4,1,2), lwd = c(2,2,2),
legend = c(expression(mu==-1~","~sigma==0.5),
expression(mu==0~","~sigma==1),
expression(mu==1~","~sigma==1.5)))

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# Calculate the probability density function for values of x in [-6;6]
x = np.linspace(start = -6, stop = 6, num = 200)
#
_ = plt.figure(num = 0, figsize = (10, 8))
_ = plt.plot(x, norm.pdf(x, loc = 0, scale = 1), color = "black",
label = "$\\mu=0$, $\sigma =1$")
_ = plt.plot(x, norm.pdf(x, loc = 1, scale = 1.5), color = "red",
label = "$\\mu=1$, $\sigma =1.5$")
_ = plt.plot(x, norm.pdf(x, loc = -1, scale = 0.5), color = "blue",
label = "$\\mu=-1$, $\sigma =0.5$")
_ = plt.title("Probability density function of $N(\\mu, \\sigma^2)$")
_ = plt.ylabel("Density")
_ = plt.legend()
plt.show()

Certain random variables appear to roughly follow a normal distribution. These include: a person’s height, weight, test scores; country unemployment rate.

On the other hand, other variables, like income do not appear to follow the normal distribution - the distribution is usually skewed towards the upper (i.e. right) tail. In some cases, a variable might be transformed to achieve normality - usually by taking the logarithm, as long as the random variable is positive. If $$X$$ is a positive, non-normal random variable, but $$\log(X)$$ has a normal distribution, then we say that $$X$$ has a log-normal distribution.

In most cases, income has a log-normal distribution (i.e. the logarithm of income has a normal distribution). Prices of goods also appear to be log-normally distributed.

#### 3.4.1.2 The Standard Normal Distribution

A special case of normal distribution occurs when $$\mu = 0$$ and $$\sigma^2 = 1$$. If $$Z \sim \mathcal{N}(0, 1)$$ - we say that $$Z$$ has a standard normal distribution. The pdf of $$Z$$ is then: $\phi(z) = \dfrac{1}{\sqrt{2\pi}} \exp \left[ - \dfrac{z^2}{2}\right], \quad -\infty < z <\infty$ The values of $$\phi(\cdot)$$ are easily tabulated and can be found in most (especially older) statistical textbooks as well as most statistical/econometrical software.

In most applications we start with a normally distributed random variable, but any normal random variable can be turned into a standard normal.

If $$X \sim \mathcal{N}(\mu, \sigma^2)$$ and $$a,b \in \mathbb{R}$$, then:

• $$\dfrac{X - \mu}{\sigma} \sim \mathcal{N}(0, 1)$$;
• $$(aX + b) \sim \mathcal{N}(a\mu + b,\ a^2 \sigma^2)$$.

#### 3.4.1.3 The Chi-Square Distribution

The chi-square distribution is obtained directly from independent, standard normal random variables. Let $$Z_1, ..., Z_N$$ be independent random variables and $$Z_i \sim \mathcal{N}(0, 1)$$, $$\forall i = 1,...,N$$. Then a new random variable $$X$$, defined as: $X = \sum_{i = 1}^N Z^2_i$ Then $$X$$ has a chi-squared distribution with $$N$$ degrees of freedom (df) and write $$X \sim \mathcal{\chi}_N^2$$. The df corresponds to the number of terms in the summation of $$Z_i$$. Furthermore, $$\mathbb{E}(X) = N$$ and $$\mathbb{V}{\rm ar}(X) = 2N$$.

#
# Calculate the probability density function for values of x in [0;10]
x <- seq(from = 0, to = 10, length.out = 200)
#
plot(x, dchisq(x, df = 1), type = "l",
ylab = "Density", lwd = 2, col = 1, ylim = c(0, 0.6))
for(i in 2:6){
lines(x, dchisq(x, df = i), col = i, lwd = 2)
}
title(expression("Probability density function of "~chi[k]^2))
#
legend(x = 8, y = 0.6, col = 1:6, lty = 1, lwd = 2,
legend = paste0("k = ", 1:6))

from scipy.stats import chi2
# Calculate the probability density function for values of x in [0;10]
x = np.linspace(start = 0, stop = 10, num = 200)
#
_ = plt.figure(num = 1, figsize = (10, 8))
_ = plt.plot(x, chi2.pdf(x, df = 1), color = "black", label = "k = 1")
for i in range(2, 7):
_ = plt.plot(x, chi2.pdf(x, df = i), label = "k = " + str(i))
_ = plt.ylim((0, 0.6))
_ = plt.title("Probability density function of $\\chi^2_k$")
_ = plt.ylabel("Density")
_ = plt.legend()
plt.show()

Note: we calculate OLS estimates by minimizing $$\sum_{i = 1}^N \widehat{\epsilon}^2_i$$, which appears similar, except, that we did not assume that the residual variance is unity.

#### 3.4.1.4 The $$t$$ Distribution

The $$t$$ distribution is used in classical statistics and multiple regression analysis. We obtain a $$t$$ distribution from a standard normal, and a chi-square random variable.

Let $$Z \sim \mathcal{N}(0, 1)$$ and $$X \sim \mathcal{\chi}_N^2$$, let $$Z$$ and $$X$$ be independent random variables. Then the random variable $$T$$, defined as: $T = \dfrac{Z}{\sqrt{X / N}}$ has a (Students) $$t$$ distribution with $$N$$ degrees of freedom, which we denote as $$T \sim \mathcal{t}_{(N)}$$. Furthermore, $$\mathbb{E}(T) = 0$$ and $$\mathbb{V}{\rm ar}(T) = \dfrac{N}{N-2}$$, $$N > 2$$. As $$N \rightarrow \infty$$, the $$t$$ distribution approaches the standard normal distribution.

#
# Calculate the probability density function for values of x in [-5;5]
x <- seq(from = -5, to = 5, length.out = 200)
#
plot(x, dt(x, df = 1), type = "l",
ylab = "Density", lwd = 2, col = 1, ylim = c(0, 0.4))
for(i in 1:4){
lines(x, dt(x, df = c(2, 3, 5, 100)[i]), col = i+1, lwd = 2)
}
title(expression("Probability density function of "~t[k]))
#
legend(x = 3, y = 0.4, col = 1:5, lty = 1, lwd = 2,
legend = paste0("k = ", c(1, 2, 3, 5, 100)))

from scipy.stats import t
# Calculate the probability density function for values of x in [-5;5]
x = np.linspace(start = -5, stop = 5, num = 200)
#
_ = plt.figure(num = 2, figsize = (10, 8))
for i in range(0, 5):
_ = plt.plot(x, t.pdf(x, df = [1, 2, 3, 5, 100][i]),
label = "k = " + str([1, 2, 3, 5, 100][i]))
_ = plt.ylim((0, 0.4))
_ = plt.title("Probability density function of $t_k$")
_ = plt.ylabel("Density")
_ = plt.legend()
plt.show()

If $$X_1,...,X_N$$ is a random sample of independent random variables with $$X_i \sim \mathcal{N}(\mu, \sigma^2)$$, $$\forall i = 1,...,N$$, then the t-ratio statistic (or simply $$t$$-statistic) of an estimator of the sample mean $$\overline{X}$$, defined as: $t_{\overline{X}} = \dfrac{\overline{X} - \mu}{\text{se} \left(\overline{X} \right)} = \dfrac{\overline{X} - \mu}{\widehat{\sigma}^2 / \sqrt{N}}$ has the $$t_{N-1}$$ distribution.

#### 3.4.1.5 The $$F$$ Distribution

Lastly, an important distribution for statistics and econometrics is the $$F$$ distribution. It is usually used for testing hypothesis in the context of multiple regression analysis.

Let $$X_1 \sim \mathcal{\chi}_{k_1}^2$$ and $$X_2 \sim \mathcal{\chi}_{k_2}^2$$ be independent chi-squared random variables. The, the random variable: $F = \dfrac{X_1 / k_1}{X_2 / k_2}$ has an F distribution with $$(k_1, k_2)$$ degrees of freedom, which we denote by $$F \sim F_{k_1, k_2}$$.

The order of the degrees of freedom in $$F_{k_1, k_2}$$ is important:

• $$k_1$$ - the numerator degrees of freedom - is associated with the chi-square variable in the numerator, $$X_1$$;
• $$k_2$$ - the denominator degrees of freedom - is associated with the chi-square variable in the denominator, $$X_2$$;
#
# Calculate the probability density function for values of x in [0;3]
x <- seq(from = 0, to = 3, length.out = 200)
#
plot(x, df(x, df1 = 1, df2 = 1), type = "l",
ylab = "Density", lwd = 2, col = 1, ylim = c(0, 2))
lines(x, df(x, df1 = 3, df2 = 2), col = 2, lwd = 2)
lines(x, df(x, df1 = 2, df2 = 3), col = 2, lwd = 2, lty = 2)
lines(x, df(x, df1 = 6, df2 = 3), col = 4, lwd = 2)
lines(x, df(x, df1 = 3, df2 = 6), col = 4, lwd = 2, lty = 2)
lines(x, df(x, df1 = 100, df2 = 100), col = 6, lwd = 2)
title(expression("Probability density function of "~F[k[1]][k[2]]))
legend(x = 2, y = 2,
lty = c(1, 1, 2, 1, 2, 1), lwd = 2,
col = c(1, 2, 2, 4, 4, 6),
legend = c(expression(k[1]==1~","~k[2]==1),
expression(k[1]==3~","~k[2]==2),
expression(k[1]==2~","~k[2]==3),
expression(k[1]==6~","~k[2]==3),
expression(k[1]==3~","~k[2]==6),
expression(k[1]==100~","~k[2]==100)))

from scipy.stats import f
# Calculate the probability density function for values of x in [0;3]
x = np.linspace(start = 1e-10, stop = 3, num = 200)
#
_ = plt.figure(num = 3, figsize = (10, 8))
#
_ = plt.plot(x, f.pdf(x, dfn = 1, dfd = 1), label = "$k_1 = 1$, $k_2 = 1$")
_ = plt.plot(x, f.pdf(x, dfn = 3, dfd = 2), label = "$k_1 = 3$, $k_2 = 2$")
_ = plt.plot(x, f.pdf(x, dfn = 2, dfd = 3),
linestyle = "--", label = "$k_1 = 2$, $k_2 = 3$")
_ = plt.plot(x, f.pdf(x, dfn = 6, dfd = 3), label = "$k_1 = 6$, $k_2 = 3$")
_ = plt.plot(x, f.pdf(x, dfn = 3, dfd = 6),
linestyle = "--", label = "$k_1 = 3$, $k_2 = 6$")
_ = plt.plot(x, f.pdf(x, dfn = 100, dfd = 100), label = "$k_1 = 100$, $k_2 = 100$")
#
_ = plt.ylim((0, 2.1))
_ = plt.title("Probability density function of $F_{k_1, k_2}$")
_ = plt.ylabel("Density")
#
_ = plt.legend()
plt.show()

Now that we have introduced a few common distributions, we can look back at our univariate regression model, and examine its distribution more carefully. This also allows us to derive yet another model parameter estimation method, which is based on the assumptions on the underlying distribution of the data.

### 3.4.2 Univariate Linear Regression Model With Gaussian Noise

As before, let our linear equation be defined as: $\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}$ where:

• $$\mathbf{X}$$ can be either a non-random variable, or a random variable with some arbitrary distribution.
• $$\mathbb{V}{\rm ar}\left( \boldsymbol{\varepsilon} | \mathbf{X} \right) = \mathbb{V}{\rm ar}\left( \boldsymbol{\varepsilon} \right) = \sigma^2_\epsilon \mathbf{I}$$ - i.e. the error terms are independent of $$\mathbf{X}$$, independent across observations $$i=1,...,N$$ and have a constant variance.
As mentioned earlier, a consequence of (UR.4) assumption is that not only are the residuals $$\boldsymbol{\varepsilon}$$ normal, but the OLS estimators as well:

$\widehat{\boldsymbol{\beta}} | \mathbf{X} \sim \mathcal{N} \left(\boldsymbol{\beta}, \sigma^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \right)$

The fact that the OLS estimators have a normal distribution can be shown by applying a combination of the Central Limit Theorem and Slutsky’s Theorem.

Additionally, if (UR.4) holds true, then it can be shown that:

$\mathbf{Y} | \mathbf{X} \sim \mathcal{N} \left(\mathbf{X} \boldsymbol{\beta},\ \sigma^2 \mathbf{I} \right)$

Since: \begin{aligned} \mathbb{E}(\mathbf{Y} |\mathbf{X}) &= \mathbb{E} \left(\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} |\mathbf{X}\right) = \mathbb{E} \left(\mathbf{X} \boldsymbol{\beta} |\mathbf{X}\right) = \mathbf{X} \boldsymbol{\beta}\\ \mathbb{V}{\rm ar}\left( \mathbf{Y} | \mathbf{X} \right) &= \mathbb{V}{\rm ar}\left( \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} | \mathbf{X} \right) = \mathbb{V}{\rm ar}\left( \boldsymbol{\varepsilon} | \mathbf{X} \right) =\sigma^2 \mathbf{I} \end{aligned}

Furthermore, we see that a consequence of these assumptions is that $$Y_i$$ and $$Y_j$$ are independent, given $$X_i$$ and $$X_j$$, $$i\neq j$$.

We can also illustrate the distribution of $$\mathbf{Y}$$ graphically:

#
#
#
set.seed(123)
#
N <- 200
beta_0 <- 1
beta_1 <- 0.5
e_sd <- 0.5
#
x <- seq(from = -10, to = 10, length.out = N)
e <- rnorm(mean = 0, sd = e_sd, n = length(x))
y <- beta_0 + beta_1 * x + e
y_cond_exp <- beta_0 + beta_1 * x
# Fit the data
#y_mdl <- lm(y ~ x)
# Plot the density at specified X axis points:
plot_at <- c(-5, 0, 5)
import numpy as np
import matplotlib.pyplot as plt
#
np.random.seed(123)
#
N = 200
beta_0 = 1
beta_1 = 0.5
e_sd = 0.5
#
x = np.linspace(start = -10, stop = 10, num = N)
e = np.random.normal(loc = 0, scale = e_sd, size = len(x))
y = beta_0 + beta_1 * x + e
y_cond_exp = beta_0 + beta_1 * x
# Fit the data
# Plot the density at specified X axis points:
plot_at = [-5, 0, 5]

We will need to define our own function to plot the density on the scatter plot:

sideways_dnorm <- function(where_x, where_y, e_var, magnify = 4){
values = seq(-2, 2, by = 0.1) # calculate density for this interval
# Y|X ~ Normal(XB, sigma^2)
# (Y - XB)|X ~ Normal(0, sigma^2)
dens <- dnorm(x = values, mean = 0, sd = sqrt(e_var))
x <- where_x + dens * magnify
y <- where_y + values
#
return(cbind(x,y))
}
from scipy.stats import norm
#
def sideways_dnorm(where_x, where_y, e_var, magnify = 4):
values = np.arange(start = -2, stop = 2.1, step = 0.1) # calculate density for this interval
# Y|X ~ Normal(XB, sigma^2)
# (Y - XB)|X ~ Normal(0, sigma^2)
dens = norm.pdf(x = values, loc = 0, scale = np.sqrt(e_var))
x = where_x + dens * np.array(magnify)
y = where_y + values
return(np.vstack((x, y)))

Note: we opted to not use the sample to estimate a regression, nor the estimated variance of the residuals for the density plot. We instead opted to visualize the true underlying distribution on the true underlying regression of $$\mathbf{E}(Y_i|X_i)$$ and use the true variance of $$\epsilon_i \sim \mathcal{N}(0, 0.5^2)$$. Finally, our scatter plot, along with the DGP regression, and the (conditional) density plot of $$Y$$ will look like this:

plot(x, y, col = "red", main = "Linear Regression with Gaussian Errors")
#
#lines(x, fitted(y_mdl), col = "blue")
lines(x, y_cond_exp, col = "blue")
#
#
for(i in 1:length(plot_at)){
# y_fit <- coef(y_mdl)[1] + coef(y_mdl)[2] * plot_at[i]
y_fit <- beta_0 + beta_1 * plot_at[i]
#
xy <- sideways_dnorm(where_x = plot_at[i],
where_y = y_fit,
e_var = e_sd^2, #var(y_mdl$resid), magnify = 4) # lines(xy) # abline(h = y_fit, lty = 2) abline(v = plot_at[i], lty = 2) points(plot_at[i], y_fit, col = "darkgreen", pch = 19) } legend(x = -10, y = 6, lty = 1, col = "blue", legend = expression("E("~Y[i]~"|"~X[i]~")="~beta[0]+beta[1]*X)) _ = plt.figure(num = 4, figsize = (10, 8)) _ = plt.plot(x, y, linestyle = "None", marker = "o", color = "red", markerfacecolor = 'None') _ = plt.title("Linear Regression with Gaussian Errors") #plt.plot(x, y_mdl.fittedvalues, linestyle = "-", color = "blue") _ = plt.plot(x, y_cond_exp, linestyle = "-", color = "blue", label = "$E(Y_i|X_i) = \\beta_0 + \\beta_1 X_i") for i in range(0, len(plot_at)): # y_fit = y_mdl.params[0] + y_mdl.params[1] * plot_at[i] y_fit = beta_0 + beta_1 * plot_at[i] xy = sideways_dnorm(where_x = plot_at[i], where_y = y_fit, e_var = e_sd**2,#np.var(y_mdl.resid), magnify = 4) _ = plt.plot(xy[0], xy[1], linestyle = "-", color = "black") _ = plt.hlines(y = y_fit, linestyle = "--", color = "black", xmin = min(x), xmax = max(x)) _ = plt.vlines(x = plot_at[i], linestyle = "--", color = "black", ymin = min(y), ymax = max(y)) _ = plt.plot(plot_at[i], y_fit, linestyle = "None", marker = "o", color = "darkgreen") _ = plt.legend() plt.show() The assumption that the residual term is normal (or sometimes called Gaussian) does not always hold true in practice. If we believe that the random noise term is a combination of a number of independent smaller random causes, all similar in magnitude, then the error term will indeed be normal (via Central Limit Theorem). Nevertheless: The Gaussian-noise assumption is important in that it gives us a conditional joint distribution of the random sample $$\mathbf{Y}$$, which in turn gives us the sampling distribution for the OLS estimators of $$\boldsymbol{\beta}$$. The distributions are important when we are doing statistical inference on the parameters - calculating confidence intervals or testing null hypothesis for the parameters. Furthermore, this allows us to calculate the confidence intervals for $$\mathbb{E} \left(\mathbf{Y} | \mathbf{X}\right)$$, or prediction intervals for $$\widehat{\mathbf{Y}}$$, given a value of $$\mathbf{X}$$. Comparison of these intervals. Consequently, we will see that the conditional probability density function (pdf) of $$\mathbf{Y}$$, given $$\mathbf{X}$$ is a multivariate normal distribution. For $$Y_i$$, given $$X_i$$ the pdf is the same for each $$i = 1,...,N$$. Lets first look at the cumulative distribution function (cdf): \begin{aligned} F_{Y|X}(y_i | x_i) &= \mathbb{P} \left(Y \leq y_i |X = x_i \right) \\ &= \mathbb{P} \left(\beta_0 + \beta_1 X + \epsilon \leq y_i |X = x_i \right)\\ &= \mathbb{P} \left(\beta_0 + \beta_1 x_i + \epsilon \leq y_i\right)\\ &= \mathbb{P} \left(\epsilon \leq y_i - \beta_0 - \beta_1 x_i \right)\\ &= F_{\epsilon | X}(y_i - \beta_0 - \beta_1 X | X = x_i) \end{aligned} since $$\epsilon \sim \mathcal{N}(0, \sigma^2)$$, it follows that the conditional pdf of $$Y$$ on $$X$$ is the same across $$i = 1,...,N$$: $f_{Y|X}(y_i | x_i) = \dfrac{1}{\sqrt{2 \pi \sigma^2}} \text{e}^{-\dfrac{\left(y_i - (\beta_0 + \beta_1x_i) \right)^2}{2\sigma^2}}$ Because $$Y_i$$ are independent across observations, conditional on $$\mathbf{X}$$, the joint pdf is a product of the marginal pdf’s: \begin{aligned} f_{Y_1, ..., Y_N|X_1,...,X_N}(y_1,...,y_N | x_1,...,x_N) &= \prod_{i = 1}^N f_{Y|X}(y_i | x_i) \\ &= \prod_{i = 1}^N \dfrac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[ -\dfrac{\left(y_i - (\beta_0 + \beta_1x_i) \right)^2}{2\sigma^2} \right] \end{aligned} which we can re-write as a multivariate normal distribution: \begin{aligned} f_{\mathbf{Y}|\mathbf{X}}(\mathbf{y} | \mathbf{x}) &= \dfrac{1}{(2 \pi)^{N/2} (\sigma^2)^{N/2}} \exp \left[ -\dfrac{1}{2} \left( \mathbf{y} - \mathbf{x} \boldsymbol{\beta}\right)^\top \left( \sigma^2 \mathbf{I}\right)^{-1} \left( \mathbf{y} - \mathbf{x} \boldsymbol{\beta}\right)\right] \end{aligned} Having defined the distribution of our random sample allows us to estimate the parameters in a different way - by using the probability density function. While the probability density function relates to the likelihood function of the parameters of a statistical model, given some observed data: $\mathcal{L}(\boldsymbol{\beta}, \sigma^2 | \mathbf{y}, \mathbf{X}) = \dfrac{1}{(2 \pi)^{N/2} (\sigma^2)^{N/2}} \exp \left[ -\dfrac{1}{2} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta}\right)^\top \left( \sigma^2 \mathbf{I}\right)^{-1} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta}\right)\right]$ The probability density function is a function of an outcome $$\mathbf{y}$$, given fixed parameters, while the likelihood function is a function of the parameters only, with the data held as fixed. ### 3.4.3 Maximizing The Log-Likelihood Function - The MLE In practice, it is much more convenient to work with the log-likelihood function: \begin{aligned} \mathcal{\ell}(\boldsymbol{\beta}, \sigma^2 | \mathbf{y}, \mathbf{X}) &= \log \left( \mathcal{L}(\boldsymbol{\beta}, \sigma^2 | \mathbf{y}, \mathbf{X}) \right) \\ &= -\dfrac{N}{2} \log (2 \pi) - N \log (\sigma) -\dfrac{1}{2\sigma^2} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta}\right)^\top \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta}\right) \end{aligned} Taking the partial derivatives allows us to fund the ML estimates: \begin{aligned} \dfrac{\partial \mathcal{\ell}}{\partial \boldsymbol{\beta}^\top} &= -\dfrac{1}{2\sigma^2} \left( -2\mathbf{X}^\top \mathbf{y} + 2 \mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}\right) = 0\\ \dfrac{\partial \mathcal{\ell}}{\partial \sigma^2} &= -\dfrac{N}{2}\dfrac{1}{\sigma^2} + \dfrac{1}{2 \sigma^4} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta}\right)^\top \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta}\right) = 0 \end{aligned} which give us the ML estimators: \begin{aligned} \widehat{\boldsymbol{\beta}}_{\text{ML}} &= \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{Y} \\ \widehat{\sigma}^2 &= \dfrac{1}{N}\left( \mathbf{y} - \mathbf{X} \widehat{\boldsymbol{\beta}}_{\text{ML}}\right)^\top \left( \mathbf{y} - \mathbf{X} \widehat{\boldsymbol{\beta}}_{\text{ML}}\right) \end{aligned} We see that these estimators exactly match the OLS estimators of $$\boldsymbol{\beta}$$. This is a special property of (UR.4) (and (UR.3)) assumption. The estimator of $$\sigma^2$$ is divided by $$N$$ (instead of $$N-2$$ in the OLS case). ### 3.4.4 Standard Errors of a MLE The standard errors can be found by calculating the inverse of the square root of the diagonal elements of the observed Fisher information matrix. In general, the Fisher information matrix $$\mathbf{I}(\boldsymbol{\gamma})$$ is a symmetrical $$k \times k$$ matrix (if the parameter vector is $$\boldsymbol{\gamma} = (\gamma_1,..., \gamma_k)^\top)$$, which contains the following entries: $\left(\mathbf{I}(\boldsymbol{\gamma}) \right)_{i,j} = -\dfrac{\partial^2 }{\partial \gamma_i \partial \gamma_j}\mathcal{\ell}(\boldsymbol{\gamma}),\quad 1 \leq i,j \leq p$ The observed Fisher information matrix is the information matrix evaluated at the MLE: $$\mathbf{I}(\widehat{\boldsymbol{\gamma}}_{\text{ML}})$$. Most statistical software calculates and returns the Hessian matrix. The Hessian is defined as $$\mathbf{H}(\boldsymbol{\gamma})$$: $\left(\mathbf{H}(\boldsymbol{\gamma}) \right)_{i,j} = \dfrac{\partial^2}{\partial \gamma_i \partial \gamma_j}\mathcal{\ell}(\boldsymbol{\gamma}),\quad 1 \leq i,j \leq p$ i.e. it is a matrix of second derivatives of the likelihood function with respect to the parameters. Often times we minimize the negative log-likelihood function (which is equivalent to maximizing the log-likelihood function), then $$\mathbf{I}(\widehat{\boldsymbol{\gamma}}_{\text{ML}}) = \mathbf{H}(\widehat{\boldsymbol{\gamma}}_{\text{ML}})$$. If we maximize the likelihood function, then $$\mathbf{I}(\widehat{\boldsymbol{\gamma}}_{\text{ML}}) = - \mathbf{H}(\widehat{\boldsymbol{\gamma}}_{\text{ML}})$$. Furthermore: $\mathbb{V}{\rm ar}(\boldsymbol{\gamma}) = \left[ \mathbf{I}(\widehat{\boldsymbol{\gamma}}_{\text{ML}}) \right]^{-1}$ and the standard errors are then the square roots of the diagonal elements of the covariance matrix. Generally, the asymptotic distribution for a maximum likelihood estimate is: $\widehat{\boldsymbol{\gamma}}_{\text{ML}} \sim \mathcal{N} \left(\boldsymbol{\gamma}, \left[ \mathbf{I}(\widehat{\boldsymbol{\gamma}}_{\text{ML}}) \right]^{-1} \right)$ ### 3.4.5 When to use MLE instead of OLS Assuming that (UR.1)-(UR.3) holds. If we additionally assume that that the property (UR.4) holds true, OLS and MLE estimates are equivalent. set.seed(123) # N <- 100 beta_0 = 3 beta_1 = 8 # X <- rnorm(n = N, mean = 10, sd = 2) e <- rnorm(n = length(X), mean = 0, sd = 2) Y <- beta_0 + beta_1 * X + e np.random.seed(123) # N = 100 beta_0 = 3 beta_1 = 8 # X = np.random.normal(loc = 10, scale = 2, size = N) e = np.random.normal(loc = 0, scale = 2, size = len(X)) Y = beta_0 + beta_1 * X + e log_lik <- function(par_vec, y, x) { # If the standard deviation prameter is negative, return a large value: if(par_vec[3] < 0) return(1e8) # The likelihood function values: lik <- dnorm(y, mean = par_vec[1] + par_vec[2] * x, sd = par_vec[3]) #This is similar to calculating the likelihood for Y - XB # res <- y - par_vec[1] - par_vec[2] * x # lik <- dnorm(res, mean = 0, sd = par_vec[3]) # If all logarithms are zero, return a large value if(all(lik == 0)) return(1e8) # Logarithm of zero = -Inf return(-sum(log(lik[lik != 0]))) } def log_lik(par_vec, y, x): # If the standard deviation prameter is negative, return a large value: if par_vec[2] < 0: return(1e8) # The likelihood function values: lik = norm.pdf(y, loc = par_vec[0] + par_vec[1] * x, scale = par_vec[2]) #This is similar to calculating the likelihood for Y - XB # res = y - par_vec[0] - par_vec[1] * x # lik = norm.pdf(res, loc = 0, sd = par_vec[2]) # If all logarithms are zero, return a large value if all(v == 0 for v in lik): return(1e8) # Logarithm of zero = -Inf return(-sum(np.log(lik[np.nonzero(lik)]))) Now, we can optimize the function and estimate the parameters: # # coef_est <- optim(par = c(0, 0, 10), fn = log_lik, hessian = T, y = Y, x = X) print(coef_est) ##par
## [1] 3.327122 7.946698 1.922506
##
## $value ## [1] 207.2262 ## ##$counts
##      196       NA
##
## $convergence ## [1] 0 ## ##$message
## NULL
##
## $hessian ## [,1] [,2] [,3] ## [1,] 27.05605401 275.4525943 0.01242708 ## [2,] 275.45259431 2893.6052373 0.20363677 ## [3,] 0.01242708 0.2036368 54.06254589 import scipy.optimize as optimize # opt_res = optimize.minimize(fun = log_lik, x0 = [0, 0, 10], args = (Y, X)) print(opt_res) ## fun: 208.1495503826878 ## hess_inv: array([[ 7.93434827e-01, -7.50597945e-02, 9.56556745e-04], ## [-7.50597945e-02, 7.45615953e-03, -8.14864036e-05], ## [ 9.56556745e-04, -8.14864036e-05, 1.89114136e-02]]) ## jac: array([-3.81469727e-06, -7.62939453e-06, -7.62939453e-06]) ## message: 'Optimization terminated successfully.' ## nfev: 150 ## nit: 19 ## njev: 30 ## status: 0 ## success: True ## x: array([3.12775139, 7.98340769, 1.93974579]) print(coef_est$par)
## [1] 3.327122 7.946698 1.922506
print(opt_res.x)
## [3.12775139 7.98340769 1.93974579]

Since the optim function minimizes a given function by default, we have calculated the negative log-likelihood function. As such, the standard errors can be calculated by taking the square root of the diagonal elements of the inverse of the Hessian matrix:

print(sqrt(diag(solve(coef_est$hessian)))) ## [1] 1.0945224 0.1058369 0.1360041 print(np.sqrt(np.diag(opt_res.hess_inv))) ## [0.89074959 0.08634906 0.13751878] The OLS estimates are: coef(summary(lm(Y ~ X))) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.319110 1.1052951 3.002918 3.393390e-03 ## X 7.947528 0.1068786 74.360322 5.172918e-88 print(summary(lm(Y ~ X))$sigma)
## [1] 1.941429
import statsmodels.api as sm
#
print(sm.OLS(Y, sm.add_constant(X)).fit().params)
## [3.12775395 7.98340745]
print(np.sqrt(sum(sm.OLS(Y, sm.add_constant(X)).fit().resid**2) / (N - 2)))
## 1.9594392282713997

Note that the OLS estimates $$\beta_0$$ and $$\beta_1$$ and calculates $$\sigma$$, which can be extracted from the output separately. We see that the estimated values differ slightly. This is because optim uses general-purpose optimization algorithms. The results may also depend on the starting parameter values.

When we are using OLS to estimate the parameters by minimizing the sum of squared residuals, we do not make any assumptions about the underlying distribution of the errors.

If the errors are normal, then MLE is equivalent to OLS. However, if we have reason to believe that the errors are not normal, then specifying a correct likelihood function would yield the correct estimates using MLE.

Example 3.16 A Poisson regression is sometimes known as a log-linear model. It is used to model count data (i.e. integer-valued data):

• number of passengers in a plane;
• the number of calls in a call center;
• the number of insurance claims in an insurance firm, etc.
Poisson regression assumes the response variable $$Y$$ has a poisson distribution and that the logarithm of its expected value is modelled by a linear combination of its expeected values. If we assume that our DGP follows a Poisson regression, then:

$Y \sim Pois (\mu),\quad \Longrightarrow \mathbb{E}(Y) = \mathbb{V}{\rm ar}(Y) = \mu$ and: $\log(\mu) = \beta_0 + \beta_1 X \iff \mu = \exp \left[ \beta_0 + \beta_1 X\right]$ This leads to the following model: $\mathbb{E}(Y|X) = \exp \left[ \beta_0 + \beta_1 X\right] \iff \log \left( \mathbb{E}(Y|X) \right) = \beta_0 + \beta_1 X$ Since $$\mathbb{E}[\log(Y) | X] \neq \log(\mathbb{E}[Y|X])$$, we cannot simply take logarithms of $$Y$$ and apply OLS. However, we can apply MLE.

The likelihood function of $$Y$$ is: $\mathcal{L}(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X}) = \prod_{i = 1}^N \dfrac{\exp\left( y_i \cdot(\beta_0 + \beta_1 x_i)\right) \cdot \exp\left(-\exp\left( \beta_0 + \beta_1 x_i \right) \right)}{y_i !}$ and the log-likelihood: $\mathcal{\ell}(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X}) = \sum_{i = 1}^N \left( y_i \cdot(\beta_0 + \beta_1 x_i) -\exp\left( \beta_0 + \beta_1 x_i \right) - \log(y_i!) \right)$

We notice that the parameters $$\beta_0$$ and $$\beta_1$$ only appear in the first two terms. Since our goal is to estimate $$\beta_0$$ and $$\beta_1$$, we can drop $$\sum_{i = 1}^N\log(y_i!)$$ from our equation. This does not impact the maximization - removing (or adding) a constant value from an additive equation will not impact the optimization.

On the other hand, the initial likelihood function $$\mathcal{L}(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X})$$ is a multiplicative equation - all the different terms are multiplied across $$i = 1,...,N$$.

This simplifies our expression to: $\mathcal{\ell}(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X}) = \sum_{i = 1}^N \left( y_i \cdot(\beta_0 + \beta_1 x_i) -\exp\left( \beta_0 + \beta_1 x_i \right) \right)$ Unfortunately, calculating $$\dfrac{\partial \mathcal{\ell}(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X})} {\partial \boldsymbol{\beta}}$$ will not yield a closed-form solution. Nevertheless, we can use the standard optimization functions to find the optimal parameter values.

Example 3.17 Let:
• $$Y$$ - the number of people who visited a cafe,
• $$X$$ - the rate of advertising done by a cafe (from 0 to 1).

We will generate an example with $$\beta_0 = 1$$, $$\beta_1 = 0.5$$ and $$N = 100$$.

set.seed(123)
#
N = 500
beta_0 <- 1
beta_1 <- 0.5
#
x  <- seq(from = 0, to = 1, length.out = N)
mu <- exp(beta_0 + beta_1 * x)
y  <- rpois(n = N, lambda = mu)
#
#
#
#
plot(x, y)

np.random.seed(123)
#
N = 500
beta_0 = 1
beta_1 = 0.5
#
x = np.linspace(start = 0, stop = 1, num = N)
mu = np.exp(beta_0 + beta_1 * x)
y = np.random.poisson(lam = mu, size = N)
#
_ = plt.figure(num = 5, figsize = (10, 8))
_ = plt.plot(x, y, linestyle = "None", marker = "o",
markerfacecolor = "None", color = "black")
plt.show()

and estimate the unknown parameters via MLE:

log_lik <- function(par_vec, y, x) {
# The likelihood function values:
lik <- dpois(y, lambda = exp(par_vec[1] + par_vec[2] * x))
# If all logarithms are zero, return a large value
if(all(lik == 0)) return(1e8)
# Logarithm of zero = -Inf
return(-sum(log(lik[lik != 0])))
}
#
#
from scipy.stats import poisson
#
def log_lik(par_vec, y, x):
# The likelihood function values:
lik = poisson.pmf(y, mu = np.exp(par_vec[0] + par_vec[1] * x))
# If all logarithms are zero, return a large value
if all(v == 0 for v in lik):
return(1e8)
# Logarithm of zero = -Inf
return(-sum(np.log(lik[np.nonzero(lik)])))
coef_est <- optim(par = c(0, 0),
fn = log_lik, hessian = T,
y = y, x = x)
print(coef_est)
## $par ## [1] 1.0120727 0.4729256 ## ##$value
## [1] 994.7103
##
## $counts ## function gradient ## 57 NA ## ##$convergence
## [1] 0
##
## $message ## NULL ## ##$hessian
##           [,1]     [,2]
## [1,] 1758.9687 948.8248
## [2,]  948.8248 657.3466
opt_res = optimize.minimize(fun = log_lik,
x0 = [0, 0],
args = (y, x))
print(opt_res)
##       fun: 1005.4483712482812
##  hess_inv: array([[ 0.00793944, -0.01120125],
##        [-0.01120125,  0.0168012 ]])
##       jac: array([6.10351562e-05, 3.81469727e-05])
##   message: 'Desired error not necessarily achieved due to precision loss.'
##      nfev: 327
##       nit: 8
##      njev: 79
##    status: 2
##   success: False
##         x: array([0.97663587, 0.52574927])

We see that the Maximum Likelihood (ML) estimates are close to the true parameter values. In conclusion, the MLE is quite handy for estimating more complex models, provided we know the true underlying distribution of the data. Since we don’t know this in practical applications, we can always look at the histogram of the data, to get some ideas:

#
#
hist(y, col = "cornflowerblue", breaks = 11)

_ = plt.figure(num = 6, figsize = (10, 8))
_ = plt.hist(y, bins = 11, histtype = 'bar', color = "cornflowerblue", ec = 'black')
plt.show()

as the data seems skewed to the right (indicating a non-normal distribution), the values are non-negative and integer valued, we should look-up possible distributions for discrete data, and examine, whether our sample is similar to (at least one) of them.

Though in practice, this is easier said than done.

Example 3.18 Let our process $$Y$$ be a mixture of normal processes $$X_1 \sim \mathcal{N}(1, (0.5)^2)$$ and $$X_2 \sim \mathcal{N}(4, 1^2)$$ with equal probability and further assume that we do not obeserve $$X_1$$ and $$X_2$$.
set.seed(123)
#
N = 500
x1 <- rnorm(mean = 1, sd = 0.5, n = N)
x2 <- rnorm(mean = 4, sd = 1, n = N)
# randomize the ordering of x1 and x2 variables
y <- sample(c(x1, x2))
#
# plot the data
#
par(mfrow = c(1, 2))
#
plot(y)
#
hist(y, col = "cornflowerblue", breaks = 40)

np.random.seed(123)
#
N = 500
x1 = np.random.normal(loc = 1, scale = 0.5, size = N)
x2 = np.random.normal(loc = 4, scale = 1, size = N)
# randomize the ordering of x1 and x2 variables
y = np.concatenate((x1, x2))
np.random.shuffle(y)
# plot the data
fig = plt.figure(num = 7, figsize = (10, 8))
_ = fig.add_subplot('121').plot(y, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = fig.add_subplot('122').hist(y, bins = 40,
color = "cornflowerblue", ec = 'black')
plt.show()

As we can see from the histogram (which has two peaks) and run-sequence plot (which appears to have values clustered around two means), $$Y$$ seems to be from a mixture of distributions.

For evaluating such cases, see Racine, J.S. (2008), Nonparametric Econometrics: A Primer.

For example, the data could contain information from:

• two shifts at work;
• sales on weekdays and weekends;
• two factories, etc.