5.1 Generalized Linear Model (GLM)
This chapter version is based in part on Source 1, Source 2, Source 3 and Source 4, as well as Source 5.
In a general linear model: \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \mathbb{E} \left( \boldsymbol{\varepsilon} | \mathbf{X}\right) = \boldsymbol{0},\quad \mathbb{V}{\rm ar}\left( \boldsymbol{\varepsilon} | \mathbf{X} \right) = \mathbb{E} \left( \boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^\top \right)= \mathbf{\Sigma} = \sigma_\epsilon^2 \mathbf{\Omega} \] the dependent variable \(\mathbf{Y}\) is described via as a linear function of explanatory variables + an error term.
While up until now this specification was useful in describing various relationships, there are cases when this specification is not appropriate. For example:
- when the range of \(\mathbf{Y}\) is restricted to binary values, or counts (i.e. non-negative integer-valued data);
- when the variance of \(\mathbf{Y}\) depends on the mean;
A natural extension, which deals with these cases is a class of Generalized linear models, which extend general linear models.
5.1.1 GLM Specification
A Generalized Linear Model consists of several elements:
- A linear predictor: \[ \boldsymbol{\eta} = \mathbf{X} \boldsymbol{\beta} \]
- A link function, \(g\), which describes how the mean of the process \(\mathbf{Y}\) depends on the linear predictor: \[ \mathbb{E}(\mathbf{Y}) = \boldsymbol{\mu} = g^{-1}(\boldsymbol{\eta}) = g^{-1}(\mathbf{X} \boldsymbol{\beta}) \]
- A variance function, \(V\), and a dispersion parameter, \(\boldsymbol{\phi}\), which describe how the variance of the process \(\mathbf{Y}\) depends on the mean: \[ \mathbb{V}{\rm ar}(\mathbf{Y}) = \boldsymbol{\phi}V(\boldsymbol{\mu}) = \boldsymbol{\phi}V\left(g^{-1}(\mathbf{X} \boldsymbol{\beta}) \right) \] Sometimes, this property can be described as some king of specific distribution on the dependent variable. For example, we may assume that \(\mathbf{Y}\) follows a probability distribution from the exponential family.
The following assumptions are implied in the GLM:
- The relationship between the dependent and independent variables may be non-linear;
- The dependent variable can have a non-normal distribution;
- In order to estimate the unknown parameters, the maximum likelihood estimation method need to be applied (see chapter 3.4 for an introductory example to the MLE for a simple univariate OLS regression);
- The errors are independent but can have a non-normal distribution;
5.1.2 Exponential Family
In a GLM, each \(Y_i\) is assumed to be generated from a particular distribution in the exponential family, where the probability density function (pdf) is written as: \[ f(y_i) = \exp\left( \dfrac{y_i \theta_i - b(\theta_i)}{a_i(\phi)} + c(y_i, \phi) \right) \] where:
- \(\theta_i\) is the location (i.e. mean) parameter;
- \(\phi\) is the scale (i.e. standard deviation, or sometimes variance) parameter
- \(a_i(\cdot)\), \(b(\cdot)\) and \(c(\cdot, \cdot)\) - are known functions.
Furthermore, \(\theta_i\) is called the canonical parameter and \(b(\cdot)\) is called the cumulant function.
It can be show that if \(Y_i\) has a distribution from the exponential family, then: \[ \begin{aligned} \mathbb{E}(Y_i) &= \mu_i = b'(\theta_i)\\ \mathbb{V}{\rm ar}(Y_i) &= \sigma^2_i = b''(\theta_i) a_i(\phi) \end{aligned} \] It is sometimes assumed that \(a_i (\phi)\) has the following form: \[ a_i(\phi) = \dfrac{\phi}{p_i}, \] where \(p_i\) is a known prior weight, usually 1.
The exponential family includes various distributions such as:
- Gaussian (i.e. normal) distribution;
- Bernoulli distribution;
- Binomial distribution;
- Multinomial distribution;
- Poisson distribution;
- Exponential distribution;
More distributions can be found here.
\[ \begin{aligned} f(y_i) &= \dfrac{1}{\sqrt{2\pi \sigma^2}} \exp \left( -\dfrac{1}{2} \dfrac{(y_i - \mu_i)^2}{\sigma^2} \right) = \dfrac{1}{\sqrt{2\pi \sigma^2}} \exp \left( -\dfrac{1}{2} \dfrac{y_i^2 + \mu_i^2 - 2y_i\mu_i}{\sigma^2} \right) = \exp \left( \dfrac{y_i \mu_i - \dfrac{1}{2} \mu_i^2}{\sigma^2} - \dfrac{y_i^2}{2\sigma^2} - \dfrac{1}{2} \log(2\pi\sigma^2) \right) \end{aligned} \] From this expression it is clear that for the normal distribution case:
- \(\theta_i = \mu_i\);
- \(\phi = \sigma^2\);
- \(a_i(\phi) = \phi\);
- \(b_i(\theta_i) = \dfrac{1}{2}\theta_i^2\);
- \(c(y_i, \phi) = - \dfrac{y_i^2}{2\phi} - \dfrac{1}{2} \log(2\pi\phi)\)
Then the mean and variance is as we would expect of a normal distribution: \[ \begin{aligned} \mathbb{E}(Y_i) &= b'(\theta_i) = \theta_i = \mu_i\\ \mathbb{V}{\rm ar}(Y_i) &= b''(\theta_i) a_i(\phi) = 1 \cdot \phi = \sigma^2 \end{aligned} \]
\[ f_i(y_i) = \mathbb{P}(Y_i = y_i) = \binom{N_i}{y_i} p_i^{y_i}(1 - p_i)^{N_i - y_i}, \quad y_i = 0, 1,..., N_i \] here \(\mathbb{P}(Y_i = y_i)\) is the probability of obtaining \(y_i\) successes and \(N_i-y_i\) failures in \(\binom{N_i}{y_i}\) different ways.
Takings the logarithms of both sides yields: \[ \begin{aligned} \log \left( f_i(y_i) \right ) &= y_i \log (p_i) + (N_i - y_i) \log (1 - p_i) + \log \left( \binom{N_i}{y_i} \right) \\\\ &= y_i \log \left( \dfrac{p_i}{1-p_i}\right) + N_i \log (1 - p_i) + \log \left( \binom{N_i}{y_i} \right) \end{aligned} \] We see that:
- \(\theta_i = \log \left( \dfrac{p_i}{1-p_i}\right)\);
Then, solving for \(p_i\) yields:
- \(p_i = \dfrac{\exp\left( \theta_i \right)}{1 + \exp\left( \theta_i \right)}\) and \(1 - p_i = \dfrac{1}{1 + \exp\left( \theta_i \right)}\)
Taking the log of \(1 - p_i\) yields \(\log(1 - p_i) = -\log \left( 1 + \exp\left( \theta_i \right) \right)\) which allows us to write \(b(\cdot)\) as:
- \(b(\theta_i) = N_i \log \left( 1 + \exp\left( \theta_i \right) \right)\)
The remaining term is:
- \(c(y_i, \phi) = \log \left( \binom{N_i}{y_i} \right)\)
Finally, we can set:
- \(a_i(\phi) = \phi\) and \(\phi = 1\).
Consequently, the mean and variance: \[ \begin{aligned} \mathbb{E}(Y_i) &= b'(\theta_i) = N_i \dfrac{\exp\left( \theta_i \right)}{1 + \exp\left( \theta_i \right)} = N_i \cdot p_i = \mu_i\\\\ \mathbb{V}{\rm ar}(Y_i) &= b''(\theta_i) a_i(\phi) = N_i \dfrac{\exp\left( \theta_i \right)}{(1 + \exp\left( \theta_i \right))^2} \cdot 1 = N_i \cdot p_i (1 - p_i) \end{aligned} \] are in line with what we expect of a binomial r.v.
\[ f_i(y_i) = \mathbb{P}(Y_i = y_i) = \dfrac{\exp\left( -\mu_i\right)\mu_i^{y_i}}{y_i!}, \quad y_i = 0, 1, 2, ... \] Taking the logarithms of both sides yields: \[ \log \left( f_i(y_i) \right ) = y_i \log (\mu_i) - \mu_i - \log(y_i!) \] we see that we can take:
- \(a_i(\phi) = \phi\) and \(\phi = 1\)
Then:
- \(\theta_i = \log(\mu_i)\)
which yields:
- \(\mu_i = \exp \left( \theta_i \right)\)
Then, the second term in the log pdf is:
- \(b(\theta_i) = \exp \left( \theta_i \right)\)
Finally, the last term is:
- \(c(y_i, \phi) = -\log(y_i!)\)
Consequently, the mean and variance: \[ \begin{aligned} \mathbb{E}(Y_i) &= b'(\theta_i) = \exp \left( \theta_i \right) = \mu_i\\ \mathbb{V}{\rm ar}(Y_i) &= b''(\theta_i) a_i(\phi) = \exp \left( \theta_i \right) \cdot 1 = \exp \left( \theta_i \right) = \mu_i \end{aligned} \] is just as we would expect of a Poisson distribution - the mean and the variance are equal.
\[ f(y_i) = \lambda \exp \left( -\lambda y_i \right),\quad \lambda > 0 \] We can rewrite the above as: \[ \log(f(y_i)) = \log(\lambda) -\lambda y_i = -y_i \lambda + \log(\lambda) \] Then we can take:
- \(a_i(\phi) = \phi\) and \(\phi = 1\)
and setting:
- \(\theta_i = -\lambda\)
yields:
- \(b(\theta_i) = -\log(-\theta_i)\).
Finally, the last term is:
- \(c(y_i, \phi) = 0\).
Consequently, the mean and variance: \[ \begin{aligned} \mathbb{E}(Y_i) &= b'(\theta_i) = - \dfrac{1}{\theta_i} = \mu_i\\ \mathbb{V}{\rm ar}(Y_i) &= b''(\theta_i) a_i(\phi) = \dfrac{1}{\theta_i^2} \end{aligned} \]
5.1.3 Link Functions
The link function can be though of as a transformation of \(\mathbb{E}(Y)\), which links the actual values of \(Y\) to their estimated counterparts in an econometric model.
So, \(g(\cdot)\) is a one-to-one continuous differentiable transformation of \(\mu_i\): \[ \eta_i = g(\mu_i) \] The function \(g(\mu_i)\) is called a link function.
Depending on the link function specification, we may be able to model various types of response variables:
- Continuous linear response;
- Binary response;
- Multinomial response;
- Count data response;
Furthermore, we assume that the transformed mean follows a linear model: \[ \eta_i = \mathbf{X}_i \boldsymbol{\beta} \] where \(\eta_i\) is called the mean predictor.
Since the link function is one-to-one, we can invert it and obtain: \[ \mu_i = g^{-1}\left( \mathbf{X}_i \boldsymbol{\beta} \right) \]
Note that we do not transform the dependent variable \(Y_i\), but rather its expected value \(\mu_i\).
Consequently, a linear model where \(Y_i\) is linearly dependent on \(\mathbf{X}_i\) (e.g. \(Y_i = \mathbf{X}_i \boldsymbol{\beta} + \epsilon_i\)) is not the same as a generalized linear model where \(\mu_i\) linearly depends on \(\mathbf{X}_i\) (e.g. \(\mu_i = \mathbf{X}_i \boldsymbol{\beta}\)).
\[ \eta_i = g(\eta_i) = \mu_i \] and the variance function: \[ \boldsymbol{\phi}V(\boldsymbol{\mu}) = \boldsymbol{\phi} \mathbf{I} = \mathbf{\Sigma} = \sigma_\epsilon^2 \mathbf{\Omega} \] we get the general multiple linear regression model form, which we have analysed in chapter 4. As such, we will not re-examine continuous linear response models in this chapter
Below we present the link functions and the resulting models for select cases:
Distribution of \(Y\) | Link function name | Link function | Model | Prediction, \(\widehat{\mu}\) | Domain of \(Y\) |
---|---|---|---|---|---|
Normal | Identity | \(g(\mu) = \mu\) | \(\mu = \mathbf{X} \boldsymbol{\beta}\) | \(\widehat{\mu} = \mathbf{X} \widehat{\boldsymbol{\beta}}\) | \(Y \in (-\infty, + \infty)\) |
Exponential | Negative Inverse | \(g(\mu) = -\mu^{-1}\) | \(-\mu^{-1} = \mathbf{X} \boldsymbol{\beta}\) | \(\widehat{\mu} = -\left(\mathbf{X} \widehat{\boldsymbol{\beta}}\right)^{-1}\) | \(Y \in (0, + \infty)\) |
Poisson | Log | \(g(\mu) = \log(\mu)\) | \(\log(\mu) = \mathbf{X} \boldsymbol{\beta}\) | \(\widehat{\mu} = \exp\left(\mathbf{X} \widehat{\boldsymbol{\beta}}\right)\) | \(Y \in \{ 0, 1, 2, ...\}\) |
Binomial | Logit | \(g(\mu) = \log\left(\dfrac{\mu}{1-\mu}\right)\) | \(\log\left(\dfrac{\mu}{1-\mu}\right) = \mathbf{X} \boldsymbol{\beta}\) | \(\widehat{\mu} = \dfrac{\exp\left(\mathbf{X} \widehat{\boldsymbol{\beta}}\right)}{1 + \exp\left(\mathbf{X} \widehat{\boldsymbol{\beta}}\right)}\) | \(Y \in \{0, 1 \}\) |