5.2 Binary Response Variables

See (Agresti 2018) and (Afifi, May, and Clark 2011) for more details.

Assume that our dependent variable takes the following values: \[ Y_i = \begin{cases} 1, \text{ if success} \\\\ 0, \text{ if failure} \end{cases} \] In other words, our independent variable can be considered a categorical variable with two outcomes. Remember that for categorical outcomes the actual values do not matter, since they are only used to indicate specific categories. The \(\{ 0, 1\}\) notation makes this variable closer to the indicator (dummy) variables, which we have come across in previous chapters. The difference is that now we need to create a model for such a variable, instead of using it as an exogenous explanatory variable.

In other words, we can regard \(Y_i\) as a Bernoulli random variable, denoted \(Y_i \sim Bern(p_i)\), where: \[ Y_i = \begin{cases} 1, \text{ with probability } p_i \\\\ 0, \text{ with probability } 1-p_i \end{cases} \] which we can also write as \(\mathbb{P}(Y_i = 1) = p_i\) and \(\mathbb{P}(Y_i = 0) = 1 - \mathbb{P}(Y_i = 1) = 1 - p_i\), or more compactly as: \[ \mathbb{P}(Y_i = y_i) =p_i^{y_i}(1 - p_i)^{1 - y_i}, \quad y_i \in \{ 0, 1 \} \] It can be verified that: \[ \begin{aligned} \mathbb{E} (Y_i) &= \mu_i = p_i \\ \mathbb{V}{\rm ar}(Y_i) &= \sigma^2_i = p_i(1-p_i) \end{aligned} \]

Since both the mean and variance of the process depend on \(p_i\) - any factors that have an effect on \(p_i\) will effect not only the process mean, but the variance of \(Y_i\) (i.e. the variance is not constant, hence heteroskedasticity is present in the data).

Consequently, a linear model is not adequate since it assumes that the predictors affect only the mean and that the variance is constant.

5.2.1 Logistic Regression

In a logistic regression, we are interested in modelling how the probabilities of success and failure depend on a vector of regressors.

5.2.1.1 Model Specification

We will assume that we can model \(Y_i\) via a GLM with some kind of link function: \[ g\left( \mathbb{E}(Y_i | \mathbf{X}_i) \right) = g(\mu_i) = \mathbf{X}_i \boldsymbol{\beta} = \beta_0 + \beta_1 X_{1,i} + ... + \beta_k X_{k,i} \]

We note that the probability value is restricted - \(p_i \in [0, 1]\). Furthermore, it is is clear that the conditional mean \(\mu_i\) is: \[ \mathbb{E}(Y_i | \mathbf{X}_i) = 1 \cdot \mathbb{P}(Y_i = 1| \mathbf{X}_i) + 0 \cdot \mathbb{P}(Y_i = 0| \mathbf{X}_i) = \mathbb{P}(Y_i = 1| \mathbf{X}_i) = p_i \] In order to estimate the probability, we want to find the link function for \(p_i = \mu_i\). To do this, we notice the following:

  • Because the probability must be non-negative, we can re-write the link function by taking the exponents (here we use \(p_i = \mu_i\)): \[ p_i = \exp \left( \mathbf{X}_i \boldsymbol{\beta} \right) \]

  • To make the probability less than one, we need to divide the value \(p_i\) by a greater value. The simplest method is: \[ p_i= \dfrac{\exp \left( \mathbf{X}_i \boldsymbol{\beta} \right)}{1 + \exp \left( \mathbf{X}_i \boldsymbol{\beta} \right)} \]

  • Then, the probability of failure can be defined as: \[ q_i = 1 - p_i = 1 - \dfrac{\exp \left( \mathbf{X}_i \boldsymbol{\beta} \right)}{1 + \exp \left( \mathbf{X}_i \boldsymbol{\beta} \right)} = \dfrac{1}{1 + \exp \left( \mathbf{X}_i \boldsymbol{\beta} \right)} \]

This allows us to calculate the odds as: \[ \dfrac{p_i}{1 - p_i} = \exp \left( \mathbf{X}_i \boldsymbol{\beta} \right) \]

The odds are defined as the ratio of favorable to unfavorable cases. If the probability \(p_i\) is very small, the odds are said to be long. sometimes odds are easier to understand than the probability. For example, in gambling, the odds of \(1:k\) indicate that the fair payoff for a stake of \(1\) is \(k\).

Taking the logarithm of both sides yields:

The logarithm of the odd ratio is known as the logit of the probability \(p_i\) (or simply the log-odds ratio), which can be written as the following logistic regression: \[ \log\left( \dfrac{p_i}{1-p_i}\right) = \mathbf{X}_i \boldsymbol{\beta} \]

Furthermore, we notice that we can rewrite the expression for the probability of success as: \[ \begin{aligned} p_i &= \dfrac{\exp \left( \mathbf{X}_i \boldsymbol{\beta} \right)}{1 + \exp \left( \mathbf{X}_i \boldsymbol{\beta} \right)} = \left( \dfrac{1 + \exp \left( \mathbf{X}_i \boldsymbol{\beta} \right)}{\exp \left( \mathbf{X}_i \boldsymbol{\beta} \right)} \right)^{-1}= \left( \dfrac{1}{\exp \left( \mathbf{X}_i \boldsymbol{\beta} \right)} + 1\right)^{-1} = \left( \exp \left( - \mathbf{X}_i \boldsymbol{\beta} \right) + 1\right)^{-1} = \dfrac{1}{1 + \exp \left( -\mathbf{X}_i \boldsymbol{\beta} \right)} \end{aligned} \]

From the definition of the GLM, we now have the following link function and mean function: \[ \begin{cases} \eta_i &= g(\mu_i) = g(p_i) = \log\left( \dfrac{p_i}{1-p_i}\right) = \text{logit}(p_i) = \mathbf{X}_i \boldsymbol{\beta}\\\\ \mu_i &= p_i = \text{logit}^{-1}(\eta_i) = \dfrac{\exp \left( \mathbf{X}_i \boldsymbol{\beta} \right)}{1 + \exp \left( \mathbf{X}_i \boldsymbol{\beta} \right)} = \dfrac{1}{1 + \exp \left( -\mathbf{X}_i \boldsymbol{\beta} \right)} \end{cases} \]

Note that the logit of the probability \(p_i\), rather than the probability itself, follows a linear model.

For convenience, the probability density function (pdf) of a random variable \(Z\) from a logistic distribution is: \[ \boldsymbol{\lambda}(z) = \dfrac{\exp \left( z \right)}{(1 + \exp \left( z \right))^2} \] and its cumulative distribution function (cdf) is: \[ \boldsymbol{\Lambda}(z) = \mathbb{P}(Z \leq z) = \dfrac{\exp \left( z \right)}{1 + \exp \left( z \right)} = \dfrac{1}{1 + \exp \left( -z \right)} \]

5.2.1.2 Maximum Likelihood Estimation

Similarly to chapter 3.4, the likelihood function of the discrete random variable vector \(\mathbf{Y}\), given \(\mathbf{X}\), can be written as: \[ \mathcal{L}(\boldsymbol{\beta} | \mathbf{Y} = \mathbf{y}, \mathbf{X} = \mathbf{x}) = \prod_{i = 1}^N \mathbb{P}\left( Y_i = y_i| \mathbf{X} = \mathbf{x}, \boldsymbol{\beta}\right) \] Recall that a sequence of Bernoulli trials \(Y_1, ..., Y_N\), where each trial has a success probability \(p_1, ..., p_N\) respectively, then the probability of observing a particular sequence \(y_1,...,y_N\) is: \[ \mathbb{P} \left(Y_1 = y_1, ..., Y_N = y_N \right) = \prod_{i = 1}^N p_i^{y_i} (1 - p_i)^{1 - y_i} \]

5.2.1.2.1 Log-likelihood Function

Consequently, the log-likelihood function now becomes: \[ \begin{aligned} \ell(\boldsymbol{\beta} | \mathbf{Y} = \mathbf{y}, \mathbf{X} = \mathbf{x}) &= \log \left(\mathcal{L}(\boldsymbol{\beta} | \mathbf{Y} = \mathbf{y}, \mathbf{X} = \mathbf{x}) \right)= \sum_{i = 1}^N \left[y_i \log(p_i) + (1 - y_i)\log(1 - p_i) \right] \\ &= \sum_{i = 1}^N \left[y_i \log \left( \dfrac{p_i}{1 - p_i}\right) + \log(1 - p_i)\right] = \sum_{i = 1}^N \left[y_i \cdot \mathbf{x}_i \boldsymbol{\beta} - \log(1 + \exp \left( \mathbf{x}_i \boldsymbol{\beta} \right))\right] \end{aligned} \] Unfortunately taking the partial derivatives and equating them to zero: \[ \dfrac{\partial \ell}{\partial \beta_j} = \sum_{i = 1}^N \left[ y_i x_j - \dfrac{1}{1 + \exp \left( \mathbf{x}_i \boldsymbol{\beta}\right)} \exp \left( \mathbf{x}_i \boldsymbol{\beta}\right) x_j\right] = \sum_{i = 1}^N \left[ \left( y_i - \dfrac{ \exp \left( \mathbf{x}_i \boldsymbol{\beta}\right) }{1 + \exp \left( \mathbf{x}_i \boldsymbol{\beta}\right)} \right) x_j \right] = 0 \] does not provide a closed-form solution. Fortunately, we can use numerical optimization to get an approximate solution.

5.2.1.2.2 Newton-Rapson Numerical Optimization Method

There is a wast array of different optimization methods - for both constrained and unconstrained optimization, via classical or evolutionary algorithms, there is no single best method which would work universally on all cases. However, there are methods, which work well for a majority or more commonly encountered cases.

One of such methods is the classical Newton-Rapson (or just Newton’s) method.

5.2.1.2.2.1 One-Dimensional Case

Assume that we want to minimize a function, \(f\), with one unknown parameter, \(\beta\). We want to find a global minimum \(\beta^*\). Assume that \(f\) is continuous and \(\beta^*\) is a regular interior minimum so that \(f'(\beta^*) = 0\) and \(f''(\beta^*) > 0\). Let’s begin by looking at the Taylor series expansion at point \(\beta = \beta^*\): \[ f(\beta) = f(\beta^*) + f'(\beta^*)(\beta - \beta^*) + \dfrac{f''(\beta^*)}{2!}(\beta - \beta^*)^2 + ... + \dfrac{f^{(n)}(\beta - \beta^*)}{n!}(\beta - \beta^*)^n + ... \] In econometrics, usually a second (sometimes even a first) order expansion is enough of an adequate approximation: \[ f(\beta) \approx f(\beta^*) + (\beta - \beta^*) \left[ \dfrac{d f(\beta)}{d \beta}\bigg\rvert_{\beta = \beta^*} \right] + \dfrac{(\beta - \beta^*)^2 }{2!} \left[ \dfrac{d^2 f(\beta)}{d \beta^2} \bigg\rvert_{\beta = \beta^*} \right] \] Where \(\dfrac{d f(\beta)}{d \beta}\bigg\rvert_{\beta = \beta^*} = f'(\beta^*)\) and \(\dfrac{d^2 f(\beta)}{d \beta^2} \bigg\rvert_{\beta = \beta^*} = f''(\beta^*)\).

(Note that because \(\beta^*\) is the minimum, we would have \(\dfrac{d f(\beta)}{d \beta}\bigg\rvert_{\beta = \beta^*} = 0\)).

Newton’s method uses the second order Taylor series expansion with an initial guess for the optimal point, \(\beta^{(0)}\): \[ f(\beta) \approx f(\beta^{(0)}) + (\beta - \beta^{(0)})f'(\beta^{(0)}) + \dfrac{(\beta - \beta^{(0)})^2 }{2!} f''(\beta^{(0)}) \] Then, we take the derivative of the expansion with respect to \(\beta\) (not \(\beta^{(0)}\)) and set it to zero: \[ f'(\beta^{(0)}) + \dfrac{1}{2} f''(\beta^{(0)}) \cdot 2 \cdot (\beta - \beta^{(0)}) = 0 \] which reduces to : \[ \beta^{(1)} = \beta^{(0)} - \dfrac{f'(\beta^{(0)})}{f''(\beta^{(0)})} \] (here we expressed \(\beta\) as \(\beta^{(1)}\) to indicate a new value)

Then, we use the new value \(\beta^{(1)}\) in the second order approximation and iterate the procedure to get:

\[ \beta^{(n + 1)} = \beta^{(n)} - \dfrac{f^{\prime}(\beta^{(n)})}{f^{\prime\prime}(\beta^{(n)})} \] If we arrive at the true minimum point so that \(\beta^{(n + 1)} = \beta^{*}\), then further iterating will not change the optimal value, since \(f^{\prime}(\beta^*) = 0\). It can be shown that in general, if \(\beta^{(0)}\) is close enough to \(\beta^{*}\), then \(\beta^{(n)} \rightarrow \beta^*\) and the convergence rate is \(| \beta^{(n + 1)} - \beta^*| = O(n^{-2})\).

5.2.1.2.2.2 Higher-Dimension Case

Suppose, that we want to minimize a function \(f(\beta_0, \beta_1, ..., \beta_k)\) and we want to find the optimal vector \(\boldsymbol{\beta}^*\). We can generalize the Newton method as:

\[ \boldsymbol{\beta}^{(n + 1)} = \boldsymbol{\beta}^{(n)} + \left[ \mathbf{H}(\boldsymbol{\beta}^{(n)}) \right]^{-1} \nabla f(\boldsymbol{\beta}^{(n)}) \]

where the gradient of \(f\): \[ \nabla f(\boldsymbol{\beta}) = \begin{bmatrix} \dfrac{\partial f}{\partial \beta_0}, & \dfrac{\partial f}{\partial \beta_1}, & ..., & \dfrac{\partial f}{\partial \beta_k} \end{bmatrix} \] and the Hessian of \(f\): \[ \mathbf{H}(\boldsymbol{\beta}) = \begin{bmatrix} \dfrac{\partial^2 f}{\partial \beta_0^2} & \dfrac{\partial^2 f}{\partial \beta_0 \partial \beta_1} & ... & \dfrac{\partial^2 f}{\partial \beta_0 \partial \beta_k} \\ \dfrac{\partial^2 f}{\partial \beta_1 \partial \beta_0} & \dfrac{\partial^2 f}{\partial \beta_1^2} & ... & \dfrac{\partial^2 f}{\partial \beta_1 \partial \beta_k} \\ \vdots & \vdots & \ddots & \vdots\\ \dfrac{\partial^2 f}{\partial \beta_k \partial \beta_0} & \dfrac{\partial^2 f}{\partial \beta_k \partial \beta_1} & ... & \dfrac{\partial^2 f}{\partial \beta_k^2} \end{bmatrix} \] are calculated at \(\boldsymbol{\beta} =\boldsymbol{\beta}^{(n)}\).

Since calculating \(\left[ \mathbf{H}(\boldsymbol{\beta}^{(n)}) \right]^{-1}\) is usually time-consuming, various approximations exist, which lead to quasi-Newton methods.

5.2.1.3 Iteratively (Re-)Weighted Least Squares (IRLS)

While there exist optimization methods like Newton’s Method for numerical Optimization, in the logistic regression case, we can use an iteratively re-weighted least squares estimation method.

5.2.1.4 Predicting Probabilities

In some econometric software, instead of calculating the predicted probability, we may get the fitted log-odds-ratio instead: \[ \log \left( \dfrac{\widehat{p}_i}{1-\widehat{p}_i} \right) = \widetilde{\mathbf{X}}_i \widehat{\boldsymbol{\beta}} \] But we can easily back-transform to get the predicted probability expression.

So, We can predict a probability for a given explanatory variable matrix \(\widetilde{\mathbf{X}}\) as: \[ \widehat{p}_i\left( \widetilde{\mathbf{X}} \right) = \dfrac{\exp \left(\widetilde{\mathbf{X}}_i \widehat{\boldsymbol{\beta}} \right)}{1 + \exp \left(\widetilde{\mathbf{X}}_i \widehat{\boldsymbol{\beta}} \right)} \]

5.2.1.5 Confidence Intervals

5.2.1.5.1 Confidence Interval For a Parameter

Since the parameters of a logit model are estimated via the Maxiumum Likelihood Estimation method - they are asymptotically normal. Hence we can use the Wald confidence interval to get the confidence interval of parameter \(\widehat{\beta}_j\) on the log-odds-ratio: \[ \beta_j \pm z_c \times se(\beta_j) \] The invariance property of the MLE, we can get the parameter confidence interval on the odds-ratio: \[ \exp\{\beta_j \pm z_c \times se(\beta_j)\} \]

5.2.1.5.2 Confidence Interval For The Probability Estimates

Note that we can use the logistic regression to get the probability estimate \(\widehat{p} = \widehat{\mathbb{P}}(Y = 1 | \mathbf{X})\). Hence, we want to construct a confidence interval for the probability such that: \(\mathbb{P}(p_L \leq \widehat{p} \leq p_U) = \alpha\), where \(p_L\) is the lower and \(p_U\) is the upper endpoints of the interval.

As per (Agresti 2018), this can be done via the endpoint transformation of the confidence intervals for the linear predictor \(\widehat{\boldsymbol{\eta}} = \mathbf{X} \widehat{\boldsymbol{\beta}}\).

Since we treat \(\mathbf{X}\) as fixed, we can use the parameter confidence intervals to get: \[ \mathbf{X} \widehat{\boldsymbol{\beta}} \pm z_c \times se(\mathbf{X} \widehat{\boldsymbol{\beta}}) \] Then, applying the logit transformations, we get the confidence interval for the predicted probability:: \[ \widehat{p}_i\left( \mathbf{X} \right) \in \left[ \dfrac{\exp \left( \mathbf{X} \widehat{\boldsymbol{\beta}} - z_c \times se(\mathbf{X} \widehat{\boldsymbol{\beta}})\right)}{1 + \exp \left( \mathbf{X} \widehat{\boldsymbol{\beta}} - z_c \times se(\mathbf{X} \widehat{\boldsymbol{\beta}})\right)} ; \quad \dfrac{\exp \left( \mathbf{X} \widehat{\boldsymbol{\beta}} + z_c \times se(\mathbf{X} \widehat{\boldsymbol{\beta}})\right)}{1 + \exp \left( \mathbf{X} \widehat{\boldsymbol{\beta}} + z_c \times se(\mathbf{X} \widehat{\boldsymbol{\beta}})\right)} \right] \] where \(se(\mathbf{X} \widehat{\boldsymbol{\beta}}) = \sqrt{\text{diag}\left(\mathbb{V}{\rm ar}(\mathbf{X} \widehat{\boldsymbol{\beta}})\right)}\). Finally, the variance-covariance matrix of the linear predictor is: \[ \mathbb{V}{\rm ar}(\mathbf{X} \widehat{\boldsymbol{\beta}}) = \mathbf{X} \mathbb{V}{\rm ar}( \widehat{\boldsymbol{\beta}}) \mathbf{X}^\top \] Noting that we analysed the MLE before, we know, based on section 3.4.4, that \(\mathbb{V}{\rm ar}( \widehat{\boldsymbol{\beta}}) = \left[ \mathbf{I}(\widehat{\boldsymbol{\beta}}_{\text{ML}}) \right]^{-1}\), where \(\mathbf{I}(\widehat{\boldsymbol{\beta}}_{\text{ML}})\) is the observed Fisher information matrix.

Some discussions on confidence intervals: [1], [2], [3].

Note that these are not prediction intervals!

Prediction intervals are not as useful for GLM’s, as discussed (here).

In order to calculate the prediction intervals for a binomial distribution, we would need to look at the 0.025 and 0.975 of the binomial distribution with size = 1. However, as the possible values are only either 0, or 1 - these will almost always be the exact lower and upper prediction intervals.

5.2.2 Probit Regression

While logistic regression used a cumulative logistic function, probit regression uses a normal cumulative density function for the estimation model.

5.2.2.1 Model Specification

The probit model expresses the probability that the success alternative is chosen as: \[ p_i = \mathbb{P}(Y_i = 1 | \mathbf{X}_i) = \mathbb{P}(Y_i \leq \mathbf{X}_i \boldsymbol{\beta}) = \mathbf{\Phi} (\mathbf{X}_i \boldsymbol{\beta}) \] where \(\mathbf{\Phi}(z)\) is the standard normal cdf. Which means that the link function is the inverse of the normal cdf: \[ \eta_i = g(\mu_i) = g(p_i) = \mathbf{\Phi}^{-1} (p_i) \] From the definition of the linear mean predictor we have the following:

The inverse of the standard normal cdf of the probability \(p_i\) can be written as the following probit regression: \[ \mathbf{\Phi}^{-1} (p_i) = \mathbf{X}_i \boldsymbol{\beta} \]

Note that the inverse cdf of the probability \(p_i\), rather than the probability itself, follows a linear model. Usually the probit regression is expressed in terms of the probability \(p_i\), as we have seen.

For convenience, the probability density function (pdf) of a random variable \(Z\) from a normal distribution is: \[ \boldsymbol{\phi}(z) = \dfrac{1}{\sqrt{2 \pi}} \exp \left( \dfrac{z^2}{2}\right) \] and its cumulative distribution function (cdf) is: \[ \mathbf{\Phi}(z) = \mathbb{P}(Z \leq z) = \displaystyle\int\limits_{-\infty}^z \dfrac{1}{\sqrt{2 \pi}} \exp \left( \dfrac{u^2}{2}\right) du \]

The logit and probit functions are extremely similar, particularly when the probit function is scaled so that its slope at \(Y = 0\) matches the slope of the logit. As a result, probit models are sometimes used in place of logit models.

5.2.3 Linear Probability Model (Simple Linear Regression) Drawbacks

Since we have mainly focused on the standard linear regression, we may want to specify \(p_i\) as a linear function of regressors: \[ p_i = \mathbf{X}_i \boldsymbol{\beta} \] This is known as a linear probability model, which is estimated via OLS. This specification is identical to the multiple regression model, but instead of \(Y_i\), we are modelling the probability \(p_i\). One main problem with this specification is that the probability is bounded, \(p_i \in [0, 1]\), but the linear predictor \(\mathbf{X}_i \boldsymbol{\beta}\) can take any real value, which may be outside the bounds.

We will illustrate this with an example.

Example 5.6 Assume that our dependent variable comes from a logitstic regression process:

\[ \begin{aligned} Y_i | \mathbf{X}_i &\sim Bern(p_i) \\\\ p_i &= \dfrac{\exp(\beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i})}{1 + \exp(\beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i})} = \dfrac{1}{1 + \exp(-(\beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i}))} \end{aligned} \]

The data plots are:

Assume, that we want to fit two models:

  • An OLS regression;
  • A Logistic regression;

We will do so:

##               Estimate  Std. Error  t value     Pr(>|t|)
## (Intercept) 0.26650163 0.071787476 3.712369 2.166990e-04
## x1          0.07381234 0.013605034 5.425369 7.260327e-08
## x2          0.03259153 0.006681385 4.877959 1.247050e-06
##               Coef.  Std.Err.         t         P>|t|    [0.025    0.975]
## Intercept  0.174465  0.078411  2.224992  2.630493e-02  0.020594  0.328335
## x1         0.108452  0.013679  7.928477  5.919460e-15  0.081609  0.135294
## x2         0.033137  0.007145  4.637872  3.988320e-06  0.019116  0.047157
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -1.5156679 0.39886968 -3.799908 1.447500e-04
## x1           0.4133253 0.07839541  5.272315 1.347137e-07
## x2           0.1809375 0.03792235  4.771262 1.830752e-06
##               Coef.  Std.Err.         z         P>|z|    [0.025    0.975]
## Intercept -1.877325  0.425286 -4.414263  1.013545e-05 -2.710871 -1.043780
## x1         0.577192  0.077278  7.468995  8.080979e-14  0.425729  0.728655
## x2         0.177027  0.039164  4.520162  6.179230e-06  0.100267  0.253787

Alternatively, we can estimate the model using an appropriate logit method:

## Optimization terminated successfully.
##          Current function value: 0.554489
##          Iterations 6
##               Coef.  Std.Err.         z         P>|z|    [0.025    0.975]
## Intercept -1.877325  0.425286 -4.414263  1.013546e-05 -2.710871 -1.043780
## x1         0.577192  0.077278  7.468994  8.080996e-14  0.425729  0.728655
## x2         0.177027  0.039164  4.520162  6.179234e-06  0.100267  0.253787

As we can see, the OLS model coefficients are different from the GLM model coefficients. This is to be expected as the models are created under different assumptions.

But this is not the only difference - a much larger, and potentially, much more important distinction is in the predicted values. To have a clear view of the fitted value curve, we will calculate the fitted values for a new data matrix, where \(X_{i,j+1} > X_{i,j}\), \(j = 1,...,M\), \(i = 1, 2\):

Considerations

Note that in the above we have calculated \(\widehat{\mu} = g^{-1}(\mathbf{X} \widehat{\beta})\). If we wanted to calculate the predicted the linear predictor \(\mathbf{X} \widehat{\beta}\):

##         1         2         3         4         5 
## -1.143938 -1.134257 -1.124576 -1.114895 -1.105214
## 0   -2.163504
## 1   -2.150915
## 2   -2.138327
## 3   -2.125738
## 4   -2.113150
## dtype: float64

We can verify that this is true by directly calculating \(\mathbf{X} \widehat{\beta}\):

##           [,1]
## [1,] -1.143938
## [2,] -1.134257
## [3,] -1.124576
## [4,] -1.114895
## [5,] -1.105214
## 0   -2.163504
## 1   -2.150915
## 2   -2.138327
## 3   -2.125738
## 4   -2.113150
## dtype: float64

So, the linear model predicts a probability greater than one for some combination of \((X_{1,j}, X_{2,j})\).

Another problem, worth mentioning is that in a linear model specification: \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{e} \] since \(Y_i\) can take two values - the same holds true for the error term: \[ \begin{aligned} Y_i = 1 &\iff e_i = 1 - (\beta_0 + \beta_1 X_{1,i} + ... + \beta_k X_{k,i})\\ Y_i = 0 &\iff e_i = - (\beta_0 + \beta_1 X_{1,i} + ... + \beta_k X_{k,i}) \end{aligned} \]

To reiterate - the error term, which accounts for omitted variables, specification and measurement errors, etc., only takes two values. This is the consequence of imposing the linear regression structure on the binary choice.

Finally, the conditional variance: \[ \mathbf{V}{\rm ar}\left( e_i | \mathbf{X}_i\right) = p_i(\mathbf{X}_i) \left[ 1-p_i(\mathbf{X}_i) \right] = \sigma_i^2 \neq const \]

The variance of the error term is heteroskedastic, by construction. Consequently, when estimating a linear probability model we must choose to use one of the following:

  • use HCE to correct OLS estimate standard errors;
  • apply an FGLS estimation methodology

5.2.4 Latent Variable Model

Logit and probit models can be derived from an underlying latent variable model. As before, assume that \(Y_i\) is a random binary response variable (e.g. from a Bernoulli distribution). \(Y_i\) is called the manifest response.

Now, suppose that there is an unobservable continuous r.v. \(Y_i^*\), such that it determines the value of \(Y_i\) based on a certain threshold \(\theta\): \[ Y_i = \begin{cases} 1&, \text{ if } Y_i^* \geq \theta\\\\ 0&, \text{ if } Y_i^* < \theta \end{cases} \] Then, \(Y_i^*\) is called the latent response.

The interpretation of \(Y_i\) and \(Y_i^*\) depends on the context:

  • \(Y_i\) may be viewed as a binary choice or an attitude scale (agree/disagree), e.g., a choice between purchasing or renting a home;
  • \(Y_i^*\) may be viewed as the utility of each choice or the underlying attitude, which does not have a well-defined unit of measurement (in biometrics, \(Y_i^*\) may be viewed as a dose and \(Y_i\) as a response).

Using this latent variable formulation allows us to define the probability of success as: \[ p_i = \mathbb{P}(Y_1 = 1) = \mathbb{P}(Y_i^* > \theta) \]

From the properties of a random variable, we may change the location (i.e. the mean) and scale (i.e. the deviation) of the latent variable and the threshold, without changing the probability \(p_i\): \[ p_i = \mathbb{P}(Y_i^* > \theta) = \mathbb{P}( a + c\cdot Y_i^* > a + c \cdot \theta) \] Consequently, to identify the model, we take the threshold \(\theta = 0\) and standardize \(Y_i^*\) to have a standard deviation of \(1\). Now, assume that the latent variable depends on a number of exogenous variables via a linear regression model: \[ Y_i^* = \mathbf{X}_i\boldsymbol{\beta} + U_i \] where the error term \(U_i\) is a random variable with a cdf \(F(\cdot)\), which may not necessarily be a normal distribution.

Then, the probability of observing a positive outcome is: \[ \begin{aligned} p_i &= \mathbb{P}(Y_i^* > 0) = \mathbb{P}\left( \mathbf{X}_i\boldsymbol{\beta} + U_i > 0 \right) \\ &= \mathbb{P}(U_i > - \mathbf{X}_i\boldsymbol{\beta}) = \mathbb{P}(U_i > - \eta_i) \\ &= 1 - F(-\eta_i) \end{aligned} \] where \(\mathbf{X}_i\boldsymbol{\beta} = \eta_i\) is the linear predictor.

If the error term \(U_i\) is symmetric about zero, then \(F(u) = 1 - F(u)\), which allows us to write the probability as: \[ p_i = F(\eta_i) \] which defined a GLM with Bernoulli response and a link function: \[ \eta_i = F^{-1}(p_i) \] If the error term \(U_i\) is not symmetric, the GLM link function is: \[ \eta_i = -F^{-1}(1-p_i) \]

5.2.4.1 Probit Model

If we assume \(U_i \sim \mathcal{N}(0, 1)\), then from the link function expression of the latent variable model we have that: \[ p_i = \mathbf{\Phi}(\eta_i) \] Then the inverse transformation for the linear predictor is called a probit and is defined as: \[ \eta_i = \mathbf{\Phi}^{-1}(p_i) \]

For a more general case, we may assume \(U_i \sim \mathcal{N}(0, \sigma^2)\). Following the properties of a r.v. and the definition of a latent variable this yields: \[ \begin{aligned} p_i &= \mathbb{P}(Y_i^* > 0) = \mathbb{P}(U_i > - \mathbf{X}_i\boldsymbol{\beta})\\\\ &= \mathbb{P}\left( \dfrac{U_i}{\sigma} > - \dfrac{\mathbf{X}_i\boldsymbol{\beta}}{\sigma} \right) = 1 - \mathbf{\Phi}\left(- \dfrac{\mathbf{X}_i\boldsymbol{\beta}}{\sigma} \right) \\\\ &= \mathbf{\Phi}\left(\mathbf{X}_i\left(\boldsymbol{\beta}/ \sigma\right) \right) \end{aligned} \] From this we see that we can estimate \(\boldsymbol{\beta}/ \sigma\) but we cannot identify \(\beta\) and \(\sigma\) separately.

We therefore usually either take \(\sigma = 1\), or interpret \(\boldsymbol{\beta}\) in units of standard deviation of the latent variable (instead of simple units).

Econometrics favors normality assumption, hence, the probit model may be more favorable than logit. On the other hand, normality assumption may not always be desirable. Furthermore, using the normal distribution of the link for the binary response models has a disadvantage in that the cdf does not have a closed form (though it can be approximated).

5.2.4.2 Logit Model

An alternative to the normal distribution is the standard logistic distribution - its shape is similar to the normal distribution, but it has a closed form expression: \[ p_i = F(\eta_i) = \boldsymbol{\Lambda}(\eta_i) = \dfrac{\exp(\eta_i)}{1 + \exp(\eta_i)} \] which yields the logic linear predictor: \[ \eta_i = F^{-1}(p_i) = \log \left( \dfrac{p_i}{1-p_i} \right) \]

So, the coefficients of a logit regression can be interpreted not only in terms of the log-odds, but also as the effects on a latent variable which follows a linear model with logistic errors.

Furthermore, the standard logistic distribution has a mean \(0\), and variance \(\pi^2 / 3\).

5.2.4.3 Logit and Probit Comparison

The logit and probit transformations tend to give similar results because they are almost linear functions of each other, for values of \(p_i \in \left( 0, 1 \right)\). Noting that:

  • A random variable from the standard normal distribution has a mean \(0\) and variance \(1\);
  • A random variable from the standard logistic distribution has a mean of \(0\) and a variance of \(\pi^2 / 3\).

comparison of probit and logit model coefficients should take into account the different variances of these distributions.

Recalling that in the probit model, if the variance is \(\sigma^2\), then we can only estimate \(\boldsymbol{\beta}/\sigma\), we can effectively set \(\sigma = \pi/\sqrt{3}\) in order. Recall that the coefficients are random variable realizations, so:

The coefficients of a logit model should be standardized by dividing them by the deviance \(\pi/\sqrt{3}\): \[ \widehat{\boldsymbol{\beta}}^{\rm logit}_{st} = \widehat{\boldsymbol{\beta}}^{\rm logit} \cdot \sqrt{3} / \pi \] Then they can be compared with the probit coefficients.

Alternatively, instead of \(\pi/\sqrt{3}\), we may divide by \(1.6\) - a factor which was chosen by trial and error to make the transformed logistic approximate to the standard normal distribution over a wide domain.

Example 5.7 Going back to our generated sample data - let’s estimate a probit model alongside a logit model:
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -1.5156679 0.39886968 -3.799908 1.447500e-04
## x1           0.4133253 0.07839541  5.272315 1.347137e-07
## x2           0.1809375 0.03792235  4.771262 1.830752e-06
##               Coef.  Std.Err.         z         P>|z|    [0.025    0.975]
## Intercept -1.877325  0.425286 -4.414263  1.013545e-05 -2.710871 -1.043780
## x1         0.577192  0.077278  7.468995  8.080979e-14  0.425729  0.728655
## x2         0.177027  0.039164  4.520162  6.179230e-06  0.100267  0.253787
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -0.8742799 0.23544776 -3.713265 2.046024e-04
## x1           0.2427822 0.04562764  5.320946 1.032288e-07
## x2           0.1067521 0.02217174  4.814780 1.473620e-06
##               Coef.  Std.Err.         z         P>|z|    [0.025    0.975]
## Intercept -1.118343  0.251822 -4.441007  8.953902e-06 -1.611905 -0.624781
## x1         0.346657  0.045182  7.672487  1.686933e-14  0.258102  0.435212
## x2         0.106196  0.023067  4.603716  4.150184e-06  0.060985  0.151407

Upon initial examination, it would appear that the coefficients are different. However, if we standardize the logit coefficients:

##                Estimate
## (Intercept) -0.83563152
## x1           0.22787817
## x2           0.09975606
## Intercept   -1.035024
## x1           0.318223
## x2           0.097600
## dtype: float64

We see that the standardized coefficients of the logit are quite similar to the ones we get from estimating the probit regression.

Another note regarding the calculation of the predicted values via R and Python:

  • We can always calculate the linear mean predictor \(\eta_i = \mathbf{X}_i \boldsymbol{\beta}\):
##   probit_eta logit_eta
## 1 -0.6540891 -1.143938
## 2 -0.6483905 -1.134257
## 3 -0.6426919 -1.124576
## 4 -0.6369932 -1.114895
## 5 -0.6312946 -1.105214
## 6 -0.6255960 -1.095533
##    probit_eta  logit_eta
## 0   -1.290519  -2.163504
## 1   -1.282962  -2.150915
## 2   -1.275405  -2.138327
## 3   -1.267848  -2.125738
## 4   -1.260290  -2.113150
## 5   -1.252733  -2.100562

which is equivalent:

##   probit_eta logit_eta
## 1 -0.6540891 -1.143938
## 2 -0.6483905 -1.134257
## 3 -0.6426919 -1.124576
## 4 -0.6369932 -1.114895
## 5 -0.6312946 -1.105214
## 6 -0.6255960 -1.095533
##    probit_eta  logit_eta
## 0   -1.290519  -2.163504
## 1   -1.282962  -2.150915
## 2   -1.275405  -2.138327
## 3   -1.267848  -2.125738
## 4   -1.260290  -2.113150
## 5   -1.252733  -2.100562

and then transform them to the predicted probability:

##   probit_pred logit_pred logit_pred_manual
## 1   0.2565272  0.2415981         0.2415981
## 2   0.2583662  0.2433763         0.2433763
## 3   0.2602120  0.2451634         0.2451634
## 4   0.2620646  0.2469594         0.2469594
## 5   0.2639240  0.2487642         0.2487642
## 6   0.2657900  0.2505777         0.2505777

Note that logit_pred and logit_pred_manual highlight the fact that we can utilize the logistic distribution pdf.

  • Or equivalently calculate the probabilities directly:
##   probit_pred logit_pred
## 1   0.2565272  0.2415981
## 2   0.2583662  0.2433763
## 3   0.2602120  0.2451634
## 4   0.2620646  0.2469594
## 5   0.2639240  0.2487642
## 6   0.2657900  0.2505777
##    probit_pred  logit_pred
## 0     0.098435    0.103076
## 1     0.099753    0.104246
## 2     0.101083    0.105427
## 3     0.102426    0.106620
## 4     0.103782    0.107825
## 5     0.105151    0.109042

Finally, we can compare the predicted values:

and see that the predicted probabilities are indeed similar.

It may also be interesting to see the 3d scatterplot:

R also allows to plot interactive plots (though they do not always work correctly when viewed outside RStudio in a web browser).