3.5 Confidence Intervals

We will continue to examine the univariate linear regression model: \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \] and assume that assumptions (UR.1)-(UR.4) hold.

In this section we will introduce the notion of interval estimation - a procedure for creating ranges of values, called confidence intervals, in which the unknown parameters are likely to be located. Confidence interval creation procedures rely heavily on (UR.4) assumption.

3.5.1 Interval Estimation for Parameters

Recall that in section 3.2 we used the OLS to estimate the unknown parameter vector: \[ \widehat{\boldsymbol{\beta}} = \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{Y} \] The estimates \(\widehat{\boldsymbol{\beta}}\) are called point estimates - we obtain a single value for each parameter via OLS. In contrast interval estimates are ranges of values, in which the true parameters \(\beta_0\) and \(\beta_1\) are likely to fall (the interval estimates are calculated separately for each coefficient). Interval estimation not only allows us to evaluate, what other possible values could be obtainable, but also the precision with which the current parameters are estimated. These interval estimates are also known as confidence intervals.

As we have mentioned in section 3.4, if assumptions (UR.1)-(UR.4) hold true, then the OLS estimators have a normal conditional distribution: \[ \widehat{\boldsymbol{\beta}} | \mathbf{X} \sim \mathcal{N} \left(\boldsymbol{\beta}, \sigma^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \right) \] If you remember, in section 3.4 we also mentioned how we can standardize any normal distribution by subtracting its mean (in our case \(\mathbb{E}(\widehat{\beta}_i) = \beta_i\), \(i = 0,1\)) and dividing by its standard deviation: \[ Z_i = \dfrac{\widehat{\beta}_i - \beta_i}{\sqrt{{\mathbb{V}{\rm ar}} (\mathbf{\widehat{\beta}_i})}} \sim \mathcal{N}(0, 1) \] Note that \(Z_i\) distribution is not conditional on \(X\). This means that when we make statements about \(Z_i\), we do not have to worry, whether \(X\) is a random variable, or not.

Since \(Z_i \sim \mathcal{N}(0,1)\), we can use a table of normal probabilities from any statistical book, or online, and have that: \[ \mathbb{P}(-1.96 \leq Z_i \leq 1.96) = 0.95 \] Substituting the expression of \(Z_i\) yields: \[ \begin{aligned} \mathbb{P}\left(-1.96 \leq \dfrac{\widehat{\beta}_i - \beta_i}{\sqrt{{\mathbb{V}{\rm ar}} (\mathbf{\widehat{\beta}_i})}} \leq 1.96 \right) &= 0.95 \end{aligned} \] which we can rewrite as:

\[ \mathbb{P}\left(\widehat{\beta}_i - 1.96 \sqrt{{\mathbb{V}{\rm ar}} (\mathbf{\widehat{\beta}_i})}\leq \beta_i \leq \widehat{\beta}_i +1.96 \sqrt{{\mathbb{V}{\rm ar}} (\mathbf{\widehat{\beta}_i})}\right) = 0.95 \] This defines the interval which has a 0.95 probability of containing the parameter \(\beta_i\). In other words the end-points: \[ \widehat{\beta}_i \pm 1.96 \sqrt{{\mathbb{V}{\rm ar}} (\mathbf{\widehat{\beta}_i})},\quad i = 0,1 \] provide an interval estimator. If we construct intervals this way using all possible samples of size \(N\) from a population, then \(95\%\) of the intervals will contain the true parameter \(\beta_i\), \(i = 0, 1\). Note that this assumes that we know the true variance \(\mathbb{V}{\rm ar} (\mathbf{\widehat{\beta}_i})\).

As we have mentioned previously, we do not know the true variance of the error term in: \[ \mathbb{V}{\rm ar} (\widehat{\boldsymbol{\beta}}) = \begin{bmatrix} \mathbb{V}{\rm ar} (\widehat{\beta}_0) & \mathbb{C}{\rm ov} (\widehat{\beta}_0, \widehat{\beta}_1) \\ \mathbb{C}{\rm ov} (\widehat{\beta}_1, \widehat{\beta}_0) & \mathbb{V}{\rm ar} (\widehat{\beta}_1) \end{bmatrix} = \sigma^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \]

but we can estimate it. However, estimation and substitution of \(\widehat{\sigma}^2\) instead of \(\sigma^2\) changes the probability distribution of \(Z_i\) from a standard normal to a \(t\)-distribution with \(N-2\) degrees of freedom: \[ t_i = \dfrac{\widehat{\beta}_i - \beta_i}{\text{se}(\widehat{\beta}_i)} \sim t_{(N-2)} \] where \(\text{se}(\widehat{\beta}_i) = \sqrt{\widehat{\mathbb{V}{\rm ar}} (\mathbf{\widehat{\beta}_i})}\). This is known as the t-ratio (or t-statistic) and it is the basis for interval estimation and hypothesis testing in the univariate linear regression model.

Proof. The proof of this can be seen from the fact that:

\[ \begin{aligned} \epsilon_i \sim \mathcal{N}(0, \sigma^2) \iff \dfrac{\epsilon_i}{\sigma} \sim \mathcal{N}(0,1) \end{aligned} \] then the sum of squared independent standardized residuals has the chi-squared distribution with \(N\) degrees of freedom: \[ \sum_{i = 1}^N \left( \dfrac{\epsilon_i}{\sigma}\right)^2 \sim \chi^2_N \] Since the true errors are unobservable, we replace them by the OLS residuals, then the random variable \(\widehat{\sigma}^2\) has a chi-squared distribution with \(N-2\) degrees of freedom: \[ V = \dfrac{\sum_{i = 1}^N \widehat{\epsilon}^2_i}{\sigma^2} = \dfrac{(N-2)\widehat{\sigma}^2}{\sigma^2} = \left(\dfrac{N-2}{\sigma^2}\right)\widehat{\sigma}^2 \sim \chi^2_{N-2} \] From the previously defined \(Z_i \sim \mathcal{N}(0,1)\) and the newly defined \(V \sim \chi^2_{N-2}\) we can define the following random variable: \[ t_i = \dfrac{Z_i}{\sqrt{V/(N-2)}} \sim t_{(N-2)} \] substituting the expressions of \(Z_i\) and \(V\), it can be shown that: \[ t_i = \dfrac{\widehat{\beta}_i - \beta_i}{\text{se}(\widehat{\beta}_i)} \]

For the 95th percentile of the \(t\)-distribution with \(N-2\) degrees of freedom the value \(t_{(0.95, N-2)}\) has the property that \(0.95\) of the probability falls to its left: \(\mathbb{P} \left( t_{(N-2)} \leq t_{(0.95, N-2)} \right) = 0.95\), where \(t_{(N-2)}\) is from a \(t\)-dsitribution with \(N-2\) degrees of freedom.

If we look at a statistical table of the percentile values for the \(t\)-distribution, we can find a critical value \(t_c\), such that: \[ \mathbb{P}(t_i \geq t_c) = \mathbb{P}(t_i \leq -t_c) = \dfrac{\alpha}{2} \] where \(\alpha\) is a probability, usually \(\alpha = 0.01\), \(\alpha = 0.05\) or \(\alpha = 0.1\). The critical value \(t_c\) for \(N-2\) degrees of freedom is the percentile value of the \(t\)-distribution \(t_{(1-\alpha/2, N-2)}\).

We can illustrate this graphically for \(N = 10\):

Each blue shaded tail is equal to \(\alpha / 2\).

Consequently, we can write the probability for the critical value \(t_c\) as: \[ \mathbb{P}(-t_c \leq t_i \leq t_c) = 1 - \alpha \]

For a \(95\%\) confidence interval, the critical values define a region of the \(t\)-distribution, with probability \(1-\alpha = 0.95\). The remaining probability \(\alpha = 0.05\) is divided equally (\(\alpha/2 = 0.025\)) between two tails: \[ \mathbb{P}(-t_{(0.975, N-2)} \leq t_i \leq t_{(0.975, N-2)}) = 0.95 \]

For the univariate linear regression case, the probability becomes: \[ \mathbb{P} \left( -t_c \leq \dfrac{\widehat{\beta}_i - \beta_i}{\text{se}(\widehat{\beta}_i)} \leq t_c \right) = 1 - \alpha \] which we can rewrite as:

\[ \mathbb{P} \left(\widehat{\beta}_i - t_c \cdot \text{se}(\widehat{\beta}_i)\leq \beta_i \leq \widehat{\beta}_i + t_c \cdot \text{se}(\widehat{\beta}_i)\right) = 1 - \alpha \] The interval estimator of \(\beta_i\) is defined by these endpoints: \[\widehat{\beta}_i \pm t_c \cdot \text{se}(\widehat{\beta}_i)\] Furthermore:

  • the interval endpoints \(\widehat{\beta}_i \pm t_c \cdot \text{se}(\widehat{\beta}_i)\) are random because they depend on the data sample;
  • the interval \(\widehat{\beta}_i \pm t_c \cdot \text{se}(\widehat{\beta}_i)\) has probability \(1-\alpha\) of containing the true unknown parameter \(\beta_i\). For a specific sample, this is also known as a \(100 \cdot (1-\alpha)\%\) interval estimate of \(\beta_i\), or the \(100 \cdot (1-\alpha)\%\) confidence interval;
This also highlights a few very important points about confidence intervals for OLS estimates:
  • If we collect all possible samples of size \(N\) from a population, calculate the OLS estimates \(\widehat{\beta}_i\), their standard errors \(\text{se}(\widehat{\beta}_i)\) and construct the confidence interval \(\widehat{\beta}_i \pm t_c \cdot \text{se}(\widehat{\beta}_i)\) for each sample, then \(100 \cdot (1-\alpha)\%\) of all intervals constructed would contain the true parameter \(\beta_i\).
  • For a single sample, the interval estimate may of may not contain the true parameter \(\beta_i\). Since \(\beta_i\) is unknown, we will never know whether this is true or not;
  • When talking about confidence intervals, take note that we are talking about the confidence in the procedure used to construct these interval estimates, and not any one interval estimate, which was calculated from a single sample of data.

We can also use confidence intervals in our interpretation of the coefficients:

For a linear-linear univariate regression \(\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 X\) we can say, with \(95\%\) confidence that from an additional unit of \(X\), the dependent variable \(Y\) will exhibit a change between \(\widehat{\beta}_1 - t_c \cdot \text{se}(\widehat{\beta}_1)\) and \(\widehat{\beta}_1 + t_c \cdot \text{se}(\widehat{\beta}_1)\).

Example 3.19 We will generate a univariate linear regression with \(\beta_0 = 2\), \(\beta_1 = 0.4\), \(N = 100\) and \(X\) - an equally spaced sequence from an interval in \(\left[0, 20 \right]\).

Next up, we will estimate its parameters, and calculate the parameter \(95\%\) confidence intervals:

##       lower     upper
##   1.2116795 2.6527494
## x 0.3626166 0.4871029

We can also calculate these values with the built-in functions:

For the Python data sample, we estimate with \(95\%\) confidence that a unit increase in \(X\) would result in an increase in \(Y\) between \(0.325\) and \(0.480\).

However, we do not know if \(\beta_1\) is actually in interval \(\left[0.325,\ 0.480 \right]\). Newertheless, if we applied this procedure to all possible data samples of size \(N = 100\) from the same population, then \(95\%\) of all constructed interval estimates will contain the true parameter value \(\beta_1\). Hence, the interval estimation procedure “works” \(95\%\) of the time. Consequently, we only have one sample, but given the reliability of our interval construction procedure, we would be “surprised” if the true parameter value \(\beta_1\) is not in the interval \(\left[0.325,\ 0.480 \right]\).

To sum up, we use the OLS estimators to obtain point estimates of unknown parameters. The estimated variance and standard error \(\text{se}(\widehat{\beta}_i) = \sqrt{\widehat{\mathbb{V}{\rm ar}} (\mathbf{\widehat{\beta}_i})}\) provide information about the sampling variability of the OLS estimators from one sample to another. Interval estimators combine point estimation with sampling variability to provide a range of values, in which the true unknown parameter value may fall. If an interval estimate is wide (which would imply a large standard error), then it implies that we do not have enough information in the sample to draw meaningful conclusions about \(\beta_1\).

3.5.2 Interval Estimation for the Mean Response

In additional to confidence intervals for \(\beta_0\) and \(\beta_1\), we can calculate a confidence interval for the mean response. We would like an interval estimate for the mean \(\widehat{\mathbb{E}}(\mathbf{Y} | \mathbf{X}= \widetilde{\mathbf{X}}) = \widehat{\mathbf{Y}} = \widetilde{\mathbf{X}} \widehat{\boldsymbol{\beta}}\) for some value of \(\mathbf{X} = \widetilde{\mathbf{X}}\).

The expected value of \(\widehat{\mathbf{Y}}\) is an unbiased estimator of \(\mathbb{E}(\mathbf{Y} | \mathbf{X}= \widetilde{\mathbf{X}})\): \[ \begin{aligned} \mathbb{E}(\widehat{\mathbf{Y}}) &= \mathbb{E}(\widetilde{\mathbf{X}} \widehat{\boldsymbol{\beta}}) \\ &= \widetilde{\mathbf{X}} \left( \mathbb{E}(\widehat{\boldsymbol{\beta}}) \right)\\ &= \widetilde{\mathbf{X}} \boldsymbol{\beta} \\ &= \mathbb{E}(\mathbf{Y} | \mathbf{X} = \widetilde{\mathbf{X}}) \end{aligned} \] The variance of the mean response is: \[ \begin{aligned} \mathbb{V}{\rm ar} (\widehat{\mathbf{Y}}) &= \mathbb{V}{\rm ar} (\widetilde{\mathbf{X}} \widehat{\boldsymbol{\beta}}) \\ &= \widetilde{\mathbf{X}} \mathbb{V}{\rm ar} ( \widehat{\boldsymbol{\beta}}) \widetilde{\mathbf{X}}^\top \\ &= \widetilde{\mathbf{X}} \sigma^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \widetilde{\mathbf{X}}^\top\\ &= \sigma^2 \widetilde{\mathbf{X}} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \widetilde{\mathbf{X}}^\top \end{aligned} \] Like we have previously seen, if (UR.4) assumption holds true, then \(\widehat{\mathbf{Y}}\) follows a normal distribution: \[ \left(\widehat{\mathbf{Y}}|\widetilde{\mathbf{X}}, \mathbf{X}\right) \sim \mathcal{N}\left( \widetilde{\mathbf{X}} \boldsymbol{\beta},\quad \sigma^2 \widetilde{\mathbf{X}} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \widetilde{\mathbf{X}}^\top\right) \] where we can again, replace \(\sigma^2\), with its estimate \(\widehat{\sigma}^2 = \dfrac{1}{N-2} \sum_{i = 1}^N \widehat{\epsilon}_i^2\).

Let \(\text{se}(\widehat{Y}_i) = \sqrt{\widehat{\mathbb{V}{\rm ar}} (\widehat{Y}_i)}\) be the square root of the corresponding \(i\)-th diagonal element of \(\widehat{\mathbb{V}{\rm ar}} (\widehat{\mathbf{Y}})\). Then, the \(100 \cdot (1 - \alpha) \%\) confidence interval for the mean response can be calculated as: \[ \widehat{Y}_i \pm t_{(1 - \alpha/2, N-2)} \cdot \text{se}(\widehat{Y}_i) \]

Example 3.20 We will continue our previous example and calculate the confidence interval for the mean resposne value for the same values of \(X\) that we estimated the regression model on.

We can also do this automatically:

##        fit      lwr      upr
## 1 1.932214 1.211679 2.652749
## 2 2.018045 1.308344 2.727746
## 3 2.103875 1.404950 2.802800
## 4 2.189705 1.501495 2.877916
## 5 2.275535 1.597976 2.953095
## 6 2.361366 1.694390 3.028341
##           0         1         2
## 0  2.026150  1.128242  2.924059
## 1  2.107526  1.223118  2.991933
## 2  2.188901  1.317922  3.059880
## 3  2.270276  1.412649  3.127902
## 4  2.351651  1.507297  3.196005

Note, that we are not restricted to using existing values of \(X\) - for example, we can calculate the confidence interval for the mean value of \(Y\), when \(X = -2\), or when \(X = 50\):

##                       0          1
## mean           1.220537  22.166489
## mean_se        0.521486   1.579980
## mean_ci_lower  0.185664  19.031069
## mean_ci_upper  2.255410  25.301908
## obs_ci_lower  -3.419593  16.662790
## obs_ci_upper   5.860667  27.670188

Note that the confidence interval for the mean response is different from the prediction interval for new observations, which we will cover in a later section.

3.5.3 Rule of Thumb For Confidence Interval Construction

As we have seen, confidence intervals for the estimated parameters, or for the mean response, can be computed for any sample size \(N > 2\) and any confidence level \(0 \leq \alpha \leq 1\). Furthermore, as we have mentioned in section 3.4, the \(t\)-distribution approaches the standard normal distribution as the degrees of freedom increases (i.e. as the sample size \(N\) gets larger).

In particular for \(\alpha = 0.05\) we have that \(t_{1 - \alpha/2, N} \rightarrow 1.96\) as \(N \rightarrow \infty\):

## [1] "N = 10 crit.val. = 2.2281"
## [1] "N = 100 crit.val. = 1.984"
## [1] "N = 1000 crit.val. = 1.9623"
## [1] "N = 1e+05 crit.val. = 1.96"
## [1] "N = 1e+06 crit.val. = 1.96"
## N = 10 crit.val. = 2.2281
## N = 100 crit.val. = 1.984
## N = 1000 crit.val. = 1.9623
## N = 100000.0 crit.val. = 1.96
## N = 1000000.0 crit.val. = 1.96

In some cases, the population is clearly non-normal. In such cases, as long as the sample size is sufficiently large then the OLS estimators, and the sample mean estimators are approximately normal. This lets us compute an approximate \(95\%\) confidence interval.

Let our estimate be (either an estimate of some model parameter, or an estimate of other population parameters, like the process mean) defined as \(\overline{b}\). Then a rule of thumb for an approximate \(95\%\) confidence interval is: \[ \left[ \overline{b} \pm 1.96 \cdot \text{se}(\overline{b})\right] \]

Sometimes, an even more generalized rule of thumb \(\left[ \overline{b} \pm 2 \cdot \text{se}(\overline{b})\right]\) may be used.