3.7 OLS Prediction and Prediction Intervals

We have examined model specification, parameter estimation and interpretation techniques. However, usually we are not only interested in identifying and quantifying the independent variable effects on the dependent variable, but we also want to predict the (unknown) value of \(Y\) for any value of \(X\). Prediction plays an important role in financial analysis (forecasting sales, revenue, etc.), government policies (prediction of growth rates for income, inflation, tax revenue, etc.) and so on.

Let our univariate regression be defined by the linear model: \[ Y = \beta_0 + \beta_1 X + \epsilon \] and let assumptions (UR.1)-(UR.4) hold. Let \(\widetilde{X}\) be a given value of the explanatory variable.

3.7.1 OLS Prediction

We want to predict the value \(\widetilde{Y}\), for this given value \(\widetilde{X}\). In order to do that we assume that the true DGP process remains the same for \(\widetilde{Y}\). The difference from the mean response is that when we are talking about the prediction, our regression outcome is composed of two parts: \[ \widetilde{\mathbf{Y}}= \mathbb{E}\left(\widetilde{\mathbf{Y}} | \widetilde{\mathbf{X}} \right) + \widetilde{\boldsymbol{\varepsilon}} \] where:

  • \(\mathbb{E}\left(\widetilde{Y} | \widetilde{X} \right) = \beta_0 + \beta_1 \widetilde{X}\) is the systematic component;
  • \(\widetilde{\epsilon}\) - the random component;

The expected value of the random component is zero. We can estimate the systematic component using the OLS estimated parameters: \[ \widehat{\mathbf{Y}} = \widehat{\mathbb{E}}\left(\widetilde{\mathbf{Y}} | \widetilde{\mathbf{X}} \right)= \widetilde{\mathbf{X}} \widehat{\boldsymbol{\beta}} \] \(\widehat{\mathbf{Y}}\) is called the prediction.

3.7.1.1 The Conditional Expectation is The Best Predictor

We begin by outlining the main properties of the conditional moments, which will be useful (assume that \(X\) and \(Y\) are random variables):

  • Law of total expectation: \(\mathbb{E}\left[ \mathbb{E}\left(h(Y) | X \right) \right] = \mathbb{E}\left[h(Y)\right]\);
  • Conditional variance: \(\mathbb{V}{\rm ar} ( Y | X ) := \mathbb{E}\left( (Y - \mathbb{E}\left[ Y | X \right])^2| X\right) = \mathbb{E}( Y^2 | X) - \left(\mathbb{E}\left[ Y | X \right]\right)^2\);
  • Variance of conditional expectation: \(\mathbb{V}{\rm ar} (\mathbb{E}\left[ Y | X \right]) = \mathbb{E}\left[(\mathbb{E}\left[ Y | X \right])^2\right] - (\mathbb{E}\left[\mathbb{E}\left[ Y | X \right]\right])^2 = \mathbb{E}\left[(\mathbb{E}\left[ Y | X \right])^2\right] - (\mathbb{E}\left[Y\right])^2\);
  • Expectation of conditional variance: \(\mathbb{E}\left[ \mathbb{V}{\rm ar} (Y | X) \right] = \mathbb{E}\left[ (Y - \mathbb{E}\left[ Y | X \right])^2 \right] = \mathbb{E}\left[\mathbb{E}\left[ Y^2 | X \right]\right] - \mathbb{E}\left[(\mathbb{E}\left[ Y | X \right])^2\right] = \mathbb{E}\left[ Y^2 \right] - \mathbb{E}\left[(\mathbb{E}\left[ Y | X \right])^2\right]\);
  • Adding the third and fourth properties together gives us: \(\mathbb{V}{\rm ar}(Y) = \mathbb{E}\left[ Y^2 \right] - (\mathbb{E}\left[ Y \right])^2 = \mathbb{V}{\rm ar} (\mathbb{E}\left[ Y | X \right]) + \mathbb{E}\left[ \mathbb{V}{\rm ar} (Y | X) \right]\).

For simplicity, assume that we are interested in the prediction of \(\mathbf{Y}\) via the conditional expectation: \[ \mathbf{Y} = \mathbb{E}\left(\mathbf{Y} | \mathbf{X} \right) \] We will show that, in general, the conditional expectation is the best predictor of \(\mathbf{Y}\).

Assume that the best predictor of \(Y\) (a single value), given \(\mathbf{X}\) is some function \(g(\cdot)\), which minimizes the expected squared error: \[ \text{argmin}_{g(\mathbf{X})} \mathbb{E} \left[ (Y - g(\mathbf{X}))^2 \right]. \] Using the conditional moment properties, we can rewrite \(\mathbb{E} \left[ (Y - g(\mathbf{X}))^2 \right]\) as: \[ \begin{aligned} \mathbb{E} \left[ (Y - g(\mathbf{X}))^2 \right] &= \mathbb{E} \left[ (Y + \mathbb{E} [Y|\mathbf{X}] - \mathbb{E} [Y|\mathbf{X}] - g(\mathbf{X}))^2 \right] \\ &= \mathbb{E} \left[ (Y - \mathbb{E} [Y|\mathbf{X}])^2 + 2(Y - \mathbb{E} [Y|\mathbf{X}])(\mathbb{E} [Y|\mathbf{X}] - g(\mathbf{X})) + (\mathbb{E} [Y|\mathbf{X}] - g(\mathbf{X}))^2 \right] \\ &=\mathbb{E} \left[ \mathbb{E}\left((Y - \mathbb{E} [Y|\mathbf{X}])^2 | \mathbf{X}\right)\right] + \mathbb{E} \left[ 2(\mathbb{E} [Y|\mathbf{X}] - g(\mathbf{X}))\mathbb{E}\left[Y - \mathbb{E} [Y|\mathbf{X}] |\mathbf{X}\right] + \mathbb{E} \left[ (\mathbb{E} [Y|\mathbf{X}] - g(\mathbf{X}))^2 | \mathbf{X}\right] \right] \\ &= \mathbb{E}\left[ \mathbb{V}{\rm ar} (Y | X) \right] + \mathbb{E} \left[ (\mathbb{E} [Y|\mathbf{X}] - g(\mathbf{X}))^2\right]. \end{aligned} \] Taking \(g(\mathbf{X}) = \mathbb{E} [Y|\mathbf{X}]\) minimizes the above equality to the expectation of the conditional variance of \(Y\) given \(\mathbf{X}\): \[ \mathbb{E} \left[ (Y - \mathbb{E} [Y|\mathbf{X}])^2 \right] = \mathbb{E}\left[ \mathbb{V}{\rm ar} (Y | X) \right]. \] Thus, \(g(\mathbf{X}) = \mathbb{E} [Y|\mathbf{X}]\) is the best predictor of \(Y\).

3.7.2 Prediction Intervals

We can defined the forecast error as \[ \widetilde{\boldsymbol{e}} = \widetilde{\mathbf{Y}} - \widehat{\mathbf{Y}} = \widetilde{\mathbf{X}} \boldsymbol{\beta} + \widetilde{\boldsymbol{\varepsilon}} - \widetilde{\mathbf{X}} \widehat{\boldsymbol{\beta}} \]

From the distribution of the dependent variable: \[ \mathbf{Y} | \mathbf{X} \sim \mathcal{N} \left(\mathbf{X} \boldsymbol{\beta},\ \sigma^2 \mathbf{I} \right) \] We know that the true observation \(\widetilde{\mathbf{Y}}\) will vary with mean \(\widetilde{\mathbf{X}} \boldsymbol{\beta}\) and variance \(\sigma^2 \mathbf{I}\).

Furthermore, since \(\widetilde{\boldsymbol{\varepsilon}}\) are independent of \(\mathbf{Y}\), it holds that: \[ \begin{aligned} \mathbb{C}{\rm ov} (\widetilde{\mathbf{Y}}, \widehat{\mathbf{Y}}) &= \mathbb{C}{\rm ov} (\widetilde{\mathbf{X}} \boldsymbol{\beta} + \widetilde{\boldsymbol{\varepsilon}}, \widetilde{\mathbf{X}} \widehat{\boldsymbol{\beta}})\\ &= \mathbb{C}{\rm ov} (\widetilde{\boldsymbol{\varepsilon}}, \widetilde{\mathbf{X}} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{Y})\\ &= 0 \end{aligned} \] We again highlight that \(\widetilde{\boldsymbol{\varepsilon}}\) are shocks in \(\widetilde{\mathbf{Y}}\), which is some other realization from the DGP that is different from \(\mathbf{Y}\) (which has shocks \(\boldsymbol{\varepsilon}\), and was used when estimating parameters via OLS).

Because of this, the variance of the forecast error is (assuming that \(\mathbf{X}\) and \(\widetilde{\mathbf{X}}\) are fixed): \[ \begin{aligned} \mathbb{V}{\rm ar}\left( \widetilde{\boldsymbol{e}} \right) &= \mathbb{V}{\rm ar}\left( \widetilde{\mathbf{Y}} - \widehat{\mathbf{Y}} \right) \\ &= \mathbb{V}{\rm ar}\left( \widetilde{\mathbf{Y}} \right) - \mathbb{C}{\rm ov} (\widetilde{\mathbf{Y}}, \widehat{\mathbf{Y}}) - \mathbb{C}{\rm ov} ( \widehat{\mathbf{Y}}, \widetilde{\mathbf{Y}})+ \mathbb{V}{\rm ar}\left( \widehat{\mathbf{Y}} \right) \\ &= \mathbb{V}{\rm ar}\left( \widetilde{\mathbf{Y}} \right) + \mathbb{V}{\rm ar}\left( \widehat{\mathbf{Y}} \right)\\ &= \sigma^2 \mathbf{I} + \widetilde{\mathbf{X}} \sigma^2 \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \widetilde{\mathbf{X}}^\top \\ &= \sigma^2 \left( \mathbf{I} + \widetilde{\mathbf{X}} \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \widetilde{\mathbf{X}}^\top\right) \end{aligned} \]

Note that our prediction interval is affected not only by the variance of the true \(\widetilde{\mathbf{Y}}\) (due to random shocks), but also by the variance of \(\widehat{\mathbf{Y}}\) (since coefficient estimates, \(\widehat{\boldsymbol{\beta}}\), are generally imprecise and have a non-zero variance), i.e. it combines the uncertainty coming from the parameter estimates and the uncertainty coming from the randomness in a new observation.

Hence, a prediction interval will be wider than a confidence interval. In practice, we replace \(\sigma^2\) with its estimator \(\widehat{\sigma}^2 = \dfrac{1}{N-2} \sum_{i = 1}^N \widehat{\epsilon}_i^2\).

Let \(\text{se}(\widetilde{e}_i) = \sqrt{\widehat{\mathbb{V}{\rm ar}} (\widetilde{e}_i)}\) be the square root of the corresponding \(i\)-th diagonal element of \(\widehat{\mathbb{V}{\rm ar}} (\widetilde{\boldsymbol{e}})\). This is also known as the standard error of the forecast. Then, the \(100 \cdot (1 - \alpha) \%\) prediction interval can be calculated as: \[ \widehat{Y}_i \pm t_{(1 - \alpha/2, N-2)} \cdot \text{se}(\widetilde{e}_i) \]

Example 3.27 We will generate a univariate linear regression with \(\beta_0 = 2\), \(\beta_1 = 0.4\), \(N = 100\) and \(X\) - an equally spaced sequence from an interval in \(\left[0, 20 \right]\).

Next, we will estimate the coefficients and their standard errors:

For simplicity, assume that we will predict \(Y\) for the existing values of \(X\):

Just like for the confidence intervals, we can get the prediction intervals from the built-in functions:

3.7.3 Confidence Intervals vs Prediction Intervals

Confidence intervals tell you about how well you have determined the mean. Assume that the data really are randomly sampled from a Gaussian distribution. If you sample the data many times, and calculate a confidence interval of the mean from each sample, you’d expect about \(95\%\) of those intervals to include the true value of the population mean. The key point is that the confidence interval tells you about the likely location of the true population parameter.

Prediction intervals tell you where you can expect to see the next data point sampled. Assume that the data really are randomly sampled from a Gaussian distribution. Collect a sample of data and calculate a prediction interval. Then sample one more value from the population. If you do this many times, you’d expect that next value to lie within that prediction interval in \(95\%\) of the samples.The key point is that the prediction interval tells you about the distribution of values, not the uncertainty in determining the population mean.

Prediction intervals are conceptually related to confidence intervals, but they are not the same. A prediction interval relates to a realization (which has not yet been observed, but will be observed in the future), whereas a confidence interval pertains to a parameter (which is in principle not observable, e.g., the population mean).

Another way to look at it is that a prediction interval is the confidence interval for an observation (as opposed to the mean) which includes and estimate of the error. A confidence interval gives a range for \(\mathbb{E} (\boldsymbol{Y}|\boldsymbol{X})\), whereas a prediction interval gives a range for \(\boldsymbol{Y}\) itself. Since our best guess for predicting \(\boldsymbol{Y}\) is \(\widehat{\mathbf{Y}} = \mathbb{E} (\boldsymbol{Y}|\boldsymbol{X})\) - both the confidence interval and the prediction interval will be centered around \(\widetilde{\mathbf{X}} \widehat{\boldsymbol{\beta}}\) but the prediction interval will be wider than the confidence interval.

Source: 1, 2, 3, 4.

Example 3.28 We will continue our previous example and calculate the confidence interval for the mean resposne value for the same values of \(X\) that we estimated the regression model on.

Prediction intervals must account for both: (i) the uncertainty of the population mean; (ii) the randomness (i.e. scatter) of the data. So, a prediction interval is always wider than a confidence interval.

In the time series context, prediction intervals are known as forecast intervals.

3.7.4 Prediction intervals when \(Y\) is transformed

We will examine the following exponential model: \[ Y = \exp(\beta_0 + \beta_1 X + \epsilon) \] which we can rewrite as a log-linear model: \[ \log(Y) = \beta_0 + \beta_1 X + \epsilon \]

Having estimated the log-linear model we are interested in the predicted value \(\widehat{Y}\). Unfortunately, our specification allows us to calculate the prediction of the log of \(Y\), \(\widehat{\log(Y)}\). Nevertheless, we can obtain the predicted values by taking the exponent of the prediction, namely: \[ \widehat{Y} = \exp \left(\widehat{\log(Y)} \right) = \exp \left(\widehat{\beta}_0 + \widehat{\beta}_1 X\right) \] Having obtained the point predictor \(\widehat{Y}\), we may be further interested in calculating the prediction (or, forecast) intervals of \(\widehat{Y}\). In order to do so, we apply the same technique that we did for the point predictor - we estimate the prediction intervals for \(\widehat{\log(Y)}\) and take their exponent.

Then, a \(100 \cdot (1 - \alpha)\%\) prediction interval for \(Y\) is: \[ \left[ \exp\left(\widehat{\log(Y)} - t_c \cdot \text{se}(\widetilde{e}_i) \right);\quad \exp\left(\widehat{\log(Y)} + t_c \cdot \text{se}(\widetilde{e}_i) \right)\right] \] or more compactly, \(\left[ \exp\left(\widehat{\log(Y)} \pm t_c \cdot \text{se}(\widetilde{e}_i) \right)\right]\).

Example 3.29 Let \(\beta_0 = 0.2\), \(\beta_1 = -1.8\), \(N = 1000\) and \(\epsilon \sim \mathcal{N}(0,(0.2)^2)\).

We estimate the model via OLS and calculate the predicted values \(\widehat{\log(Y)}\):

We can plot \(\widehat{\log(Y)}\) along with their prediction intervals:

Finally, we take the exponent of \(\widehat{\log(Y)}\) and the prediction interval to get the predicted value and \(95\%\) prediction interval for \(\widehat{Y}\):

Alternatively, notice that for the log-linear (and similarly for the log-log) model: \[ \begin{aligned} Y &= \exp(\beta_0 + \beta_1 X + \epsilon) \\ &= \exp(\beta_0 + \beta_1 X) \cdot \exp(\epsilon)\\ &= \mathbb{E}(Y|X)\cdot \exp(\epsilon) \end{aligned} \] the prediction is comprised of the systematic and the random components, but they are multiplicative, rather than additive.

Therefore we can use the properties of the log-normal distribution to derive an alternative corrected prediction of the log-linear model: \[ \widehat{Y}_{c} = \widehat{\mathbb{E}}(Y|X) \cdot \exp(\widehat{\sigma}^2/2) = \widehat{Y}\cdot \exp(\widehat{\sigma}^2/2) \] Because, if \(\epsilon \sim \mathcal{N}(\mu, \sigma^2)\), then \(\mathbb{E}(\exp(\epsilon)) = \exp(\mu + \sigma^2/2)\) and \(\mathbb{V}{\rm ar}(\epsilon) = \left[ \exp(\sigma^2) - 1 \right] \exp(2 \mu + \sigma^2)\).

For larger samples sizes \(\widehat{Y}_{c}\) is closer to the true mean than \(\widehat{Y}\). On the other hand, in smaller samples \(\widehat{Y}\) performs better than \(\widehat{Y}_{c}\). Finally, it also depends on the scale of \(X\). In our case:

There is a slight difference between the corrected and the natural predictor when the variance of the sample, \(Y\), increases. Because \(\exp(0) = 1 \leq \exp(\widehat{\sigma}^2/2)\), the corrected predictor will always be larger than the natural predictor: \(\widehat{Y}_c \geq \widehat{Y}\). Furthermore, this correction assumes that the errors have a normal distribution (i.e. that (UR.4) holds).

The same ideas apply when we examine a log-log model.