3.3 Regression Models and Interpretation

This chapter provides the most common forms of regression models, along with possible interpretations for their coefficients. Note that these models are presented for the univariate case but can analogously be extended to the multivariate case, as will be seen from the chapters further on.

3.3.1 Inclusion of the constant term in the regression

In some cases we want to impose a restriction that if \(X = 0\), then \(Y = 0\) as well. An example could be the relationship between income (\(X\)) and income tax revenue (\(Y\)) - if there is no income, \(X = 0\), then the expected revenue from the taxes would also be zero - \(\mathbb{E}(Y | X = 0) = 0\).

Formally, we now choose a slope estimator, \(\beta_1\), from the following regression model: \[ Y_i = \beta_1 X_i + \epsilon_i,\ i = 1,...,N \] which is called a regression through the origin, because the conditional expected value: \[ \mathbb{E} \left(Y_i | X_i \right) = \beta_1 X_i \] of the regression passes through the origin point \(X = 0\), \(Y = 0\). We can obtain the estimate of the slope parameter via OLS by minimizing the sum of squared residuals: \[ \begin{aligned} \text{RSS} &= \sum_{i = 1}^N \widehat{\epsilon}_i^2 = \sum_{i = 1}^N \left( Y_i - \widehat{\beta}_1 X_i \right)^2 \\ \dfrac{\partial \text{RSS}}{\partial \widehat{\beta}_1} &= -2\sum_{i = 1}^N X_i\left( Y_i - \widehat{\beta}_1 X_i \right) = 0 \end{aligned} \] which leads to the following expression: \[ \begin{aligned} \widehat{\beta}_1 = \dfrac{\sum_{i = 1}^N X_iY_i}{\sum_{i = 1}^N X_i^2} \end{aligned} \]

So, it is possible to specify a regression without a constant term, but should we opt for it? A constant \(\beta_0\) can be described as the mean value of \(Y\) when all predictor variables are set to zero. However, if the predictors can’t be zero, then it is impossible to interpret the constant. Furthermore, even if all other predictors can have zero-values, the constant may still be uninterpretable, as the intercept parameter \(\beta_0\) may be regarded as a sort of garbage collector for the regression model. The reason for this is the underlying assumption that the expected value of the residuals is zero. So the regression is estimated to a point where the mean of the residuals is (very close to) zero, which means that any bias, that is not accounted by the model, is collected in the intercept \(\beta_0\).

In general, inclusion of a constant in the regression model ensures that the models residuals have a mean of zero, otherwise the estimated coefficients may be biased. Consequently, and as is often the case, without knowing the true underlying model it is generally not worth interpreting the regression constant.

3.3.2 Linear Regression Models

The regression model that we examined up to now is called a simple linear regression model. Looking at the univariate regression: \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \] where \(\boldsymbol{\beta} = (\beta_0,\ \beta_1)^\top\). When we say linear regression, we mean linear in parameters \(\boldsymbol{\beta}\). There are no restrictions on transformations of \(X\) and \(Y\), as long as the parameters enter the equation linearly. For example, we can use \(\log (X)\) and \(\log (Y)\), or \(\sqrt{X}\) and \(\sqrt{X}\) etc. in the univariate regression. While transforming \(X\) and \(Y\) does not effect the linear regression specification itself, the interpretation of the coefficients depends on the transformation of \(X\) and \(Y\).

On the other hand, there are regression models, which are not regarded as linear, since they are not linear in their parameters: \[ Y_i = \dfrac{1}{\beta_0 + \beta_1 X_i} + \epsilon_i,\ i = 1,...,N \] Furthermore, estimation of such models is a separate issue, which covers nonlinear regression models.

3.3.2.1 Effects of Changing the Measurement Units

Generally, it is easy to figure out what happens to the intercept, \(\beta_0\), and slope, \(\beta_1\), estimates when the units of measurement are changed for the dependent variable, \(Y\):

\[ \widetilde{Y} = c \cdot Y = c \cdot \left( \beta_0 + \beta_1 X + \epsilon \right) = (c \cdot \beta_0) + (c \cdot \beta_1) X + (c \cdot \epsilon) \] In other words, if \(Y\) is multiplied by a constant \(c\), then the OLS estimates of \(\widetilde{Y} = c \cdot Y\) are \(\widetilde{\beta}_0 = c \cdot \beta_0\) and \(\widetilde{\beta}_1 = c \cdot \beta_1\).

We will generate a univariate regression with:

  • \(\beta_0 = 1\), \(\beta_1 = 0.5\)
  • \(X = 1, ..., 51\), \(\epsilon \sim \mathcal{N}(0, 1^2)\)
  • \(c = 10\)

The new variable \(Y^{new} = Y \cdot c\) (i.e. assume that we have \(Y\) in kilograms and we want to transform it into grams) and the data can be generated as follows:

The data sample plots are:

and the parameters can be estimated as before:

The estimated OLS coefficients using the dependent variable \(Y\) are as we expected. If we scale our dependent variable and estimate the coefficients using \(Y^{new}\), then the parameters increase proportionally:

## (Intercept)           x 
##    8.337401    5.047895
## [8.38536209 5.09922195]

Note that the variance of \(c \cdot \epsilon\) is now \(c^2 \sigma^2\):

## 0.9570628350921718
## 95.70628350921723

This assumes that nothing changes in the scaling of the independent variable \(X\) used in OLS estimation.

If we change the units of measurement of an independent variable, \(X\), then only the slope, \(\beta_1\), (i.e. the coefficient of that independent variable) changes:

\[ Y = \beta_0 + \beta_1 X + \epsilon = \beta_0 + \left( \dfrac{\beta_1}{c} \right) \left( c \cdot X \right) + \epsilon \] In other words, if \(X\) is multiplied by a constant \(c\), then the OLS estimates of \(Y\) are \(\beta_0\) and \(\widetilde{\beta}_1 = \dfrac{\beta_1}{c}\).

We can verify that this is the case with our empirical data sample by creating new variables \(X^{(1)} = X \cdot c\) and \(X^{(2)} = X / c\)

## (Intercept)          x1 
##  0.83374010  0.05047895
## [0.83853621 0.05099222]
## (Intercept)          x2 
##   0.8337401   5.0478946
## [0.83853621 5.09922195]
Furthermore, if we scale both \(X\) and \(Y\) by the same constant:

\[ \widetilde{Y} = c \cdot Y = c \cdot \left( \beta_0 + \left( \dfrac{\beta_1}{c} \right) \left( c \cdot X \right) + \epsilon \right) = (c \cdot \beta_0) + \beta_1 \left( c \cdot X \right) + (c \cdot \epsilon) \] In other words, if both \(Y\) and \(X\) are multiplied by the same constant \(c\), then the OLS estimates for the intercept change to \(\widetilde{\beta_0} = c \cdot \beta_0\) but remain the same for the slope \(\beta_1\).

## (Intercept)          x1 
##   8.3374010   0.5047895
## [8.38536209 0.5099222 ]
Finally, if we scale \(Y\) by one constant and \(X\) by a different constant:

\[ \widetilde{Y} = a \cdot Y = a \cdot \left( \beta_0 + \left( \dfrac{\beta_1}{c} \right) \left( c \cdot X \right) + \epsilon \right) = (a \cdot \beta_0) + \left(\dfrac{a}{c}\cdot\beta_1 \right) \left( c \cdot X \right) + (a \cdot \epsilon) \] In other words, if \(Y\) is multiplied by a constant \(a\) and \(X\) is multiplied by a constant \(c\), then the OLS estimates for the intercept change to \(\widetilde{\beta}_0 = a \cdot \beta_0\) but for the slope they change to \(\widetilde{\beta}_1 = \dfrac{a}{c}\cdot \beta_1\).

## (Intercept)           x 
##   0.8337401   0.5047895
## [0.83853621 0.5099222 ]
## (Intercept)          x1 
##   4.1687005   0.2523947
## [4.19268105 0.2549611 ]

Usually, changing the unit of measurement is referred to as data scaling. In all cases after scaling the data the standard errors of the scaled regression coefficients change as well, however, as we will later see, this does not effect any test statistics related to the coefficients, or the model accuracy.

In multiple regression analysis this means that we can include variables with different measurements in the same regression and it will not affect the accuracy (in terms of standard errors) of our model - e.g. \(Y\) is measured in thousands, \(X_1\) is measured in thousands and \(X_2\) is measured in single units (or millions, etc.).

3.3.2.2 Interpretation of the Parameters

In our univariate regression: \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \] \(\beta_1\) shows the amount by which the expected value of \(Y\) (remember that \(\mathbb{E}(Y_i|X_i) = \beta_0 + \beta_1 X_i\)) changes (either increases, or decreases), when \(X\) increases by 1 unit. This can be verified by specifying two cases - with \(X = x\) and \(X = x + 1\): \[ \begin{aligned} \mathbb{E} (Y | X = x) &= \widetilde{Y} = \beta_0 + \beta_1 x \\ \mathbb{E} (Y | X = x + 1) &= \widetilde{\widetilde{Y}} =\beta_0 + \beta_1 (x + 1) \end{aligned} \] then, taking the difference yields: \[ \begin{aligned} \widetilde{\widetilde{Y}} - \widetilde{Y} &= \beta_0 + \beta_1 (x + 1) - \beta_0 - \beta_1 x = \beta_1 \end{aligned} \]

For example, if \(X\) is in thousands of dollars, then \(\beta_1\) shows the amount that the expected value of \(Y\) changes, when \(X\) increases by one thousand.

Generally, \(\beta_0\) can be regarded as the expected value of \(Y\) with \(X = 0\). As mentioned previously, interpreting the intercept \(\beta_0\) is tricky not only because it usually accounts for bias in our specified model (in comparison to the true unknown model), but also because we may not have enough data when \(X = 0\) (as mentioned earlier, maybe \(X\) cannot obtain a zero value at all). In such cases, the estimated intercept is not a good approximation of reality in that region and we should avoid any interpretations of \(\beta_0\).

Alternatively, we can say that:

The defining feature of a univariate linear regression is that the change in (the expected value of) \(Y\) is equal to the change in \(X\) multiplied by \(\beta_1\). So, the marginal effect of \(X\) on \(Y\) is constant and equal to \(\beta_1\): \[ \Delta Y = \beta_1 \Delta X \] or alternatively: \[ \beta_1 = \dfrac{\Delta Y}{\Delta X} = \dfrac{\Delta \mathbb{E} (Y | X)}{\Delta X} = \dfrac{d \mathbb{E} (Y | X)}{d X} =: \text{slope} \] where \(d\) denotes the derivative of the expected value of \(Y\) with respect to \(X\). As we can see, in the linear regression case, the derivative is simply the slope of the regression line, \(\beta_1\).

Note that in this case we say that the marginal effect of \(X\) on \(Y\) is constant, because a one-unit change in \(X\) results in the same change in \(Y\), regardless of the initial value of \(X\).

Below we provide a graphical interpretation with \(\beta_0 = 10\), \(\beta_1 = 5\), \(\epsilon \sim \mathcal{N}(0, 5^2)\) and \(X\) from an evenly spaced set between \(0\) and \(10\), with \(N = 100\).

np.random.seed(123)
#
N = 100
beta_0 = 10
beta_1 = 5
x = np.linspace(start = 0, stop = 10, num = N)
e = np.random.normal(loc = 0, scale = 5, size = len(x))
y = beta_0 + beta_1 * x + e
# Conditional expectation of Y, given X
E_y = beta_0 + beta_1 * x
# Plot the data
_ = plt.figure(num = 1, figsize = (10, 8))
_ = plt.plot(x, y, linestyle = "None", marker = "o", 
            label = "$Y = \\beta_0 + \\beta_1 X + \\epsilon$",
            markerfacecolor = 'None', markeredgecolor = "darkgreen")
_ = plt.margins(x = 0)
_ = plt.ylim([0, max(y)])
# Plot the conditional expectation E(Y|X)
_ = plt.plot(x, E_y, linestyle = "-", color = "blue", 
            label='$E(Y|X) = \\beta_0 + \\beta_1 X$')
# plot beta_0
_ = plt.annotate("", xy = (0.5, E_y[0] / 2), 
             xytext = (0.2, E_y[0] / 2),
             arrowprops = dict(arrowstyle = "]-, widthA=3.5,lengthA=1",
                               connectionstyle = "arc", color='red'))
_ = plt.text(0.5, E_y[0] / 2, '$\\beta_0$, intercept', fontsize = 10)
# plot beta_1
_ = plt.plot([4, 5], [beta_0 + beta_1 * 4]*2, 
            linestyle = "--", color = "red")
_ = plt.plot([5, 5], beta_0 + beta_1 * np.array([4, 5]), 
            linestyle = "--", color = "red")
_ = plt.annotate("", xy = (5.5, beta_0 + beta_1 * 4.5), 
             xytext = (5.2, beta_0 + beta_1 * 4.5),
             arrowprops = dict(arrowstyle = "]-, widthA=1.8,lengthA=1",
                               connectionstyle = "arc", color='red'))
_ = plt.text(4, beta_0 + beta_1 * 3.3, 'unit increase in $X$', fontsize = 10)
_ = plt.annotate("", xy = (4.5, beta_0 + beta_1 * 3.5), 
             xytext = (4.5, beta_0 + beta_1 * 3.8),
             arrowprops = dict(arrowstyle = "]-, widthA=2.8,lengthA=1",
                               connectionstyle = "arc", color='red'))
_ = plt.text(5.5, beta_0 + beta_1 * 4.5, '$\\beta_1$, slope', fontsize = 10)
_ = plt.legend()
plt.show()

In econometrics, an even more popular characteristic is the rate of change. The proportionate (or relative) change of \(Y\), moving from \(Y\), to \(\widetilde{Y}\) is defined as dividing the change in \(Y\) by its initial value: \[ \dfrac{\widetilde{Y} - Y}{Y} = \dfrac{\Delta Y}{Y},\ Y \neq 0 \] Usually, we measure changes in terms of percentages - a percentage change in \(Y\), from \(Y\) to \(\widetilde{Y}\) is defined as the proportionate change multiplied by 100: \[ \% \Delta Y = 100 \cdot \dfrac{\Delta Y}{Y} \approx 100 \cdot \Delta \log(Y) \]

Note: the approximate equality to the logarithm is useful for interpreting the coefficients when modelling \(\log (Y)\), instead of \(Y\).

Consequently we have that:

The elasticity of a variable \(Y\) with respect to \(X\) is defined as the percentage change in \(Y\) corresponding to a \(1\%\) increase in \(X\): \[ \eta = \eta(Y|X) = \dfrac{\% \Delta Y}{\% \Delta X} = \dfrac{100\cdot\dfrac{\Delta Y}{Y}}{100\cdot\dfrac{\Delta X}{X}} = \dfrac{\Delta Y}{\Delta X} \cdot \dfrac{X}{Y} \] So, the elasticity of the expected value of \(Y\) with respect to \(X\) is: \[ \eta = \dfrac{d \mathbb{E} (Y | X)}{d X}\cdot \dfrac{X}{\mathbb{E} (Y | X)} \] In the univariate regression case we have that: \[ \eta = \beta_1 \cdot \dfrac{X}{\mathbb{E} (Y | X)} \] In practice in a linear model the elasticity is different on each point \((X_i, Y_i)\), \(i = 1,...,N\). Most commonly, elasticity estimated by substituting the sample means of \(X\) and \(Y\), i.e.: \[ \widehat{\eta} = \widehat{\beta}_1 \cdot \dfrac{\overline{X}}{\overline{Y}} \] With the interpretation being that a \(1\%\) increase in \(X\) will yield, on average, a \(\widehat{\eta}\) percentage (i.e. \(\% \widehat{\eta}\)) increase/decrease in \(Y\).

To reiterate - \(\eta\) shows a percentage and not a unit change in \(Y\) corresponding to a \(1\%\) change in \(X\).

If elasticity is less than one, we can classify that \(Y\) is inelastic with respect to \(X\). If elasticity is greater than one, then we would say that \(Y\) is elastic with respect to \(X\).

3.3.3 Nonlinearities in a Linear Regression

Often times economic variables are not always related by a straight-line relationship. In a simple linear regression the marginal effect of \(X\) on \(Y\) is constant, though this is not realistic in many economic relationships. For example, estimating how the price of a house (\(Y\)) relates to the size of the house (\(X\)): in a linear specification, the expected price for an additional square foot is constant. However, it is possible that more expensive homes have a higher value for each additional square foot, compared to smaller, less expensive homes.

Fortunately, the linear regression model \(\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}\) is quite flexible - the variables \(\mathbf{Y}\) and \(\mathbf{X}\) can be transformed via logarithms, squares, cubes, or even so called indicator variables. All of these transformations can be used to account for a nonlinear relationship between the variables \(\mathbf{Y}\) and \(\mathbf{X}\) (but still expressed as a linear regression in terms of parameters \(\boldsymbol{\beta}\)).

Finally, since models are linear in parameters, we can use the usual OLS estimates. In other words:

If we have a linear regression with transformed variables: \[ f(Y_i) = \beta_0 + \beta_1 \cdot g(X_i) + \epsilon_i,\quad i = 1,...,N \] then we can rewrite it in a matrix notation: \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \] where \(\mathbf{Y} =\left[ f(Y_1), ..., f(Y_N) \right]^\top\), \(\boldsymbol{\varepsilon} = [\epsilon_1,...,\epsilon_N]^\top\), \(\boldsymbol{\beta} =[ \beta_0, \beta_1]^\top\) and \(\mathbf{X} = \begin{bmatrix} 1 \quad g(X_1) \\ 1 \quad g(X_2) \\ \vdots \qquad \quad \vdots \\ 1 \quad g(X_N) \end{bmatrix}\), where \(f(Y)\) and \(g(X)\) are some kind of transformations of the initial values of \(Y\) and \(X\).

This allows us to estimate the unknown parameters via OLS: \[ \widehat{\boldsymbol{\beta}} = \left( \mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{Y} \]

3.3.3.1 Quadratic Regression Model

The quadratic regression model: \[ Y = \beta_0 + \beta_1 X^2 + \epsilon \] is a parabola, where \(\beta_0\) is the intercept and \(\beta_1\) is the shape parameter of the curve: if \(\beta_1 > 0\), then the curve is \(\cup-shaped\) (convex); if \(\beta_1 < 0\), then the curve is \(\cap-shaped\) (concave).

Because of the nonlinearity in \(X\), our unit change in \(X\) effect on \(Y\) now depends on the initial value of \(X\). If, as before, we take \(X = x\) and \(X = x + 1\): \[ \begin{aligned} \mathbb{E} (Y | X = x) = \widetilde{Y} &= \beta_0 + \beta_1 x^2 \\ \mathbb{E} (Y | X = x + 1) = \widetilde{\widetilde{Y}} &=\beta_0 + \beta_1 (x + 1)^2 \\ &= \beta_0 + \beta_1 x^2 + \beta_1 \cdot(2x + 1) \end{aligned} \] so, the difference: \[ \widetilde{\widetilde{Y}} - \widetilde{Y} = \beta_1 \cdot(2x + 1) \] now depends on the initial value of \(x\) - the larger the initial value, the more pronounced the change in \(Y\) will be.

The slope of the quadratic regression is: \[ \text{slope} = \dfrac{d \mathbb{E} (Y | X)}{d X} = 2 \beta_1 X \] which changes as \(X\) changes. For large values of \(X\), the slope will be larger, for \(\beta_1 > 0\) (or smaller, if \(\beta_1 < 0\)), and would have a more pronounced change in \(Y\) for a unit increase in \(X\), compared to smaller values of \(X\). Note that, unlike the simple linear regression, in this case \(\beta_1\) is no longer the slope.

The elasticity (i.e. the percentage change in \(Y\), given a \(1\%\) change in \(X\)) is: \[ \eta = 2 \beta_1 X \cdot \dfrac{X}{Y} = \dfrac{2 \beta_1 X^2}{Y} \] A common approach is to choose a point on the fitted relationship, i.e. select a value of \(X\) and the corresponding fitted value \(\widehat{Y}\) to estimate \(\widehat{\eta}(\widehat{Y} | X) = 2 \widehat{\beta_1} X^2 / \widehat{Y}\). Note that we could use \(( \overline{Y}, \overline{X})\) since it is on the regression curve. However, we may be interested to see how the elasticity changes at different value of \(X\) - when \(X\) is small; when \(X\) is large etc.

If we have a random sample of size \(N\) and specify: \[ \mathbf{X} = \begin{bmatrix} 1 & X_1^2\\ 1 & X_2^2\\ \vdots\\ 1 & X_N^2\\ \end{bmatrix} \] We can use the usual OLS formula to evaluate \(\boldsymbol{\beta}\).

Example 3.5 Below we will assume that our data generating process satisfies (UR.1) - (UR.4) assumptions with the following parameters:
  • \(\beta_0 = 1\), \(\beta_1 \in \{ -0.02, 0.02 \}\);
  • \(X\) is a random sample with replacement from a set: \(X_i \in \{ 1, ..., 50 \}\), \(i = 1, ..., 100\);
  • \(\epsilon \sim \mathcal{N}(0, 5^2)\).

We will generate these values only once, in order to compare the results easier.

##           [,1]       [,2]
## [1,] 0.0347076 0.02033724
## [0.27578417 0.02026908]

We can also use the built-in functions to estimate linear regression parameters:

## (Intercept)      I(x^2) 
##  0.03470760 -0.01966276
## (Intercept)      I(x^2) 
##  0.03470760  0.02033724

The function I() is used to bracket those portions of a model formula where the operators are used in their arithmetic sense, i.e. we want to estimate a parameter for the variable x^2.

3.3.3.2 Log-Linear Regression Model

A couple of useful properties of the logarithm function, which are frequently applied to simplify some non-linear model specifications and various approximations in econometric analysis:

Let \(X \gt 0\), \(X_0 \gt 0\) and \(X_1 \gt 0\), then: \[ \log(X_0 \cdot X_1) = \log(X_0) + \log(X_1) \] \[ \log(X_0 / X_1) = \log(X_0) - \log(X_1) \] \[ \log(X^\alpha) = \alpha\log(X),\ \alpha \in \mathbb{R} \] If \(X \gt 0\) is small (e.g. \(X = 0.01, 0.02,...,0.1\)), then: \[ \log(1 + X) \approx X \] The equality deteriorates as \(X\) gets larger (e.g. \(X = 0.5\)).

For small changes in \(X\) it can be shown that: \[ \Delta \log (X) = \log(X_1) - \log(X_0) \approx \dfrac{X_1 - X_0}{X_0} = \dfrac{\Delta X}{X_0},\quad \text{where } \Delta X = X_1 - X_0 \] A percentage change in \(X\), from \(X_0\) to \(X_1\) is defined as the log difference multiplied by 100: \[ \% \Delta X \approx 100 \cdot \Delta \log(X) \]

Often times, the dependent and/or independent variable may appear in logarithmic form. The log-linear model has a logarithmic term on the left-hand side of the equation and an untransformed variable on the right-hand side: \[ \log (Y) = \beta_0 + \beta_1 X + \epsilon \] In order to use this model we must have that \(Y \gt 0\).

Furthermore, we may also sometimes want to take the logarithm in order to linearize \(Y\).

If \(Y\) is defined via the following exponential form: \[\begin{equation} Y = \exp(\beta_0 + \beta_1 X + \epsilon) \tag{3.10} \end{equation}\] Then we can take the logarithm of \(Y\) to get the log-linear expression \(\log (Y) = \beta_0 + \beta_1 X + \epsilon\).

From eq. (3.10) we can see that we can use the log-transformation to regularize the data (i.e. to make highly skewed distributions less skewed).

Example 3.6 The histogram of eq. (3.10) with \(\beta_0 = 0.8\), \(\beta_1 = 4\) and \(\epsilon \sim \mathcal{N}(0, 1)\) looks to be skewed to the right, because the tail to the right is longer. Whereas the histogram of \(\log (Y)\) (i.e. the dependent variable of the log-linear model) appears to be symmetric around the mean and similar to the normal distribution:
The extremely skewed distribution of \(Y\) became less skewed and more bell-shaped after taking the logarithm.

Many economic variables - price, income, wage, etc. - have skewed distributions, and taking logarithms to regularize the data is a common practice.

Furthermore, setting \(\mathbf{Y} = \left[\log(Y_1), ..., \log(Y_N) \right]^\top\) allows us to apply the OLS via the same formulas as we did before:

Furthermore, we can calculate the log-transformed fitted values \(\log (\widehat{Y})\), as well as transform them back to \(\widehat{Y}\):

To better understand the effect that a change in \(X\) has on \(\log(Y)\), we calculate the expected value of \(\log(Y)\) when \(X\) changes from \(x\) to \(x + 1\): \[ \begin{aligned} \mathbb{E} (\log(Y) | X = x) = \widetilde{Y} &= \beta_0 + \beta_1 x \\ \mathbb{E} (\log(Y) | X = x + 1) = \widetilde{\widetilde{Y}} &=\beta_0 + \beta_1 (x + 1) \end{aligned} \] Then the difference: \[ \widetilde{\widetilde{Y}} - \widetilde{Y} = \beta_1 \] is similar as in the simple linear regression case, but the interpretation is different since we are talking about \(\log (Y)\), instead of \(Y\). In other words: \[ \begin{aligned} \Delta \log(Y) &= \beta_1 \Delta X \\ 100 \cdot \Delta \log(Y) &= (100 \cdot \beta_1) \Delta X \end{aligned} \] Because \(X\) and \(Y\) are related via the log-linear regression, it follows that: \[ \% \Delta Y \approx (100 \cdot \beta_1) \Delta X \]

In other words, for the log-linear model, a unit increase in \(X\) yields (approximately) a \((100 \cdot \beta_1)\) percentage change in \(Y\).

We can rewrite the previous equality as: \[ 100 \cdot \beta_1 = \dfrac{\% \Delta Y}{\Delta X} := \text{semi-elasticity} \] This quantity, known the semi-elasticity, is the percentage change in \(Y\) when \(X\) increases by one unit. As we have just shown, in the log-linear regression, the semi-elasticity is constant and equal to \(100 \cdot \beta_1\).

Furthermore, we can derive the same measurements as before: \[ \text{slope} := \dfrac{d \mathbb{E} (Y | X)}{d X} = \beta_1 Y \] unlike the simple linear regression model, in the log-linear regression model, the marginal effect increases for larger values of \(Y\) (note, that this is not \(\log (Y)\)).

The elasticity (the percentage change in \(Y\), given a \(1\%\) increase in \(X\)): \[ \eta = \text{slope} \cdot \dfrac{X}{Y} = \beta_1 X \]

If we wanted to change the units of measurement of \(Y\) in a log-linear model, then \(\beta_0\) would change, but \(\beta_1\) would remain unchanged, since: \[ \log(c Y) = \log(c) + \log(Y) = \left[\log(c) + \beta_0 \right] + \beta_1 X + \epsilon \]

3.3.3.3 Linear-Log Regression Model

Alternatively, we may describe the linear relationship, where \(X\) is log-transformed and \(Y\) is untransformed: \[ Y = \beta_0 + \beta_1 \log(X) + \epsilon \] where \(X \gt 0\). In this case, if \(X\) increases from \(X_0\) to \(X_1\), it follows that: \[ \begin{aligned} Y_0 &= \mathbb{E}(Y| X = X_0) = \beta_0 + \beta_1 \log(X_0)\\ Y_1 &= \mathbb{E}(Y| X = X_1) = \beta_0 + \beta_1 \log(X_1) \end{aligned} \] Setting \(\Delta Y = Y_1 - Y_0\) and \(\Delta \log(X) = \log(X_1) - \log(X_0)\), yields: \[ \Delta Y = \beta_1 \Delta \log(X) = \dfrac{\beta_1}{100} \left[100 \cdot \Delta \log(X) \right] \approx \dfrac{\beta_1}{100} (\% \Delta X) \]

In other words, in a linear-log model, a \(1\%\) increase in \(X\) yields (approximately) a \(\beta_1 / 100\) unit change in \(Y\).

Furthermore, we have that: \[ \text{slope} := \dfrac{d \mathbb{E} (Y | X)}{d X} = \beta_1 \dfrac{1}{X} \] and: \[ \eta = \text{slope} \cdot \dfrac{X}{Y} = \beta_1 \dfrac{1}{Y} \]

If we wanted to change the units of measurement of \(X\) in a linear-log model, then \(\beta_0\) would change, but \(\beta_1\) would remain unchanged, since: \[ Y = \beta_0 + \beta_1 \log \left( \dfrac{c}{c} X \right) = \left[\beta_0 -\beta_1 \log(c) \right] + \beta_1 \log\left( c X \right) \]

Sometimes linear-log is not much different (in terms of model accuracy) from a linear-linear (i.e. simple) regression. However, because of the functional form of the linear-log model, if \(\beta_1 \lt 0\), then the function decreases at a decreased rate, as \(X\) increases. This means that it may sometimes be useful if we do not want \(Y\) to have a negative value, for a reasonably large value of \(X\).

Example 3.7 Let \(Y\) be expenditure on leisure activities, and \(X\) - age. Let us say that we only have expenditure data on ages from 18 to 50 and we would like to predict the expenditure for ages 51 to 80.

In this case it is reasonable to assume, that expenditure cannot be negative for reasonable values of \(X\) - an age of up to 80 may be realistic assumption, while 100 years or more - less so.

Further assume that the underlying (true) model is indeed linear-log: \[ Y = \beta_0 + \beta_1 X + \epsilon \] with \(\beta_0 = 1200\), \(\beta_1 = -250\) and \(\epsilon \sim \mathcal{N}(0, 50^2)\).

We see that the fitted values \(\widehat{Y}\) do not differ much in the linear-linear and linear-log cases, however, the predicted values (in this example \(Y\) is the expenditure) are more believable in the linear-log case, since they are non-negative.

3.3.3.4 Log-Log Regression Model

Elasticities are often important in applied economics. As such, it is sometimes convenient to have constant elasticity models. If we take logarithms of both \(Y\) and \(X\), then we arrive at the log-log model: \[ \log (Y) = \beta_0 + \beta_1 \log(X) + \epsilon \] where \(X, Y \gt 0\). Then: \[ \text{slope} := \dfrac{d \mathbb{E} (Y | X)}{d X} = \beta_1 \dfrac{Y}{X} \] and: \[ \eta = \text{slope} \cdot \dfrac{X}{Y} = \beta_1 \] i.e. the elasticity of the log-log model is constant.

In other words, in a log-log model, a \(1\%\) increase in \(X\) yields (approximately) a \(\beta_1\) percentage change in \(Y\).

3.3.4 Nonlinear Regression Models

Sometimes the scatter plot diagram, or theoretical assumptions, suggest a nonlinear relationship. The word nonlinear may be a bit confusing - in this context we mean that a regression is nonlinear in parameters.

3.3.4.1 Exponential Regression

An exponential regression model may be specified as: \[ Y = \beta_0 + \beta_1 \cdot \exp(\beta_2 X) + \epsilon \] We see that \(\beta_2\) does not enter the equation linearly. Nevertheless, we may estimate the parameters by minimizing the sum of squared residuals: \[ \text{RSS}(\boldsymbol{\beta}) = \sum_{i = 1}^N \widehat{\epsilon}_i^2 = \sum_{i = 1}^N \left( Y_i - \widehat{\beta}_0 - \widehat{\beta}_1 \exp(\widehat{\beta}_2 X) \right)^2 \] Taking the partial derivatives and equating to zero yields: \[ \begin{cases} \dfrac{\partial \text{RSS}}{\partial \widehat{\beta}_0} &= -2\sum_{i = 1}^N \left( Y_i - \widehat{\beta}_0 - \widehat{\beta}_1 \exp(\widehat{\beta}_2 X) \right) = 0\\\\ \dfrac{\partial \text{RSS}}{\partial \widehat{\beta}_1} &= -2\sum_{i = 1}^N \left( Y_i - \widehat{\beta}_0 - \widehat{\beta}_1 \exp(\widehat{\beta}_2 X) \right) \exp(\widehat{\beta}_2 X) = 0 \\\\ \dfrac{\partial \text{RSS}}{\partial \widehat{\beta}_2} &=-2\sum_{i = 1}^N \left( Y_i - \widehat{\beta}_0 - \widehat{\beta}_1 \exp(\widehat{\beta}_2 X) \right) \beta_2\exp(\widehat{\beta}_2 X) = 0 \end{cases} \] which can be solved via an iterative numeric method, known as the gradient descent.

We can also specify our \(\text{RSS}(\boldsymbol{\beta})\) function, which we can minimize via the built-in functions in R and Python.

Example 3.8 Below we will assume that our data generating process satisfies (UR.1) - (UR.4) assumptions with the following parameters:
  • \(\beta_0 = 0.8\), \(\beta_1 = 1.2\), \(\beta_2 = 3\);
  • \(\epsilon \sim \mathcal{N}(0, 1^2)\).

Firstly, define the function that we wish to minimize:

Then, use the built-in optimization function to minimize the function. Note that we need to specify a starting parameter value, we choose \(\widehat{\beta}_0 = \widehat{\beta}_1 = \widehat{\beta}_2 = 0\) for this example:

The estimated parameter values appear to be similar to the true values.

Alternatively, if we have the following exponential regression: \[ Y = \exp(\beta_0 + \beta_1 X + \epsilon) \] then we may notice, upon further inspection, that we may not need to deal with a nonlinear regression.

Sometimes it is possible to transform a non-linear regression into a linear regression.

We can transform the specified exponential model into a log-linear model by taking the logarithms of both sides of the equation: \[ \log (Y) = \beta_0 + \beta_1 X + \epsilon \] which would give us a log-linear regression model.

3.3.4.2 Power Regression

A power regression may be specified as: \[ Y = \beta_0 + \beta_1 X^{\beta_2} + \epsilon \]

Example 3.9 If we have the power regression defined as: \[ Y = \beta_1 X^{\beta_2} \exp(\epsilon) \] Then we can transform it into a log-log regression: \[ \log(Y) = \log(\beta_1) + \beta_2 \log(X) + \epsilon \]
Example 3.10 In the multivariable regression setting the Cobb-Douglas production function is expressed as a non-linear multiplicative regression:

\[ Y = A L^{\alpha} K^{\beta} \exp(\epsilon) \] where \(Y\) - total production value of all goods; \(L\) - labour input (e.g. total number of person-hours worked); \(K\) - capital input (value of all machinery, equipment, etc.); \(A\) - total-factor productivity; \(\alpha, \beta\) - output elasticities of capital and labor, respectively.

It can be transformed into a simple log-log model: \[ \log(Y) = \log(A) + \alpha \log(L) + \beta \log(K) + \epsilon \]

We have learned different specification possibilities of a regression model, as well as estimation methods in such cases. However, simply estimating the parameters on a number of different models does not say much neither about the overall accuracy of a model, nor about whether the included explanatory variable actually helps us predict the dependent variable.

3.3.5 Choosing a Functional Form

Generally, we will never know the true relationship between \(Y\) and \(X\). The functional form that we select is only an approximation. As such, we should select a functional form, which satisfies our objectives, economic theory, the underlying model assumptions, like (UR.1) - (UR.3), and one which provides an adequate fit on the data sample.

Consequently, we may incorrectly specify a functional form for our regression model, or, some of our regression assumptions may not hold.

It is a good idea to focus on:

  1. Examining the regression results:
    • checking whether the signs of the coefficients follow economic logic;
    • checking the significance of the coefficients;
  2. Examining the residuals of \(\mathbf{Y} - \widehat{\mathbf{Y}}\) - if our specified functional form is inadequate, then the residuals would not necessarily follow (UR.3), (UR.4).

Nevertheless, before examining the signs and model statistical properties, we should have a first look at the data and examine the dependent variable \(Y\) and the independent variable \(X\) plots to make some initial guesses at the functional form of the regression model.

3.3.5.2 Run-Sequence Plot

An easy way to graphically summarize a univariate data set. A common assumption of univariate data sets is that they behave like:

  • Random realization of a dataset;
  • Are from the same population (i.e. all random drawings have the same distribution);

The run-sequence (or simply, run charts) plots the variable of interest on the vertical axis, and the variable index on the horizontal axis. They are primarily used to inspect if there are any outliers, mean or variance changes or if there is a dependence across observations.

Example 3.11 We will generate three data samples with \(N = 200\):
  • A \(r.v.\) from a normal distribution: \(Y_{1,i} \sim \mathcal{N}(\mu, \sigma^2)\);
  • A \(r.v.\) from a normal distribution, where the variance depends on the observation index: \(Y_{2,i} \sim \mathcal{N}(\mu, i \cdot \sigma^2)\)
  • A \(r.v\)., where \(Y_{2,i} \sim \mathcal{N}(\mu, \sigma^2)\) for \(i = 1,...,100\) and \(Y_{2,i} \sim \mathcal{N}(1.5 \cdot \mu, \sigma^2)\) for \(i = 101, ..., 200\)
  • \(\mu = 20\), \(\sigma = 1\).

We see that the underlying differences in the distribution of the data are present in the run-sequence plot of the data sample.

The run-sequence plot can help answer questions, like:

  • Are there any (significant) changes in the mean?
  • Are there any (significant) changes in the variance?
  • Are there any outliers in the data?

The run-sequence plot can be useful when examining the residuals of a model.

Example 3.12 In our exponential model example: \[ Y = \exp(\beta_0 + \beta_1 X + \epsilon) \] what would the residuals look like if we fitted a linear-linear and a log-linear model?

We see that in the case of a linear linear model, the residual variance is not the same across observations, while the log-linear model has residuals that appear to have the same mean and variance across observations.

We note that the values of \(X\) are ordered from smallest to largest - this means that a larger index value corresponds to the value of \(Y\) which was for a larger value of \(X\). If we were to randomize the order of \(X_i\) (and as a result, randomize \(Y_i\) and \(\widehat{Y}_i\)), the run-sequence plot of the residuals would then have the following plot:

We again note the fact that in cross-sectional data one of our assumptions is that the observations \((X_i, Y_i)\) are treated as independent from \((X_j, Y_j)\) for \(i \neq j\). As such we can order the data in any way we want. In the first case (with ordered values of \(X\) from \((X_i, Y_i)\)), ordering the residuals by the value of \(X\) allows us to examine whether our model performs the same for all values of \(X\). Clearly, for larger values of \(X\) this was not the case. In the second case (where the ordering of \(X\) from \((X_i, Y_i)\) is random), where we shuffled the order of our data \((X_i, Y_i)\), the residual run-sequence plot allows us to identify, whether there are residuals, which are more pronounced, but it does not identify when this happens, unlike the first case, with ordering by values of \(X\).

Run sequence plots are more common in time series data, but can still be utilized for cross-sectional data.

3.3.5.3 Scatter Plot

As we have seen, a scatter plot reveals relationships between two variables or interest in a form of non-random structure. the vertical axis is usually for the dependent (i.e. response) variable \(Y\), and the horizontal axis is for the independent variable \(X\).

Example 3.13 Let: \[ Y = \beta_0 + \beta_1 \cdot X_1 + \epsilon \] where \(X_1 \sim \mathcal{N}(\mu_1, \sigma^2_1)\), \(\epsilon \sim \mathcal{N}(0, 1^2)\) and let \(X_2 \sim \mathcal{N}(\mu_2, \sigma^2_2)\) and \(X_2\) is independent of \(X_1\) and \(\epsilon\).

We see a clear linear relationship between \(X_1\) and \(Y\). The scatter plot of \(X_2\) and \(Y\) appears to be random, i.e. \(Y\) does not depend on the value of \(X_2\).

The scatter plot can help us answer the following questions:

  • Are the variables \(X\) and \(Y\) related?
  • What kind of relationship (linear, exponential, etc.) could describe \(Y\) and \(X\)?
  • Are there any outliers?
  • Does the variation in \(Y\) change depending on the value of \(X\)?

In addition to the run-sequence plot, the scatter plot for the residuals \(\widehat{\epsilon}\) and the independent variable \(X\) can be useful to determine, whether our model performs the same across any value of \(X\), or if there are specific values of \(X\), for which our model is worse (which would result in larger residual values). If the values of \(X\) are ordered in ascending order, then the run-sequence plot will be very similar to the scatter plot of \(\widehat{\epsilon}\) and \(X\), with the only difference being the spacing between the points.

Example 3.14 In our exponential model example: \[ Y = \exp(\beta_0 + \beta_1 X + \epsilon) \] what would the residuals look like if we fitted a linear-linear and a log-linear model?

Notice that when \(X\) is ordered the only difference between the run-sequence plot and the scatter plot is in the spacing between the points.

Scatter plots are similar to run sequence plots for univariate cross-sectional data. However, unlike run-sequence plots, scatter plots allow to examine the relationship between \(X\) and \(Y\) (instead of only examining \(Y\)).

For multivariable cross-sectional data, run-sequence plots may be faster to utilise to focus on the properties of \(Y\) variable itself.

3.3.5.4 Quantile-Quantile Plot

The quantile-quantile (q-q) plot can be used to determine if two data sets come from a common distribution. A 45-degree line is also plotted for easier comparison - if the two sets come from the same distribution, the points should fall approximately along this reference line. The greater the departure from this line, the greater the evidence for the conclusion that the two data sets have come from populations with different distributions. More specifically, the quantiles of the first dataset are plotted against the quantiles of the second dataset.

The quantile-quantile (q-q, or Q-Q) plot is a scatterplot created by plotting two sets of quantiles against one another.

The q-q plot is similar to a probability plot, which can be used to examine, whether the model residuals are normally distributed. For a probability plot, the q-q plot is used, with quantiles for one of the data samples are replaced with the quantiles of a theoretical distribution.

In a normal probability plot the data are plotted against a theoretical (standard) normal distribution in such a way that the points should form an approximate straight line. Departures from this straight line indicate departures from normality. The normal distribution is a base distribution, and its quantiles are on the horizontal axis as the Theoretical Quantiles, while the sample data quantiles are on the vertical axis.

Example 3.15 We will generate \(Y_1 \sim \mathcal{N}(0, 1)\) and \(Y_2 = \exp(Y_1)\).

We see that the q-q plot of \(Y_1\) shows that \(Y_1\) is normally distributed, since all the quantiles are on the diagonal line. On the other hand, the q-q plot of \(Y_2\) shows that the data is skewed.

The probability plot is used to answer the following questions:

  • Does a specified theoretical distribution provide a good fit to my data?
  • What distribution best fits my data?
  • What are good estimates for the location and scale parameters of the chosen distribution?

In addition, the normal probability plot answers the following questions:

  • Is the data normally distributed?
  • What is the nature of the departure from normality (data skewed with shorter/longer tails)?

As mentioned, probability plots can be used to inspect whether the residuals are normally distributed. We will see an example of this in the residual diagnostics section later on.

Some notes on the terminology of quantiles, quartiles and percentiles:

Some notes on the terminology of quantiles, percentiles and quartiles:

  • Quantiles can go from 0 to any value. Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. The p-quantile is defined as the value, which includes \(p \cdot N\) observations, with \(0 \leq p \leq 1\) and \(N\) being the number of observations.
  • Percentiles go from 0 to 100. It is a measure used for indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value below which \(20\%\) of the observations may be found.
  • Quartiles go from 0 to 4. They are values that divide a list of sorted values into quarters.

In general, percentiles and quartiles are specific types of quantiles. The relationship is as follows:

  • 0 quartile = 0 quantile = 0 percentile
  • 1 quartile = 0.25 quantile = 25 percentile
  • 2 quartile = 0.5 quantile = 50 percentile (median)
  • 3 quartile = 0.75 quantile = 75 percentile
  • 4 quartile = 1 quantile = 100 percentile

In our previous example if we wanted to mark the quantiles, we could do so with the following code:

Note: a newer version of numpy in Python has the np.quantile() function, in which case we would not need to multiply the quantile values by 100.