3.3 Regression Models and Interpretation
This chapter provides the most common forms of regression models, along with possible interpretations for their coefficients. Note that these models are presented for the univariate case but can analogously be extended to the multivariate case, as will be seen from the chapters further on.
3.3.1 Inclusion of the constant term in the regression
In some cases we want to impose a restriction that if \(X = 0\), then \(Y = 0\) as well. An example could be the relationship between income (\(X\)) and income tax revenue (\(Y\))  if there is no income, \(X = 0\), then the expected revenue from the taxes would also be zero  \(\mathbb{E}(Y  X = 0) = 0\).
Formally, we now choose a slope estimator, \(\beta_1\), from the following regression model: \[ Y_i = \beta_1 X_i + \epsilon_i,\ i = 1,...,N \] which is called a regression through the origin, because the conditional expected value: \[ \mathbb{E} \left(Y_i  X_i \right) = \beta_1 X_i \] of the regression passes through the origin point \(X = 0\), \(Y = 0\). We can obtain the estimate of the slope parameter via OLS by minimizing the sum of squared residuals: \[ \begin{aligned} \text{RSS} &= \sum_{i = 1}^N \widehat{\epsilon}_i^2 = \sum_{i = 1}^N \left( Y_i  \widehat{\beta}_1 X_i \right)^2 \\ \dfrac{\partial \text{RSS}}{\partial \widehat{\beta}_1} &= 2\sum_{i = 1}^N X_i\left( Y_i  \widehat{\beta}_1 X_i \right) = 0 \end{aligned} \] which leads to the following expression: \[ \begin{aligned} \widehat{\beta}_1 = \dfrac{\sum_{i = 1}^N X_iY_i}{\sum_{i = 1}^N X_i^2} \end{aligned} \]
So, it is possible to specify a regression without a constant term, but should we opt for it? A constant \(\beta_0\) can be described as the mean value of \(Y\) when all predictor variables are set to zero. However, if the predictors can’t be zero, then it is impossible to interpret the constant. Furthermore, even if all other predictors can have zerovalues, the constant may still be uninterpretable, as the intercept parameter \(\beta_0\) may be regarded as a sort of garbage collector for the regression model. The reason for this is the underlying assumption that the expected value of the residuals is zero. So the regression is estimated to a point where the mean of the residuals is (very close to) zero, which means that any bias, that is not accounted by the model, is collected in the intercept \(\beta_0\).
In general, inclusion of a constant in the regression model ensures that the models residuals have a mean of zero, otherwise the estimated coefficients may be biased. Consequently, and as is often the case, without knowing the true underlying model it is generally not worth interpreting the regression constant.
3.3.2 Linear Regression Models
The regression model that we examined up to now is called a simple linear regression model. Looking at the univariate regression: \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \] where \(\boldsymbol{\beta} = (\beta_0,\ \beta_1)^\top\). When we say linear regression, we mean linear in parameters \(\boldsymbol{\beta}\). There are no restrictions on transformations of \(X\) and \(Y\), as long as the parameters enter the equation linearly. For example, we can use \(\log (X)\) and \(\log (Y)\), or \(\sqrt{X}\) and \(\sqrt{X}\) etc. in the univariate regression. While transforming \(X\) and \(Y\) does not effect the linear regression specification itself, the interpretation of the coefficients depends on the transformation of \(X\) and \(Y\).
On the other hand, there are regression models, which are not regarded as linear, since they are not linear in their parameters: \[ Y_i = \dfrac{1}{\beta_0 + \beta_1 X_i} + \epsilon_i,\ i = 1,...,N \] Furthermore, estimation of such models is a separate issue, which covers nonlinear regression models.
3.3.2.1 Effects of Changing the Measurement Units
Generally, it is easy to figure out what happens to the intercept, \(\beta_0\), and slope, \(\beta_1\), estimates when the units of measurement are changed for the dependent variable, \(Y\):
\[ \widetilde{Y} = c \cdot Y = c \cdot \left( \beta_0 + \beta_1 X + \epsilon \right) = (c \cdot \beta_0) + (c \cdot \beta_1) X + (c \cdot \epsilon) \] In other words, if \(Y\) is multiplied by a constant \(c\), then the OLS estimates of \(\widetilde{Y} = c \cdot Y\) are \(\widetilde{\beta}_0 = c \cdot \beta_0\) and \(\widetilde{\beta}_1 = c \cdot \beta_1\).
We will generate a univariate regression with:
 \(\beta_0 = 1\), \(\beta_1 = 0.5\)
 \(X = 1, ..., 51\), \(\epsilon \sim \mathcal{N}(0, 1^2)\)
 \(c = 10\)
The new variable \(Y^{new} = Y \cdot c\) (i.e. assume that we have \(Y\) in kilograms and we want to transform it into grams) and the data can be generated as follows:
The data sample plots are:
import matplotlib.pyplot as plt
#
# Plot
_ = plt.figure(num = 0, figsize = (10, 8))
_ = plt.plot(x, new_y, linestyle = "None", marker = "o", label = "$\\tilde{Y}$",
markerfacecolor = 'red', markeredgecolor = "red")
_ = plt.plot(x, y, linestyle = "None", marker = "o", label = "$Y$",
markerfacecolor = 'blue', markeredgecolor = "blue")
_ = plt.legend()
plt.show()
and the parameters can be estimated as before:
## (Intercept) x
## 0.8337401 0.5047895
The estimated OLS coefficients using the dependent variable \(Y\) are as we expected. If we scale our dependent variable and estimate the coefficients using \(Y^{new}\), then the parameters increase proportionally:
## (Intercept) x
## 8.337401 5.047895
## [8.38536209 5.09922195]
Note that the variance of \(c \cdot \epsilon\) is now \(c^2 \sigma^2\):
This assumes that nothing changes in the scaling of the independent variable \(X\) used in OLS estimation.
If we change the units of measurement of an independent variable, \(X\), then only the slope, \(\beta_1\), (i.e. the coefficient of that independent variable) changes:\[ Y = \beta_0 + \beta_1 X + \epsilon = \beta_0 + \left( \dfrac{\beta_1}{c} \right) \left( c \cdot X \right) + \epsilon \] In other words, if \(X\) is multiplied by a constant \(c\), then the OLS estimates of \(Y\) are \(\beta_0\) and \(\widetilde{\beta}_1 = \dfrac{\beta_1}{c}\).
We can verify that this is the case with our empirical data sample by creating new variables \(X^{(1)} = X \cdot c\) and \(X^{(2)} = X / c\)
x1 < x * const
x2 < x / const
#
y_fit_x_mlt < lm(y ~ x1)
y_fit_x_div < lm(y ~ x2)
print(y_fit$coefficients)
## (Intercept) x
## 0.8337401 0.5047895
## (Intercept) x1
## 0.83374010 0.05047895
## [0.83853621 0.05099222]
## (Intercept) x2
## 0.8337401 5.0478946
## [0.83853621 5.09922195]
\[ \widetilde{Y} = c \cdot Y = c \cdot \left( \beta_0 + \left( \dfrac{\beta_1}{c} \right) \left( c \cdot X \right) + \epsilon \right) = (c \cdot \beta_0) + \beta_1 \left( c \cdot X \right) + (c \cdot \epsilon) \] In other words, if both \(Y\) and \(X\) are multiplied by the same constant \(c\), then the OLS estimates for the intercept change to \(\widetilde{\beta_0} = c \cdot \beta_0\) but remain the same for the slope \(\beta_1\).
## (Intercept) x
## 0.8337401 0.5047895
x1_mat = sm.add_constant(x * const)
new_y = y * const
#
y_fit_scaled = sm.OLS(new_y, x1_mat).fit()
print(y_fit.params)
## [0.83853621 0.5099222 ]
## (Intercept) x1
## 8.3374010 0.5047895
## [8.38536209 0.5099222 ]
\[ \widetilde{Y} = a \cdot Y = a \cdot \left( \beta_0 + \left( \dfrac{\beta_1}{c} \right) \left( c \cdot X \right) + \epsilon \right) = (a \cdot \beta_0) + \left(\dfrac{a}{c}\cdot\beta_1 \right) \left( c \cdot X \right) + (a \cdot \epsilon) \] In other words, if \(Y\) is multiplied by a constant \(a\) and \(X\) is multiplied by a constant \(c\), then the OLS estimates for the intercept change to \(\widetilde{\beta}_0 = a \cdot \beta_0\) but for the slope they change to \(\widetilde{\beta}_1 = \dfrac{a}{c}\cdot \beta_1\).
## (Intercept) x
## 0.8337401 0.5047895
## [0.83853621 0.5099222 ]
## (Intercept) x1
## 4.1687005 0.2523947
## [4.19268105 0.2549611 ]
Usually, changing the unit of measurement is referred to as data scaling. In all cases after scaling the data the standard errors of the scaled regression coefficients change as well, however, as we will later see, this does not effect any test statistics related to the coefficients, or the model accuracy.
In multiple regression analysis this means that we can include variables with different measurements in the same regression and it will not affect the accuracy (in terms of standard errors) of our model  e.g. \(Y\) is measured in thousands, \(X_1\) is measured in thousands and \(X_2\) is measured in single units (or millions, etc.).
3.3.2.2 Interpretation of the Parameters
In our univariate regression: \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \] \(\beta_1\) shows the amount by which the expected value of \(Y\) (remember that \(\mathbb{E}(Y_iX_i) = \beta_0 + \beta_1 X_i\)) changes (either increases, or decreases), when \(X\) increases by 1 unit. This can be verified by specifying two cases  with \(X = x\) and \(X = x + 1\): \[ \begin{aligned} \mathbb{E} (Y  X = x) &= \widetilde{Y} = \beta_0 + \beta_1 x \\ \mathbb{E} (Y  X = x + 1) &= \widetilde{\widetilde{Y}} =\beta_0 + \beta_1 (x + 1) \end{aligned} \] then, taking the difference yields: \[ \begin{aligned} \widetilde{\widetilde{Y}}  \widetilde{Y} &= \beta_0 + \beta_1 (x + 1)  \beta_0  \beta_1 x = \beta_1 \end{aligned} \]
For example, if \(X\) is in thousands of dollars, then \(\beta_1\) shows the amount that the expected value of \(Y\) changes, when \(X\) increases by one thousand.
Generally, \(\beta_0\) can be regarded as the expected value of \(Y\) with \(X = 0\). As mentioned previously, interpreting the intercept \(\beta_0\) is tricky not only because it usually accounts for bias in our specified model (in comparison to the true unknown model), but also because we may not have enough data when \(X = 0\) (as mentioned earlier, maybe \(X\) cannot obtain a zero value at all). In such cases, the estimated intercept is not a good approximation of reality in that region and we should avoid any interpretations of \(\beta_0\).
Alternatively, we can say that:
The defining feature of a univariate linear regression is that the change in (the expected value of) \(Y\) is equal to the change in \(X\) multiplied by \(\beta_1\). So, the marginal effect of \(X\) on \(Y\) is constant and equal to \(\beta_1\): \[ \Delta Y = \beta_1 \Delta X \] or alternatively: \[ \beta_1 = \dfrac{\Delta Y}{\Delta X} = \dfrac{\Delta \mathbb{E} (Y  X)}{\Delta X} = \dfrac{d \mathbb{E} (Y  X)}{d X} =: \text{slope} \] where \(d\) denotes the derivative of the expected value of \(Y\) with respect to \(X\). As we can see, in the linear regression case, the derivative is simply the slope of the regression line, \(\beta_1\).
Note that in this case we say that the marginal effect of \(X\) on \(Y\) is constant, because a oneunit change in \(X\) results in the same change in \(Y\), regardless of the initial value of \(X\).
Below we provide a graphical interpretation with \(\beta_0 = 10\), \(\beta_1 = 5\), \(\epsilon \sim \mathcal{N}(0, 5^2)\) and \(X\) from an evenly spaced set between \(0\) and \(10\), with \(N = 100\).
set.seed(123)
#
N < 100
beta_0 < 10
beta_1 < 5
x < seq(from = 0, to = 10, length.out = N)
e < rnorm(mean = 0, sd = 5, n = length(x))
y < beta_0 + beta_1 * x + e
# Conditional expectation of Y, given X
E_y < beta_0 + beta_1 * x
# Plot the data
#
plot(x, y, ylim = c(0, max(y)), xaxs = "i", yaxs = "i", col = "darkgreen")
#
#
#
#
# Plot the conditional expectation E(YX)
lines(x, E_y, col = "blue")
#
# plot beta_0
pBrackets::brackets(x1 = 0, y1 = E_y[1], x2 = 0, y2 = 0, h = 0.4, col = "red")
text(x = 1, y = E_y[1] / 2, labels = expression(beta[0] ~ ", intercept"))
#
#
#
# plot beta_1
segments(x0 = 4, y0 = beta_0 + beta_1 * 4,
x1 = 5, y1 = beta_0 + beta_1 * 4, lty = 2, col = "red")
segments(x0 = 5, y0 = beta_0 + beta_1 * 4,
x1 = 5, y1 = beta_0 + beta_1 * 5, lty = 2, col = "red")
pBrackets::brackets(x1 = 5, y1 = beta_0 + beta_1 * 4,
x2 = 4, y2 = beta_0 + beta_1 * 4, h = 1, col = "red")
text(x = 5, y = beta_0+beta_1*3.7, labels = expression("unit increase in" ~ X))
pBrackets::brackets(x1 = 5, y1 = beta_0 + beta_1 * 5,
x2 = 5, y2 = beta_0 + beta_1 * 4, h = 0.2, col = "red")
text(x = 5.7, y = beta_0+beta_1*4.5, labels = expression(beta[1] ~ ", slope"))
# Add a legend
#
legend(x = 0.1, y = 60, lty = c(NA, 1), pch = c(1, NA),
col = c("darkgreen", "blue"),
legend = c(expression("Y =" ~ beta[0] + beta[1] * X + epsilon),
expression("E(YX) =" ~ beta[0] + beta[1] * X)))
np.random.seed(123)
#
N = 100
beta_0 = 10
beta_1 = 5
x = np.linspace(start = 0, stop = 10, num = N)
e = np.random.normal(loc = 0, scale = 5, size = len(x))
y = beta_0 + beta_1 * x + e
# Conditional expectation of Y, given X
E_y = beta_0 + beta_1 * x
# Plot the data
_ = plt.figure(num = 1, figsize = (10, 8))
_ = plt.plot(x, y, linestyle = "None", marker = "o",
label = "$Y = \\beta_0 + \\beta_1 X + \\epsilon$",
markerfacecolor = 'None', markeredgecolor = "darkgreen")
_ = plt.margins(x = 0)
_ = plt.ylim([0, max(y)])
# Plot the conditional expectation E(YX)
_ = plt.plot(x, E_y, linestyle = "", color = "blue",
label='$E(YX) = \\beta_0 + \\beta_1 X$')
# plot beta_0
_ = plt.annotate("", xy = (0.5, E_y[0] / 2),
xytext = (0.2, E_y[0] / 2),
arrowprops = dict(arrowstyle = "], widthA=3.5,lengthA=1",
connectionstyle = "arc", color='red'))
_ = plt.text(0.5, E_y[0] / 2, '$\\beta_0$, intercept', fontsize = 10)
# plot beta_1
_ = plt.plot([4, 5], [beta_0 + beta_1 * 4]*2,
linestyle = "", color = "red")
_ = plt.plot([5, 5], beta_0 + beta_1 * np.array([4, 5]),
linestyle = "", color = "red")
_ = plt.annotate("", xy = (5.5, beta_0 + beta_1 * 4.5),
xytext = (5.2, beta_0 + beta_1 * 4.5),
arrowprops = dict(arrowstyle = "], widthA=1.8,lengthA=1",
connectionstyle = "arc", color='red'))
_ = plt.text(4, beta_0 + beta_1 * 3.3, 'unit increase in $X$', fontsize = 10)
_ = plt.annotate("", xy = (4.5, beta_0 + beta_1 * 3.5),
xytext = (4.5, beta_0 + beta_1 * 3.8),
arrowprops = dict(arrowstyle = "], widthA=2.8,lengthA=1",
connectionstyle = "arc", color='red'))
_ = plt.text(5.5, beta_0 + beta_1 * 4.5, '$\\beta_1$, slope', fontsize = 10)
_ = plt.legend()
plt.show()
In econometrics, an even more popular characteristic is the rate of change. The proportionate (or relative) change of \(Y\), moving from \(Y\), to \(\widetilde{Y}\) is defined as dividing the change in \(Y\) by its initial value: \[ \dfrac{\widetilde{Y}  Y}{Y} = \dfrac{\Delta Y}{Y},\ Y \neq 0 \] Usually, we measure changes in terms of percentages  a percentage change in \(Y\), from \(Y\) to \(\widetilde{Y}\) is defined as the proportionate change multiplied by 100: \[ \% \Delta Y = 100 \cdot \dfrac{\Delta Y}{Y} \approx 100 \cdot \Delta \log(Y) \]
Note: the approximate equality to the logarithm is useful for interpreting the coefficients when modelling \(\log (Y)\), instead of \(Y\).
Consequently we have that:
The elasticity of a variable \(Y\) with respect to \(X\) is defined as the percentage change in \(Y\) corresponding to a \(1\%\) increase in \(X\): \[ \eta = \eta(YX) = \dfrac{\% \Delta Y}{\% \Delta X} = \dfrac{100\cdot\dfrac{\Delta Y}{Y}}{100\cdot\dfrac{\Delta X}{X}} = \dfrac{\Delta Y}{\Delta X} \cdot \dfrac{X}{Y} \] So, the elasticity of the expected value of \(Y\) with respect to \(X\) is: \[ \eta = \dfrac{d \mathbb{E} (Y  X)}{d X}\cdot \dfrac{X}{\mathbb{E} (Y  X)} \] In the univariate regression case we have that: \[ \eta = \beta_1 \cdot \dfrac{X}{\mathbb{E} (Y  X)} \] In practice in a linear model the elasticity is different on each point \((X_i, Y_i)\), \(i = 1,...,N\). Most commonly, elasticity estimated by substituting the sample means of \(X\) and \(Y\), i.e.: \[ \widehat{\eta} = \widehat{\beta}_1 \cdot \dfrac{\overline{X}}{\overline{Y}} \] With the interpretation being that a \(1\%\) increase in \(X\) will yield, on average, a \(\widehat{\eta}\) percentage (i.e. \(\% \widehat{\eta}\)) increase/decrease in \(Y\).
To reiterate  \(\eta\) shows a percentage and not a unit change in \(Y\) corresponding to a \(1\%\) change in \(X\).
If elasticity is less than one, we can classify that \(Y\) is inelastic with respect to \(X\). If elasticity is greater than one, then we would say that \(Y\) is elastic with respect to \(X\).
3.3.3 Nonlinearities in a Linear Regression
Often times economic variables are not always related by a straightline relationship. In a simple linear regression the marginal effect of \(X\) on \(Y\) is constant, though this is not realistic in many economic relationships. For example, estimating how the price of a house (\(Y\)) relates to the size of the house (\(X\)): in a linear specification, the expected price for an additional square foot is constant. However, it is possible that more expensive homes have a higher value for each additional square foot, compared to smaller, less expensive homes.
Fortunately, the linear regression model \(\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}\) is quite flexible  the variables \(\mathbf{Y}\) and \(\mathbf{X}\) can be transformed via logarithms, squares, cubes, or even so called indicator variables. All of these transformations can be used to account for a nonlinear relationship between the variables \(\mathbf{Y}\) and \(\mathbf{X}\) (but still expressed as a linear regression in terms of parameters \(\boldsymbol{\beta}\)).
Finally, since models are linear in parameters, we can use the usual OLS estimates. In other words:
If we have a linear regression with transformed variables: \[ f(Y_i) = \beta_0 + \beta_1 \cdot g(X_i) + \epsilon_i,\quad i = 1,...,N \] then we can rewrite it in a matrix notation: \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \] where \(\mathbf{Y} =\left[ f(Y_1), ..., f(Y_N) \right]^\top\), \(\boldsymbol{\varepsilon} = [\epsilon_1,...,\epsilon_N]^\top\), \(\boldsymbol{\beta} =[ \beta_0, \beta_1]^\top\) and \(\mathbf{X} = \begin{bmatrix} 1 \quad g(X_1) \\ 1 \quad g(X_2) \\ \vdots \qquad \quad \vdots \\ 1 \quad g(X_N) \end{bmatrix}\), where \(f(Y)\) and \(g(X)\) are some kind of transformations of the initial values of \(Y\) and \(X\).
This allows us to estimate the unknown parameters via OLS: \[ \widehat{\boldsymbol{\beta}} = \left( \mathbf{X}^\top \mathbf{X}\right)^{1} \mathbf{X}^\top \mathbf{Y} \]
3.3.3.1 Quadratic Regression Model
The quadratic regression model: \[ Y = \beta_0 + \beta_1 X^2 + \epsilon \] is a parabola, where \(\beta_0\) is the intercept and \(\beta_1\) is the shape parameter of the curve: if \(\beta_1 > 0\), then the curve is \(\cupshaped\) (convex); if \(\beta_1 < 0\), then the curve is \(\capshaped\) (concave).
Because of the nonlinearity in \(X\), our unit change in \(X\) effect on \(Y\) now depends on the initial value of \(X\). If, as before, we take \(X = x\) and \(X = x + 1\): \[ \begin{aligned} \mathbb{E} (Y  X = x) = \widetilde{Y} &= \beta_0 + \beta_1 x^2 \\ \mathbb{E} (Y  X = x + 1) = \widetilde{\widetilde{Y}} &=\beta_0 + \beta_1 (x + 1)^2 \\ &= \beta_0 + \beta_1 x^2 + \beta_1 \cdot(2x + 1) \end{aligned} \] so, the difference: \[ \widetilde{\widetilde{Y}}  \widetilde{Y} = \beta_1 \cdot(2x + 1) \] now depends on the initial value of \(x\)  the larger the initial value, the more pronounced the change in \(Y\) will be.
The slope of the quadratic regression is: \[ \text{slope} = \dfrac{d \mathbb{E} (Y  X)}{d X} = 2 \beta_1 X \] which changes as \(X\) changes. For large values of \(X\), the slope will be larger, for \(\beta_1 > 0\) (or smaller, if \(\beta_1 < 0\)), and would have a more pronounced change in \(Y\) for a unit increase in \(X\), compared to smaller values of \(X\). Note that, unlike the simple linear regression, in this case \(\beta_1\) is no longer the slope.
The elasticity (i.e. the percentage change in \(Y\), given a \(1\%\) change in \(X\)) is: \[ \eta = 2 \beta_1 X \cdot \dfrac{X}{Y} = \dfrac{2 \beta_1 X^2}{Y} \] A common approach is to choose a point on the fitted relationship, i.e. select a value of \(X\) and the corresponding fitted value \(\widehat{Y}\) to estimate \(\widehat{\eta}(\widehat{Y}  X) = 2 \widehat{\beta_1} X^2 / \widehat{Y}\). Note that we could use \(( \overline{Y}, \overline{X})\) since it is on the regression curve. However, we may be interested to see how the elasticity changes at different value of \(X\)  when \(X\) is small; when \(X\) is large etc.
If we have a random sample of size \(N\) and specify: \[ \mathbf{X} = \begin{bmatrix} 1 & X_1^2\\ 1 & X_2^2\\ \vdots\\ 1 & X_N^2\\ \end{bmatrix} \] We can use the usual OLS formula to evaluate \(\boldsymbol{\beta}\).
 \(\beta_0 = 1\), \(\beta_1 \in \{ 0.02, 0.02 \}\);
 \(X\) is a random sample with replacement from a set: \(X_i \in \{ 1, ..., 50 \}\), \(i = 1, ..., 100\);
 \(\epsilon \sim \mathcal{N}(0, 5^2)\).
We will generate these values only once, in order to compare the results easier.
set.seed(123)
# Set the coefficients:
N = 100
beta_0 = 0.5
beta_1 = c(0.02, 0.02)
# Generate sample data:
x < sample(1:50, size = N, replace = TRUE)
e < rnorm(mean = 0, sd = 3, n = length(x))
y1 < beta_0 + beta_1[1] * x^2 + e
y2 < beta_0 + beta_1[2] * x^2 + e
# Estimate the model:
x_mat < cbind(1, x^2)
beta_mat_1 < solve(t(x_mat) %*% x_mat) %*% t(x_mat) %*% y1
beta_mat_2 < solve(t(x_mat) %*% x_mat) %*% t(x_mat) %*% y2
print(t(beta_mat_1))
## [,1] [,2]
## [1,] 0.0347076 0.01966276
np.random.seed(123)
# Set the coefficients:
N = 100
beta_0 = 0.5
beta_1 = [0.02, 0.02]
# Generate sample data:
x = np.random.choice(list(range(1, 51)), size = N, replace = True)
e = np.random.normal(loc = 0, scale = 3, size = len(x))
y1 = beta_0 + beta_1[0] * x**2 + e
y2 = beta_0 + beta_1[1] * x**2 + e
# Estimate the model:
x_mat = np.column_stack((np.ones(len(x)), x**2))
beta_mat_1 = np.dot(np.linalg.inv(np.dot(np.transpose(x_mat), x_mat)),
np.dot(np.transpose(x_mat), y1))
beta_mat_2 = np.dot(np.linalg.inv(np.dot(np.transpose(x_mat), x_mat)),
np.dot(np.transpose(x_mat), y2))
print(beta_mat_1)
## [ 0.27578417 0.01973092]
## [,1] [,2]
## [1,] 0.0347076 0.02033724
## [0.27578417 0.02026908]
# Calcualte the fitted values:
y_fit1 < beta_mat_1[1] + beta_mat_1[2] * x^2
y_fit2 < beta_mat_2[1] + beta_mat_2[2] * x^2
# Plot the data and the fitted regression:
plot(x, y1, col = "blue", xlim = c(0, max(x)),
ylim = c(min(y1, y2), max(y1, y2)))
#
points(x, y2, col = "red")
#
lines(x = x[order(x)], y = y_fit1[order(x)], col = "blue")
#
lines(x = x[order(x)], y = y_fit2[order(x)], col = "red")
#
legend(x = 0.1, y = 50, lty = c(1, 1), col = c("blue", "red"),
legend = c(expression(beta[1] < 0), expression(beta[1] > 0)))
# Calcualte the fitted values
y_fit1 = beta_mat_1[0] + beta_mat_1[1] * x**2
y_fit2 = beta_mat_2[0] + beta_mat_2[1] * x**2
# Plot the data and the fitted regression:
_ = plt.figure(num = 2, figsize = (10, 8))
_ = plt.plot(x, y1, linestyle = "None", marker = "o", markerfacecolor = 'None',
markeredgecolor = "blue")
_ = plt.plot(x, y2, linestyle = "None", marker = "o", markerfacecolor = 'None',
markeredgecolor = "red")
_ = plt.plot(x[np.argsort(x)], y_fit1[np.argsort(x)], linestyle = "",
color = "blue", label = "$\\beta_1 < 0$")
_ = plt.plot(x[np.argsort(x)], y_fit2[np.argsort(x)], linestyle = "",
color = "red", label = "$\\beta_1 > 0$")
_ = plt.legend()
plt.show()
We can also use the builtin functions to estimate linear regression parameters:
## (Intercept) I(x^2)
## 0.03470760 0.01966276
## [ 0.27578417 0.01973092]
## (Intercept) I(x^2)
## 0.03470760 0.02033724
The function I()
is used to bracket those portions of a model formula where the operators are used in their arithmetic sense, i.e. we want to estimate a parameter for the variable x^2
.
## [0.27578417 0.02026908]
3.3.3.2 LogLinear Regression Model
A couple of useful properties of the logarithm function, which are frequently applied to simplify some nonlinear model specifications and various approximations in econometric analysis:
Let \(X \gt 0\), \(X_0 \gt 0\) and \(X_1 \gt 0\), then: \[ \log(X_0 \cdot X_1) = \log(X_0) + \log(X_1) \] \[ \log(X_0 / X_1) = \log(X_0)  \log(X_1) \] \[ \log(X^\alpha) = \alpha\log(X),\ \alpha \in \mathbb{R} \] If \(X \gt 0\) is small (e.g. \(X = 0.01, 0.02,...,0.1\)), then: \[ \log(1 + X) \approx X \] The equality deteriorates as \(X\) gets larger (e.g. \(X = 0.5\)).
For small changes in \(X\) it can be shown that: \[ \Delta \log (X) = \log(X_1)  \log(X_0) \approx \dfrac{X_1  X_0}{X_0} = \dfrac{\Delta X}{X_0},\quad \text{where } \Delta X = X_1  X_0 \] A percentage change in \(X\), from \(X_0\) to \(X_1\) is defined as the log difference multiplied by 100: \[ \% \Delta X \approx 100 \cdot \Delta \log(X) \]
Often times, the dependent and/or independent variable may appear in logarithmic form. The loglinear model has a logarithmic term on the lefthand side of the equation and an untransformed variable on the righthand side: \[ \log (Y) = \beta_0 + \beta_1 X + \epsilon \] In order to use this model we must have that \(Y \gt 0\).
Furthermore, we may also sometimes want to take the logarithm in order to linearize \(Y\).
If \(Y\) is defined via the following exponential form: \[\begin{equation} Y = \exp(\beta_0 + \beta_1 X + \epsilon) \tag{3.10} \end{equation}\] Then we can take the logarithm of \(Y\) to get the loglinear expression \(\log (Y) = \beta_0 + \beta_1 X + \epsilon\).
From eq. (3.10) we can see that we can use the logtransformation to regularize the data (i.e. to make highly skewed distributions less skewed).
set.seed(123)
#
N < 1000
beta_0 < 0.8
beta_1 < 4
x < seq(from = 0, to = 1, length.out = N)
e < rnorm(mean = 0, sd = 1, n = length(x))
y < exp(beta_0 + beta_1 * x + e)
#
#
#
#
#
#
par(mfrow = c(1, 2))
#
hist(y, breaks = 20, col = "cornflowerblue")
#
hist(log(y), breaks = 20, col = "cornflowerblue")
np.random.seed(123)
#
N = 1000
beta_0 = 0.8
beta_1 = 4
x = np.linspace(start = 0, stop = 1, num = N)
e = np.random.normal(loc = 0, scale = 1, size = len(x))
y = np.exp(beta_0 + beta_1 * x + e)
#
fig = plt.figure(num = 3, figsize = (10, 8))
ax = fig.add_subplot('121')
_ = ax.hist(y, bins = 20,
histtype = 'bar', color = "cornflowerblue", ec = 'black')
_ = ax.set_title("Histogram of $Y$")
ax = fig.add_subplot('122')
_ = ax.hist(np.log(y), bins = 20,
histtype = 'bar', color = "cornflowerblue", ec = 'black')
_ = ax.set_title("Histogram of $\\log(Y)$")
plt.show()
Many economic variables  price, income, wage, etc.  have skewed distributions, and taking logarithms to regularize the data is a common practice.
Furthermore, setting \(\mathbf{Y} = \left[\log(Y_1), ..., \log(Y_N) \right]^\top\) allows us to apply the OLS via the same formulas as we did before:
## (Intercept) x
## 0.8156225 4.0010107
## [0.81327991 3.89431192]
Furthermore, we can calculate the logtransformed fitted values \(\log (\widehat{Y})\), as well as transform them back to \(\widehat{Y}\):
par(mfrow = c(1, 2))
#
plot(x, y, ylim = c(0, 200),
main = expression(Y == exp(beta[0] + beta[1] * X + epsilon)))
#
#
lines(x, exp(fitted(lm_fit)), col = "blue", lwd = 2)
#
#
#
plot(x, log(y),
main = expression(log(Y) == beta[0] + beta[1] * X + epsilon))
#
#
lines(x, fitted(lm_fit), col = "blue", lwd = 2)
fig = plt.figure(num = 4, figsize = (10, 8))
ax = fig.add_subplot('121')
_ = ax.plot(x, y, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = ax.plot(x, np.exp(lm_fit.fittedvalues),
linestyle = "", color = "blue", linewidth = 2)
_ = ax.set_ylim((0, 200))
_ = ax.set_title("$Y = \\exp(\\beta_0 + \\beta_1 X + \\epsilon)$")
ax = fig.add_subplot('122')
_ = ax.plot(x, np.log(y), color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = ax.plot(x, lm_fit.fittedvalues,
linestyle = "", color = "blue", linewidth = 2)
_ = ax.set_title("$\\log(Y) = \\beta_0 + \\beta_1 X + \\epsilon$")
plt.show()
In other words, for the loglinear model, a unit increase in \(X\) yields (approximately) a \((100 \cdot \beta_1)\) percentage change in \(Y\).
We can rewrite the previous equality as: \[ 100 \cdot \beta_1 = \dfrac{\% \Delta Y}{\Delta X} := \text{semielasticity} \] This quantity, known the semielasticity, is the percentage change in \(Y\) when \(X\) increases by one unit. As we have just shown, in the loglinear regression, the semielasticity is constant and equal to \(100 \cdot \beta_1\).
Furthermore, we can derive the same measurements as before: \[ \text{slope} := \dfrac{d \mathbb{E} (Y  X)}{d X} = \beta_1 Y \] unlike the simple linear regression model, in the loglinear regression model, the marginal effect increases for larger values of \(Y\) (note, that this is not \(\log (Y)\)).
The elasticity (the percentage change in \(Y\), given a \(1\%\) increase in \(X\)): \[ \eta = \text{slope} \cdot \dfrac{X}{Y} = \beta_1 X \]
If we wanted to change the units of measurement of \(Y\) in a loglinear model, then \(\beta_0\) would change, but \(\beta_1\) would remain unchanged, since: \[ \log(c Y) = \log(c) + \log(Y) = \left[\log(c) + \beta_0 \right] + \beta_1 X + \epsilon \]
3.3.3.3 LinearLog Regression Model
Alternatively, we may describe the linear relationship, where \(X\) is logtransformed and \(Y\) is untransformed: \[ Y = \beta_0 + \beta_1 \log(X) + \epsilon \] where \(X \gt 0\). In this case, if \(X\) increases from \(X_0\) to \(X_1\), it follows that: \[ \begin{aligned} Y_0 &= \mathbb{E}(Y X = X_0) = \beta_0 + \beta_1 \log(X_0)\\ Y_1 &= \mathbb{E}(Y X = X_1) = \beta_0 + \beta_1 \log(X_1) \end{aligned} \] Setting \(\Delta Y = Y_1  Y_0\) and \(\Delta \log(X) = \log(X_1)  \log(X_0)\), yields: \[ \Delta Y = \beta_1 \Delta \log(X) = \dfrac{\beta_1}{100} \left[100 \cdot \Delta \log(X) \right] \approx \dfrac{\beta_1}{100} (\% \Delta X) \]In other words, in a linearlog model, a \(1\%\) increase in \(X\) yields (approximately) a \(\beta_1 / 100\) unit change in \(Y\).
Furthermore, we have that: \[ \text{slope} := \dfrac{d \mathbb{E} (Y  X)}{d X} = \beta_1 \dfrac{1}{X} \] and: \[ \eta = \text{slope} \cdot \dfrac{X}{Y} = \beta_1 \dfrac{1}{Y} \]
If we wanted to change the units of measurement of \(X\) in a linearlog model, then \(\beta_0\) would change, but \(\beta_1\) would remain unchanged, since: \[ Y = \beta_0 + \beta_1 \log \left( \dfrac{c}{c} X \right) = \left[\beta_0 \beta_1 \log(c) \right] + \beta_1 \log\left( c X \right) \]
Sometimes linearlog is not much different (in terms of model accuracy) from a linearlinear (i.e. simple) regression. However, because of the functional form of the linearlog model, if \(\beta_1 \lt 0\), then the function decreases at a decreased rate, as \(X\) increases. This means that it may sometimes be useful if we do not want \(Y\) to have a negative value, for a reasonably large value of \(X\).
Example 3.7 Let \(Y\) be expenditure on leisure activities, and \(X\)  age. Let us say that we only have expenditure data on ages from 18 to 50 and we would like to predict the expenditure for ages 51 to 80.
In this case it is reasonable to assume, that expenditure cannot be negative for reasonable values of \(X\)  an age of up to 80 may be realistic assumption, while 100 years or more  less so.Further assume that the underlying (true) model is indeed linearlog: \[ Y = \beta_0 + \beta_1 X + \epsilon \] with \(\beta_0 = 1200\), \(\beta_1 = 250\) and \(\epsilon \sim \mathcal{N}(0, 50^2)\).
set.seed(123)
#
beta_0 < 1200
beta_1 < 250
x < 18:50
max_x < 80
e < rnorm(mean = 0, sd = 50, n = length(x))
y < beta_0 + beta_1 * log(x) + e
# Estimate a LinearLinear and LinearLog model:
mdl_1 < lm(y ~ 1 + x)
y1 < coef(mdl_1)[1] + coef(mdl_1)[2] * 18:max_x
mdl_2 < lm(y ~ 1 + log(x))
y2 < coef(mdl_2)[1] + coef(mdl_2)[2] * log(18:max_x)
np.random.seed(123)
#
beta_0 = 1200
beta_1 = 250
x = np.array(range(18, 51))
max_x = 80
e = np.random.normal(loc = 0, scale = 50, size = len(x))
y = beta_0 + beta_1 * np.log(x) + e
# Estimate a LinearLinear and LinearLog model:
mdl_1 = sm.OLS(y, sm.add_constant(x)).fit()
y1 = mdl_1.params[0] + mdl_1.params[1] * np.array(range(18, max_x + 1))
mdl_2 = sm.OLS(y, sm.add_constant(np.log(x))).fit()
y2 = mdl_2.params[0] + mdl_2.params[1] * np.log(np.array(range(18, max_x + 1)))
# Plot the sample data and the fitted values:
plot(x, y, ylim = c(100, 700), xlim = c(18, max_x))
#
#
abline(h = 0, col = "black", lty = 2)
lines(18:max_x, y1, col = "red")
#
lines(18:max_x, y2, col = "blue")
#
legend(x = 50, y = 600, lty = c(1, 1), col = c("red", "blue"),
legend = c(expression("E(YX) = " ~ beta[0] + beta[1] * X),
expression("E(YX) = " ~ beta[0] + beta[1] * log(X))))
# Plot the sample data and the fitted values:
_ = plt.figure(num = 5, figsize = (10, 8))
_ = plt.plot(x, y, linestyle = "None", marker = "o", color = "black", markerfacecolor = 'None')
_ = plt.ylim((100, 700))
_ = plt.xlim((15, max_x + 5))
_ = plt.axhline(y = 0, linestyle = "", color = "black")
_ = plt.plot(np.array(range(18, max_x + 1)), y1,
linestyle = "", color = "red", label = "$E(YX) = \\beta_0 + \\beta_1 X$")
_ = plt.plot(np.array(range(18, max_x + 1)), y2,
linestyle = "", color = "blue", label = "$E(YX) = \\beta_0 + \\beta_1 \\log(X)$")
_ = plt.legend()
plt.show()
We see that the fitted values \(\widehat{Y}\) do not differ much in the linearlinear and linearlog cases, however, the predicted values (in this example \(Y\) is the expenditure) are more believable in the linearlog case, since they are nonnegative.
3.3.3.4 LogLog Regression Model
Elasticities are often important in applied economics. As such, it is sometimes convenient to have constant elasticity models. If we take logarithms of both \(Y\) and \(X\), then we arrive at the loglog model: \[ \log (Y) = \beta_0 + \beta_1 \log(X) + \epsilon \] where \(X, Y \gt 0\). Then: \[ \text{slope} := \dfrac{d \mathbb{E} (Y  X)}{d X} = \beta_1 \dfrac{Y}{X} \] and: \[ \eta = \text{slope} \cdot \dfrac{X}{Y} = \beta_1 \] i.e. the elasticity of the loglog model is constant.In other words, in a loglog model, a \(1\%\) increase in \(X\) yields (approximately) a \(\beta_1\) percentage change in \(Y\).
3.3.4 Nonlinear Regression Models
Sometimes the scatter plot diagram, or theoretical assumptions, suggest a nonlinear relationship. The word nonlinear may be a bit confusing  in this context we mean that a regression is nonlinear in parameters.
3.3.4.1 Exponential Regression
An exponential regression model may be specified as: \[ Y = \beta_0 + \beta_1 \cdot \exp(\beta_2 X) + \epsilon \] We see that \(\beta_2\) does not enter the equation linearly. Nevertheless, we may estimate the parameters by minimizing the sum of squared residuals: \[ \text{RSS}(\boldsymbol{\beta}) = \sum_{i = 1}^N \widehat{\epsilon}_i^2 = \sum_{i = 1}^N \left( Y_i  \widehat{\beta}_0  \widehat{\beta}_1 \exp(\widehat{\beta}_2 X) \right)^2 \] Taking the partial derivatives and equating to zero yields: \[ \begin{cases} \dfrac{\partial \text{RSS}}{\partial \widehat{\beta}_0} &= 2\sum_{i = 1}^N \left( Y_i  \widehat{\beta}_0  \widehat{\beta}_1 \exp(\widehat{\beta}_2 X) \right) = 0\\\\ \dfrac{\partial \text{RSS}}{\partial \widehat{\beta}_1} &= 2\sum_{i = 1}^N \left( Y_i  \widehat{\beta}_0  \widehat{\beta}_1 \exp(\widehat{\beta}_2 X) \right) \exp(\widehat{\beta}_2 X) = 0 \\\\ \dfrac{\partial \text{RSS}}{\partial \widehat{\beta}_2} &=2\sum_{i = 1}^N \left( Y_i  \widehat{\beta}_0  \widehat{\beta}_1 \exp(\widehat{\beta}_2 X) \right) \beta_2\exp(\widehat{\beta}_2 X) = 0 \end{cases} \] which can be solved via an iterative numeric method, known as the gradient descent.
We can also specify our \(\text{RSS}(\boldsymbol{\beta})\) function, which we can minimize via the builtin functions in R
and Python
.
 \(\beta_0 = 0.8\), \(\beta_1 = 1.2\), \(\beta_2 = 3\);
 \(\epsilon \sim \mathcal{N}(0, 1^2)\).
np.random.seed(123)
#
N = 1500
beta_0 = 0.8
beta_1 = 1.2
beta_2 = 3
x = np.linspace(start = 0, stop = 1, num = N)
e = np.random.normal(loc = 0, scale = 1, size = len(x))
y = beta_0 + beta_1 * np.exp(beta_2 * x) + e
#
_ = plt.figure(num = 6, figsize = (10, 8))
_ = plt.plot(x, y, linestyle = "None", marker = "o",
color = "black", markerfacecolor = 'None')
plt.show()
Firstly, define the function that we wish to minimize:
Then, use the builtin optimization function to minimize the function. Note that we need to specify a starting parameter value, we choose \(\widehat{\beta}_0 = \widehat{\beta}_1 = \widehat{\beta}_2 = 0\) for this example:
## [1] 0.8362279 1.1878721 3.0108930
import scipy.optimize as optimize
#
opt_res = optimize.minimize(fun = RSS, x0 = [0, 0, 0], args = (y, x))
print(opt_res.x)
## [0.87673901 1.13297236 3.06450799]
The estimated parameter values appear to be similar to the true values.
Alternatively, if we have the following exponential regression: \[ Y = \exp(\beta_0 + \beta_1 X + \epsilon) \] then we may notice, upon further inspection, that we may not need to deal with a nonlinear regression.Sometimes it is possible to transform a nonlinear regression into a linear regression.
We can transform the specified exponential model into a loglinear model by taking the logarithms of both sides of the equation: \[ \log (Y) = \beta_0 + \beta_1 X + \epsilon \] which would give us a loglinear regression model.
3.3.4.2 Power Regression
A power regression may be specified as: \[ Y = \beta_0 + \beta_1 X^{\beta_2} + \epsilon \]
\[ Y = A L^{\alpha} K^{\beta} \exp(\epsilon) \] where \(Y\)  total production value of all goods; \(L\)  labour input (e.g. total number of personhours worked); \(K\)  capital input (value of all machinery, equipment, etc.); \(A\)  totalfactor productivity; \(\alpha, \beta\)  output elasticities of capital and labor, respectively.
It can be transformed into a simple loglog model: \[ \log(Y) = \log(A) + \alpha \log(L) + \beta \log(K) + \epsilon \]
We have learned different specification possibilities of a regression model, as well as estimation methods in such cases. However, simply estimating the parameters on a number of different models does not say much neither about the overall accuracy of a model, nor about whether the included explanatory variable actually helps us predict the dependent variable.
3.3.5 Choosing a Functional Form
Generally, we will never know the true relationship between \(Y\) and \(X\). The functional form that we select is only an approximation. As such, we should select a functional form, which satisfies our objectives, economic theory, the underlying model assumptions, like (UR.1)  (UR.3), and one which provides an adequate fit on the data sample.
Consequently, we may incorrectly specify a functional form for our regression model, or, some of our regression assumptions may not hold.
It is a good idea to focus on:
 Examining the regression results:
 checking whether the signs of the coefficients follow economic logic;
 checking the significance of the coefficients;
 Examining the residuals of \(\mathbf{Y}  \widehat{\mathbf{Y}}\)  if our specified functional form is inadequate, then the residuals would not necessarily follow (UR.3), (UR.4).
Nevertheless, before examining the signs and model statistical properties, we should have a first look at the data and examine the dependent variable \(Y\) and the independent variable \(X\) plots to make some initial guesses at the functional form of the regression model.
3.3.5.1 Histogram of The Response Variable
np.random.seed(123)
#
N = 1000
#
y1 = np.random.normal(loc = 0, scale = 1, size = N)
y2 = np.exp(np.random.normal(loc = 0, scale = 1, size = N))
#
fig = plt.figure(num = 7, figsize = (10, 8))
ax = fig.add_subplot('121')
_ = ax.hist(y1, bins = 40, histtype = 'bar',
color = "cornflowerblue", ec = 'black')
ax = fig.add_subplot('122')
_ = ax.hist(y2, bins = 40, histtype = 'bar',
color = "cornflowerblue", ec = 'black')
plt.show()
The histogram can be used to answer the following questions:
 What kind of population distribution is the data sample most likely from?
 What is the sample mean of the data?
 Is the distribution of the data large?
 Are the data symmetric, or skewed?
 Are there outliers in the data?
3.3.5.2 RunSequence Plot
An easy way to graphically summarize a univariate data set. A common assumption of univariate data sets is that they behave like:
 Random realization of a dataset;
 Are from the same population (i.e. all random drawings have the same distribution);
The runsequence (or simply, run charts) plots the variable of interest on the vertical axis, and the variable index on the horizontal axis. They are primarily used to inspect if there are any outliers, mean or variance changes or if there is a dependence across observations.
 A \(r.v.\) from a normal distribution: \(Y_{1,i} \sim \mathcal{N}(\mu, \sigma^2)\);
 A \(r.v.\) from a normal distribution, where the variance depends on the observation index: \(Y_{2,i} \sim \mathcal{N}(\mu, i \cdot \sigma^2)\)
 A \(r.v\)., where \(Y_{2,i} \sim \mathcal{N}(\mu, \sigma^2)\) for \(i = 1,...,100\) and \(Y_{2,i} \sim \mathcal{N}(1.5 \cdot \mu, \sigma^2)\) for \(i = 101, ..., 200\)
 \(\mu = 20\), \(\sigma = 1\).
set.seed(123)
#
N < 200
#
y1 < rnorm(mean = 20, sd = 1, n = N)
y2 < rnorm(mean = 20, sd = sqrt(1:N) * 1, n = N)
y3 < c(rnorm(mean = 20, sd = 1, n = 100),
rnorm(mean = 30, sd = 1, n = 100))
#
#
#
par(mfrow = c(2, 2))
#
#
plot(y1)
title("Normal Distribution")
#
#
plot(y2)
title("Change in variance")
#
#
plot(y3)
title("Change in mean")
np.random.seed(123)
#
N = 200
#
y1 = np.random.normal(loc = 20, scale = 1, size = N)
y2 = np.random.normal(loc = 20, scale = np.sqrt(np.array(range(1,N+1))) * 1,
size = N)
y3 = np.concatenate((np.random.normal(loc = 20, scale = 1, size = 100),
np.random.normal(loc = 30, scale = 1, size = 100)))
#
fig = plt.figure(num = 8, figsize = (10, 8))
ax = fig.add_subplot('221')
_ = ax.plot(y1, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Normal Distribution")
ax = fig.add_subplot('222')
_ = ax.plot(y2, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Change in variance")
ax = fig.add_subplot('223')
_ = ax.plot(y3, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Change in mean")
plt.show()
We see that the underlying differences in the distribution of the data are present in the runsequence plot of the data sample.
The runsequence plot can help answer questions, like:
 Are there any (significant) changes in the mean?
 Are there any (significant) changes in the variance?
 Are there any outliers in the data?
The runsequence plot can be useful when examining the residuals of a model.
set.seed(123)
#
N < 1000
beta_0 < 0.8
beta_1 < 4
x < seq(from = 0, to = 1, length.out = N)
e < rnorm(mean = 0, sd = 1, n = length(x))
y < exp(beta_0 + beta_1 * x + e)
#
# Fit linearlinear
lm_lin = lm(y ~ x)
# Fit loglinear
lm_log = lm(log(y) ~ x)
#
#Plot the residuals
#
#
#
par(mfrow = c(1, 2))
#
#
plot(lm_lin$residuals, main = "Residuals of a linearlinear model")
#
#
plot(lm_log$residuals, main = "Residuals of a loglinear model")
np.random.seed(123)
#
N = 1000
beta_0 = 0.8
beta_1 = 4
x = np.linspace(start = 0, stop = 1, num = N)
e = np.random.normal(loc = 0, scale = 1, size = len(x))
y = np.exp(beta_0 + beta_1 * x + e)
#
# Fit linearlinear
lm_lin = sm.OLS(y, sm.add_constant(x)).fit()
# Fit loglinear
lm_log = sm.OLS(np.log(y), sm.add_constant(x)).fit()
#
#Plot the residuals
fig = plt.figure(num = 9, figsize = (10, 8))
ax = fig.add_subplot('121')
_ = ax.plot(lm_lin.resid, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Residuals of a linearlinear model")
ax = fig.add_subplot('122')
_ = ax.plot(lm_log.resid, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Residuals of a loglog model")
plt.show()
We see that in the case of a linear linear model, the residual variance is not the same across observations, while the loglinear model has residuals that appear to have the same mean and variance across observations.
We note that the values of \(X\) are ordered from smallest to largest  this means that a larger index value corresponds to the value of \(Y\) which was for a larger value of \(X\). If we were to randomize the order of \(X_i\) (and as a result, randomize \(Y_i\) and \(\widehat{Y}_i\)), the runsequence plot of the residuals would then have the following plot:
np.random.seed(123)
i_index = np.array(range(0, len(x)))
np.random.shuffle(i_index)
#Plot the residuals
fig = plt.figure(num = 10, figsize = (10, 8))
ax = fig.add_subplot('121')
_ = ax.plot(lm_lin.resid[i_index], color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Residuals of a linearlinear model, index randomized")
ax = fig.add_subplot('122')
_ = ax.plot(lm_log.resid[i_index], color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Residuals of a loglinear model, index randomized")
plt.show()
We again note the fact that in crosssectional data one of our assumptions is that the observations \((X_i, Y_i)\) are treated as independent from \((X_j, Y_j)\) for \(i \neq j\). As such we can order the data in any way we want. In the first case (with ordered values of \(X\) from \((X_i, Y_i)\)), ordering the residuals by the value of \(X\) allows us to examine whether our model performs the same for all values of \(X\). Clearly, for larger values of \(X\) this was not the case. In the second case (where the ordering of \(X\) from \((X_i, Y_i)\) is random), where we shuffled the order of our data \((X_i, Y_i)\), the residual runsequence plot allows us to identify, whether there are residuals, which are more pronounced, but it does not identify when this happens, unlike the first case, with ordering by values of \(X\).
Run sequence plots are more common in time series data, but can still be utilized for crosssectional data.
3.3.5.3 Scatter Plot
As we have seen, a scatter plot reveals relationships between two variables or interest in a form of nonrandom structure. the vertical axis is usually for the dependent (i.e. response) variable \(Y\), and the horizontal axis is for the independent variable \(X\).
set.seed(123)
#
N < 1000
beta_0 < 0.8
beta_1 < 4
x1 < rnorm(mean = 2, sd = 2, n = N)
x2 < rnorm(mean = 4, sd = 3, n = N)
e < rnorm(mean = 0, sd = 1, n = N)
y < beta_0 + beta_1 * x1 + e
#
#
#
#
#
par(mfrow = c(1, 2))
#
#
plot(x1, y, main = expression("Scatter plot of " ~ X[1] ~ " and " ~ Y))
#
#
plot(x2, y, main = expression("Scatter plot of " ~ X[2] ~ " and " ~ Y))
np.random.seed(123)
#
N = 1000
beta_0 = 0.8
beta_1 = 4
x1 = np.random.normal(loc = 2, scale = 2, size = N)
x2 = np.random.normal(loc = 4, scale = 3, size = N)
e = np.random.normal(loc = 0, scale = 1, size = N)
y = beta_0 + beta_1 * x1 + e
#
#
fig = plt.figure(num = 11, figsize = (10, 8))
ax = fig.add_subplot('121')
_ = ax.plot(x1, y, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Scatter plot of $X_1$ and $Y$")
ax = fig.add_subplot('122')
_ = ax.plot(x2, y, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Scatter plot of $X_2$ and $Y$")
plt.show()
We see a clear linear relationship between \(X_1\) and \(Y\). The scatter plot of \(X_2\) and \(Y\) appears to be random, i.e. \(Y\) does not depend on the value of \(X_2\).
The scatter plot can help us answer the following questions:
 Are the variables \(X\) and \(Y\) related?
 What kind of relationship (linear, exponential, etc.) could describe \(Y\) and \(X\)?
 Are there any outliers?
 Does the variation in \(Y\) change depending on the value of \(X\)?
In addition to the runsequence plot, the scatter plot for the residuals \(\widehat{\epsilon}\) and the independent variable \(X\) can be useful to determine, whether our model performs the same across any value of \(X\), or if there are specific values of \(X\), for which our model is worse (which would result in larger residual values). If the values of \(X\) are ordered in ascending order, then the runsequence plot will be very similar to the scatter plot of \(\widehat{\epsilon}\) and \(X\), with the only difference being the spacing between the points.
set.seed(123)
#
N < 50
beta_0 < 0.2
beta_1 < 0.02
# Generate a sequence of X, two times the size of N
x < seq(from = 0, to = 20, length.out = N*2)
# Take a random sample from X, size N
x < sample(x, replace = TRUE, size = N)
# Sort x
x < sort(x)
e < rnorm(mean = 0, sd = 1, n = length(x))
y < exp(beta_0 + beta_1 * x + e)
#
# Fit linearlinear
lm_lin = lm(y ~ x)
# Fit loglinear
lm_log = lm(log(y) ~ x)
np.random.seed(123)
#
N = 50
beta_0 = 0.2
beta_1 = 0.02
# Generate a sequence of X, two times the size of N
x = np.linspace(start = 0, stop = 20, num = N*2)
# Take a random sample from X, size N
x = np.random.choice(x, size = N, replace = True)
# Sort x
x = np.sort(x)
e = np.random.normal(loc = 0, scale = 1, size = len(x))
y = np.exp(beta_0 + beta_1 * x + e)
#
# Fit linearlinear
lm_lin = sm.OLS(y, sm.add_constant(x)).fit()
# Fit loglinear
lm_log = sm.OLS(np.log(y), sm.add_constant(x)).fit()
#Plot the residuals
#
#
#
#
#
#
#
#
#
#
par(mfrow = c(2, 2))
#
#
plot(lm_lin$residuals, main = "Residual runsequence plot \n linearlinear model")
plot(x, lm_lin$residuals, main = "Scatter plot of Residuals and X \n linearlinear model")
#
#
plot(lm_log$residuals, main = "Residual runsequence plot \n loglinear model")
plot(x, lm_log$residuals, main = "Scatter plot of Residuals and X \n loglinear model")
#Plot the residuals
fig = plt.figure(num = 12, figsize = (10, 8))
ax = fig.add_subplot('221')
_ = ax.plot(lm_lin.resid, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Residual runsequence plot \n linearlinear model")
ax = fig.add_subplot('222')
_ = ax.plot(x, lm_lin.resid, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Scatter plot of Residuals and X \n linearlinear model")
ax = fig.add_subplot('223')
_ = ax.plot(lm_log.resid, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Residual runsequence plot \n loglinear model")
ax = fig.add_subplot('224')
_ = ax.plot(x, lm_log.resid, color = "black",
linestyle = "None", marker = "o", markerfacecolor = 'None')
_ = plt.title("Scatter plot of Residuals and X \n loglinear model")
plt.tight_layout()
plt.show()
Notice that when \(X\) is ordered the only difference between the runsequence plot and the scatter plot is in the spacing between the points.
Scatter plots are similar to run sequence plots for univariate crosssectional data. However, unlike runsequence plots, scatter plots allow to examine the relationship between \(X\) and \(Y\) (instead of only examining \(Y\)).
For multivariable crosssectional data, runsequence plots may be faster to utilise to focus on the properties of \(Y\) variable itself.
3.3.5.4 QuantileQuantile Plot
The quantilequantile (qq) plot can be used to determine if two data sets come from a common distribution. A 45degree line is also plotted for easier comparison  if the two sets come from the same distribution, the points should fall approximately along this reference line. The greater the departure from this line, the greater the evidence for the conclusion that the two data sets have come from populations with different distributions. More specifically, the quantiles of the first dataset are plotted against the quantiles of the second dataset.
The quantilequantile (qq, or QQ) plot is a scatterplot created by plotting two sets of quantiles against one another.
The qq plot is similar to a probability plot, which can be used to examine, whether the model residuals are normally distributed. For a probability plot, the qq plot is used, with quantiles for one of the data samples are replaced with the quantiles of a theoretical distribution.
In a normal probability plot the data are plotted against a theoretical (standard) normal distribution in such a way that the points should form an approximate straight line. Departures from this straight line indicate departures from normality. The normal distribution is a base distribution, and its quantiles are on the horizontal axis as the Theoretical Quantiles, while the sample data quantiles are on the vertical axis.
#
#
set.seed(123)
#
N < 1000
y1 < rnorm(mean = 0, sd = 1, n = N)
y2 < exp(y1)
#
#
par(mfrow = c(2, 2))
#
#
hist(y1, breaks = 20, col = "cornflowerblue",
main = expression("Histogram of"~Y[1]~"~N(0,1)"))
qqnorm(y1, main = expression("Normal QQ Plot of"~Y[1]~"~N(0,1)"))
qqline(y1, col = "cornflowerblue", lwd = 2)
#
#
hist(y2, breaks = 20, col = "cornflowerblue",
main = expression("Histogram of"~Y[2]==exp(Y[1])))
qqnorm(y2, main = expression("Normal QQ Plot of"~Y[2]==exp(Y[1])))
qqline(y2, col = "cornflowerblue", lwd = 2)
import scipy.stats as stats
#
np.random.seed(123)
#
N = 1000
y1 = np.random.normal(loc = 0, scale = 1, size = N)
y2 = np.exp(y1)
#
fig = plt.figure(num = 13, figsize = (10, 8))
_ = fig.add_subplot('221').hist(y1, color = "cornflowerblue", bins = 20, ec='black')
_ = plt.title("Histogram of $Y_1 \\sim N(0, 1)$")
ax = fig.add_subplot('222')
_ = stats.probplot(y1, dist = "norm", plot = ax)
_ = plt.title("Normal QQ plot of $Y_1 \\sim N(0, 1)$")
_ = fig.add_subplot('223').hist(y2, bins = 20, color = "cornflowerblue", ec='black')
_ = plt.title("Histogram of $Y_2 = \\exp(Y_1)$")
ax = fig.add_subplot('224')
_ = stats.probplot(y2, dist = "norm", plot = ax)
_ = plt.title("Normal QQ plot of $Y_2 = \\exp(Y_1)$")
plt.tight_layout()
plt.show()
We see that the qq plot of \(Y_1\) shows that \(Y_1\) is normally distributed, since all the quantiles are on the diagonal line. On the other hand, the qq plot of \(Y_2\) shows that the data is skewed.
The probability plot is used to answer the following questions:
 Does a specified theoretical distribution provide a good fit to my data?
 What distribution best fits my data?
 What are good estimates for the location and scale parameters of the chosen distribution?
In addition, the normal probability plot answers the following questions:
 Is the data normally distributed?
 What is the nature of the departure from normality (data skewed with shorter/longer tails)?
As mentioned, probability plots can be used to inspect whether the residuals are normally distributed. We will see an example of this in the residual diagnostics section later on.
Some notes on the terminology of quantiles, quartiles and percentiles:
Some notes on the terminology of quantiles, percentiles and quartiles:
 Quantiles can go from 0 to any value. Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. The pquantile is defined as the value, which includes \(p \cdot N\) observations, with \(0 \leq p \leq 1\) and \(N\) being the number of observations.
 Percentiles go from 0 to 100. It is a measure used for indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value below which \(20\%\) of the observations may be found.
 Quartiles go from 0 to 4. They are values that divide a list of sorted values into quarters.
In general, percentiles and quartiles are specific types of quantiles. The relationship is as follows:

0 quartile
=0 quantile
=0 percentile

1 quartile
=0.25 quantile
=25 percentile

2 quartile
=0.5 quantile
=50 percentile (median)

3 quartile
=0.75 quantile
=75 percentile

4 quartile
=1 quantile
=100 percentile
In our previous example if we wanted to mark the quantiles, we could do so with the following code:
quantiles < c(0.25, 0.5, 0.75)
#
par(mfrow = c(1,2))
qqnorm(y1, main = expression("Normal QQ Plot of"~Y[1]~"~N(0,1)"), col = "blue")
qqline(y1, col = "red", lwd = 2)
# Draw the sample data quantiles:
abline(h = quantile(y1, quantiles),col="blue",lty=2)
#
# Draw the theoretical quantiles:
abline(v = qnorm(quantiles),col = "red",lty=2)
#
#
#
#
#
qqnorm(y2, main = expression("Normal QQ Plot of"~Y[2]==exp(Y[1])), col = "blue")
qqline(y2, col = "red", lwd = 2)
# Draw the sample data quantiles:
#
abline(h = quantile(y2, quantiles),col="blue",lty=2)
# Draw the theoretical quantiles:
#
abline(v = qnorm(quantiles),col = "red", lty=2)
quantiles = [0.25, 0.5, 0.75]
#
fig = plt.figure(num = 14, figsize = (10, 8))
ax = fig.add_subplot('121')
_ = stats.probplot(y1, dist = "norm", plot = ax)
# Draw the sample data quantiles:
_ = [plt.axhline(ii, linewidth=1, linestyle = "", color = 'b')
for ii in np.percentile(y1, q = np.array(quantiles)*100)]
# Draw the theoretical quantiles
_ = [plt.axvline(ii, linewidth=1, linestyle = "", color = 'r')
for ii in stats.norm.ppf(quantiles)]
_ = plt.title("Normal QQ plot of $Y_1 \\sim N(0, 1)$")
ax = fig.add_subplot('122')
_ = stats.probplot(y2, dist = "norm", plot = ax)
# Draw the sample data quantiles:
_ = [plt.axhline(ii, linewidth=1, linestyle = "", color = 'b')
for ii in np.percentile(y2, q = np.array(quantiles)*100)]
# Draw the theoretical quantiles:
_ = [plt.axvline(ii, linewidth=1, linestyle = "", color = 'r')
for ii in stats.norm.ppf(quantiles)]
_ = plt.title("Normal QQ plot of $Y_2 = \\exp(Y_1)$")
plt.tight_layout()
plt.show()
Note: a newer version of numpy
in Python
has the np.quantile()
function, in which case we would not need to multiply the quantile values by 100.