4.3 Goodness-Of-Fit
Let our multiple regression model be defined as: \[ Y_i = \beta_0 + \beta_1 X_{1i} + ... + \beta_k X_{ki} + \epsilon_i,\quad i = 1,...,N \] Furthermore, assume that conditions (MR.1) - (MR.6) hold true.
4.3.1 \(F\)-test For Goodness of Fit: Joint Hypothesis Test for the Overall Model Significance
This is the joint hypothesis test for multiple coefficient significance, applied for all slope coefficients (excluding the intercept): \[ \begin{cases} H_0&: \beta_1 = \beta_2 = ... = \beta_k = 0 \\ H_1&: \beta_{j} \neq 0, \quad \text{for some } j \end{cases} \] the associated \(F\)-statistic: \[ F = \dfrac{\text{ESS}/k}{\text{RSS}/(N-(k+1))} = \dfrac{R^2/k}{(1 - R^2)/(N-(k+1))} \] If \(F > F_{(1 - \alpha, k, N-(k+1))}\) for some significance level \(\alpha\) (or if the associated \(p\)-value is less than \(\alpha\)), we reject the null hypothesis.
The \(F\)-test of overall significance compares a model with no predictors to the specified model. A regression model that contains no predictors is also known as an intercept-only model, where all of the fitted values equal the mean of the response variable. Thus, if the \(p\)-value of the overall \(F\)-test is less than the significance level \(\alpha\), the specified model predicts the response variable better than the mean of the response.
This is somewhat similar to what the \(R^2\) does. However, it does not provide a formal hypothesis test (which the overall \(F\)-test does). Consequently, if the \(p\)-value of the overall \(F\)-test is less than the \(\alpha\) significance level, we could conclude that the \(R^2\) value is significantly different from zero.
4.3.2 \(R\)-squared and Adjusted \(R\)-squared
In the multiple regression model \(R^2\) is a measure of the proportion of variation in the dependent variable, that is explained by all the explanatory variables included in the model: \[ R^2 = \dfrac{\text{ESS}}{\text{TSS}} = \dfrac{\sum_{i = 1}^N \left( \widehat{Y}_i - \overline{Y} \right)^2}{\sum_{i = 1}^N \left(Y_i - \overline{Y} \right)^2} = 1 - \dfrac{\text{RSS}}{TSS} = 1 - \dfrac{\sum_{i = 1}^N \widehat{\epsilon}_i^2}{\sum_{i = 1}^N \left(Y_i - \overline{Y} \right)^2} \] where \(\widehat{\epsilon}_i = Y_i - (\widehat{\beta}_0 + \widehat{\beta}_1 X_{1,i} + ... + \widehat{\beta}_k X_{k,i})\).
For the multiple regression model, \(R^2\) automatically increases when extra explanatory variables are added to the model, even if the variables added have no justification: if the model contains \(N-1\) variables, then \(R^2 = 1\). As such, an adjusted \(R^2\), \(R^2_{adj}\) may be used. This modification of \(R^2\) adjusts for the number of explanatory variable (excluding the constant) terms in a model, relative to the number of data points. Furthermore, \(R^2_{adj}\) can be negative and \(R^2_{adj} \leq R^2\). The adjusted \(R^2\) is defined as: \[ R^2_{adj} = 1 - (1 - R^2)\dfrac{N-1}{N - k - 1} = 1 - \dfrac{\text{RSS}/(N-k-1)}{\text{TSS} /(N-1)} \] The adjusted \(R^2\), \(R^2_{adj}\), can be interpreted as an unbiased estimator of the population \(R^2\).
However, because of this correction, \(R^2_{adj}\) loses its interpretation - \(R^2_{adj}\) is no longer the proportion of explained variation.
4.3.3 AIC and BIC
Selecting variables that maximize \(R^2_{adj}\) is equivalent to selecting variables that minimize \(\text{ESS}\), subject to a penalty based on the number of variables. We will introduce two mode criterions, which work in a similar way, but have different penalties for inclusion of additional variables:
- Akaike Information Criterion (AIC): \[ AIC = N + N \log(2\pi) + N \log \left( \dfrac{RSS}{N} \right) + 2(k + 1) \]
- Bayesian Information Criterion (BIC), also called Schwarz Criterion (SC):
\[
BIC = N + N \log(2\pi) + N \log \left( \dfrac{RSS}{N} \right) + (k + 1) \log(N)
\]
The formulas are based on (“Statistics 333 Applied Regression Analysis” Spring 2003 semester from Professor Bret Larget) correction of (Ramsey and Schafer 2012). These are the formulas used in
R
andPython
, with slight differences.
On one hand, the variance parameter (which is also estimated) must also be included: overall, there are \(k + 2\) total parameters \(\beta_0, \beta_1,..., \beta_k, \sigma^2\). As such \((k + 1)\) is sometimes replaced with \((k + 2)\).
On the other hand, some textbooks ignore the first two terms, \(N + N \log(2\pi)\), and use \(k+1\), instead of \(k+2\).
Note: in R
\((k + 2)\) is used, while in Python
\((k + 1)\) is used.
In both cases, the first term decreases with each extra variable added, but the second term increases, penalizing the inclusion of additional variables. BIC penalizes extra variables more strictly, compared to the AIC.
The model with the smallest AIC (or BIC) is preferred.
\[ \begin{aligned} \log(Y_i) = \beta_0 &+ \beta_1 X_{1i} + \beta_2 \log(X_{2i}) + \beta_3 \text{MARRIED}_{i} + \beta_4 \text{AGE}\_\text{GROUP}_{1i} + \beta_5 \text{AGE}\_\text{GROUP}_{2i} \\ &+ \beta_6 (\text{AGE}\_\text{GROUP}_{2i} \times X_{1i}) + \beta_7 (\text{MARRIED}_{i} \times \text{AGE}\_\text{GROUP}_{1i}) + \epsilon_i \end{aligned} \] where \(MARRIED_i = 1\), if the \(i\)-th person is married, 0 otherwise; \(\text{AGE}\_\text{GROUP}_{ji}\) are different age groups: if \(j = 1\) - between \(20-30\); if \(j = 2\) - between \(31-65\), the base group, \(\text{AGE}\_\text{GROUP}_{OTHER}\), consists the people with ages in the remaining age brackets, not covered by \(j = 1,2\).
We begin by specifying the parameter vector and sample size:
We then generate the variables in the following way:
The different age groups can be generated randomly as well. We can further create separate indicator variables for two of the three groups. Doing it this way automatically classifies the remaining group of other ages as the base group and we will avoid the dummy variable trap:
Finally, we can create our dependent variable and combine all the data into a single dataset:
#
#
#
x_mat <- cbind(1, x1, log(x2), married,
age_gr1, age_gr2, age_gr2 * x1 , married * age_gr1)
colnames(x_mat) <- c("intercept", "x1", "log_x2", "married",
"age_gr1", "age_gr2", "age_gr2_x1", "married_age_gr1")
#
y <- exp(x_mat %*% beta_vec + e)
#
data_mat <- data.frame(y, x1, x2, married, age_gr1, age_gr2, age_group)
#
#
#
#
head(data_mat)
## y x1 x2 married age_gr1 age_gr2 age_group
## 1 29.313052 10.206100 2.450563 1 0 1 aged_31_65
## 2 14.656237 10.164646 2.976220 0 0 1 aged_31_65
## 3 28.228720 10.816951 2.255319 1 0 0 other
## 4 7.741518 11.713026 4.286608 1 0 1 aged_31_65
## 5 9.333522 6.220738 2.514393 1 1 0 aged_20_30
## 6 2.165643 5.813722 4.237797 1 0 1 aged_31_65
import statsmodels.api as sm
import pandas as pd
#
x_mat = sm.add_constant(np.column_stack([x1, np.log(x2), married,
age_gr1, age_gr2, age_gr2 * x1, married * age_gr1]))
x_col_names = ["intercept", "x1", "log_x2", "married",
"age_gr1", "age_gr2", "age_gr2_x1", "married_age_gr1"]
#
y = np.exp(np.dot(x_mat, beta_vec) + e)
#
data_mat = pd.DataFrame(np.column_stack([y, x1, x2, married, age_gr1, age_gr2, age_group]),
columns = ["y", "x1", "x2", "married", "age_gr1", "age_gr2", "age_group"])
# convert data to float:
data_mat[data_mat.columns[:-1]] = data_mat[data_mat.columns[:-1]].astype(float)
#
print(data_mat.head())
## y x1 x2 married age_gr1 age_gr2 age_group
## 0 5.710176 8.502345 3.655820 1.0 0.0 1.0 aged_31_65
## 1 10.572763 11.135189 3.182728 0.0 0.0 0.0 other
## 2 17.613645 11.436301 2.732165 1.0 1.0 0.0 aged_20_30
## 3 4.586854 8.001239 3.411765 0.0 0.0 0.0 other
## 4 10.625736 10.949797 3.085106 0.0 1.0 0.0 aged_20_30
Next, we estimate the models and print their output:
#
#
mdl <- lm(log(y) ~ x1 + log(x2) + married + age_gr1 + age_gr2 + age_gr2 * x1 + married * age_gr1,
data = data_mat)
print(summary(mdl))
##
## Call:
## lm(formula = log(y) ~ x1 + log(x2) + married + age_gr1 + age_gr2 +
## age_gr2 * x1 + married * age_gr1, data = data_mat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.129248 -0.034729 0.001534 0.034763 0.163707
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0101765 0.0122819 326.511 < 2e-16 ***
## x1 0.1590640 0.0009389 169.416 < 2e-16 ***
## log(x2) -3.0023689 0.0059708 -502.843 < 2e-16 ***
## married 0.0466433 0.0038756 12.035 < 2e-16 ***
## age_gr1 0.0247338 0.0051786 4.776 2.06e-06 ***
## age_gr2 -0.1468251 0.0170318 -8.621 < 2e-16 ***
## x1:age_gr2 0.0503657 0.0016449 30.620 < 2e-16 ***
## married:age_gr1 -0.0267870 0.0066062 -4.055 5.41e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04951 on 992 degrees of freedom
## Multiple R-squared: 0.997, Adjusted R-squared: 0.9969
## F-statistic: 4.647e+04 on 7 and 992 DF, p-value: < 2.2e-16
import statsmodels.formula.api as smf
#
mdl = smf.ols(formula='np.log(y) ~ x1 + np.log(x2) + married + age_gr1 + age_gr2 + age_gr2 * x1 + married * age_gr1',
data = data_mat).fit()
print(mdl.summary2())
## Results: Ordinary least squares
## ===================================================================
## Model: OLS Adj. R-squared: 0.997
## Dependent Variable: np.log(y) AIC: -3144.4117
## Date: 2020-10-13 21:40 BIC: -3105.1497
## No. Observations: 1000 Log-Likelihood: 1580.2
## Df Model: 7 F-statistic: 4.224e+04
## Df Residuals: 992 Prob (F-statistic): 0.00
## R-squared: 0.997 Scale: 0.0025030
## -------------------------------------------------------------------
## Coef. Std.Err. t P>|t| [0.025 0.975]
## -------------------------------------------------------------------
## Intercept 3.9994 0.0131 305.8630 0.0000 3.9738 4.0251
## x1 0.1587 0.0010 155.5255 0.0000 0.1567 0.1607
## np.log(x2) -2.9943 0.0061 -490.5459 0.0000 -3.0063 -2.9823
## married 0.0568 0.0039 14.5523 0.0000 0.0492 0.0645
## age_gr1 0.0246 0.0051 4.8070 0.0000 0.0146 0.0347
## age_gr2 -0.1595 0.0179 -8.9329 0.0000 -0.1946 -0.1245
## age_gr2:x1 0.0514 0.0017 29.5491 0.0000 0.0480 0.0548
## married:age_gr1 -0.0412 0.0067 -6.1671 0.0000 -0.0543 -0.0281
## -------------------------------------------------------------------
## Omnibus: 0.209 Durbin-Watson: 1.934
## Prob(Omnibus): 0.901 Jarque-Bera (JB): 0.269
## Skew: -0.029 Prob(JB): 0.874
## Kurtosis: 2.944 Condition No.: 137
## ===================================================================
We also manually estimate AIC and BIC (we again remind that the formulas in R
are different from the ones in Python
):
aic_lm <- nrow(mdl$model) + nrow(mdl$model) * log(2 * pi) + nrow(mdl$model) * log(sum(mdl$residual^2) / nrow(mdl$model)) + 2 * (length(mdl$coefficients) + 1)
print(aic_lm)
## [1] -3163.401
bic_lm <- nrow(mdl$model) + nrow(mdl$model) * log(2 * pi) + nrow(mdl$model) * log(sum(mdl$residual^2) / nrow(mdl$model)) + (length(mdl$coefficients) + 1) * log(nrow(mdl$model))
print(bic_lm)
## [1] -3119.231
aic_lm = mdl.nobs + mdl.nobs * np.log(2 * np.pi) + mdl.nobs * np.log(np.sum(np.array(mdl.resid)**2) / mdl.nobs) + 2 * (len(mdl.params))
print(aic_lm)
## -3144.4116967048176
bic_lm = mdl.nobs + mdl.nobs * np.log(2 * np.pi) + mdl.nobs * np.log(np.sum(np.array(mdl.resid)**2) / mdl.nobs) + (len(mdl.params)) * np.log(mdl.nobs)
print(bic_lm)
## -3105.1496544729607
4.3.4 Out-of-Sample (Hold-Out-Sample) GoF Testing
If our model is designed for prediction/forecasting, we want to evaluate its ability to forecast the dependent variable values, which have not been observed yet. One way to do this is to hold-back some of the observations from estimation and evaluate how well can the model predict the omitted observations (also known as splitting the data into an \(80\%\) training and a \(20\%\) testing set).
Lets say that out of the \(N\) observations, we use \(N-m\) to estimate the parameters and then calculate the predictions: \[ \widehat{Y}_i = \widehat{\beta}_0 + \widehat{\beta}_1 X_{1,i} + ... + \widehat{\beta}_k X_{k,i},\quad i = N-m + 1,...,N \] Then, we can measure the models out-of-sample forecasting accuracy by calculating the Root Mean Squared Error (RMSE): \[ RMSE = \sqrt{\dfrac{1}{m} \sum_{i = N-m+1}^N \left( Y_i - \widehat{Y}_i \right)^2} \] We can then compare different models based on their \(RMSE\) (as long as the models being compared have the same dependent variable, as well as the same amount of held-back observations, \(m\)).
References
Ramsey, F., and D. Schafer. 2012. The Statistical Sleuth: A Course in Methods of Data Analysis. Cengage Learning. https://books.google.lt/books?id=eSlLjA9TwkUC.