## 4.3 Goodness-Of-Fit

Let our multiple regression model be defined as: $Y_i = \beta_0 + \beta_1 X_{1i} + ... + \beta_k X_{ki} + \epsilon_i,\quad i = 1,...,N$ Furthermore, assume that conditions (MR.1) - (MR.6) hold true.

### 4.3.1$$F$$-test For Goodness of Fit: Joint Hypothesis Test for the Overall Model Significance

This is the joint hypothesis test for multiple coefficient significance, applied for all slope coefficients (excluding the intercept): $\begin{cases} H_0&: \beta_1 = \beta_2 = ... = \beta_k = 0 \\ H_1&: \beta_{j} \neq 0, \quad \text{for some } j \end{cases}$ the associated $$F$$-statistic: $F = \dfrac{\text{ESS}/k}{\text{RSS}/(N-(k+1))} = \dfrac{R^2/k}{(1 - R^2)/(N-(k+1))}$ If $$F > F_{(1 - \alpha, k, N-(k+1))}$$ for some significance level $$\alpha$$ (or if the associated $$p$$-value is less than $$\alpha$$), we reject the null hypothesis.

The $$F$$-test of overall significance compares a model with no predictors to the specified model. A regression model that contains no predictors is also known as an intercept-only model, where all of the fitted values equal the mean of the response variable. Thus, if the $$p$$-value of the overall $$F$$-test is less than the significance level $$\alpha$$, the specified model predicts the response variable better than the mean of the response.

This is somewhat similar to what the $$R^2$$ does. However, it does not provide a formal hypothesis test (which the overall $$F$$-test does). Consequently, if the $$p$$-value of the overall $$F$$-test is less than the $$\alpha$$ significance level, we could conclude that the $$R^2$$ value is significantly different from zero.

### 4.3.2$$R$$-squared and Adjusted $$R$$-squared

In the multiple regression model $$R^2$$ is a measure of the proportion of variation in the dependent variable, that is explained by all the explanatory variables included in the model: $R^2 = \dfrac{\text{ESS}}{\text{TSS}} = \dfrac{\sum_{i = 1}^N \left( \widehat{Y}_i - \overline{Y} \right)^2}{\sum_{i = 1}^N \left(Y_i - \overline{Y} \right)^2} = 1 - \dfrac{\text{RSS}}{TSS} = 1 - \dfrac{\sum_{i = 1}^N \widehat{\epsilon}_i^2}{\sum_{i = 1}^N \left(Y_i - \overline{Y} \right)^2}$ where $$\widehat{\epsilon}_i = Y_i - (\widehat{\beta}_0 + \widehat{\beta}_1 X_{1,i} + ... + \widehat{\beta}_k X_{k,i})$$.

For the multiple regression model, $$R^2$$ automatically increases when extra explanatory variables are added to the model, even if the variables added have no justification: if the model contains $$N-1$$ variables, then $$R^2 = 1$$. As such, an adjusted $$R^2$$, $$R^2_{adj}$$ may be used. This modification of $$R^2$$ adjusts for the number of explanatory variable (excluding the constant) terms in a model, relative to the number of data points. Furthermore, $$R^2_{adj}$$ can be negative and $$R^2_{adj} \leq R^2$$. The adjusted $$R^2$$ is defined as: $R^2_{adj} = 1 - (1 - R^2)\dfrac{N-1}{N - k - 1} = 1 - \dfrac{\text{RSS}/(N-k-1)}{\text{TSS} /(N-1)}$ The adjusted $$R^2$$, $$R^2_{adj}$$, can be interpreted as an unbiased estimator of the population $$R^2$$.

However, because of this correction, $$R^2_{adj}$$ loses its interpretation - $$R^2_{adj}$$ is no longer the proportion of explained variation.

### 4.3.3 AIC and BIC

Selecting variables that maximize $$R^2_{adj}$$ is equivalent to selecting variables that minimize $$\text{ESS}$$, subject to a penalty based on the number of variables. We will introduce two mode criterions, which work in a similar way, but have different penalties for inclusion of additional variables:

• Akaike Information Criterion (AIC): $AIC = N + N \log(2\pi) + N \log \left( \dfrac{RSS}{N} \right) + 2(k + 1)$
• Bayesian Information Criterion (BIC), also called Schwarz Criterion (SC): $BIC = N + N \log(2\pi) + N \log \left( \dfrac{RSS}{N} \right) + (k + 1) \log(N)$ The formulas are based on (“Statistics 333 Applied Regression Analysis” Spring 2003 semester from Professor Bret Larget) correction of (Ramsey and Schafer 2012). These are the formulas used in R and Python, with slight differences.

On one hand, the variance parameter (which is also estimated) must also be included: overall, there are $$k + 2$$ total parameters $$\beta_0, \beta_1,..., \beta_k, \sigma^2$$. As such $$(k + 1)$$ is sometimes replaced with $$(k + 2)$$.

On the other hand, some textbooks ignore the first two terms, $$N + N \log(2\pi)$$, and use $$k+1$$, instead of $$k+2$$.

Note: in R $$(k + 2)$$ is used, while in Python $$(k + 1)$$ is used.

In both cases, the first term decreases with each extra variable added, but the second term increases, penalizing the inclusion of additional variables. BIC penalizes extra variables more strictly, compared to the AIC.

The model with the smallest AIC (or BIC) is preferred.

Example 4.20 We will use the following model to aid our presented methodology:

\begin{aligned} \log(Y_i) = \beta_0 &+ \beta_1 X_{1i} + \beta_2 \log(X_{2i}) + \beta_3 \text{MARRIED}_{i} + \beta_4 \text{AGE}\_\text{GROUP}_{1i} + \beta_5 \text{AGE}\_\text{GROUP}_{2i} \\ &+ \beta_6 (\text{AGE}\_\text{GROUP}_{2i} \times X_{1i}) + \beta_7 (\text{MARRIED}_{i} \times \text{AGE}\_\text{GROUP}_{1i}) + \epsilon_i \end{aligned} where $$MARRIED_i = 1$$, if the $$i$$-th person is married, 0 otherwise; $$\text{AGE}\_\text{GROUP}_{ji}$$ are different age groups: if $$j = 1$$ - between $$20-30$$; if $$j = 2$$ - between $$31-65$$, the base group, $$\text{AGE}\_\text{GROUP}_{OTHER}$$, consists the people with ages in the remaining age brackets, not covered by $$j = 1,2$$.

We begin by specifying the parameter vector and sample size:

#
#
set.seed(132)
#
N <- 1000
#
beta_vec <- c(4, 0.16, -3, 0.05, 0.02, -0.15, 0.05, -0.03)
import numpy as np
#
np.random.seed(123)
#
N = 1000
#
beta_vec = np.array([4, 0.16, -3, 0.05, 0.02, -0.15, 0.05, -0.03])

We then generate the variables in the following way:

e <- rnorm(mean = 0, sd = 0.05, n = N)
x1<- rnorm(mean = 10, sd = 2, n = N)
x2<- sample(seq(from = 2, to = 5, length.out = floor(N * 0.8)),
size = N, replace = TRUE)
#
married <- sample(c(0, 1), size = N, replace = TRUE)
e = np.random.normal(loc = 0, scale = 0.05, size = N)
x1= np.random.normal(loc = 10, scale = 2, size = N)
x2= np.random.choice(np.linspace(start = 2, stop = 5, num = int(np.floor(N * 0.8))), size = N, replace = True)
#
married = np.random.choice([0, 1], size = N, replace = True)

The different age groups can be generated randomly as well. We can further create separate indicator variables for two of the three groups. Doing it this way automatically classifies the remaining group of other ages as the base group and we will avoid the dummy variable trap:

age_group <- sample(c("other", "aged_20_30", "aged_31_65"),
size = N, replace = TRUE)
age_gr1 <- rep(0, N)
age_gr1[age_group %in% "aged_20_30"] <- 1
age_gr2 <- rep(0, N)
age_gr2[age_group %in% "aged_31_65"] <- 1
age_group = np.random.choice(["other", "aged_20_30", "aged_31_65"],
size = N, replace = True)
age_gr1 = np.array([0]*N)
age_gr1[age_group == "aged_20_30"] = 1
age_gr2 = np.array([0]*N)
age_gr2[age_group == "aged_31_65"] = 1

Finally, we can create our dependent variable and combine all the data into a single dataset:

#
#
#
x_mat <- cbind(1, x1, log(x2), married,
age_gr1, age_gr2, age_gr2 * x1 , married * age_gr1)
colnames(x_mat) <- c("intercept", "x1", "log_x2", "married",
"age_gr1", "age_gr2", "age_gr2_x1", "married_age_gr1")
#
y <- exp(x_mat %*% beta_vec + e)
#
data_mat <- data.frame(y, x1, x2, married, age_gr1, age_gr2, age_group)
#
#
#
#
head(data_mat)
##           y        x1       x2 married age_gr1 age_gr2  age_group
## 1 29.313052 10.206100 2.450563       1       0       1 aged_31_65
## 2 14.656237 10.164646 2.976220       0       0       1 aged_31_65
## 3 28.228720 10.816951 2.255319       1       0       0      other
## 4  7.741518 11.713026 4.286608       1       0       1 aged_31_65
## 5  9.333522  6.220738 2.514393       1       1       0 aged_20_30
## 6  2.165643  5.813722 4.237797       1       0       1 aged_31_65
import statsmodels.api as sm
import pandas as pd
#
age_gr1, age_gr2, age_gr2 * x1, married * age_gr1]))
x_col_names = ["intercept", "x1", "log_x2", "married",
"age_gr1", "age_gr2", "age_gr2_x1", "married_age_gr1"]
#
y = np.exp(np.dot(x_mat, beta_vec) + e)
#
data_mat = pd.DataFrame(np.column_stack([y, x1, x2, married, age_gr1, age_gr2, age_group]),
columns = ["y", "x1", "x2", "married", "age_gr1", "age_gr2", "age_group"])
# convert data to float:
data_mat[data_mat.columns[:-1]] = data_mat[data_mat.columns[:-1]].astype(float)
#
print(data_mat.head())                       
##            y         x1        x2  married  age_gr1  age_gr2   age_group
## 0   5.710176   8.502345  3.655820      1.0      0.0      1.0  aged_31_65
## 1  10.572763  11.135189  3.182728      0.0      0.0      0.0       other
## 2  17.613645  11.436301  2.732165      1.0      1.0      0.0  aged_20_30
## 3   4.586854   8.001239  3.411765      0.0      0.0      0.0       other
## 4  10.625736  10.949797  3.085106      0.0      1.0      0.0  aged_20_30

Next, we estimate the models and print their output:

#
#
mdl <- lm(log(y) ~ x1 + log(x2) + married + age_gr1 + age_gr2 + age_gr2 * x1 + married * age_gr1,
data = data_mat)
print(summary(mdl))
##
## Call:
## lm(formula = log(y) ~ x1 + log(x2) + married + age_gr1 + age_gr2 +
##     age_gr2 * x1 + married * age_gr1, data = data_mat)
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.129248 -0.034729  0.001534  0.034763  0.163707
##
## Coefficients:
##                   Estimate Std. Error  t value Pr(>|t|)
## (Intercept)      4.0101765  0.0122819  326.511  < 2e-16 ***
## x1               0.1590640  0.0009389  169.416  < 2e-16 ***
## log(x2)         -3.0023689  0.0059708 -502.843  < 2e-16 ***
## married          0.0466433  0.0038756   12.035  < 2e-16 ***
## age_gr1          0.0247338  0.0051786    4.776 2.06e-06 ***
## age_gr2         -0.1468251  0.0170318   -8.621  < 2e-16 ***
## x1:age_gr2       0.0503657  0.0016449   30.620  < 2e-16 ***
## married:age_gr1 -0.0267870  0.0066062   -4.055 5.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04951 on 992 degrees of freedom
## Multiple R-squared:  0.997,  Adjusted R-squared:  0.9969
## F-statistic: 4.647e+04 on 7 and 992 DF,  p-value: < 2.2e-16
import statsmodels.formula.api as smf
#
mdl = smf.ols(formula='np.log(y) ~ x1 + np.log(x2) + married + age_gr1 + age_gr2 + age_gr2 * x1 + married * age_gr1',
data = data_mat).fit()
print(mdl.summary2())
##                   Results: Ordinary least squares
## ===================================================================
## Model:              OLS              Adj. R-squared:     0.997
## Dependent Variable: np.log(y)        AIC:                -3144.4117
## Date:               2020-10-13 21:40 BIC:                -3105.1497
## No. Observations:   1000             Log-Likelihood:     1580.2
## Df Model:           7                F-statistic:        4.224e+04
## Df Residuals:       992              Prob (F-statistic): 0.00
## R-squared:          0.997            Scale:              0.0025030
## -------------------------------------------------------------------
##                    Coef.  Std.Err.     t     P>|t|   [0.025  0.975]
## -------------------------------------------------------------------
## Intercept          3.9994   0.0131  305.8630 0.0000  3.9738  4.0251
## x1                 0.1587   0.0010  155.5255 0.0000  0.1567  0.1607
## np.log(x2)        -2.9943   0.0061 -490.5459 0.0000 -3.0063 -2.9823
## married            0.0568   0.0039   14.5523 0.0000  0.0492  0.0645
## age_gr1            0.0246   0.0051    4.8070 0.0000  0.0146  0.0347
## age_gr2           -0.1595   0.0179   -8.9329 0.0000 -0.1946 -0.1245
## age_gr2:x1         0.0514   0.0017   29.5491 0.0000  0.0480  0.0548
## married:age_gr1   -0.0412   0.0067   -6.1671 0.0000 -0.0543 -0.0281
## -------------------------------------------------------------------
## Omnibus:               0.209         Durbin-Watson:           1.934
## Prob(Omnibus):         0.901         Jarque-Bera (JB):        0.269
## Skew:                  -0.029        Prob(JB):                0.874
## Kurtosis:              2.944         Condition No.:           137
## ===================================================================
print(AIC(mdl))
## [1] -3163.401
print(BIC(mdl))
## [1] -3119.231
print(mdl.aic)
## -3144.4116967048176
print(mdl.bic)
## -3105.1496544729607

We also manually estimate AIC and BIC (we again remind that the formulas in R are different from the ones in Python):

aic_lm <- nrow(mdl$model) + nrow(mdl$model) * log(2 * pi) + nrow(mdl$model) * log(sum(mdl$residual^2) / nrow(mdl$model)) + 2 * (length(mdl$coefficients) + 1)
print(aic_lm)
## [1] -3163.401
bic_lm <- nrow(mdl$model) + nrow(mdl$model) * log(2 * pi) + nrow(mdl$model) * log(sum(mdl$residual^2) / nrow(mdl$model)) + (length(mdl$coefficients) + 1) * log(nrow(mdl\$model))
print(bic_lm)
## [1] -3119.231
aic_lm = mdl.nobs + mdl.nobs * np.log(2 * np.pi) + mdl.nobs * np.log(np.sum(np.array(mdl.resid)**2) / mdl.nobs) + 2 * (len(mdl.params))
print(aic_lm)
## -3144.4116967048176
bic_lm = mdl.nobs + mdl.nobs * np.log(2 * np.pi) + mdl.nobs * np.log(np.sum(np.array(mdl.resid)**2) / mdl.nobs) + (len(mdl.params)) * np.log(mdl.nobs)
print(bic_lm)                   
## -3105.1496544729607

### 4.3.4 Out-of-Sample (Hold-Out-Sample) GoF Testing

If our model is designed for prediction/forecasting, we want to evaluate its ability to forecast the dependent variable values, which have not been observed yet. One way to do this is to hold-back some of the observations from estimation and evaluate how well can the model predict the omitted observations (also known as splitting the data into an $$80\%$$ training and a $$20\%$$ testing set).

Lets say that out of the $$N$$ observations, we use $$N-m$$ to estimate the parameters and then calculate the predictions: $\widehat{Y}_i = \widehat{\beta}_0 + \widehat{\beta}_1 X_{1,i} + ... + \widehat{\beta}_k X_{k,i},\quad i = N-m + 1,...,N$ Then, we can measure the models out-of-sample forecasting accuracy by calculating the Root Mean Squared Error (RMSE): $RMSE = \sqrt{\dfrac{1}{m} \sum_{i = N-m+1}^N \left( Y_i - \widehat{Y}_i \right)^2}$ We can then compare different models based on their $$RMSE$$ (as long as the models being compared have the same dependent variable, as well as the same amount of held-back observations, $$m$$).

### References

Ramsey, F., and D. Schafer. 2012. The Statistical Sleuth: A Course in Methods of Data Analysis. Cengage Learning. https://books.google.lt/books?id=eSlLjA9TwkUC.