Datasets can be found here. The MROZ dataset contains cross-sectional labor force participation data:
inlf
- =1 if in labor force, 1975
hours
- hours worked, 1975
kidslt6
- # kids < 6 years
kidsge6
- # kids 6-18
age
- woman's age in yrs
educ
- years of schooling
wage
- estimated wage from earns., hours
repwage
- reported wage at interview in 1976
hushrs
- hours worked by husband, 1975
husage
- husband's age
huseduc
- husband's years of schooling
huswage
- husband's hourly wage, 1975
faminc
- family income, 1975
mtr
- fed. marginal tax rate facing woman
motheduc
- mother's years of schooling
fatheduc
- father's years of schooling
unem
- unem. rate in county of resid.
city
- =1 if live in SMSA
exper
- actual labor mkt exper
nwifeinc
- (faminc - wage*hours)/1000
lwage
- log(wage)
expersq
- exper^2
assume that we are interested in modelling the labor force participation, inlf
, of married women.
dt5 <- foreign::read.dta("http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta")
dt5 <- data.frame(dt5)
head(dt5)
We will begin by splitting the dataset into training and testing subsets. These are extremely important for binary outcome modelling - we need to make sure our model can correctly classify new data.
table(dt5$inlf)
We see that we have approximately the same amount of data for both non-participating and participating women. So, we can partition our data as we normally would into $80\%$ training and $20\%$ test sets.
print(nrow(dt5))
n <- floor(nrow(dt5) * 0.8)
print(n)
dt5_train <- dt5[1:n, ]
dt5_test <- dt5[-c(1:n), ]
print(nrow(dt5_train))
print(nrow(dt5_test))
The dt4_train
is our training set, for which we will estimate our model and dt4_test
is our test data subset, for which we will evaluate the adequacy of our model.
Note that a better way to do this would be to randomize the rows at first, if we are unsure of their randomnes:
any(duplicated(sample(1:nrow(dt5))))
#dt5 <- dt5[sample(1:nrow(dt5)), ]
#dt5_train <- dt5[1:n, ]
#dt5_test <- dt5[-c(1:n), ]
Though we will leave this for another time. For now, we will naively assume that the data is already randomly organized.
Modelling binary data involves creating a model for the probability of success (in this case it is inlf = 1
- that a woman is participating in the labor force).
We have three primary choices for binary data modelling:
The multiple regression model does not take into account the fact that the probability $p \in \left[0, 1\right]$. Therefore, we will opt to use the logistic and probit regressions.
options(repr.plot.width = 10)
options(repr.plot.height = 8)
print(colnames(dt5_train))
par(mfrow = c(3, 3))
for(i in 2:19){
tmp_formula <- formula(paste0(colnames(dt5_train)[i]," ~ inlf"))
boxplot(tmp_formula, dt5_train,
col = c("cornflowerblue", "orange"),
xlab = "inlf", ylab = colnames(dt5_train)[i], outline = FALSE,
main = paste0(colnames(dt5_train)[i], " by participation"))
}
Note that city
is an indicator variable, for which the values 0
and 1
do not have any numerical interpretation, but rather only indicate whether the person lives in a city or not.
In order to meaningfully examine the differences between return to labor force in a city versus in a non-city, we can use a contingency table:
tbl <- xtabs(~ city + inlf, data = dt5)
print(tbl)
To make it easier to understand where the data falls in this barchart, we will change the categorization of the variables:
dt5_tmp <- dt5
dt5_tmp$inlf[dt5_tmp$inlf == 0] <- "Not in labor force"
dt5_tmp$inlf[dt5_tmp$inlf == 1] <- "In labor force"
dt5_tmp$city[dt5_tmp$city == 1] <- "Yes"
dt5_tmp$city[dt5_tmp$city == 0] <- "No"
tbl <- xtabs(~ city + inlf, data = dt5_tmp)
print(tbl)
We can further visualize this with a barplot
. The following transposed contingencyt table will be used for the barplot:
t(tbl)
Each column
will be on the horizontal axis. In this case - the axis will have No
and Yes
labels.
Each row
will be a differently colored barchart - the barcharts will either be grouped according to the columns
, or they will be stacked
on top of each other according to the columns.
par(mfrow = c(1, 2))
bb <- barplot(t(tbl), beside = TRUE, col = c("grey60", "grey90"), main = "Barchart", xlab = "Living in a City",
legend.text = TRUE, args.legend = list(x = "topleft", bty = "n", title = "In Labor Force", cex = 0.6))
# Add the numeric values to the bar chart
text(bb, c(t(tbl))*0.95, labels = c(t(tbl)), cex = 0.9, col = "blue")
#
bb <- barplot(t(tbl), beside = FALSE, col = c("grey60", "grey90"), main = "Stacked barchart", xlab = "Living in a City",
legend.text = TRUE, args.legend = list(x = "topleft", bty = "y", title = "In Labor Force", cex = 0.6))
text(bb, t(tbl)[1,]*0.9, labels = t(tbl)[1,], cex = 0.9, col = "blue")
text(bb, colSums(t(tbl))*0.9, labels = t(tbl)[2,], cex = 0.9, col = "red")
#legend("topleft", legend = c("inlf = 0", "inlf = 1"), fill = c("black", "gray"), bty = "y")
Note: compare the values in t(tbl)
and the barcharts - make sure you understand, which variables are in the horizontal axis, and which variable is used for different bar colors.
Overall, we see that there are more women from cities in this sample, however, the tendency is the same - whether a woman is from a city or not - there are more women returning to the labor force instead of staying at home.
It may be more interesting to examine the proportion of city and non-city women returning to the workforce:
round(prop.table(tbl, 1), 5)
Note that each row sums up to 1
(i.e. 100%
of city and 100%
of non-city women are split into returning to the labor force and staying at home):
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd = TRUE)
barplot(t(prop.table(tbl, 1)), beside = FALSE, main = "Stacked barchart", xlab = "Living in a City",
legend.text = TRUE, args.legend = list(x = "topright", bty = "y", inset=c(-0.21, 0), title = "In Labor Force", cex = 0.55))
#legend("topright", inset=c(-0.21, 0) , legend = c("inlf = 0", "inlf = 1"), fill = c("black", "gray"))
So, while there are fewer women from non-cities in this data sample, the proportion of women, who decide to return to the workforce is very similar, regardless if the woman lives in a city or not.
Finally, note that when calculating the proportions:
round(prop.table(tbl, 1), 5)
there is a difference between:
round(t(prop.table(tbl, 1)), 5)
and:
round(prop.table(t(tbl), 1), 5)
In the first case - we have calculated the proportions of women in labor force, and not in labor force separetely for all women living in cities and for all women living outside cities. Then we simply transposed that table.
In the second case we have calculated the proportions of women living in a city, and outside the city separately for all in-labor-force-women (so that they sum up to 100%
of in labor force women) and all not-in-labot-force women (so that they sum up to 100%
of not in labor force women);
Note that we are not restricted to only indicator variables - we can very succesfully examine categorical variables, containing multiple different categories.
#From: https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/pairs.html
panel.hist <- function(x, ...){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(dt5_train[, c("hours", "age", "educ", "wage", "exper", "husage", "huseduc", "huswage", "faminc")])
pairs(dt5_train[, c("hours", "age", "educ", "wage", "exper", "husage", "huseduc", "huswage", "faminc")], diag.panel = panel.hist, lower.panel = panel.cor, upper.panel = panel.smooth)
#for(i in 2:19){
#print(colnames(dt5_train)[i])
#print(tapply(dt5_train[, colnames(dt5_train)[i]], dt5_train$inlf, summary))
#print("---------------------------------------------------------------------")
#}
We see that for some variables - like age
, educ
, exper
, there is a difference between the two groups.
We would expect the following relationship between the independent variables and the participation rate:
in addition:
We will estimate both a logit and probit models.
logit_glm <- glm(inlf ~ educ + exper + age + kidslt6 + kidsge6, data = dt5_train, family = binomial(link = "logit"))
round(coef(summary(logit_glm)), 5)
probit_glm <- glm(inlf ~ educ + exper + age + kidslt6 + kidsge6, data = dt5_train, family = binomial(link = "probit"))
round(coef(summary(probit_glm)), 5)
The signs are as we would expect, but kidsge6
appears to be insignificant.
print(cbind(car::vif(logit_glm)))
We do not have any collinear variables.
We may want to include the following variables:
Regarding interaction terms, lets assume that we want to include:
logit_glm <- glm(inlf ~ educ + I(educ^2) + exper + I(exper^2) + age + I(age^2) + kidslt6 + I(kidslt6^2) + kidsge6 + I(kidsge6^2), data = dt5_train, family = binomial(link = "logit"))
round(coef(summary(logit_glm)), 5)
We see lots of insignificant variables. Instead, let's try adding one by one
logit_glm <- glm(inlf ~ educ + I(educ^2) + exper + age + kidslt6 + kidsge6, data = dt5_train, family = binomial(link = "logit"))
round(coef(summary(logit_glm)), 5)
educ^2
is insignificant.
logit_glm <- glm(inlf ~ educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = dt5_train, family = binomial(link = "logit"))
round(coef(summary(logit_glm)), 5)
exper^2
is significant.
logit_glm <- glm(inlf ~ educ + exper + I(exper^2) + age + I(age^2) + kidslt6 + kidsge6, data = dt5_train, family = binomial(link = "logit"))
round(coef(summary(logit_glm)), 5)
age^2
is insignificant.
If we include an interaction term:
logit_glm <- glm(inlf ~ educ + exper + I(exper^2) + age + kidslt6*kidsge6, data = dt5_train, family = binomial(link = "logit"))
round(coef(summary(logit_glm)), 5)
It is significant. But their squared terms are not:
logit_glm <- glm(inlf ~ educ + exper + I(exper^2) + age + kidslt6*kidsge6 + I(kidslt6^2) + I(kidsge6^2), data = dt5_train, family = binomial(link = "logit"))
round(coef(summary(logit_glm)), 5)
Hence, our finalized model is:
logit_glm <- glm(inlf ~ educ + exper + I(exper^2) + age + kidslt6*kidsge6, data = dt5_train, family = binomial(link = "logit"))
round(coef(summary(logit_glm)), 5)
Again note that kidslt6:kidsge6
is significant, so we cannot remove the main effect kidsge6
, even though it is significant (unless we want to make interpretation harder for ourselves,which we don't).
The same model is estimated for the probit regression:
probit_glm <- glm(inlf ~ educ + exper + I(exper^2) + age + kidslt6*kidsge6, data = dt5_train, family = binomial(link = "probit"))
round(coef(summary(probit_glm)), 5)
the coefficients are also significant.
We will do so in a single plot - we will use the range of age
s in the training dataset as continuous data and the number of kids as discrete data, where the nulber of kids is a finite number from the training dataset.
max(dt5_train$age) - min(dt5_train$age)
We will firstly repeat the first line of the data 31 times:
predict_data_0 <- NULL
for(i in 1:31){
predict_data_0 <- rbind(predict_data_0, dt5[1, , drop = FALSE])
}
head(predict_data_0)
Next, we will replace age
with values to be within the ranges of the actual data
predict_data_1 <- predict_data_0
predict_data_1$age <- min(dt5_train$age):max(dt5_train$age)
and separate two cases: kidslt6 = 0
and kidslt6 = 1
predict_data_2 <- predict_data_1
predict_data_1$kidslt6 <- 0
predict_data_2$kidslt6 <- 1
Then, the predicted values are (here we also calculate their standard errors, which we will use to calculate the confidence bounds):
predict_1 <- predict(logit_glm, newdata = predict_data_1, type = "response", se.fit = TRUE)
predict_2 <- predict(logit_glm, newdata = predict_data_2, type = "response", se.fit = TRUE)
print(predict_1)
predict_1_bounds <- data.frame(lwr = predict_1$fit - 1.96 * predict_1$se.fit, upr = predict_1$fit + 1.96 * predict_1$se.fit)
predict_2_bounds <- data.frame(lwr = predict_2$fit - 1.96 * predict_2$se.fit, upr = predict_2$fit + 1.96 * predict_2$se.fit)
Note that:
predict(logit_glm, newdata = predict_data_1)
predicts $\mathbf{X} \widehat{\beta}$predict(logit_glm, newdata = predict_data_1, type = "response")
predicts $\widehat{\mu} = g^{-1}(\mathbf{X} \widehat{\beta})$options(repr.plot.width = 10)
options(repr.plot.height = 6)
plot(dt5_train$age, dt5_train$inlf)
# Plot the confidence region:
polygon(c(predict_data_1$age, rev(predict_data_1$age)), c(predict_1_bounds$upr, rev(predict_1_bounds$lwr)), col = "grey90", border = NA)
polygon(c(predict_data_2$age, rev(predict_data_2$age)), c(predict_2_bounds$upr, rev(predict_2_bounds$lwr)), col = "grey90", border = NA)
# Plot the fitted probability values:
lines(predict_data_1$age, predict_1$fit, col = "red")
lines(predict_data_2$age, predict_2$fit, col = "blue")
# Plot the confidence bounds:
lines(predict_data_1$age, predict_1$fit - 1.96 * predict_1$se.fit, col = "red", lty = 3)
lines(predict_data_1$age, predict_1$fit + 1.96 * predict_1$se.fit, col = "red", lty = 3)
#
lines(predict_data_2$age, predict_2$fit - 1.96 * predict_2$se.fit, col = "blue", lty = 3)
lines(predict_data_2$age, predict_2$fit + 1.96 * predict_2$se.fit, col = "blue", lty = 3)
title("Predicted probability values of the Logistic Regression with prediction 95% confidence bounds")
We can even iterate this for various values of number kids under 6 years old:
options(repr.plot.width = 10)
options(repr.plot.height = 8)
predict_data_temp <- predict_data_1
plot(dt5_train$age, dt5_train$inlf)
for(i in 0:max(dt5_train$kidslt6)){
predict_data_temp$kidslt6 <- i
lines(predict_data_temp$age, predict(logit_glm, newdata = predict_data_temp, type= "response"), col = i+1)
lines(predict_data_temp$age, predict(probit_glm, newdata = predict_data_temp, type= "response"), col = i+1, lty = 2)
}
legend(x = 29, y = 0.5, legend = paste0(0:max(dt5_train$kidslt6)),
lty = rep(1, max(dt5_train$kidslt6) + 1), col = 0:max(dt5_train$kidslt6) + 1, cex = 0.8)
legend(x = 29, y = 0.6, legend = c("Logit", "Probit"), lty = c(1, 2), col = c(1, 1), cex = 0.8)
title("Probability to return to the workforce based on the number of children under the age of 6\n Logit vs Probit regression")
predict_data_temp <- predict_data_1
predict_data_temp$kidslt6 <- 0
plot(dt5_train$age, dt5_train$inlf)
for(i in 0:max(dt5_train$kidsge6)){
predict_data_temp$kidsge6 <- i
lines(predict_data_temp$age, predict(logit_glm, newdata = predict_data_temp, type= "response"), col = i+1)
lines(predict_data_temp$age, predict(probit_glm, newdata = predict_data_temp, type= "response"), col = i+1, lty = 2)
}
legend(x = 29, y = 0.5, legend = paste0(0:max(dt5_train$kidsge6)),
lty = rep(1, max(dt5_train$kidsge6) + 1), col = 0:max(dt5_train$kidsge6) + 1, cex = 0.8)
legend(x = 29, y = 0.6, legend = c("Logit", "Probit"), lty = c(1, 2), col = c(1, 1), cex = 0.8)
title("Probability to return to the workforce based on the number of children OVER the age of 6\n Logit vs Probit regression")
We begin by examining the default cutoff results:
predicted_probs <- predict(logit_glm, dt5_train, type = "response")
print(InformationValue::confusionMatrix(dt5_train$inlf, predicted_probs, threshold = 0.5))
We compare the results with the true values:
table(dt5_train$inlf)
We see that we have correctly predicted that 386
women will return to the labor force. Furthermore, we have correctly predicted that 77
women will not return to the labor force.
On the other hand:
42
women will return to the labor force (they actually did not return);97
women will not return to the labor force (they actually did return);In total, if we were interested in determining the amount of jobs needed for women, returning to the labor force - the difference between the number of predicted returns, and the true returns to labor force is:
386 + 97 - 428
in other words, we would have overestimated the overall amount of women returning to work. This would have several complications:
If we were to use the optimal cutoff:
optimalCutoff_logit <- InformationValue::optimalCutoff(dt5_train$inlf, predict(logit_glm, dt5_train, type = "response"))[1]
print(optimalCutoff_logit)
The optimal cutoff is around 0.045
less than the default - this seems like a small value, but the final confusion matrix is now different:
print(InformationValue::confusionMatrix(dt5_train$inlf, predicted_probs, threshold = optimalCutoff_logit))
(NOTE: the columns sum up to the true totals of 0
and 1
, as in table(dt5_train$inlf)
)
In this case:
386
to 401
(GOOD);77
to 71
(BAD);42
to 27
(GOOD);97
to 103
(BAD);In total, if we were interested in determining the amount of jobs needed for women, returning to the labor force - the difference between the number of predicted returns, and the true returns to labor force is:
401 + 103 - 428
We see an important result:
In general it is important to determine what our objective is:
(401 / 386 - 1) * 100
We have correctly classified $3.89\%$ more returns to labor force with the optimal cutoff. On the other hand:
(103 / 97 - 1) * 100
We have incorrectly classified $6.19\%$ more false returns to labor force (i.e. we predicted 1, but it was actually 0).
If we want to look at the total correctly predicted returns to labor out of all predicted returns to labor (i.e. predited 1
's):
InformationValue::precision(dt5_train$inlf, predicted_probs, threshold = 0.5)
InformationValue::precision(dt5_train$inlf, predicted_probs, threshold = optimalCutoff_logit)
Comparing with the probit model:
predicted_probs <- predict(probit_glm, dt5_train, type = "response")
print(InformationValue::confusionMatrix(dt5_train$inlf, predicted_probs, threshold = 0.5))
We see that default cutoff produces the same results. Meanwhile the optimal cutoff:
optimalCutoff_probit <- InformationValue::optimalCutoff(dt5_train$inlf, predict(probit_glm, dt5_train, type = "response"))[1]
print(optimalCutoff_probit)
is now $0.1\%$ higher, compared to the logit case. This results in the following confusion matrix table using the optimal cutoff:
print(InformationValue::confusionMatrix(dt5_train$inlf, predicted_probs, threshold = optimalCutoff_probit))
The main points are now slightly different:
386
to 365
(BAD);77
to 107
(GOOD);42
to 63
(BAD);97
to 67
(GOOD);So, while the models are close, the automated optimal cutoff procedures give slightly different optimal cutoff points. Now, the total amount of jobs that we would create is:
365 + 67 - 401
Much less than initially! On the other hand, we now incorrectly classify that a woman will not return to the labor force (when in fact she will return):
(63 /42 - 1) * 100
$50\%$ more than before! If it is vital that we correctly classify those women, who will return to the labor force, as returning to the labor force (maybe based on their work experience, number of children, and other factors, which are specifically different for these misclassification cases), then this is quite bad. Furthermore:
(365 / 368 - 1) * 100
while our overall amount of predicted labor force returns decreased, we have made more mistakes in classifying which specific women will return to the labor force. If the women are identical in terms of family composition, education, age and all other factors - this may not be a problem.
In reality, unfortunately, this is not the case - we may have misclassified women, which have more experience, as not returning to the workforce, while women, with less - experience -as returning. As a result, we may opt to create more lower skill job positions, for which the actual women returning to the workforce will be overqualified, which may result in those women not taking the jobs and either being unemployed, or - even worse - leaving the workforce altogether!
On the other hand, our model precision appears to have increased:
InformationValue::precision(dt5_train$inlf, predicted_probs, threshold = optimalCutoff_probit)
So, looking at it from an economic/policymaker perspective - it is quite difficult to determine what is more important - it all depends on our task and the data that we have. consequently, we need to take care when analysing this type of data as the results of TPR, FPR, Precision, Accuracy and any other statistic may not be as usefull in every case.
InformationValue::plotROC(dt5_train$inlf, predicted_probs, Show.labels = TRUE)
We may want to examine the ROC curves for different models and different cutpoints:
ppred <- data.frame(logit = predict(logit_glm, dt5_train, type = "response"),
probit = predict(probit_glm, dt5_train, type = "response"))
TPR <- NULL
FPR <- NULL
for(j in 1:ncol(ppred)){
tpr <- fpr <- NULL
for(i in seq(from = 0.01, to = 0.99, by = 0.01)){
tpr <- c(tpr, InformationValue::sensitivity(dt5_train$inlf, ppred[, j], threshold = i))
fpr <- c(fpr, 1 - InformationValue::specificity(dt5_train$inlf, ppred[, j], threshold = i))
}
TPR <- cbind(TPR, tpr)
FPR <- cbind(FPR, fpr)
}
plot(FPR[, 1], TPR[, 1], type = "l", col = "blue", lwd = 2)
lines(FPR[, 2], TPR[, 2], type = "l", col = "red", lwd = 2)
legend("topleft", legend = c("logit", "probit"), lty = 1, col = c("blue", "red"), lwd = 2)
lines(x = c(0, 1), y = c(0, 1), col = "grey50", lty = 2, lwd = 2)
We do see that the logit and probit models do not differ much in terms of ROC.
Finally, we want to examine the AUC:
pred_logit <- ROCR::prediction(predictions = ppred$logit, labels = dt5_train$inlf)
#
auc_logit <- ROCR::performance(pred_logit, measure = "auc")
auc_logit <- unlist(auc_logit@y.values)
print(auc_logit)
pred_probit <- ROCR::prediction(predictions = ppred$probit, labels = dt5_train$inlf)
#
auc_probit <- ROCR::performance(pred_probit, measure = "auc")
auc_probit <- unlist(auc_probit@y.values)
print(auc_probit)
Generally, an AUC of $0.7 - 0.8$ indicates an adequate model; $0.5$ indicates random guessing; $0.5-0.7$ indicates a poor model; Ideally, we would strive for the AUC value to be $> 0.8$.
The AUC for borth logit and probit models is around $0.8$, indicating that our estimated models are adequate. The AUC is slightly higher for the logit regression.
The parameters of alinear regerssion model have a straightforward interpretation:
$ \beta_j \text{ measures the effect of a unit change in } X_j \text{ on } \mathbb{E}(Y|\mathbf{X}) $
On the other hand, the parameters of nonlinear logit and probit models are not as easy to interpret, since the linear predictor affects the mean through the link function. Nevertheless, we can measure this effect via the marginal effects:
$ \dfrac{\partial \mu}{\partial X_j} = \dfrac{\partial g^{-1}(\mathbf{X} \boldsymbol{\beta})}{\partial X_j} $
which have the same interpretations as the coefficients in the linear model.
For example, for the probit model: $ p(x) = \dfrac{\partial g^{-1}(\mathbf{X} \boldsymbol{\beta})}{\partial X_j} = \dfrac{\partial \mathbf{\Phi} (\mathbf{X}_i \boldsymbol{\beta})}{\partial X_j} = \beta_j \cdot \mathbf{\phi} (\mathbf{X}_i \boldsymbol{\beta}) $
where $ \mathbf{\Phi} (\mathbf{X}_i \boldsymbol{\beta})$ is the standard normal cdf and $\mathbf{\phi} (\mathbf{X}_i \boldsymbol{\beta})$ is the standard normal pdf.
The partial effects differ by regressor values and make it difficult to present them in a compact way. Therefore, it is common to aggregate the effects in two ways:
We will begin by examining the coefficients themselves:
round(coef(summary(logit_glm)), 5)
In the logit model, we can interpret the coefficients in terms of the log-odds-ratio, as we did before. Some of the coefficients can be interpreted as follows:
educ
increases by 1 year then the log-odds of returning to the labor force (versus not returning) increase by 0.19791
. age
increases by 1 year, then the log-odds of returning to the labor force (versus not returning) decrease by -0.10395
.kidslt6
increases by 1 (i.e. if we have an additional child, younger than 6 years), then the log-odds of returning to the labor force (versus not returning) changes by $−2.2880 + 0.53630 \cdot kidsge6$, ceteris paribus.exper
increases by 1 year, then the log-odds of returning to the labor force (versus not returning) changes by $0.21693 -0.00318(2 \cdot exper + 1)$We can also itnerpret them in terms of the odds-ratio:
round(exp(coef(summary(logit_glm)))[, 1, drop = FALSE], 5)
educ
increases by 1 year then the odds of returning to the labor force (versus not returning) increase by 1.21885
.age
increases by 1 year, then the log-odds of returning to the labor force (versus not returning) increase by a factor of 0.90127
.and so on.
Next up, we want to evaluate the partial effects of the explanatory variables on the probability:
PEA_logit = mfx::logitmfx(formula = logit_glm$formula, data = logit_glm$data, atmean = TRUE)
PEA_logit$mfxest
PEA
- the Partial Effects at the Average - shows that:
educ
results in a 0.0356
increase in the probability of returning to the labour force ;age
results in a 0.018698
decrease in the probability of returning to the labour force ;kidslt6
increases by one, then the probability of returning to the labour force changes by $-0.411591 + 0.096475 \cdot \overline{kidsge6}$;exper
increases by one, then the probability of returning to the labor force changes by $0.039023 - 0.000573 \cdot (2 \cdot \overline{exper})$APE_logit = mfx::logitmfx(formula = logit_glm$formula, data = logit_glm$data, atmean = FALSE)
APE_logit$mfxest
APE
- the Average Partial Effect - shows that:
educ
results in the sample average estimated increase of 0.0299
in the probability of returning to the labour force;age
results in the sample average estimated decrease of 0.015705
in the probability of returning to the labour force;kidslt6
results results in the sample average estimated change of $-0.01252 + 0.081028 \cdot \overline{kidsge6}$. PEA
expression, which stems from the fact that the partial derivative with respect to kidsge6
returns $-0.01252 + 0.081028 \cdot kidsge6$. Since APE
averages all the partial effects, kidsge6
turns into its average $\overline{kidsge6}$. In contrast, PEA
already uses $\overline{kidsge6}$, but the expressions become equivalent, although with different coefficients.We can do the same with the probit regression:
round(coef(summary(probit_glm)), 5)
Interpretation is similar to the logit, with the difference being that instead of the log-odds-ratio
we now have changes to the inverse normal cdf of the probability
, $\Phi^{-1}(p_i)$. Consequently, it makes more sense to evaluate the PEA
and APE
:
PEA_probit = mfx::probitmfx(formula = probit_glm$formula, data = probit_glm$data, atmean = TRUE)
PEA_probit$mfxest
APE_probit = mfx::probitmfx(formula = probit_glm$formula, data = probit_glm$data, atmean = FALSE)
APE_probit$mfxest
The interpretation of PEA
and APE
are analogous to the logit model - a unit increase in the (average, if PEA
) regressor results in the specified (average, if APE
) change in the probability.
Marginal effects show the change in probability when the predictor or independent variable increases by one unit. For continuous variables this represents the instantaneous change given that the 'unit' may be very small. For binary variables, the change is from 0 to 1, so one ‘unit’ as it is usually thought.
As in the multiple linear regression case, the intercept does not have a meaningfull interpretation.
The fitted Logit model:
$ \log \left( \dfrac{\widehat{p}_i}{1 - \widehat{p}_i} \right) = \underset{(1.0492)}{1.7010} + \underset{(0.050)}{0.1979} \cdot educ + \underset{(0.0388)}{0.2169} \cdot exper - \underset{(0.0012)}{0.0032} \cdot exper^2 - \underset{(0.0183)}{0.1039} \cdot age - \underset{(0.3720)}{2.2880} \cdot kidslt6 - \underset{(0.0992)}{0.0829} \cdot kidsge6 + \underset{(0.1690)}{0.5363} \cdot (kidslt6 \times kidsge6) $
The fitted Probit model:
$ \widehat{p}_i = \mathbf{\Phi} \left( \underset{(0.6050)}{0.9593} + \underset{(0.0287)}{0.1164} \cdot educ + \underset{(0.0223)}{0.1280} \cdot exper - \underset{(0.0007)}{0.0019} \cdot exper^2 - \underset{(0.0104)}{0.0603} \cdot age - \underset{(0.2102)}{1.3413} \cdot kidslt6 - \underset{(0.0573)}{0.0465} \cdot kidsge6 + \underset{(0.0983)}{0.3150} \cdot (kidslt6 \times kidsge6) \right) $
round(coef(summary(logit_glm)), 5)
We will begin with a simple economic hypothesis: $$ \text{ education is twice as effective, in absolute value, on the log-odds-ratio, compared to age} $$ this is equivalent to the hypothesis that:
$ \begin{cases} H_0&: educ + 2 \cdot age = 0\\ H_1&: educ + 2 \cdot age \neq 0 \end{cases} $
car::linearHypothesis(logit_glm, c("educ + 2 * age = 0"))
Since p-value > 0.05
, we have no gounds to reject the null hypothesis and conclude that educations effect on the log-odds-ratio is twice as strong as the effect of age.
Furthermore, we may examine another hypothesis:
$ \begin{cases} H_0&: educ + 2 \cdot age + kidslt6= 0\\ H_1&: educ + 2 \cdot age + kidslt6 \neq 0 \end{cases} $
car::linearHypothesis(logit_glm, c("educ + 2 * age + kidslt6 = 0"))
Since p-value < 0.05
, we reject the null hypothesis of this linear restricion.
Finally, we may examine a joint hypothesis of two linear restrictions:
$ \begin{cases} H_0&: educ + 2 \cdot age = 0 \text{ and } kidslt6 = 28 \cdot kidsge6\\ H_1&: educ + 2 \cdot age \neq 0, \text{ or } kidslt6 \neq 28 \cdot kidsge6, \text{ or both} \end{cases} $
print(car::linearHypothesis(logit_glm, c("educ + 2 * age = 0", "kidslt6 = 28 * kidsge6")))
Since p-value > 0.05
we do not reject the join hypothesis of these two linear restrictions.
We will begin by evaluating the confusion matrix on the test
data, using the optimal cutoff, which we found from the train
dataset:
table(dt5_test$inlf)
Note that in this data sample, our test
set contains only zeros! This may be a nice example of how we may obtain new data - there is always a possibility that the new data, for which we do not yet know the outcome, may have only 1
's, or only 0
's.
Firstly, using the naive 0.5 threshold:
probs_logit_test <- predict(logit_glm, newdata = dt5_test, type = "response")
print(InformationValue::confusionMatrix(dt5_test$inlf, probs_logit_test, threshold = 0.5))
probs_probit_test <- predict(probit_glm, newdata = dt5_test, type = "response")
print(InformationValue::confusionMatrix(dt5_test$inlf, probs_probit_test, threshold = 0.5))
Gives the same results - around more than half of the data is incorrectly classified.
If we look at the optimal cutoff threshold:
print(InformationValue::confusionMatrix(dt5_test$inlf, probs_logit_test, threshold = optimalCutoff_logit))
print(InformationValue::confusionMatrix(dt5_test$inlf, probs_probit_test, threshold = optimalCutoff_probit))
In this case, the results are slightly better for the probit regression.
Unfortunately, since we do not have any 1
in our dataset - we cannot calculate the ROC and AUC.
At this point our analysis on the test data would come to an end - since we do not have any 1
in the test data, we cannot do any further analysis.
For interests sake, assume that out new data contains both zeros and ones (we will ignore that some (or all) of the sample observations may come from our train
set):
set.seed(123)
#
dt5_test_v2 <- dt5[sample(1:nrow(dt5), nrow(dt5) - n, replace = FALSE), ]
table(dt5_test_v2$inlf)
probs_logit_test <- predict(logit_glm, newdata = dt5_test_v2, type = "response")
print(InformationValue::confusionMatrix(dt5_test_v2$inlf, probs_logit_test, threshold = optimalCutoff_logit))
probs_probit_test <- predict(probit_glm, newdata = dt5_test_v2, type = "response")
print(InformationValue::confusionMatrix(dt5_test_v2$inlf, probs_probit_test, threshold = optimalCutoff_probit))
We see that for this test set, we correctly predict the majority of 1
's. Again part of this is because all of the observations with 1
were included in our model. In this case, the logit model appears to be better at classifying 1
correctly, compared to the probit model.
options(repr.plot.width = 6)
options(repr.plot.height = 4)
InformationValue::plotROC(dt5_test_v2$inlf, probs_logit_test, Show.labels = FALSE)
InformationValue::plotROC(dt5_test_v2$inlf, probs_probit_test, Show.labels = FALSE)
The AUC is around $0.8$, indicating an adequate model.