Logistic regression model

Note

This section continues on from the end of the previous section on exploratory data analysis.

We would expect the following relationship between the independent variables and the participation rate:

\(\beta_{educ}>0\) - a higher education is more likely to lead to an active labor force participation, since the person has already invested their time into improving themselves and have a better chance of earning a higher wage, hence they are more likely to be motivated to return to the workforce; \(\beta_{exper}>0\) - the more experience they have, the more likely that they will return to a higher pay job, hence they are more motivated to return to the workforce; \(\beta_{age}<0\) - the older they are, the less likely they are to be hired back in a good paying position. Hence they may be discouraged from returning to the workforce;

In addition:

\(\beta_{kidslt6}<0\) - the more younger kinds in the family, the higher chance that one of the parents will remain at home to look after them;
\(\beta_{kidsge6}>0\) - older children are at school or daycare, so the parent is less likely to stay at home, when they could be working (especially since buying school supplies, clothes, etc. would benefit from additional income).

We now estimate the model with these variables:

R
Python

my_formula = "inlf ~ 1 + educ + exper + age + kidslt6 + kidsge6"
logit_glm  = glm(my_formula, data = dt_train, family = binomial)
summary(logit_glm)


Call:
glm(formula = my_formula, family = binomial, data = dt_train)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.49290    0.91870   0.537    0.592    
educ         0.20567    0.04563   4.507 6.57e-06 ***
exper        0.11060    0.01448   7.638 2.21e-14 ***
age         -0.08503    0.01517  -5.605 2.08e-08 ***
kidslt6     -1.42606    0.21722  -6.565 5.20e-11 ***
kidsge6      0.08920    0.08217   1.085    0.278    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 820.43  on 601  degrees of freedom
Residual deviance: 667.03  on 596  degrees of freedom
AIC: 679.03

Number of Fisher Scoring iterations: 4

We see that all of the variables, except for kidsge6, are statistically significant. We also check for multicollinearity:

R
Python

car::vif(logit_glm) %>% print()

    educ    exper      age  kidslt6  kidsge6 
1.068891 1.209205 1.654253 1.383618 1.285253

Since all of the \(\rm VIF < 5\), we do not have any collinear variables.

What if we included hours or wage in our model?

R
Python

glm(inlf ~ 1 + educ + exper + age + kidslt6 + kidsge6 + hours, data = dt_train, family = binomial) %>% summary()

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred


Call:
glm(formula = inlf ~ 1 + educ + exper + age + kidslt6 + kidsge6 + 
    hours, family = binomial, data = dt_train)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  -40.1499  5146.0208  -0.008    0.994
educ           1.0450   185.2361   0.006    0.995
exper         -0.4995   154.3390  -0.003    0.997
age            0.1465    65.7403   0.002    0.998
kidslt6       -0.3265   631.4149  -0.001    1.000
kidsge6        1.5228   333.1617   0.005    0.996
hours          2.2316    45.0036   0.050    0.960

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8.2043e+02  on 601  degrees of freedom
Residual deviance: 9.3449e-06  on 595  degrees of freedom
AIC: 14

Number of Fisher Scoring iterations: 25

Note the high \(p\)-values and warning about convergence - since hours > 0 directly relates to inlf=1 - we get a warning telling us that one of our wariables is able to perfectly separate the different groupr of inlf.

We may want to include the following variables:

\(educ^2\) - each additional year of education would have a decreasing effect on the participation rate, so that \(\beta_{educ^2} < 0\);
\(exper^2\) - each additional year of experience would have a decreasing effect on the participation rate, so that \(\beta_{exper^2} < 0\);
\(age^2\) - each additional year of experience would have an increasing negative effect on the participation rate, so that \(\beta_{age^2}<0\);
\(\rm kidslt6^2\) - each additional young kid would significantly reduce the participation rate further, so that \(\beta_{kidslt6^2}<0\);
\(\rm kidsge6^2\) - each additional older kid would further increase the participation rate, since more money is needed for clothes, books, trips, after school activities, etc. On the other hand, younger children may get hand-me-downs from their older siblings.

Regarding interaction terms, lets assume that we want to include:

\(kidslt6\times kidsge6\) - if we already have a number of older kids, and then we have a younger child, then it may be more likely that the older sibling will babysit the younger ones. Meanwhile the parents can go to work.

R
Python

my_formula = "inlf ~ 1 + educ + I(educ^2) + exper + I(exper^2) + age + kidslt6 + kidsge6 + kidslt6*kidsge6"
logit_glm  = glm(my_formula, data = dt_train, family = binomial)
summary(logit_glm)


Call:
glm(formula = my_formula, family = binomial, data = dt_train)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      5.040155   2.129311   2.367  0.01793 *  
educ            -0.530697   0.313145  -1.695  0.09013 .  
I(educ^2)        0.030340   0.012822   2.366  0.01797 *  
exper            0.199566   0.034957   5.709 1.14e-08 ***
I(exper^2)      -0.003154   0.001104  -2.858  0.00426 ** 
age             -0.095150   0.016492  -5.770 7.95e-09 ***
kidslt6         -2.174765   0.351711  -6.183 6.27e-10 ***
kidsge6         -0.021683   0.093832  -0.231  0.81725    
kidslt6:kidsge6  0.389727   0.139276   2.798  0.00514 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 820.43  on 601  degrees of freedom
Residual deviance: 647.80  on 593  degrees of freedom
AIC: 665.8

Number of Fisher Scoring iterations: 4

(Note the use of the function I(...) - here it is used to inhibit the interpretation of operators such as “+”, “-”, “*” and “^” as formula operators, so they are used as arithmetical operators.

The variables are statistically significant, so we don’t need to remove them (remember from linear regression that if we include interaction terms, then we dont remove their individual variables from the model).

We can calculate the \(\widehat{\mathbb{P}}(Y = 1 | \mathbf{X})\) as follows:

R
Python

# Note the difference between type = "response" and type = "link"!
predicted_probs_logit <- predict(logit_glm, newdata = dt_train, type = "response")

R
Python

predicted_probs_logit %>% as_tibble() %>% 
    ggplot(aes(x = value)) + 
    geom_histogram(bins = 10) +
    theme_bw()

Next, we will specify prediction rules, \(\theta\), such that: \[ \widehat{Y}_i = \begin{cases} 1&, \text{ if } \widehat{p}_i \geq \theta \\\\ 0&, \text{ if } \widehat{p}_i < \theta \end{cases} \]

Using the default prediction rule:

R
Python

conf_mat_logit_default_cutoff <- cbind(actual = dt_train$inlf, predicted = as.numeric(predicted_probs_logit >= 0.5)) %>% as_tibble()
conf_mat_logit_default_cutoff <- table(conf_mat_logit_default_cutoff)
conf_mat_logit_default_cutoff

      predicted
actual   0   1
     0 155 100
     1  60 287

We can calculate the optimal cutoff based on the accuracy score:

R
Python

opt_cutoff <- ROCR::performance(ROCR::prediction(predicted_probs_logit, dt_train$inlf), measure = "acc")
plot(opt_cutoff)

str(opt_cutoff)

Formal class 'performance' [package "ROCR"] with 6 slots
  ..@ x.name      : chr "Cutoff"
  ..@ y.name      : chr "Accuracy"
  ..@ alpha.name  : chr "none"
  ..@ x.values    :List of 1
  .. ..$ : Named num [1:584] Inf 0.982 0.977 0.969 0.965 ...
  .. .. ..- attr(*, "names")= chr [1:584] "" "463" "149" "20" ...
  ..@ y.values    :List of 1
  .. ..$ : num [1:584] 0.424 0.425 0.427 0.429 0.43 ...
  ..@ alpha.values: list()

opt_cutoff@x.values[[1]][which.max(opt_cutoff@y.values[[1]])] %>% print()

      154 
0.4517714

R
Python

roc <- roc(response = dt_train$inlf, predictor = predicted_probs_logit, percent = FALSE)

Setting levels: control = 0, case = 1

Setting direction: controls < cases

coords(roc, "best", ret = c("threshold", "sens", "spec", "accuracy"), transpose = TRUE)

  threshold sensitivity specificity    accuracy 
  0.4511304   0.8703170   0.5843137   0.7491694

optimalCutoff_logit <- coords(roc, "best", ret = c("threshold"), transpose = TRUE)
optimalCutoff_logit

threshold 
0.4511304

R
Python

conf_mat_logit_optimal_cutoff <- cbind(actual = dt_train$inlf, predicted = as.numeric(predicted_probs_logit >= optimalCutoff_logit)) %>% as_tibble()
conf_mat_logit_optimal_cutoff <- table(conf_mat_logit_optimal_cutoff)
conf_mat_logit_optimal_cutoff

      predicted
actual   0   1
     0 149 106
     1  45 302

We can compare the number of misclassifications:

R
Python

conf_mat_logit_default_cutoff

      predicted
actual   0   1
     0 155 100
     1  60 287

conf_mat_logit_optimal_cutoff

      predicted
actual   0   1
     0 149 106
     1  45 302

print(paste0("Misclassifications with default cutoff: ", sum(sum(conf_mat_logit_default_cutoff[lower.tri(conf_mat_logit_default_cutoff) | upper.tri(conf_mat_logit_default_cutoff)]))))

[1] "Misclassifications with default cutoff: 160"

print(paste0("Misclassifications with optimal cutoff: ", sum(sum(conf_mat_logit_optimal_cutoff[lower.tri(conf_mat_logit_optimal_cutoff) | upper.tri(conf_mat_logit_optimal_cutoff)]))))

[1] "Misclassifications with optimal cutoff: 151"

And we see that the optimal cutoff reduces the number of misclassifications.

The optimal cutoff is more accurate in predicting \(Y = 1\), but less accurate in predicting \(Y = 0\).