Logistic regression model

Note

This section continues on from the end of the previous section on exploratory data analysis.

We would expect the following relationship between the independent variables and the participation rate:

In addition:

We now estimate the model with these variables:

my_formula = "inlf ~ 1 + educ + exper + age + kidslt6 + kidsge6"
logit_glm  = glm(my_formula, data = dt_train, family = binomial)
summary(logit_glm)

Call:
glm(formula = my_formula, family = binomial, data = dt_train)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.49290    0.91870   0.537    0.592    
educ         0.20567    0.04563   4.507 6.57e-06 ***
exper        0.11060    0.01448   7.638 2.21e-14 ***
age         -0.08503    0.01517  -5.605 2.08e-08 ***
kidslt6     -1.42606    0.21722  -6.565 5.20e-11 ***
kidsge6      0.08920    0.08217   1.085    0.278    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 820.43  on 601  degrees of freedom
Residual deviance: 667.03  on 596  degrees of freedom
AIC: 679.03

Number of Fisher Scoring iterations: 4
1
1

We see that all of the variables, except for kidsge6, are statistically significant. We also check for multicollinearity:

car::vif(logit_glm) %>% print()
    educ    exper      age  kidslt6  kidsge6 
1.068891 1.209205 1.654253 1.383618 1.285253 
1
1

Since all of the \(\rm VIF < 5\), we do not have any collinear variables.

glm(inlf ~ 1 + educ + exper + age + kidslt6 + kidsge6 + hours, data = dt_train, family = binomial) %>% summary()
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Call:
glm(formula = inlf ~ 1 + educ + exper + age + kidslt6 + kidsge6 + 
    hours, family = binomial, data = dt_train)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  -40.1499  5146.0208  -0.008    0.994
educ           1.0450   185.2361   0.006    0.995
exper         -0.4995   154.3390  -0.003    0.997
age            0.1465    65.7403   0.002    0.998
kidslt6       -0.3265   631.4149  -0.001    1.000
kidsge6        1.5228   333.1617   0.005    0.996
hours          2.2316    45.0036   0.050    0.960

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8.2043e+02  on 601  degrees of freedom
Residual deviance: 9.3449e-06  on 595  degrees of freedom
AIC: 14

Number of Fisher Scoring iterations: 25
1
1

Note the high \(p\)-values and warning about convergence - since hours > 0 directly relates to inlf=1 - we get a warning telling us that one of our wariables is able to perfectly separate the different groupr of inlf.

We may want to include the following variables:

Regarding interaction terms, lets assume that we want to include:

my_formula = "inlf ~ 1 + educ + I(educ^2) + exper + I(exper^2) + age + kidslt6 + kidsge6 + kidslt6*kidsge6"
logit_glm  = glm(my_formula, data = dt_train, family = binomial)
summary(logit_glm)

Call:
glm(formula = my_formula, family = binomial, data = dt_train)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      5.040155   2.129311   2.367  0.01793 *  
educ            -0.530697   0.313145  -1.695  0.09013 .  
I(educ^2)        0.030340   0.012822   2.366  0.01797 *  
exper            0.199566   0.034957   5.709 1.14e-08 ***
I(exper^2)      -0.003154   0.001104  -2.858  0.00426 ** 
age             -0.095150   0.016492  -5.770 7.95e-09 ***
kidslt6         -2.174765   0.351711  -6.183 6.27e-10 ***
kidsge6         -0.021683   0.093832  -0.231  0.81725    
kidslt6:kidsge6  0.389727   0.139276   2.798  0.00514 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 820.43  on 601  degrees of freedom
Residual deviance: 647.80  on 593  degrees of freedom
AIC: 665.8

Number of Fisher Scoring iterations: 4

(Note the use of the function I(...) - here it is used to inhibit the interpretation of operators such as “+”, “-”, “*” and “^” as formula operators, so they are used as arithmetical operators.

1
1

The variables are statistically significant, so we don’t need to remove them (remember from linear regression that if we include interaction terms, then we dont remove their individual variables from the model).

We can calculate the \(\widehat{\mathbb{P}}(Y = 1 | \mathbf{X})\) as follows:

# Note the difference between type = "response" and type = "link"!
predicted_probs_logit <- predict(logit_glm, newdata = dt_train, type = "response")
1
1
predicted_probs_logit %>% as_tibble() %>% 
    ggplot(aes(x = value)) + 
    geom_histogram(bins = 10) +
    theme_bw()

1
1

Next, we will specify prediction rules, \(\theta\), such that: \[ \widehat{Y}_i = \begin{cases} 1&, \text{ if } \widehat{p}_i \geq \theta \\\\ 0&, \text{ if } \widehat{p}_i < \theta \end{cases} \]

Using the default prediction rule:

conf_mat_logit_default_cutoff <- cbind(actual = dt_train$inlf, predicted = as.numeric(predicted_probs_logit >= 0.5)) %>% as_tibble()
conf_mat_logit_default_cutoff <- table(conf_mat_logit_default_cutoff)
conf_mat_logit_default_cutoff
      predicted
actual   0   1
     0 155 100
     1  60 287
1
1

We can calculate the optimal cutoff based on the accuracy score:

opt_cutoff <- ROCR::performance(ROCR::prediction(predicted_probs_logit, dt_train$inlf), measure = "acc")
plot(opt_cutoff)

str(opt_cutoff)
Formal class 'performance' [package "ROCR"] with 6 slots
  ..@ x.name      : chr "Cutoff"
  ..@ y.name      : chr "Accuracy"
  ..@ alpha.name  : chr "none"
  ..@ x.values    :List of 1
  .. ..$ : Named num [1:584] Inf 0.982 0.977 0.969 0.965 ...
  .. .. ..- attr(*, "names")= chr [1:584] "" "463" "149" "20" ...
  ..@ y.values    :List of 1
  .. ..$ : num [1:584] 0.424 0.425 0.427 0.429 0.43 ...
  ..@ alpha.values: list()
opt_cutoff@x.values[[1]][which.max(opt_cutoff@y.values[[1]])] %>% print()
      154 
0.4517714 
1
1
roc <- roc(response = dt_train$inlf, predictor = predicted_probs_logit, percent = FALSE)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
coords(roc, "best", ret = c("threshold", "sens", "spec", "accuracy"), transpose = TRUE)
  threshold sensitivity specificity    accuracy 
  0.4511304   0.8703170   0.5843137   0.7491694 
optimalCutoff_logit <- coords(roc, "best", ret = c("threshold"), transpose = TRUE)
optimalCutoff_logit
threshold 
0.4511304 
1
1
conf_mat_logit_optimal_cutoff <- cbind(actual = dt_train$inlf, predicted = as.numeric(predicted_probs_logit >= optimalCutoff_logit)) %>% as_tibble()
conf_mat_logit_optimal_cutoff <- table(conf_mat_logit_optimal_cutoff)
conf_mat_logit_optimal_cutoff
      predicted
actual   0   1
     0 149 106
     1  45 302
1
1

We can compare the number of misclassifications:

conf_mat_logit_default_cutoff
      predicted
actual   0   1
     0 155 100
     1  60 287
conf_mat_logit_optimal_cutoff
      predicted
actual   0   1
     0 149 106
     1  45 302
print(paste0("Misclassifications with default cutoff: ", sum(sum(conf_mat_logit_default_cutoff[lower.tri(conf_mat_logit_default_cutoff) | upper.tri(conf_mat_logit_default_cutoff)]))))
[1] "Misclassifications with default cutoff: 160"
print(paste0("Misclassifications with optimal cutoff: ", sum(sum(conf_mat_logit_optimal_cutoff[lower.tri(conf_mat_logit_optimal_cutoff) | upper.tri(conf_mat_logit_optimal_cutoff)]))))
[1] "Misclassifications with optimal cutoff: 151"
1
1

And we see that the optimal cutoff reduces the number of misclassifications.

The optimal cutoff is more accurate in predicting \(Y = 1\), but less accurate in predicting \(Y = 0\).