5.4 Example

Note that the response variables in these datasets are fundamentally different from the ones in 3.11 and 4.11.

We will use the fifth dataset. The MROZ dataset contains cross-sectional labor force participation data:

  • inlf - =1 if in labor force, 1975

  • hours- hours worked, 1975

  • kidslt6- # kids < 6 years

  • kidsge6- # kids 6-18

  • age- woman’s age in yrs

  • educ- years of schooling

  • wage- estimated wage from earns., hours

  • repwage- reported wage at interview in 1976

  • hushrs- hours worked by husband, 1975

  • husage- husband’s age

  • huseduc- husband’s years of schooling

  • huswage- husband’s hourly wage, 1975

  • faminc- family income, 1975

  • mtr - fed. marginal tax rate facing woman

  • motheduc- mother’s years of schooling

  • fatheduc- father’s years of schooling

  • unem- unem. rate in county of resid.

  • city- =1 if live in SMSA

  • exper- actual labor mkt exper

  • nwifeinc- (faminc - wage*hours)/1000

  • lwage- log(wage)

  • expersq - exper^2

assume that we are interested in modelling the labor force participation, inlf, of married women.

##   inlf hours kidslt6 kidsge6 age educ   wage repwage hushrs husage huseduc huswage faminc    mtr motheduc fatheduc unem city exper  nwifeinc     lwage expersq
## 1    1  1610       1       0  32   12 3.3540    2.65   2708     34      12  4.0288  16310 0.7215       12        7  5.0    0    14 10.910060 1.2101541     196
## 2    1  1656       0       2  30   12 1.3889    2.65   2310     30       9  8.4416  21800 0.6615        7        7 11.0    1     5 19.499981 0.3285121      25
## 3    1  1980       1       3  35   12 4.5455    4.04   3072     40      12  3.5807  21040 0.6915       12        7  5.0    0    15 12.039910 1.5141380     225
## 4    1   456       0       3  34   12 1.0965    3.25   1920     53      10  3.5417   7300 0.7815        7        7  5.0    0     6  6.799996 0.0921233      36
## 5    1  1568       1       2  31   14 4.5918    3.60   2000     32      12 10.0000  27300 0.6215       12       14  9.5    1     7 20.100060 1.5242720      49
## 6    1  2032       0       0  54   12 4.7421    4.70   1040     57      11  6.7106  19495 0.6915       14        7  7.5    1    33  9.859054 1.5564801    1089
##    inlf   hours  kidslt6  kidsge6   age  educ    wage  repwage  hushrs  husage  huseduc  huswage   faminc     mtr  motheduc  fatheduc  unem  city  exper   nwifeinc     lwage  expersq
## 0   1.0  1610.0      1.0      0.0  32.0  12.0  3.3540     2.65  2708.0    34.0     12.0   4.0288  16310.0  0.7215      12.0       7.0   5.0   0.0   14.0  10.910060  1.210154    196.0
## 1   1.0  1656.0      0.0      2.0  30.0  12.0  1.3889     2.65  2310.0    30.0      9.0   8.4416  21800.0  0.6615       7.0       7.0  11.0   1.0    5.0  19.499981  0.328512     25.0
## 2   1.0  1980.0      1.0      3.0  35.0  12.0  4.5455     4.04  3072.0    40.0     12.0   3.5807  21040.0  0.6915      12.0       7.0   5.0   0.0   15.0  12.039910  1.514138    225.0
## 3   1.0   456.0      0.0      3.0  34.0  12.0  1.0965     3.25  1920.0    53.0     10.0   3.5417   7300.0  0.7815       7.0       7.0   5.0   0.0    6.0   6.799996  0.092123     36.0
## 4   1.0  1568.0      1.0      2.0  31.0  14.0  4.5918     3.60  2000.0    32.0     12.0  10.0000  27300.0  0.6215      12.0      14.0   9.5   1.0    7.0  20.100060  1.524272     49.0
## 5   1.0  2032.0      0.0      0.0  54.0  12.0  4.7421     4.70  1040.0    57.0     11.0   6.7106  19495.0  0.6915      14.0       7.0   7.5   1.0   33.0   9.859054  1.556480   1089.0

5.4.1 Exercise Set 1

We will begin by dividing our data into training and test sets. These are extremely important for binary outcome modelling - we need to make sure our model can correctly classify new data.

## 
##   0   1 
## 325 428
## 1.0    428
## 0.0    325
## Name: inlf, dtype: int64

We see that we have approximately the same amount of data for both non-participating and participating women. So, we can partition our data as we normally would into \(80\%\) training and \(20\%\) test sets.

The dt5_train is our training set, for which we will estimate our model and dt5_test is our test data subset, for which we will evaluate the adequacy of our model.

Considerations

Note that a better way to do this would be to randomize the rows at first, if we are unsure of their randomness. Though we will leave this for another time, since we want to compare the results with R and Python. For now, we will naively assume that the data is already randomly organized.

5.4.1.1 Task 1

Modelling binary data involves creating a model for the probability of success (in this case it is inlf = 1 - that a woman is participating in the labor force).

We have three primary choices for binary data modelling:

  • A multiple linear regression;
  • A logit regression;
  • A probit regression;

The multiple regression model does not take into account the fact that the probability \(p \in \left[0, 1\right]\). Therefore, we will opt to use the logistic and probit regressions.

Considerations

Note that it is also possible to estimate a multiple linear regression and compare the results with logit and probit models - try it!

5.4.1.2 Task 2

We begin by examining the boxplots of the different explanatory variables, based on the value of the binary dependent variable. For interests sake, we also include the city indicator variable.

Since city is an indicator variable - the values 0 and 1 do not have any numerical interpretation, but rather only indicate whether the person lives in a city or not.

In order to meaningfully examine the differences between return to labor force in a city versus in a non-city, we can use a contingency table:

##     inlf
## city   0   1
##    0 115 154
##    1 210 274
## inlf  0.0  1.0
## city          
## 0.0   115  154
## 1.0   210  274

To make it easier to understand where the data falls in this barchart, we will change the categorization of the variables:

##      inlf
## city  In labor force Not in labor force
##   No             154                115
##   Yes            274                210
## inlf  In labor force  Not in labor force
## city                                    
## No               154                 115
## Yes              274                 210

We can further visualize this with a barplot. The following transposed contingency table will be used for the barplot:

##                     city
## inlf                  No Yes
##   In labor force     154 274
##   Not in labor force 115 210
## city                 No  Yes
## inlf                        
## In labor force      154  274
## Not in labor force  115  210

where

  • Each column will be on the horizontal axis. In this case - the axis will have No and Yes labels.

  • Each row will be a differently colored barchart - the barcharts will either be grouped according to the columns, or they will be stacked on top of each other according to the columns.

Considerations

Note: compare the values in t(tbl) and the barcharts - make sure you understand, which variables are in the horizontal axis, and which variable is used for different bar colors.

Overall, we see that there are more women from cities in this sample, however, the tendency is the same - whether a woman is from a city or not - there are more women returning to the labor force instead of staying at home.

It may be more interesting to examine the proportion of city and non-city women returning to the workforce:

Note that each row sums up to 1 (i.e. 100% of city and 100% of non-city women are split into returning to the labor force and staying at home):

  • the first row indicates the proportion of women in labor force and not in labor force of the total women not living in a city.
  • the second row indicated the proportions of women in labor force and not in labor force of the total women living in a city.

So, while there are fewer women from non-cities in this data sample, the proportion of women, who decide to return to the workforce is very similar, regardless if the woman lives in a city or not.

Finally, note that when calculating the proportions:

##      inlf
## city  In labor force Not in labor force
##   No         0.57249            0.42751
##   Yes        0.56612            0.43388
## inlf  In labor force  Not in labor force
## city                                    
## No          0.572491            0.427509
## Yes         0.566116            0.433884

It is important to know the difference between:

##                     city
## inlf                      No     Yes
##   In labor force     0.57249 0.56612
##   Not in labor force 0.42751 0.43388
## city                      No       Yes
## inlf                                  
## In labor force      0.572491  0.566116
## Not in labor force  0.427509  0.433884

and

##                     city
## inlf                      No     Yes
##   In labor force     0.35981 0.64019
##   Not in labor force 0.35385 0.64615
## city                      No       Yes
## inlf                                  
## In labor force      0.359813  0.640187
## Not in labor force  0.353846  0.646154
  • In the first case - we have calculated the proportions of women in labor force, and not in labor force separately for all women living in cities and for all women living outside cities. Then we simply transposed that table.

  • In the second case we have calculated the proportions of women living in a city, and outside the city separately for all in-labor-force-women (so that they sum up to 100% of in labor force women) and all not-in-labor-force women (so that they sum up to 100% of not in labor force women);

Note that we are not restricted to only indicator variables - we can very successfully examine categorical variables, containing multiple different categories.

We see that for some variables - like age, educ, exper, there is a difference between the two groups.

We would expect the following relationship between the independent variables and the participation rate:

  • \(\beta_{educ} > 0\) - a higher education is more likely to lead to an active labor force participation, since the person has already invested their time into improving themselves and have a better chance of earning a higher wage, hence they are more likely to be motivated to return to the workforce;
  • \(\beta_{exper} > 0\) - the more experience they have, the more likely that they will return to a higher pay job, hence they are more motivated to return to the workforce;
  • \(\beta_{age} < 0\) - the older they are, the less likely they are to be hired back in a good paying position. Hence they may be discouraged from returning to the workforce;

In addition:

  • \(\beta_{kidslt6} < 0\) - the more younger kinds in the family, the higher chance that one of the parents will remain at home to look after them;
  • \(\beta_{kidsge6} > 0\) - older children are at school or daycare, so the parent is less likely to stay at home, when they could be working (especially since buying school supplies, clothes, etc. would benefit from additional income).

Considerations

We can see that the correlations in R and Python are different. We can check for missing values:

##         [,1]
## hours      0
## age        0
## educ       0
## wage     174
## exper      0
## husage     0
## huseduc    0
## huswage    0
## faminc     0
## hours        0
## age          0
## educ         0
## wage       174
## exper        0
## husage       0
## huseduc      0
## huswage      0
## faminc       0
## dtype: int64

So, wage has 174 missing observations. If we compare the correlation matrix of all of these variables:

##               hours         age        educ        wage       exper      husage     huseduc     huswage      faminc
## hours    1.00000000  0.05486465 -0.06487945 -0.09762756  0.29923008  0.04584942 -0.08593461 -0.11031196  0.15073816
## age      0.05486465  1.00000000 -0.05217135  0.03039444  0.48364624  0.89442269 -0.06931940  0.08869675  0.11392319
## educ    -0.06487945 -0.05217135  1.00000000  0.34195444 -0.01520617 -0.06985821  0.59434325  0.30300522  0.36232809
## wage    -0.09762756  0.03039444  0.34195444  1.00000000  0.05499150  0.02565592  0.16632982  0.21588573  0.30265172
## exper    0.29923008  0.48364624 -0.01520617  0.05499150  1.00000000  0.41387168 -0.08321311 -0.11167355 -0.02748859
## husage   0.04584942  0.89442269 -0.06985821  0.02565592  0.41387168  1.00000000 -0.11392087  0.07238782  0.08672004
## huseduc -0.08593461 -0.06931940  0.59434325  0.16632982 -0.08321311 -0.11392087  1.00000000  0.39641601  0.35468441
## huswage -0.11031196  0.08869675  0.30300522  0.21588573 -0.11167355  0.07238782  0.39641601  1.00000000  0.66875621
## faminc   0.15073816  0.11392319  0.36232809  0.30265172 -0.02748859  0.08672004  0.35468441  0.66875621  1.00000000
##             hours       age      educ      wage     exper    husage   huseduc   huswage    faminc
## hours    1.000000 -0.009723  0.079324 -0.097628  0.399119 -0.020243 -0.033895 -0.096275  0.186113
## age     -0.009723  1.000000 -0.131189  0.030394  0.370998  0.888971 -0.154236  0.039179  0.065755
## educ     0.079324 -0.131189  1.000000  0.341954  0.065681 -0.160103  0.592238  0.295962  0.348235
## wage    -0.097628  0.030394  0.341954  1.000000  0.054991  0.025656  0.166330  0.215886  0.302652
## exper    0.399119  0.370998  0.065681  0.054991  1.000000  0.291947 -0.042077 -0.100794 -0.012938
## husage  -0.020243  0.888971 -0.160103  0.025656  0.291947  1.000000 -0.198246  0.029199  0.048197
## huseduc -0.033895 -0.154236  0.592238  0.166330 -0.042077 -0.198246  1.000000  0.389622  0.342842
## huswage -0.096275  0.039179  0.295962  0.215886 -0.100794  0.029199  0.389622  1.000000  0.679731
## faminc   0.186113  0.065755  0.348235  0.302652 -0.012938  0.048197  0.342842  0.679731  1.000000

with the correlation matrix without wage:

##                hours          age        educ       exper      husage     huseduc     huswage      faminc
## hours    1.000000000 -0.009722653  0.07932400  0.39911906 -0.02024281 -0.03389459 -0.09627502  0.18611296
## age     -0.009722653  1.000000000 -0.13118868  0.37099790  0.88897071 -0.15423616  0.03917928  0.06575474
## educ     0.079323997 -0.131188678  1.00000000  0.06568143 -0.16010274  0.59223834  0.29596240  0.34823451
## exper    0.399119061  0.370997900  0.06568143  1.00000000  0.29194744 -0.04207658 -0.10079356 -0.01293827
## husage  -0.020242809  0.888970711 -0.16010274  0.29194744  1.00000000 -0.19824603  0.02919907  0.04819651
## huseduc -0.033894595 -0.154236156  0.59223834 -0.04207658 -0.19824603  1.00000000  0.38962214  0.34284234
## huswage -0.096275018  0.039179278  0.29596240 -0.10079356  0.02919907  0.38962214  1.00000000  0.67973079
## faminc   0.186112959  0.065754743  0.34823451 -0.01293827  0.04819651  0.34284234  0.67973079  1.00000000
##             hours       age     exper    husage   huseduc   huswage    faminc
## hours    1.000000 -0.009723  0.399119 -0.020243 -0.033895 -0.096275  0.186113
## age     -0.009723  1.000000  0.370998  0.888971 -0.154236  0.039179  0.065755
## exper    0.399119  0.370998  1.000000  0.291947 -0.042077 -0.100794 -0.012938
## husage  -0.020243  0.888971  0.291947  1.000000 -0.198246  0.029199  0.048197
## huseduc -0.033895 -0.154236 -0.042077 -0.198246  1.000000  0.389622  0.342842
## huswage -0.096275  0.039179 -0.100794  0.029199  0.389622  1.000000  0.679731
## faminc   0.186113  0.065755 -0.012938  0.048197  0.342842  0.679731  1.000000

We see that Python calculates pairwise correlations, based on their individual observation availability, while R excludes all the rows that have NA values and calculates the correlation for each variables on the same data.

When estimating a model, all of the rows, where at least one of the included explanatory variable is NA, will be dropped.

  • In this case, R calculates the correlation matrix, which would show the variable correlations if we were to include wage in the model.

  • On the other hand, if we won’t include wage, then the correlation matrix in Python is convenient.

If we wanted, we could make Python drop the rows with NA values and calculate the correlation matrix, to get the same results as in R:

##             hours       age      educ      wage     exper    husage   huseduc   huswage    faminc
## hours    1.000000  0.054865 -0.064879 -0.097628  0.299230  0.045849 -0.085935 -0.110312  0.150738
## age      0.054865  1.000000 -0.052171  0.030394  0.483646  0.894423 -0.069319  0.088697  0.113923
## educ    -0.064879 -0.052171  1.000000  0.341954 -0.015206 -0.069858  0.594343  0.303005  0.362328
## wage    -0.097628  0.030394  0.341954  1.000000  0.054991  0.025656  0.166330  0.215886  0.302652
## exper    0.299230  0.483646 -0.015206  0.054991  1.000000  0.413872 -0.083213 -0.111674 -0.027489
## husage   0.045849  0.894423 -0.069858  0.025656  0.413872  1.000000 -0.113921  0.072388  0.086720
## huseduc -0.085935 -0.069319  0.594343  0.166330 -0.083213 -0.113921  1.000000  0.396416  0.354684
## huswage -0.110312  0.088697  0.303005  0.215886 -0.111674  0.072388  0.396416  1.000000  0.668756
## faminc   0.150738  0.113923  0.362328  0.302652 -0.027489  0.086720  0.354684  0.668756  1.000000

Finally, looking at the above plots and the data definitions we note that someone not in the labor force should not have any wage. So, it may be of interest whether wage is only NA for someone not in the labor force?

We will use the whole dataset to get a better look:

As we have suspected, wage is only NA if someone is not in the labor force.

We might at first think that it may be more practical to replace the NA values with zero to indicate that such people “earn” zero dollars. On the other hand, it is natural that someone not working will also not earn any money. So the cases with wage = 0 will naturally mean that someone is not in the labor force. In other words, if our model were to include wage with 0 instead of NA - then it will perfectly predict when someone is not in the labor force.

In other words the statement that:

If someone “earns” a wage of 0 - they are not in the labor force.

is obvious, as only working people receive wages.

Hence why we should not replace NA in wage with 0, since the NA values are used to indicate whether someone is in the labor force - they are not just missing data!

  • Overall, the current wage itself will not help us determine whether someone will return to the labor force or not.

  • If we were to have a variable that indicated, for example, the average wage earned by the person in question - it may be useful in determining whether such person would return to the labor force or not.

5.4.1.3 Task 3

We will estimate both a logit and probit models.

##             Estimate Std. Error  z value Pr(>|z|)
## (Intercept)  1.44243    1.00899  1.42958  0.15284
## educ         0.18468    0.04800  3.84757  0.00012
## exper        0.12986    0.01691  7.67801  0.00000
## age         -0.09074    0.01700 -5.33889  0.00000
## kidslt6     -1.42174    0.22830 -6.22740  0.00000
## kidsge6      0.07344    0.08762  0.83818  0.40193
##              Coef.  Std.Err.        z    P>|z|   [0.025   0.975]
## Intercept  1.44243   1.00899  1.42958  0.15284 -0.53515  3.42001
## educ       0.18468   0.04800  3.84757  0.00012  0.09060  0.27875
## exper      0.12986   0.01691  7.67801  0.00000  0.09671  0.16302
## age       -0.09074   0.01700 -5.33889  0.00000 -0.12405 -0.05743
## kidslt6   -1.42174   0.22830 -6.22740  0.00000 -1.86921 -0.97427
## kidsge6    0.07344   0.08762  0.83818  0.40193 -0.09829  0.24517
##             Estimate Std. Error  z value Pr(>|z|)
## (Intercept)  0.93704    0.58188  1.61038  0.10731
## educ         0.11016    0.02755  3.99892  0.00006
## exper        0.07380    0.00925  7.98287  0.00000
## age         -0.05496    0.00969 -5.67441  0.00000
## kidslt6     -0.86249    0.13284 -6.49269  0.00000
## kidsge6      0.03735    0.05087  0.73418  0.46284
##              Coef.  Std.Err.        z    P>|z|   [0.025   0.975]
## Intercept  0.93705   0.58526  1.60109  0.10936 -0.21003  2.08413
## educ       0.11016   0.02764  3.98617  0.00007  0.05600  0.16433
## exper      0.07380   0.00903  8.16866  0.00000  0.05610  0.09151
## age       -0.05496   0.00971 -5.66135  0.00000 -0.07399 -0.03594
## kidslt6   -0.86249   0.13360 -6.45577  0.00000 -1.12434 -0.60064
## kidsge6    0.03735   0.05020  0.74404  0.45685 -0.06104  0.13574

The signs are as we would expect, but kidsge6 appears to be insignificant.

5.4.1.4 Task 4

We will check for collinearity via the VIF (or GVIF, if needed):

##             [,1]
## educ    1.074126
## exper   1.156804
## age     1.703891
## kidslt6 1.450408
## kidsge6 1.242962

Note that if we were to calculate the \(VIF\) on an lm() model:

##             [,1]
## educ    1.048660
## exper   1.232963
## age     1.521952
## kidslt6 1.242066
## kidsge6 1.234228

Then we would get the same results as in Python.

Since all of the \(VIF < 5\), we do not have any collinear variables.

5.4.2 Exercise Set 2

5.4.2.1 Task 5

We may want to include the following variables:

  • \(educ^2\) - each additional year of education would have a decreasing effect on the participation rate, so that \(\beta_{educ^2} < 0\);
  • \(exper^2\) - each additional year of experience would have a decreasing effect on the participation rate, so that \(\beta_{exper^2} < 0\);
  • \(age^2\) - each additional year of experience would have an increasing negative effect on the participation rate, so that \(\beta_{age^2} < 0\);
  • \(kidslt6^2\) - each additional young kid would significantly reduce the participation rate further, so that \(\beta_{kidslt6^2} < 0\)
  • \(kidsge6^2\) - each additional older kid would further increase the participation rate, since more money is needed for clothes, books, trips, after school activities, etc. On the other hand, younger children may get hand-me-downs from their older siblings.

Regarding interaction terms, lets assume that we want to include:

  • \(kidslt6 \times kidsge6\) - if we already have a number of older kids, and then we have a younger child, then it may be more likely that the older sibling will babysit the younger ones. Meanwhile the parents can go to work.

If we try to include all of them at the same time:

##              Estimate Std. Error  z value Pr(>|z|)
## (Intercept)   2.37291    3.89189  0.60971  0.54206
## educ         -0.21699    0.31343 -0.69231  0.48874
## I(educ^2)     0.01618    0.01287  1.25699  0.20876
## exper         0.20877    0.03959  5.27303  0.00000
## I(exper^2)   -0.00293    0.00128 -2.29261  0.02187
## age          -0.04097    0.16160 -0.25351  0.79987
## I(age^2)     -0.00054    0.00187 -0.28590  0.77495
## kidslt6      -0.87533    0.62116 -1.40920  0.15878
## I(kidslt6^2) -0.34784    0.36620 -0.94985  0.34219
## kidsge6       0.10136    0.22997  0.44074  0.65940
## I(kidsge6^2) -0.00725    0.05133 -0.14126  0.88767
##                         Coef.  Std.Err.        z    P>|z|   [0.025    0.975]
## Intercept             2.37291   3.89189  0.60971  0.54206 -5.25505  10.00088
## educ                 -0.21699   0.31343 -0.69231  0.48874 -0.83129   0.39731
## np.power(educ, 2)     0.01618   0.01287  1.25699  0.20876 -0.00905   0.04141
## exper                 0.20877   0.03959  5.27303  0.00000  0.13117   0.28638
## np.power(exper, 2)   -0.00293   0.00128 -2.29261  0.02187 -0.00544  -0.00043
## age                  -0.04097   0.16160 -0.25351  0.79987 -0.35769   0.27576
## np.power(age, 2)     -0.00054   0.00187 -0.28590  0.77495 -0.00421   0.00314
## kidslt6              -0.87533   0.62116 -1.40920  0.15878 -2.09278   0.34211
## np.power(kidslt6, 2) -0.34784   0.36620 -0.94985  0.34219 -1.06557   0.36990
## kidsge6               0.10136   0.22997  0.44074  0.65940 -0.34938   0.55210
## np.power(kidsge6, 2) -0.00725   0.05133 -0.14126  0.88767 -0.10785   0.09335

We see lots of insignificant variables. Instead, let’s try adding one by one.

##             Estimate Std. Error  z value Pr(>|z|)
## (Intercept)  3.57652    2.06186  1.73461  0.08281
## educ        -0.17722    0.30834 -0.57476  0.56546
## I(educ^2)    0.01497    0.01265  1.18385  0.23647
## exper        0.12976    0.01686  7.69479  0.00000
## age         -0.09132    0.01699 -5.37619  0.00000
## kidslt6     -1.44724    0.23135 -6.25566  0.00000
## kidsge6      0.08243    0.08795  0.93722  0.34864
##                      Coef.  Std.Err.        z    P>|z|   [0.025   0.975]
## Intercept          3.57652   2.06186  1.73461  0.08281 -0.46465  7.61770
## educ              -0.17722   0.30834 -0.57476  0.56546 -0.78155  0.42711
## np.power(educ, 2)  0.01497   0.01265  1.18385  0.23647 -0.00982  0.03977
## exper              0.12976   0.01686  7.69479  0.00000  0.09671  0.16281
## age               -0.09132   0.01699 -5.37619  0.00000 -0.12461 -0.05803
## kidslt6           -1.44724   0.23135 -6.25566  0.00000 -1.90068 -0.99381
## kidsge6            0.08243   0.08795  0.93722  0.34864 -0.08996  0.25482

We see that educ^2 and kidsge6 are insignificant. We remove educ^2 for now and continue adding our variables (ignoring the insignificance of kidsge6 for now):

##             Estimate Std. Error  z value Pr(>|z|)
## (Intercept)  1.17661    1.02507  1.14783  0.25104
## educ         0.17237    0.04844  3.55860  0.00037
## exper        0.20720    0.03857  5.37226  0.00000
## I(exper^2)  -0.00293    0.00124 -2.36180  0.01819
## age         -0.08862    0.01729 -5.12716  0.00000
## kidslt6     -1.41915    0.22987 -6.17366  0.00000
## kidsge6      0.07470    0.08919  0.83753  0.40229
##                       Coef.  Std.Err.        z    P>|z|   [0.025   0.975]
## Intercept           1.17661   1.02513  1.14776  0.25107 -0.83262  3.18584
## educ                0.17237   0.04844  3.55834  0.00037  0.07743  0.26731
## exper               0.20720   0.03857  5.37172  0.00000  0.13160  0.28280
## np.power(exper, 2) -0.00293   0.00124 -2.36144  0.01820 -0.00535 -0.00050
## age                -0.08862   0.01729 -5.12677  0.00000 -0.12251 -0.05474
## kidslt6            -1.41915   0.22989 -6.17327  0.00000 -1.86971 -0.96858
## kidsge6             0.07470   0.08919  0.83749  0.40232 -0.10011  0.24951

We note that exper^2 is significant. Before removing kidsge6 - let’s try to include the interaction term, which we have been considering:

##                 Estimate Std. Error  z value Pr(>|z|)
## (Intercept)      1.70098    1.04919  1.62123  0.10497
## educ             0.19791    0.05020  3.94197  0.00008
## exper            0.21693    0.03877  5.59589  0.00000
## I(exper^2)      -0.00318    0.00123 -2.57767  0.00995
## age             -0.10395    0.01832 -5.67236  0.00000
## kidslt6         -2.28803    0.37195 -6.15140  0.00000
## kidsge6         -0.08289    0.09919 -0.83566  0.40334
## kidslt6:kidsge6  0.53630    0.16900  3.17341  0.00151
##                       Coef.  Std.Err.        z    P>|z|   [0.025   0.975]
## Intercept           1.70098   1.04919  1.62123  0.10497 -0.35540  3.75735
## educ                0.19791   0.05020  3.94197  0.00008  0.09951  0.29631
## exper               0.21693   0.03877  5.59589  0.00000  0.14095  0.29291
## np.power(exper, 2) -0.00318   0.00123 -2.57767  0.00995 -0.00560 -0.00076
## age                -0.10395   0.01832 -5.67236  0.00000 -0.13986 -0.06803
## kidslt6            -2.28803   0.37195 -6.15140  0.00000 -3.01705 -1.55902
## kidsge6            -0.08289   0.09919 -0.83566  0.40334 -0.27729  0.11152
## kidslt6:kidsge6     0.53630   0.16900  3.17341  0.00151  0.20507  0.86753

It is significant. But the squared terms of the interaction variables are not:

##                 Estimate Std. Error  z value Pr(>|z|)
## (Intercept)      1.73274    1.09457  1.58304  0.11341
## educ             0.19860    0.05039  3.94126  0.00008
## exper            0.21690    0.03890  5.57533  0.00000
## I(exper^2)      -0.00319    0.00124 -2.57792  0.00994
## age             -0.10446    0.01904 -5.48616  0.00000
## kidslt6         -2.26474    0.74439 -3.04240  0.00235
## kidsge6         -0.11724    0.22999 -0.50978  0.61021
## I(kidslt6^2)    -0.01828    0.37168 -0.04918  0.96078
## I(kidsge6^2)     0.00829    0.04956  0.16729  0.86714
## kidslt6:kidsge6  0.53807    0.17471  3.07977  0.00207
##                         Coef.  Std.Err.        z    P>|z|   [0.025   0.975]
## Intercept             1.73274   1.09457  1.58304  0.11341 -0.41257  3.87806
## educ                  0.19860   0.05039  3.94126  0.00008  0.09984  0.29736
## exper                 0.21690   0.03890  5.57533  0.00000  0.14065  0.29315
## np.power(exper, 2)   -0.00319   0.00124 -2.57792  0.00994 -0.00562 -0.00076
## age                  -0.10446   0.01904 -5.48616  0.00000 -0.14178 -0.06714
## kidslt6              -2.26474   0.74439 -3.04240  0.00235 -3.72372 -0.80576
## kidsge6              -0.11724   0.22999 -0.50978  0.61021 -0.56801  0.33352
## kidslt6:kidsge6       0.53807   0.17471  3.07977  0.00207  0.19564  0.88050
## np.power(kidslt6, 2) -0.01828   0.37168 -0.04918  0.96078 -0.74677  0.71021
## np.power(kidsge6, 2)  0.00829   0.04956  0.16729  0.86714 -0.08885  0.10543

Hence, our finalized model is:

##                 Estimate Std. Error  z value Pr(>|z|)
## (Intercept)      1.70098    1.04919  1.62123  0.10497
## educ             0.19791    0.05020  3.94197  0.00008
## exper            0.21693    0.03877  5.59589  0.00000
## I(exper^2)      -0.00318    0.00123 -2.57767  0.00995
## age             -0.10395    0.01832 -5.67236  0.00000
## kidslt6         -2.28803    0.37195 -6.15140  0.00000
## kidsge6         -0.08289    0.09919 -0.83566  0.40334
## kidslt6:kidsge6  0.53630    0.16900  3.17341  0.00151
##                       Coef.  Std.Err.        z    P>|z|   [0.025   0.975]
## Intercept           1.70098   1.04919  1.62123  0.10497 -0.35540  3.75735
## educ                0.19791   0.05020  3.94197  0.00008  0.09951  0.29631
## exper               0.21693   0.03877  5.59589  0.00000  0.14095  0.29291
## np.power(exper, 2) -0.00318   0.00123 -2.57767  0.00995 -0.00560 -0.00076
## age                -0.10395   0.01832 -5.67236  0.00000 -0.13986 -0.06803
## kidslt6            -2.28803   0.37195 -6.15140  0.00000 -3.01705 -1.55902
## kidsge6            -0.08289   0.09919 -0.83566  0.40334 -0.27729  0.11152
## kidslt6:kidsge6     0.53630   0.16900  3.17341  0.00151  0.20507  0.86753

Again note that kidslt6:kidsge6 is significant, so we cannot remove the main effect kidsge6, even though it is significant (unless we want to make interpretation harder for ourselves, which we don’t).

The same model is estimated for the probit regression:

##                 Estimate Std. Error  z value Pr(>|z|)
## (Intercept)      0.95926    0.60504  1.58545  0.11286
## educ             0.11640    0.02866  4.06061  0.00005
## exper            0.12800    0.02226  5.75096  0.00000
## I(exper^2)      -0.00188    0.00071 -2.65361  0.00796
## age             -0.06025    0.01036 -5.81485  0.00000
## kidslt6         -1.34130    0.21021 -6.38088  0.00000
## kidsge6         -0.04653    0.05730 -0.81207  0.41675
## kidslt6:kidsge6  0.31497    0.09829  3.20436  0.00135
##                       Coef.  Std.Err.        z    P>|z|   [0.025   0.975]
## Intercept           0.95926   0.60729  1.57958  0.11420 -0.23100  2.14953
## educ                0.11640   0.02868  4.05878  0.00005  0.06019  0.17260
## exper               0.12800   0.02195  5.83111  0.00000  0.08498  0.17102
## np.power(exper, 2) -0.00188   0.00070 -2.69562  0.00703 -0.00324 -0.00051
## age                -0.06025   0.01031 -5.84510  0.00000 -0.08045 -0.04005
## kidslt6            -1.34130   0.20747 -6.46511  0.00000 -1.74793 -0.93467
## kidsge6            -0.04653   0.05694 -0.81717  0.41383 -0.15813  0.06507
## kidslt6:kidsge6     0.31497   0.09781  3.22010  0.00128  0.12326  0.50668

The coefficients are also significant.

5.4.2.2 Task 6

We will calculate the predicted probability for different variable value combinations in a single plot. To do this we will use:

  • the range of ages in the training dataset as continuous data
  • and the number of kids less than 6 years old, kidslt6, as discrete data, since the number of kids is a finite number from the training dataset.

Note that the amount of different values for age is:

So, we will need to create a dataset for prediction, which consists of 31 different values of age, ceteris paribus.

To do this, we will firstly repeat the first line of the data 31 times:

##   inlf hours kidslt6 kidsge6 age educ  wage repwage hushrs husage huseduc huswage faminc    mtr motheduc fatheduc unem city exper nwifeinc    lwage expersq
## 1    1  1610       1       0  32   12 3.354    2.65   2708     34      12  4.0288  16310 0.7215       12        7    5    0    14 10.91006 1.210154     196
## 2    1  1610       1       0  32   12 3.354    2.65   2708     34      12  4.0288  16310 0.7215       12        7    5    0    14 10.91006 1.210154     196
## 3    1  1610       1       0  32   12 3.354    2.65   2708     34      12  4.0288  16310 0.7215       12        7    5    0    14 10.91006 1.210154     196
## 4    1  1610       1       0  32   12 3.354    2.65   2708     34      12  4.0288  16310 0.7215       12        7    5    0    14 10.91006 1.210154     196
## 5    1  1610       1       0  32   12 3.354    2.65   2708     34      12  4.0288  16310 0.7215       12        7    5    0    14 10.91006 1.210154     196
##    inlf   hours  kidslt6  kidsge6   age  educ   wage  repwage  hushrs  husage  huseduc  huswage   faminc     mtr  motheduc  fatheduc  unem  city  exper  nwifeinc     lwage  expersq
## 0   1.0  1610.0      1.0      0.0  32.0  12.0  3.354     2.65  2708.0    34.0     12.0   4.0288  16310.0  0.7215      12.0       7.0   5.0   0.0   14.0  10.91006  1.210154    196.0
## 1   1.0  1610.0      1.0      0.0  32.0  12.0  3.354     2.65  2708.0    34.0     12.0   4.0288  16310.0  0.7215      12.0       7.0   5.0   0.0   14.0  10.91006  1.210154    196.0
## 2   1.0  1610.0      1.0      0.0  32.0  12.0  3.354     2.65  2708.0    34.0     12.0   4.0288  16310.0  0.7215      12.0       7.0   5.0   0.0   14.0  10.91006  1.210154    196.0
## 3   1.0  1610.0      1.0      0.0  32.0  12.0  3.354     2.65  2708.0    34.0     12.0   4.0288  16310.0  0.7215      12.0       7.0   5.0   0.0   14.0  10.91006  1.210154    196.0
## 4   1.0  1610.0      1.0      0.0  32.0  12.0  3.354     2.65  2708.0    34.0     12.0   4.0288  16310.0  0.7215      12.0       7.0   5.0   0.0   14.0  10.91006  1.210154    196.0

Next, we will replace age with values to be within the ranges of the actual data:

And we will separate two cases: kidslt6 = 0 and kidslt6 = 1:

Then, the predicted values are (here we also calculate their standard errors, which we will use to calculate the confidence bounds):

The parameter variance-covariance matrix, \(\mathbb{V}{\rm ar}( \widehat{\boldsymbol{\beta}})\):

##                   (Intercept)          educ         exper    I(exper^2)           age       kidslt6       kidsge6 kidslt6:kidsge6
## (Intercept)      1.1007987013 -2.844867e-02 -2.930640e-03  1.411362e-04 -1.508642e-02 -1.278700e-01 -4.944921e-02    2.712274e-02
## educ            -0.0284486727  2.520516e-03 -1.496552e-04  5.820870e-06 -1.804033e-05 -4.677607e-03 -5.880139e-05    1.352341e-03
## exper           -0.0029306398 -1.496552e-04  1.502766e-03 -4.365827e-05 -7.334632e-05 -1.375430e-03  8.216827e-05    7.159497e-04
## I(exper^2)       0.0001411362  5.820870e-06 -4.365827e-05  1.524588e-06 -5.787982e-07  2.636548e-05  3.380823e-06   -1.767184e-05
## age             -0.0150864155 -1.804033e-05 -7.334632e-05 -5.787982e-07  3.358019e-04  3.584616e-03  7.871850e-04   -8.362006e-04
## kidslt6         -0.1278700208 -4.677607e-03 -1.375430e-03  2.636548e-05  3.584616e-03  1.383491e-01  1.643111e-02   -4.760569e-02
## kidsge6         -0.0494492103 -5.880139e-05  8.216827e-05  3.380823e-06  7.871850e-04  1.643111e-02  9.837965e-03   -7.636762e-03
## kidslt6:kidsge6  0.0271227419  1.352341e-03  7.159497e-04 -1.767184e-05 -8.362006e-04 -4.760569e-02 -7.636762e-03    2.856037e-02
##                     Intercept      educ     exper  np.power(exper, 2)           age   kidslt6   kidsge6  kidslt6:kidsge6
## Intercept            1.100799 -0.028449 -0.002931        1.411362e-04 -1.508642e-02 -0.127870 -0.049449         0.027123
## educ                -0.028449  0.002521 -0.000150        5.820871e-06 -1.804037e-05 -0.004678 -0.000059         0.001352
## exper               -0.002931 -0.000150  0.001503       -4.365828e-05 -7.334637e-05 -0.001375  0.000082         0.000716
## np.power(exper, 2)   0.000141  0.000006 -0.000044        1.524588e-06 -5.787973e-07  0.000026  0.000003        -0.000018
## age                 -0.015086 -0.000018 -0.000073       -5.787973e-07  3.358020e-04  0.003585  0.000787        -0.000836
## kidslt6             -0.127870 -0.004678 -0.001375        2.636550e-05  3.584617e-03  0.138349  0.016431        -0.047606
## kidsge6             -0.049449 -0.000059  0.000082        3.380825e-06  7.871852e-04  0.016431  0.009838        -0.007637
## kidslt6:kidsge6      0.027123  0.001352  0.000716       -1.767184e-05 -8.362008e-04 -0.047606 -0.007637         0.028560

We can create the design \(\mathbf{X}\) matrix using the model formulas as follows:

##   (Intercept) educ exper I(exper^2) age kidslt6 kidsge6 kidslt6:kidsge6
## 1           1   12    14        196  30       0       0               0
## 2           1   12    14        196  31       0       0               0
## 3           1   12    14        196  32       0       0               0
## 4           1   12    14        196  33       0       0               0
## 5           1   12    14        196  34       0       0               0
##    Intercept  educ  exper  np.power(exper, 2)   age  kidslt6  kidsge6  kidslt6:kidsge6
## 0        1.0  12.0   14.0               196.0  30.0      0.0      0.0              0.0
## 1        1.0  12.0   14.0               196.0  31.0      0.0      0.0              0.0
## 2        1.0  12.0   14.0               196.0  32.0      0.0      0.0              0.0
## 3        1.0  12.0   14.0               196.0  33.0      0.0      0.0              0.0
## 4        1.0  12.0   14.0               196.0  34.0      0.0      0.0              0.0

Next, we can calculate \(\mathbb{V}{\rm ar}(\mathbf{X} \widehat{\boldsymbol{\beta}})\):

##  [1] 0.4006169 0.3848265 0.3692703 0.3539793 0.3389892 0.3243418 0.3100858 0.2962775 0.2829825 0.2702766 0.2582468 0.2469919 0.2366224 0.2272597 0.2190327 0.2120739 0.2065113 0.2024601 0.2000123 0.1992268 0.2001233 0.2026794 0.2068336 0.2124923 0.2195390 0.2278452 0.2372784 0.2477101 0.2590195 0.2710970 0.2838443
## [0.40061697 0.38482656 0.36927039 0.35397932 0.33898923 0.32434187
##  0.31008579 0.29627749 0.28298253 0.27027666 0.25824686 0.24699192
##  0.23662245 0.22725968 0.21903275 0.21207389 0.20651132 0.20246017
##  0.20001229 0.1992268  0.20012328 0.20267939 0.20683363 0.21249228
##  0.21953905 0.22784516 0.23727842 0.24771008 0.25901954 0.27109697
##  0.28384434]

Note that this gives the same results as in R, if we predict the log-odds:

##         1         2         3         4         5         6         7         8         9        10        11        12        13        14        15        16        17        18        19        20        21        22        23        24        25        26        27        28        29        30        31 
## 0.4006169 0.3848265 0.3692703 0.3539793 0.3389892 0.3243418 0.3100858 0.2962775 0.2829825 0.2702766 0.2582468 0.2469919 0.2366224 0.2272597 0.2190327 0.2120739 0.2065113 0.2024601 0.2000123 0.1992268 0.2001233 0.2026794 0.2068336 0.2124923 0.2195390 0.2278452 0.2372784 0.2477101 0.2590195 0.2710970 0.2838443

Finally, we can do the same for the other data frame:

##  [1] 0.3431336 0.3354213 0.3285511 0.3225768 0.3175490 0.3135132 0.3105082 0.3085639 0.3077007 0.3079274 0.3092418 0.3116301 0.3150678 0.3195211 0.3249482 0.3313013 0.3385283 0.3465744 0.3553841 0.3649020 0.3750743 0.3858491 0.3971776 0.4090135 0.4213143 0.4340403 0.4471553 0.4606260 0.4744222 0.4885162 0.5028830
## [0.34313361 0.33542129 0.32855107 0.32257678 0.31754899 0.31351322
##  0.31050817 0.30856396 0.30770068 0.30792743 0.30924182 0.31163009
##  0.3150678  0.3195211  0.32494823 0.33130133 0.33852828 0.34657442
##  0.3553841  0.36490204 0.37507431 0.38584917 0.39717758 0.40901355
##  0.4213143  0.43404032 0.44715531 0.46062603 0.47442219 0.48851621
##  0.50288306]

Considerations

As per (this answer on stackoverflow), the standard errors for the type = "response" give a first order approximation for the true standard error.

This first order approximation is known as the (Delta method) and an example is given (here).

We can compare with the results in R:

##          1          2          3          4          5          6          7          8          9         10         11         12         13         14         15         16         17         18         19         20         21         22         23         24         25         26         27         28         29         30         31 
## 0.01286851 0.01361606 0.01438090 0.01516033 0.01595159 0.01675199 0.01755923 0.01837176 0.01918925 0.02001320 0.02084766 0.02170008 0.02258218 0.02351082 0.02450861 0.02560416 0.02683167 0.02822957 0.02983826 0.03169688 0.03383959 0.03629184 0.03906718 0.04216511 0.04557000 0.04925119 0.05316388 0.05725080 0.06144430 0.06566886 0.06984389

As the link function is generally non-linear, it is better to get the confidence interval for the link function, and then map it to the response variable scale.

There is also a problem with the way that glm() produces “prediction” intervals - see (1), (2), (3) and (4)

And finally, we can calculate the prediction interval upper and lower bounds:

The plot, along with the \(95\%\) confidence interval for the predicted probability is:

From the above we see that:

  • If someone does not have any kids under 6 years old (i.e. kidslt6 = 0), then the probability that they would return to the workforce is higher, than if someone has one child under 6 years (i.e. kidslt6 = 1).

  • If a woman is 30 years old and has no children under 6 years old (i.e. age = 30 and kidslt6 = 0), then the probability of returning to the labor force is very high, and the confidence interval is quite narrow.

Considerations

Furthermore, at least visually, the confidence intervals for the predicted probability appear to be smaller for someone with kidslt6 = 0, up until their age is 60. However, how likely is it that someone will have children in their late 50’s ?

We can even iterate this for various values of number kids under 6 years old:

  • We can conclude that the more children under 6 years old are in the family, the less likely that the mother would return to the labor force. If kidslt6 = 3, then the predicted probability is very close to zero, regardless of age.

  • The logit and probit models provide similar predicted probability values, hence we see little to no reason to use one over the other in terms of the prediction.

Finally, we can also do the same with kids over 6 years kidsge6:

  • We see a similar trend - the more children there are, the less likely that the mother would return to the labor force.

  • On the other hand, if we compare with the previous plots with the changes in kidslt6 - we see that the decrease in probability of being in the labor force with each additional kidsge6 is less than the decrease in probability with each additional kidslt6.

Considerations

Note that both for kidslt6 and kidsge6, we have only used existing cases, i.e. up to the maximum value from the dataset.

How much could we trust our model(-s) to predict for higher values of kidslt6 and kidsge6, than in the dataset?

5.4.2.3 Task 7

We begin by examining the results using the default cutoff \(0.5\):

We compare the results with the true values:

## 
##   0   1 
## 174 428
## 1.0    428
## 0.0    174
## Name: inlf, dtype: int64

Considerations

From the above results it may be easier to format the results as:

This also highlights the differences in the way that the results are formatted in R and Python

We see that:

  • We have correctly predicted that 386 women will return to the labor force.
  • Furthermore, we have correctly predicted that 77 women will not return to the labor force.

On the other hand:

  • We have incorrectly predicted that 42 women will return to the labor force (they actually did not return);
  • We have incorrectly predicted that 97 women will not return to the labor force (they actually did return);

Considerations

In total, if we were interested in determining the amount of jobs needed for women, returning to the labor force - the difference between the number of predicted returns, and the true returns to labor force is:

In other words, we would have overestimated the overall amount of women returning to the workforce. This would have several complications:

  • too many vacant job positions may be created;
  • too many school/daycare workers would be hired (as some of the women decided to remain at home to take care of their children);
  • if we made any other decisions regarding compensation/wage based on the expected value of returns to labor force - it would be overestimated;

If we were to use the optimal cutoff, which is based on minimizing the misclassification error:

The optimal cutoff is around 0.045 less than the default - this seems like a small value, but the final confusion matrix is now different:

Take note, whether the columns (as in R), or the rows (as in Python) sum up to the true totals.

  • the amount of correct returns to labor force predictions increased from 386 to 401 (GOOD);
  • the amount of correct non-returns to labor force predictions decreased from 77 to 71 (BAD);
  • the amount of incorrect returns to labor force predictions decreased from 42 to 27 (GOOD);
  • the amount of incorrect non-returns to labor force predictions increased from 97 to 103 (BAD);

Considerations

In total, if we were interested in determining the amount of jobs needed for women, returning to the labor force - the difference between the number of predicted returns, and the true returns to labor force is:

We see an important result:

  • Our total prediction of correct labor force returns increased;
  • Our incorrect predictions resulted in even more vacant jobs created than before

In general it is important to determine what our objective is:

  • If we are interested in the correct prediction on an individual level, for example, if our correct predictions allow us to create specific specialized job positions - then the increased correct predictions - True Positive Rate (\(TRP\)) - outweigh the False Positive Rate (\(FPR\)) increase.

Compared to the default cutoff:

We have correctly classified \(3.89\%\) more returns to labor force with the optimal cutoff.

On the other hand:

We have incorrectly classified \(6.19\%\) more false returns to labor force (i.e. we predicted 1, but it was actually 0) using the optimal cutoff, which is based on minimizing the misclassification error.

If we want to look at the total correctly predicted returns to labor out of all predicted returns to labor (i.e. predicted 1’s):

The optimal cutoff gives a slightly lower precision value, indicating that our model may be too optimistic (we are assuming that inlf = 1 - the return/participation in the labor force - is a positive result).

On the other hand, basing our analysis on the results using the optimal cutoff may result in the creation of more job positions - if this is important to our analysis, we may need to fine-tune the cutoff further.

Considerations

See 5.2.6.5, 5.2.6.7.1 and 5.2.6.8 for alternative methods for choosing the optimal cutoff.

Comparing with the probit model:

We see that default cutoff produces the same results. Meanwhile the optimal cutoff:

is a little bit higher, compared to the logit case.

This results in the following confusion matrix table using the optimal cutoff:

Then our insights are now slightly different:

  • The amount of correctly predicted returns to labor force decreased from 386 to 365 (BAD);
  • The amount of correctly predicted non-returns to labor increased from 77 to 107 (GOOD);
  • The amount of incorrectly predicted non-returns to labor increased from 42 to 63 (BAD);
  • The amount of incorrectly predicted returns to labor decreased from 97 to 67 (GOOD);

So, while the models are close, the automated optimal cutoff procedures give slightly different optimal cutoff points.

Considerations

Now, the total amount of jobs that we would create is:

We would predict 4 more people returning to the labor force, than would return in reality

On the other hand, we may have just gotten lucky with the amount of false positives. After all, our model has to accurately predict the true 1.

Furthermore, we now incorrectly classify that a woman will not return to the labor force (when in fact she will return):

With the optimal cutoff, we would make this incorrect classification \(50\%\) more than with the default cutoff! If it is vital that we correctly classify those women, who will return to the labor force, as returning to the labor force (maybe based on their work experience, number of children, and other factors, which are specifically different for these misclassification cases), then this is quite bad.

Furthermore:

While our overall amount of predicted labor is close to the true amount of women returning to the labor force, We see that, using the optimal cutoff, we have made more mistakes in classifying which specific women will return to the labor force. If the women are identical in terms of family composition, education, age and all other factors - this may not be a problem.

In reality, unfortunately, this is not the case - we may have misclassified women, which have more experience, as not returning to the workforce, while women, with less - experience -as returning. As a result, we may opt to create more lower skill job positions, for which the actual women returning to the workforce will be overqualified, which may result in those women not taking the jobs and either being unemployed, or - even worse - leaving the workforce altogether!

On the other hand, our model precision appears to have increased:

So, looking at it from an economic/policymaker perspective - it is quite difficult to determine what is more important - it all depends on our task and the data that we have. Consequently, we need to take care when analysing this type of data as the results of \(TPR\), \(FPR\), \(Precision\), \(Accuracy\) and any other statistic may not be as useful in every case.

5.4.2.4 Task 8

We do see that the logit and probit models do not differ much in terms of ROC.

Finally, we want to examine the AUC:

Generally, an AUC of \(0.7 - 0.8\) indicates an adequate model; \(0.5\) indicates random guessing; \(0.5-0.7\) indicates a poor model; Ideally, we would strive for the AUC value to be \(> 0.8\).

The AUC for both logit and probit models is around \(0.8\), indicating that our estimated models are adequate. The AUC is slightly higher for the logit regression.

Considerations

We can plot both the \(ROC\) and the \(AUC\) for each model separately:

Note that the values may be slightly different in R due to the way that the threshold vector is created.

5.4.3 Exercise Set 3

5.4.3.1 Task 9

The parameters of a linear regression model have a straightforward interpretation:

\[ \beta_j \text{ measures the effect of a unit change in } X_j \text{ on } \mathbb{E}(Y|\mathbf{X}) \]

On the other hand, the parameters of nonlinear logit and probit models are not as easy to interpret, since the linear predictor affects the mean through the link function. Nevertheless, we can measure this effect via the marginal effects:

\[ \dfrac{\partial \mu}{\partial X_j} = \dfrac{\partial g^{-1}(\mathbf{X} \boldsymbol{\beta})}{\partial X_j} \]

which have the same interpretations as the coefficients in the linear model.

The partial effects differ by regressor values and make it difficult to present them in a compact way. Therefore, it is common to aggregate the effects in two ways:

  • Partial effects at the average value of \(X\) (\(PEA\));

  • Average partial effects (\(APE\)).

We will begin by examining the coefficients themselves of the logistic regression:

##                 Estimate Std. Error  z value Pr(>|z|)
## (Intercept)      1.70098    1.04919  1.62123  0.10497
## educ             0.19791    0.05020  3.94197  0.00008
## exper            0.21693    0.03877  5.59589  0.00000
## I(exper^2)      -0.00318    0.00123 -2.57767  0.00995
## age             -0.10395    0.01832 -5.67236  0.00000
## kidslt6         -2.28803    0.37195 -6.15140  0.00000
## kidsge6         -0.08289    0.09919 -0.83566  0.40334
## kidslt6:kidsge6  0.53630    0.16900  3.17341  0.00151
##                       Coef.  Std.Err.        z    P>|z|   [0.025   0.975]
## Intercept           1.70098   1.04919  1.62123  0.10497 -0.35540  3.75735
## educ                0.19791   0.05020  3.94197  0.00008  0.09951  0.29631
## exper               0.21693   0.03877  5.59589  0.00000  0.14095  0.29291
## np.power(exper, 2) -0.00318   0.00123 -2.57767  0.00995 -0.00560 -0.00076
## age                -0.10395   0.01832 -5.67236  0.00000 -0.13986 -0.06803
## kidslt6            -2.28803   0.37195 -6.15140  0.00000 -3.01705 -1.55902
## kidsge6            -0.08289   0.09919 -0.83566  0.40334 -0.27729  0.11152
## kidslt6:kidsge6     0.53630   0.16900  3.17341  0.00151  0.20507  0.86753

In the logit model, we can interpret the coefficients in terms of the log-odds-ratio, as we did before. Some of the coefficients can be interpreted as follows:

  • If educ increases by 1 year then the log-odds of returning to the labor force (versus not returning) increase by 0.19791.
  • If age increases by 1 year, then the log-odds of returning to the labor force (versus not returning) decrease by -0.10395.
  • If kidslt6 increases by 1 (i.e. if we have an additional child, younger than 6 years), then the log-odds of returning to the labor force (versus not returning) changes by \(−2.2880 + 0.53630 \cdot kidsge6\), ceteris paribus.
  • If exper increases by 1 year, then the log-odds of returning to the labor force (versus not returning) changes by \(0.21693 -0.00318(2 \cdot exper + 1)\)

We can also interpret them in terms of the odds-ratio:

##                 Estimate
## (Intercept)      5.47929
## educ             1.21885
## exper            1.24225
## I(exper^2)       0.99682
## age              0.90127
## kidslt6          0.10147
## kidsge6          0.92046
## kidslt6:kidsge6  1.70967
## Intercept             5.47929
## educ                  1.21885
## exper                 1.24225
## np.power(exper, 2)    0.99682
## age                   0.90127
## kidslt6               0.10147
## kidsge6               0.92046
## kidslt6:kidsge6       1.70967
## Name: Coef., dtype: float64

The interpretation is then as follows:

  • If educ increases by 1 year then the odds of returning to the labor force (versus not returning) increase by 1.21885.
  • If age increases by 1 year, then the log-odds of returning to the labor force (versus not returning) increase by a factor of 0.90127.
  • and so on.

Next up, we want to evaluate the partial effects of the explanatory variables on the probability:

##                         dF/dx    Std. Err.          z        P>|z|
## educ             0.0356010216 0.0089769875  3.9658094 7.314729e-05
## exper            0.0390228423 0.0069853509  5.5863825 2.318482e-08
## I(exper^2)      -0.0005725429 0.0002234722 -2.5620319 1.040618e-02
## age             -0.0186986384 0.0032165601 -5.8132408 6.127481e-09
## kidslt6         -0.4115914788 0.0666696754 -6.1735936 6.675501e-10
## kidsge6         -0.0149103840 0.0177874785 -0.8382517 4.018894e-01
## kidslt6:kidsge6  0.0964745724 0.0304359033  3.1697621 1.525638e-03
##         Logit Marginal Effects       
## =====================================
## Dep. Variable:                   inlf
## Method:                          dydx
## At:                              mean
## ======================================================================================
##                         dy/dx    std err          z      P>|z|      [0.025      0.975]
## --------------------------------------------------------------------------------------
## educ                   0.0356      0.009      3.966      0.000       0.018       0.053
## exper                  0.0390      0.007      5.586      0.000       0.025       0.053
## np.power(exper, 2)    -0.0006      0.000     -2.562      0.010      -0.001      -0.000
## age                   -0.0187      0.003     -5.813      0.000      -0.025      -0.012
## kidslt6               -0.4116      0.067     -6.174      0.000      -0.542      -0.281
## kidsge6               -0.0149      0.018     -0.838      0.402      -0.050       0.020
## kidslt6:kidsge6        0.0965      0.030      3.170      0.002       0.037       0.156
## ======================================================================================

PEA - the Partial Effects at the Average - shows that:

  • a unit increase in the sample average value of educ results in a 0.0356 increase in the probability of returning to the labor force ;
  • a unit increase in the sample average value of age results in a 0.018698 decrease in the probability of returning to the labor force ;
  • if the sample average number of kids under the age of 6, kidslt6 increases by one, then the probability of returning to the labor force changes by \(-0.411591 + 0.096475 \cdot \overline{kidsge6}\);
  • if the average number of exper increases by one, then the probability of returning to the labor force changes by \(0.039023 - 0.000573 \cdot (2 \cdot \overline{exper})\).
##                         dF/dx    Std. Err.          z        P>|z|
## educ             0.0299008551 0.0081927749  3.6496615 2.625861e-04
## exper            0.0327747997 0.0068154334  4.8089091 1.517562e-06
## I(exper^2)      -0.0004808716 0.0001937699 -2.4816628 1.307709e-02
## age             -0.0157047537 0.0032096172 -4.8930302 9.929517e-07
## kidslt6         -0.3456905620 0.0667862419 -5.1760745 2.266030e-07
## kidsge6         -0.0125230460 0.0150278296 -0.8333237 4.046622e-01
## kidslt6:kidsge6  0.0810277930 0.0268936290  3.0128992 2.587649e-03
##         Logit Marginal Effects       
## =====================================
## Dep. Variable:                   inlf
## Method:                          dydx
## At:                           overall
## ======================================================================================
##                         dy/dx    std err          z      P>|z|      [0.025      0.975]
## --------------------------------------------------------------------------------------
## educ                   0.0299      0.007      4.112      0.000       0.016       0.044
## exper                  0.0328      0.005      6.186      0.000       0.022       0.043
## np.power(exper, 2)    -0.0005      0.000     -2.634      0.008      -0.001      -0.000
## age                   -0.0157      0.003     -6.214      0.000      -0.021      -0.011
## kidslt6               -0.3457      0.050     -6.894      0.000      -0.444      -0.247
## kidsge6               -0.0125      0.015     -0.836      0.403      -0.042       0.017
## kidslt6:kidsge6        0.0810      0.025      3.263      0.001       0.032       0.130
## ======================================================================================

APE - the Average Partial Effect - shows that:

  • a unit increase in educ results in the sample average estimated increase of 0.0299 in the probability of returning to the labor force;
  • a unit increase in age results in the sample average estimated decrease of 0.015705 in the probability of returning to the labor force;
  • a unit increase in kidslt6 results results in the sample average estimated change of \(-0.01252 + 0.081028 \cdot \overline{kidsge6}\).
    Note: this is similar to the PEA expression, which stems from the fact that the partial derivative with respect to kidsge6 returns \(-0.01252 + 0.081028 \cdot kidsge6\). Since APE averages all the partial effects, kidsge6 turns into its average \(\overline{kidsge6}\). In contrast, PEA already uses \(\overline{kidsge6}\), but the expressions become equivalent, although with different coefficients.

We can do the same with the probit regression:

##                 Estimate Std. Error  z value Pr(>|z|)
## (Intercept)      0.95926    0.60504  1.58545  0.11286
## educ             0.11640    0.02866  4.06061  0.00005
## exper            0.12800    0.02226  5.75096  0.00000
## I(exper^2)      -0.00188    0.00071 -2.65361  0.00796
## age             -0.06025    0.01036 -5.81485  0.00000
## kidslt6         -1.34130    0.21021 -6.38088  0.00000
## kidsge6         -0.04653    0.05730 -0.81207  0.41675
## kidslt6:kidsge6  0.31497    0.09829  3.20436  0.00135
##                       Coef.  Std.Err.        z    P>|z|   [0.025   0.975]
## Intercept           0.95926   0.60729  1.57958  0.11420 -0.23100  2.14953
## educ                0.11640   0.02868  4.05878  0.00005  0.06019  0.17260
## exper               0.12800   0.02195  5.83111  0.00000  0.08498  0.17102
## np.power(exper, 2) -0.00188   0.00070 -2.69562  0.00703 -0.00324 -0.00051
## age                -0.06025   0.01031 -5.84510  0.00000 -0.08045 -0.04005
## kidslt6            -1.34130   0.20747 -6.46511  0.00000 -1.74793 -0.93467
## kidsge6            -0.04653   0.05694 -0.81717  0.41383 -0.15813  0.06507
## kidslt6:kidsge6     0.31497   0.09781  3.22010  0.00128  0.12326  0.50668

Interpretation follows a similar idea to the logit, with the difference being that instead of the log-odds-ratio we now have the inverse normal cdf of the probability, \(\Phi^{-1}(p_i)\).

Consequently, it makes more sense to evaluate the PEA and APE:

##                         dF/dx    Std. Err.          z        P>|z|
## educ             0.0364118090 0.0089471120  4.0696718 4.707942e-05
## exper            0.0400413820 0.0069878069  5.7301786 1.003249e-08
## I(exper^2)      -0.0005869283 0.0002220151 -2.6436419 8.201938e-03
## age             -0.0188477517 0.0032028840 -5.8846188 3.989724e-09
## kidslt6         -0.4195922325 0.0658753362 -6.3694891 1.896590e-10
## kidsge6         -0.0145560517 0.0178951313 -0.8134085 4.159839e-01
## kidslt6:kidsge6  0.0985304764 0.0307986942  3.1991771 1.378205e-03
##        Probit Marginal Effects       
## =====================================
## Dep. Variable:                   inlf
## Method:                          dydx
## At:                              mean
## ======================================================================================
##                         dy/dx    std err          z      P>|z|      [0.025      0.975]
## --------------------------------------------------------------------------------------
## educ                   0.0364      0.009      4.073      0.000       0.019       0.054
## exper                  0.0400      0.007      5.806      0.000       0.027       0.054
## np.power(exper, 2)    -0.0006      0.000     -2.684      0.007      -0.001      -0.000
## age                   -0.0188      0.003     -5.924      0.000      -0.025      -0.013
## kidslt6               -0.4196      0.065     -6.476      0.000      -0.547      -0.293
## kidsge6               -0.0146      0.018     -0.819      0.413      -0.049       0.020
## kidslt6:kidsge6        0.0985      0.031      3.220      0.001       0.039       0.159
## ======================================================================================
##                         dF/dx    Std. Err.          z        P>|z|
## educ             0.0301981575 0.0071744405  4.2091307 2.563551e-05
## exper            0.0332083463 0.0053187829  6.2435987 4.276162e-10
## I(exper^2)      -0.0004867694 0.0001802354 -2.7007420 6.918499e-03
## age             -0.0156313952 0.0024919704 -6.2727051 3.548286e-10
## kidslt6         -0.3479890919 0.0495031781 -7.0296313 2.070800e-12
## kidsge6         -0.0120720710 0.0148541006 -0.8127097 4.163845e-01
## kidslt6:kidsge6  0.0817163149 0.0249294073  3.2779085 1.045793e-03
##        Probit Marginal Effects       
## =====================================
## Dep. Variable:                   inlf
## Method:                          dydx
## At:                           overall
## ======================================================================================
##                         dy/dx    std err          z      P>|z|      [0.025      0.975]
## --------------------------------------------------------------------------------------
## educ                   0.0302      0.007      4.213      0.000       0.016       0.044
## exper                  0.0332      0.005      6.344      0.000       0.023       0.043
## np.power(exper, 2)    -0.0005      0.000     -2.743      0.006      -0.001      -0.000
## age                   -0.0156      0.002     -6.312      0.000      -0.020      -0.011
## kidslt6               -0.3480      0.049     -7.153      0.000      -0.443      -0.253
## kidsge6               -0.0121      0.015     -0.818      0.413      -0.041       0.017
## kidslt6:kidsge6        0.0817      0.025      3.296      0.001       0.033       0.130
## ======================================================================================

The interpretation of PEA and APE of the probit model are analogous to the logit model - a unit increase in the (average, if PEA) regressor results in the specified change (or average change in the case of APE) in the probability.

Considerations

Marginal effects show the change in probability when the predictor, or independent, variable increases by one unit. For continuous variables this represents the instantaneous change given that the ‘unit’ may be very small. For binary variables, the change is from 0 to 1, so one ‘unit’ as it is usually thought.

As in the multiple linear regression case, the intercept does not have a meaningful interpretation.

5.4.3.2 Task 10

The fitted Logit model:

\[ \begin{aligned} \log \left( \dfrac{\widehat{p}_i}{1 - \widehat{p}_i} \right) &= \underset{(1.0492)}{1.7010} + \underset{(0.050)}{0.1979} \cdot educ + \underset{(0.0388)}{0.2169} \cdot exper - \underset{(0.0012)}{0.0032} \cdot exper^2 - \underset{(0.0183)}{0.1039} \cdot age \\ &- \underset{(0.3720)}{2.2880} \cdot kidslt6 - \underset{(0.0992)}{0.0829} \cdot kidsge6 + \underset{(0.1690)}{0.5363} \cdot (kidslt6 \times kidsge6) \end{aligned} \]

The fitted Probit model:

\[ \begin{aligned} \widehat{p}_i = \mathbf{\Phi} \bigg( \underset{(0.6050)}{0.9593} &+ \underset{(0.0287)}{0.1164} \cdot educ + \underset{(0.0223)}{0.1280} \cdot exper - \underset{(0.0007)}{0.0019} \cdot exper^2 - \underset{(0.0104)}{0.0603} \cdot age \\ &- \underset{(0.2102)}{1.3413} \cdot kidslt6 - \underset{(0.0573)}{0.0465} \cdot kidsge6 + \underset{(0.0983)}{0.3150} \cdot (kidslt6 \times kidsge6) \bigg) \end{aligned} \]

5.4.3.3 Task 11

We will check some arbitrary linear restrictions to highlight how such tests could be carried out:

##                 Estimate Std. Error  z value Pr(>|z|)
## (Intercept)      1.70098    1.04919  1.62123  0.10497
## educ             0.19791    0.05020  3.94197  0.00008
## exper            0.21693    0.03877  5.59589  0.00000
## I(exper^2)      -0.00318    0.00123 -2.57767  0.00995
## age             -0.10395    0.01832 -5.67236  0.00000
## kidslt6         -2.28803    0.37195 -6.15140  0.00000
## kidsge6         -0.08289    0.09919 -0.83566  0.40334
## kidslt6:kidsge6  0.53630    0.16900  3.17341  0.00151
##                       Coef.  Std.Err.        z    P>|z|   [0.025   0.975]
## Intercept           1.70098   1.04919  1.62123  0.10497 -0.35540  3.75735
## educ                0.19791   0.05020  3.94197  0.00008  0.09951  0.29631
## exper               0.21693   0.03877  5.59589  0.00000  0.14095  0.29291
## np.power(exper, 2) -0.00318   0.00123 -2.57767  0.00995 -0.00560 -0.00076
## age                -0.10395   0.01832 -5.67236  0.00000 -0.13986 -0.06803
## kidslt6            -2.28803   0.37195 -6.15140  0.00000 -3.01705 -1.55902
## kidsge6            -0.08289   0.09919 -0.83566  0.40334 -0.27729  0.11152
## kidslt6:kidsge6     0.53630   0.16900  3.17341  0.00151  0.20507  0.86753

We will begin with a simple economic hypothesis: \[ \text{ education is twice as effective, in absolute value, on the log-odds-ratio, compared to age} \] this is equivalent to the hypothesis that:

\[ \begin{cases} H_0&: educ + 2 \cdot age = 0\\ H_1&: educ + 2 \cdot age \neq 0 \end{cases} \]

## Linear hypothesis test
## 
## Hypothesis:
## educ  + 2 age = 0
## 
## Model 1: restricted model
## Model 2: inlf ~ educ + exper + I(exper^2) + age + kidslt6 * kidsge6
## 
##   Res.Df Df  Chisq Pr(>Chisq)
## 1    595                     
## 2    594  1 0.0263     0.8712
## <Wald test (chi2): statistic=[[0.02629718]], p-value=0.8711766750958096, df_denom=1>

Since p-value = 0.8712 > 0.05, we have no grounds to reject the null hypothesis and conclude that educations effect on the log-odds-ratio is twice as strong as the effect of age.

Furthermore, we may examine another hypothesis:

\[ \begin{cases} H_0&: educ + 2 \cdot age + kidslt6= 0\\ H_1&: educ + 2 \cdot age + kidslt6 \neq 0 \end{cases} \]

## Linear hypothesis test
## 
## Hypothesis:
## educ  + 2 age  + kidslt6 = 0
## 
## Model 1: restricted model
## Model 2: inlf ~ educ + exper + I(exper^2) + age + kidslt6 * kidsge6
## 
##   Res.Df Df  Chisq Pr(>Chisq)    
## 1    595                         
## 2    594  1 35.894  2.083e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## <Wald test (chi2): statistic=[[35.89410407]], p-value=2.0833820432189484e-09, df_denom=1>

Since p-value < 0.05, we reject the null hypothesis of this linear restriction.

Finally, we may examine a joint hypothesis of two linear restrictions:

\[ \begin{cases} H_0&: educ + 2 \cdot age = 0 \text{ and } kidslt6 = 28 \cdot kidsge6\\ H_1&: educ + 2 \cdot age \neq 0, \text{ or } kidslt6 \neq 28 \cdot kidsge6, \text{ or both} \end{cases} \]

## Linear hypothesis test
## 
## Hypothesis:
## educ  + 2 age = 0
## kidslt6 - 28 kidsge6 = 0
## 
## Model 1: restricted model
## Model 2: inlf ~ educ + exper + I(exper^2) + age + kidslt6 * kidsge6
## 
##   Res.Df Df  Chisq Pr(>Chisq)
## 1    596                     
## 2    594  2 0.0271     0.9865
## <Wald test (chi2): statistic=[[0.02710237]], p-value=0.9865402197730535, df_denom=2>

Since p-value = 0.9865 > 0.05 we do not reject the joint hypothesis of these two linear restrictions.

5.4.3.4 Task 12

Before examining the model on the test data subset, we examine the \(Y\) values:

## 0.0    151
## Name: inlf, dtype: int64

Note that in this data sample, our test set contains only zeros!

Considerations

This may be a nice example of the type of new data that we may obtain for testing - there is always a possibility that the new data, for which we do not yet know the outcome, may have only 1’s, or only 0’s.

Firstly, using the naive \(0.5\) threshold:

Gives the same results - around more than half of the data is incorrectly classified.

If we look at the optimal cutoff prediction rule (which are different for logit and probit models):

In this case, the results from using an optimal cutoff point are slightly better for the probit regression.

Unfortunately, since we do not have any 1 in our dataset - we cannot calculate the ROC and AUC.

Considerations

At this point our analysis on the test data would come to an end - since we do not have any 1 in the test data, we cannot do any further analysis.

For interests sake, assume that out new data contains both zeros and ones (we will ignore that some (or all) of the sample observations may come from our train set):

##   inlf hours kidslt6 kidsge6 age educ   wage repwage hushrs husage huseduc huswage faminc    mtr motheduc fatheduc unem city exper nwifeinc     lwage expersq
## 1    1   384       0       3  36   14 1.6927       0   3060     43      17  5.4020  32650 0.5815       10       10 14.0    1     1 32.00000 0.5263249       1
## 2    0     0       2       3  34   12     NA       0   3160     30      12  0.7595   5000 0.9415       12       12  7.5    0    10  5.00000        NA     100
## 3    1   528       0       2  51   12 3.7879       0   2500     50      11 10.4000  40000 0.6215        7        7  7.5    0     8 37.99999 1.3318120      64
## 4    0     0       0       0  52   11     NA       0   2210     51      12 10.4070  24000 0.6615        7        7 11.0    1     3 24.00000        NA       9
## 5    1   800       0       3  33   13 1.7500       0   3000     37      14  8.3333  32600 0.6215       12        7  7.5    1     4 31.20000 0.5596158      16
##    index  inlf   hours  kidslt6  kidsge6   age  educ    wage  repwage  hushrs  husage  huseduc  huswage   faminc     mtr  motheduc  fatheduc  unem  city  exper   nwifeinc     lwage  expersq
## 0    373   1.0  2143.0      1.0      1.0  34.0  12.0  4.4797     4.80  1935.0    40.0     12.0   3.4109  16200.0  0.7215      10.0      10.0  14.0   0.0   16.0   6.600003  1.499556    256.0
## 1    246   1.0  1980.0      0.0      4.0  31.0   9.0  2.5253     2.56  2070.0    36.0     10.0   4.6860  19900.0  0.7515      10.0       7.0   5.0   1.0    2.0  14.899910  0.926360      4.0
## 2    686   0.0     0.0      0.0      0.0  54.0  15.0     NaN     0.00  1920.0    52.0     15.0  10.4170  20000.0  0.6915      14.0      16.0  14.0   0.0    6.0  20.000000       NaN     36.0
## 3    691   0.0     0.0      0.0      0.0  55.0  16.0     NaN     0.00  2304.0    56.0     11.0   2.6701  13000.0  0.7215      12.0      12.0   7.5   0.0   33.0  13.000000       NaN   1089.0
## 4    381   1.0  1960.0      0.0      1.0  43.0  14.0  3.9286     3.85  2000.0    45.0     14.0   5.3350  24470.0  0.6915       7.0       7.0   7.5   0.0   14.0  16.769939  1.368283    196.0
## 
##  0  1 
## 61 90
## 1.0    85
## 0.0    66
## Name: inlf, dtype: int64

Now that our dataset has both 1 and 0 - we can compare the models with the optimal cutoff’s:

We see that for this test set, we correctly predict the majority of 1’s. Again part of this is because all of the observations with 1 were included in our model. In this case, the logit model appears to be better at classifying 1 correctly, compared to the probit model. For the correct prediction of 0 - it appears that the probit may be better suited.

We cal also plot the \(ROC\) for the test data as:

Note: since we have used randomization to select the subset - the results in R and Python will be different due to the different data subsamples selected.