The value of in the KNN algorithm is related to the error rate of the model. A small value of could lead to overfitting, while a big value of can lead to underfitting.
For binary data we want to avoid the case of ties. For example, if , then we could have a case when one neighbor is and the other one is .
Since we want to avoid overfitting - we will start from .
misclasserror%>%ggplot(aes(x =k, y =error))+geom_line()+geom_point()+theme_bw()
1
1
Important
We have to be very carefull and include only explanatory variables, avoiding varibales like hours, wage, nwifeinc, etc., which have specific values as a consequence of being in the labor force.
The predicted class (in our case - either 0 or 1).
The predicted probability
The percentage of observations in the node.
For example, at the first depth level - we see that if someone has less than 6 years of experience, then around of (all) observations didn’t return to the labor force (the predicted probability is ), while did return. Then, when we go one level down, we see that of those who did return to the labor force, if they also had more than 1 child under 6 years old (kidslt6>=1), there is a probability of that they won’t return to the labor force - around of the total sample.
mdl_lda<-MASS::lda(factor(inlf)~educ+exper+age+kidslt6+kidsge6, data =dt_train)mdl_qda<-MASS::qda(factor(inlf)~educ+exper+age+kidslt6+kidsge6, data =dt_train)mdl_nbc<-e1071::naiveBayes(factor(inlf)~educ+exper+age+kidslt6+kidsge6, data =dt_train)