Tasks
Exercise Set 1: Exploratory Data Analysis (EDA)
Visually examine the data:
- Split your datasets into a training set (\(\sim 80\%\) of the data) and testing set (the remaining \(\sim 20\%\) of the data). 
- Look at the descriptive statistics of the dependent and explanatory variables - are there any missing data? Are there any categorical, or indicator variables? 
- Think a bit more about the data itself - what are the variable definitions? What would you expect their effect to be on the dependent variable \(Y\)? (in your opinion - is the effect likely to be positive, negative, or, maybe, it is difficult to determine?) 
- Visually examine the relationships between the dependent and explanatory variables (boxplots, barcharts, etc.). 
Before carrying out any tasks - split your datasets into a training set (\(\sim 80\%\) of the data) and testing set (the remaining \(\sim 20\%\) of the data).
Unless stated otherwise, use the training set for Exercise Sets 1 through 4. The test set is used for out-of-sample validation and to compare model predictions with actual outcomes on a new set of data.
Exercise Set 2: Logistic regression model
- Which variables would you include in a logistic regression model? What signs do you expect them to have? Estimate the model with your selected variables. 
- Check your model variables for multicollinearity. Remove any collinear variables. 
- Include polynomial and/or interaction terms in your model and explain you selection. Leave only significant variables. 
- Select a cutoff prediction rule: - Use a default \(\theta = 0.5\) cutoff
- Calculate an optimal cutoff based on some accuracy metric of your choice.
- Compare the confusion matrix for both cutoff cases and select the best prediction rule.
 
Exercise Set 3: Other Classification methods
- Carry out classification via k-Nearest neighbors. Determine the optimal value of \(k\) (i.e. how many neighboring points should you take). 
- Carry out classification via Classification Trees 
- Carry out classification via Naive Bayes, LDA and QDA. 
Exercise Set 6: Model comparison on the training set
- Plot the ROC curves of the different models. 
- Calculate the optimal cutoff’s for all models and compare the confusion matrices. 
Exercise Set 7: Model comparison on the test set
- Calculate the ROC curves and AUC on the test set. 
- Using the optimal cutoff from the training set, calculate the confusion matrix on the test set. 
- Which model is the best in terms of the training and test sets?