Tasks
Exercise Set 1: Exploratory Data Analysis (EDA)
Visually examine the data:
Split your datasets into a training set (\(\sim 80\%\) of the data) and testing set (the remaining \(\sim 20\%\) of the data).
Look at the descriptive statistics of the dependent and explanatory variables - are there any missing data? Are there any categorical, or indicator variables?
Think a bit more about the data itself - what are the variable definitions? What would you expect their effect to be on the dependent variable \(Y\)? (in your opinion - is the effect likely to be positive, negative, or, maybe, it is difficult to determine?)
Visually examine the relationships between the dependent and explanatory variables (boxplots, barcharts, etc.).
Before carrying out any tasks - split your datasets into a training set (\(\sim 80\%\) of the data) and testing set (the remaining \(\sim 20\%\) of the data).
Unless stated otherwise, use the training set for Exercise Sets 1 through 4. The test set is used for out-of-sample validation and to compare model predictions with actual outcomes on a new set of data.
Exercise Set 2: Logistic regression model
Which variables would you include in a logistic regression model? What signs do you expect them to have? Estimate the model with your selected variables.
Check your model variables for multicollinearity. Remove any collinear variables.
Include polynomial and/or interaction terms in your model and explain you selection. Leave only significant variables.
Select a cutoff prediction rule:
- Use a default \(\theta = 0.5\) cutoff
- Calculate an optimal cutoff based on some accuracy metric of your choice.
- Compare the confusion matrix for both cutoff cases and select the best prediction rule.
Exercise Set 3: Other Classification methods
Carry out classification via k-Nearest neighbors. Determine the optimal value of \(k\) (i.e. how many neighboring points should you take).
Carry out classification via Classification Trees
Carry out classification via Naive Bayes, LDA and QDA.
Exercise Set 6: Model comparison on the training set
Plot the ROC curves of the different models.
Calculate the optimal cutoff’s for all models and compare the confusion matrices.
Exercise Set 7: Model comparison on the test set
Calculate the ROC curves and AUC on the test set.
Using the optimal cutoff from the training set, calculate the confusion matrix on the test set.
Which model is the best in terms of the training and test sets?