BankChurn is a project designed to analyze and predict customer attrition (churning) from the Credit Card Dataset utilizing various Machine Learning Algorithms and Data Sampling Techniques. Below is a summary of the findings from the notebook.
This project utilized various Machine Learning Algorithms to classify if a customer would churn or not:
- KNN (k was chosen to be the sqrt of all samples was chosen, as shown to be best practice here).
- SGD (i.e. The SGD Classifier in Sklearn)
- Decision Trees
- Random Forest
- SVM
- Soft Voting Ensembles (See here). These were chosen primarily for being distinct from one another (except Random Forests and Decision Trees), and primarily to gauge the performance difference between different classical ML algorithms versus an Ensemble Method.
The Credit Card Dataset is largely unbalanced, with only 16% of customers being attrited vs 84% of non-attrited customers. This presents an issue in attempting to create a classifier for customers, as while classifiers may retain a high accuracy rating(~96% at most without sampling), they perform less adequately in terms of metrics such as recall (~80% at most without sampling). In other words, while classifiers may be excellent at predicting whether a customer will churn or not, the cases in which a customer does are far less accurate than if they do not.
In this regard, recall is the primary metric that will be maximized, as doing so would minimize the chance of false negatives in the prediction of customer churn when paired with accuracy. However, other metrics such as precision, F1 score, and PR curves were also considered. (More information is detailed in the notebook).
There are many solutions to the issue, but the solution that was employed was the use of Data Sampling, specifically- Random Over-Sampling, Random Under-Sampling, and SMOTE. The sampling techniques were chosen similarly to the classifiers, namely to compare naive sampling methods (i.e. over and under-sampling) to a more complex sampler (SMOTE).
Data Sampling largely mitigates the issues of an unbalanced dataset by simply balancing it. Depending on the sampler, the majority class is undersampled (i.e. culled), or the minority class receives artificial copies (oversampling, done in a slightly more sophisticated way in SMOTE).
Various multiprocessing steps such as removing certain multicollinear columns, standardizing the dataset, and more were undertaken.
Afterward, various copies of the dataset were made and had the various sampling techniques applied to them. The modified datasets were then split into testing and training sets.
Separate classifiers were then trained on each of the different sampling methods. In total, 23 different classifiers were trained. (Note that the last sampler was a Random Forest outside of this procedure, made to test feature importance).
- Oversampling typically retained high accuracy and overall f1 score, but lower recall.
- SMOTE was not particularly great at anything, but paired with the Random Forest classifier was tied with the highest f1 score.
- Undersampling typically retained a high recall, but lower rates of accuracy and f1 score.
- No sampler had the highest accuracy as stated earlier, but the lowest recall.
- Overall, KNN and SGD were quite quick but inadequate, with a ~70-80% recall but low precision scores and produced the worst metrics on average.
- Decision Trees were slightly better in Recall and F1 score, albeit slower.
- SVM showed promise with a high recall score (90% with Oversampling and Undersampling) but was the slowest to fit and had a nearly random precision (~60% at lowest).
- Random Forests were the best overall in terms of f1 score(~80%) and also retained a 94% or higher accuracy. This classifier, when paired with multiple samplers, had nearly all the highest values in terms of metrics.
In addition, two soft-voting ensembles were made taking three models which were either the best overall or best in recall:
- Ensembling the best of Recall: the whole was lesser than the sum of its parts in metrics(SVM Over and Undersampled, Random Forest Undersampled)
- Ensembling the best of overall stats closely resembles the performance as Random Forest Oversampled (Random Forest Under, Over, SMOTE sampled).
- Random Forest (Under Sampling) had a 95% recall and 94% accuracy, so by the metrics defined earlier, it’s the “best” classifier.
- While it has impressive primary metrics, it suffers from poor precision, but false positives are the lesser evil compared to a false negative.
Utilizing the best classifier (i.e. Random Forest Under Sampling), feature importance can be obtained via a sklearn method on said classifier.
Of the features explored, Total Transaction Count and Total Revolving Balance are the most telling in terms of risk factors.
None of this would have been possible without tortega-24 who helped provide the Explanatory Data Analysis for the notebook, insights into preprocessing the dataset, and more.