This project applies various machine learning techniques to a credit score dataset to predict loan risk levels for customers. By exploring both supervised and unsupervised learning models, the project demonstrates and compares the effectiveness of multiple algorithms, providing a hands-on learning experience.
The primary goal of this project was to build, train, and test predictive models on a Kaggle credit score dataset. The dataset includes customer attributes like ID, gender, income, and education as predictor variables. Since a target variable was not provided, I created a new column STATUS
with random binary values (0 and 1), representing "low risk" and "high risk" categories. While random assignment does not yield precise real-world predictions, it enables practical model testing and comparisons.
-
Data Preprocessing and Cleaning
- Loaded the dataset and removed columns with minimal impact on prediction, like
ID
,CODE_GENDER
, andFLAG_EMAIL
. - Converted
DAYS_BIRTH
andDAYS_EMPLOYED
to years and took absolute values to ensure positivity. - Transformed categorical columns (e.g.,
FLAG_OWN_CAR
,NAME_INCOME_TYPE
) into factors for compatibility.
- Loaded the dataset and removed columns with minimal impact on prediction, like
-
Data Normalization and Encoding
- Applied normalization to numeric columns for consistent scaling.
- Encoded categorical columns numerically, converting factor levels to integers to work with all models.
-
Train-Test Split
- Split data into 70% training and 30% testing sets. This partition provides an unbiased accuracy estimate by evaluating models on unseen data.
-
Modeling
- K-Nearest Neighbors (KNN): Used KNN with
k=10
as a baseline model to predict loan risk, assessing accuracy by comparing predictions against actual labels. - Naive Bayes: Implemented Naive Bayes (using the
e1071
library), a probabilistic approach that assumes independence among predictors. - Decision Tree: Trained a decision tree classifier to predict loan risk, visualized with
rpart.plot
to observe decision paths. - K-Means Clustering (Unsupervised): Applied K-means clustering with two clusters to explore patterns in the data, visualizing cluster centers and groupings.
- K-Nearest Neighbors (KNN): Used KNN with
-
Accuracy Comparisons
- Calculated and compared accuracies for KNN, Naive Bayes, and Decision Tree models.
- Created a bar chart with
ggplot2
to compare model accuracies visually.
- Decision Tree: Visualized to display classification rules.
- K-Means Clustering Plot: Displays data points grouped by clusters, showing natural groupings.
- Accuracy Comparison Chart: Highlights accuracy for each model to identify the best-performing algorithm.
- Data Manipulation:
dplyr
,lapply
,as.factor
- Modeling:
class
(KNN),e1071
(Naive Bayes),rpart
(Decision Tree),kmeans
- Visualization:
ggplot2
(bar chart),rpart.plot
(decision tree visualization)
This project demonstrates a comprehensive approach to predictive analysis using multiple machine learning models. Although the dataset lacks a genuine target variable, the random target column enables an educational exploration of various classification techniques.