This project focuses on predicting whether a person's salary exceeds 50 thousand dollars per year based on certain characteristics. We utilize the Adult dataset from the University of California Irvine for this purpose.
The dataset contains 14 predictor variables and a target variable, with a total of 32,561 samples. The target variable indicates whether a person's salary is above 50K (">50K") or below or equal to 50K ("<=50K"). The predictor variables include age, work class, education, marital status, occupation, relationship, race, sex, capital gain, capital loss, hours per week, and native country.
The project involves several key tasks:
Preprocessing the dataset, which includes handling numerical and categorical features, dealing with outliers and null values, and transforming data.
Implementing various supervised learning models, specifically Logistic Regression, K-Nearest Neighbors, and Decision Trees.
Evaluating the performance of these models and selecting the most effective one.
The whole process is implemented and illustrated in a Jupyter notebook, which forms the main deliverable of the project.