Skip to content

This project includes data preprocessing of the dataset, feature engineering and comparative prediction models with different machine learning methods.

Notifications You must be signed in to change notification settings

SubeyteT/End_to_End_Diabetes_Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

Diabetes_Prediction

This project includes data preprocessing of the dataset, feature engineering and comparative prediction models with different machine learning methods.

Aim of the project: To develop and compare machine learning models that will predict if one has diabetes(1) or not(0) according to their specifications.

Dataset Information

Dataset is a part of big dataset kept in National Diabetes-Digestion-Kidney Diseases Institute of USA. The research is done with 768 Pima Indian women who are at least 21 years old and lives in Phoenix in USA. The dataset includes 768 observations and 8 numerical independent variables. Target variable: "Outcome". (1) refers to having diabetes ad (0) refers to not having diabates.

Variables:

Pregnancies: Number of times pregnant Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test BloodPressure: Diastolic blood pressure (mm Hg) SkinThickness: Triceps skin fold thickness (mm) Insulin: 2-Hour serum insulin (mu U/ml) BMI: Body mass index (weight in kg/(height in m)2) DiabetesPedigreeFunction: Diabetes pedigree function (a function which scores likelihood of diabetes based on family history) Age: Age (years) Outcome: Class variable (0 if non-diabetic, 1 if diabetic)

You can achieve the public dataset.

Process

  • Data load and
  • Exploratory data analysis (includes graphs)
  • Feature engineering
  • Outlier detection
  • Missing value detection
  • Encoding (Label, rare and one-hot encoding)
  • Scaling
  • Regression Models (Logistic and Linear)

Summary of feature engineering:

Correlation between variables is reviewed before preprocessing steps. Also, some literature research done and expert (a doctor) opinion is received from for better observations.

  • Blood pressure, glucose, skin thickness, insulin and BMI values can't be 0. So they shoul be considered as NA.
  • Glucose affects blood pressure.
  • From a broader perspective 21 and older female skin thickness range is 7-27 (mm) MicrosoftTeams-image
  • Optimal Glucose level range is 70-100. Glucose level < 70: low, glucose level between 100-125 is risky and glucose level > 125 is potential diabetes
  • Insulin level is high distributed. But mostly differs between 40-460. So, observations higher than 460 can be considered as outliers.
  • Also pregnancy has correlation with outcome, it may trigger diabetes.

Machine Learning Methods

Until now, logistic regression is applied. Standard scaling is chosen upon some trials. Here are test results for logistic regression method:

  • ACCURACY: 0.773
  • PRECISION: 0.708
  • RECALL: 0.618
  • F1: 0.660
  • AUC: 0.874

New ML models will be uploaded.

Referances: https://doi.org/10.2337/diacare.12.5.309

About

This project includes data preprocessing of the dataset, feature engineering and comparative prediction models with different machine learning methods.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages