-
Notifications
You must be signed in to change notification settings - Fork 0
/
Heart Disease with Random Forests
1 lines (1 loc) · 20.2 KB
/
Heart Disease with Random Forests
1
{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.13","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceId":6674905,"sourceType":"datasetVersion","datasetId":1936563}],"dockerImageVersionId":30746,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Introduction","metadata":{}},{"cell_type":"markdown","source":"Heart disease is a broad term for a range of conditions affecting the heart's function and structure. It is one of the leading causes of death worldwide, often linked to lifestyle factors like poor diet, lack of exercise, and smoking. Common types include coronary artery disease, heart attacks, and heart failure. Early detection and management through lifestyle changes or medical intervention can significantly improve outcomes. Understanding the risk factors and symptoms is key to prevention and treatment.","metadata":{}},{"cell_type":"markdown","source":"![Heart Disease](https://d3b6u46udi9ohd.cloudfront.net/wp-content/uploads/2023/06/04055004/Heart-failure.jpg)","metadata":{}},{"cell_type":"markdown","source":"Heart disease often involves the buildup of plaque in the coronary arteries, reducing blood flow to the heart muscle. This condition, called atherosclerosis, can lead to serious complications like heart attacks. Timely diagnosis and treatment, such as lifestyle changes or medical intervention, are essential to manage and prevent further damage to the heart. In our dataset we will see what is the reason of Heart Disease.","metadata":{}},{"cell_type":"markdown","source":"![Heart Disease](https://cdn-bohdg.nitrocdn.com/LRSkEHBfAjwsEFOOHlbAXIhAeKQgiLsG/assets/images/optimized/rev-803d478/www.thekeyholeheartclinic.com/wp-content/uploads/2021/04/Coronary-disease-illustration.jpg)","metadata":{}},{"cell_type":"markdown","source":"# Overview","metadata":{}},{"cell_type":"markdown","source":"Our columns in data is :\n- HeartDisease\n- BMI\n- Smoking\n- AlcoholDrinking\n- Stroke\n- PhysicalHealth\n- MentalHealth\n- DiffWalking\n- Sex\n- AgeCategory\n- Race\n- Diabetic\n- PhysicalActivity\n- GenHealth\n- SleepTime\n- Asthma\n- KidneyDisease \n- SkinCancer","metadata":{}},{"cell_type":"markdown","source":"# Importing","metadata":{}},{"cell_type":"code","source":"import pickle\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport warnings\nwarnings.filterwarnings(\"ignore\")\nfrom sklearn.model_selection import train_test_split , cross_val_score , RepeatedStratifiedKFold\nfrom sklearn.pipeline import make_pipeline\nfrom category_encoders import OneHotEncoder\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.svm import SVC\nimport imblearn\nfrom collections import Counter\nfrom imblearn.over_sampling import SMOTE\nfrom imblearn.under_sampling import RandomUnderSampler\nfrom sklearn.preprocessing import MinMaxScaler\nfrom sklearn.dummy import DummyClassifier\nfrom sklearn.metrics import roc_auc_score , classification_report , accuracy_score , confusion_matrix","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","execution":{"iopub.status.busy":"2024-09-23T05:13:52.238099Z","iopub.execute_input":"2024-09-23T05:13:52.238481Z","iopub.status.idle":"2024-09-23T05:13:54.560852Z","shell.execute_reply.started":"2024-09-23T05:13:52.238445Z","shell.execute_reply":"2024-09-23T05:13:54.55965Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"df=pd.read_csv(\"/kaggle/input/personal-key-indicators-of-heart-disease/2020/heart_2020_cleaned.csv\")\n","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:13:54.563015Z","iopub.execute_input":"2024-09-23T05:13:54.563604Z","iopub.status.idle":"2024-09-23T05:13:55.944176Z","shell.execute_reply.started":"2024-09-23T05:13:54.563539Z","shell.execute_reply":"2024-09-23T05:13:55.943022Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"df.head()","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:13:55.945622Z","iopub.execute_input":"2024-09-23T05:13:55.946038Z","iopub.status.idle":"2024-09-23T05:13:55.985011Z","shell.execute_reply.started":"2024-09-23T05:13:55.946Z","shell.execute_reply":"2024-09-23T05:13:55.983784Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"df.info()","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:13:57.397416Z","iopub.execute_input":"2024-09-23T05:13:57.397854Z","iopub.status.idle":"2024-09-23T05:13:57.902465Z","shell.execute_reply.started":"2024-09-23T05:13:57.39782Z","shell.execute_reply":"2024-09-23T05:13:57.901272Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"for i in df.columns:\n if df[i].dtype== \"object\":\n print(i,df[i].unique())","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:14:11.496687Z","iopub.execute_input":"2024-09-23T05:14:11.497198Z","iopub.status.idle":"2024-09-23T05:14:11.919155Z","shell.execute_reply.started":"2024-09-23T05:14:11.497151Z","shell.execute_reply":"2024-09-23T05:14:11.91791Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# EDA","metadata":{}},{"cell_type":"markdown","source":"### What is nature of target Column ?","metadata":{}},{"cell_type":"code","source":"plt.hist(df[\"HeartDisease\"])\nplt.show()","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:17:55.269183Z","iopub.execute_input":"2024-09-23T05:17:55.269607Z","iopub.status.idle":"2024-09-23T05:17:55.638854Z","shell.execute_reply.started":"2024-09-23T05:17:55.269575Z","shell.execute_reply":"2024-09-23T05:17:55.637748Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"There are Imbalanced Data","metadata":{}},{"cell_type":"code","source":"plt.hist(df[df['HeartDisease']=='No'][\"Smoking\"])\nplt.show()","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:32:59.224316Z","iopub.execute_input":"2024-09-23T05:32:59.224763Z","iopub.status.idle":"2024-09-23T05:32:59.658686Z","shell.execute_reply.started":"2024-09-23T05:32:59.224724Z","shell.execute_reply":"2024-09-23T05:32:59.657171Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"plt.hist(df[df['HeartDisease']=='Yes'][\"Smoking\"])\nplt.show()","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:33:07.760009Z","iopub.execute_input":"2024-09-23T05:33:07.760921Z","iopub.status.idle":"2024-09-23T05:33:08.064702Z","shell.execute_reply.started":"2024-09-23T05:33:07.760879Z","shell.execute_reply":"2024-09-23T05:33:08.063609Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"### the previous plot shows that the smoking effects on Heart Disease","metadata":{}},{"cell_type":"markdown","source":"### Are There a relationship between Alcohol and Heart Disease ?","metadata":{}},{"cell_type":"code","source":"plt.pie(df.AlcoholDrinking.value_counts(),colors=['#384B70','#CDC1FF'],autopct='%1.1f%%',labels=df.AlcoholDrinking.value_counts().index)\n","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:43:17.048293Z","iopub.execute_input":"2024-09-23T05:43:17.048738Z","iopub.status.idle":"2024-09-23T05:43:17.311579Z","shell.execute_reply.started":"2024-09-23T05:43:17.048704Z","shell.execute_reply":"2024-09-23T05:43:17.309776Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"sns.countplot(x='AlcoholDrinking',hue='HeartDisease',data=df,palette='Set2')","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:43:21.877239Z","iopub.execute_input":"2024-09-23T05:43:21.877643Z","iopub.status.idle":"2024-09-23T05:43:22.765483Z","shell.execute_reply.started":"2024-09-23T05:43:21.877611Z","shell.execute_reply":"2024-09-23T05:43:22.764369Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"### Relationship between Stroke & Heart Disease","metadata":{}},{"cell_type":"code","source":"#Stroke\nsns.countplot(x='Stroke',hue='HeartDisease',data=df,palette='Set2')","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:44:07.739742Z","iopub.execute_input":"2024-09-23T05:44:07.740723Z","iopub.status.idle":"2024-09-23T05:44:08.571176Z","shell.execute_reply.started":"2024-09-23T05:44:07.740687Z","shell.execute_reply":"2024-09-23T05:44:08.569999Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"### Age Effect","metadata":{}},{"cell_type":"code","source":"plt.figure(figsize=(12, 8))\n\n# Count and sort AgeCategory in descending order\nsorted_order = df[df['HeartDisease'] == 'Yes']['AgeCategory'].value_counts().index\n\n# Plot the countplot with sorted AgeCategory and enlarged size\nsns.countplot(x='AgeCategory', hue='HeartDisease', \n data=df[df['HeartDisease'] == 'Yes'], \n palette='Set2', \n order=sorted_order)","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:18:11.19562Z","iopub.execute_input":"2024-09-23T05:18:11.19612Z","iopub.status.idle":"2024-09-23T05:18:11.819368Z","shell.execute_reply.started":"2024-09-23T05:18:11.196075Z","shell.execute_reply":"2024-09-23T05:18:11.818082Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"There Are relationship between age and Heart Disease.","metadata":{}},{"cell_type":"code","source":"plt.figure(figsize=(12, 8))\n\n# Count and sort AgeCategory in descending order\nsorted_order = df[df['HeartDisease'] == 'Yes']['Race'].value_counts().index\n\n# Plot the countplot with sorted AgeCategory and enlarged size\nsns.countplot(x='Race', hue='HeartDisease', \n data=df[df['HeartDisease'] == 'Yes'], \n palette='rocket', \n order=sorted_order)","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:18:29.62267Z","iopub.execute_input":"2024-09-23T05:18:29.623078Z","iopub.status.idle":"2024-09-23T05:18:30.082535Z","shell.execute_reply.started":"2024-09-23T05:18:29.623046Z","shell.execute_reply":"2024-09-23T05:18:30.081302Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"### The White is the largest precentage have Heart Disease","metadata":{}},{"cell_type":"markdown","source":"### The Diabetic","metadata":{}},{"cell_type":"code","source":"#\nplt.pie(df.Diabetic.value_counts(),colors=['#A1D6B2','#CEDF9F','#F1F3C2','#E8B86D'],autopct='%1.1f%%',labels=df.Diabetic.value_counts().index)\n","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:45:07.66575Z","iopub.execute_input":"2024-09-23T05:45:07.666177Z","iopub.status.idle":"2024-09-23T05:45:07.936104Z","shell.execute_reply.started":"2024-09-23T05:45:07.666145Z","shell.execute_reply":"2024-09-23T05:45:07.934647Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"plt.figure(figsize=(12, 8))\n\n# Count and sort AgeCategory in descending order\nsorted_order = df[df['HeartDisease'] == 'No']['Race'].value_counts().index\n\n# Plot the countplot with sorted AgeCategory and enlarged size\nsns.countplot(x='Race', hue='HeartDisease', \n data=df[df['HeartDisease'] == 'No'], \n palette='crest', \n order=sorted_order)","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:18:47.62022Z","iopub.execute_input":"2024-09-23T05:18:47.620639Z","iopub.status.idle":"2024-09-23T05:18:48.638764Z","shell.execute_reply.started":"2024-09-23T05:18:47.620607Z","shell.execute_reply":"2024-09-23T05:18:48.637526Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"for i in df.columns:\n if df[i].dtype!= \"object\":\n print(i)","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:28:28.861376Z","iopub.execute_input":"2024-09-23T05:28:28.861799Z","iopub.status.idle":"2024-09-23T05:28:28.868389Z","shell.execute_reply.started":"2024-09-23T05:28:28.861766Z","shell.execute_reply":"2024-09-23T05:28:28.867254Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"### Mental Health with Age","metadata":{}},{"cell_type":"code","source":"#MentalHealth\nsns.kdeplot(data=df[df['HeartDisease']=='Yes'],x='MentalHealth',hue='AgeCategory')","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:30:20.020906Z","iopub.execute_input":"2024-09-23T05:30:20.021335Z","iopub.status.idle":"2024-09-23T05:30:20.930076Z","shell.execute_reply.started":"2024-09-23T05:30:20.021301Z","shell.execute_reply":"2024-09-23T05:30:20.928909Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"### Sleep Time","metadata":{}},{"cell_type":"code","source":"sns.kdeplot(data=df[df['HeartDisease']=='Yes'],x='SleepTime')","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:48:18.128463Z","iopub.execute_input":"2024-09-23T05:48:18.12888Z","iopub.status.idle":"2024-09-23T05:48:18.700462Z","shell.execute_reply.started":"2024-09-23T05:48:18.12885Z","shell.execute_reply":"2024-09-23T05:48:18.699198Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"### the high density in 8 hours is not lead to effect because this the normal hours of sleep. but, you can see the variance start be bigger from 4 hours.","metadata":{}},{"cell_type":"markdown","source":"# Encoding","metadata":{}},{"cell_type":"code","source":"from sklearn.preprocessing import LabelEncoder as LE\nle=LE()","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:59:07.462414Z","iopub.execute_input":"2024-09-23T05:59:07.463239Z","iopub.status.idle":"2024-09-23T05:59:07.469319Z","shell.execute_reply.started":"2024-09-23T05:59:07.463194Z","shell.execute_reply":"2024-09-23T05:59:07.467707Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"for column in df.columns:\n if df[column].dtype== \"object\":\n df[column] = le.fit_transform(df[column])","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:59:08.112697Z","iopub.execute_input":"2024-09-23T05:59:08.113113Z","iopub.status.idle":"2024-09-23T05:59:09.456161Z","shell.execute_reply.started":"2024-09-23T05:59:08.11308Z","shell.execute_reply":"2024-09-23T05:59:09.455088Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"df.head()","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:59:11.731201Z","iopub.execute_input":"2024-09-23T05:59:11.731609Z","iopub.status.idle":"2024-09-23T05:59:11.752812Z","shell.execute_reply.started":"2024-09-23T05:59:11.731571Z","shell.execute_reply":"2024-09-23T05:59:11.751575Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"correlation_matrix = df.corr()\nHeart_Disease_correlation = correlation_matrix['HeartDisease'].sort_values(ascending=True)\nordered_correlation_matrix = correlation_matrix.loc[Heart_Disease_correlation.index, Heart_Disease_correlation.index]\nplt.figure(figsize=(12, 10))\nsns.heatmap(ordered_correlation_matrix[['HeartDisease']], annot=True,cmap='coolwarm', fmt='.2f', vmin=-1, vmax=1)\nplt.title('Correlation Heatmap with Heart Disease (Ordered)')\nplt.show()","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:59:13.6688Z","iopub.execute_input":"2024-09-23T05:59:13.669211Z","iopub.status.idle":"2024-09-23T05:59:14.679001Z","shell.execute_reply.started":"2024-09-23T05:59:13.669176Z","shell.execute_reply":"2024-09-23T05:59:14.677762Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"target = df['HeartDisease']\nx=df.drop('HeartDisease',axis=1)\nX_train , X_test , y_train , y_test = train_test_split(x ,target ,test_size=0.2 , random_state=42 )\nprint(\"X_train shape:\", X_train.shape)\nprint(\"y_train shape:\", y_train.shape)\nprint(\"X_test shape:\", X_test.shape)\nprint(\"y_test shape:\", y_test.shape)","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:59:24.000038Z","iopub.execute_input":"2024-09-23T05:59:24.000435Z","iopub.status.idle":"2024-09-23T05:59:24.16115Z","shell.execute_reply.started":"2024-09-23T05:59:24.000403Z","shell.execute_reply":"2024-09-23T05:59:24.159918Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Handling Imbalanced Data","metadata":{}},{"cell_type":"code","source":"over = SMOTE(sampling_strategy = 1)\nunder = RandomUnderSampler(sampling_strategy = 0.1)\n\nX_train_resampled, y_train_resampled = under.fit_resample(X_train, y_train)\nX_train_resampled, y_train_resampled = over.fit_resample(X_train_resampled, y_train_resampled)\nCounter(y_train_resampled)","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:59:37.438036Z","iopub.execute_input":"2024-09-23T05:59:37.438488Z","iopub.status.idle":"2024-09-23T05:59:39.254806Z","shell.execute_reply.started":"2024-09-23T05:59:37.438451Z","shell.execute_reply":"2024-09-23T05:59:39.253803Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"def train(classifier,x_train,y_train,x_test,y_test):\n \n classifier.fit(x_train,y_train)\n prediction = classifier.predict(x_test)\n cv = RepeatedStratifiedKFold(n_splits = 10,n_repeats = 3,random_state = 1)\n print(\"Cross Validation Score : \",'{0:.2%}'.format(cross_val_score(classifier,x_train,y_train,cv = cv,scoring = 'roc_auc').mean()))\n \n\ndef model_evaluation(classifier,x_test,y_test):\n \n # Confusion Matrix\n cm = confusion_matrix(y_test,classifier.predict(x_test))\n names = ['True Neg','False Pos','False Neg','True Pos']\n counts = [value for value in cm.flatten()]\n percentages = ['{0:.2%}'.format(value) for value in cm.flatten()/np.sum(cm)]\n labels = [f'{v1}\\n{v2}\\n{v3}' for v1, v2, v3 in zip(names,counts,percentages)]\n labels = np.asarray(labels).reshape(2,2)\n sns.heatmap(cm,annot = labels,cmap = 'Greens',fmt ='')\n \n # Classification Report\n print(classification_report(y_test,classifier.predict(x_test)))","metadata":{"execution":{"iopub.status.busy":"2024-09-23T05:59:43.61949Z","iopub.execute_input":"2024-09-23T05:59:43.619919Z","iopub.status.idle":"2024-09-23T05:59:43.630962Z","shell.execute_reply.started":"2024-09-23T05:59:43.619889Z","shell.execute_reply":"2024-09-23T05:59:43.629768Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Baseline","metadata":{}},{"cell_type":"code","source":"dummy_classifier = DummyClassifier(strategy = 'most_frequent') \ndummy_classifier.fit(X_train, y_train) \ny_pred = dummy_classifier.predict(X_test)\naccuracy = accuracy_score(y_test, y_pred)\nprint(f\"Baseline Model Accuracy: {accuracy:.4f}\")","metadata":{"execution":{"iopub.status.busy":"2024-09-23T06:00:29.276396Z","iopub.execute_input":"2024-09-23T06:00:29.27727Z","iopub.status.idle":"2024-09-23T06:00:29.300397Z","shell.execute_reply.started":"2024-09-23T06:00:29.277222Z","shell.execute_reply":"2024-09-23T06:00:29.299297Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Random Forest ","metadata":{}},{"cell_type":"code","source":"rf_classifier = make_pipeline(\n OneHotEncoder(use_cat_names = True),\n MinMaxScaler(),\n RandomForestClassifier(n_estimators=10, random_state=42)\n)","metadata":{"execution":{"iopub.status.busy":"2024-09-23T06:00:49.612086Z","iopub.execute_input":"2024-09-23T06:00:49.612485Z","iopub.status.idle":"2024-09-23T06:00:49.618503Z","shell.execute_reply.started":"2024-09-23T06:00:49.612452Z","shell.execute_reply":"2024-09-23T06:00:49.617161Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"train(rf_classifier, X_train_resampled, y_train_resampled, X_test, y_test)\nmodel_evaluation(rf_classifier, X_test, y_test)","metadata":{"execution":{"iopub.status.busy":"2024-09-23T06:00:51.552749Z","iopub.execute_input":"2024-09-23T06:00:51.553136Z","iopub.status.idle":"2024-09-23T06:06:02.993065Z","shell.execute_reply.started":"2024-09-23T06:00:51.553107Z","shell.execute_reply":"2024-09-23T06:06:02.991914Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Conclusion\n- The older you are, the greater your chance of Heart Disease.\n- The percentage of people with heart disease is 6%.","metadata":{}},{"cell_type":"markdown","source":"## Thank You","metadata":{}}]}