w/ Logistic Regression using SAS Studio 🖥
- About Project
- Objectives
- Dataset Description
- Dataset Pre-processing
- Logistic Regression
- Output Delivery System (ODS)
👉 This dataset contains information about contains diagnoses of heart disease patients. Machine learning model is needed in order to determine whether a person has heart disease or not.
- Perform dataset exploration using various type of visualizations.
- Perform EDA on given dataset.
- Build logistic regression model to predict heart disease status.
👉 There are 14 variables in this dataset:
- 9 categorical variables, and
- 5 continuous variables.
👉 The structure of the two datasets that have been given:
Variable Name | Description | Sample Data |
---|---|---|
Age | Patient Age (in years) |
63; 37; ... |
Sex | Gender of patient (0 = male; 1 = female) |
1; 0; ... |
cp | Chest pain type (4 values: 0, 1, 2, 3) |
3; 1; 2; ... |
trestbps | resting blood pressure (in mm Hg) |
145; 130; ... |
chol | Serum cholestoral (in mg/dl) |
233; 250; ... |
fbs | Fasting blood sugar > 120 mg/dl (1 = true; 0 = false) |
1; 0; ... |
restecg | Resting electrocardiographic results (values 0, 1, 2) |
0; 1; ... |
thalach | Maximum heart rate achieved | 150; 187; ... |
exang | Exercise induced angina (1 = yes; 0 = no) |
1; 0; ... |
oldpeak | ST depression induced by exercise relative to rest | 2.3; 3.5; ... |
slope | The slope of the peak exercise ST segment (values 0, 1, 2) |
0; 2; ... |
ca | number of major vessels (0-4) colored by flourosopy | 0; 3; ... |
thal | (3 = normal; 6 = fixed defect; 7 = reversable defect) | 1; 3; ... |
Target | Target column (1 = Yes; 0 = No) |
1; 0; ... |
- As mentioned above, there are 14 variables with 303 observations.
- sex (Gender)
- The distribution of male patients are highest compared to female patients.
- The distribution of male patients are highest compared to female patients.
- cp (Chest Pain Type)
- Chest pain type 0 have the highest number compared to other types of chest pain.
- Chest pain type 0 have the highest number compared to other types of chest pain.
- fbs (Fasting Blood Sugar)
- It can be seen that the number of patients with fasting blood sugar less than 120 mg/dl have the highest numbers.
- It can be seen that the number of patients with fasting blood sugar less than 120 mg/dl have the highest numbers.
- restecg (Resting Electrocardiographic Results)
- Resting electrocardiographic with results 1 and 0 has a higher distribution than result 2.
- In addition, result 1 has the highest distribution compared to the other results.
- Resting electrocardiographic with results 1 and 0 has a higher distribution than result 2.
- exang (Exercise Induced Angina)
- Patients with no exercise induced angina are the highest compared to patients with exercise induced angina.
- Patients with no exercise induced angina are the highest compared to patients with exercise induced angina.
- slope (Slope of the Peak Exercise)
- The distribution of slope 1 and 2 are almost the same.
- Moreover, slope 2 has the highest distribution compared to others.
- ca (Number of Major Vessels)
- People with 0 major vessel has the highest distribution compared to others.
- People with 0 major vessel has the highest distribution compared to others.
- thal
- Patients with 2 "thal" has the highest distribution compared to others.
- Patients with 2 "thal" has the highest distribution compared to others.
- target (Heart Diseases Status)
- The total number of patients that have heart diseases are higher than patients that have no heart diseases.
- age (Patient Age)
- From the histogram and boxplot, it can be seen that this column is normally distributed. This also proven by skewness value (-0.2) of this column.
- In this column, the kurtosis value is -0.5, which indicates that the column is platikurtic.
- From the Q-Q plot, the data values tend to closely follow the 45-degree, which means the data is likely normally distributed (as stated previously).
- trestbps (Resting Blood Pressure in mm Hg)
- From the histogram, it can be seen that this column is moderatly right skewed. This also proven by skewness value (0.7) of this column.
- There are some outliers detected at the upper part of boxplot.
- At the upper part of Q-Q plot, the data values tend to move away from 45-degree (there is a gap at upper part of Q-Q plot with 45-degree line), which means the data is likely moderatly right skewed (as stated previously).
- In this column, the kurtosis value is 0.9, which indicates that the column is platikurtic.
- chol (Serum Cholestoral in mg/dl)
- From the histogram, it can be seen that this column is highly right skewed. This also proven by skewness value (1.1) of this column.
- There are some outliers detected at the upper part of boxplot.
- At the upper part of Q-Q plot, there is a gap at upper part of Q-Q plot with 45-degree line, which means the data is likely highly right skewed (as stated previously).
- In this column, the kurtosis value is 4.5, which indicates that the column is leptokurtic.
- thalach (Maximum Heart Rate)
- From the histogram, it can be seen that this column is moderatly left skewed. This also proven by skewness value (-0.5) of this column.
- There is an outlier detected at the bottom part of boxplot.
- At the upper part of Q-Q plot, there is a gap at bottom part of Q-Q plot with 45-degree line, which means the data is likely moderatly left skewed (as stated previously).
- In this column, the kurtosis value is -0.06, which indicates that the column is platikurtic.
- oldpeak
- From the histogram, it can be seen that this column is highly right skewed. This also proven by skewness value (1.3) of this column.
- There are some outliers detected at the upper part of boxplot.
- At the upper part of Q-Q plot, there is a gap at bottom part of Q-Q plot with 45-degree line, which means the data is likely highly right skewed (as stated previously).
- In this column, the kurtosis value is 1.57, which indicates that the column is platikurtic.
- In the data pre-processing, one-hot encoding performed for these columns:
- cp (into cp_0, cp_1, cp_2, and cp_3)
- thal (into thal_0, thal_1, thal_2, and thal_3)
- slope (into slope_0, slope_1, and slope_2)
- After one-hot encoding performed, original columns (cp, thal, and slope) are dropped from the table.
- Then, the observations will be splitted into 80% train and 20% test ratio using
PROC SURVEYSELECT
technique. - Next, the new columns (
Selected
) will be dropped in both train and test data. - Finally, the target values in test set will be change into
NULL
values.
Each step for data pre-processing are available on part no. 3 in
main.sas
file.
- [Image 1] - In train set, there are 243 observations (no missing values detected). In addition, the number of patients with and without heart disease are equally balanced.
- [Image 2] - The "Model Convergence Status" is Satisified, indicates that the developed logistic regression is good predictor in predicting patients status. This convergence status also supported from smaller AIC value compared to SC value.
- [Image 3] - p-value under the column "Pr > ChiSq", that not all variables are significant in the model. The p-value has to be less than 0.05 in order for the variable to be significantly impacting the variation in the heart disease status. (Example of great values for prediction: sex, cp_0, exang, etc.)
- Output Delivery System (ODS) is used to present the output data from SAS program in the form of a nicely presented report which would hep the user to be able to understand the output of their analysis much easier. For this case, the prediction exported as PDF file (
.pdf
) - The prediction report can be seen here.
Each step for creating output (ODS) file are available on part no. 5 in
main.sas
file.
👉 If you find this project useful, please ⭐ this repository 😆!
🎈 Check out my work on Kaggle here using various machine learning models!
👉 More about myself: here