EDA and PCA of Myocardial Infarction Dataset
The myocardial infraction dataset (MI Dataset) was too noisy and required much preprocessing for further analysis. The check for null values indicated that out of the 124 variables, 89 variables had null values in them. Out of the 124 variables that we had in this dataset, we were interested in the variables obtained at the time of hospital admission, and we dropped the variables that had more than 30% null values and the patient ID variable, and it reduced the number of variables available for further analysis to 98. Finally, our dataset had 63 categorical variables, 34 numerical variables, and the outcome variable was categorical with eight categories (alive and seven complication outcomes). The outcome variable was converted into binary as dead and alive for further analysis. The missing values in the data were imputed using the primary data imputation techniques such as continuous variables using the mean and the categorical variables using the mode. We also encoded some of the categorical variables of a character type to numerical.
The statistical analysis of the MI dataset was also done using the RStudio. It was essential to reduce the number of variables and choose the appropriate variables that were having good significance with the outcome variable. We performed Pearson's correlation to find the variables with a significant correlation with the outcome variable, a correlation coefficient of more than 20% was only considered to reduce the size of the correlation heatmap and the number of predictive features. To confirm the correlation results, we also performed the chi-square analysis of variables with the target variable.
In the predictive model of MI, the initial model was built using the variables identified as associated with the outcome variable from Chi-square and Pearson correlation tests. We also performed the PCA and plotted the same to find the number of top principal components that would accommodate most of the data variation. A second prediction model was built using the top principal components as the predictor variables. These models were compared for the performance using various matrices. Based upon the significance of each variable used to build the model as obtained from the model summary, we tried removing and or replacing the non-significant variables with another significant variable and looked for the improvement in the performance of the models.