This is my submission for the Tech Weekend Data Science Challenge on Kaggle.
We are living in an “information age”. Terabytes of data are produced every day. Data mining is the process which turns a collection of data into knowledge. The health care industry generates a huge amount of data daily. However, most of it is not effectively used. Efficient tools to extract knowledge from these databases for clinical detection of diseases or other purposes are not much prevalent. With the rise of Data Science and Machine Learning it is possible to make sense of huge data and provide assitance to doctors. This Tech Weekend we challenge the participants to predict if a person given his/her attributes has a heart disease or not.
- Since it is a classification problem, after visualizing and analyzing the dataset, I decided to start off with a KNN implementation which gave me a 61% accuracy.
- Then I decided to use Logistic Regression which increased my accuracy upto 83% which further went upto 87% after setting class weight as balanced in Scikit-learn.
- After some research and Googling, I decided to use Random Forests and it worked perfectly giving me an accuracy of 89.728%.
- The Notebook containing the source code can be found here. The training/testing dataset can be found in dataset.zip.
Open to enhancements
and bugs
This was my first contest on Kaggle and I hope to participate in more such contests. 😄