This project was done for The University of Washington's Professional Master's Program AI and Healthcare course taught by Dr. Karthik Mohan, and it was a partner project.
Data preprocessing, data generation, and model training and comparison, including a SOTA CNN model, for Heartbeat classification, specifically class 'A' for Arrythmia. A Test set F1 of 0.964 was achieved using the SOTA CNN with 1 feature, MLII readings.
The Dataset can be found here in the mitbih_databse directory. The data folder contains 44 csv files with corresponding txt files to annotate the csv files. Using the annotated R-peaks from the txt files, the csv files could be broken down into individual heartbeats for each patient. However, only 42 of the patients had the 'MLII' ECG reading so we opted to use just that one feature for the majority of the work.
The data was preprocessed into a dataframe of 98,312 x 360. Here the 98,132 refers to the total number of heartbeats we extracted from the 42 patients, and 360 represents the MLII values for each heartbeat. Heartbeats are created by taking 180 values left of the R-peak and 179 values right of the R-peak, this created a 360-d vector which represented one ECG sensor reading for one heartbeat. This database was collected using a two-channel ambulatory ECG between 1975 and 1979. These R-peaks have been hand annotated by cardiologists after digitization.
To ensure that each patient was normalized with respect to their own heartbeat ECG data, we normalized by patient before concatenating the heartbeats of a single patient to the larger dataframe with all the patients.
Examples of Heartbeats from each class parsed during preprocessing:
Since a normal heartbeat is significantly more common to read when taking the ECG of a patient, this created a large dataset imbalance as shown below. Notice there are 6 classes, 'N' representing normal heartbeat.
To overcome the massive data imbalance presented in the dataset, my partner and I tried a basic autoencoder and a variational autoencoder, and found very comparable results between the two.
Using the Autoencoder, we boosted the samples in the 5 classes that were low. Still, each class had about 3/5 the samples of the N class.
Using the Variational Autoencoder, we also boosted the low classes, but this time leveled the classes out with the N class.
Shown below are the tables for the metrics used to compare each model, as you can see the basic autoencoder actually produces better end results than the variational autoencoder did.
We also implemented a feature for data denoising, enabling us to clean each signal so that the deep learning algorithms focus on the big picture of the data fluctuations and not the small jitters that don't contribute to type of heartbeat.
Random Forest:
- depth: 20
- estimators: 25
- min_samples_split: 2
Feed Forward Neural Network:
- API: Keras Sequenial Model
- Layer Count: 6
- Activation: ReLU
- Dropout: 0.6
- Loss: Categorical Cross Entropy
- Optimizer: Adam
- Target: Label Binarizer of the 6 classes N,L,R,A,V,U
We used the CNN architecture proposed in the paper “X. Xu and H. Liu, "ECG Heartbeat Classification Using Convolutional Neural Networks," in IEEE Access, vol. 8, pp. 8614-8619, 2020, doi: 10.1109/ACCESS.2020.2964749”.
- API: PyTorch
- 1D-Convolutional Layers: 4
- Pooling Layers: 2
- Linear Layers: 3
- Loss: Cross Entropy
- Optimizer: Adam
With the above specs for the NN and CNN, we completed training for the NN and CNN for 9 and 8 epochs respectfully. The CNN had much smoother descent for accuracy and loss curves and proved better to use than the NN.
To measure the correctness of the model we took a look at the metrics by class: precision, recall and f1. With upsampling we were able to achieve a much higher accuracy for class 'A', Arrythmic heartbeat, but still significantly worse than other classes. The CNN paper claims 99.4% accuracy where we achieved 97% accuracy.
After creating a new dataframe with only patients that have both MLII and V1 readings, we were able to achieve a higher f1 score with the CNN than before using only one feature, namely MLII.
Please check out the notebook for more intermediary results and full preprocessing, model creation, training and analysis functions / cells. Enjoy!