Dataset "usability" for AI #29

stbnps · 2020-04-21T19:53:19Z

I performed the following experiment

Downloaded datasets [1], [2] and [3]
Extracted PA views for control and pneumonia patients (for [2] all "pneumonia" images were used regardless of the type:bacteria/virus, for [3], only "normal" patients, or "lung opacity" patients were used)
Trained a convolutional network using oversampling to balance both labels and datasets (control and pneumonia images were sampled with 50% probability, and each dataset was sampled with 1/3 probability). This is to prevent the network from prioritizing a dataset or a label.
Selected the epoch with the best "balanced" validation accuracy (the "balanced" accuracy was computed by oversampling the validation datasets following the same strategy used for the training sets)

Achieving the following results

Specificity:

Dataset [1]: 0.8746355685131195
Dataset [2]: 0.8632478632478633
Dataset [3]: 0.9661399548532731

Sensibility:

Dataset [1]: 0.7647058823529411
Dataset [2]: 0.9794871794871794
Dataset [3]: 0.9581589958158996

The issue

The network seems to perform very well on dataset [3], where each image was manually reviewed by radiologists [4]. However it performs significantly worse on dataset [1], where most labels were extracted using NLP and the images were not reviewed (even leading to the inclusion of completely white, or completely black images [5]).

Do you think the quality of the images and annotations may be a limiting factor for the performance of the network?

References

[1] http://ceib.bioinfo.cipf.es/covid19/resized_padchest_neumo.tar.gz
[2] https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
[3] https://www.kaggle.com/c/rsna-pneumonia-detection-challenge
[4] https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/overview/acknowledgements
[5] https://github.com/BIMCV-CSUSP/BIMCV-COVID-19/tree/master/padchest-covid#iti---proposal-for-datasets

rahools · 2020-06-11T20:28:16Z

Images that seem to be white or black have data in them. Just normalize[0 - 1], multiply it by 255, and plot it or save it.

samils7 · 2020-06-16T22:56:26Z

Images that seem to be white or black have data in them. Just normalize[0 - 1], multiply it by 255, and plot it or save it.

This comment is the answer to Q1 in: BIMCV-COVID19+/FAQ.md

stbnps · 2020-06-17T11:11:23Z

@rahools That's not true. Take a look at image 216840111366964013590140476722013038132133659_02-059-019.png:

You can see a white line. That white line means that the image is already scaled.

@samils7 That FAQ is for BIMCV-COVID19+, not for padchest-covid

rahools · 2020-06-17T11:30:33Z

my bad, I successfully applied normalization on BIMCV-COVID19+ so I thought that would translate to padchest dataset too. Thanks for the insight @stbnps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset "usability" for AI #29

Dataset "usability" for AI #29

stbnps commented Apr 21, 2020

rahools commented Jun 11, 2020

samils7 commented Jun 16, 2020

stbnps commented Jun 17, 2020

rahools commented Jun 17, 2020

Dataset "usability" for AI #29

Dataset "usability" for AI #29

Comments

stbnps commented Apr 21, 2020

I performed the following experiment

Achieving the following results

The issue

References

rahools commented Jun 11, 2020

samils7 commented Jun 16, 2020

stbnps commented Jun 17, 2020

rahools commented Jun 17, 2020