-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble reproducing baseline results #4
Comments
Hi, a. Can you provide the following details which might help us narrow the sources of differences:
b. The models had 14 classes |
Thanks for the quick reply!
In short, I've tried a lot of different things, but perhaps not all together in the exact configuration that you had. I'm happy to re-run my model with hyperparameters more closely mirroring your paper, so let me know! However, I've just never experienced weight decay or batch size to be the difference between a decent baseline (like what I've achieved) and a nearly state-of-the-art model (like yours), even though almost everything else is the same. If you let me know what exactly you think I should try, I'll be sure to re-run and report back. I appreciate the help. |
For thoroughness, here are results I have so far. All models trained on the exact data split you use, torchvision DenseNet121 pretrained on ImageNet defined exactly as you have, trained for at most 20 epochs. For validation, I just resize images to 224x224. For testing, I use 10x test-time augmentation, where I generate 10 augmented versions of each test image (via the same augmentation pipeline for training) and average predictions — I’m aware this is slightly different than just taking 10 crops as you do in the paper.
Just by reading your paper as carefully as possible, this last one looks to be the closest to BL5-DenseNet121, and as you can see the performance is considerably lower than the 0.822 AUROC you report. Of course it’s entirely possible I just have a bug somewhere, but I want to post this for visibility in case others are also having trouble achieving the results that you do. Overall, I’m totally puzzled why none of these reach the validation or test set metrics that you were able to reach despite being so similar to the approach you took. If you have any suggestions for how to bridge this gap, let me know. |
.1 Are you normalizing the input image by subtracting the image with imagenet mean and std dev? |
I'm still at a loss for what is causing this. I suppose I can try using torchvision transforms instead of albumentations but I would be shocked if that's the culprit; I just can't even identify another difference. To add confusion, this repo was able to reach the metrics that you did, yet their training code doesn't exactly reflect what you did: they use a cosine annealing learning rate scheduler, seemingly a initial learning rate of 5e-4, no class weights on the BCE loss, etc. I guess I'll keep trying things 🤷 |
Just finished another run where the only thing I changed was preprocessing: I used torchvision transformations just as you did (virtually copy and pasted your code) instead of albumentations. This, as far as I can tell, is nearly identical to how you trained BL5-DenseNet121... and yet still decidedly not reaching 0.82+ test AUROC.
This one seemed most promising since it had the highest validation AUROC I've seen and the highest test AUROC without test-time augmentation (TTA) of 0.813 -- disappointing that 10x TTA only bumped it up to 0.814. This is so confusing to me, I just can't comprehend what could be causing such a difference in final performance. Is there anything you can think of that I'm missing? Did you use class weights in your loss function? |
Hi there! First of all, I think the paper is really interesting. Second, I appreciate that this is one of the only open-sourced repos I can find that provides training code for the NIH dataset and uses the official train-test split.
I am trying to independently reproduce the DenseNet121 baseline you provide in the paper (over 0.82 AUROC with augmentation), and not really coming close. I am using the exact same data splits as you, but have written my own training and model definition scripts; I am still using pre-trained ImageNet weights provided by torchvision and, as far as I can tell, all the same hyperparameters and preprocessing steps as you have.
I've tried unweighted cross-entropy, class-weighted cross-entropy, heavy augmentation (including elastic deformations, cutout, etc.), light augmentation (e.g., just random crop to 224 and horizontal flip), using 14 classes, using 15 classes (considering "No Finding" to be another output class). No matter what I've tried, I see rather quick convergence and overfitting (best validation loss achieved no later than epoch 7), and the highest validation AUROC I've seen is 0.814. This is considerably lower than the 0.839 validation AUROC you report in Table 2 for BL5-DenseNet121. The absolute best test set results I've achieved with a DenseNet121 architecture is 0.807 AUROC with 8x test-time augmentation.
I'm pretty puzzled by this because I don't think random variation in training or minor implementation differences should cause a >0.015-point drop in AUROC... Of course there are a million potential sources for this difference, but maybe you can help me pinpoint it.
For your final models, did you use 14 or 15 classes? Also, would you be able to provide any sort of training/learning curve showing loss or AUROC vs. # epochs? I am suspicious of how quickly my baseline models are converging, and am wondering how my training trajectories compare to yours.
The text was updated successfully, but these errors were encountered: