Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Understanding a poor training on the ADAM dataset #21

Open
AaronSaam opened this issue Jul 22, 2021 · 15 comments
Open

[Question] Understanding a poor training on the ADAM dataset #21

AaronSaam opened this issue Jul 22, 2021 · 15 comments
Labels
question Default label

Comments

@AaronSaam
Copy link

Hi! Awesome work :)

Recently we have trained the nnDetection on the ADAM challenge, i.e., Task019FG_ADAM.
However, the predictions on the test set are pretty bad - a lot of false postives and general sensitivity (approaching) 0. We are trying to understand where it went wrong, maybe you could be of help.

  1. In your case, did the network generate a low resolution model for the ADAM challenge? Our network did end up generating a low resolution model, which we did not specifically use further on.

  2. Do you have any suggestions on what could be different with your run?

The input data was unchanged apart from the omission of one patient due to having a T1 image, and we did not deviate from the instruction steps. We trained all five folds and performed a sweep for all. After that we ran the consolidation and prediction arguments as instructed.

Thank you for your help!

Best,
Aaron


Environment Information

Currently using an NVIDIA GeForce RTX 2080 Ti; PyTorch 1.8.0; CUDA 11.2.
nnDetection was installed from [docker | source].

PyTorch Version: <module 'torch.version' from '/opt/conda/lib/python3.8/site-packages/torch/version.py'>
PyTorch CUDA: 11.2
PyTorch Backend cudnn: 8100
PyTorch CUDA Arch List: ['sm_52', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'compute_86']
PyTorch Current Device Capability: (7, 5)
PyTorch CUDA available: True
System NVCC: nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0

System Arch List: 7.5
System OMP_NUM_THREADS: 1
System CUDA_HOME is None: False
Python Version: 3.8.5 (default, Sep  4 2020, 07:30:14) 
[GCC 7.3.0]
@mibaumgartner
Copy link
Collaborator

Hi @AaronSaam ,

thank you for trying nnDetection :) That sounds strange and didn't happen in our experiments.

  1. Yes, the low-resolution model is generated/triggered but was not used.
  2. How do the validation results look like? (by passing -1 for the fold to nndet_eval it is also possible to evaluate the consolidated result, i.e. the result across the whole training data set) Did you use the patient split (by running split.py before training the network) or did you use a random split for training?

Best,
Michael

@mibaumgartner
Copy link
Collaborator

Hi @AaronSaam,

since the bad performance was specifically reported on the test set, maybe there is something off with the conversion from the bounding boxes to the center coordinates (which are used for evaluation on the public leaderboard). I added our code below (I'll add it to the repository, too).

Note two things:

  1. For an input with axes x1, x2, x3 the output bounding box has the format (x1_low, x2_low, x1_high, x2_high, x3_low, x3_high) (the provided function box_center_np will automatically account for this and the center will be given in x1, x2, x3 format). This format was chosen from the MedicalDetectionToolkit and makes it easier to support 2d and 3d.
  2. I needed to revert the coordinates to align with the provided coordinates (see the last few lines where the center is saved as c[2], c[1], c[0]).

Maybe this helps.

Best,
Michael

import argparse
from pathlib import Path
 
from nndet.io import load_pickle
from nndet.core.boxes.ops_np import box_center_np
 
THRESHOLD = 0.5
 
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('source', type=Path)
    args = parser.parse_args()
 
    source = args.source
 
    predictions = load_pickle(source / "case_boxes.pkl")
    boxes = predictions["pred_boxes"]
    scores = predictions["pred_scores"]
    keep = scores > THRESHOLD
 
    boxes = boxes[keep]
    if boxes.size > 0:
        centers = box_center_np(boxes)
    else:
        centers = []
 
    with open(source / "result.txt", "a") as f:
        if len(centers) > 0:
            for c in centers[:-1]:
                f.write(f"{round(float(c[2]))}, {round(float(c[1]))}, {round(float(c[0]))}\n")
        c = centers[-1]
        f.write(f"{round(float(c[2]))}, {round(float(c[1]))}, {round(float(c[0]))}")

@AaronSaam
Copy link
Author

Hi @mibaumgartner,
Thank you for the quick reply! I went on a trip of debugging.
The validation set indeed performs better, with more true positive values. We are currently visualizing the predicted bounding boxes, so I may have an update on that soon.
It is a great point regarding the patient split. In our training we used a randomized split, which could indeed introduce a wrongful bias.

Will update with more details.

Best,
Aaron

@mibaumgartner
Copy link
Collaborator

Hi @AaronSaam,

Any update?

Best,
Michael

@AaronSaam
Copy link
Author

Hi @mibaumgartner,

Currently I am again training the nnDetection on the ADAM dataset, this time mapping the label for the treated aneurysm (value=2) to background (value=0), and using a patient stratisfied split. The first fold of the network should be done training by tomorrow morning. We have not yet been able to find the root cause of where things are going differently, but it would be interesting to see and compare with the new performance.

Best,
Aaron

@mibaumgartner
Copy link
Collaborator

mibaumgartner commented Jul 29, 2021

Since we ran your first setup in our experiments (both our MICCAI and nnDetection submission) I don't think the setup will change the overall results drastically (nnDetection metrics will be worse since treated aneurysms are pretty easy to detect but I wouldn't expect huge differences in the performance of untreated aneurysms).

From your original run (random split, treated and untreated aneurysms as foreground):

  1. Can you check the metrics provided by nnDetection in the following file: [model dir]/consolidated/val_results/results_boxes.json specifically the values for AP_IoU_0.10_MaxDet_100 or FROC_score_IoU_0.10 are of interest. If the file/folder does not exist in consolidated, please run nndet_eval with the fold set to -1 (make sure the task ids are correct though and there is no mixup with the new run)
  2. The file [model dir]/consolidated/val_results/pred_hist_IoU@0_1.png contains a probability histogram. The title of the figure shows how many "pos" instances there are. This value corresponds to the total number of instances in the data set and thus is great to check for debugging. (the plot of my run shows 156)
  3. If you evaluate the predictions with the official ADAM script, could you eventually post your script to convert the predictions from nnDetection into the final ADAM format?

@AaronSaam
Copy link
Author

Thank you for the reply!

  1. The results_boxes.json contains a.o. the following values:
    AP_IoU_0.10_MaxDet_100: 0.835501827391805
    FROC_score_IoU_0.10: 0.8580705009276438
  2. The plot of my run shows 154 "pos" instances. (TP: 145; FP: 10699; FN: 9; pos: 154)
  3. See below for the code to generate a result.txt for every prediction. I think it resembles the code snippet you posted earlier quite well.

Best,
Aaron

from pathlib import Path
import os
import numpy as np
import pandas as pd

THRESHOLD = 0.5

def get_bbox_center(bbox, round_output=False):
    """Return the center of predicted bounding boxes"""
    (z1, y1, z2, y2, x1, x2) = bbox
    center = [np.mean([x1, x2]), np.mean([y1, y2]), np.mean([z1, z2])]
    if round_output:
        return np.round(center).astype(int)
    else:
        return center

def main():
    path_pred = '[PATH_TO_PREDICTIONS]'
    path_output = '[PATH_TO_OUTPUT]'
    path_ref_csv = '[PATH_TO_REFERENCE_CSV]' # to convert ADAM_XXX to a patient
    
    path_pred = Path(path_pred)
    path_output = Path(path_output)
    files = os.listdir(path_pred)

    # load the reference csv to restore nnDet naming to original subject id
    df_reference = pd.read_csv(path_ref_csv, sep=';')

    for case in files:
        
        path_pkl = path_pred / case
        
        # only load pickle files
        if path_pkl.suffix != '.pkl':
            continue
        
        # obtain original subject id, e.g. ADAM_112 -> 20001
        case_id = case.split('_')[1]
        subject_id = df_reference.at[int(case_id), 'patient_id']
        
        # load predicted bounding boxes
        with open(path_pkl, 'rb') as src:
            data = np.load(src, allow_pickle=True)
        
        lOutput = []
        
        # log bbox centers for scores above the threshold
        loc = np.where(data['pred_scores'] > THRESHOLD)
        for idx in loc[0]:
            bbox = data['pred_boxes'][idx]
            center = get_bbox_center(bbox, round_output=True)
            lOutput.append(center)
        
        # create the output folder if needed
        path_result = path_output / subject_id
        
        if not Path.exists(path_result):
            Path.mkdir(path_result)
        
        # write to file
        with open(path_result / 'result.txt', 'w') as output:
            for row in range(len(lOutput)):
                output.write(', '.join([str(x) for x in lOutput[row]]))
                if row + 1 < len(lOutput):
                    output.write('\n')

@mibaumgartner
Copy link
Collaborator

Hi @AaronSaam ,

  1. The validation results look good
  2. The difference in the number of instances probably occurs due to the exclusion of one case you mentioned in your first post (if I understood correctly).
  3. Thanks, the script looks good to me, too.

Some other ideas (I'll only have sparse internet access during the next week, so I can only look into this in more detail from the 14th of August):

  • maybe there is a shift in the probability distribution of the predictions wrt the original run (FROC and AP are ranking-based metrics that do not respond to such a shift). That would be visible in [model_dir]/consolidated/val_results/pred_hist_IoU@0_1.png
  • I can export the 5 Fold CV predictions to the official format and run the challenge evaluation script. I'll report the results here so we can compare those, too.
  • eventually, it could be useful to run the test evaluation with nnDetection and check if the results are bad, too. (this can be done by providing iamgesTs and labelsTs, rerunning the preprocessing script, and running evaluation with --test).

Best,
Michael

@AaronSaam
Copy link
Author

AaronSaam commented Aug 10, 2021

Hi @mibaumgartner ,

  • The pred_hist_IoU@0_1.png plot is lower in TP compared to the original run (video on the ADAM website); there are also more 'confident' FP values visible:
    image
  • That would be great! Thanks
  • ~I did a rerun of the preprocessing script on labelsTs and copied them from preprocessed/labelsTs/ into preprocessed/. (edit: I should have renamed preprocessed/labelsTr/ to preprocessed/labelsTs/, because I renamed the original folder to labelsTr to run the script. Fixed that and did a rerun; the results are the same)
    Evaluation of test_results/pred_hist_IoU@0_1.png gives TP: 10, FP: 14090, FN: 192, pos: 202, with about one TP above the threshold of 0.5.

The results of the second run should be ready soon.
If you are on a holiday, enjoy the time off!

Best,
Aaron

@mibaumgartner
Copy link
Collaborator

mibaumgartner commented Aug 16, 2021

Hi @AaronSaam,

the prediction histogram looks fine, some smaller deviations are expected due to the randomness inside the training process. The histogram for our paper submission is shown below. (the sweep resulted in slightly different probability thresholds which leads to the different number of FP: since the final evaluation is thresholded at 0.5 anyway it does not have any influence on the test predictions)
pred_hist_IoU@0_1

Since the validation results look fine and only the test results are bad / basically nonexisting (also via the nndetection evaluation) I suspect something might be off there. This kinda excludes any problems with the conversion from bouding boxes to the center points, in case you are interested I can still run the center point evaluation on the training (5 Fold CV) runs.

Since the test and validation prediction use the same function to predict the data, I was wondering if there is something off with the intput data. Could you double-check that the test data follows the same scheme as the training data: i.e. struct_aligned.nii.gz is mapped to channel 0 (the raw_splitted/imagesTs case-file ends with _0000) and TOF.nii.gz is mapped to channel 1 (the raw_splitted/imagesTs case-file ends with _0001).

I uploaded the necessary files for our test set submission here in case you want to check anything there.

Best,
Michael

@AaronSaam
Copy link
Author

Hi @mibaumgartner,

Thanks for the results! I agree that the validation seems fine, which makes the disappointing results thus the more interesting. I double checked (say triple checked) the testing data and the input follows the instructions. My results for the second training (treated aneurysms mapped to background; patient stratified split) gave similar non-existing results.

However, I might have a lead: I ran a prediction using the 'best model' instead of the standard 'last model' and on a quick inspection the predictions seem to be very much improved. The predictions still need to be converted into numbers, I will do that first thing in the morning.

Best,
Aaron

@AaronSaam
Copy link
Author

Quick update: Following the ADAM evaluation method, the best model checkpoint predictions give a sensitivity of 0.63 and a FPcount of 0.32. I uploaded the loss curves for the five folds over here, could you check with your log files whether they look somewhat similar?

Code to plot the curves:

import matplotlib.pyplot as plt

def get_loss(data : list, find_query : str):
    output = []
    query = [x for x in data if find_query in x]
    
    for line in query:
        idx = line.find(find_query) + len(find_query)
        output.append(float(line[idx:].rstrip()))
    
    return output


task = 'Task019FG_ADAM'
network = 'nnDetection'
fold = 2

for fold in range(0,5):
    # path_data = r'D:\nnDetection\Task019FG_ADAM\log\fold4\train.log'
    
    # Path to .log file -- do make use of the variables above!
    path_data = rf'D:\{network}\{task}\log\fold{fold}\train.log'
    
    # Path to save loss curve image
    path_output = rf'D:\{network}\{task}\log\fold{fold}.png'
    
    path_data = Path(path_data)
    
    with open(path_data, 'r') as file:
        data = file.readlines()
    
    train_loss = get_loss(data, 'Train loss reached: ')
    val_loss   = get_loss(data, 'Val loss reached: ')
    #dice       = get_loss(data, 'Proxy FG Dice: ')
    
    fig, ax = plt.subplots()
    plt.title(f'{network} - {task} - fold {fold}')
    ax.plot(range(len(val_loss)), val_loss)
    #plt.plot(range(len(val_loss)), dice)
    ax.plot(range(1, len(val_loss)), train_loss)
    ax.legend(['Validation', 'Training'])
    ax.set_xlabel('Epoch')
    ax.set_ylim([-0.9, 3.4])
    #ax.set_xlim([0, 60])
    
    plt.savefig(path_output, dpi=150)

@mibaumgartner
Copy link
Collaborator

Hi @AaronSaam ,

those results look reasonable, a very good lead! (the version of the paper scored 0.64 at 0.3 on the test set) I ran the script you provided to extract the plots and they look very similar. The main contributor to the rising validation loss is the classification part which is computed via hard negative mining (only a very small portion of anchors contribute to the loss and 2/3 of those are selected as hard negatives). The metrics on the validation set improve much longer (the current schedule is probably a bit long when only we are only looking at ADAM though).

nnDetection uses MLFlow for logging which provides additional metrics and per loss logging which might be more beneficial. Unfortunately, I forgot to save the mlflow log from my run and only kept models/configs/normal logs.

I also rechecked the models we submitted to the open leaderboard for our paper and those were the last models from the training (each checkpoints contains the epoch when it was saved) which did not perform in your run.
If you need any other information or have an idea of how I can create some kind of artificial setup to reproduce the issue let me know.

Best,
Michael

@orouskhani
Copy link

Hello All,

Thank you for your comments and helps. I have already used nnDetection for ADAM dataset. I run the convert.py to generate the coordination. However, the result.txt provides the coordination as 6 numbers while it should be 3 numbers. Could you please let me know what the problem is?
This is the text printed in result.txt -->
238,264,49
297,269,47238,264,49
297,269,47238,264,49
297,269,47

Regards,
Maysam

@mibaumgartner mibaumgartner added the question Default label label Jan 2, 2024
Copy link

github-actions bot commented Feb 2, 2024

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Issue without activity, will be closed soon label Feb 2, 2024
@mibaumgartner mibaumgartner removed the stale Issue without activity, will be closed soon label Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Default label
Projects
None yet
Development

No branches or pull requests

3 participants