Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertions and other #144

Closed
mfalkiewicz opened this issue Feb 1, 2018 · 21 comments
Closed

Assertions and other #144

mfalkiewicz opened this issue Feb 1, 2018 · 21 comments

Comments

@mfalkiewicz
Copy link

Dear heudiconv developers,
thank you for coming up with this awesome project. I am trying to use heudiconv now to convert massive datasets, often with several thousands of subjects. As you know, such datasets are often messy. Running a pass through the data with heudiconv often takes several hours. Unfortunately, heudiconv seems to crash on almost every inconsistency it encounters. For instance:

INFO: PROCESSING STARTS: {'session': None, 'outdir': '/home/homeGlobal/mfalkiewicz/largefs/dataset_conversion/1000Gehirne/', 'subject': '997128'}
INFO: Processing 3376 dicoms
INFO: Analyzing 3376 dicoms
Traceback (most recent call last):
File "/usr/bin/heudiconv", line 2197, in
main()
File "/usr/bin/heudiconv", line 2189, in main
return _main(args)
File "/usr/bin/heudiconv", line 2052, in _main
min_meta=args.minmeta)
File "/usr/bin/heudiconv", line 1364, in convert_dicoms
grouping=None, # no groupping
File "/usr/bin/heudiconv", line 425, in group_dicoms_into_seqinfos
assert studyUID == file_studyUID
AssertionError

This error popped out after waiting for 3 hours and processing of 17 subjects (the actual processing started after 3 hours - data for ~1500 subjects). I think the default behavior in this setting should be dumping the information about failed subject to a text file and continuation for other subjects.

The other problem that I encountered are DICOM folder templates. heudiconv crashes when it encounters a folder instead of a dicom. I think in this situation it should simply continue the exectution.

Unfortunately, I do not have time to get deeply into the logic of the above mentioned assertion, nor the exact mechanisms of scanning files through dicoms. I would be very greatful if you could fix these features. This software is extremely useful!

@yarikoptic
Copy link
Member

run with --dbg flag so you end up in pdb upon failure and could check details on what is going on?

@mgxd
Copy link
Member

mgxd commented Feb 2, 2018

@mfalkiewicz

can you share the command you are using and how your data is set up? we may be able to optimize this and make the conversion much faster and more efficient

@yarikoptic
Copy link
Member

knock knock @mfalkiewicz -- without further info we wouldn't be able to reproduce or figure really out why is this happening and would need just to close this issue unresolved

@mfalkiewicz
Copy link
Author

My apologies guys, I will provide the details in the coming days.

@mfalkiewicz
Copy link
Author

mfalkiewicz commented Mar 14, 2018

Hello,
sorry for massive delays, but I finally managed to get back to the topic.

I think the assert studyUID == file_studyUID is simply too much. I mean, the check is valid and makes sense, but in my use cases what doesn't make sense is that it breaks the pipeline, i.e. heudiconv exits with with AssertionError. Having to re-start processing of 1000 subjects because of that error is really annoying. After removing that assertion (and replacing it with a warning), heudiconv nicely traverses the entire dataset, no worries. Running this with --dbg indeed shows that studyID and file_studyID don't match, but I'm pretty sure the data comes from the same protocol.

Regarding the crashes due to encountering folders instead of files, here is a more detailed problem description. I have datasets in which dicom folders are structured differently. Some dicoms have .DCM extension, some don't have an extension at all (for example, data comes from different scanners). Thus, in order to match the extension-less files I need to use a wildcard *. However, the second layer of mess is that the subfolder structure of DICOM directories also varies across subjects. For instance,
subjectA has his dicoms in folders subA/dicom/*/*, but the pattern for the second subject would have to be subB/dicom/*/dicom/* (don't ask me about the reasons for this mess). Thus, if I set the dicom folder template to the one that matches subjectA, heudiconv will crash for subjectB (because it encounters a folder when it expects a file).

One idea to solve the latter problem is to use something that replicates dcm2niix behavior when dealing with the folders. It would be much more convenient if I could provide a template 'sub*/dicom' and all subdirectories would be traversed in search for dicoms.

I hope that clarifies these issues.

@yarikoptic
Copy link
Member

yarikoptic commented Mar 14, 2018

dataset, no worries. Running this with --dbg indeed shows that studyID and file_studyID don't match,
but I'm pretty sure the data comes from the same protocol

I love certainty and confidence, but I really prefer facts and "better be safe than sorry"!
Figuring out why/how come they differ would be necessary IMHO to figure out a proper solution for this case. Note that this assertion was there already for a while with probably thousands of sessions converted and this is so far a single reported case where it got triggered. Having 1000 of subjects is a situation where I would not be surprised to run into a human error of some kind or even might be a console/PACS software bug. But again -- better to know details of the situation and what kind of action/issue triggered it.

Running this with --dbg indeed shows that studyID and file_studyID don't match,

is there a chance you could share what those values were? (anyhow close?) may be there is a chance to share (privately if necessary) minimal set of sample files or dcmdump of them to figure out the reason?

@yarikoptic
Copy link
Member

Re:

... if I could provide a template 'sub*/dicom' and all subdirectories would be traversed in search for dicoms.

well -- we kinda already have it but I guess it might be triggered (only?) if no template is provided. You could try heudiconv --command ls --files sub*/dicom (start by pointing to just few directories, may be from different scanners) to see heudiconv groupping them by accessions. Would it work on your data? Then may be your custom heuristic could provide custom infotoids function which based on what provided for the accession would return a dict with subject and session (and locator which is pretty much a prefix for a study, could leave the same/empty) keys -- then you would not need to provide subject template but just point heudiconv to the specific directory(ies) you would like to convert. Here is a bit elaborate example for that function in reproin heuristic

@gureckis
Copy link

I also am finding that the studyUID check is feeling very strong. I have an old dataset from 2006 on a Siemens Trio I'd like to update and publish/reanalyze. I know that the data come from the same subject and session but for some reason the Dicom files have an empty string for Accession Number and a different Study Instance UID for every scan. Perhaps something was misconfigured on the machine at that time or something. However, as a result I can't move forward with heudiconv despite this seeming like a non-fatal issue in the larger scheme (I know the structure of the files based on their filename). Is the best practice to manually rewrite the headers of my dicom files or comment out this assertion in the heudiconv? Feels like there should be a third way which just makes a warning about this rather than stopping all processing? Does studyUID need to be consistent for later processing steps or is this just a check in case two sets of files were somehow mixed up?

@mgxd
Copy link
Member

mgxd commented Apr 16, 2019

@gureckis there was work on disabling the grouping in #200, and I'm all for revisiting it. This check was added to ensure conflicting studies would not be mixed together. However, as we see in #126 (comment) - we would want to give users the power to allow it under certain conditions.

@satra
Copy link
Member

satra commented Apr 17, 2019

@nicholst - just sent an email about a grouping where the series id is the same, but should be split into separate files. so another related problem that may require custom grouping.

@DanielDelbarre
Copy link

Hi, I'm working on the dataset that @nicholst emailed about and wanted to provide some more details.

We have some datasets where there are 2 scans (T2 and PD) with the same series UID that need to be kept as two separate scans within a session. When I run Heudiconv, it groups these 2 scans together and outputs them as 2 runs of the same scan instead of 2 separate scans. I believe that Heudiconv is grouping these together because of the shared series UID and I cannot find a way to separate them. Grouping by accession ID is not an option as this field is either empty or the same across all scans within a session.

Custom grouping would be ideal for resolving this as there are a number of fields in the dicom headers that would allow for the scans to be grouped as needed. Is there currently any functionality to do this?

Also, in the dicominfo tsv file, I know that some of the fields are distinct for the 2 scans (e.g. dcm_dir_name) and I have tried building a heuristic file using this field, but this did not work because I assume that the grouping has already been performed by this stage.

Let me know if you need any more details. Any advice would be greatly appreciated!

@nicholst
Copy link
Contributor

I wanted to give some more information on the problem with PD+T2 image pairs. The distressing thing is that we can't stop it from creating invalid BIDS files, because it is appending an integer after our modality label.

In particular, our two images are coming out as:

sub-{subject}_ses-{session}_PDT2_1.nii.gz
sub-{subject}_ses-{session}_PDT2_2.nii.gz

which of course isn't valid.

Our heuristic file is below; you can see that we use the string PDT2_ at the end, but we only added the _ to make it easier to see the digit ("PDT21" and "PDT22" read really weird). We can manipulate the different image types but we can't figure out how to control the naming of these two images.

Any suggests will be gratefully appreciated! (w/ @DanielDelbarre)

import os


def create_key(template, outtype=('nii.gz',), annotation_classes=None):
    if template is None or not template:
        raise ValueError('Template must be a valid format string')
    return template, outtype, annotation_classes


def infotodict(seqinfo):
    """Heuristic evaluator for determining which runs belong where

    allowed template fields - follow python string module:

    item: index within category
    subject: participant id
    seqitem: run number during scanning
    subindex: sub index within group
    """

    PDT2 = create_key('sub-{subject}/{session}/anat/sub-{subject}_{session}_PDT2_')
    T1w = create_key('sub-{subject}/{session}/anat/sub-{subject}_{session}_T1w')
    t1ce = create_key('sub-{subject}/{session}/anat/sub-{subject}_{session}_ce-Gd_T1w')
    T1w_3d = create_key('sub-{subject}/{session}/anat/sub-{subject}_{session}_acq-3d_T1w')
    T2w = create_key('sub-{subject}/{session}/anat/sub-{subject}_{session}_T2w')
    PD = create_key('sub-{subject}/{session}/anat/sub-{subject}_{session}_PD')

    info = {PDT2: [], T1w: [], t1ce: [], T1w_3d: [], T2w: [], PD: []}

    for s in seqinfo:
        if ((s.dcm_dir_name == 't2') and (s.dim3 > 60)) or ((s.dcm_dir_name == 'pd') and (s.dim3 > 60)):
            info[PDT2] = [s.series_id]
        if (s.dcm_dir_name == 't1n'):
            info[T1w] = [s.series_id]
        if (s.dcm_dir_name == 't1ce'):
            info[t1ce] = [s.series_id]
        if (s.dcm_dir_name == '3d_t1n'):
            info[T1w_3d] = [s.series_id]
        if (s.dcm_dir_name == 't2') and (s.dim3 < 60):
            info[T2w] = [s.series_id]
        if (s.dcm_dir_name == 'pd') and (s.dim3 < 60):
            info[PD] = [s.series_id]

    return info

@yarikoptic
Copy link
Member

FTR: Woohoo -- I have managed to find the sample phantom PD+T2w acquisition I have done for another issue (multi-echo) troubleshooting: http://datasets.datalad.org/?dir=/dicoms/dartmouth-phantoms/bids_test6-PD+T2w/

So just confirming the undesired behavior:

$> heudiconv --version
0.6.0.dev1

$> heudiconv -f reproin --bids --datalad -o bids_test6-PD+T2w-bids --files bids_test6-PD+T2w
...

$> ls bids_test6-PD+T2w-bids/head/library/sub-qa/ses-20180226/anat 
sub-qa_ses-20180226_acq-tseXtraX3mm_PD+T2w1.json     sub-qa_ses-20180226_acq-tseXtraX3mm_PD+T2w2.json
sub-qa_ses-20180226_acq-tseXtraX3mm_PD+T2w1.nii.gz@  sub-qa_ses-20180226_acq-tseXtraX3mm_PD+T2w2.nii.gz@

@yarikoptic
Copy link
Member

@DanielDelbarre @nicholst -- if you take your dicoms and convert using dcm2niix manually with -b y for bids side car files -- what do you see different between two produced .json files? in my case it is just echo times

$> diff /home/yoh/.tmp/dcm2niixiE52aR/convert/anat_e{1,2}.json
24c24
< 	"AcquisitionTime": "09:32:3.255000",
---
> 	"AcquisitionTime": "09:32:3.302500",
29,30c29,30
< 	"EchoNumber": 1,
< 	"EchoTime": 0.0094,
---
> 	"EchoNumber": 2,
> 	"EchoTime": 0.094,

@nicholst
Copy link
Contributor

Thank you for replicating this problem @yarikoptic!

@DanielDelbarre will have to confirm Monday, but I'm pretty sure that that is what we find as well -- .json identical except for EchoNumber and EchoTime.

@mgxd
Copy link
Member

mgxd commented May 14, 2019

I believe adding PDT2 to this list

supported_multiecho = ['_bold', '_epi', '_sbref', '_T1w']

will solve the problem (though hard-coding all the valid multi-echo acquisitions shouldn't be our long term solution).

I'm hoping to change this behavior in a future release by separating the dim4 seqinfo field into its parts: volumes and echoes, but this has the potential to break a large amount of existing heuristics.

@nicholst
Copy link
Contributor

Adding _PDT2 to the supported_multiecho list fixes the problem! The PD and T2 are identified by echo labels and not by an appended 1 or 2. Thank you @mgxd!

The fix is lodged as PR #345.

@yarikoptic
Copy link
Member

I wonder if in the long(er) run, e.g. in the context of reproin heuristic, we should allow for having multiple modalities being listed joined by +, e.g. _PD+T2w then if multiple images detected, this suffix will get split and assigned accordingly. And if there is no + detected, use -echo[123...] and leave suffix intact.
AFAIK + is allowed in the sequence name on Siemens and Philips

@nicholst
Copy link
Contributor

Someone with more MR physics knowledge than me should weigh in on this.

As far as I know, we're talking about the very special case of one excitation generating multiple images (via multiple echos). The only uses cases I know of are PD+T2 and multi-echo sequences used to support parametric mapping (or other corrections that generate maps after some modelling/transformation process... this is in contrast to PD+T2 which is just two raw images with arbitrary units).

Hence, unless we see another use case I'd be against adding a generic <modality>+<modality> syntax.

@yarikoptic
Copy link
Member

Well, I also had only PD+T2w combination. If I get it right, such "syntax handling" is largely heudiconv (and then reproin) specific, not BIDS per se, to just be able to accomplish the mission in a generic way (as opposed to custom to be hardcoded _PDT2 suffix).

Asked our physics guy, he mentioned GE scanners being able to produce multiple contrasts from a single acquisition... Google lead e.g. to this ad...
That is why I thought that looking forward might be worth establishing a generic solution where from a single DICOM series we might need to provide a number of separate, and more specific than _echo-X of the same modality images . Again, AFAIK it does and will not interfere with BIDS - our goal to arrive to BIDS ;)

@yarikoptic
Copy link
Member

ok, created a dedicated issue #346 for possible enhancement. The other assertions are about groupping for which there are #200 and other issues... so I guess this could be closed for now. Reopen if there is more to discuss specifically in this now lengthy thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants