Merge pull request #242 from Emory-HITI/dev

An experimental workflows and nifti modules
Emory-HITI · Nov 4, 2021 · a023417 · a023417
2 parents 48707a1 + 7768feb
commit a023417
Show file tree

Hide file tree

Showing 9 changed files with 826 additions and 54 deletions.
diff --git a/modules/nifti-extraction/ImageExtractorNifti.py b/modules/nifti-extraction/ImageExtractorNifti.py
diff --git a/modules/nifti-extraction/README.md b/modules/nifti-extraction/README.md
@@ -0,0 +1,94 @@
+# The Niffler PNG Extractor
+
+The PNG Extractor converts a set of DICOM images into png images, extract metadata in a privacy-preserving manner.
+
+
+## Configuring Niffler PNG Extractor
+
+Find the config.json file in the folder and modify accordingly *for each* Niffler PNG extractions.
+
+* *DICOMHome*: The folder where you have your DICOM files whose metadata and binary imaging data (png) must be extracted.
+
+* *OutputDirectory*: The root folder where Niffler produces the output after running the PNG Extractor.
+
+* *Depth*: How far in the folder hierarchy from the DICOMHome are the DICOM images. For example, a patient/study/series/instances.dcm hierarchy indicates a depth of 3. If the DICOM files are in the DICOMHome itself with no folder hierarchy, the depth will be 0.
+
+* *SplitIntoChunks*: How many chunks do you want to split the metadata extraction process into? By default, 1. Leave it as it is for most of the extractions. For extremely large batches, split it accordingly. Single chunk works for 10,000 files. So you can set it to 2, if you have 20,000 files, for example.
+
+* *UseProcesses*: How many of the CPU cores to be used for the Image Extraction. Default is 0, indicating all the cores. 0.5 indicates, using only half of the available cores. Any other number sets the number of cores to be used to that value. If a value more than the available cores is specified, all the cores will be used.
+
+* *FlattenedToLevel*: Specify how you want your folder tree to be. Default is, "patient" (produces patient/*.png). 
+  You may change this value to "study" (patient/study/*.png) or "series" (patient/study/series/*.png). All IDs are de-identified.
+
+* *is16Bit*:  Specifies whether to save extracted image as 16-bit  image. By default, this is set to true. Please set it to false to run 8-bit extraction.
+
+* *SendEmail*: Do you want to send an email notification when the extraction completes? The default is true. You may disable this if you do not want to receive an email upon the completion.
+
+* *YourEmail*: Replace "[email protected]" with a valid email if you would like to receive an email notification. If the SendEmail property is disabled, you can leave this as is.
+
+
+### Print the Images or Limit the Extraction to Include only the Common DICOM Attributes
+
+The below two fields can be left unmodified for most executions. The default values are included below for these boolean properties.
+
+* *PrintImages*: Do you want to print the images from these dicom files? Default is _true_.
+
+* *CommonHeadersOnly*: Do you want the resulting dataframe csv to contain only the common headers? Finds if less than 10% of the rows are missing this column field. To extract all the headers, default is set as _false_.
+
+
+## Running the Niffler PNG Extractor
+```bash
+
+$ python3 ImageExtractor.py
+
+# With Nohup
+$ nohup python3 ImageExtractor.py > UNIQUE-OUTPUT-FILE-FOR-YOUR-EXTRACTION.out &
+
+# With Command Line Arguments
+$ nohup python3 ImageExtractor.py --DICOMHome "/opt/data/new-study" --Depth 0 --PrintImages true --SendEmail true > UNIQUE-OUTPUT-FILE-FOR-YOUR-EXTRACTION.out &
+```
+Check that the extraction is going smooth with no errors, by,
+
+```
+$ tail -f UNIQUE-OUTPUT-FILE-FOR-YOUR-EXTRACTION.out
+```
+
+## The output files and folders
+
+In the OutputDirectory, there will be several sub folders and directories.
+
+* *metadata.csv*: The metadata from the DICOM images in a csv format.
+
+* *mapping.csv*: A csv file that maps the DICOM -> PNG file locations.
+
+* *ImageExtractor.out*: The log file.
+
+* *extracted-images*: The folder that consists of extracted PNG images
+
+* *failed-dicom*: The folder that consists of the DICOM images that failed to produce the PNG images upon the execution of the Niffler PNG Extractor. Failed DICOM images are stored in 4 sub-folders named 1, 2, 3, and 4, categorizing according to their failure reason.
+
+
+## Running the Niffler PNG Extractor with Slurm
+
+There is also an experimental PNG extractor implementation (ImageExtractorSlurm.py) that provides a distributed execution based on Slurm on a cluster.
+
+
+## Troubleshooting
+
+If you encounter your images being ending in the failed-dicom/3 folder (the folder signifying base exception), check the
+ImageExtractor.out.
+
+Check whether you still have conda installed and configured correctly (by running "conda"), if you observe the below error log:
+
+"The following handlers are available to decode the pixel data however they are missing required dependencies: GDCM (req. GDCM)"
+
+The above error indicates a missing gdcm, which usually happens if either it is not configured (if you did not follow the
+installation steps correctly) or if conda (together with gdcm) was later broken (mostly due to a system upgrade or a manual removal of conda).
+
+Check whether conda is available, by running "conda" in terminal. If it is missing, install [Anaconda](https://www.anaconda.com/distribution/#download-section).
+
+If you just installed conda, make sure to close and open your terminal. Then, install gdcm.
+
+```
+$ conda install -c conda-forge -y gdcm 
+```
diff --git a/modules/nifti-extraction/config.json b/modules/nifti-extraction/config.json
@@ -0,0 +1,13 @@
+{
+	"DICOMHome": "/path/to/files",
+	"OutputDirectory": "/path/to/svae",
+	"Depth": 3,
+	"SplitIntoChunks": 3,
+	"PrintImages": true,
+	"CommonHeadersOnly": true,
+	"UseProcesses": 12 ,
+	"FlattenedToLevel": "patient",
+	"is16Bit":true,
+	"SendEmail": true,
+	"YourEmail": "[email protected]"
+}
diff --git a/modules/png-extraction/ImageExtractor.py b/modules/png-extraction/ImageExtractor.py
@@ -144,9 +144,9 @@ def extract_headers(f_list_elem):
     except:
         c = False
     kv = get_tuples(plan)       # gets tuple for field,val pairs for this file. function defined above
-    # dicom images should not have more than 300
+    # dicom images should not have more than 300 dicom tags
     if len(kv)>500:
-        logging.debug(str(len(kv)) + " dicoms produced by " + ff)
+        logging.debug(str(len(kv)) + " dicom tags produced by " + ff)
     kv.append(('file', f_list_elem[1])) # adds my custom field with the original filepath
     kv.append(('has_pix_array',c))   # adds my custom field with if file has image
     if c:
@@ -263,6 +263,8 @@ def fix_mismatch_callback(raw_elem, **kwargs):
                 values.convert_value(vr, raw_elem)
             except ValueError:
                 pass
+            except TypeError:
+                continue
             else:
                 raw_elem = raw_elem._replace(VR=vr)
     return raw_elem

diff --git a/modules/rta-extraction/README.md b/modules/rta-extraction/README.md
@@ -2,24 +2,34 @@
 
 The RTA Extractor runs continuously to load the data (labs, meds and orders) in JSON format, clear the data which has been on the database for more than 24 hours and store the data in a tabular format (csv file) upon reciebing query parameters.
 
-# Configuring Niffler RTA Extractor
+## Configuring Niffler RTA Extractor
 
-Niffler RTA Extractor must be configured as a service for it to run continuously and resume automatically even when the server restarts. Unless you are the administrator who is configuring Niffler for the first time, skip this section.
+Niffler RTA Extractor must be configured as a service for it to run continuously and resume automatically when the server restarts. Unless you are the administrator who is configuring Niffler for the first time, skip this section.
 
 Find the system.json file in the service folder and modify accordingly.
 
 system.json entries are to be set *only once* for the Niffler deployment by the administrator. Once set, further extractions do not require a change.
 
  * *LabsURL*: Set the URL providing continous labs data.
+
  * *MedsURL*: Set the URL providing continous meds data.
+
  * *OrdersURL*: Set the URL providing continous orders data.
+
  * *LabsDataLoadFrequency*: Time Frequency for loading labs data onto MongoDB. The frequency is to be provided in minutes.
+
  * *MedsDataLoadFrequency*: Time Frequency for loading meds data onto MongoDB. The frequency is to be provided in minutes.
+
  * *OrdersDataLoadFrequency*: Time Frequency for loading orders data onto MongoDB. The frequency is to be provided in minutes.
+
  * *UserName*: Set the Username Credentials for RTA Connection.
+
  * *PassCode*: Set the Passcode Credentials for RTA Connection.
+
  * *MongoURI*: Set the MongoDB Connection URL.
+
  * *MongoUserName*: Set the MongoDB Username for Credentials.
+
  * *MongoPassCode*: Set the MongoDB Passcode for Credentials.
 
 ## Configure DICOM attributes to extract