[Multi-Modal Image+Text] Preprocess the text reports #55

sumedhasingla · 2018-08-06T15:22:22Z

Extract just the finding and impression sections from the reports
Get an idea of distribution of length (number of words, sentences) of finding and impression section in different reports.
Perform the word-count. Find the most frequent words.
Find the most frequent sentences.
Group the similar sentences
Fix some matrix to define sentence similarity
Train word2vec on the large text-report dataset (500K reports-spine, chest CT, head CT, others)
Train different dimensions for word embedding d =100 to 300
Get a statistics about the most frequent tags. (tags are obtained from NOBLE tool, etc)

sumedhasingla · 2018-08-06T15:33:17Z

Results on 7800 reports (we have matching images for a subset of these reports: RAD-ALL-Findings-Impressions.csv)
Most frequent words:

Current Size of vocabulary: 7495 distinct words

The legnth of FINDINGS varies between 0 words to 651 words
The legnth of IMPRESSIONS varies between 0 words to 327 words
The legnth of FINDINGS varies between 0 sentences to 66 sentences
The legnth of IMPRESSIONS varies between 0 sentences to 30 sentences

kayhan-batmanghelich · 2018-08-06T17:00:28Z

FYI @pyadolla

Thanks @sumedhasingla . This word count figure is very informative. The most frequent words are "no", "right" and "left" which shows we have to find a good negation pipeline.

Also, what do you think about this:

If we get a radiologist intern (I am working on it), do you think it makes sense to draw an "attention" box for each sentence?

kayhan-batmanghelich · 2018-08-06T17:04:07Z

@sumedhasingla can you show wordcount for impression and finding separately ?

sumedhasingla · 2018-08-30T18:34:39Z

Results on 548,369 reports. These are all the reports in folder: '/pghbio/dbmi/batmanlab/Data/radiologyTextDataset2/Reports'. It contains reports from Lung (2013-2017), spine (2016-2017), neck (2016-2017), head (2017), abdomen (2017).

The impression and finding sections were extracted from all the reports and saved at location: '/pghbio/dbmi/batmanlab/Data/radiologyTextDataset2/singla/Processed_Reports/Text_To_Train_Glove'
Distribution of words
The length of FINDINGS varies between 1 words to 1586 words and 1 sentences to 127 sentences.
The length of IMPRESSIONS varies between 1 words to 693 words and 1 sentences to 67 sentences.

Most frequentwords in findings

Most frequent words in impressions

kayhan-batmanghelich · 2018-09-05T10:00:00Z

Thanks @sumedhasingla

sumedhasingla · 2018-09-05T14:10:39Z

To calculate the statistics about the most frequent tags, I ran NOBLE tool on 6K reports and compute the inverse-document frequency.
Top 20 concepts and semantic types

Concept	idf	SemanticType	idf
impression (attitude)	0.996964	Finding	1
Radiologic Impression	0.996964	Qualitative Concept	0.999831
Unremarkable	0.718455	Body Part, Organ, or Organ Component	0.999325
Mass	0.647267	Mental Process	0.997132
Breast Lump	0.589575	Spatial Concept	0.994433
Focus (area of enhancement)	0.56444	Disease or Syndrome	0.994433
Pericardial effusion	0.550101	Quantitative Concept	0.986336
Mild	0.538124	Functional Concept	0.976889
Lesion	0.529352	Pathologic Function	0.962719
Pericardial effusion body substance	0.512146	Body Location or Region	0.947031
Pleural Effusion	0.505735	Intellectual Product	0.93303
Lymphatic Diseases	0.486842	Idea or Concept	0.898617
Lymphadenopathy	0.486842	Temporal Concept	0.842611
Mild Adverse Event	0.443826	Body Substance	0.763158
Nodule	0.433536	Conceptual Entity	0.708333
centimeter	0.432692	Therapeutic or Preventive Procedure	0.637314
No status change	0.431174	Neoplastic Process	0.615385
Curium	0.430837	Acquired Abnormality	0.61471
Small	0.430162	Tissue	0.612854

kayhan-batmanghelich · 2018-09-09T13:28:22Z

@sumedhasingla hmmm... many of those terms are too general and hence non-informative. Interesting things to look into:

Pleural Effusion
Lymphadenopathy
Pericardial effusion --- why is this intellectual product?!
Pericardial effusion body substance

What are those? Would you please take a look.

I also wonder what are the examples of the following concepts:
Mental Process
Intellectual Product
Conceptual Entity
Neoplastic Process

There are semantic tags that we should be able to predict for example see below. I think these should have salient visual representation:

Neoplastic Process
Disease or Syndrome
Body Substance

@sumedhasingla could you please eyebal which concepts/semantic types co-occur. Given that we are going to do multi-label prediction, knowing this will be helpful.

Thanks @sumedhasingla

shyamvis · 2018-09-09T16:14:28Z

We have been using Noble Coder to identify SNOMED-CT concepts in CT reports of patients with trauma. The goal is to identify sentences that are incidental findings in trauma CT reports and the work is being done by Gaurav Trivedi in ISP. We noticed that the versions of vocabularies that come with Noble are quite old.

You can get the latest UMLS in NLM format, unzip and concatenate MR**.RRF files (cat MRCONSO.RRF.a* > MRCONSO.RRF) and use Noble Coder to import as RRF. If you need the concept type hierarchy download the latest NCI Metathesausus in RRF format. Contact Eugene Tseytlin at [email protected] Iif you need more information about Noble Coder.

The UMLS semantic types are at https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html. If useful you can selectively annotate only those types that are of interest.

In our experiments so far it seems that bag of words with and without embeddings is not improved by adding UMLS or SNOMED-CT concepts.

kayhan-batmanghelich · 2018-09-09T16:21:28Z

Thanks @sumedhasingla for the comment.

@sumedhasingla @pyadolla FYI.

sumedhasingla · 2018-09-18T19:15:33Z

Result of NOBLE tool on 6k reports.
To get better understanding of the semantic type and concept, I collected the stats about how many times any word is tagged with a given (semantic-type and concept) pair.
Below are some examples of concepts with in a semantic type:
Mental Process : 20 (Number of concepts in this semantic type)
Regression - mental defense mechanism
Content
impression (attitude)
Psychological habituation
Suspicion
Persistence
Interpretation
Attention Concentration
Psychological fixation
Body Image
Intellectual Product : 165 (Number of concepts in this semantic type)
Uncertain
Observation Interpretation - intermediate
Group Object
Financial Account
International Classification of Diseases
Participation Type - device
attestation
Leaflet
Severe Extremity Pain
Pattern of Bowel Movements
Conceptual Entity : 80 (Number of concepts in this semantic type)
Upper Limit of Normal
Study Terminated
Combination
Computer Configuration
Fusiform shape
Content
Background
dictated - ParticipationMode
Spicule
Low Risk
Neoplastic Process : 69 (Number of concepts in this semantic type)
Breast Carcinoma
Esophageal Neoplasms
Thyroid Gland Nodule
Metastatic Malignant Neoplasm in the Pleura
Cyst
Metastatic Malignant Neoplasm in the Lymph Nodes
Minimal Residual Disease
Mass of thyroid gland
Leiomyoma
Plasma Cell Myeloma

sumedhasingla · 2018-09-18T19:16:43Z

The interesting semantic types are:
'Acquired Abnormality', 'Anatomical Abnormality', 'Congenital Abnormality', 'Sign or Symptom', 'Finding', 'Pathologic Function', 'Disease or Syndrome', 'Disease or Syndrome, Anatomical Abnormality'

sumedhasingla · 2018-09-18T19:19:47Z

sumedhasingla self-assigned this Aug 6, 2018

sumedhasingla added the Coding/Data Wrangling label Aug 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Multi-Modal Image+Text] Preprocess the text reports #55

[Multi-Modal Image+Text] Preprocess the text reports #55

sumedhasingla commented Aug 6, 2018 •

edited

Loading

sumedhasingla commented Aug 6, 2018 •

edited

Loading

kayhan-batmanghelich commented Aug 6, 2018

kayhan-batmanghelich commented Aug 6, 2018

sumedhasingla commented Aug 30, 2018

kayhan-batmanghelich commented Sep 5, 2018

sumedhasingla commented Sep 5, 2018

kayhan-batmanghelich commented Sep 9, 2018

shyamvis commented Sep 9, 2018

kayhan-batmanghelich commented Sep 9, 2018

sumedhasingla commented Sep 18, 2018

sumedhasingla commented Sep 18, 2018

sumedhasingla commented Sep 18, 2018

[Multi-Modal Image+Text] Preprocess the text reports #55

[Multi-Modal Image+Text] Preprocess the text reports #55

Comments

sumedhasingla commented Aug 6, 2018 • edited Loading

sumedhasingla commented Aug 6, 2018 • edited Loading

kayhan-batmanghelich commented Aug 6, 2018

kayhan-batmanghelich commented Aug 6, 2018

sumedhasingla commented Aug 30, 2018

kayhan-batmanghelich commented Sep 5, 2018

sumedhasingla commented Sep 5, 2018

kayhan-batmanghelich commented Sep 9, 2018

shyamvis commented Sep 9, 2018

kayhan-batmanghelich commented Sep 9, 2018

sumedhasingla commented Sep 18, 2018

sumedhasingla commented Sep 18, 2018

sumedhasingla commented Sep 18, 2018

sumedhasingla commented Aug 6, 2018 •

edited

Loading

sumedhasingla commented Aug 6, 2018 •

edited

Loading