Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Multi-Modal Image+Text] Preprocess the text reports #55

Open
5 of 6 tasks
sumedhasingla opened this issue Aug 6, 2018 · 12 comments
Open
5 of 6 tasks

[Multi-Modal Image+Text] Preprocess the text reports #55

sumedhasingla opened this issue Aug 6, 2018 · 12 comments
Assignees

Comments

@sumedhasingla
Copy link
Contributor

sumedhasingla commented Aug 6, 2018

  • Extract just the finding and impression sections from the reports

  • Get an idea of distribution of length (number of words, sentences) of finding and impression section in different reports.

  • Perform the word-count. Find the most frequent words.

  • Find the most frequent sentences.

  • Group the similar sentences

  • Fix some matrix to define sentence similarity

  • Train word2vec on the large text-report dataset (500K reports-spine, chest CT, head CT, others)

  • Train different dimensions for word embedding d =100 to 300

  • Get a statistics about the most frequent tags. (tags are obtained from NOBLE tool, etc)

@sumedhasingla
Copy link
Contributor Author

sumedhasingla commented Aug 6, 2018

Results on 7800 reports (we have matching images for a subset of these reports: RAD-ALL-Findings-Impressions.csv)
Most frequent words:
image
Current Size of vocabulary: 7495 distinct words

The legnth of FINDINGS varies between 0 words to 651 words
The legnth of IMPRESSIONS varies between 0 words to 327 words
The legnth of FINDINGS varies between 0 sentences to 66 sentences
The legnth of IMPRESSIONS varies between 0 sentences to 30 sentences
image

@kayhan-batmanghelich
Copy link
Collaborator

FYI @pyadolla

Thanks @sumedhasingla . This word count figure is very informative. The most frequent words are "no", "right" and "left" which shows we have to find a good negation pipeline.

Also, what do you think about this:

If we get a radiologist intern (I am working on it), do you think it makes sense to draw an "attention" box for each sentence?

@kayhan-batmanghelich
Copy link
Collaborator

@sumedhasingla can you show wordcount for impression and finding separately ?

@sumedhasingla
Copy link
Contributor Author

Results on 548,369 reports. These are all the reports in folder: '/pghbio/dbmi/batmanlab/Data/radiologyTextDataset2/Reports'. It contains reports from Lung (2013-2017), spine (2016-2017), neck (2016-2017), head (2017), abdomen (2017).

  • The impression and finding sections were extracted from all the reports and saved at location: '/pghbio/dbmi/batmanlab/Data/radiologyTextDataset2/singla/Processed_Reports/Text_To_Train_Glove'

  • Distribution of words
    The length of FINDINGS varies between 1 words to 1586 words and 1 sentences to 127 sentences.
    The length of IMPRESSIONS varies between 1 words to 693 words and 1 sentences to 67 sentences.
    image

Most frequentwords in findings
image

Most frequent words in impressions
image

@kayhan-batmanghelich
Copy link
Collaborator

Thanks @sumedhasingla

@sumedhasingla
Copy link
Contributor Author

To calculate the statistics about the most frequent tags, I ran NOBLE tool on 6K reports and compute the inverse-document frequency.
Top 20 concepts and semantic types

Concept idf   SemanticType idf
impression (attitude) 0.996964   Finding 1
Radiologic Impression 0.996964   Qualitative Concept 0.999831
Unremarkable 0.718455   Body Part, Organ, or Organ Component 0.999325
Mass 0.647267   Mental Process 0.997132
Breast Lump 0.589575   Spatial Concept 0.994433
Focus (area of enhancement) 0.56444   Disease or Syndrome 0.994433
Pericardial effusion 0.550101   Quantitative Concept 0.986336
Mild 0.538124   Functional Concept 0.976889
Lesion 0.529352   Pathologic Function 0.962719
Pericardial effusion body substance 0.512146   Body Location or Region 0.947031
Pleural Effusion 0.505735   Intellectual Product 0.93303
Lymphatic Diseases 0.486842   Idea or Concept 0.898617
Lymphadenopathy 0.486842   Temporal Concept 0.842611
Mild Adverse Event 0.443826   Body Substance 0.763158
Nodule 0.433536   Conceptual Entity 0.708333
centimeter 0.432692   Therapeutic or Preventive Procedure 0.637314
No status change 0.431174   Neoplastic Process 0.615385
Curium 0.430837   Acquired Abnormality 0.61471
Small 0.430162   Tissue 0.612854

@kayhan-batmanghelich
Copy link
Collaborator

@sumedhasingla hmmm... many of those terms are too general and hence non-informative. Interesting things to look into:

Pleural Effusion
Lymphadenopathy
Pericardial effusion --- why is this intellectual product?!
Pericardial effusion body substance

What are those? Would you please take a look.

I also wonder what are the examples of the following concepts:
Mental Process
Intellectual Product
Conceptual Entity
Neoplastic Process

There are semantic tags that we should be able to predict for example see below. I think these should have salient visual representation:

  • Neoplastic Process
  • Disease or Syndrome
  • Body Substance

@sumedhasingla could you please eyebal which concepts/semantic types co-occur. Given that we are going to do multi-label prediction, knowing this will be helpful.

Thanks @sumedhasingla

@shyamvis
Copy link

shyamvis commented Sep 9, 2018

We have been using Noble Coder to identify SNOMED-CT concepts in CT reports of patients with trauma. The goal is to identify sentences that are incidental findings in trauma CT reports and the work is being done by Gaurav Trivedi in ISP. We noticed that the versions of vocabularies that come with Noble are quite old.

You can get the latest UMLS in NLM format, unzip and concatenate MR**.RRF files (cat MRCONSO.RRF.a* > MRCONSO.RRF) and use Noble Coder to import as RRF. If you need the concept type hierarchy download the latest NCI Metathesausus in RRF format. Contact Eugene Tseytlin at [email protected] Iif you need more information about Noble Coder.

The UMLS semantic types are at https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html. If useful you can selectively annotate only those types that are of interest.

In our experiments so far it seems that bag of words with and without embeddings is not improved by adding UMLS or SNOMED-CT concepts.

@kayhan-batmanghelich
Copy link
Collaborator

Thanks @sumedhasingla for the comment.

@sumedhasingla @pyadolla FYI.

@sumedhasingla
Copy link
Contributor Author

Result of NOBLE tool on 6k reports.
To get better understanding of the semantic type and concept, I collected the stats about how many times any word is tagged with a given (semantic-type and concept) pair.
Below are some examples of concepts with in a semantic type:
Mental Process : 20 (Number of concepts in this semantic type)
Regression - mental defense mechanism
Content
impression (attitude)
Psychological habituation
Suspicion
Persistence
Interpretation
Attention Concentration
Psychological fixation
Body Image
Intellectual Product : 165 (Number of concepts in this semantic type)
Uncertain
Observation Interpretation - intermediate
Group Object
Financial Account
International Classification of Diseases
Participation Type - device
attestation
Leaflet
Severe Extremity Pain
Pattern of Bowel Movements
Conceptual Entity : 80 (Number of concepts in this semantic type)
Upper Limit of Normal
Study Terminated
Combination
Computer Configuration
Fusiform shape
Content
Background
dictated - ParticipationMode
Spicule
Low Risk
Neoplastic Process : 69 (Number of concepts in this semantic type)
Breast Carcinoma
Esophageal Neoplasms
Thyroid Gland Nodule
Metastatic Malignant Neoplasm in the Pleura
Cyst
Metastatic Malignant Neoplasm in the Lymph Nodes
Minimal Residual Disease
Mass of thyroid gland
Leiomyoma
Plasma Cell Myeloma

@sumedhasingla
Copy link
Contributor Author

The interesting semantic types are:
'Acquired Abnormality', 'Anatomical Abnormality', 'Congenital Abnormality', 'Sign or Symptom', 'Finding', 'Pathologic Function', 'Disease or Syndrome', 'Disease or Syndrome, Anatomical Abnormality'

@sumedhasingla
Copy link
Contributor Author

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants