-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Multi-Modal Image+Text] Preprocess the text reports #55
Comments
FYI @pyadolla Thanks @sumedhasingla . This word count figure is very informative. The most frequent words are "no", "right" and "left" which shows we have to find a good negation pipeline. Also, what do you think about this: If we get a radiologist intern (I am working on it), do you think it makes sense to draw an "attention" box for each sentence? |
@sumedhasingla can you show wordcount for impression and finding separately ? |
Thanks @sumedhasingla |
To calculate the statistics about the most frequent tags, I ran NOBLE tool on 6K reports and compute the inverse-document frequency.
|
@sumedhasingla hmmm... many of those terms are too general and hence non-informative. Interesting things to look into: Pleural Effusion What are those? Would you please take a look. I also wonder what are the examples of the following concepts: There are semantic tags that we should be able to predict for example see below. I think these should have salient visual representation:
@sumedhasingla could you please eyebal which concepts/semantic types co-occur. Given that we are going to do multi-label prediction, knowing this will be helpful. Thanks @sumedhasingla |
We have been using Noble Coder to identify SNOMED-CT concepts in CT reports of patients with trauma. The goal is to identify sentences that are incidental findings in trauma CT reports and the work is being done by Gaurav Trivedi in ISP. We noticed that the versions of vocabularies that come with Noble are quite old. You can get the latest UMLS in NLM format, unzip and concatenate MR**.RRF files (cat MRCONSO.RRF.a* > MRCONSO.RRF) and use Noble Coder to import as RRF. If you need the concept type hierarchy download the latest NCI Metathesausus in RRF format. Contact Eugene Tseytlin at [email protected] Iif you need more information about Noble Coder. The UMLS semantic types are at https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html. If useful you can selectively annotate only those types that are of interest. In our experiments so far it seems that bag of words with and without embeddings is not improved by adding UMLS or SNOMED-CT concepts. |
Thanks @sumedhasingla for the comment. @sumedhasingla @pyadolla FYI. |
Result of NOBLE tool on 6k reports. |
The interesting semantic types are: |
Extract just the finding and impression sections from the reports
Get an idea of distribution of length (number of words, sentences) of finding and impression section in different reports.
Perform the word-count. Find the most frequent words.
Find the most frequent sentences.
Group the similar sentences
Fix some matrix to define sentence similarity
Train word2vec on the large text-report dataset (500K reports-spine, chest CT, head CT, others)
Train different dimensions for word embedding d =100 to 300
Get a statistics about the most frequent tags. (tags are obtained from NOBLE tool, etc)
The text was updated successfully, but these errors were encountered: