Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 60 spacy entity overlap #61

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ If the matcher throws a warning during initialization, read [this page](https://

## spaCy pipeline component

QuickUMLS can be used for standalone processing but it can also be use as a component in a modular spaCy pipeline. This follows traditional spaCy handling of concepts to be entity objects added to the Document object. These entity objects contain the CUI, similarity score and Semantic Types in the spacy "underscore" object.
QuickUMLS can be used for standalone processing but it can also be use as a component in a modular spaCy pipeline. This follows traditional spaCy handling of concepts to be entity objects added to the Document object. These entity objects contain the CUI, similarity score and Semantic Types in the spacy "underscore" object. Note that this implementation follows a [known spacy convention](https://github.com/explosion/spaCy/issues/3608) that entity Spans cannot overlap on a single token. To prevent token overlap, matches are ranked according to the `overlapping_criteria` supplied so that overlap of any tokens will be prioritized by this order.

Adding QuickUMLS as a component in a pipeline can be done as follows:

Expand Down
23 changes: 22 additions & 1 deletion quickumls/spacy_component.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,13 @@ def __init__(self, nlp, quickumls_fp, best_match=True, ignore_syntax=False, **kw

This creates a QuickUMLS spaCy component which can be used in modular pipelines.
This module adds entity Spans to the document where the entity label is the UMLS CUI and the Span's "underscore" object is extended to contains "similarity" and "semtypes" for matched concepts.
Note that this implementation follows and enforces a known spacy convention that entity Spans cannot overlap on a single token.

Args:
nlp: Existing spaCy pipeline. This is needed to update the vocabulary with UMLS CUI values
quickumls_fp (str): Path to QuickUMLS data
best_match (bool, optional): Whether to return only the top match or all overlapping candidates. Defaults to True.
ignore_syntax (bool, optional): Wether to use the heuristcs introduced in the paper (Soldaini and Goharian, 2016). TODO: clarify,. Defaults to False
ignore_syntax (bool, optional): Whether to use the heuristcs introduced in the paper (Soldaini and Goharian, 2016). TODO: clarify,. Defaults to False
**kwargs: QuickUMLS keyword arguments (see QuickUMLS in core.py)
"""

Expand All @@ -43,6 +44,15 @@ def __call__(self, doc):
# pass in the document which has been parsed to this point in the pipeline for ngrams and matches
matches = self.quickumls._match(doc, best_match=self.best_match, ignore_syntax=self.ignore_syntax)

# NOTE: Spacy spans do not allow overlapping tokens, so we prevent the overlap here
# For more information, see: https://github.com/explosion/spaCy/issues/3608
tokens_in_ents_set = set()

# let's track any other entities which may have been attached via upstream components
for ent in doc.ents:
for token_index in range(ent.start, ent.end):
tokens_in_ents_set.add(token_index)

# Convert QuickUMLS match objects into Spans
for match in matches:
# each match may match multiple ngrams
Expand All @@ -59,6 +69,17 @@ def __call__(self, doc):
# char_span() creates a Span from these character indices
# UMLS CUI should work well as the label here
span = doc.char_span(start_char_idx, end_char_idx, label = cui_label_value)

# before we add this, let's make sure that this entity does not overlap any tokens added thus far
candidate_token_indexes = set(range(span.start, span.end))

# check the intersection and skip this if there is any overlap
if len(tokens_in_ents_set.intersection(candidate_token_indexes)) > 0:
continue

# track this to make sure we do not introduce overlap later
tokens_in_ents_set.update(candidate_token_indexes)

# add some custom metadata to the spans
span._.similarity = ngram_match_dict['similarity']
span._.semtypes = ngram_match_dict['semtypes']
Expand Down