SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced #13457
Unanswered
vrunm
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Goal: First split a token into two tokens. Then use SpanRuler to label both the re-tokenized tokens as a single span with one label.
Problem: The labeled span consists of the original text (a single token) rather than the two tokens concatenated with a separating space (ie after re-tokenization).
What I did:
I add a custom tokenizer splitter as the first stage. It correctly splits the single token into two tokens.
I then detect the two (splitted) tokens using a SpanRuler. Notice that the SpanRuler works for a pattern of two separated tokens (ie pattern=['abc', 'efg']), and will correctly detect nothing if the pattern is the original single token (pattern='abcefg').
Notice the custom retokenizer does respect Spacy's non-destructive retokenization.
Thanks for any help.
Minimal Reproducible Example:
Actual Output:
Expected output:
Beta Was this translation helpful? Give feedback.
All reactions