make TermNormIsSubStringMappingStrategy handle multi-word substrings #24

EFord36 · 2024-06-04T15:00:34Z

Closes #274

Currently, this will only prefer a term if the term norm is a single word contained (as a word) in the ent_match_norm. Change to also prefer it if the term norm is a sequence of words that are a substring of the ent_match_norm.

In draft because I still need to:

Check impact on performance (we're using a regex match now, is this too slow? We could compare to looking for the first word with list.index, and then iterating. The implementation here is longer though. We could also try mypyc on that for interest).
Check impact on behaviour (does this change anything in the test documents we have for the different use cases? It should help, but does it?)
Write tests

That said, the actual code is ready to look at to assess 'is this a good idea' in a broad sense?

Currently, this will only prefer a term if the term norm is a single word contained (as a word) in the ent_match_norm. Change to also prefer it if the term norm is a sequence of words that are a substring of the ent_match_norm.

EFord36 requested a review from RichJackson June 4, 2024 15:00

EFord36 force-pushed the multi-word-substring-checking branch from 119f242 to 0d706b3 Compare June 4, 2024 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make TermNormIsSubStringMappingStrategy handle multi-word substrings #24

make TermNormIsSubStringMappingStrategy handle multi-word substrings #24

EFord36 commented Jun 4, 2024

make TermNormIsSubStringMappingStrategy handle multi-word substrings #24

Are you sure you want to change the base?

make TermNormIsSubStringMappingStrategy handle multi-word substrings #24

Conversation

EFord36 commented Jun 4, 2024