-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update fuzzy_match.R #181
Update fuzzy_match.R #181
Conversation
ehwenk
commented
Mar 11, 2024
- update fuzzy match algorithm to cycle through multiple "same distance" matches until one passes the "first letter" rules. This was includes, because found an instance where there was an equal closest match that was a completely different genus and because multiple matches the fuzzy matches all returned NA. This will also mean that if there are multiple equally good matches it will align with the first.
- Currently we have the more conservative approach of throwing out all fuzzy matches if there are multiple matches that are the same distance. I think this new approach is more appropriate.
* update fuzzy match algorithm to cycle through multiple "same distance" matches until one passes the "first letter" rules. This was includes, because found an instance where there was an equal closest match that was a completely different genus and because multiple matches the fuzzy matches all returned NA. This will also mean that if there are multiple equally good matches it will align with the first.
The tests are failing because one trio of alignments that were going to
@wcornwell @dfalster Do you think this is acceptable? If so I'll change the benchmark for the tests. |
further edits to fuzzy match - distances only calculated for names where the first letter of the first and second words in the input text matches names in the reference list with identical first letters for those words - this greatly sped up running the test dataset.
In an ideal world the test would be that we're right >98% (or something). I think in our current testing framework we enforce 100% "correct" but that's not realistic to expect that to stay constant if the algorithm (or the data) changes. |
I hadn't thought about some follow-on effects of filtering by first letter first - and they are philosophically interesting... Previously there were cases in our test datasets (and probably real datasets) where "no match" was returned because the closest distance changed the first letter and when that was thrown out, no further matches were attempted, This meant no match for "Danksia", "Acalyptus" in the test. But if you just go straight to only searching for matches where the first letter matches, such cases are matched - obviously incorrectly. But we could also well be losing good matches with this. Also, with filtering to "same first letter only", strings of text with no letters result in errors - as with one of the tests. So now our tests won't pass, but I actually think the algorithm is better - and far, far faster at fuzzy matching. |
I think we should discuss changing the testing framework a bit to handle future cases like this |
I think we should move to snapshot testing or just be mindful to run local tests/checks before creating a PR |
Please submit a new PR, merging into develop rather master |