Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update fuzzy_match.R #181

Closed
wants to merge 7 commits into from
Closed

Update fuzzy_match.R #181

wants to merge 7 commits into from

Conversation

ehwenk
Copy link
Collaborator

@ehwenk ehwenk commented Mar 11, 2024

  • update fuzzy match algorithm to cycle through multiple "same distance" matches until one passes the "first letter" rules. This was includes, because found an instance where there was an equal closest match that was a completely different genus and because multiple matches the fuzzy matches all returned NA. This will also mean that if there are multiple equally good matches it will align with the first.
  • Currently we have the more conservative approach of throwing out all fuzzy matches if there are multiple matches that are the same distance. I think this new approach is more appropriate.

* update fuzzy match algorithm to cycle through multiple "same distance" matches until one passes the "first letter" rules. This was includes, because found an instance where there was an equal closest match that was a completely different genus and because multiple matches the fuzzy matches all returned NA. This will also mean that if there are multiple equally good matches it will align with the first.
@ehwenk
Copy link
Collaborator Author

ehwenk commented Mar 13, 2024

The tests are failing because one trio of alignments that were going to NA are now being mis-aligned to the wrong genus... Now while it is the wrong genus, they were "garbage" names anyway, so I don't think this is a problem.

Ryandra abc / def -> Randia sp. [Ryandra abc / def; test_all_matches_TRUE] (instead of NA)
Ryandra abc x def -> Randia sp. [Ryandra abc x def; test_all_matches_TRUE] (instead of NA)
Ryandra abc--def -> Randia sp. [Ryandra abc--def; test_all_matches_TRUE] (instead of NA)

@wcornwell @dfalster Do you think this is acceptable? If so I'll change the benchmark for the tests.

further edits to fuzzy match - distances only calculated for names where the first letter of the first  and second words in the input text matches names in the reference list with identical first letters for those words - this greatly sped up running the test dataset.
@wcornwell
Copy link
Contributor

wcornwell commented Mar 13, 2024

In an ideal world the test would be that we're right >98% (or something). I think in our current testing framework we enforce 100% "correct" but that's not realistic to expect that to stay constant if the algorithm (or the data) changes.

@ehwenk
Copy link
Collaborator Author

ehwenk commented Mar 13, 2024

I hadn't thought about some follow-on effects of filtering by first letter first - and they are philosophically interesting...

Previously there were cases in our test datasets (and probably real datasets) where "no match" was returned because the closest distance changed the first letter and when that was thrown out, no further matches were attempted, This meant no match for "Danksia", "Acalyptus" in the test. But if you just go straight to only searching for matches where the first letter matches, such cases are matched - obviously incorrectly. But we could also well be losing good matches with this.

Also, with filtering to "same first letter only", strings of text with no letters result in errors - as with one of the tests.

So now our tests won't pass, but I actually think the algorithm is better - and far, far faster at fuzzy matching.

@wcornwell
Copy link
Contributor

I think we should discuss changing the testing framework a bit to handle future cases like this

@fontikar
Copy link
Collaborator

I think we should discuss changing the testing framework a bit to handle future cases like this

I think we should move to snapshot testing or just be mindful to run local tests/checks before creating a PR

@dfalster
Copy link
Member

Please submit a new PR, merging into develop rather master

@dfalster dfalster closed this Apr 18, 2024
@dfalster dfalster deleted the minor_fixes branch April 19, 2024 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants