Update fuzzy_match.R #181

ehwenk · 2024-03-11T08:28:34Z

update fuzzy match algorithm to cycle through multiple "same distance" matches until one passes the "first letter" rules. This was includes, because found an instance where there was an equal closest match that was a completely different genus and because multiple matches the fuzzy matches all returned NA. This will also mean that if there are multiple equally good matches it will align with the first.
Currently we have the more conservative approach of throwing out all fuzzy matches if there are multiple matches that are the same distance. I think this new approach is more appropriate.

* update fuzzy match algorithm to cycle through multiple "same distance" matches until one passes the "first letter" rules. This was includes, because found an instance where there was an equal closest match that was a completely different genus and because multiple matches the fuzzy matches all returned NA. This will also mean that if there are multiple equally good matches it will align with the first.

ehwenk · 2024-03-13T00:37:23Z

The tests are failing because one trio of alignments that were going to NA are now being mis-aligned to the wrong genus... Now while it is the wrong genus, they were "garbage" names anyway, so I don't think this is a problem.

Ryandra abc / def -> Randia sp. [Ryandra abc / def; test_all_matches_TRUE] (instead of NA)
Ryandra abc x def -> Randia sp. [Ryandra abc x def; test_all_matches_TRUE] (instead of NA)
Ryandra abc--def -> Randia sp. [Ryandra abc--def; test_all_matches_TRUE] (instead of NA)

@wcornwell @dfalster Do you think this is acceptable? If so I'll change the benchmark for the tests.

further edits to fuzzy match - distances only calculated for names where the first letter of the first and second words in the input text matches names in the reference list with identical first letters for those words - this greatly sped up running the test dataset.

wcornwell · 2024-03-13T02:17:41Z

In an ideal world the test would be that we're right >98% (or something). I think in our current testing framework we enforce 100% "correct" but that's not realistic to expect that to stay constant if the algorithm (or the data) changes.

ehwenk · 2024-03-13T02:40:40Z

I hadn't thought about some follow-on effects of filtering by first letter first - and they are philosophically interesting...

Previously there were cases in our test datasets (and probably real datasets) where "no match" was returned because the closest distance changed the first letter and when that was thrown out, no further matches were attempted, This meant no match for "Danksia", "Acalyptus" in the test. But if you just go straight to only searching for matches where the first letter matches, such cases are matched - obviously incorrectly. But we could also well be losing good matches with this.

Also, with filtering to "same first letter only", strings of text with no letters result in errors - as with one of the tests.

So now our tests won't pass, but I actually think the algorithm is better - and far, far faster at fuzzy matching.

wcornwell · 2024-03-13T03:51:59Z

I think we should discuss changing the testing framework a bit to handle future cases like this

fontikar · 2024-03-13T22:52:41Z

I think we should discuss changing the testing framework a bit to handle future cases like this

I think we should move to snapshot testing or just be mindful to run local tests/checks before creating a PR

dfalster · 2024-04-18T00:34:13Z

Please submit a new PR, merging into develop rather master

Update fuzzy_match.R

b452bd1

fontikar and others added 4 commits March 27, 2024 11:26

Merge branch 'master' into minor_fixes

519f46f

changes following reviews

576696c

Merge branch 'master' into minor_fixes

ea9976a

Fix sequences function

1755e6d

dfalster closed this Apr 18, 2024

dfalster deleted the minor_fixes branch April 19, 2024 06:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update fuzzy_match.R #181

Update fuzzy_match.R #181

ehwenk commented Mar 11, 2024

ehwenk commented Mar 13, 2024 •

edited

Loading

wcornwell commented Mar 13, 2024 •

edited

Loading

ehwenk commented Mar 13, 2024

wcornwell commented Mar 13, 2024

fontikar commented Mar 13, 2024

dfalster commented Apr 18, 2024

Update fuzzy_match.R #181

Update fuzzy_match.R #181

Conversation

ehwenk commented Mar 11, 2024

ehwenk commented Mar 13, 2024 • edited Loading

wcornwell commented Mar 13, 2024 • edited Loading

ehwenk commented Mar 13, 2024

wcornwell commented Mar 13, 2024

fontikar commented Mar 13, 2024

dfalster commented Apr 18, 2024

ehwenk commented Mar 13, 2024 •

edited

Loading

wcornwell commented Mar 13, 2024 •

edited

Loading