-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
limit common prefix in jaro-winkler #58
Conversation
This fixes #53 |
a2cf3ca
to
d91d06d
Compare
e3f4c28
to
5b1fe08
Compare
With the second commit it now has the same behavior. |
@@ -4,9 +4,14 @@ This project attempts to adhere to [Semantic Versioning](http://semver.org). | |||
|
|||
## [Unreleased] | |||
|
|||
### Changed | |||
|
|||
- only boost similarity in Jaro-Winkler once the Jaro similarity exceeds 0.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth mentioning this in the README and/or the function documentation? Since you said that you've also seen implementations that always boost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that probably makes sense.
5b1fe08
to
1d2d6b8
Compare
1d2d6b8
to
c1f34d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave it to you to merge this or #67 first.
c1f34d7
to
4f818e2
Compare
The paper for Jaro-Winkler only allows the common prefix to be boosted for the first 4 characters. This can be found in Winklers paper about the Jaro-Winkler similarity (https://files.eric.ed.gov/fulltext/ED325505.pdf). More specifically in
Results still differ from the ones I get in RapidFuzz, since I only boost the score if the Jaro similarity is above 0.7. The paper mentions:
This suggests a limit of 0.7. However there is a C implementation from Winkler which only boosts them starting at 0.7 here: https://web.archive.org/web/19990822155334/http://www.census.gov/geo/msb/stand/strcmp.c
I think so far I have only seen implementations which either always boost, or boost starting at 0.7.