Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

limit common prefix in jaro-winkler #58

Merged
merged 2 commits into from
Jan 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,12 @@ This project attempts to adhere to [Semantic Versioning](http://semver.org).
- reduce runtime in our own benchmark by more than `70%`
- reduce binary size by more than `25%`

- only boost similarity in Jaro-Winkler once the Jaro similarity exceeds 0.7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth mentioning this in the README and/or the function documentation? Since you said that you've also seen implementations that always boost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that probably makes sense.


### Fixed

- Fix transposition counting in Jaro and Jaro-Winkler.
- Limit common prefix in Jaro-Winkler to 4 characters

## [0.10.0] - (2020-01-31)

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
- [Levenshtein] - distance & normalized
- [Optimal string alignment]
- [Damerau-Levenshtein] - distance & normalized
- [Jaro and Jaro-Winkler] - this implementation of Jaro-Winkler does not limit the common prefix length
- [Jaro and Jaro-Winkler]
- [Sørensen-Dice]

The normalized versions return values between `0.0` and `1.0`, where `1.0` means
Expand Down
31 changes: 14 additions & 17 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -194,22 +194,19 @@ where
&'b Iter2: IntoIterator<Item = Elem2>,
Elem1: PartialEq<Elem2>,
{
let jaro_distance = generic_jaro(a, b);
let sim = generic_jaro(a, b);

// Don't limit the length of the common prefix
let prefix_length = a
.into_iter()
.zip(b)
.take_while(|(a_elem, b_elem)| a_elem == b_elem)
.count();
if sim > 0.7 {
let prefix_length = a
.into_iter()
.take(4)
.zip(b)
.take_while(|(a_elem, b_elem)| a_elem == b_elem)
.count();

let jaro_winkler_distance =
jaro_distance + (0.1 * prefix_length as f64 * (1.0 - jaro_distance));

if jaro_winkler_distance <= 1.0 {
jaro_winkler_distance
sim + 0.1 * prefix_length as f64 * (1.0 - sim)
} else {
1.0
sim
}
}

Expand All @@ -218,7 +215,7 @@ where
/// ```
/// use strsim::jaro_winkler;
///
/// assert!((0.911 - jaro_winkler("cheeseburger", "cheese fries")).abs() <
/// assert!((0.866 - jaro_winkler("cheeseburger", "cheese fries")).abs() <
/// 0.001);
/// ```
pub fn jaro_winkler(a: &str, b: &str) -> f64 {
Expand Down Expand Up @@ -960,15 +957,15 @@ mod tests {
#[test]
fn jaro_winkler_names() {
assert_delta!(
0.562,
0.452,
jaro_winkler("Friedrich Nietzsche", "Fran-Paul Sartre"),
0.001
);
}

#[test]
fn jaro_winkler_long_prefix() {
assert_delta!(0.911, jaro_winkler("cheeseburger", "cheese fries"), 0.001);
assert_delta!(0.866, jaro_winkler("cheeseburger", "cheese fries"), 0.001);
}

#[test]
Expand All @@ -984,7 +981,7 @@ mod tests {
#[test]
fn jaro_winkler_very_long_prefix() {
assert_delta!(
1.0,
0.98519,
jaro_winkler("thequickbrownfoxjumpedoverx", "thequickbrownfoxjumpedovery")
);
}
Expand Down
2 changes: 1 addition & 1 deletion tests/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -67,5 +67,5 @@ fn jaro_works() {

#[test]
fn jaro_winkler_works() {
assert_delta!(0.911, jaro_winkler("cheeseburger", "cheese fries"), 0.001);
assert_delta!(0.866, jaro_winkler("cheeseburger", "cheese fries"), 0.001);
}
Loading