-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuzzy matching for journal abbreviations #12477
base: main
Are you sure you want to change the base?
Fuzzy matching for journal abbreviations #12477
Conversation
build.gradle
Outdated
@@ -178,6 +178,7 @@ dependencies { | |||
implementation "org.apache.lucene:lucene-queries:$luceneVersion" | |||
implementation "org.apache.lucene:lucene-analysis-common:$luceneVersion" | |||
implementation "org.apache.lucene:lucene-highlighter:$luceneVersion" | |||
implementation 'org.apache.commons:commons-text:1.13.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for Apache commons text, we already have a Levenstein distance implementation
private final Levenshtein METRIC_DISTANCE = new Levenshtein(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun
, check the results, commit, and push.
You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun
, check the results, commit, and push.
You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun
, check the results, commit, and push.
You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".
build.gradle
Outdated
@@ -73,34 +73,34 @@ application { | |||
mainModule.set('org.jabref') | |||
|
|||
applicationDefaultJvmArgs = [ | |||
// On a change here, also adapt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please revert the formatting changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the file with the original
Sure!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun
, check the results, commit, and push.
You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, welcome to JabRef and thank you for your contribution!
I was unable to understand the need of a lot of changes which caused diff, so kindly explain the "why" in the comments below, and revert them.
Furthermore, is the PR description copy-pasted from somewhere? Please do not remove the mandatory checks and link the issue properly.
private transient String name; | ||
private final String abbreviation; | ||
private transient String dotlessAbbreviation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason why the order was changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but I removed some functions and rewrote them, which caused changes to the code structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how that should remove class fields, but alright.
|
||
// Is the empty string if not available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the comments removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m sorry about that. This is my first open-source contribution! Personally, I often find comments to be annoying!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's alright. Read on why comments are used in code, and when they should be used. Furthermore, in open-source, whenever you touch existing code - know that it it was written by another developer and had gotten approved to be merged. So by default, any changes made to that (that don't pertain to the issue being solved) should be justified.
@@ -21,12 +25,12 @@ public Abbreviation(String name, String abbreviation) { | |||
public Abbreviation(String name, String abbreviation, String shortestUniqueAbbreviation) { | |||
this(name, | |||
abbreviation, | |||
// "L. N." becomes "L N ", we need to remove the double spaces inbetween |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you replaced the code directly from somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but I removed some functions and rewrote them more than once, which caused changes to the code structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted
return Objects.hash(getName(), getAbbreviation(), getShortestUniqueAbbreviation()); | ||
return Objects.hash(normalize(getName()), normalize(getAbbreviation()), normalize(getShortestUniqueAbbreviation())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the hashcode stay the same if two abbreviations have the same "normalized" name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this to make sure that both values consider the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, there is no difference between the two names except for their case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but my question is, why does the hashCode
need to change in that case? Where do you use/plan to use it? I don't see any direct equality check or use of hashing anywhere in your changes.
Please do not mark discussions resolved until the person who asked the question does so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll revert the change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun
, check the results, commit, and push.
You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun
, check the results, commit, and push.
You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".
@arraymahdi please don't resolve the conversations if it involves a question or discussion. If the questioner is satisfied, they'll resolve it themselves or reply that they're fine with your explanation. Otherwise, we have to expand it to see your responses and carry it forward. For simple comments like the "revert" ones, it's fine. |
Noted. |
For failing tests refer to https://devdocs.jabref.org/code-howtos/faq.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add test case for non-latin characters
Chinese Science Bulletin (中国科学通报) → Chin. Sci. Bull.
Full Title (中文) | Abbreviation (中文) |
---|---|
中国物理学报 | 物理学报 |
中国科学: 物理学 | 中科物理 |
数学学报 | 数学报 |
计算机学报 | 计学报 |
电子学报 | 电学报 |
化学学报 | 化学报 |
Also for German
I think, the current implemetnation would consider all chinese journal equal, which is wrong.
For French and German, your approach could work, because the likelyhood that there are only accented letters is low.
Please see the discussion at JabRef#235
Normalization should NOT remove characters, it should really normalize it.
I propose to sue ASCIIFoldingFilter for normalization
I also think that equals
should not use normalization, because there could be mistakes made there?!
@@ -152,7 +152,7 @@ public void addCustomAbbreviations(Collection<Abbreviation> abbreviationsToAdd) | |||
} | |||
|
|||
public Optional<String> getNextAbbreviation(String text) { | |||
return get(text).map(abbreviation -> abbreviation.getNext(text)); | |||
return get(text).map(abbreviation -> abbreviation.getNext(text.trim())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a trim at the end but no trim at the beginning? Shoulnd't be text
be trimmed on method entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Thank you very much for your insightful feedback! I’ve reviewed all your comments and did my best to implement them."
private static String normalize(String input) { | ||
return input.toLowerCase().replaceAll("[^a-z0-9 ]", "").trim(); | ||
} | ||
|
||
@Override | ||
public int compareTo(Abbreviation toCompare) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you described in the PR description should appear as JavaDoc here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added DocStrings to the code.
return distance <= 2; | ||
} | ||
|
||
private static String normalize(String input) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add JavaDoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
@@ -103,7 +130,9 @@ public boolean equals(Object o) { | |||
return false; | |||
} | |||
Abbreviation that = (Abbreviation) o; | |||
return getName().equals(that.getName()) && getAbbreviation().equals(that.getAbbreviation()) && getShortestUniqueAbbreviation().equals(that.getShortestUniqueAbbreviation()); | |||
return Objects.equals(normalize(name), normalize(that.name)) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some JavaDoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, there should be no normalization here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted this change
@Test | ||
void equalAbbrevationsWithFourComponentsAreAlsoCompareZero() { | ||
Abbreviation abbreviation1 = new Abbreviation("Long Name", "L. N.", "LN"); | ||
Abbreviation abbreviation2 = new Abbreviation("Long Name", "L. N.", "LN"); | ||
assertEquals(0, abbreviation1.compareTo(abbreviation2)); | ||
} | ||
|
||
@Test | ||
void testCompareToWithFuzzyMatching() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please convert to parameterized Test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a parameterized for compareTo
String normalizedOther = normalize(otherName); | ||
|
||
int distance = LEVENSHTEIN.apply(normalizedThis, normalizedOther); | ||
return distance <= 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a reasoning why "2" works.
@@ -1,12 +1,19 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No additional empty line at the beginning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
To ease organizational workflows I have linked the pull-request to the issue with syntax as described in https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue
Example:
|
Please do not remove the 'Mandatory checks' from the PR description. I manually readded them. Note that we cannot accept any change if you don't guarantee to own the ip rights of to the code submitted and license them under the terms of the MIT license. In this case, check the checkbox. If not, we will reject your changes and close this PR. |
@arraymahdi this is not a UI change, yet I see the box for "added screenshots in PR description" checked. We don't put the checks for the sake of formality. Please read what they say before you tick. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun
, check the results, commit, and push.
You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".
@arraymahdi Please, please, please, before requesting reviews fix the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DO NOT REMOVE FILES
Some files of .vscode
dir were removed. This is not OK without checking with the devs.
I also don't see any changelog entry, yet you've ticked the check. |
fixes #12467
Summary
This PR enhances the
Abbreviation
class by introducing fuzzy matching to improve abbreviation comparisons. It allows for minor variations, such as punctuation differences, extra spaces, slight typos, and case insensitivity, ensuring a more flexible and accurate comparison system.Changes Made
✅ Implemented Fuzzy Matching in
isSimilar
Method✅ Enhanced
compareTo
to Use Fuzzy Matching"L. N."
vs."L N"
) are now considered equal if they meet the similarity threshold.✅ Added Unit Tests for Fuzzy Matching
"Long Name"
vs."Longn Name"
."Long Name"
vs."Long, Name"
."Long Name"
vs."LONG NAME"
."L. N."
vs."L N"
.Mandatory checks
CHANGELOG.md
described in a way that is understandable for the average user (if change is visible to the user)