Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy matching for journal abbreviations #12477

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

arraymahdi
Copy link

@arraymahdi arraymahdi commented Feb 10, 2025

fixes #12467

Summary

This PR enhances the Abbreviation class by introducing fuzzy matching to improve abbreviation comparisons. It allows for minor variations, such as punctuation differences, extra spaces, slight typos, and case insensitivity, ensuring a more flexible and accurate comparison system.

Changes Made

Implemented Fuzzy Matching in isSimilar Method

  • Uses Levenshtein Distance to allow minor typos (up to 2 character differences).
  • Normalizes text by removing punctuation and converting it to lowercase.

Enhanced compareTo to Use Fuzzy Matching

  • Abbreviations with slight variations (e.g., "L. N." vs. "L N") are now considered equal if they meet the similarity threshold.
  • Ensures consistent ordering even when minor differences exist.

Added Unit Tests for Fuzzy Matching

  • Handles minor typos: "Long Name" vs. "Longn Name".
  • Ignores punctuation differences: "Long Name" vs. "Long, Name".
  • Supports case insensitivity: "Long Name" vs. "LONG NAME".
  • Handles missing spaces: "L. N." vs. "L N".

Mandatory checks

  • I own the copyright of the code submitted and I licence it under the MIT license
  • Change in CHANGELOG.md described in a way that is understandable for the average user (if change is visible to the user)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

build.gradle Outdated
@@ -178,6 +178,7 @@ dependencies {
implementation "org.apache.lucene:lucene-queries:$luceneVersion"
implementation "org.apache.lucene:lucene-analysis-common:$luceneVersion"
implementation "org.apache.lucene:lucene-highlighter:$luceneVersion"
implementation 'org.apache.commons:commons-text:1.13.0'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for Apache commons text, we already have a Levenstein distance implementation

private final Levenshtein METRIC_DISTANCE = new Levenshtein();

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

build.gradle Outdated
@@ -73,34 +73,34 @@ application {
mainModule.set('org.jabref')

applicationDefaultJvmArgs = [
// On a change here, also adapt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert the formatting changes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the file with the original

@arraymahdi
Copy link
Author

arraymahdi commented Feb 10, 2025 via email

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

Copy link
Member

@subhramit subhramit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, welcome to JabRef and thank you for your contribution!
I was unable to understand the need of a lot of changes which caused diff, so kindly explain the "why" in the comments below, and revert them.

Furthermore, is the PR description copy-pasted from somewhere? Please do not remove the mandatory checks and link the issue properly.

Comment on lines 10 to 12
private transient String name;
private final String abbreviation;
private transient String dotlessAbbreviation;
Copy link
Member

@subhramit subhramit Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason why the order was changed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I removed some functions and rewrote them, which caused changes to the code structure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how that should remove class fields, but alright.


// Is the empty string if not available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the comments removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m sorry about that. This is my first open-source contribution! Personally, I often find comments to be annoying!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's alright. Read on why comments are used in code, and when they should be used. Furthermore, in open-source, whenever you touch existing code - know that it it was written by another developer and had gotten approved to be merged. So by default, any changes made to that (that don't pertain to the issue being solved) should be justified.

@@ -21,12 +25,12 @@ public Abbreviation(String name, String abbreviation) {
public Abbreviation(String name, String abbreviation, String shortestUniqueAbbreviation) {
this(name,
abbreviation,
// "L. N." becomes "L N ", we need to remove the double spaces inbetween
Copy link
Member

@subhramit subhramit Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you replaced the code directly from somewhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I removed some functions and rewrote them more than once, which caused changes to the code structure

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted

src/main/java/org/jabref/logic/journals/Abbreviation.java Outdated Show resolved Hide resolved
src/main/java/org/jabref/logic/journals/Abbreviation.java Outdated Show resolved Hide resolved
Comment on lines 111 to 130
return Objects.hash(getName(), getAbbreviation(), getShortestUniqueAbbreviation());
return Objects.hash(normalize(getName()), normalize(getAbbreviation()), normalize(getShortestUniqueAbbreviation()));
Copy link
Member

@subhramit subhramit Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the hashcode stay the same if two abbreviations have the same "normalized" name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this to make sure that both values consider the same

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, there is no difference between the two names except for their case.

Copy link
Member

@subhramit subhramit Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but my question is, why does the hashCode need to change in that case? Where do you use/plan to use it? I don't see any direct equality check or use of hashing anywhere in your changes.

Please do not mark discussions resolved until the person who asked the question does so.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll revert the change!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

@arraymahdi arraymahdi requested a review from subhramit February 10, 2025 13:25
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

@subhramit
Copy link
Member

@arraymahdi please don't resolve the conversations if it involves a question or discussion. If the questioner is satisfied, they'll resolve it themselves or reply that they're fine with your explanation. Otherwise, we have to expand it to see your responses and carry it forward. For simple comments like the "revert" ones, it's fine.

@arraymahdi
Copy link
Author

@arraymahdi please don't resolve the conversations if it involves a question or discussion. If the questioner is satisfied, they'll resolve it themselves or reply that they're fine with your explanation. Otherwise, we have to expand it to see your responses and carry it forward. For simple comments like the "revert" ones, it's fine.

Noted.

@subhramit
Copy link
Member

For failing tests refer to https://devdocs.jabref.org/code-howtos/faq.html

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add test case for non-latin characters

Chinese Science Bulletin (中国科学通报) → Chin. Sci. Bull.
Full Title (中文) Abbreviation (中文)
中国物理学报 物理学报
中国科学: 物理学 中科物理
数学学报 数学报
计算机学报 计学报
电子学报 电学报
化学学报 化学报

Also for German

I think, the current implemetnation would consider all chinese journal equal, which is wrong.

For French and German, your approach could work, because the likelyhood that there are only accented letters is low.

Please see the discussion at JabRef#235

Normalization should NOT remove characters, it should really normalize it.

I propose to sue ASCIIFoldingFilter for normalization


I also think that equals should not use normalization, because there could be mistakes made there?!

@@ -152,7 +152,7 @@ public void addCustomAbbreviations(Collection<Abbreviation> abbreviationsToAdd)
}

public Optional<String> getNextAbbreviation(String text) {
return get(text).map(abbreviation -> abbreviation.getNext(text));
return get(text).map(abbreviation -> abbreviation.getNext(text.trim()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a trim at the end but no trim at the beginning? Shoulnd't be text be trimmed on method entry?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Thank you very much for your insightful feedback! I’ve reviewed all your comments and did my best to implement them."

private static String normalize(String input) {
return input.toLowerCase().replaceAll("[^a-z0-9 ]", "").trim();
}

@Override
public int compareTo(Abbreviation toCompare) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you described in the PR description should appear as JavaDoc here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added DocStrings to the code.

return distance <= 2;
}

private static String normalize(String input) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add JavaDoc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@@ -103,7 +130,9 @@ public boolean equals(Object o) {
return false;
}
Abbreviation that = (Abbreviation) o;
return getName().equals(that.getName()) && getAbbreviation().equals(that.getAbbreviation()) && getShortestUniqueAbbreviation().equals(that.getShortestUniqueAbbreviation());
return Objects.equals(normalize(name), normalize(that.name)) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some JavaDoc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, there should be no normalization here!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted this change

@Test
void equalAbbrevationsWithFourComponentsAreAlsoCompareZero() {
Abbreviation abbreviation1 = new Abbreviation("Long Name", "L. N.", "LN");
Abbreviation abbreviation2 = new Abbreviation("Long Name", "L. N.", "LN");
assertEquals(0, abbreviation1.compareTo(abbreviation2));
}

@Test
void testCompareToWithFuzzyMatching() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please convert to parameterized Test

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a parameterized for compareTo

String normalizedOther = normalize(otherName);

int distance = LEVENSHTEIN.apply(normalizedThis, normalizedOther);
return distance <= 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a reasoning why "2" works.

@@ -1,12 +1,19 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No additional empty line at the beginning

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@koppor
Copy link
Member

koppor commented Feb 10, 2025

To ease organizational workflows I have linked the pull-request to the issue with syntax as described in https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue

Linking a pull request to an issue using a keyword

You can link a pull request to an issue by using a supported keyword in the pull request's description or in a commit message. The pull request must be on the default branch.

  • close
  • closes
  • closed
  • fix
  • fixes
  • fixed
  • resolve
  • resolves
  • resolved

If you use a keyword to reference a pull request comment in another pull request, the pull requests will be linked. Merging the referencing pull request also closes the referenced pull request.

The syntax for closing keywords depends on whether the issue is in the same repository as the pull request.

Example:

  • Fixes #xyz links pull-request to issue. Merging the PR will close the Issue.
  • Fixes https://github.com/JabRef/jabref/issues/xyz links pull-request to issue. Merging the PR will close the Issue.
  • Fixes https://github.com/Koppor/jabref/issues/xyz links pull-request to issue. Merging the PR will close the Issue.
  • Fixes [#xyz](https://github.com/JabRef/jabref/issues/xyz) links pull-request to issue. Merging the PR will NOT close the Issue.

@calixtus
Copy link
Member

Please do not remove the 'Mandatory checks' from the PR description. I manually readded them. Note that we cannot accept any change if you don't guarantee to own the ip rights of to the code submitted and license them under the terms of the MIT license. In this case, check the checkbox. If not, we will reject your changes and close this PR.

@subhramit
Copy link
Member

@arraymahdi this is not a UI change, yet I see the box for "added screenshots in PR description" checked. We don't put the checks for the sake of formality. Please read what they say before you tick.

@subhramit subhramit changed the title Feature fuzzy matching for journal abbreviations #12467 Fuzzy matching for journal abbreviations Feb 11, 2025
@arraymahdi arraymahdi requested a review from koppor February 12, 2025 14:04
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

@koppor
Copy link
Member

koppor commented Feb 16, 2025

@arraymahdi Please, please, please, before requesting reviews fix the tests.

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DO NOT REMOVE FILES

Some files of .vscode dir were removed. This is not OK without checking with the devs.

@koppor koppor marked this pull request as draft February 16, 2025 16:50
@subhramit
Copy link
Member

subhramit commented Feb 16, 2025

@arraymahdi this is not a UI change, yet I see the box for "added screenshots in PR description" checked. We don't put the checks for the sake of formality. Please read what they say before you tick.

I also don't see any changelog entry, yet you've ticked the check.
Copy above, again. These are basic etiquettes we expect anybody genuinely interested in contributing to follow - to "read" what's written - otherwise all this distracts us from the work done in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fuzzy matching for journal abbreviations
5 participants