Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy matching for journal abbreviations #12477

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from
6 changes: 0 additions & 6 deletions .vscode/extensions.json

This file was deleted.

7 changes: 0 additions & 7 deletions .vscode/ltex.dictionary.en-US.txt

This file was deleted.

6 changes: 0 additions & 6 deletions .vscode/settings.json

This file was deleted.

31 changes: 30 additions & 1 deletion src/main/java/org/jabref/logic/journals/Abbreviation.java
Original file line number Diff line number Diff line change
@@ -1,12 +1,19 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No additional empty line at the beginning

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

package org.jabref.logic.journals;

import java.io.Serial;
import java.io.Serializable;
import java.util.Objects;

import org.apache.commons.text.similarity.LevenshteinDistance;

public class Abbreviation implements Comparable<Abbreviation>, Serializable {

@Serial
private static final long serialVersionUID = 1;

private static final LevenshteinDistance LEVENSHTEIN = LevenshteinDistance.getDefaultInstance();

private transient String name;
private final String abbreviation;
private transient String dotlessAbbreviation;
Expand Down Expand Up @@ -56,13 +63,33 @@ public String getDotlessAbbreviation() {
return this.dotlessAbbreviation;
}

public boolean isSimilar(String otherName) {
String normalizedThis = normalize(this.name);
String normalizedOther = normalize(otherName);

int distance = LEVENSHTEIN.apply(normalizedThis, normalizedOther);
return distance <= 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a reasoning why "2" works.

}

private static String normalize(String input) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add JavaDoc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

return input.toLowerCase().replaceAll("[^a-z0-9 ]", "").trim();
}

@Override
public int compareTo(Abbreviation toCompare) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you described in the PR description should appear as JavaDoc here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added DocStrings to the code.

if (isSimilar(toCompare.getName())) {
return 0;
}

int nameComparison = getName().compareTo(toCompare.getName());
if (nameComparison != 0) {
return nameComparison;
}

if (isSimilar(toCompare.getAbbreviation())) {
return 0;
}

int abbreviationComparison = getAbbreviation().compareTo(toCompare.getAbbreviation());
if (abbreviationComparison != 0) {
return abbreviationComparison;
Expand Down Expand Up @@ -103,7 +130,9 @@ public boolean equals(Object o) {
return false;
}
Abbreviation that = (Abbreviation) o;
return getName().equals(that.getName()) && getAbbreviation().equals(that.getAbbreviation()) && getShortestUniqueAbbreviation().equals(that.getShortestUniqueAbbreviation());
return Objects.equals(normalize(name), normalize(that.name)) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some JavaDoc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, there should be no normalization here!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted this change

Objects.equals(normalize(abbreviation), normalize(that.abbreviation)) &&
Objects.equals(normalize(shortestUniqueAbbreviation), normalize(that.shortestUniqueAbbreviation));
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ public void addCustomAbbreviations(Collection<Abbreviation> abbreviationsToAdd)
}

public Optional<String> getNextAbbreviation(String text) {
return get(text).map(abbreviation -> abbreviation.getNext(text));
return get(text).map(abbreviation -> abbreviation.getNext(text.trim()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a trim at the end but no trim at the beginning? Shoulnd't be text be trimmed on method entry?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Thank you very much for your insightful feedback! I’ve reviewed all your comments and did my best to implement them."

}

public Optional<String> getDefaultAbbreviation(String text) {
Expand Down
37 changes: 37 additions & 0 deletions src/test/java/org/jabref/logic/journals/AbbreviationTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNotEquals;
import static org.junit.jupiter.api.Assertions.assertTrue;

class AbbreviationTest {

Expand Down Expand Up @@ -111,10 +112,46 @@ void equals() {
assertNotEquals("String", abbreviation);
}

@Test
void testSlightDifferences() {
assertTrue(new Abbreviation("Long Name", "L. N.").isSimilar("Longn Name"));
}

@Test
void testMissingLetter() {
assertTrue(new Abbreviation("Long Name", "L. N.").isSimilar("Long ame"));
}

@Test
void testPunctuationDifferences() {
assertTrue(new Abbreviation("Long Name", "L. N.").isSimilar("Long, Name"));
}

@Test
void testCaseDifferences() {
assertTrue(new Abbreviation("Long Name", "L. N.").isSimilar("LONG NAME"));
}

@Test
void equalAbbrevationsWithFourComponentsAreAlsoCompareZero() {
Abbreviation abbreviation1 = new Abbreviation("Long Name", "L. N.", "LN");
Abbreviation abbreviation2 = new Abbreviation("Long Name", "L. N.", "LN");
assertEquals(0, abbreviation1.compareTo(abbreviation2));
}

@Test
void testCompareToWithFuzzyMatching() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please convert to parameterized Test

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a parameterized for compareTo

Abbreviation abbreviation1 = new Abbreviation("Long Name", "L. N.");
Abbreviation abbreviation2 = new Abbreviation("Long Name", "L. N.");
assertEquals(0, abbreviation1.compareTo(abbreviation2));

abbreviation2 = new Abbreviation("Long Name", "L. N. ");
assertEquals(0, abbreviation1.compareTo(abbreviation2));

abbreviation2 = new Abbreviation("Long Name", "L. N");
assertEquals(0, abbreviation1.compareTo(abbreviation2));

abbreviation2 = new Abbreviation("Short Name", "S. N.");
assertNotEquals(0, abbreviation1.compareTo(abbreviation2));
}
}
Loading