Implimented arXivId Parsing for PDF with arXivId #12335

ar-rana · 2024-12-24T13:15:07Z

Used the 'parse' method in ArXivIdentifier to get arXivId and added a testcase for the same using the link give in the issue.
Closes #12000
Closes https://github.com/koppor/jabref/issues/47".

Mandatory checks

I own the copyright of the code submitted and I licence it under the MIT license
Change in CHANGELOG.md described in a way that is understandable for the average user (if change is visible to the user)
Tests created for changes (if applicable)
Manually tested changed features in running JabRef (always required)
Screenshots added in PR description (for UI changes)
Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

github-actions

Your code currently does not meet JabRef's code guidelines.
We use Checkstyle to identify issues.
Please carefully follow the setup guide for the codestyle.
Afterwards, please run checkstyle locally and fix the issues.

In case of issues with the import order, double check that you activated Auto Import.
You can trigger fixing imports by pressing Ctrl+Alt+O to trigger Optimize Imports.

ar-rana · 2024-12-24T16:05:20Z

@koppor could you please review this PR and suggest changes where I am wrong.

Siedlerchr · 2024-12-24T18:53:01Z

Thanks for your contribution. Please fix the check style issues first

github-actions

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

ar-rana · 2024-12-25T07:34:19Z

Hello maintainers, could you please review this PR, I have fixed the previous issues.
The extra changes that are being reflected are because I merged the latest changes from upstream, please ignore them they will not be there in the actual PR.

Siedlerchr · 2024-12-25T17:10:37Z

Codewise looks good to me. You have accidentally commited the csl styles submodules, see https://devdocs.jabref.org/code-howtos/faq.html#the-problem for a solution

InAnYan · 2024-12-25T17:44:26Z

Hmm, I tried to import these papers:

https://arxiv.org/abs/2406.14319
https://arxiv.org/abs/2412.06769

But JabRef didn't import it properly, and arXiv was null all the time

ar-rana · 2024-12-27T14:26:29Z

I have changed the getArXivId method to this in my local branch:

private String getArXivId(String arXivId) {
        if (arXivId == null) {
            arXivId = ArXivIdentifier.parse(curString).map(ArXivIdentifier::asString).orElse(null);
            if (arXivId != null) {
                if (curString.length() > arXivId.length() + 7) {
                    curString = curString.substring(arXivId.length() + 7);
                    extractYear();
                }
                return arXivId;
            }
        }
        return arXivId;
    }

and modified the ArXivIdentifier.parse method a little by altering the identifier to this String identifier = value.split(" ")[0]; at line 41.

this fixes the null arXiv problem that @InAnYan was encontering when externally importing the papers, but for some reason it still does not import the titles correctly except for the paper that was mentioned in the issue.

github-actions

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

InAnYan · 2024-12-28T14:54:01Z

Hi, ar-rana! Thanks for working on this PR!

As holidays come soon, maintainers could be too busy these weeks, so don't worry if we give feedback too late

Siedlerchr · 2025-01-01T20:49:14Z

src/main/java/org/jabref/logic/importer/fileformat/PdfContentImporter.java

@@ -609,11 +609,15 @@ private String getDoi(String doi) {

    private String getArXivId(String arXivId) {
        if (arXivId == null) {
-            arXivId = ArXivIdentifier.parse(curString).map(ArXivIdentifier::asString).orElse(null);
+            String arXiv = curString.split(" ")[0];


This could lead to a null pointer if there is no whitespace and you access the index

Hello Siedlerchr, I tested this method with empty/non-empty strings with/without whitespaces and it would only give a null pointer if the curString is null which does not seem to be the case here. So should I add a change here or leave it, as getDoi also work the same way

HoussemNasri · 2025-01-05T21:56:47Z

src/main/java/org/jabref/logic/importer/fileformat/PdfContentImporter.java

+            String arXiv = curString.split(" ")[0];
+            arXivId = ArXivIdentifier.parse(arXiv).map(ArXivIdentifier::asString).orElse(null);
+            if (arXivId != null) {
+                if (curString.length() > arXivId.length() + 7) {


What does 7 stand for here? If possible, define a constant with a good name for the value 7 in this context.

Also, consider reducing the level of nesting in the code by returning early.

7 is for the 'arxiv:' prefix in the arxiv string,

If possible, define a constant with a good name for the value 7 in this context.

Also, consider reducing the level of nesting in the code by returning early.

sure, I will work on that.

In that case, ARXIV_PREFIX_LENGTH = "arxiv:".length() would be a good name.

HoussemNasri · 2025-01-06T18:02:12Z

src/main/java/org/jabref/logic/importer/fileformat/PdfContentImporter.java

@@ -608,18 +608,16 @@ private String getDoi(String doi) {
    }

    private String getArXivId(String arXivId) {
+        final int ARXIV_PREFIX_LENGTH = "arxiv:".length();


Let's define it as static and move it outside the method to avoid reallocating the same string object everytime the method is called.

private static final ...

should i move it to the ArXivIdentifier class instead, i think it might be more suited there

In general (and in OOP more specifically) we want to keep the data as close as possible to the code that uses it, if the constant is used there then okay otherwise let's keep it in the class where it is used.

sure, thank you

HoussemNasri · 2025-01-06T18:07:08Z

src/main/java/org/jabref/logic/importer/fileformat/PdfContentImporter.java

-                    proceedToNextNonEmptyLine();
-                }
-                return arXivId;
+            if (arXivId != null && curString.length() > arXivId.length() + ARXIV_PREFIX_LENGTH) {


This is not what I meant by reduced nesting, Please read this: https://szymonkrajewski.pl/why-should-you-return-early/

private String getArXivId(String arXivId) { if (arXivId != null) { return arXivId; } String arXiv = curString.split(" ")[0]; arXivId = ArXivIdentifier.parse(arXiv).map(ArXivIdentifier::asString).orElse(null); if (arXivId == null || curString.length() < arXivId.length() + ARXIV_PREFIX_LENGTH) { return arXivId; } // The arxiv string also contains the year curString = curString.substring(arXivId.length() + ARXIV_PREFIX_LENGTH); extractYear(); curString = ""; proceedToNextNonEmptyLine(); return arXivId; }

This is what I wrote based on my understanding of the article, please share you feedback

Yes, great, it looks better!

HoussemNasri · 2025-01-07T14:31:10Z

src/test/java/org/jabref/logic/importer/fileformat/PdfContentImporterTest.java

+                .withField((StandardField.KEYWORDS), "Test Automation Artificial Intelligence AI-assisted Test Automation Grey Literature Automated Test Generation Self-Healing Test Scripts");
+
+        String firstPageContent = """
+                arXiv:2408.06224v1 [cs.SE] 12 Aug 2024


Is it guaranteed that the arXiv id will be on the first line? If not we need to add at least two more tests.

ArXiv id in a middle line (the line begins with the id)

ArXiv ID in a middle line (the ID is somewhere in the middle of that line)

You can use JUnit 5 Parameterized Tests to reduce verbosity.

hello houssem, I checked multiple arXiv papers(10) and all of them had their arxiv string at the last line, here you can see it at the top(sorry for the oversight, will fix that) but moving it at the last also passes the test, and since all these papers have the same format i think the arxiv string would mostly be at the end

HoussemNasri · 2025-01-07T14:36:04Z

src/main/java/org/jabref/logic/importer/fileformat/PdfContentImporter.java

        // year is a class variable as the method extractYear() uses it;
        String publisher = null;

        EntryType type = StandardEntryType.InProceedings;
        if (curString.length() > 4) {
            // special case: possibly conference as first line on the page
+            arXivId = getArXivId(null);


This is an example why comments are to be avoided as much as possible, this line compiles fine, runs fine but the comment above it is misguiding. Please move comment to its correct location.

Implimented arXivId Parsing forPDF with arXivId

220ddac

github-actions bot reviewed Dec 24, 2024

View reviewed changes

ar-rana marked this pull request as draft December 24, 2024 15:59

added Optional parameter

32e9867

github-actions bot reviewed Dec 25, 2024

View reviewed changes

ar-rana added 2 commits December 25, 2024 11:48

Merge branch 'main' into feature/arXivId

9c80e04

merged fixes

89de378

ar-rana marked this pull request as ready for review December 25, 2024 13:20

Siedlerchr added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Dec 25, 2024

removed csl-styles

28755cf

fixed null arxiv issue on external imports

06b6bb0

github-actions bot reviewed Dec 28, 2024

View reviewed changes

Improved getArxivId Implementation

457bb3f

Siedlerchr reviewed Jan 1, 2025

View reviewed changes

HoussemNasri reviewed Jan 5, 2025

View reviewed changes

reduced nesting and added arxiv constant

a653de5

HoussemNasri reviewed Jan 6, 2025

View reviewed changes

reduced nesting

06c771b

HoussemNasri reviewed Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implimented arXivId Parsing for PDF with arXivId #12335

Implimented arXivId Parsing for PDF with arXivId #12335

ar-rana commented Dec 24, 2024

github-actions bot left a comment

ar-rana commented Dec 24, 2024 •

edited

Loading

Siedlerchr commented Dec 24, 2024

github-actions bot left a comment

ar-rana commented Dec 25, 2024 •

edited

Loading

Siedlerchr commented Dec 25, 2024

InAnYan commented Dec 25, 2024 •

edited

Loading

ar-rana commented Dec 27, 2024

github-actions bot left a comment

InAnYan commented Dec 28, 2024

Siedlerchr Jan 1, 2025

ar-rana Jan 2, 2025 •

edited

Loading

HoussemNasri Jan 5, 2025

ar-rana Jan 6, 2025

HoussemNasri Jan 6, 2025

HoussemNasri Jan 6, 2025

ar-rana Jan 6, 2025

HoussemNasri Jan 6, 2025

ar-rana Jan 6, 2025

HoussemNasri Jan 6, 2025

ar-rana Jan 6, 2025

HoussemNasri Jan 6, 2025

HoussemNasri Jan 7, 2025

ar-rana Jan 7, 2025

HoussemNasri Jan 7, 2025

ar-rana Jan 7, 2025

Implimented arXivId Parsing for PDF with arXivId #12335

Are you sure you want to change the base?

Implimented arXivId Parsing for PDF with arXivId #12335

Conversation

ar-rana commented Dec 24, 2024

Mandatory checks

github-actions bot left a comment

Choose a reason for hiding this comment

ar-rana commented Dec 24, 2024 • edited Loading

Siedlerchr commented Dec 24, 2024

github-actions bot left a comment

Choose a reason for hiding this comment

ar-rana commented Dec 25, 2024 • edited Loading

Siedlerchr commented Dec 25, 2024

InAnYan commented Dec 25, 2024 • edited Loading

ar-rana commented Dec 27, 2024

github-actions bot left a comment

Choose a reason for hiding this comment

InAnYan commented Dec 28, 2024

Choose a reason for hiding this comment

ar-rana Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ar-rana commented Dec 24, 2024 •

edited

Loading

ar-rana commented Dec 25, 2024 •

edited

Loading

InAnYan commented Dec 25, 2024 •

edited

Loading

ar-rana Jan 2, 2025 •

edited

Loading