Improve KeywordList parsing to support escaped characters and nested structures #12888

krishnagjsForGit · 2025-04-06T03:00:57Z

What I changed
Implemented escape handling in KeywordList.parse(String, Character, Character) to support escaping the keyword separator using backslash (). This prevents incorrect splitting of keywords when delimiters appear within the keyword itself.

Refactored the parsing logic for clarity, including renaming loop variables for better readability and intent.

Added test cases in KeywordListTest to cover escape scenarios such as:

Escaped delimiter: "one\,two" → ["one,two"]

Escaped backslash: "one\\two" → ["one\two"]

Mixed escaped and unescaped delimiters.

Where the changes are
KeywordList.java: Modified the parse method to include escape handling logic using a character-by-character loop.

KeywordListTest.java: Added JUnit test cases to ensure parsing behaves correctly with escaped delimiters and backslashes.

Why I made these changes
Fixes issue #12810: Current parsing breaks when delimiters appear inside keywords (e.g., in MeSH terms). There was no way to escape the delimiter character, leading to incorrect keyword splitting.

Improves consistency: Provides a more robust, predictable behavior for keyword parsing, especially when users or importers use delimiters in keyword values.

Next Steps
Awaiting review and feedback.

If the community prefers a different escape character or behavior, I’m happy to adjust the implementation.

After this is merged, similar logic could be extracted or reused where keyword parsing happens elsewhere (e.g., importers).

Mandatory checks

I own the copyright of the code submitted and I license it under the MIT license
[/] Change in CHANGELOG.md described in a way that is understandable for the average user (if change is visible to the user)
Tests created for changes (if applicable)
Manually tested changed features in running JabRef (always required)
[/] Screenshots added in PR description (if change is visible to the user)
[/] Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

src/test/java/org/jabref/model/entry/KeywordListTest.java

krishnagjsForGit · 2025-04-06T04:03:33Z

can someone help me on this. is this a valid comment from Bot ? I have never used DisplayName annotation and I changed method name to be more comprehensive. Am I missing anything here ?

Siedlerchr · 2025-04-06T08:05:35Z

Seems like a false positive from the bot @koppor

koppor · 2025-04-06T08:27:16Z

@krishnagjsForGit please change the PR title to contain text summarizing the fix.

krishnagjsForGit · 2025-04-06T10:55:16Z

@krishnagjsForGit please change the PR title to contain text summarizing the fix.

Done. is this fine @koppor ?

src/main/java/org/jabref/model/entry/KeywordList.java

…r-issue-12810

trag-bot · 2025-04-16T14:28:17Z

@trag-bot didn't find any issues in the code! ✅✨

koppor · 2025-04-16T14:56:23Z

src/main/java/org/jabref/model/entry/KeywordList.java

-            String chain = tok.nextToken();
-            Keyword chainRoot = Keyword.of(chain.split(hierarchicalDelimiter.toString()));
-            keywordList.add(chainRoot);
+        List<String> tokens = splitRespectingEscapes(keywordString, delimiter);


@Yubo-Cao What do you think about this? Maybe an ANTLR grammar would help here, too?

Yubo-Cao · 2025-04-20T20:03:11Z

I'm sorry for not getting back to you sooner. I have two suggestions:

ANTLR probably won't help in this use case, since the delimiters are dynamically specified
I am concerned regarding the Unicode handling of the code, since Java char is finicky, and potentially you want to use codepoints instead of .charAt
One way to address 2 and increase readability without necessarily using ANTLR is to dynamically construct a Regular Expression with a lookbehind and use only .split and .replace. Take a look at this: https://regex101.com/r/O1kLWF/1

Krishna Kumar Parthasarathy and others added 8 commits April 6, 2025 08:19

Fix for issue JabRef#12810

ddffe9b

Remove NotNull Import

9997f30

Update test names

eaace58

Fix Styling Issues

664b93c

Update test names

9c00761

Update test names

de3a84b

Update test names

923acfc

Merge branch 'main' into fix-for-issue-12810

15e4aca