Add heuristic detection for delimiters in importing process#15521
Add heuristic detection for delimiters in importing process#15521mikezhanghaozhe wants to merge 5 commits intoJabRef:mainfrom
Conversation
Add KeywordList.parseImport with default delimiters (checking semicolons first, then commas). It detects the default delimiters and normalize keywords with user's customized delimiter in preference during import.
Review Summary by QodoAdd heuristic keyword delimiter detection for BibTeX import
WalkthroughsDescription• Implement heuristic keyword delimiter detection during BibTeX import - Prioritizes semicolon over comma to avoid false splits - Normalizes imported keywords to user's preferred delimiter • Add KeywordList.parseImport() method with delimiter detection logic • Update keyword parsing in BibtexParser to use new import method • Remove duplicate keywords during import process • Update test cases to reflect deduplication behavior Diagramflowchart LR
A["BibTeX Import"] --> B["BibtexParser.parseField"]
B --> C["KeywordList.parseImport"]
C --> D["detectImportDelimiter"]
D --> E["Check Delimiters in Priority Order"]
E --> F["Semicolon Found?"]
F -->|Yes| G["Use Semicolon"]
F -->|No| H["Use Comma"]
G --> I["Normalize to User Delimiter"]
H --> I
I --> J["Store Keywords"]
File Changes1. jablib/src/main/java/org/jabref/logic/importer/fileformat/BibtexParser.java
|
Code Review by Qodo
|
| // Default delimiters to try when importing keywords, in priority order. | ||
| // It overrides the delimiter stored in preference. | ||
| private final List<Character> IMPORT_KEYWORD_DELIMITERS = List.of(';', ','); | ||
|
|
There was a problem hiding this comment.
1. import_keyword_delimiters not static final 📘 Rule violation ⚙ Maintainability
IMPORT_KEYWORD_DELIMITERS is an instance field named like a constant (UPPER_SNAKE_CASE), which violates existing naming conventions and may trigger style/check rules. It should be a `private static final` constant (or renamed to lowerCamelCase if intentionally instance-scoped).
Agent Prompt
## Issue description
`IMPORT_KEYWORD_DELIMITERS` is named like a constant but is not `static final`, which breaks the project's naming/style conventions.
## Issue Context
The surrounding fields in `BibtexParser` use `private static final ...` for constants; introducing an instance-level UPPER_SNAKE_CASE field is inconsistent and may fail style checks.
## Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/fileformat/BibtexParser.java[112-115]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
|
|
||
| ### Added | ||
|
|
||
| - We added support for heuristic keywords delimiter detection: special delimiters such as ";" will be replaced with "," in the importing process. [#12974](https://github.com/JabRef/jabref/issues/12974) |
There was a problem hiding this comment.
2. Changelog claims comma replacement 📘 Rule violation ≡ Correctness
The new CHANGELOG entry states that ";" will be replaced with "," during import, but the implementation re-serializes keywords using the user's configured separator, which is not necessarily a comma. This makes the release note misleading for end users.
Agent Prompt
## Issue description
The CHANGELOG entry incorrectly claims that semicolons are replaced with commas during import.
## Issue Context
The implementation detects import delimiters heuristically but writes keywords back using the user's configured keyword separator, not always a comma.
## Fix Focus Areas
- CHANGELOG.md[14-14]
- jablib/src/main/java/org/jabref/logic/importer/fileformat/BibtexParser.java[810-815]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| /// Parses the keyword list using heuristic delimiter detection for the importing process. | ||
| /// Tries each delimiter in the provided list in priority order; if none found, falls back to comma. | ||
| /// | ||
| /// @param keywordString a String of keywordChains | ||
| /// @param delimiters a List of delimiters used for separating the keywords in the importing process | ||
| /// @return a parsed list containing the keywordChains | ||
| public static KeywordList parseImport(@NonNull String keywordString, @NonNull List<Character> delimiters) { | ||
| Character delimiter = detectImportDelimiter(keywordString, delimiters); | ||
| return parse(keywordString, delimiter); | ||
| } | ||
|
|
||
| /// Detects which delimiter to use by checking the keyword string for each candidate in priority order. | ||
| /// | ||
| /// @param keywordString a String of keywordChains | ||
| /// @param delimiters a List of delimiters used for separating the keywords in the importing process | ||
| /// @return a character representing the delimiter to use | ||
| static Character detectImportDelimiter(@NonNull String keywordString, @NonNull List<Character> delimiters) { | ||
| for (Character delimiter : delimiters) { | ||
| if (keywordString.indexOf(delimiter) >= 0) { | ||
| return delimiter; | ||
| } | ||
| } | ||
| // Falls back to comma if none of the delimiters are found. | ||
| return ','; | ||
| } |
There was a problem hiding this comment.
3. Missing openfasttrace requirement entry 📘 Rule violation ⚙ Maintainability
This PR introduces a user-visible change to BibTeX keyword import behavior (heuristic delimiter detection and semicolon splitting) but does not add a corresponding OpenFastTrace req~...~1 requirement in docs/requirements/. This breaks requirements tracing for a significant behavior change.
Agent Prompt
## Issue description
A significant import behavior change was implemented without adding a corresponding OpenFastTrace requirement entry under `docs/requirements/`.
## Issue Context
The PR adds heuristic delimiter detection for keyword import (`parseImport`/`detectImportDelimiter`) and new tests for semicolon-separated keywords.
## Fix Focus Areas
- jablib/src/main/java/org/jabref/model/entry/KeywordList.java[91-115]
- jablib/src/test/java/org/jabref/logic/importer/fileformat/BibtexParserTest.java[2181-2193]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| if (StandardField.KEYWORDS == field) { // If there are no duplicated keywords fields encountered yet. | ||
| // Import the keywordList with default, heuristic delimiter. | ||
| KeywordList parsed = KeywordList.parseImport(content, IMPORT_KEYWORD_DELIMITERS); | ||
| // Re-serialize with the user's preference delimiter. | ||
| entry.setField(field, parsed.getAsString( | ||
| importFormatPreferences.bibEntryPreferences().getKeywordSeparator())); | ||
| } else { |
There was a problem hiding this comment.
4. Comma keywords get split 🐞 Bug ≡ Correctness
BibtexParser re-serializes imported keywords using the user’s preferred delimiter without escaping delimiter characters inside a keyword, so a keyword containing ',' will be split into multiple keywords on later reads. This silently corrupts keyword data when importing semicolon-delimited keywords that contain commas and the user preference is ','.
Agent Prompt
## Issue description
When importing keywords, the code re-serializes `KeywordList` using `getAsString(preferenceDelimiter)`, which does not escape delimiter characters inside individual keywords. This corrupts data for cases like `keywordOne, keywordTwo; keywordThree` with preference delimiter `,`.
## Issue Context
- `KeywordList.parse(...)` supports escaping via `Keyword.DEFAULT_ESCAPE_SYMBOL`.
- `KeywordList.serialize(...)` already implements correct escaping for delimiters and hierarchical delimiters.
- `BibEntry` later parses the stored keywords field using the preference delimiter.
## Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/fileformat/BibtexParser.java[769-818]
- jablib/src/main/java/org/jabref/model/entry/KeywordList.java[117-179]
- jablib/src/main/java/org/jabref/model/entry/BibEntry.java[785-804]
## What to change
- Ensure the string written into `StandardField.KEYWORDS` escapes the *output* delimiter when it occurs inside a keyword node.
- Prefer re-serializing via an escaping-aware routine (e.g., add a `KeywordList.serializeWithSpaces(...)` or adjust usage to `KeywordList.serialize(...)` and ensure spacing expectations are handled consistently).
- Add a regression test covering import of semicolon-separated keywords containing commas with preference delimiter `,` (expect embedded commas to remain part of the keyword, e.g. stored as `keywordOne\, keywordTwo, keywordThree` or equivalent escaped form).
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
|
JUnit tests of You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide. |
|
Your pull request conflicts with the target branch. Please merge with your code. For a step-by-step guide to resolve merge conflicts, see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/addressing-merge-conflicts/resolving-a-merge-conflict-using-the-command-line. |
Related issues and pull requests
Closes #12974
PR Description
Delimiters are now separate, where default delimiters
IMPORT_KEYWORD_DELIMITERSare used in the importing process and user's preferred delimiters are used for displaying. Thus, the default delimiters can be flexible and accommodate";".Things to note:
[";", ","]. This list is flexible and can be connected/replaced by user's preferred delimiters later if needed.";"is detected,","will no longer be considered as a delimiter in the importing process. This addresses the concern of having","as a valid part of the keyword. #12974 Comment""" @Article{, Keywords={asdf,asdf,asdf}, } """inBibtexParserTestwill be deduplicated during importing process. The previous unit test preserved the duplicated keywords.Steps to test
";".","in this case).Checklist
CHANGELOG.mdin a way that can be understood by the average user (if change is visible to the user)