Skip to content

Fix #12271: Integrity checker for year, location, and page numbers in booktitle#15465

Open
Chiragsd13 wants to merge 10 commits intoJabRef:mainfrom
Chiragsd13:fix/12271-booktitle-integrity-checker
Open

Fix #12271: Integrity checker for year, location, and page numbers in booktitle#15465
Chiragsd13 wants to merge 10 commits intoJabRef:mainfrom
Chiragsd13:fix/12271-booktitle-integrity-checker

Conversation

@Chiragsd13
Copy link
Copy Markdown

@Chiragsd13 Chiragsd13 commented Apr 1, 2026

Related issues and pull requests

Closes #12271

PR Description

Enhances BooktitleChecker with three new integrity checks that warn when a booktitle field contains data that belongs in dedicated fields: a 4-digit year (1000–2999), a country name from a hard-coded UN-recognised list, or an explicit page-number pattern (pp. X, pages X). A new Countries.java utility class holds the country set and builds a single pre-compiled regex alternation at class-load time for efficient matching.

Steps to test

  1. Open JabRef and create a new library.
  2. Add an @inproceedings entry and set the booktitle field to:
    • 2015 {IEEE} International Conference on Digital Signal Processing, {DSP} 2015, Singapore → flagged for year and location
    • European Conference on Circuit Theory and Design, {ECCTD} 2015, Trondheim, Norway → flagged for year and location
    • Advances in Neural Information Processing Systems, pp. 1234-1242 → flagged for page numbers
    • International Conference on Machine Learning → no warning
  3. Run Quality → Check Integrity and verify the expected warnings appear.
image

Checklist

  • I own the copyright of the code submitted and I license it under the MIT license
  • I manually tested my changes in running JabRef (always required)
  • I added JUnit tests for changes (if applicable)
  • I added screenshots in the PR description (if change is visible to the user)
  • I added a screenshot in the PR description showing a library with a single entry with me as author and as title the issue number
  • [/] I described the change in CHANGELOG.md in a way that can be understood by the average user (if change is visible to the user)
  • [/] I checked the user documentation for up to dateness and submitted a pull request to our user documentation repository

Enhance BooktitleChecker to flag booktitle values that contain:
- A 4-digit year (e.g. 2015)
- A country name (e.g. Norway, Austria, Singapore)
- Explicit page-number patterns (e.g. "pp. 1–10", "pages 3-7")

Add Countries.java with a hard-coded set of all UN-recognised country
names used for the country-presence check.  The set is built as a single
pre-compiled regex alternation so the pattern is compiled only once.

Update BooktitleCheckerTest with parameterised tests covering all three
new integrity rules and the blank-value / valid-value edge cases.

Closes JabRef#12271
@github-actions github-actions bot added the good second issue Issues that involve a tour of two or three interweaved components in JabRef label Apr 1, 2026
@github-actions github-actions bot added the status: changes-required Pull requests that are not yet complete label Apr 1, 2026
@testlens-app

This comment has been minimized.

@testlens-app

This comment has been minimized.

@Chiragsd13 Chiragsd13 marked this pull request as ready for review April 1, 2026 04:32
@qodo-free-for-open-source-projects
Copy link
Copy Markdown
Contributor

Review Summary by Qodo

Enhance BooktitleChecker with year, location, and page detection

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Adds three new integrity checks to BooktitleChecker for year, location, and page numbers
• Detects 4-digit years (1000–2999) in booktitle fields
• Detects country names from UN-recognized list using pre-compiled regex
• Detects explicit page-number patterns (pp., p., pages keywords)
• Creates Countries utility class with hard-coded country name set
• Adds comprehensive parameterized tests covering all new checks
• Adds localization keys for new warning messages
Diagram
flowchart LR
  BC["BooktitleChecker"]
  YC["Year Check<br/>1000-2999"]
  CC["Country Check<br/>UN list"]
  PC["Page Check<br/>pp/pages"]
  CO["Countries<br/>utility class"]
  L10N["Localization<br/>keys"]
  
  BC --> YC
  BC --> CC
  BC --> PC
  CC --> CO
  BC --> L10N
Loading

Grey Divider

File Changes

1. jablib/src/main/java/org/jabref/logic/integrity/BooktitleChecker.java ✨ Enhancement +33/-0

Add year, country, and page detection logic

• Adds three static Predicate fields for year, country, and page-number detection
• Implements year detection using regex pattern for 4-digit numbers (1000–2999)
• Implements country detection using pre-compiled regex from Countries class
• Implements page-number detection for patterns like "pp.", "p.", "pages"
• Adds three new integrity check conditions in checkValue method
• Returns localized warning messages for each detected issue

jablib/src/main/java/org/jabref/logic/integrity/BooktitleChecker.java


2. jablib/src/main/java/org/jabref/logic/integrity/Countries.java ✨ Enhancement +61/-0

Create Countries utility class with country names

• New utility class holding hard-coded set of UN-recognized country names
• Contains 195+ country names stored in lower-case for case-insensitive matching
• Includes both standard country names and common aliases (e.g., "czechia" and "czech republic")
• Designed to be used by BooktitleChecker for location detection in booktitle fields

jablib/src/main/java/org/jabref/logic/integrity/Countries.java


3. jablib/src/test/java/org/jabref/logic/integrity/BooktitleCheckerTest.java 🧪 Tests +66/-3

Add comprehensive tests for new integrity checks

• Reorganizes tests with clear section comments for existing and new checks
• Adds year detection tests covering middle, start positions, and no-year cases
• Adds country detection tests for Austria, Singapore, and no-country cases
• Adds page-number detection tests for "pp." and "pages" patterns
• Removes year from existing test cases to isolate the "conference on" check
• Fixes typo in test method name from "DoesNotAccepts" to "DoesNotAccept"

jablib/src/test/java/org/jabref/logic/integrity/BooktitleCheckerTest.java


View more (1)
4. jablib/src/main/resources/l10n/JabRef_en.properties Localization +3/-0

Add localization strings for new checks

• Adds three new localization keys for booktitle integrity warnings
• "booktitle should not contain a location" for country detection
• "booktitle should not contain a year" for year detection
• "booktitle should not contain page numbers" for page-number detection

jablib/src/main/resources/l10n/JabRef_en.properties


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Copy Markdown
Contributor

qodo-free-for-open-source-projects bot commented Apr 1, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (1) 📎 Requirement gaps (0) 🎨 UX Issues (0)

Grey Divider


Action required

1. assertNotEquals(Optional.empty()) used📘 Rule violation ≡ Correctness
Description
The new/modified tests only assert that an Optional is non-empty (via
assertNotEquals(Optional.empty(), ...)) instead of asserting the exact expected Optional
value/message. This weak predicate-style check can miss regressions where the checker returns the
wrong warning text.
Code

jablib/src/test/java/org/jabref/logic/integrity/BooktitleCheckerTest.java[R26-87]

+    void booktitleDoesNotAcceptIfItEndsWithConferenceOn() {
+        assertNotEquals(Optional.empty(), checker.checkValue("Digital Information and Communication Technology and its Applications (DICTAP), Fourth International Conference on"));
}
@Test
void booktitleIsBlank() {
 assertEquals(Optional.empty(), checker.checkValue(" "));
}
+
+    // ------------------------------------------------------------------
+    // Year detection
+    // ------------------------------------------------------------------
+
+    @Test
+    void booktitleFlagsYearInMiddle() {
+        // Example from the issue: year embedded inside a booktitle
+        assertNotEquals(Optional.empty(), checker.checkValue("European Conference on Circuit Theory and Design, {ECCTD} 2015, Trondheim, Norway"));
+    }
+
+    @Test
+    void booktitleFlagsYearAtStart() {
+        assertNotEquals(Optional.empty(), checker.checkValue("2015 {IEEE} International Conference on Digital Signal Processing"));
+    }
+
+    @Test
+    void booktitleAcceptsWhenNoYear() {
+        assertEquals(Optional.empty(), checker.checkValue("International Conference on Software Engineering"));
+    }
+
+    // ------------------------------------------------------------------
+    // Location (country) detection
+    // ------------------------------------------------------------------
+
+    @Test
+    void booktitleFlagsCountryName() {
+        // "Norway" is a country and should be flagged
+        assertNotEquals(Optional.empty(), checker.checkValue("Service-Oriented Computing, Fifth International Conference, Vienna, Austria, Proceedings"));
+    }
+
+    @Test
+    void booktitleFlagsCountryNameSingapore() {
+        assertNotEquals(Optional.empty(), checker.checkValue("{IEEE} International Conference on Digital Signal Processing, Singapore, Proceedings"));
+    }
+
+    @Test
+    void booktitleAcceptsWhenNoCountry() {
+        assertEquals(Optional.empty(), checker.checkValue("International Conference on Machine Learning Proceedings"));
+    }
+
+    // ------------------------------------------------------------------
+    // Page-number detection
+    // ------------------------------------------------------------------
+
+    @Test
+    void booktitleFlagsPagesPattern() {
+        assertNotEquals(Optional.empty(), checker.checkValue("Advances in Neural Information Processing Systems, pp. 1234-1242"));
+    }
+
+    @Test
+    void booktitleFlagsPagesKeyword() {
+        assertNotEquals(Optional.empty(), checker.checkValue("Advances in Neural Information Processing Systems, pages 1234-1242"));
+    }
Evidence
PR Compliance ID 28 requires asserting exact expected values/structures rather than weak predicate
checks; multiple newly added/modified assertions only verify that the Optional is non-empty. PR
Compliance ID 20 also discourages weak boolean/predicate-style assertions in tests when content
assertions are possible.

AGENTS.md
jablib/src/test/java/org/jabref/logic/integrity/BooktitleCheckerTest.java[26-27]
jablib/src/test/java/org/jabref/logic/integrity/BooktitleCheckerTest.java[40-48]
jablib/src/test/java/org/jabref/logic/integrity/BooktitleCheckerTest.java[60-68]
jablib/src/test/java/org/jabref/logic/integrity/BooktitleCheckerTest.java[79-87]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Tests use weak assertions like `assertNotEquals(Optional.empty(), ...)` which only check presence, not the exact expected warning message.
## Issue Context
`BooktitleChecker#checkValue` returns an `Optional<String>` containing a specific localized warning string. Tests should verify the exact returned `Optional` value (e.g., `Optional.of("booktitle should not contain a year")`) so regressions in message selection/order are caught.
## Fix Focus Areas
- jablib/src/test/java/org/jabref/logic/integrity/BooktitleCheckerTest.java[26-27]
- jablib/src/test/java/org/jabref/logic/integrity/BooktitleCheckerTest.java[40-48]
- jablib/src/test/java/org/jabref/logic/integrity/BooktitleCheckerTest.java[60-68]
- jablib/src/test/java/org/jabref/logic/integrity/BooktitleCheckerTest.java[79-87]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Booktitle warns only once🐞 Bug ≡ Correctness
Description
BooktitleChecker returns on the first matching condition (year, then country, then pages), so a
single booktitle containing multiple kinds of embedded metadata will only ever report one integrity
message and silently skip the others.
Code

jablib/src/main/java/org/jabref/logic/integrity/BooktitleChecker.java[R42-52]

+        if (CONTAINS_YEAR.test(value)) {
+            return Optional.of(Localization.lang("booktitle should not contain a year"));
+        }
+
+        if (CONTAINS_COUNTRY.test(value)) {
+            return Optional.of(Localization.lang("booktitle should not contain a location"));
+        }
+
+        if (CONTAINS_PAGES.test(value)) {
+            return Optional.of(Localization.lang("booktitle should not contain page numbers"));
+        }
Evidence
BooktitleChecker exits early on the first match, making subsequent checks unreachable for the same
value. The integrity system can emit multiple messages for the same field by registering multiple
ValueCheckers (FieldCheckers uses a Multimap), but this implementation keeps all checks in one
ValueChecker that can only return one Optional message.

jablib/src/main/java/org/jabref/logic/integrity/BooktitleChecker.java[38-52]
jablib/src/main/java/org/jabref/logic/integrity/ValueChecker.java[5-10]
jablib/src/main/java/org/jabref/logic/integrity/FieldChecker.java[23-27]
jablib/src/main/java/org/jabref/logic/integrity/FieldCheckers.java[18-55]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`BooktitleChecker.checkValue` returns after the first matched rule, so a booktitle containing both a year and a location (or pages) will only produce one warning.
### Issue Context
The integrity framework already supports multiple `ValueChecker`s per field via `FieldCheckers`’ `Multimap<Field, ValueChecker>`; each checker can emit one `IntegrityMessage`.
### Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/integrity/BooktitleChecker.java[38-52]
- jablib/src/main/java/org/jabref/logic/integrity/FieldCheckers.java[18-55]
### Suggested fix
1. Keep `BooktitleChecker` for the existing "ends with conference on" rule (or convert it into one focused checker).
2. Create three new `ValueChecker` implementations:
- `BooktitleContainsYearChecker`
- `BooktitleContainsCountryChecker`
- `BooktitleContainsPagesChecker`
3. Register all of them for `StandardField.BOOKTITLE` in `FieldCheckers.getAllMap(...)` using multiple `put(...)` calls.
4. Add/adjust tests to assert that a booktitle with both year and country yields *two* messages when run through the integrity pipeline (e.g., via `FieldCheckers.getForField(StandardField.BOOKTITLE)` + applying each checker).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Country regex matches tokens🐞 Bug ≡ Correctness
Description
The country detection regex only checks adjacent letters for word boundaries, so it will match
country abbreviations inside alphanumeric tokens (e.g., "USA2015"), producing incorrect “location”
warnings despite the intent to match whole words.
Code

jablib/src/main/java/org/jabref/logic/integrity/BooktitleChecker.java[R26-30]

+        String alternation = Countries.COUNTRY_NAMES.stream()
+                                                    .map(Pattern::quote)
+                                                    .collect(Collectors.joining("|"));
+        CONTAINS_COUNTRY = Pattern.compile("(?i)(?<![a-z])(" + alternation + ")(?![a-z])").asPredicate();
+    }
Evidence
The pattern uses (?<![a-z]) and (?![a-z]), which allow digits immediately after a match. With
(?i) enabled, USA2015 matches usa because 2 is not a letter. The countries list explicitly
includes short abbreviations like usa, uk, and uae, increasing the chance of these false
matches.

jablib/src/main/java/org/jabref/logic/integrity/BooktitleChecker.java[25-30]
jablib/src/main/java/org/jabref/logic/integrity/Countries.java[52-55]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`CONTAINS_COUNTRY` claims to do whole-word matching, but its boundary checks only exclude letters, not digits/underscores. This makes it match abbreviations like `usa` inside `USA2015`.
### Issue Context
Short abbreviations (`usa`, `uk`, `uae`) exist in `Countries.COUNTRY_NAMES`, so alphanumeric conference tokens can be mis-flagged as locations.
### Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/integrity/BooktitleChecker.java[25-30]
- jablib/src/main/java/org/jabref/logic/integrity/Countries.java[52-55]
### Suggested fix
Replace the `[a-z]`-based lookarounds with real word boundaries, e.g.:
- `Pattern.compile("(?i)\\b(" + alternation + ")\\b")`
Alternatively, if you want Unicode-aware boundaries and to treat digits as part of tokens, use:
- `(?<!\\p{Alnum})(...)(?!\\p{Alnum})`
Add a regression test ensuring strings like `"Proceedings USA2015"` do **not** trigger the location warning.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

4. Trivial // utility class 📘 Rule violation ⚙ Maintainability
Description
The new Countries utility class adds a trivial comment // utility class that restates what the
private constructor already implies. This violates the comment hygiene rule to avoid
trivial/restating comments.
Code

jablib/src/main/java/org/jabref/logic/integrity/Countries.java[R58-60]

+    private Countries() {
+        // utility class
+    }
Evidence
PR Compliance ID 4 forbids adding trivial comments that restate code; the comment // utility class
provides no additional intent beyond the private constructor.

AGENTS.md
jablib/src/main/java/org/jabref/logic/integrity/Countries.java[58-60]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A trivial comment (`// utility class`) was added in the private constructor of `Countries`, restating what the code already makes clear.
## Issue Context
Comment hygiene guidelines require comments to explain intent (“why”), not restate obvious code.
## Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/integrity/Countries.java[58-60]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Year regex matches tokens🐞 Bug ≡ Correctness
Description
The year regex is bounded only by non-digits, so it also matches 4-digit sequences embedded in
alphanumeric tokens (e.g., "ICML2015"), contradicting the “standalone” intent and potentially
creating noisy warnings.
Code

jablib/src/main/java/org/jabref/logic/integrity/BooktitleChecker.java[R14-16]

+    // Matches a standalone 4-digit year in the range 1000–2999
+    private static final Predicate<String> CONTAINS_YEAR =
+            Pattern.compile("(?<![0-9])[12][0-9]{3}(?![0-9])").asPredicate();
Evidence
(?<![0-9]) / (?![0-9]) only prevent adjacent digits; letters are allowed. Therefore any
occurrence like ICML2015 (letter immediately before the digits) still matches the predicate and
will be flagged as containing a year.

jablib/src/main/java/org/jabref/logic/integrity/BooktitleChecker.java[14-16]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`CONTAINS_YEAR` matches 4-digit sequences even when directly attached to letters (e.g., `ICML2015`), even though the comment says it matches a standalone year.
### Issue Context
Current pattern: `(?<![0-9])[12][0-9]{3}(?![0-9])`.
### Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/integrity/BooktitleChecker.java[14-16]
### Suggested fix
If the intent is truly a standalone year token, switch to an alphanumeric-aware boundary, e.g.:
- `Pattern.compile("\\b[12]\\d{3}\\b")`
Then add a unit test demonstrating that `"ICML2015"` is not flagged while `"ICML 2015"` is flagged.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@Chiragsd13
Copy link
Copy Markdown
Author

The guard-review check is failing due to a missing .github/actions/pr-gate/action.yml in the main repository. This appears to be a CI infrastructure issue unrelated to the changes in this PR. All 51 other checks are passing.

@Chiragsd13
Copy link
Copy Markdown
Author

CI Infrastructure Issue: guard-review job failing

Hi maintainers, I wanted to flag that the guard-review check is failing on this PR with the following error:

Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under
'/home/runner/work/jabref/jabref/.github/actions/pr-gate'.
Did you forget to run actions/checkout before running your local action?

Root cause: The workflow .github/workflows/remove-ready-for-review.yml references a local composite action at .github/actions/pr-gate, but that directory/file does not appear to exist in the repository.

Impact on this PR: None — this is unrelated to the code changes here. All 51 other checks are passing. This job only triggered because the PR was just converted from Draft to Ready for Review.

This seems to be a pre-existing infrastructure issue on the main repository side. Happy to proceed with review whenever convenient.

- Extract year, country, and page-number checks into separate
  ValueChecker classes (BooktitleContainsYearChecker,
  BooktitleContainsCountryChecker, BooktitleContainsPagesChecker)
  so all three issues in one booktitle are reported independently
- Fix word-boundary regex in country checker: replace [a-z]
  lookarounds with \p{Alnum} so tokens like USA2015 are not
  mis-flagged as locations
- Register all three new checkers in FieldCheckers for BOOKTITLE
- Strengthen tests: use assertEquals with exact expected message
  instead of assertNotEquals(Optional.empty()); add regression
  test for alphanumeric token false-positive
@testlens-app
Copy link
Copy Markdown

testlens-app bot commented Apr 1, 2026

✅ All tests passed ✅

🏷️ Commit: 7a50c37
▶️ Tests: 10226 executed
⚪️ Checks: 49/49 completed


Learn more about TestLens at testlens.app.

@github-actions github-actions bot added status: no-bot-comments and removed status: changes-required Pull requests that are not yet complete labels Apr 1, 2026
@github-actions github-actions bot added status: changes-required Pull requests that are not yet complete and removed status: no-bot-comments labels Apr 4, 2026
@github-actions github-actions bot added status: no-bot-comments and removed status: changes-required Pull requests that are not yet complete labels Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

good second issue Issues that involve a tour of two or three interweaved components in JabRef status: no-bot-comments

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New integrity checker for booktitle

1 participant