Commit 48a9838
fix: Prevent false positives for legitimate preprint repositories (#130)
* fix: Prevent false positives for legitimate preprint repositories
Fixes critical issue where legitimate preprint repositories like bioRxiv,
SSRN, Zenodo, and medRxiv were incorrectly flagged as [SUSPICIOUS],
representing ~37% of all false positives (~185 out of ~500 total flags).
## Changes Made
### 1. Enhanced Preprint Detection System
- Expanded existing arXiv-only detection to comprehensive preprint repository detection
- Added patterns for: bioRxiv, SSRN, medRxiv, Zenodo, PsyArXiv, TechRxiv, Research Square, OSF, ChemRxiv, and more
- Improved arXiv pattern matching to handle subject classifications (cs.LG, eess.SP) and institutional variants
### 2. Updated Data Models and Processing
- Renamed `arxiv_entries_count` to `preprint_entries_count` to reflect broader scope
- Updated batch processing to use new comprehensive detection
- Enhanced logging to report "preprints from legitimate repositories" instead of just "arXiv"
### 3. Comprehensive Test Coverage
- Added test for comprehensive preprint repository detection (13 different repositories)
- Added test for mixed content scenarios (preprints + legitimate + suspicious journals)
- Added test for arXiv subject classification variants (cs.LG, eess.SP, stat.ML, Cornell)
- Updated existing arXiv test to use new variable names
## Root Cause Analysis
The tool was incorrectly flagging legitimate preprint repositories because:
1. Only arXiv was detected and skipped during assessment
2. Other preprints (bioRxiv, SSRN, etc.) were processed as regular journals
3. Heuristic backends flagged them as suspicious due to unconventional metrics
4. No legitimate lists confirmed them, resulting in SUSPICIOUS classification
## Impact
- Eliminates ~185 false positives (~37% reduction in total false flags)
- Prevents researchers from avoiding essential academic infrastructure
- Maintains existing arXiv detection while expanding coverage
- No impact on legitimate journal/conference assessment
## Testing
- All quality checks pass (linting, formatting, type checking, tests)
- 333 tests pass including 3 new comprehensive preprint tests
- Maintains backward compatibility with existing functionality
Resolves #123
* feat: Enhance ArXiv detection patterns based on real-world test data
Improves preprint detection to catch all ArXiv variants found in actual
bibliographies, significantly reducing false positives.
## Enhanced Detection Patterns
### 1. Comprehensive ArXiv Pattern Coverage
- Added support for `ArXivPreprint:ID` format (e.g., ArXivPreprint:2510.09378)
- Added detection for `ArXive-prints` (with hyphen and 'e')
- Improved handling of spacing variations (`ArXivpreprint` vs `ArXiv preprint`)
- Enhanced case-insensitive detection (`ArXiv` vs `arxiv`)
- Added word boundary detection for standalone `ArXiv` entries
### 2. Extended Field Coverage
- Added `publisher` field detection (e.g., publisher={ArXiv})
- Added `howpublished` field detection (e.g., howpublished={ArXiv})
- Maintains existing coverage of journal, booktitle, eprint, url, title fields
### 3. Comprehensive Test Coverage
- Added `test_arxiv_variants_from_real_world_data` with 7 real-world variants
- Tests all patterns found in florath's test dataset:
- journal={ArXivPreprint:2510.09378}
- booktitle={ArXivPreprint}
- howpublished={ArXiv}
- journal={ArXive-prints}
- journal={ArXivpreprint}
- journal={ArXiv}
- publisher={ArXiv}
## Validation Results
All 7 ArXiv variants from real-world test data are now detected correctly:
✓ ArXivPreprint:2510.09378 | journal={ArXivPreprint:2510.09378}
✓ ArXivPreprint | booktitle={ArXivPreprint}
✓ ArXiv | howpublished={ArXiv}
✓ ArXive-prints | journal={ArXive-prints}
✓ ArXivpreprint | journal={ArXivpreprint}
✓ ArXiv | journal={ArXiv}
✓ ArXiv | publisher={ArXiv}
## Quality Assurance
- All 334 tests pass including new comprehensive test case
- All quality checks pass (linting, formatting, type checking, coverage)
- Maintains backward compatibility with existing detection patterns
This addresses the feedback that ArXiv was still appearing as "suspicious"
in real-world test datasets, providing much more robust detection coverage.
* docs: Clarify BibTeX validation use case in README
Add parenthetical explanation '(Are the journal entries in my bibtex file valid?)'
to question 2 to help users understand the additional use case of validating
BibTeX file entries for strange or invalid journal names.
This addresses the observation that the tool can be used as a rough check
for bibliography quality, not just reference legitimacy assessment.
* refactor: Remove legacy code and redundant pattern list
Clean up internal code as requested:
1. **Removed legacy _is_arxiv_entry() method**
- Unnecessary backward compatibility for internal code
- Function was only defined but never called elsewhere
2. **Removed redundant misc_preprint_patterns list**
- Was duplicate subset of patterns already in main list
- All repositories (arxiv, biorxiv, ssrn, medrxiv, zenodo) already covered
- Simplified logic to use single comprehensive pattern matching
3. **Improved code quality**
- Eliminated code duplication
- Cleaner, more maintainable implementation
- Same functionality with less complexity
All tests pass - no functional changes, just cleaner internal code structure.
---------
Co-authored-by: florath-ai-assistant[bot] <Andreas.Florath@telekom.de>1 parent 5a9af42 commit 48a9838
File tree
5 files changed
+395
-52
lines changed- src/aletheia_probe
- tests/unit
5 files changed
+395
-52
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
25 | | - | |
| 25 | + | |
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
81 | 81 | | |
82 | 82 | | |
83 | 83 | | |
84 | | - | |
85 | | - | |
| 84 | + | |
| 85 | + | |
86 | 86 | | |
87 | 87 | | |
88 | | - | |
| 88 | + | |
89 | 89 | | |
90 | 90 | | |
91 | 91 | | |
92 | 92 | | |
93 | 93 | | |
94 | | - | |
| 94 | + | |
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
| |||
101 | 101 | | |
102 | 102 | | |
103 | 103 | | |
104 | | - | |
| 104 | + | |
105 | 105 | | |
106 | 106 | | |
107 | 107 | | |
| |||
285 | 285 | | |
286 | 286 | | |
287 | 287 | | |
288 | | - | |
289 | | - | |
290 | | - | |
291 | | - | |
| 288 | + | |
| 289 | + | |
292 | 290 | | |
293 | 291 | | |
294 | 292 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
92 | 92 | | |
93 | 93 | | |
94 | 94 | | |
95 | | - | |
| 95 | + | |
96 | 96 | | |
97 | 97 | | |
98 | 98 | | |
99 | | - | |
100 | | - | |
101 | | - | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
102 | 102 | | |
103 | | - | |
| 103 | + | |
104 | 104 | | |
105 | 105 | | |
106 | 106 | | |
| |||
125 | 125 | | |
126 | 126 | | |
127 | 127 | | |
128 | | - | |
129 | | - | |
| 128 | + | |
| 129 | + | |
130 | 130 | | |
131 | | - | |
| 131 | + | |
132 | 132 | | |
133 | 133 | | |
134 | 134 | | |
| |||
137 | 137 | | |
138 | 138 | | |
139 | 139 | | |
140 | | - | |
| 140 | + | |
141 | 141 | | |
142 | 142 | | |
143 | 143 | | |
| |||
501 | 501 | | |
502 | 502 | | |
503 | 503 | | |
504 | | - | |
505 | | - | |
| 504 | + | |
| 505 | + | |
506 | 506 | | |
507 | | - | |
508 | | - | |
| 507 | + | |
| 508 | + | |
509 | 509 | | |
510 | 510 | | |
511 | 511 | | |
512 | 512 | | |
513 | 513 | | |
514 | | - | |
| 514 | + | |
515 | 515 | | |
516 | 516 | | |
517 | 517 | | |
518 | | - | |
519 | | - | |
520 | | - | |
521 | | - | |
522 | | - | |
523 | | - | |
524 | | - | |
| 518 | + | |
525 | 519 | | |
526 | | - | |
527 | | - | |
528 | | - | |
529 | | - | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
530 | 526 | | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
531 | 570 | | |
532 | 571 | | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
533 | 575 | | |
534 | | - | |
535 | | - | |
| 576 | + | |
536 | 577 | | |
537 | 578 | | |
538 | 579 | | |
539 | 580 | | |
| 581 | + | |
540 | 582 | | |
| 583 | + | |
| 584 | + | |
541 | 585 | | |
542 | 586 | | |
543 | 587 | | |
544 | 588 | | |
545 | 589 | | |
546 | 590 | | |
547 | 591 | | |
548 | | - | |
| 592 | + | |
549 | 593 | | |
550 | 594 | | |
551 | | - | |
| 595 | + | |
552 | 596 | | |
553 | 597 | | |
554 | 598 | | |
555 | | - | |
556 | | - | |
557 | | - | |
558 | | - | |
559 | | - | |
560 | | - | |
561 | 599 | | |
562 | 600 | | |
563 | 601 | | |
| |||
586 | 624 | | |
587 | 625 | | |
588 | 626 | | |
589 | | - | |
590 | | - | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
591 | 630 | | |
592 | 631 | | |
593 | 632 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
215 | 215 | | |
216 | 216 | | |
217 | 217 | | |
218 | | - | |
219 | | - | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
220 | 221 | | |
221 | 222 | | |
222 | 223 | | |
| |||
0 commit comments