Skip to content

Commit 48a9838

Browse files
fix: Prevent false positives for legitimate preprint repositories (#130)
* fix: Prevent false positives for legitimate preprint repositories Fixes critical issue where legitimate preprint repositories like bioRxiv, SSRN, Zenodo, and medRxiv were incorrectly flagged as [SUSPICIOUS], representing ~37% of all false positives (~185 out of ~500 total flags). ## Changes Made ### 1. Enhanced Preprint Detection System - Expanded existing arXiv-only detection to comprehensive preprint repository detection - Added patterns for: bioRxiv, SSRN, medRxiv, Zenodo, PsyArXiv, TechRxiv, Research Square, OSF, ChemRxiv, and more - Improved arXiv pattern matching to handle subject classifications (cs.LG, eess.SP) and institutional variants ### 2. Updated Data Models and Processing - Renamed `arxiv_entries_count` to `preprint_entries_count` to reflect broader scope - Updated batch processing to use new comprehensive detection - Enhanced logging to report "preprints from legitimate repositories" instead of just "arXiv" ### 3. Comprehensive Test Coverage - Added test for comprehensive preprint repository detection (13 different repositories) - Added test for mixed content scenarios (preprints + legitimate + suspicious journals) - Added test for arXiv subject classification variants (cs.LG, eess.SP, stat.ML, Cornell) - Updated existing arXiv test to use new variable names ## Root Cause Analysis The tool was incorrectly flagging legitimate preprint repositories because: 1. Only arXiv was detected and skipped during assessment 2. Other preprints (bioRxiv, SSRN, etc.) were processed as regular journals 3. Heuristic backends flagged them as suspicious due to unconventional metrics 4. No legitimate lists confirmed them, resulting in SUSPICIOUS classification ## Impact - Eliminates ~185 false positives (~37% reduction in total false flags) - Prevents researchers from avoiding essential academic infrastructure - Maintains existing arXiv detection while expanding coverage - No impact on legitimate journal/conference assessment ## Testing - All quality checks pass (linting, formatting, type checking, tests) - 333 tests pass including 3 new comprehensive preprint tests - Maintains backward compatibility with existing functionality Resolves #123 * feat: Enhance ArXiv detection patterns based on real-world test data Improves preprint detection to catch all ArXiv variants found in actual bibliographies, significantly reducing false positives. ## Enhanced Detection Patterns ### 1. Comprehensive ArXiv Pattern Coverage - Added support for `ArXivPreprint:ID` format (e.g., ArXivPreprint:2510.09378) - Added detection for `ArXive-prints` (with hyphen and 'e') - Improved handling of spacing variations (`ArXivpreprint` vs `ArXiv preprint`) - Enhanced case-insensitive detection (`ArXiv` vs `arxiv`) - Added word boundary detection for standalone `ArXiv` entries ### 2. Extended Field Coverage - Added `publisher` field detection (e.g., publisher={ArXiv}) - Added `howpublished` field detection (e.g., howpublished={ArXiv}) - Maintains existing coverage of journal, booktitle, eprint, url, title fields ### 3. Comprehensive Test Coverage - Added `test_arxiv_variants_from_real_world_data` with 7 real-world variants - Tests all patterns found in florath's test dataset: - journal={ArXivPreprint:2510.09378} - booktitle={ArXivPreprint} - howpublished={ArXiv} - journal={ArXive-prints} - journal={ArXivpreprint} - journal={ArXiv} - publisher={ArXiv} ## Validation Results All 7 ArXiv variants from real-world test data are now detected correctly: ✓ ArXivPreprint:2510.09378 | journal={ArXivPreprint:2510.09378} ✓ ArXivPreprint | booktitle={ArXivPreprint} ✓ ArXiv | howpublished={ArXiv} ✓ ArXive-prints | journal={ArXive-prints} ✓ ArXivpreprint | journal={ArXivpreprint} ✓ ArXiv | journal={ArXiv} ✓ ArXiv | publisher={ArXiv} ## Quality Assurance - All 334 tests pass including new comprehensive test case - All quality checks pass (linting, formatting, type checking, coverage) - Maintains backward compatibility with existing detection patterns This addresses the feedback that ArXiv was still appearing as "suspicious" in real-world test datasets, providing much more robust detection coverage. * docs: Clarify BibTeX validation use case in README Add parenthetical explanation '(Are the journal entries in my bibtex file valid?)' to question 2 to help users understand the additional use case of validating BibTeX file entries for strange or invalid journal names. This addresses the observation that the tool can be used as a rough check for bibliography quality, not just reference legitimacy assessment. * refactor: Remove legacy code and redundant pattern list Clean up internal code as requested: 1. **Removed legacy _is_arxiv_entry() method** - Unnecessary backward compatibility for internal code - Function was only defined but never called elsewhere 2. **Removed redundant misc_preprint_patterns list** - Was duplicate subset of patterns already in main list - All repositories (arxiv, biorxiv, ssrn, medrxiv, zenodo) already covered - Simplified logic to use single comprehensive pattern matching 3. **Improved code quality** - Eliminated code duplication - Cleaner, more maintainable implementation - Same functionality with less complexity All tests pass - no functional changes, just cleaner internal code structure. --------- Co-authored-by: florath-ai-assistant[bot] <Andreas.Florath@telekom.de>
1 parent 5a9af42 commit 48a9838

File tree

5 files changed

+395
-52
lines changed

5 files changed

+395
-52
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Aletheia-Probe helps answer two critical questions for researchers:
2222
```bash
2323
aletheia-probe journal "Journal of Computer Science"
2424
```
25-
2. **Are the references in my paper legitimate?**
25+
2. **Are the references in my paper legitimate?** (Are the journal entries in my bibtex file valid?)
2626
```bash
2727
aletheia-probe bibtex references.bib
2828
```

src/aletheia_probe/batch_assessor.py

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -81,17 +81,17 @@ async def assess_bibtex_file(
8181

8282
# Parse the BibTeX file to extract journal entries
8383
try:
84-
bibtex_entries, skipped_count, arxiv_count = BibtexParser.parse_bibtex_file(
85-
file_path, relax_bibtex
84+
bibtex_entries, skipped_count, preprint_count = (
85+
BibtexParser.parse_bibtex_file(file_path, relax_bibtex)
8686
)
8787
detail_logger.debug(
88-
f"Successfully parsed {len(bibtex_entries)} entries, skipped {skipped_count}, found {arxiv_count} arXiv entries"
88+
f"Successfully parsed {len(bibtex_entries)} entries, skipped {skipped_count}, found {preprint_count} preprint entries"
8989
)
9090
except Exception as e:
9191
detail_logger.error(f"Failed to parse BibTeX file: {e}")
9292
raise ValueError(f"Failed to parse BibTeX file: {e}") from e
9393

94-
total_entries = len(bibtex_entries) + skipped_count + arxiv_count
94+
total_entries = len(bibtex_entries) + skipped_count + preprint_count
9595
status_logger.info(
9696
f"Found {len(bibtex_entries)} entries with journal information"
9797
)
@@ -101,7 +101,7 @@ async def assess_bibtex_file(
101101
file_path=str(file_path),
102102
total_entries=total_entries,
103103
entries_with_journals=len(bibtex_entries),
104-
arxiv_entries_count=arxiv_count,
104+
preprint_entries_count=preprint_count,
105105
skipped_entries_count=skipped_count,
106106
predatory_count=0,
107107
legitimate_count=0,
@@ -285,10 +285,8 @@ def format_summary(result: BibtexAssessmentResult, verbose: bool = False) -> str
285285
summary_lines.append(f"File: {result.file_path}")
286286
summary_lines.append(f"Total entries in file: {result.total_entries}")
287287
summary_lines.append(f"Entries assessed: {result.entries_with_journals}")
288-
if result.arxiv_entries_count > 0:
289-
summary_lines.append(
290-
f"Skipped arXiv preprints: {result.arxiv_entries_count}"
291-
)
288+
if result.preprint_entries_count > 0:
289+
summary_lines.append(f"Skipped preprints: {result.preprint_entries_count}")
292290
if result.skipped_entries_count > 0:
293291
summary_lines.append(
294292
f"Skipped other entries: {result.skipped_entries_count}"

src/aletheia_probe/bibtex_parser.py

Lines changed: 76 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -92,15 +92,15 @@ def parse_bibtex_file(
9292

9393
entries = []
9494
skipped_entries = 0
95-
arxiv_entries = 0
95+
preprint_entries = 0
9696

9797
for entry_key, entry in bib_data.entries.items():
9898
try:
99-
# First, check for arXiv entries to correctly categorize skipped entries
100-
if BibtexParser._is_arxiv_entry(entry):
101-
arxiv_entries += 1
99+
# First, check for preprint entries to correctly categorize skipped entries
100+
if BibtexParser._is_preprint_entry(entry):
101+
preprint_entries += 1
102102
detail_logger.debug(
103-
f"Skipping arXiv entry: {entry_key}"
103+
f"Skipping preprint entry: {entry_key}"
104104
)
105105
continue
106106

@@ -125,10 +125,10 @@ def parse_bibtex_file(
125125
f"with {description}"
126126
)
127127

128-
# Inform user about arXiv preprints (if any)
129-
if arxiv_entries > 0:
128+
# Inform user about preprints (if any)
129+
if preprint_entries > 0:
130130
status_logger.info(
131-
f"Skipped {arxiv_entries} arXiv preprint(s) - not publication venues"
131+
f"Skipped {preprint_entries} preprint(s) from legitimate repositories - not publication venues"
132132
)
133133

134134
# Log other skipped entries
@@ -137,7 +137,7 @@ def parse_bibtex_file(
137137
f"Skipped {skipped_entries} other entries due to processing errors"
138138
)
139139

140-
return entries, skipped_entries, arxiv_entries
140+
return entries, skipped_entries, preprint_entries
141141

142142
except UnicodeDecodeError as e:
143143
last_error = e
@@ -501,63 +501,101 @@ def _remove_nested_braces(value: str) -> str:
501501
return value.strip()
502502

503503
@staticmethod
504-
def _is_arxiv_entry(entry: Entry) -> bool:
505-
"""Detects if a BibTeX entry is an arXiv preprint.
504+
def _is_preprint_entry(entry: Entry) -> bool:
505+
"""Detects if a BibTeX entry is a preprint from a legitimate repository.
506506
507-
Checks the 'journal', 'booktitle', 'eprint', and 'title' fields
508-
for common arXiv patterns.
507+
Checks the 'journal', 'booktitle', 'eprint', 'url', and 'title' fields
508+
for patterns from legitimate preprint repositories to prevent false positives.
509509
510510
Args:
511511
entry: BibTeX entry object.
512512
513513
Returns:
514-
True if the entry is identified as an arXiv preprint, False otherwise.
514+
True if the entry is identified as a legitimate preprint, False otherwise.
515515
"""
516516
import re
517517

518-
# Patterns to identify arXiv entries
519-
# - "arXiv preprint arXiv:XXXX.XXXXX"
520-
# - "ArXiv e-prints"
521-
# - "arXiv:XXXX.XXXXX" (bare arXiv identifier)
522-
# - "e-print" field containing "arXiv"
523-
# - Journal field containing only arXiv identifier
524-
518+
# Comprehensive patterns for arXiv (all variants from real-world data)
525519
arxiv_patterns = [
526-
r"arxiv\s+preprint\s+arxiv:\d{4}\.\d{5}(v\d+)?", # arXiv preprint arXiv:XXXX.XXXXX
527-
r"arxiv\s+e-prints", # ArXiv e-prints
528-
r"arxiv:\d{4}\.\d{5}(v\d+)?", # bare arXiv identifier
529-
r"arxiv:\w+\.\w+(v\d+)?", # arXiv:cs.AI/9901001 (old style)
520+
# Standard arXiv patterns
521+
r"arxiv\s*preprint\s*(?:arxiv\s*:)?\s*\d{4}\.\d{5}(?:v\d+)?", # arXiv preprint arXiv:XXXX.XXXXX
522+
r"arxiv\s*preprint\s*:?\s*\d{4}\.\d{5}(?:v\d+)?", # ArXivPreprint:2510.09378
523+
r"arxiv\s*e-?prints?", # ArXiv e-prints, ArXive-prints
524+
r"arxiv:\d{4}\.\d{5}(?:v\d+)?", # bare arXiv identifier
525+
r"arxiv:\w+\.\w+(?:v\d+)?", # arXiv:cs.AI/9901001 (old style)
530526
r"eprint:\s*arxiv", # for entries where eprint field is "eprint = {arXiv}"
527+
# ArXiv with classifications and institutional info
528+
r"arxiv\s*\[[^\]]+\]", # arXiv with subject classification (e.g., "arXiv [cs.LG]")
529+
r"arxiv\s*\([^)]*\)", # arXiv with parenthetical info (e.g., "arXiv (Cornell University)")
530+
# Common variants found in real bibliographies
531+
r"\barxive?\s*preprints?\b", # ArXivpreprint, ArXivepreprint
532+
r"\barxive?\b", # Just "ArXiv" or "ArXive" as word boundary
533+
r"^arxive?$", # Just "ArXiv" or "ArXive" as whole field
534+
r"^arxive?\s*preprints?", # ArXiv preprint at start
535+
r"arxive?\s*preprints?\s*$", # ArXiv preprint at end
536+
]
537+
538+
# Patterns for other legitimate preprint repositories
539+
preprint_patterns = [
540+
# bioRxiv - biology preprints
541+
r"biorxiv",
542+
r"bio\s*rxiv",
543+
r"www\.biorxiv\.org",
544+
r"doi\.org/10\.1101/",
545+
# SSRN - social sciences preprints
546+
r"ssrn\s*electronic\s*journal",
547+
r"social\s*science\s*research\s*network",
548+
r"ssrn\.com",
549+
r"\bssrn\b",
550+
# medRxiv - medical preprints
551+
r"medrxiv",
552+
r"med\s*rxiv",
553+
r"www\.medrxiv\.org",
554+
# Zenodo - multidisciplinary repository
555+
r"\bzenodo\b",
556+
r"zenodo\.org",
557+
r"doi\.org/10\.5281/zenodo",
558+
# Other legitimate preprint repositories
559+
r"psyarxiv", # Psychology preprints
560+
r"socarxiv", # Social sciences preprints
561+
r"eartharxiv", # Earth sciences preprints
562+
r"engrxiv", # Engineering preprints
563+
r"techrxiv", # IEEE preprints
564+
r"preprints\.org", # MDPI preprints
565+
r"research\s*square", # Research Square preprints
566+
r"researchsquare\.com",
567+
r"osf\.io/preprints", # Open Science Framework preprints
568+
r"chemrxiv", # Chemistry preprints
569+
r"authorea\.com", # Authorea preprints platform
531570
]
532571

572+
# Combine all patterns
573+
all_patterns = arxiv_patterns + preprint_patterns
574+
533575
# Combine all relevant fields into a single string for pattern matching
534-
# Prioritize 'journal' and 'booktitle' as they are often used for venue names
535-
# 'eprint' is a direct indicator, 'title' might contain it if poorly formatted
576+
# Include publisher and howpublished fields based on real-world data analysis
536577
fields_to_check = [
537578
BibtexParser._get_field_safely(entry, "journal"),
538579
BibtexParser._get_field_safely(entry, "booktitle"),
539580
BibtexParser._get_field_safely(entry, "eprint"),
581+
BibtexParser._get_field_safely(entry, "url"),
540582
BibtexParser._get_field_safely(entry, "title"),
583+
BibtexParser._get_field_safely(entry, "publisher"),
584+
BibtexParser._get_field_safely(entry, "howpublished"),
541585
]
542586

543587
# Filter out None values and convert to lowercase for case-insensitive matching
544588
checked_content = " ".join(
545589
[f.lower() for f in fields_to_check if f is not None]
546590
)
547591

548-
for pattern in arxiv_patterns:
592+
for pattern in all_patterns:
549593
if re.search(pattern, checked_content, re.IGNORECASE):
550594
detail_logger.debug(
551-
f"Detected arXiv pattern '{pattern}' in entry: {entry.key}"
595+
f"Detected preprint pattern '{pattern}' in entry: {entry.key}"
552596
)
553597
return True
554598

555-
# Additionally, check if the entry type itself is 'misc' and contains 'arxiv' in title/journal
556-
if entry.type.lower() == "misc":
557-
if re.search(r"arxiv", checked_content, re.IGNORECASE):
558-
detail_logger.debug(f"Detected arXiv in 'misc' type entry: {entry.key}")
559-
return True
560-
561599
return False
562600

563601
@staticmethod
@@ -586,8 +624,9 @@ def _detect_venue_type(entry: Entry, venue_name: str) -> VenueType:
586624
venue_name_lower = venue_name.lower()
587625
entry_type_lower = entry.type.lower()
588626

589-
# Check for arXiv/preprints first (highest priority)
590-
if BibtexParser._is_arxiv_entry(entry):
627+
# Check for preprints first (highest priority)
628+
# This includes arXiv, bioRxiv, SSRN, medRxiv, Zenodo, and other legitimate repositories
629+
if BibtexParser._is_preprint_entry(entry):
591630
return VenueType.PREPRINT
592631

593632
# Symposium patterns (check first since they should have highest priority)

src/aletheia_probe/models.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -215,8 +215,9 @@ class BibtexAssessmentResult(BaseModel):
215215
entries_with_journals: int = Field(
216216
..., description="Number of entries with identifiable journals"
217217
)
218-
arxiv_entries_count: int = Field(
219-
0, description="Number of entries identified as arXiv preprints"
218+
preprint_entries_count: int = Field(
219+
0,
220+
description="Number of entries identified as legitimate preprints (arXiv, bioRxiv, SSRN, etc.)",
220221
)
221222
skipped_entries_count: int = Field(
222223
0, description="Number of entries skipped for other reasons"

0 commit comments

Comments
 (0)