fix: Prevent false positives for legitimate preprint repositories (#130)

coding-ai-assistant[bot] · florath · web-flow · commit 48a983872f78 · 2025-11-24T15:45:19.000+01:00
* fix: Prevent false positives for legitimate preprint repositories Fixes critical issue where legitimate preprint repositories like bioRxiv, SSRN, Zenodo, and medRxiv were incorrectly flagged as [SUSPICIOUS], representing ~37% of all false positives (~185 out of ~500 total flags). ## Changes Made ### 1. Enhanced Preprint Detection System - Expanded existing arXiv-only detection to comprehensive preprint repository detection - Added patterns for: bioRxiv, SSRN, medRxiv, Zenodo, PsyArXiv, TechRxiv, Research Square, OSF, ChemRxiv, and more - Improved arXiv pattern matching to handle subject classifications (cs.LG, eess.SP) and institutional variants ### 2. Updated Data Models and Processing - Renamed `arxiv_entries_count` to `preprint_entries_count` to reflect broader scope - Updated batch processing to use new comprehensive detection - Enhanced logging to report "preprints from legitimate repositories" instead of just "arXiv" ### 3. Comprehensive Test Coverage - Added test for comprehensive preprint repository detection (13 different repositories) - Added test for mixed content scenarios (preprints + legitimate + suspicious journals) - Added test for arXiv subject classification variants (cs.LG, eess.SP, stat.ML, Cornell) - Updated existing arXiv test to use new variable names ## Root Cause Analysis The tool was incorrectly flagging legitimate preprint repositories because: 1. Only arXiv was detected and skipped during assessment 2. Other preprints (bioRxiv, SSRN, etc.) were processed as regular journals 3. Heuristic backends flagged them as suspicious due to unconventional metrics 4. No legitimate lists confirmed them, resulting in SUSPICIOUS classification ## Impact - Eliminates ~185 false positives (~37% reduction in total false flags) - Prevents researchers from avoiding essential academic infrastructure - Maintains existing arXiv detection while expanding coverage - No impact on legitimate journal/conference assessment ## Testing - All quality checks pass (linting, formatting, type checking, tests) - 333 tests pass including 3 new comprehensive preprint tests - Maintains backward compatibility with existing functionality Resolves #123 * feat: Enhance ArXiv detection patterns based on real-world test data Improves preprint detection to catch all ArXiv variants found in actual bibliographies, significantly reducing false positives. ## Enhanced Detection Patterns ### 1. Comprehensive ArXiv Pattern Coverage - Added support for `ArXivPreprint:ID` format (e.g., ArXivPreprint:2510.09378) - Added detection for `ArXive-prints` (with hyphen and 'e') - Improved handling of spacing variations (`ArXivpreprint` vs `ArXiv preprint`) - Enhanced case-insensitive detection (`ArXiv` vs `arxiv`) - Added word boundary detection for standalone `ArXiv` entries ### 2. Extended Field Coverage - Added `publisher` field detection (e.g., publisher={ArXiv}) - Added `howpublished` field detection (e.g., howpublished={ArXiv}) - Maintains existing coverage of journal, booktitle, eprint, url, title fields ### 3. Comprehensive Test Coverage - Added `test_arxiv_variants_from_real_world_data` with 7 real-world variants - Tests all patterns found in florath's test dataset: - journal={ArXivPreprint:2510.09378} - booktitle={ArXivPreprint} - howpublished={ArXiv} - journal={ArXive-prints} - journal={ArXivpreprint} - journal={ArXiv} - publisher={ArXiv} ## Validation Results All 7 ArXiv variants from real-world test data are now detected correctly: ✓ ArXivPreprint:2510.09378 | journal={ArXivPreprint:2510.09378} ✓ ArXivPreprint | booktitle={ArXivPreprint} ✓ ArXiv | howpublished={ArXiv} ✓ ArXive-prints | journal={ArXive-prints} ✓ ArXivpreprint | journal={ArXivpreprint} ✓ ArXiv | journal={ArXiv} ✓ ArXiv | publisher={ArXiv} ## Quality Assurance - All 334 tests pass including new comprehensive test case - All quality checks pass (linting, formatting, type checking, coverage) - Maintains backward compatibility with existing detection patterns This addresses the feedback that ArXiv was still appearing as "suspicious" in real-world test datasets, providing much more robust detection coverage. * docs: Clarify BibTeX validation use case in README Add parenthetical explanation '(Are the journal entries in my bibtex file valid?)' to question 2 to help users understand the additional use case of validating BibTeX file entries for strange or invalid journal names. This addresses the observation that the tool can be used as a rough check for bibliography quality, not just reference legitimacy assessment. * refactor: Remove legacy code and redundant pattern list Clean up internal code as requested: 1. **Removed legacy _is_arxiv_entry() method** - Unnecessary backward compatibility for internal code - Function was only defined but never called elsewhere 2. **Removed redundant misc_preprint_patterns list** - Was duplicate subset of patterns already in main list - All repositories (arxiv, biorxiv, ssrn, medrxiv, zenodo) already covered - Simplified logic to use single comprehensive pattern matching 3. **Improved code quality** - Eliminated code duplication - Cleaner, more maintainable implementation - Same functionality with less complexity All tests pass - no functional changes, just cleaner internal code structure. --------- Co-authored-by: florath-ai-assistant[bot] <Andreas.Florath@telekom.de>
diff --git a/README.md b/README.md
@@ -22,7 +22,7 @@ Aletheia-Probe helps answer two critical questions for researchers:
     ```bash
     aletheia-probe journal "Journal of Computer Science"
     ```
-2.  **Are the references in my paper legitimate?**
+2.  **Are the references in my paper legitimate?** (Are the journal entries in my bibtex file valid?)
     ```bash
     aletheia-probe bibtex references.bib
     ```
diff --git a/src/aletheia_probe/batch_assessor.py b/src/aletheia_probe/batch_assessor.py
@@ -81,17 +81,17 @@ async def assess_bibtex_file(
 
         # Parse the BibTeX file to extract journal entries
         try:
-            bibtex_entries, skipped_count, arxiv_count = BibtexParser.parse_bibtex_file(
-                file_path, relax_bibtex
+            bibtex_entries, skipped_count, preprint_count = (
+                BibtexParser.parse_bibtex_file(file_path, relax_bibtex)
             )
             detail_logger.debug(
-                f"Successfully parsed {len(bibtex_entries)} entries, skipped {skipped_count}, found {arxiv_count} arXiv entries"
+                f"Successfully parsed {len(bibtex_entries)} entries, skipped {skipped_count}, found {preprint_count} preprint entries"
             )
         except Exception as e:
             detail_logger.error(f"Failed to parse BibTeX file: {e}")
             raise ValueError(f"Failed to parse BibTeX file: {e}") from e
 
-        total_entries = len(bibtex_entries) + skipped_count + arxiv_count
+        total_entries = len(bibtex_entries) + skipped_count + preprint_count
         status_logger.info(
             f"Found {len(bibtex_entries)} entries with journal information"
         )
@@ -101,7 +101,7 @@ async def assess_bibtex_file(
             file_path=str(file_path),
             total_entries=total_entries,
             entries_with_journals=len(bibtex_entries),
-            arxiv_entries_count=arxiv_count,
+            preprint_entries_count=preprint_count,
             skipped_entries_count=skipped_count,
             predatory_count=0,
             legitimate_count=0,
@@ -285,10 +285,8 @@ def format_summary(result: BibtexAssessmentResult, verbose: bool = False) -> str
         summary_lines.append(f"File: {result.file_path}")
         summary_lines.append(f"Total entries in file: {result.total_entries}")
         summary_lines.append(f"Entries assessed: {result.entries_with_journals}")
-        if result.arxiv_entries_count > 0:
-            summary_lines.append(
-                f"Skipped arXiv preprints: {result.arxiv_entries_count}"
-            )
+        if result.preprint_entries_count > 0:
+            summary_lines.append(f"Skipped preprints: {result.preprint_entries_count}")
         if result.skipped_entries_count > 0:
             summary_lines.append(
                 f"Skipped other entries: {result.skipped_entries_count}"
diff --git a/src/aletheia_probe/bibtex_parser.py b/src/aletheia_probe/bibtex_parser.py
@@ -92,15 +92,15 @@ def parse_bibtex_file(
 
                     entries = []
                     skipped_entries = 0
-                    arxiv_entries = 0
+                    preprint_entries = 0
 
                     for entry_key, entry in bib_data.entries.items():
                         try:
-                            # First, check for arXiv entries to correctly categorize skipped entries
-                            if BibtexParser._is_arxiv_entry(entry):
-                                arxiv_entries += 1
+                            # First, check for preprint entries to correctly categorize skipped entries
+                            if BibtexParser._is_preprint_entry(entry):
+                                preprint_entries += 1
                                 detail_logger.debug(
-                                    f"Skipping arXiv entry: {entry_key}"
+                                    f"Skipping preprint entry: {entry_key}"
                                 )
                                 continue
 
@@ -125,10 +125,10 @@ def parse_bibtex_file(
                         f"with {description}"
                     )
 
-                    # Inform user about arXiv preprints (if any)
-                    if arxiv_entries > 0:
+                    # Inform user about preprints (if any)
+                    if preprint_entries > 0:
                         status_logger.info(
-                            f"Skipped {arxiv_entries} arXiv preprint(s) - not publication venues"
+                            f"Skipped {preprint_entries} preprint(s) from legitimate repositories - not publication venues"
                         )
 
                     # Log other skipped entries
@@ -137,7 +137,7 @@ def parse_bibtex_file(
                             f"Skipped {skipped_entries} other entries due to processing errors"
                         )
 
-                    return entries, skipped_entries, arxiv_entries
+                    return entries, skipped_entries, preprint_entries
 
                 except UnicodeDecodeError as e:
                     last_error = e
@@ -501,63 +501,101 @@ def _remove_nested_braces(value: str) -> str:
         return value.strip()
 
     @staticmethod
-    def _is_arxiv_entry(entry: Entry) -> bool:
-        """Detects if a BibTeX entry is an arXiv preprint.
+    def _is_preprint_entry(entry: Entry) -> bool:
+        """Detects if a BibTeX entry is a preprint from a legitimate repository.
 
-        Checks the 'journal', 'booktitle', 'eprint', and 'title' fields
-        for common arXiv patterns.
+        Checks the 'journal', 'booktitle', 'eprint', 'url', and 'title' fields
+        for patterns from legitimate preprint repositories to prevent false positives.
 
         Args:
             entry: BibTeX entry object.
 
         Returns:
-            True if the entry is identified as an arXiv preprint, False otherwise.
+            True if the entry is identified as a legitimate preprint, False otherwise.
         """
         import re
 
-        # Patterns to identify arXiv entries
-        # - "arXiv preprint arXiv:XXXX.XXXXX"
-        # - "ArXiv e-prints"
-        # - "arXiv:XXXX.XXXXX" (bare arXiv identifier)
-        # - "e-print" field containing "arXiv"
-        # - Journal field containing only arXiv identifier
-
+        # Comprehensive patterns for arXiv (all variants from real-world data)
         arxiv_patterns = [
-            r"arxiv\s+preprint\s+arxiv:\d{4}\.\d{5}(v\d+)?",  # arXiv preprint arXiv:XXXX.XXXXX
-            r"arxiv\s+e-prints",  # ArXiv e-prints
-            r"arxiv:\d{4}\.\d{5}(v\d+)?",  # bare arXiv identifier
-            r"arxiv:\w+\.\w+(v\d+)?",  # arXiv:cs.AI/9901001 (old style)
+            # Standard arXiv patterns
+            r"arxiv\s*preprint\s*(?:arxiv\s*:)?\s*\d{4}\.\d{5}(?:v\d+)?",  # arXiv preprint arXiv:XXXX.XXXXX
+            r"arxiv\s*preprint\s*:?\s*\d{4}\.\d{5}(?:v\d+)?",  # ArXivPreprint:2510.09378
+            r"arxiv\s*e-?prints?",  # ArXiv e-prints, ArXive-prints
+            r"arxiv:\d{4}\.\d{5}(?:v\d+)?",  # bare arXiv identifier
+            r"arxiv:\w+\.\w+(?:v\d+)?",  # arXiv:cs.AI/9901001 (old style)
             r"eprint:\s*arxiv",  # for entries where eprint field is "eprint = {arXiv}"
+            # ArXiv with classifications and institutional info
+            r"arxiv\s*\[[^\]]+\]",  # arXiv with subject classification (e.g., "arXiv [cs.LG]")
+            r"arxiv\s*\([^)]*\)",  # arXiv with parenthetical info (e.g., "arXiv (Cornell University)")
+            # Common variants found in real bibliographies
+            r"\barxive?\s*preprints?\b",  # ArXivpreprint, ArXivepreprint
+            r"\barxive?\b",  # Just "ArXiv" or "ArXive" as word boundary
+            r"^arxive?$",  # Just "ArXiv" or "ArXive" as whole field
+            r"^arxive?\s*preprints?",  # ArXiv preprint at start
+            r"arxive?\s*preprints?\s*$",  # ArXiv preprint at end
+        ]
+
+        # Patterns for other legitimate preprint repositories
+        preprint_patterns = [
+            # bioRxiv - biology preprints
+            r"biorxiv",
+            r"bio\s*rxiv",
+            r"www\.biorxiv\.org",
+            r"doi\.org/10\.1101/",
+            # SSRN - social sciences preprints
+            r"ssrn\s*electronic\s*journal",
+            r"social\s*science\s*research\s*network",
+            r"ssrn\.com",
+            r"\bssrn\b",
+            # medRxiv - medical preprints
+            r"medrxiv",
+            r"med\s*rxiv",
+            r"www\.medrxiv\.org",
+            # Zenodo - multidisciplinary repository
+            r"\bzenodo\b",
+            r"zenodo\.org",
+            r"doi\.org/10\.5281/zenodo",
+            # Other legitimate preprint repositories
+            r"psyarxiv",  # Psychology preprints
+            r"socarxiv",  # Social sciences preprints
+            r"eartharxiv",  # Earth sciences preprints
+            r"engrxiv",  # Engineering preprints
+            r"techrxiv",  # IEEE preprints
+            r"preprints\.org",  # MDPI preprints
+            r"research\s*square",  # Research Square preprints
+            r"researchsquare\.com",
+            r"osf\.io/preprints",  # Open Science Framework preprints
+            r"chemrxiv",  # Chemistry preprints
+            r"authorea\.com",  # Authorea preprints platform
         ]
 
+        # Combine all patterns
+        all_patterns = arxiv_patterns + preprint_patterns
+
         # Combine all relevant fields into a single string for pattern matching
-        # Prioritize 'journal' and 'booktitle' as they are often used for venue names
-        # 'eprint' is a direct indicator, 'title' might contain it if poorly formatted
+        # Include publisher and howpublished fields based on real-world data analysis
         fields_to_check = [
             BibtexParser._get_field_safely(entry, "journal"),
             BibtexParser._get_field_safely(entry, "booktitle"),
             BibtexParser._get_field_safely(entry, "eprint"),
+            BibtexParser._get_field_safely(entry, "url"),
             BibtexParser._get_field_safely(entry, "title"),
+            BibtexParser._get_field_safely(entry, "publisher"),
+            BibtexParser._get_field_safely(entry, "howpublished"),
         ]
 
         # Filter out None values and convert to lowercase for case-insensitive matching
         checked_content = " ".join(
             [f.lower() for f in fields_to_check if f is not None]
         )
 
-        for pattern in arxiv_patterns:
+        for pattern in all_patterns:
             if re.search(pattern, checked_content, re.IGNORECASE):
                 detail_logger.debug(
-                    f"Detected arXiv pattern '{pattern}' in entry: {entry.key}"
+                    f"Detected preprint pattern '{pattern}' in entry: {entry.key}"
                 )
                 return True
 
-        # Additionally, check if the entry type itself is 'misc' and contains 'arxiv' in title/journal
-        if entry.type.lower() == "misc":
-            if re.search(r"arxiv", checked_content, re.IGNORECASE):
-                detail_logger.debug(f"Detected arXiv in 'misc' type entry: {entry.key}")
-                return True
-
         return False
 
     @staticmethod
@@ -586,8 +624,9 @@ def _detect_venue_type(entry: Entry, venue_name: str) -> VenueType:
         venue_name_lower = venue_name.lower()
         entry_type_lower = entry.type.lower()
 
-        # Check for arXiv/preprints first (highest priority)
-        if BibtexParser._is_arxiv_entry(entry):
+        # Check for preprints first (highest priority)
+        # This includes arXiv, bioRxiv, SSRN, medRxiv, Zenodo, and other legitimate repositories
+        if BibtexParser._is_preprint_entry(entry):
             return VenueType.PREPRINT
 
         # Symposium patterns (check first since they should have highest priority)
diff --git a/src/aletheia_probe/models.py b/src/aletheia_probe/models.py
@@ -215,8 +215,9 @@ class BibtexAssessmentResult(BaseModel):
     entries_with_journals: int = Field(
         ..., description="Number of entries with identifiable journals"
     )
-    arxiv_entries_count: int = Field(
-        0, description="Number of entries identified as arXiv preprints"
+    preprint_entries_count: int = Field(
+        0,
+        description="Number of entries identified as legitimate preprints (arXiv, bioRxiv, SSRN, etc.)",
     )
     skipped_entries_count: int = Field(
         0, description="Number of entries skipped for other reasons"
diff --git a/tests/unit/test_bibtex_parser.py b/tests/unit/test_bibtex_parser.py

Original file line number	Diff line number	Diff line change
`@@ -215,8 +215,9 @@ class BibtexAssessmentResult(BaseModel):`
`215`	`215`	`entries_with_journals: int = Field(`
`216`	`216`	`..., description="Number of entries with identifiable journals"`
`217`	`217`	`)`
`218`		`- arxiv_entries_count: int = Field(`
`219`		`- 0, description="Number of entries identified as arXiv preprints"`
	`218`	`+ preprint_entries_count: int = Field(`
	`219`	`+ 0,`
	`220`	`+ description="Number of entries identified as legitimate preprints (arXiv, bioRxiv, SSRN, etc.)",`
`220`	`221`	`)`
`221`	`222`	`skipped_entries_count: int = Field(`
`222`	`223`	`0, description="Number of entries skipped for other reasons"`