Skip to content

Conversation

@jeremymanning
Copy link
Member

auto verification

Implements solution for issue #37: Automated checking of bibliographic
entries against external sources to verify accuracy of metadata.

New features:
- bibverify.py: Python script to verify .bib entries against CrossRef API
- Parallel batch processing for efficient verification of large bibliographies
- DOI-based and title-based lookup strategies with fuzzy matching
- Comprehensive verification of titles, authors, years, journals, volumes, pages
- Detailed discrepancy reporting with suggestions for corrections
- Thread-safe parallel processing with configurable worker count
- BIBVERIFY_README.md: Complete documentation and usage guide

Technical details:
- Uses CrossRef REST API (170M+ records, free, unlimited)
- Supports 1-20 parallel workers for scalable performance
- Smart fuzzy matching with configurable similarity thresholds
- Respects API rate limits with built-in delays
- Framework in place for future auto-fix functionality

Performance: Can verify 6,151 entries in ~30-60 minutes with 10 workers
(compared to 5-8 hours sequentially)

Related to: #37
Test results show:
- Processing speed: 17.5 entries/second with 10 workers
- Full bibliography (6,151 entries) estimated at ~6 minutes
- Found discrepancies in 78% of tested entries
- Common issues: missing DOIs, formatting differences, some genuine errors

Performance demonstrates that automated verification is highly
feasible and practical for the CDL bibliography at scale.
Major improvements based on user feedback:

1. Conservative Match Verification:
   - Requires ALL of: title ≥85%, authors ≥70%, journal ≥60%, year ≤1 difference
   - Rejects uncertain matches rather than reporting false positives
   - Fixes GuoEtal20 false positive (0% author match correctly rejected)

2. Focus on Metadata Accuracy:
   - Only verifies volume/pages/number when confident match found
   - Removed DOI suggestions (not needed per formatting guide)
   - Detects common errors (DOI in pages field)

3. Results Improvement:
   - 54% verified (vs 2% before)
   - 14% real errors (vs 78% false positives before)
   - No more wrong-paper suggestions

Test results on 100 entries:
- Real errors found: volume mismatches, DOI in pages, year discrepancies
- False positives eliminated: GuoEtal20, etc.
- Processing speed: ~12 entries/sec with conservative matching

Addresses feedback on issue #37.
…ication results

Changes:
- Integrated bibverify.py documentation into main README.md
- Removed standalone BIBVERIFY_README.md (now in README.md)
- Removed VERIFICATION_TEST_RESULTS.md (superseded by full report)
- Added full_verification_report.txt with complete verification results

Full verification results (6,151 entries in 6min 11sec):
- ✓ Verified: 3,988 entries (65%)
- ✗ Errors: 724 entries (12%) - real metadata issues
- ⚠ Warnings: 1,434 entries (23%) - not in CrossRef or uncertain match

Common errors found:
- Volume/issue number mismatches
- Page range errors (off-by-one)
- DOI in pages field instead of doi field
- Year discrepancies (preprint vs published)

The bibverify tool successfully demonstrates feasibility of automated
bibliographic accuracy verification at scale, addressing issue #37.
Removed per user request - full report not needed in repository.
@jeremymanning jeremymanning merged commit 780a7ad into master Nov 6, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants