Skip to content

Commit 1fdb2b7

Browse files
docs: Add comprehensive documentation for retraction checking and acronym expansion features [AI-assisted] (#212)
Document two significant features that were implemented but undocumented: 1. Article retraction checking system: - Individual article retraction detection by DOI - Journal-level retraction statistics and risk levels - Three-tier data sources (local, Crossref API, caching) - Integration with BibTeX assessment and output formatting - Risk classification with rate-based and count-based thresholds 2. Conference acronym expansion system: - Automatic expansion during assessment (ICSE → International Conference...) - Intelligent acronym detection and database lookup - Auto-learning from parenthetical text - Integration with existing CLI management commands Resolves issue #208 by providing conceptual documentation, usage examples, and integration details for these advanced features. Users can now understand and effectively utilize retraction checking and acronym expansion capabilities. Updated Table of Contents and Key Features section to reflect new content. All quality checks pass. Co-authored-by: florath-ai-assistant[bot] <Andreas.Florath@telekom.de>
1 parent 09e0eb9 commit 1fdb2b7

File tree

1 file changed

+190
-11
lines changed

1 file changed

+190
-11
lines changed

docs/user-guide.md

Lines changed: 190 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,10 @@ This comprehensive guide covers all aspects of using the Journal Assessment Tool
1212
6. [Understanding Results](#understanding-results)
1313
7. [Configuration](#configuration)
1414
8. [Conference Acronym Management](#conference-acronym-management)
15-
9. [Data Sources](#data-sources)
16-
10. [Best Practices](#best-practices)
17-
11. [Troubleshooting](#troubleshooting)
15+
9. [Article Retraction Checking](#article-retraction-checking)
16+
10. [Data Sources](#data-sources)
17+
11. [Best Practices](#best-practices)
18+
12. [Troubleshooting](#troubleshooting)
1819

1920
## Overview
2021

@@ -24,6 +25,8 @@ The Journal Assessment Tool helps researchers and institutions evaluate whether
2425

2526
- **Multi-source verification**: Combines DOAJ, Beall's List, Retraction Watch, and institutional data
2627
- **BibTeX batch processing**: Assess entire bibliographies from BibTeX files with automated exit codes
28+
- **Article retraction checking**: Identifies retracted publications by DOI with detailed statistics
29+
- **Conference acronym expansion**: Automatically expands abbreviations like "ICSE" to full conference names
2730
- **Intelligent matching**: Handles name variations, ISSNs, and publisher information
2831
- **Confidence scoring**: Provides probabilistic assessments with clear reasoning
2932
- **Fast performance**: Local caching reduces API calls and improves speed
@@ -552,18 +555,47 @@ backends:
552555
553556
## Conference Acronym Management
554557
555-
The conference acronym management feature helps expand conference abbreviations to their full names. This is particularly useful when processing bibliographic data where conferences may be referenced by common acronyms like "ICSE" (International Conference on Software Engineering) or "NIPS" (Neural Information Processing Systems).
558+
The conference acronym expansion system automatically expands conference abbreviations to their full names during assessment. This improves matching accuracy when processing bibliographic data where conferences may be referenced by common acronyms like "ICSE" (International Conference on Software Engineering) or "NIPS" (Neural Information Processing Systems).
556559
557-
### Concept
560+
### How Acronym Expansion Works
558561
559-
Conference acronyms are stored in a local database that maps short forms to their full conference names. The system automatically builds this mapping as it encounters conference data during journal assessments, and also allows manual management of acronym mappings.
562+
When you query a journal or conference name, the system automatically:
560563
561-
### Purpose
564+
1. **Detects acronyms**: Identifies when input appears to be an acronym (short, mostly uppercase)
565+
2. **Database lookup**: Searches the local acronym database for known expansions
566+
3. **Expands for matching**: Uses the full conference name for better backend matching
567+
4. **Preserves original**: Tracks the original acronym in `acronym_expanded_from` field
568+
5. **Auto-learns**: Extracts new acronym mappings from text like "International Conference on Software Engineering (ICSE)"
562569

563-
- **Standardization**: Ensure consistent conference naming across bibliographic data
564-
- **Expansion**: Convert acronyms to full names for better readability and processing
565-
- **Data Quality**: Maintain a curated database of conference name mappings
566-
- **Automation**: Reduce manual effort in processing conference references
570+
**Example workflow:**
571+
```bash
572+
# User queries an acronym
573+
aletheia-probe conference "ICSE"
574+
575+
# System automatically expands to "International Conference on Software Engineering"
576+
# Backend searches use the full name for better results
577+
# Output shows: acronym_expanded_from: "ICSE"
578+
```
579+
580+
### Automatic vs Manual Acronym Management
581+
582+
**Automatic expansion** happens during assessment:
583+
- Input normalization checks if query looks like an acronym
584+
- Local database provides expansions for better matching
585+
- Parenthetical text like "(ICML)" automatically creates new mappings
586+
587+
**Manual management** allows database control:
588+
- Pre-populate common acronyms for your field
589+
- Correct automatic mappings
590+
- Add institution-specific abbreviations
591+
- View and clear stored mappings
592+
593+
### Benefits
594+
595+
- **Better Matching**: Full names improve backend search accuracy
596+
- **Standardization**: Consistent conference naming across bibliographic data
597+
- **Automation**: Reduces manual effort in processing conference references
598+
- **Transparency**: Original acronyms preserved in assessment results
567599

568600
### Available Commands
569601

@@ -672,6 +704,153 @@ The acronym database integrates automatically with journal assessment workflows.
672704

673705
For implementation details, see `src/aletheia_probe/cli.py`.
674706

707+
## Article Retraction Checking
708+
709+
The article retraction checking system identifies retracted academic publications by their DOI and integrates retraction data into assessment results. This helps users identify potentially problematic articles in their bibliographies and provides journal-level retraction statistics as quality indicators.
710+
711+
### What Retraction Checking Does
712+
713+
The system performs two levels of retraction analysis:
714+
715+
1. **Article-level checking**: Identifies individual retracted articles by DOI
716+
2. **Journal-level statistics**: Calculates retraction rates and risk levels for journals
717+
718+
When processing BibTeX files, any articles with DOIs are automatically checked against retraction databases. Retracted articles are flagged with clear warnings and detailed retraction information.
719+
720+
### How It Works
721+
722+
#### Data Sources (Three-Tier Lookup)
723+
724+
The system checks multiple sources in priority order:
725+
726+
1. **Local Retraction Watch Database** (fastest)
727+
- Pre-populated from Retraction Watch CSV dataset during sync
728+
- Contains ~50,000+ retracted articles indexed by DOI
729+
- Checked first for immediate results
730+
731+
2. **Crossref API** (real-time)
732+
- Queries Crossref metadata for retraction notices
733+
- Looks for 'update-by' and 'update-to' fields
734+
- Used as fallback when local database doesn't have the article
735+
736+
3. **Intelligent Caching**
737+
- 30-day cache for all retraction checks
738+
- Negative results (non-retracted) also cached
739+
- Prevents redundant API calls
740+
741+
#### Retraction Detection Process
742+
743+
```bash
744+
# For each article with a DOI:
745+
1. Check cache (30-day TTL) → Return if found
746+
2. Check local database → Cache result if retracted
747+
3. Query Crossref API → Cache result if retracted
748+
4. Cache negative result → Mark as not retracted
749+
```
750+
751+
### Retraction Information Provided
752+
753+
For retracted articles, the system provides:
754+
755+
- **Retraction status**: Whether the article has been retracted
756+
- **Retraction type**: Misconduct, error, plagiarism, etc.
757+
- **Retraction date**: When the retraction was issued
758+
- **Retraction DOI**: DOI of the retraction notice (if available)
759+
- **Retraction reason**: Human-readable explanation
760+
- **Source**: Where the retraction data was found
761+
762+
### Journal-Level Retraction Statistics
763+
764+
The system calculates comprehensive statistics for journals:
765+
766+
#### Risk Level Classification
767+
768+
Risk levels are calculated using rate-based thresholds when publication data is available, otherwise absolute count thresholds:
769+
770+
**Rate-Based Risk Levels** (preferred):
771+
772+
| Level | Overall Rate | Recent Rate | Meaning |
773+
|-------|--------------|-------------|---------|
774+
| **critical** | ≥3.0% | ≥4.0% | Very high (25x+ normal rate) |
775+
| **high** | ≥1.5% | ≥2.5% | High (10x+ normal rate) |
776+
| **moderate** | ≥0.8% | ≥1.2% | Concerning (5x+ normal rate) |
777+
| **low** | ≥0.1% | ≥0.2% | Elevated (2-3x normal rate) |
778+
| **note** | <0.1% | <0.2% | Within normal range |
779+
| **none** | 0 | 0 | No retractions |
780+
781+
**Count-Based Fallback** (when publication data unavailable):
782+
783+
| Level | Total Count | Recent Count |
784+
|-------|------------|--------------|
785+
| **critical** | ≥21 | ≥10 |
786+
| **high** | ≥11 | ≥5 |
787+
| **moderate** | ≥6 | ≥3 |
788+
| **low** | ≥2 | ≥2 |
789+
790+
**Baseline for comparison**: Research literature average is 0.02-0.04% retraction rate (1 per 2,500-5,000 articles).
791+
792+
#### Statistics Meaning
793+
794+
- **total_retractions**: Cumulative retracted articles (all time)
795+
- **recent_retractions**: Retractions in last 2 years (quality trend)
796+
- **very_recent_retractions**: Retractions in last 1 year (current trend)
797+
- **retraction_rate**: Percentage of all articles retracted
798+
- **recent_retraction_rate**: Percentage of recent articles retracted
799+
800+
### Retraction Information in Output
801+
802+
#### BibTeX Assessment Output
803+
804+
Retracted articles are clearly marked in batch assessments:
805+
806+
```
807+
❌ Journal Name (predatory, confidence: 0.85) 🚫 RETRACTED
808+
🚫 RETRACTED: type=misconduct, date=2023-01-15
809+
Reason: Data fabrication and falsification
810+
```
811+
812+
#### Journal Assessment Output
813+
814+
Journal-level retraction risk appears in reasoning:
815+
816+
```
817+
⚠️ CRITICAL retraction risk: 45 retractions (18 recent) = 1.23% rate (3,654 publications)
818+
⚠️ Moderate retraction risk: 8 retractions (2 recent) = 0.15% rate (5,300 publications)
819+
📊 3 retraction(s): 0.05% rate (within normal range for 6,000 publications)
820+
```
821+
822+
#### Summary Statistics
823+
824+
For BibTeX files, summary includes:
825+
- Count of retracted articles found
826+
- Number of articles checked (those with DOIs)
827+
- Example: "3 retracted articles found (of 47 articles with DOIs checked)"
828+
829+
### Integration with Assessment Results
830+
831+
**BibtexEntry fields**:
832+
- `is_retracted`: Boolean flag indicating retraction status
833+
- `retraction_info`: Complete retraction details if retracted
834+
835+
**BibtexAssessmentResult fields**:
836+
- `retracted_articles_count`: Number of retracted articles found
837+
- `articles_checked_for_retraction`: Number of articles with DOIs
838+
839+
**Journal assessment**:
840+
- Retraction statistics influence confidence scores
841+
- High retraction rates trigger warnings in reasoning
842+
- Risk level affects overall journal assessment
843+
844+
### Data Management
845+
846+
Retraction data is kept current through:
847+
848+
- **Weekly sync**: Updates local Retraction Watch database
849+
- **On-demand queries**: Real-time Crossref API checks
850+
- **Intelligent caching**: Reduces API load while maintaining accuracy
851+
852+
For implementation details, see `src/aletheia_probe/article_retraction_checker.py` and `src/aletheia_probe/backends/retraction_watch.py`.
853+
675854
## Data Sources
676855

677856
### DOAJ (Directory of Open Access Journals)

0 commit comments

Comments
 (0)