Problem
Current full-text retrieval tries a single source and fails if that source doesn't have the content. This results in unnecessary failures when content is available from alternative sources.
Current Behavior
If Europe PMC doesn't have full text → operation fails
Proposed Behavior
Implement waterfall fallback pattern:
Europe PMC → Unpaywall → BioC → PMC Text → Failure
Implementation
Fallback Chain
async def get_full_text_with_fallback(doi: str, email: Optional[str] = None) -> dict:
"""Try multiple sources in order until one succeeds."""
sources = [
("Europe PMC", lambda: get_europepmc_full_text(doi)),
("Unpaywall", lambda: get_unpaywall_full_text(doi, email)),
("BioC", lambda: get_bioc_full_text(doi)),
("PMC Text", lambda: get_pmc_text(doi)),
]
errors = []
for source_name, fetch_fn in sources:
try:
result = fetch_fn()
if result:
return {
"content": result,
"source": source_name,
"fallback_used": len(errors) > 0,
"attempted_sources": [s[0] for s in sources[:len(errors)+1]]
}
except Exception as e:
errors.append(f"{source_name}: {str(e)}")
continue
raise ValueError(f"All sources failed: {', '.join(errors)}")
Fallback Strategy by Content Type
Full Text:
- Europe PMC (fastest, best quality)
- PMC Text API (good quality, free)
- BioC XML (structured, but needs parsing)
- Unpaywall PDF → Extract (slowest, requires email)
PDF Access:
- Unpaywall (open access, requires email)
- Europe PMC PDF
- PMC supplementary materials
Metadata:
- Europe PMC (comprehensive)
- CrossRef (DOI metadata)
- PubMed (PMID metadata)
Configuration
Allow users to:
# Disable specific sources
export ARTL_DISABLE_SOURCES="unpaywall,bioc"
# Preferred source order
export ARTL_SOURCE_PRIORITY="europepmc,pmc,unpaywall"
# Max fallback attempts
export ARTL_MAX_FALLBACK_ATTEMPTS=3
User Feedback
Inform users when fallback is used:
{
"content": "...",
"source": "Unpaywall",
"note": "Primary source (Europe PMC) unavailable, used fallback",
"attempted_sources": ["Europe PMC", "Unpaywall"]
}
Source Characteristics
| Source |
Speed |
Quality |
Coverage |
Email Required |
| Europe PMC |
Fast |
High |
PMC papers |
No |
| PMC Text |
Fast |
High |
PMC papers |
No |
| BioC |
Medium |
High |
PubMed subset |
No |
| Unpaywall PDF |
Slow |
Medium |
Open access |
Yes |
Testing
- Test each fallback scenario
- Verify error messages
- Test with sources disabled
- Test source priority configuration
- Verify performance (don't slow down when first source works)
Benefits
- Higher success rate: ~70% → ~90% for full text retrieval
- Better UX: Users get content instead of errors
- Transparency: Users know which source was used
Priority
High - Significantly improves reliability
Related
Problem
Current full-text retrieval tries a single source and fails if that source doesn't have the content. This results in unnecessary failures when content is available from alternative sources.
Current Behavior
If Europe PMC doesn't have full text → operation fails
Proposed Behavior
Implement waterfall fallback pattern:
Implementation
Fallback Chain
Fallback Strategy by Content Type
Full Text:
PDF Access:
Metadata:
Configuration
Allow users to:
User Feedback
Inform users when fallback is used:
{ "content": "...", "source": "Unpaywall", "note": "Primary source (Europe PMC) unavailable, used fallback", "attempted_sources": ["Europe PMC", "Unpaywall"] }Source Characteristics
Testing
Benefits
Priority
High - Significantly improves reliability
Related