Skip to content
This repository was archived by the owner on Feb 11, 2026. It is now read-only.
This repository was archived by the owner on Feb 11, 2026. It is now read-only.

Implement fallback sources for full text retrieval #222

@turbomam

Description

@turbomam

Problem

Current full-text retrieval tries a single source and fails if that source doesn't have the content. This results in unnecessary failures when content is available from alternative sources.

Current Behavior

If Europe PMC doesn't have full text → operation fails

Proposed Behavior

Implement waterfall fallback pattern:

Europe PMC → Unpaywall → BioC → PMC Text → Failure

Implementation

Fallback Chain

async def get_full_text_with_fallback(doi: str, email: Optional[str] = None) -> dict:
    """Try multiple sources in order until one succeeds."""
    
    sources = [
        ("Europe PMC", lambda: get_europepmc_full_text(doi)),
        ("Unpaywall", lambda: get_unpaywall_full_text(doi, email)),
        ("BioC", lambda: get_bioc_full_text(doi)),
        ("PMC Text", lambda: get_pmc_text(doi)),
    ]
    
    errors = []
    for source_name, fetch_fn in sources:
        try:
            result = fetch_fn()
            if result:
                return {
                    "content": result,
                    "source": source_name,
                    "fallback_used": len(errors) > 0,
                    "attempted_sources": [s[0] for s in sources[:len(errors)+1]]
                }
        except Exception as e:
            errors.append(f"{source_name}: {str(e)}")
            continue
    
    raise ValueError(f"All sources failed: {', '.join(errors)}")

Fallback Strategy by Content Type

Full Text:

  1. Europe PMC (fastest, best quality)
  2. PMC Text API (good quality, free)
  3. BioC XML (structured, but needs parsing)
  4. Unpaywall PDF → Extract (slowest, requires email)

PDF Access:

  1. Unpaywall (open access, requires email)
  2. Europe PMC PDF
  3. PMC supplementary materials

Metadata:

  1. Europe PMC (comprehensive)
  2. CrossRef (DOI metadata)
  3. PubMed (PMID metadata)

Configuration

Allow users to:

# Disable specific sources
export ARTL_DISABLE_SOURCES="unpaywall,bioc"

# Preferred source order
export ARTL_SOURCE_PRIORITY="europepmc,pmc,unpaywall"

# Max fallback attempts
export ARTL_MAX_FALLBACK_ATTEMPTS=3

User Feedback

Inform users when fallback is used:

{
  "content": "...",
  "source": "Unpaywall",
  "note": "Primary source (Europe PMC) unavailable, used fallback",
  "attempted_sources": ["Europe PMC", "Unpaywall"]
}

Source Characteristics

Source Speed Quality Coverage Email Required
Europe PMC Fast High PMC papers No
PMC Text Fast High PMC papers No
BioC Medium High PubMed subset No
Unpaywall PDF Slow Medium Open access Yes

Testing

  • Test each fallback scenario
  • Verify error messages
  • Test with sources disabled
  • Test source priority configuration
  • Verify performance (don't slow down when first source works)

Benefits

  • Higher success rate: ~70% → ~90% for full text retrieval
  • Better UX: Users get content instead of errors
  • Transparency: Users know which source was used

Priority

High - Significantly improves reliability

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions