Skip to content

feat(lit-sync): opt-in fulltext-retrieval phase + consolidate OA cascade to fetch_oa.py#226

Open
Yoojin-nam wants to merge 1 commit into
mainfrom
feat/lit-sync-fulltext-phase
Open

feat(lit-sync): opt-in fulltext-retrieval phase + consolidate OA cascade to fetch_oa.py#226
Yoojin-nam wants to merge 1 commit into
mainfrom
feat/lit-sync-fulltext-phase

Conversation

@Yoojin-nam

Copy link
Copy Markdown
Contributor

Summary

Adds an opt-in Phase 2.7 (Fulltext Retrieval) to /lit-sync and makes
/fulltext-retrieval's fetch_oa.py the single authored home of the open-access cascade.

Phase 2.7 orchestrates two complementary routes and reconciles them in
references/fulltext_retrieval.json (a new /lit-sync sole-writer artifact):

  • Disk OA PDFs — delegates to the /fulltext-retrieval engine (no re-implementation;
    invoked via a ${MEDSCI_SKILLS_ROOT:-$HOME/workspace/medsci-skills} path contract, no import/vendor).
  • In-library Zotero-native PDFs — a user-run references/find_available_pdf.js snippet that
    triggers Zotero's own addAvailablePDF/addAvailablePDFs, reusing the user's own
    proxy/OpenURL config. No credentials, proxy hosts, or institutional identifiers enter the skill.

Engine (fetch_oa.py) enhancements

  • Worklist parsing: TSV/CSV/Markdown-table + Title column (was plain/TSV only).
  • Direct arXiv resolution for 10.48550/arXiv.* DOIs (new/old-style, version suffixes).
  • --report retrieval_report.json (schema_version + per-DOI status/source/title_match).
  • Pure, offline-testable normalize_title/title_overlap/classify_title_match/build_report;
    best-effort pdftotext title cross-check flags mislabeled PDFs (tri-state match|mismatch|unavailable,
    never auto-rejects).

DRY consolidation

/search-lit Phase 5 now delegates to /fulltext-retrieval and drops the duplicated inline OA
code and the SCIHUB_BASE wording (which conflicted with forbidden_actions). Net:
api.unpaywall.org appears in exactly one authored file.

Correctness (Codex review of installed zotero-mcp-server 0.2.2)

Docs now state zotero_add_by_doi does not dedupe (the zotero_search_items search-first
step is what dedupes), that its attach_mode governs the OA child-PDF attach, and that
zotero_add_from_file would create duplicate parent items.

CI / verification

  • New validate.yml step runs the network-free fetch_oa_report_challenge/verify.sh
    (validation_commands alone are not executed by CI).
  • Local gates green: challenge + test_pdf_to_md.py, validate_skills.sh (ALL PASSED),
    gen_skill_docs/gen_distribution_manifest regenerated + --check clean, catalog/probe
    validators + locale/frontmatter/routing OK, PII self-scan clean.
  • Live OA smoke: arXiv DOI → direct arXiv PDF (title_match match), an Unpaywall hit, one OA
    miss correctly partitioned to not_retrieved + manual_needed.txt.

Additive — skill/detector/guideline/probe counts unchanged (51 / 41 / 38 / 14).

🤖 Generated with Claude Code

…ade to fetch_oa.py

Add /lit-sync Phase 2.7 (opt-in, owner-only) that orchestrates two complementary
full-text routes and reconciles them into references/fulltext_retrieval.json:
  - disk OA PDFs by delegating to the /fulltext-retrieval engine (no re-implementation;
    invoked via ${MEDSCI_SKILLS_ROOT:-$HOME/workspace/medsci-skills} path contract)
  - in-library Zotero-native PDFs via a user-run "Find Available PDF" snippet that reuses
    the user's own proxy/OpenURL config (no institutional info enters the skill)

Engine (fetch_oa.py) enhancements, kept as the single authored OA cascade:
  - worklist parsing: TSV/CSV/Markdown table + Title column (was plain/TSV only)
  - direct arXiv resolution for 10.48550/arXiv.* DOIs (new/old-style, version suffixes)
  - --report retrieval_report.json (schema_version + per-DOI status/source/title_match)
  - pure, offline-testable normalize_title/title_overlap/classify_title_match/build_report;
    best-effort pdftotext title cross-check flags mislabeled PDFs (tri-state, never auto-rejects)

DRY: /search-lit Phase 5 now delegates to /fulltext-retrieval and drops the duplicated
inline OA code + the SCIHUB_BASE wording (conflicted with forbidden_actions).

Correctness (per Codex review of installed zotero-mcp-server 0.2.2): document that
zotero_add_by_doi does NOT dedupe (search-first is what dedupes) and that its attach_mode
governs the OA child-PDF attach; note zotero_add_from_file would create duplicate parents.

CI: real wiring — new validate.yml step runs the network-free
fetch_oa_report_challenge/verify.sh (validation_commands alone are not executed by CI).
Additive: skill/detector/guideline/probe counts unchanged (51/41/38/14).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant