feat(lit-sync): opt-in fulltext-retrieval phase + consolidate OA cascade to fetch_oa.py#226
Open
Yoojin-nam wants to merge 1 commit into
Open
feat(lit-sync): opt-in fulltext-retrieval phase + consolidate OA cascade to fetch_oa.py#226Yoojin-nam wants to merge 1 commit into
Yoojin-nam wants to merge 1 commit into
Conversation
…ade to fetch_oa.py
Add /lit-sync Phase 2.7 (opt-in, owner-only) that orchestrates two complementary
full-text routes and reconciles them into references/fulltext_retrieval.json:
- disk OA PDFs by delegating to the /fulltext-retrieval engine (no re-implementation;
invoked via ${MEDSCI_SKILLS_ROOT:-$HOME/workspace/medsci-skills} path contract)
- in-library Zotero-native PDFs via a user-run "Find Available PDF" snippet that reuses
the user's own proxy/OpenURL config (no institutional info enters the skill)
Engine (fetch_oa.py) enhancements, kept as the single authored OA cascade:
- worklist parsing: TSV/CSV/Markdown table + Title column (was plain/TSV only)
- direct arXiv resolution for 10.48550/arXiv.* DOIs (new/old-style, version suffixes)
- --report retrieval_report.json (schema_version + per-DOI status/source/title_match)
- pure, offline-testable normalize_title/title_overlap/classify_title_match/build_report;
best-effort pdftotext title cross-check flags mislabeled PDFs (tri-state, never auto-rejects)
DRY: /search-lit Phase 5 now delegates to /fulltext-retrieval and drops the duplicated
inline OA code + the SCIHUB_BASE wording (conflicted with forbidden_actions).
Correctness (per Codex review of installed zotero-mcp-server 0.2.2): document that
zotero_add_by_doi does NOT dedupe (search-first is what dedupes) and that its attach_mode
governs the OA child-PDF attach; note zotero_add_from_file would create duplicate parents.
CI: real wiring — new validate.yml step runs the network-free
fetch_oa_report_challenge/verify.sh (validation_commands alone are not executed by CI).
Additive: skill/detector/guideline/probe counts unchanged (51/41/38/14).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in Phase 2.7 (Fulltext Retrieval) to
/lit-syncand makes/fulltext-retrieval'sfetch_oa.pythe single authored home of the open-access cascade.Phase 2.7 orchestrates two complementary routes and reconciles them in
references/fulltext_retrieval.json(a new/lit-syncsole-writer artifact):/fulltext-retrievalengine (no re-implementation;invoked via a
${MEDSCI_SKILLS_ROOT:-$HOME/workspace/medsci-skills}path contract, no import/vendor).references/find_available_pdf.jssnippet thattriggers Zotero's own
addAvailablePDF/addAvailablePDFs, reusing the user's ownproxy/OpenURL config. No credentials, proxy hosts, or institutional identifiers enter the skill.
Engine (
fetch_oa.py) enhancementsTitlecolumn (was plain/TSV only).10.48550/arXiv.*DOIs (new/old-style, version suffixes).--report retrieval_report.json(schema_version + per-DOIstatus/source/title_match).normalize_title/title_overlap/classify_title_match/build_report;best-effort
pdftotexttitle cross-check flags mislabeled PDFs (tri-statematch|mismatch|unavailable,never auto-rejects).
DRY consolidation
/search-litPhase 5 now delegates to/fulltext-retrievaland drops the duplicated inline OAcode and the
SCIHUB_BASEwording (which conflicted withforbidden_actions). Net:api.unpaywall.orgappears in exactly one authored file.Correctness (Codex review of installed
zotero-mcp-server 0.2.2)Docs now state
zotero_add_by_doidoes not dedupe (thezotero_search_itemssearch-firststep is what dedupes), that its
attach_modegoverns the OA child-PDF attach, and thatzotero_add_from_filewould create duplicate parent items.CI / verification
validate.ymlstep runs the network-freefetch_oa_report_challenge/verify.sh(
validation_commandsalone are not executed by CI).test_pdf_to_md.py,validate_skills.sh(ALL PASSED),gen_skill_docs/gen_distribution_manifestregenerated +--checkclean, catalog/probevalidators + locale/frontmatter/routing OK, PII self-scan clean.
match), an Unpaywall hit, one OAmiss correctly partitioned to
not_retrieved+manual_needed.txt.Additive — skill/detector/guideline/probe counts unchanged (51 / 41 / 38 / 14).
🤖 Generated with Claude Code