Project-specific instructions for Claude Code and other AI agents working with this repository.
This document is automatically loaded by Claude Code when running in this repo. It provides context, conventions, and common operations so agents can be productive immediately.
Name: Naturalistic fMRI Literature & Wisdom Synthesis (2021-2026) Purpose: Curate 100 top-tier naturalistic fMRI papers, upload to NotebookLM, query with 20 expert-level questions, synthesize into a wisdom document. Status: Production pipeline. 100/100 PDFs uploaded, 20/20 queries executed, synthesis in progress.
Core philosophy:
- Evidence over assumption — every claim in answers must cite corpus papers
- Reproducibility first — all scripts run end-to-end with zero manual config
- AI as force multiplier — NotebookLM, Claude, and foundation models compound each other
Python pipeline (order of execution):
search_papers.py → PubMed E-utilities multi-query
select_and_download.py → Europe PMC PDF fetch (parallel)
retry_failed.py → HTTP 500 retry with backoff
select_100.py → Relevance-score selection
save_answer.py → NotebookLM JSON → markdown parser
upload_to_nlm.sh → Bulk NotebookLM upload
Metadata (JSON, all stable):
papers_all.json ← 335 filtered papers (full metadata)
papers_top120.json ← top 120 candidates
papers_selected.json ← download targets (120)
top100_paths.json ← selected 100 PDF paths
download_manifest.json ← what downloaded
retry_manifest.json ← retry outcomes
Query answers (markdown, 20 files):
answers/Q01_statistical_inference.md
answers/Q02_stimulus_standardization.md
...
answers/Q20_beyond_movies.md
(each contains: full answer + references with source_ids + cited_text)
Documentation (docs/):
01-quickstart.md ← 5-min onboarding
02-pipeline.md ← technical deep-dive on pipeline
03-notebooklm-guide.md ← NotebookLM setup + MCP + CLI usage
04-queries.md ← 20 queries verbatim + rationale
05-reproducibility.md ← reproduce from scratch
06-extending.md ← add papers / new queries
Ignored (gitignore):
pdfs/ ← 131 PDFs (~350 MB, copyright)
*.log ← runtime logs
__pycache__/, .venv/ ← Python artifacts
# Requires: pip, nlm CLI, NotebookLM Plus account
pip install -r requirements.txt
make full-pipeline # search → download → retry → select → upload# Edit search_papers.py QUERIES (update year range)
python3 search_papers.py # Regenerate papers_all.json
python3 select_and_download.py # Download new papers
python3 select_100.py # Re-select top 100
# Manually diff top100_paths.json old vs. new, upload deltasVia MCP (preferred in Claude Code):
# Use mcp__notebooklm__notebook_query
notebook_id = "d9265824-3383-4fd4-8d17-03512a338ee5" # default notebook
query = "Q21 — [your new question]: ..."
# Result saved to tool-results/, parse with save_answer.pyVia CLI:
nlm chat <notebook_id> "Q21 — [your question]"python3 save_answer.py \
--file /path/to/tool-result.txt \
--qid 21 \
--title "my_query_topic"
# → answers/Q21_my_query_topic.md# Manual synthesis (recommended — needs expert judgment)
# Open all answers/*.md, distill into structured wisdom_synthesis.md
# Structure: 7 categories × (SOTA / Outstanding / AI / Citations)- Python 3.12+ (test on 3.10+)
- stdlib + requests only for pipeline scripts (no heavy deps)
- Type hints recommended but not enforced
- Functions over classes unless state is essential (stateless pipelines)
- Parallelism:
concurrent.futures.ThreadPoolExecutor(max_workers=3-6)withrandom.uniformjitter to avoid rate limits
- Bash 4+ features OK (
mapfile,${!arr[@]}) - Set -u for safety; avoid
set -ein parallel contexts (lose failures) - Log everything to separate success / failure files
- JSON with
indent=2, ensure_ascii=False(non-ASCII in titles preserved) - Manifests as list of tuples
[idx, pmid, status, path] - Never mutate metadata files in place — regenerate from source
nlm login # Browser OAuth, saves tokens to ~/.nlm/
nlm login switch <profile> # Multi-account switchingToken location: ~/.nlm/profiles/<profile>/tokens.json — never commit.
Standard gh CLI: gh auth login. This repo's default account: snuconnectome.
PubMed E-utilities and Europe PMC are unauthenticated (with 3 req/s rate limit).
- Never commit PDFs —
.gitignorecoverspdfs/and*.pdf; check before push - Never commit
tool-results/raw outputs — contains system paths - Never modify
papers_all.jsonby hand — always regenerate viasearch_papers.py - Never bypass
save_answer.pywhen storing NotebookLM responses — maintains citation integrity - Never share the specific notebook ID
d9265824-...without checking if it's been made public - Never hit NCBI PMC directly for PDFs — use Europe PMC (PoW bypass)
- Never parallel-upload to NotebookLM with >6 workers — hits rate limits
- Never trust inline LLM-generated content without verifying against NotebookLM citations — hallucination risk in synthesis
- Read before edit — this repo has carefully designed conventions
- Test scripts on 5 papers first before full-scale runs
- Commit after each phase — search → download → select → upload → query → synthesize
- Save raw NotebookLM JSON before parsing — useful for re-runs
- Use TaskCreate for anything 3+ steps
- Spawn Explore/Plan agents for tasks involving >3 searches
- Preserve citation source_ids in all synthesis outputs — enables forward traceability
- Keep answers/ files append-only — treat as immutable log of what NotebookLM said at time T
- Current default notebook ID:
d9265824-3383-4fd4-8d17-03512a338ee5(Naturalistic fMRI Literature 2021-2026) - Deferred tools:
mcp__notebooklm__*,mcp__tavily__*,mcp__context7__*— load via ToolSearch - Background uploads: Use
run_in_background: trueforupload_to_nlm.sh(100 uploads take ~5 min) - Persisted outputs: NotebookLM answers >30 KB go to
tool-results/— path in response
- Check
answers/Q01-Q20_*.mdbefore running new NotebookLM queries — likely already answered - When adding external knowledge, clearly mark as "outside corpus" to preserve citation integrity
- Documentation style: bilingual Korean/English where useful, Korean preferred for narrative
- Technical content: English first, Korean annotations
- Code identifiers: never translate — keep
papers_all.json,search_papers.pyin English
The 20 existing queries follow a 3-part structure. When proposing new queries:
Q## — [Title]: [One-sentence scope setter referencing corpus theme]
(a) SOTA: What does the corpus reveal about [X]? [Specific paper hooks if possible]
(b) Outstanding: What remains unresolved — [dimension 1] vs. [dimension 2]?
(c) AI angle: How can [specific AI approach] address [specific limitation]?
Synthesize across multiple papers with citations.Why this structure:
- (a) grounds the answer in the corpus (SOTA consensus)
- (b) elicits debates (productive tensions)
- (c) surfaces neuro-AI opportunities (the novel contribution)
Category placement: Match new queries to one of 7 existing categories (A-G in docs/04-queries.md). If it doesn't fit, propose a new category first.
- Run
search_papers.pywith updated queries - Diff
papers_all.jsonto find new PMIDs - Download only the diff via
select_and_download.pywith filtered input - Upload diff to NotebookLM via
upload_to_nlm.sh - (Optional) Re-run representative queries to check if answers change
- Check
answers/first — is it already there? - If no: craft 3-part query, submit via MCP, parse with
save_answer.py - Save to
answers/Q##_title.md - Update
docs/04-queries.mdto register the new query
- Load all
answers/*.mdfiles - Group by category (A-G)
- For each category, extract: SOTA consensus, outstanding debates, AI opportunities, key citations
- Write
wisdom_synthesis.mdwith structured template - Include cross-cutting themes (e.g., foundation models appear in multiple categories)
Before committing:
-
git status— no PDFs or tool-results/ accidentally staged -
git diff— no hardcoded tokens or absolute paths (/home/juke/...) - Answer files valid markdown (render in GitHub preview)
- Metadata JSON valid (
python3 -c "import json; json.load(open('papers_all.json'))")
Before merging to main:
- All 20 query answers in
answers/ -
README.mdreflects current state (version, dates, counts) -
docs/updated if pipeline changed -
upload.logcaptured if NotebookLM state changed
Cause: NCBI's new PoW challenge.
Fix: Ensure fetch_pmc_pdf_url() uses europepmc.org/articles/PMC{id}?pdf=render.
Cause: Rate limit or paper not yet indexed.
Fix: retry_failed.py with exponential backoff (already implemented). If persistent, try BioC API https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_xml/PMC{id}/unicode as fallback.
Cause: Parallel >6 workers.
Fix: Reduce -P in upload_to_nlm.sh to 2-3; add sleep 0.5 between batches.
Fix: nlm login (opens browser). For headless: nlm login --device-code.
Fix: Use ToolSearch first:
ToolSearch(query="select:mcp__notebooklm__notebook_query,mcp__notebooklm__source_add", max_results=5)
When reporting progress to the user:
- Korean for conversational text (매뉴얼, 진행 상황)
- English for code, identifiers, file paths
- Concise: ≤100 words for routine updates, longer for deliverables
- Show don't tell: paste counts (
100/100 success), not adjectives (많은,성공적)
README.md— project overviewdocs/01-quickstart.md— 5-minute onboardingdocs/02-pipeline.md— pipeline technical deep-divedocs/03-notebooklm-guide.md— NotebookLM setup + usagedocs/04-queries.md— all 20 queries + design rationaledocs/05-reproducibility.md— reproduce from scratchdocs/06-extending.md— add papers / new queries
Last updated: 2026-04-14 by Claude Opus 4.6 orchestrating SNU Connectome Lab's naturalistic fMRI synthesis