Skip to content

Commit 9649622

Browse files
inutanoclaude
andcommitted
Add E2E test scenario: Kusako yeast heat shock RNA-seq reanalysis
Realistic scenario where an undergrad uses Claude Code to reanalyze published RNA-seq data (PMID 32109230, GSE135568) for TOR pathway genes not discussed in the original paper. Exercises the full chain: PMC text mining → TogoID ID conversion → fetchngs → rnaseq. All data paths verified against live APIs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 7d5dbd1 commit 9649622

1 file changed

Lines changed: 130 additions & 0 deletions

File tree

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# E2E Test 001: Kusako — Yeast Heat Shock RNA-seq Reanalysis
2+
3+
> **Test type:** End-to-end agent scenario
4+
> **Workflows:** fetchngs → rnaseq
5+
> **Organism:** *Saccharomyces cerevisiae* (R64-1-1)
6+
> **Data:** GSE135568 (PMID: 32109230)
7+
8+
## Scenario
9+
10+
Kusako is an undergraduate biology student working on a thesis about nutrient signaling in *Saccharomyces cerevisiae*. Her lab studies the TOR pathway, and she's interested in how TOR-regulated genes (e.g., *GAP1*, *MEP2*, *DAL5* — nitrogen permease genes) respond under heat shock. She found a paper about yeast heat shock transcriptomics, but the paper focuses on the RNA binding protein Mip6 and histone modifications — her target genes aren't discussed at all.
11+
12+
She wants to reanalyze the same RNA-seq data to look at expression of her specific gene set. She has Claude Code but no bioinformatics experience.
13+
14+
## Prompt
15+
16+
```
17+
I'm studying TOR pathway genes in yeast for my thesis. I found this paper about
18+
heat shock response (PMID: 32109230) that has RNA-seq data — they looked at Mip6
19+
and histone modifications but I want to check what happens to nitrogen permease
20+
genes like GAP1, MEP2, and DAL5 under the same conditions. Can you get the data
21+
and run the analysis for me?
22+
```
23+
24+
## Expected Agent Behavior
25+
26+
### Step 1: Understand the request
27+
28+
This is a secondary analysis: reuse published RNA-seq data to answer a different biological question. The student doesn't know about workflows, WES, or CWL — the agent handles everything.
29+
30+
### Step 2: Read AGENTS.md
31+
32+
Discover available workflows, the WES submission protocol, and provenance requirements.
33+
34+
### Step 3: Find the data
35+
36+
Read `docs/data-discovery.md`, then:
37+
38+
1. Try TogoID: PMID 32109230 → SRA run accessions (no direct route exists)
39+
2. **Fall back to PMC full text**: fetch the paper via Europe PMC or NCBI E-utilities
40+
- Paper: "A multi-omics dataset of heat-shock response in the yeast RNA binding protein Mip6" (PMC7046740)
41+
- Scan Data Availability / Data Records section for archive identifiers
42+
- Find: **GSE135568** (GEO series accession)
43+
3. Use TogoID to convert identifiers:
44+
- `GET https://api.togoid.dbcls.jp/convert?ids=GSE135568&route=geo_series,bioproject&format=json`
45+
- Result: **PRJNA559331**
46+
- `GET https://api.togoid.dbcls.jp/convert?ids=PRJNA559331&route=bioproject,sra_run&format=json`
47+
- Result: 67 SRA run accessions (SRR9929263–SRR9929329)
48+
4. Filter for RNA-seq only (the dataset also contains ChIP-seq):
49+
- Query ENA: `study_accession="PRJNA559331"` with `library_strategy` field
50+
- 24 RNA-seq runs, 36 ChIP-seq runs — select only RNA-seq
51+
5. Present the accessions to Kusako with metadata:
52+
- Organism: *S. cerevisiae* BY4741
53+
- Conditions: wild-type vs mip6Δ, 30°C vs 39°C (20 min and 120 min heat shock)
54+
- Library: paired-end, Illumina HiSeq 2000
55+
- Ask which samples she needs (all 24, or a subset for her comparison)
56+
57+
### Step 4: Fetch raw data
58+
59+
Read `workflows/fetchngs/agent.yaml`:
60+
- Prepare input YAML with confirmed SRA accessions
61+
- Submit fetchngs to WES, poll until complete
62+
- Retrieve FASTQ output paths
63+
64+
**For testing:** Use a subset of 2–4 samples to keep runtime short (e.g., 1 wild-type 30°C + 1 wild-type 39°C).
65+
66+
### Step 5: Resolve references
67+
68+
Read `references/README.md` + `references/genomes.yaml`:
69+
- Look up *S. cerevisiae* R64-1-1 (Tier 1 in catalog)
70+
- rnaseq needs: genome FASTA + GTF + STAR index
71+
- Check iGenomes: `s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/STARIndex/`
72+
- Download pre-built STAR index, OR run `prepare-references` with `build_star: true`
73+
74+
### Step 6: Run RNA-seq quantification
75+
76+
Read `workflows/rnaseq/agent.yaml`:
77+
- Prepare inputs: FASTQs from step 4, genome FASTA + GTF + STAR index from step 5
78+
- Select aligner: STAR (default for model organisms with good annotation)
79+
- Submit to WES, poll until complete
80+
- Retrieve gene-level counts and MultiQC report
81+
82+
### Step 7: Retrieve provenance
83+
84+
`GET /runs/{run_id}/ro-crate` for both fetchngs and rnaseq runs.
85+
Save RO-Crate metadata.
86+
87+
### Step 8: Present results in student-friendly terms
88+
89+
- Extract expression levels of GAP1, MEP2, DAL5 from the gene count matrix
90+
- Compare across conditions (30°C vs 39°C heat shock)
91+
- Explain what the counts mean (higher = more transcripts = more active)
92+
- Point to MultiQC report for data quality
93+
- Note that this is raw counts, not differential expression — suggest DESeq2 as next step if she sees interesting patterns
94+
95+
## Verified Data Path
96+
97+
This test scenario has been verified end-to-end:
98+
99+
| Step | Input | Method | Output | Verified |
100+
|------|-------|--------|--------|----------|
101+
| PMC lookup | PMID 32109230 | Europe PMC / NCBI efetch | PMC7046740 (open access) | Yes |
102+
| Text mining | PMC7046740 full text | Scan Data Records section | GSE135568 | Yes |
103+
| ID conversion | GSE135568 | TogoID geo_series→bioproject | PRJNA559331 | Yes |
104+
| ID conversion | PRJNA559331 | TogoID bioproject→sra_run | 67 SRR accessions | Yes |
105+
| RNA-seq filter | 67 runs | ENA API library_strategy filter | 24 RNA-seq runs | Yes |
106+
| Genome lookup | *S. cerevisiae* | references/genomes.yaml | R64-1-1, Ensembl 115 | Yes |
107+
| STAR index | R64-1-1 | iGenomes S3 | Available | Yes |
108+
109+
## Success Criteria
110+
111+
- [ ] Agent fetches PMC full text and extracts GSE135568 from the paper
112+
- [ ] Agent converts GSE135568 → PRJNA559331 → SRR accessions via TogoID
113+
- [ ] Agent filters to RNA-seq runs only (24 of 67)
114+
- [ ] Agent presents sample metadata and asks Kusako to confirm
115+
- [ ] fetchngs completes and produces FASTQ files
116+
- [ ] Yeast reference resolved from genomes.yaml without manual intervention
117+
- [ ] rnaseq completes with STAR alignment and gene quantification
118+
- [ ] Agent reports expression values for GAP1, MEP2, DAL5
119+
- [ ] RO-Crate retrieved and valid for both workflow runs (Dataset + ComputationalWorkflow + CreateAction)
120+
121+
## Test Subset (for CI/quick validation)
122+
123+
For automated testing, use 2 samples to keep runtime under 30 minutes:
124+
125+
| Run | Condition | Genotype |
126+
|-----|-----------|----------|
127+
| SRR9929263 | 30°C (control) | BY4741 wild-type |
128+
| SRR9929265 | 39°C 20 min (heat shock) | BY4741 wild-type |
129+
130+
These two samples provide the minimal comparison: control vs heat shock in wild-type, which is what Kusako needs to see if her target genes respond to heat stress.

0 commit comments

Comments
 (0)