|
| 1 | +# E2E Test 001: Kusako — Yeast Heat Shock RNA-seq Reanalysis |
| 2 | + |
| 3 | +> **Test type:** End-to-end agent scenario |
| 4 | +> **Workflows:** fetchngs → rnaseq |
| 5 | +> **Organism:** *Saccharomyces cerevisiae* (R64-1-1) |
| 6 | +> **Data:** GSE135568 (PMID: 32109230) |
| 7 | +
|
| 8 | +## Scenario |
| 9 | + |
| 10 | +Kusako is an undergraduate biology student working on a thesis about nutrient signaling in *Saccharomyces cerevisiae*. Her lab studies the TOR pathway, and she's interested in how TOR-regulated genes (e.g., *GAP1*, *MEP2*, *DAL5* — nitrogen permease genes) respond under heat shock. She found a paper about yeast heat shock transcriptomics, but the paper focuses on the RNA binding protein Mip6 and histone modifications — her target genes aren't discussed at all. |
| 11 | + |
| 12 | +She wants to reanalyze the same RNA-seq data to look at expression of her specific gene set. She has Claude Code but no bioinformatics experience. |
| 13 | + |
| 14 | +## Prompt |
| 15 | + |
| 16 | +``` |
| 17 | +I'm studying TOR pathway genes in yeast for my thesis. I found this paper about |
| 18 | +heat shock response (PMID: 32109230) that has RNA-seq data — they looked at Mip6 |
| 19 | +and histone modifications but I want to check what happens to nitrogen permease |
| 20 | +genes like GAP1, MEP2, and DAL5 under the same conditions. Can you get the data |
| 21 | +and run the analysis for me? |
| 22 | +``` |
| 23 | + |
| 24 | +## Expected Agent Behavior |
| 25 | + |
| 26 | +### Step 1: Understand the request |
| 27 | + |
| 28 | +This is a secondary analysis: reuse published RNA-seq data to answer a different biological question. The student doesn't know about workflows, WES, or CWL — the agent handles everything. |
| 29 | + |
| 30 | +### Step 2: Read AGENTS.md |
| 31 | + |
| 32 | +Discover available workflows, the WES submission protocol, and provenance requirements. |
| 33 | + |
| 34 | +### Step 3: Find the data |
| 35 | + |
| 36 | +Read `docs/data-discovery.md`, then: |
| 37 | + |
| 38 | +1. Try TogoID: PMID 32109230 → SRA run accessions (no direct route exists) |
| 39 | +2. **Fall back to PMC full text**: fetch the paper via Europe PMC or NCBI E-utilities |
| 40 | + - Paper: "A multi-omics dataset of heat-shock response in the yeast RNA binding protein Mip6" (PMC7046740) |
| 41 | + - Scan Data Availability / Data Records section for archive identifiers |
| 42 | + - Find: **GSE135568** (GEO series accession) |
| 43 | +3. Use TogoID to convert identifiers: |
| 44 | + - `GET https://api.togoid.dbcls.jp/convert?ids=GSE135568&route=geo_series,bioproject&format=json` |
| 45 | + - Result: **PRJNA559331** |
| 46 | + - `GET https://api.togoid.dbcls.jp/convert?ids=PRJNA559331&route=bioproject,sra_run&format=json` |
| 47 | + - Result: 67 SRA run accessions (SRR9929263–SRR9929329) |
| 48 | +4. Filter for RNA-seq only (the dataset also contains ChIP-seq): |
| 49 | + - Query ENA: `study_accession="PRJNA559331"` with `library_strategy` field |
| 50 | + - 24 RNA-seq runs, 36 ChIP-seq runs — select only RNA-seq |
| 51 | +5. Present the accessions to Kusako with metadata: |
| 52 | + - Organism: *S. cerevisiae* BY4741 |
| 53 | + - Conditions: wild-type vs mip6Δ, 30°C vs 39°C (20 min and 120 min heat shock) |
| 54 | + - Library: paired-end, Illumina HiSeq 2000 |
| 55 | + - Ask which samples she needs (all 24, or a subset for her comparison) |
| 56 | + |
| 57 | +### Step 4: Fetch raw data |
| 58 | + |
| 59 | +Read `workflows/fetchngs/agent.yaml`: |
| 60 | +- Prepare input YAML with confirmed SRA accessions |
| 61 | +- Submit fetchngs to WES, poll until complete |
| 62 | +- Retrieve FASTQ output paths |
| 63 | + |
| 64 | +**For testing:** Use a subset of 2–4 samples to keep runtime short (e.g., 1 wild-type 30°C + 1 wild-type 39°C). |
| 65 | + |
| 66 | +### Step 5: Resolve references |
| 67 | + |
| 68 | +Read `references/README.md` + `references/genomes.yaml`: |
| 69 | +- Look up *S. cerevisiae* R64-1-1 (Tier 1 in catalog) |
| 70 | +- rnaseq needs: genome FASTA + GTF + STAR index |
| 71 | +- Check iGenomes: `s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/STARIndex/` |
| 72 | +- Download pre-built STAR index, OR run `prepare-references` with `build_star: true` |
| 73 | + |
| 74 | +### Step 6: Run RNA-seq quantification |
| 75 | + |
| 76 | +Read `workflows/rnaseq/agent.yaml`: |
| 77 | +- Prepare inputs: FASTQs from step 4, genome FASTA + GTF + STAR index from step 5 |
| 78 | +- Select aligner: STAR (default for model organisms with good annotation) |
| 79 | +- Submit to WES, poll until complete |
| 80 | +- Retrieve gene-level counts and MultiQC report |
| 81 | + |
| 82 | +### Step 7: Retrieve provenance |
| 83 | + |
| 84 | +`GET /runs/{run_id}/ro-crate` for both fetchngs and rnaseq runs. |
| 85 | +Save RO-Crate metadata. |
| 86 | + |
| 87 | +### Step 8: Present results in student-friendly terms |
| 88 | + |
| 89 | +- Extract expression levels of GAP1, MEP2, DAL5 from the gene count matrix |
| 90 | +- Compare across conditions (30°C vs 39°C heat shock) |
| 91 | +- Explain what the counts mean (higher = more transcripts = more active) |
| 92 | +- Point to MultiQC report for data quality |
| 93 | +- Note that this is raw counts, not differential expression — suggest DESeq2 as next step if she sees interesting patterns |
| 94 | + |
| 95 | +## Verified Data Path |
| 96 | + |
| 97 | +This test scenario has been verified end-to-end: |
| 98 | + |
| 99 | +| Step | Input | Method | Output | Verified | |
| 100 | +|------|-------|--------|--------|----------| |
| 101 | +| PMC lookup | PMID 32109230 | Europe PMC / NCBI efetch | PMC7046740 (open access) | Yes | |
| 102 | +| Text mining | PMC7046740 full text | Scan Data Records section | GSE135568 | Yes | |
| 103 | +| ID conversion | GSE135568 | TogoID geo_series→bioproject | PRJNA559331 | Yes | |
| 104 | +| ID conversion | PRJNA559331 | TogoID bioproject→sra_run | 67 SRR accessions | Yes | |
| 105 | +| RNA-seq filter | 67 runs | ENA API library_strategy filter | 24 RNA-seq runs | Yes | |
| 106 | +| Genome lookup | *S. cerevisiae* | references/genomes.yaml | R64-1-1, Ensembl 115 | Yes | |
| 107 | +| STAR index | R64-1-1 | iGenomes S3 | Available | Yes | |
| 108 | + |
| 109 | +## Success Criteria |
| 110 | + |
| 111 | +- [ ] Agent fetches PMC full text and extracts GSE135568 from the paper |
| 112 | +- [ ] Agent converts GSE135568 → PRJNA559331 → SRR accessions via TogoID |
| 113 | +- [ ] Agent filters to RNA-seq runs only (24 of 67) |
| 114 | +- [ ] Agent presents sample metadata and asks Kusako to confirm |
| 115 | +- [ ] fetchngs completes and produces FASTQ files |
| 116 | +- [ ] Yeast reference resolved from genomes.yaml without manual intervention |
| 117 | +- [ ] rnaseq completes with STAR alignment and gene quantification |
| 118 | +- [ ] Agent reports expression values for GAP1, MEP2, DAL5 |
| 119 | +- [ ] RO-Crate retrieved and valid for both workflow runs (Dataset + ComputationalWorkflow + CreateAction) |
| 120 | + |
| 121 | +## Test Subset (for CI/quick validation) |
| 122 | + |
| 123 | +For automated testing, use 2 samples to keep runtime under 30 minutes: |
| 124 | + |
| 125 | +| Run | Condition | Genotype | |
| 126 | +|-----|-----------|----------| |
| 127 | +| SRR9929263 | 30°C (control) | BY4741 wild-type | |
| 128 | +| SRR9929265 | 39°C 20 min (heat shock) | BY4741 wild-type | |
| 129 | + |
| 130 | +These two samples provide the minimal comparison: control vs heat shock in wild-type, which is what Kusako needs to see if her target genes respond to heat stress. |
0 commit comments