Skip to content

Commit 862af51

Browse files
dpark01claude
andcommitted
Add Terra performance analysis best practices to CLAUDE.md
Document lessons learned from regression analysis of filter_bam_to_taxa: - Use GCS file timestamps (not Python logs) for accurate task timing - Use wildcards in gcloud storage ls for efficient batch queries - Handle attempt-* directories for preempted tasks - Match samples to workflows by scanning stderr content Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 5e37f5f commit 862af51

1 file changed

Lines changed: 47 additions & 0 deletions

File tree

CLAUDE.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -188,3 +188,50 @@ Image versions are pinned in `requirements-modules.txt` and must be kept in sync
188188
## Dockstore Integration
189189

190190
Workflows are registered on Dockstore for easy import to Terra, DNAnexus, and other platforms. The `.dockstore.yml` file defines all published workflows and their test parameter files.
191+
192+
## Terra Performance Analysis
193+
194+
When analyzing workflow performance from Terra submissions, use the Terra MCP tools for structure/status queries and direct GCS access for log analysis.
195+
196+
### Timing Methodology for WDL Tasks
197+
198+
When measuring task execution time from Terra logs:
199+
200+
1. **Start time**: Use first Python log timestamp in stderr
201+
- Pattern: `^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),\d+`
202+
203+
2. **End time**: Use GCS file modification timestamp of stderr
204+
- Get via: `gcloud storage ls -l <path>/stderr`
205+
- This captures ALL execution including post-Python BAM I/O
206+
207+
3. **Why not use Python log end time?**
208+
- Many tasks run external tools (Java, pysam) after Python logging ends
209+
- Python logs don't capture full execution time
210+
211+
### Efficient GCS Queries with Wildcards
212+
213+
Use wildcards to batch GCS queries instead of iterating:
214+
```bash
215+
# Get all stderr files from a submission with timestamps in one query
216+
gcloud storage ls -l "gs://bucket/submissions/<sub_id>/classify_single/*/call-deplete/stderr"
217+
gcloud storage ls -l "gs://bucket/submissions/<sub_id>/classify_single/*/call-deplete/attempt-*/stderr"
218+
```
219+
220+
### Handling Preemption Retries
221+
222+
When a task is preempted, Cromwell creates `attempt-*` directories:
223+
```
224+
call-deplete/
225+
stderr # First attempt (may be incomplete)
226+
attempt-2/ # Second attempt
227+
stderr # Final successful run
228+
```
229+
230+
**Always use the final (highest-numbered) attempt** for performance analysis - preemption time shouldn't count against code performance.
231+
232+
### Sample Identification
233+
234+
To identify which workflow corresponds to which sample:
235+
1. Read first few KB of stderr from each workflow
236+
2. Look for sample name in BAM file paths (e.g., `/S20.l1.xxxx.bam`)
237+
3. Cache the sample-to-workflow mapping for reuse

0 commit comments

Comments
 (0)