@@ -164,11 +164,13 @@ GitHub Actions (`.github/workflows/build.yml`) runs on all PRs and pushes:
164164 - Supports novoalign, bwa, or minimap2 aligners
165165 - Primary workflow for viral genome assembly
166166
167- - ** assemble_denovo .wdl** : De novo assembly with SPAdes
167+ - ** assemble_denovo_metagenomic .wdl** : De novo metagenomic assembly with SPAdes
168168
169- - ** classify_kraken2 .wdl** : Taxonomic classification of reads
169+ - ** classify_single .wdl** : Taxonomic classification and depletion pipeline
170170
171- - ** sarscov2_illumina_full.wdl** : Complete SARS-CoV-2 analysis pipeline
171+ - ** nextclade_single.wdl** : Nextclade analysis for single samples
172+
173+ - ** genbank_single.wdl** : GenBank submission preparation for single samples
172174
173175- ** augur_from_assemblies.wdl** : Nextstrain phylogenetic analysis from assemblies
174176
@@ -195,7 +197,31 @@ When analyzing workflow performance from Terra submissions, use the Terra MCP to
195197
196198### Timing Methodology for WDL Tasks
197199
198- When measuring task execution time from Terra logs:
200+ ** Preferred method - use ` get_batch_job_status ` :**
201+
202+ The Terra MCP's ` get_batch_job_status ` tool returns timing data directly from the Google Batch API:
203+
204+ ```
205+ get_batch_job_status(
206+ workspace_namespace="<namespace>",
207+ workspace_name="<workspace>",
208+ submission_id="<submission-uuid>",
209+ workflow_id="<workflow-uuid>",
210+ task_name="<task_name>",
211+ shard_index=<optional>,
212+ attempt=<optional>
213+ )
214+ ```
215+
216+ Returns timing in the ` batch_job.timing ` field:
217+ - ** run_duration** : Actual task execution time (what you usually want for performance analysis)
218+ - ** pre_run_duration** : Queue and setup time (VM provisioning, Docker pull, etc.)
219+
220+ This is more accurate than log-based methods because it captures the complete execution including post-script I/O operations.
221+
222+ ** Alternative method - log-based timing (for detailed analysis):**
223+
224+ When you need finer-grained timing within a task (e.g., timing individual steps):
199225
2002261 . ** Start time** : Use first Python log timestamp in stderr
201227 - Pattern: ` ^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),\d+ `
@@ -210,6 +236,8 @@ When measuring task execution time from Terra logs:
210236
211237### Efficient GCS Queries with Wildcards
212238
239+ ** Always use ` gcloud storage ` instead of ` gsutil ` ** - it's faster, more reliable, and the preferred CLI for GCS operations.
240+
213241Use wildcards to batch GCS queries instead of iterating:
214242``` bash
215243# Get all stderr files from a submission with timestamps in one query
@@ -235,3 +263,58 @@ To identify which workflow corresponds to which sample:
2352631 . Read first few KB of stderr from each workflow
2362642 . Look for sample name in BAM file paths (e.g., ` /S20.l1.xxxx.bam ` )
2372653 . Cache the sample-to-workflow mapping for reuse
266+
267+ ### Debugging Infrastructure-Level Failures
268+
269+ Some workflow failures have errors that aren't visible in standard stderr logs. These include:
270+ - Docker pull failures (rate limits, image not found, auth errors)
271+ - VM provisioning failures
272+ - Preemption before task execution started
273+ - Network connectivity issues during container setup
274+
275+ ** Signs you need Batch logs instead of stderr:**
276+ - Batch reports exit code 0 (success) but task is marked as failed ("GCP Batch task exited with Success(0)")
277+ - Error message says "The job was stopped before the command finished"
278+ - stderr is empty or very short
279+ - Error message says "Executor error" without details
280+ - Task failed instantly (0 seconds runtime)
281+ - ` get_job_metadata ` summary shows failure but no useful error message
282+
283+ ** Use ` get_batch_job_status ` to diagnose infrastructure failures:**
284+
285+ The Terra MCP provides ` get_batch_job_status ` which queries the Google Batch API directly:
286+
287+ ```
288+ get_batch_job_status(
289+ workspace_namespace="<namespace>",
290+ workspace_name="<workspace>",
291+ submission_id="<submission-uuid>",
292+ workflow_id="<workflow-uuid>",
293+ task_name="<task_name>",
294+ shard_index=<optional>, # For scattered tasks
295+ attempt=<optional> # For retried tasks
296+ )
297+ ```
298+
299+ The tool returns:
300+ - ** Batch job status** : QUEUED, SCHEDULED, RUNNING, SUCCEEDED, or FAILED
301+ - ** Timing** : run_duration and pre_run_duration (queue/setup time)
302+ - ** Resources** : machine_type, CPU, memory, disk sizes
303+ - ** Status events** : State transitions with timestamps
304+ - ** Detected issues** : Auto-detected problems with severity and suggestions
305+ - ** Cloud Logging query** : Ready-to-use gcloud command for deeper debugging
306+
307+ ** Recommended debugging workflow:**
308+ 1 . ` get_submission_status ` → identify failed workflows
309+ 2 . ` get_job_metadata ` (summary mode) → identify failed tasks and error messages
310+ 3 . ` get_workflow_logs ` → check stderr for application errors
311+ 4 . ` get_batch_job_status ` → check infrastructure issues if logs don't explain failure
312+
313+ ** Common failure patterns detected:**
314+ - ` "Failed to pull image" ` - Check image name, tag, and registry auth
315+ - ` "429 Too Many Requests" ` - Registry rate limit, retry later
316+ - ` "manifest unknown" ` - Image tag doesn't exist
317+ - ` "unauthorized" ` - Service account lacks permission to pull from registry
318+ - ` "PREEMPTED" ` - VM was preempted, usually retried automatically
319+ - ` "exit code 137" ` - OOM killed (out of memory)
320+ - ` "exit code 1" ` - Application error in the task script
0 commit comments