Skip to content

Commit 5457b7e

Browse files
authored
Merge pull request #629 from broadinstitute/dp-assemble
Skip indel realignment for large BAMs in align_reads task
2 parents 9c7a822 + 6f1ca3a commit 5457b7e

14 files changed

+144
-49
lines changed

CLAUDE.md

Lines changed: 87 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -164,11 +164,13 @@ GitHub Actions (`.github/workflows/build.yml`) runs on all PRs and pushes:
164164
- Supports novoalign, bwa, or minimap2 aligners
165165
- Primary workflow for viral genome assembly
166166

167-
- **assemble_denovo.wdl**: De novo assembly with SPAdes
167+
- **assemble_denovo_metagenomic.wdl**: De novo metagenomic assembly with SPAdes
168168

169-
- **classify_kraken2.wdl**: Taxonomic classification of reads
169+
- **classify_single.wdl**: Taxonomic classification and depletion pipeline
170170

171-
- **sarscov2_illumina_full.wdl**: Complete SARS-CoV-2 analysis pipeline
171+
- **nextclade_single.wdl**: Nextclade analysis for single samples
172+
173+
- **genbank_single.wdl**: GenBank submission preparation for single samples
172174

173175
- **augur_from_assemblies.wdl**: Nextstrain phylogenetic analysis from assemblies
174176

@@ -195,7 +197,31 @@ When analyzing workflow performance from Terra submissions, use the Terra MCP to
195197

196198
### Timing Methodology for WDL Tasks
197199

198-
When measuring task execution time from Terra logs:
200+
**Preferred method - use `get_batch_job_status`:**
201+
202+
The Terra MCP's `get_batch_job_status` tool returns timing data directly from the Google Batch API:
203+
204+
```
205+
get_batch_job_status(
206+
workspace_namespace="<namespace>",
207+
workspace_name="<workspace>",
208+
submission_id="<submission-uuid>",
209+
workflow_id="<workflow-uuid>",
210+
task_name="<task_name>",
211+
shard_index=<optional>,
212+
attempt=<optional>
213+
)
214+
```
215+
216+
Returns timing in the `batch_job.timing` field:
217+
- **run_duration**: Actual task execution time (what you usually want for performance analysis)
218+
- **pre_run_duration**: Queue and setup time (VM provisioning, Docker pull, etc.)
219+
220+
This is more accurate than log-based methods because it captures the complete execution including post-script I/O operations.
221+
222+
**Alternative method - log-based timing (for detailed analysis):**
223+
224+
When you need finer-grained timing within a task (e.g., timing individual steps):
199225

200226
1. **Start time**: Use first Python log timestamp in stderr
201227
- Pattern: `^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),\d+`
@@ -210,6 +236,8 @@ When measuring task execution time from Terra logs:
210236

211237
### Efficient GCS Queries with Wildcards
212238

239+
**Always use `gcloud storage` instead of `gsutil`** - it's faster, more reliable, and the preferred CLI for GCS operations.
240+
213241
Use wildcards to batch GCS queries instead of iterating:
214242
```bash
215243
# Get all stderr files from a submission with timestamps in one query
@@ -235,3 +263,58 @@ To identify which workflow corresponds to which sample:
235263
1. Read first few KB of stderr from each workflow
236264
2. Look for sample name in BAM file paths (e.g., `/S20.l1.xxxx.bam`)
237265
3. Cache the sample-to-workflow mapping for reuse
266+
267+
### Debugging Infrastructure-Level Failures
268+
269+
Some workflow failures have errors that aren't visible in standard stderr logs. These include:
270+
- Docker pull failures (rate limits, image not found, auth errors)
271+
- VM provisioning failures
272+
- Preemption before task execution started
273+
- Network connectivity issues during container setup
274+
275+
**Signs you need Batch logs instead of stderr:**
276+
- Batch reports exit code 0 (success) but task is marked as failed ("GCP Batch task exited with Success(0)")
277+
- Error message says "The job was stopped before the command finished"
278+
- stderr is empty or very short
279+
- Error message says "Executor error" without details
280+
- Task failed instantly (0 seconds runtime)
281+
- `get_job_metadata` summary shows failure but no useful error message
282+
283+
**Use `get_batch_job_status` to diagnose infrastructure failures:**
284+
285+
The Terra MCP provides `get_batch_job_status` which queries the Google Batch API directly:
286+
287+
```
288+
get_batch_job_status(
289+
workspace_namespace="<namespace>",
290+
workspace_name="<workspace>",
291+
submission_id="<submission-uuid>",
292+
workflow_id="<workflow-uuid>",
293+
task_name="<task_name>",
294+
shard_index=<optional>, # For scattered tasks
295+
attempt=<optional> # For retried tasks
296+
)
297+
```
298+
299+
The tool returns:
300+
- **Batch job status**: QUEUED, SCHEDULED, RUNNING, SUCCEEDED, or FAILED
301+
- **Timing**: run_duration and pre_run_duration (queue/setup time)
302+
- **Resources**: machine_type, CPU, memory, disk sizes
303+
- **Status events**: State transitions with timestamps
304+
- **Detected issues**: Auto-detected problems with severity and suggestions
305+
- **Cloud Logging query**: Ready-to-use gcloud command for deeper debugging
306+
307+
**Recommended debugging workflow:**
308+
1. `get_submission_status` → identify failed workflows
309+
2. `get_job_metadata` (summary mode) → identify failed tasks and error messages
310+
3. `get_workflow_logs` → check stderr for application errors
311+
4. `get_batch_job_status` → check infrastructure issues if logs don't explain failure
312+
313+
**Common failure patterns detected:**
314+
- `"Failed to pull image"` - Check image name, tag, and registry auth
315+
- `"429 Too Many Requests"` - Registry rate limit, retry later
316+
- `"manifest unknown"` - Image tag doesn't exist
317+
- `"unauthorized"` - Service account lacks permission to pull from registry
318+
- `"PREEMPTED"` - VM was preempted, usually retried automatically
319+
- `"exit code 137"` - OOM killed (out of memory)
320+
- `"exit code 1"` - Application error in the task script

pipes/WDL/tasks/tasks_assembly.wdl

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -714,7 +714,12 @@ task align_reads {
714714
String sample_name = basename(basename(basename(reads_unmapped_bam, ".bam"), ".taxfilt"), ".clean")
715715
}
716716

717-
Int disk_size = ceil((6 * size(reads_unmapped_bam, "GB") + 2 * size(reference_fasta, "GB") + 100) / 375.0) * 375
717+
# Note: GCP local SSDs must be allocated in pairs (2, 4, 8, 16, 24 × 375GB), so we round to 750GB multiples.
718+
Int disk_size = ceil((6 * size(reads_unmapped_bam, "GB") + 2 * size(reference_fasta, "GB") + 100) / 750.0) * 750
719+
720+
# Skip indel realignment for large BAMs (>1GB) to save runtime
721+
Float reads_bam_size_gb = size(reads_unmapped_bam, "GB")
722+
Boolean skip_realign = reads_bam_size_gb >= 1.0
718723

719724
# Autoscale CPU based on input size: 8 CPUs for small inputs, up to 64 CPUs for ~15 GB inputs
720725
# Linear scaling: 8 + (input_GB / 15) * 56, capped at 64, rounded to nearest multiple of 4
@@ -773,6 +778,7 @@ task align_reads {
773778
--aligner ~{aligner} \
774779
~{'--aligner_options "' + aligner_options + '"'} \
775780
~{true='--skipMarkDupes' false="" skip_mark_dupes} \
781+
~{true='--skipRealign' false="" skip_realign} \
776782
--JVMmemory "$mem_in_mb"m \
777783
~{"--NOVOALIGN_LICENSE_PATH=" + novocraft_license} \
778784
--loglevel=DEBUG

pipes/WDL/tasks/tasks_demux.wdl

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,8 @@ task illumina_demux {
180180

181181
# --- options for VM shape ----------------------
182182
Int? machine_mem_gb
183-
Int disk_size = 2625
183+
# Note: GCP local SSDs must be allocated in pairs (2, 4, 8, 16, 24 × 375GB), so use 3000 (8 SSDs) instead of 2625 (7 SSDs)
184+
Int disk_size = 3000
184185
String docker = "quay.io/broadinstitute/viral-core:2.5.21"
185186
}
186187

pipes/WDL/tasks/tasks_interhost.wdl

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,7 @@ task multi_align_mafft_ref {
160160
Float? mafft_gapOpeningPenalty
161161

162162
Int? machine_mem_gb
163-
String docker = "quay.io/broadinstitute/viral-phylo:2.5.16.0"
163+
String docker = "quay.io/broadinstitute/viral-phylo:2.5.21.0"
164164
}
165165

166166
String fasta_basename = basename(reference_fasta, '.fasta')
@@ -207,7 +207,7 @@ task multi_align_mafft {
207207
Float? mafft_gapOpeningPenalty
208208

209209
Int? machine_mem_gb
210-
String docker = "quay.io/broadinstitute/viral-phylo:2.5.16.0"
210+
String docker = "quay.io/broadinstitute/viral-phylo:2.5.21.0"
211211
}
212212

213213
Int disk_size = 200
@@ -476,7 +476,7 @@ task merge_vcfs_gatk {
476476
File ref_fasta
477477

478478
Int? machine_mem_gb
479-
String docker = "quay.io/broadinstitute/viral-phylo:2.5.16.0"
479+
String docker = "quay.io/broadinstitute/viral-phylo:2.5.21.0"
480480

481481
String output_prefix = "merged"
482482
}

pipes/WDL/tasks/tasks_intrahost.wdl

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ task lofreq {
136136
File reference_fasta
137137

138138
String out_basename = basename(aligned_bam, '.bam')
139-
String docker = "quay.io/broadinstitute/viral-phylo:2.5.16.0"
139+
String docker = "quay.io/broadinstitute/viral-phylo:2.5.21.0"
140140
}
141141
Int disk_size = 200
142142
command <<<
@@ -196,7 +196,7 @@ task isnvs_per_sample {
196196
Boolean removeDoublyMappedReads = true
197197

198198
Int? machine_mem_gb
199-
String docker = "quay.io/broadinstitute/viral-phylo:2.5.16.0"
199+
String docker = "quay.io/broadinstitute/viral-phylo:2.5.21.0"
200200

201201
String sample_name = basename(basename(basename(mapped_bam, ".bam"), ".all"), ".mapped")
202202
}
@@ -239,7 +239,7 @@ task isnvs_vcf {
239239
Boolean naiveFilter = false
240240

241241
Int? machine_mem_gb
242-
String docker = "quay.io/broadinstitute/viral-phylo:2.5.16.0"
242+
String docker = "quay.io/broadinstitute/viral-phylo:2.5.21.0"
243243
}
244244

245245
parameter_meta {
@@ -313,7 +313,7 @@ task annotate_vcf_snpeff {
313313
String? emailAddress
314314

315315
Int? machine_mem_gb
316-
String docker = "quay.io/broadinstitute/viral-phylo:2.5.16.0"
316+
String docker = "quay.io/broadinstitute/viral-phylo:2.5.21.0"
317317

318318
String output_basename = basename(basename(in_vcf, ".gz"), ".vcf")
319319
}

pipes/WDL/tasks/tasks_megablast.wdl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ task lca_megablast {
7575
Int cpu = 16
7676
Int disk_size_gb = 300
7777

78-
String docker = "quay.io/broadinstitute/viral-classify:2.5.20.0"
78+
String docker = "quay.io/broadinstitute/viral-classify:2.5.21.0"
7979
}
8080
parameter_meta {
8181
trimmed_fasta: {

pipes/WDL/tasks/tasks_metagenomics.wdl

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -218,7 +218,7 @@ task kraken2 {
218218
Int? min_base_qual
219219

220220
Int machine_mem_gb = 90
221-
String docker = "quay.io/broadinstitute/viral-classify:2.5.20.0"
221+
String docker = "quay.io/broadinstitute/viral-classify:2.5.21.0"
222222
}
223223

224224
parameter_meta {
@@ -246,9 +246,10 @@ task kraken2 {
246246

247247
# Disk autoscaling: BAM->FASTQ expansion is ~7-8x, plus kraken2 reads output (~1x input),
248248
# plus kraken2 database (1x localized tarball + 2x decompressed = 3x), plus overhead for krona and temp files.
249-
# Minimum 375GB to accommodate typical database sizes.
250-
Int disk_size_auto = ceil((8 * size(reads_bam, "GB") + 3 * size(kraken2_db_tgz, "GB") + 50) / 375.0) * 375
251-
Int disk_size = if disk_size_auto < 375 then 375 else disk_size_auto
249+
# Minimum 750GB to accommodate typical database sizes.
250+
# Note: GCP local SSDs must be allocated in pairs (2, 4, 8, 16, 24 × 375GB), so we round to 750GB multiples.
251+
Int disk_size_auto = ceil((8 * size(reads_bam, "GB") + 3 * size(kraken2_db_tgz, "GB") + 50) / 750.0) * 750
252+
Int disk_size = if disk_size_auto < 750 then 750 else disk_size_auto
252253

253254
command <<<
254255
set -ex -o pipefail
@@ -350,7 +351,7 @@ task report_primary_kraken_taxa {
350351
File kraken_summary_report
351352
String focal_taxon = "Viruses"
352353

353-
String docker = "quay.io/broadinstitute/viral-classify:2.5.20.0"
354+
String docker = "quay.io/broadinstitute/viral-classify:2.5.21.0"
354355
}
355356
String out_basename = basename(kraken_summary_report, '.txt')
356357
Int disk_size = 50
@@ -401,7 +402,7 @@ task filter_refs_to_found_taxa {
401402
File taxdump_tgz
402403
Int min_read_count = 100
403404

404-
String docker = "quay.io/broadinstitute/viral-classify:2.5.20.0"
405+
String docker = "quay.io/broadinstitute/viral-classify:2.5.21.0"
405406
}
406407
String ref_basename = basename(taxid_to_ref_accessions_tsv, '.tsv')
407408
String hits_basename = basename(focal_report_tsv, '.tsv')
@@ -452,7 +453,7 @@ task build_kraken2_db {
452453
Int? zstd_compression_level
453454

454455
Int machine_mem_gb = 100
455-
String docker = "quay.io/broadinstitute/viral-classify:2.5.20.0"
456+
String docker = "quay.io/broadinstitute/viral-classify:2.5.21.0"
456457
}
457458

458459
Int disk_size = 750
@@ -594,7 +595,7 @@ task blastx {
594595
File krona_taxonomy_db_tgz
595596

596597
Int machine_mem_gb = 8
597-
String docker = "quay.io/broadinstitute/viral-classify:2.5.20.0"
598+
String docker = "quay.io/broadinstitute/viral-classify:2.5.21.0"
598599
}
599600

600601
parameter_meta {
@@ -684,7 +685,7 @@ task krona {
684685
Int? magnitude_column
685686

686687
Int machine_mem_gb = 3
687-
String docker = "quay.io/broadinstitute/viral-classify:2.5.20.0"
688+
String docker = "quay.io/broadinstitute/viral-classify:2.5.21.0"
688689
}
689690

690691
Int disk_size = 50
@@ -791,7 +792,7 @@ task filter_bam_to_taxa {
791792
String out_filename_suffix = "filtered"
792793

793794
Int machine_mem_gb = 8
794-
String docker = "quay.io/broadinstitute/viral-classify:2.5.20.0"
795+
String docker = "quay.io/broadinstitute/viral-classify:2.5.21.0"
795796
}
796797

797798
String out_basename = basename(classified_bam, ".bam") + "." + out_filename_suffix
@@ -884,7 +885,7 @@ task kaiju {
884885
File krona_taxonomy_db_tgz # taxonomy/taxonomy.tab
885886
886887
Int machine_mem_gb = 100
887-
String docker = "quay.io/broadinstitute/viral-classify:2.5.20.0"
888+
String docker = "quay.io/broadinstitute/viral-classify:2.5.21.0"
888889
}
889890

890891
String input_basename = basename(reads_unmapped_bam, ".bam")

pipes/WDL/tasks/tasks_ncbi.wdl

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ task download_fasta {
66
Array[String]+ accessions
77
String emailAddress
88

9-
String docker = "quay.io/broadinstitute/viral-phylo:2.5.16.0"
9+
String docker = "quay.io/broadinstitute/viral-phylo:2.5.21.0"
1010
}
1111

1212
command <<<
@@ -42,7 +42,7 @@ task download_fasta_from_accession_string {
4242
String out_prefix
4343
String emailAddress
4444

45-
String docker = "quay.io/broadinstitute/viral-phylo:2.5.16.0"
45+
String docker = "quay.io/broadinstitute/viral-phylo:2.5.21.0"
4646
}
4747

4848
command <<<
@@ -94,7 +94,7 @@ task download_annotations {
9494
String emailAddress
9595
String combined_out_prefix
9696

97-
String docker = "quay.io/broadinstitute/viral-phylo:2.5.16.0"
97+
String docker = "quay.io/broadinstitute/viral-phylo:2.5.21.0"
9898
}
9999

100100
command <<<
@@ -136,7 +136,7 @@ task download_ref_genomes_from_tsv {
136136
File ref_genomes_tsv # [tax_id, isolate_prefix, taxname, colon_delim_accession_list]
137137
String emailAddress
138138

139-
String docker = "quay.io/broadinstitute/viral-phylo:2.5.16.0"
139+
String docker = "quay.io/broadinstitute/viral-phylo:2.5.21.0"
140140
}
141141

142142
command <<<
@@ -238,7 +238,7 @@ task align_and_annot_transfer_single {
238238

239239
String out_basename = basename(genome_fasta, '.fasta')
240240
Int machine_mem_gb = 30
241-
String docker = "quay.io/broadinstitute/viral-phylo:2.5.16.0"
241+
String docker = "quay.io/broadinstitute/viral-phylo:2.5.21.0"
242242
}
243243

244244
parameter_meta {
@@ -1246,7 +1246,7 @@ task table2asn {
12461246

12471247
String out_basename = basename(assembly_fasta, ".fasta")
12481248
Int machine_mem_gb = 8
1249-
String docker = "quay.io/broadinstitute/viral-phylo:2.5.16.0" # this could be a simpler docker image, we don't use anything beyond table2asn itself
1249+
String docker = "quay.io/broadinstitute/viral-phylo:2.5.21.0" # this could be a simpler docker image, we don't use anything beyond table2asn itself
12501250
}
12511251
Int disk_size = 50
12521252

@@ -1403,7 +1403,7 @@ task genbank_special_taxa {
14031403
Int taxid
14041404
File taxdump_tgz
14051405
File vadr_by_taxid_tsv # "gs://pathogen-public-dbs/viral-references/annotation/vadr/vadr-by-taxid.tsv"
1406-
String docker = "quay.io/broadinstitute/viral-classify:2.5.20.0"
1406+
String docker = "quay.io/broadinstitute/viral-classify:2.5.21.0"
14071407
}
14081408

14091409
command <<<

pipes/WDL/tasks/tasks_nextstrain.wdl

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ task taxid_to_nextclade_dataset_name {
55
Int taxid
66
File taxdump_tgz
77
File nextclade_by_taxid_tsv # "gs://pathogen-public-dbs/viral-references/typing/nextclade-by-taxid.tsv"
8-
String docker = "quay.io/broadinstitute/viral-classify:2.5.20.0"
8+
String docker = "quay.io/broadinstitute/viral-classify:2.5.21.0"
99
}
1010
command <<<
1111
set -e
@@ -1001,7 +1001,7 @@ task mafft_one_chr {
10011001
Boolean large = false
10021002
Boolean memsavetree = false
10031003

1004-
String docker = "quay.io/broadinstitute/viral-phylo:2.5.16.0"
1004+
String docker = "quay.io/broadinstitute/viral-phylo:2.5.21.0"
10051005
Int mem_size = 500
10061006
Int cpus = 64
10071007
Int disk_size = 750
@@ -1091,7 +1091,7 @@ task mafft_one_chr_chunked {
10911091
Int batch_chunk_size = 2000
10921092
Int threads_per_job = 2
10931093

1094-
String docker = "quay.io/broadinstitute/viral-phylo:2.5.16.0"
1094+
String docker = "quay.io/broadinstitute/viral-phylo:2.5.21.0"
10951095
Int mem_size = 32
10961096
Int cpus = 64
10971097
Int disk_size = 750

0 commit comments

Comments
 (0)