Use dsub to run one-off compute jobs on Google Cloud when your analysis requires more compute/memory than the local environment, or needs specific Docker images that are impractical to run locally.
- Analysis tools need >8GB RAM (e.g., VADR, BLAST, assembly)
- Need to run many independent jobs in parallel (batch processing)
- Need a specific Docker image with pre-installed tools
- Data lives in GCS and is most efficiently processed in-cloud
- dsub installed (ask the user where their dsub installation or venv is located)
- gcloud CLI authenticated with a GCP project that has Batch API enabled
- GCS bucket accessible by the project's default service account
dsub --provider google-cls-v2 \
--project <gcp-project> \
--regions <region> \
--image <docker-image> \
--machine-type <machine-type> \
--script <script.sh> \
--tasks <tasks.tsv> \
--logging gs://<bucket>/logs/<job-name>/| Parameter | Description |
|---|---|
--provider google-cls-v2 |
Use Google Cloud Batch (recommended over Life Sciences) |
--project |
GCP project with Batch API enabled |
--regions |
Compute region (e.g., us-central1) |
--image |
Docker image to run (e.g., staphb/vadr:1.6.4) |
--machine-type |
VM type (e.g., n1-highmem-4 for 26GB RAM) |
--script |
Local shell script to execute inside the container |
--tasks |
TSV file defining one row per job (batch mode) |
--logging |
GCS path for stdout/stderr logs |
The tasks TSV defines inputs, outputs, and environment variables for each job. Header row uses column prefixes to declare types:
--env VAR1 --env VAR2 --input FASTA --output RESULT --output LOG
value1 value2 gs://bucket/input.fasta gs://bucket/output.txt gs://bucket/log.txt
--env NAME-- environment variable passed to the script--input NAME-- GCS file downloaded to a local path; the env var is set to the local path--output NAME-- local path; after the script finishes, the file is uploaded to GCS
Each non-header row is one job. All jobs run independently in parallel.
The service account running dsub jobs must have read/write access to all GCS paths referenced in the tasks TSV. The simplest approach:
- Use a GCP project whose default service account already has access to your data
- Use a bucket within that same project for staging intermediate/output files
- For ephemeral results, use a temp bucket with a lifecycle policy (e.g., 30-day auto-delete)
Most developers on the viral-ngs codebase use:
- GCP project:
gcid-viral-seq - Staging bucket:
gs://viral-temp-30d(30-day auto-delete lifecycle)
These are not universal -- always confirm with the user before using them.
# Check job status
dsub --provider google-cls-v2 --project <project> --jobs <job-id> --status
# Or use dstat
dstat --provider google-cls-v2 --project <project> --jobs <job-id> --status '*'
# View logs
gcloud storage cat gs://<bucket>/logs/<job-name>/<task-id>.log- Batch over single jobs: Always prefer
--taskswith a TSV over individualdsubinvocations. One TSV row per job is cleaner and easier to track. - Machine sizing: Check your tool's memory requirements. VADR needs ~16GB;
use
n1-highmem-4(26GB). Most tools work fine withn1-standard-4(15GB). - Script portability: Write the
--scriptto be self-contained. It receives inputs/outputs as environment variables. Don't assume any local state. - Logging: Always set
--loggingto a GCS path so you can debug failures. - Idempotency: If re-running, dsub will create new jobs. Check for existing outputs before re-submitting to avoid redundant computation.
From the GATK-to-FreeBayes regression testing (PR #1053), we ran VADR on 30 FASTAs (15 assemblies x old/new) using dsub:
source ~/venvs/dsub/bin/activate # or wherever dsub is installed
dsub --provider google-cls-v2 \
--project gcid-viral-seq \
--regions us-central1 \
--image staphb/vadr:1.6.4 \
--machine-type n1-highmem-4 \
--script run_vadr.sh \
--tasks vadr_tasks.tsv \
--logging gs://viral-temp-30d/vadr_regression/logs/The tasks TSV had columns for VADR options (--env VADR_OPTS), model URL
(--env MODEL_URL), input FASTA (--input FASTA), and outputs
(--output NUM_ALERTS, --output ALERTS_TSV, --output VADR_TGZ).
All 30 jobs completed in ~15 minutes total (running in parallel on GCP).