- 🔮 Overview
- 🚀 Getting Started
- 📑 Input Variables
- 📊 Benchmarking
- 🔧 Troubleshooting
- 📁 Useful Resources
- ☎️ Contacts
- 🎉 Credits
Spectronaut is the primary software platform that we use for data-independent acquisition (DIA) proteomics analysis at the Proteomics Platform. Conventional Spectronaut workflows process samples sequentially on a single compute node, creating a critical throughput bottleneck that limits the scalability in large-scale studies. Although Biognosys has proposed a conceptual framework, a production-ready, fully validated implementation has not previously existed. To address this gap, we developed a cloud-native workflow that distributes analysis across virtual machines (VMs) on Google Cloud Platformat (GCP) via Terra.
This implementation achieves >20x runtime improvement while simultaneously driving down compute costs.
- Robust Parallelization Framework: Implements the two-step Pulsar framework (Spectronaut v20.3+), eliminating batch-dependent variability by generating a spectral library using aggregated information from all samples.
- Flexible Scaling: Set
num_vmsto match your Spectronaut token budget, and the workflow will scale dynamically to your available resources. - Parallelized HTRMS Conversion: Accelerates HTRMS conversion by distributing one file per VM
- Auto-Resource Tuning: Dynamically allocates CPU, RAM, and disk space based on input size, preventing resource overhead and reducing configuration burden for users with limited computational experience
- Comprehensive DIA Support: Automatically switches between directDIA, hybrid-DIA, and library-based DIA modes according to user preference.
- Fully Customizable: Provides granular access to Spectronaut search parameters to tailor the analysis to your experimental needs.
- Proven Reliability: Validated against the legacy workflow to ensure consistent identification and quantification performance. See Benchmarking for details.
- Cost-Efficient Computing: Support for Preemptible Instances and Call Caching drives down compute cost and enables restarting a failed run from the exact point of failure.
- Verbose Logging: Streamlines troubleshooting experience.
- Backward Compatible: Seamlessly reverts to the legacy pipeline when
num_vms = 1.
-
Set Up Terra:
- Sign In: Go to Terra. New to Terra? Register here.
- Create Workspace: Click Workspaces → "+". Assign a name and billing project. See Working with Terra workspaces for details.
-
Create Docker Image:
- Modify the Dockerfile of your desired Spectronaut version and modify it according to instruction, then build and push the image to a registry, such as Docker Hub or Google Cloud Registry.
- If you are new to Docker, check out this step-by-step guide.
- Update Workflow: In the WDL file, replace the existing
dockerpath with your new image path:docker: "your-registry/your-spectronaut-image:tag". - Import Workflow: Go to Library → Workflows → Terra Workflow Repository. Select "Create New Workflow" and upload your updated workflow file. See Create, edit, and share a new workflow for more details.
- Modify the Dockerfile of your desired Spectronaut version and modify it according to instruction, then build and push the image to a registry, such as Docker Hub or Google Cloud Registry.
-
Set Up
gcloudCLI:-
Install: Follow the gcloud CLI Installation Guide.
-
Initialize: Open your terminal and run:
gcloud init
-
Authenticate: Log in using your Terra-linked Google account:
gcloud auth login
-
- Gather Input Files: Ensure all necessary inputs (see Input Variables) are ready for upload.
- Upload Files to Terra Workspace:
-
Find Your Workspace GCS Bucket: In the Cloud Information box of your workspace Dashboard, copy the path to its GCS bucket.
-
Upload Data: Upload your files to the GCS bucket using the following commands:
-
For a single file:
gcloud storage cp "/path/to/file" "gs://fc-your-bucket/destination/directory/"
-
For a folder:
gcloud storage cp -r "/path/to/folder/" "gs://fc-your-bucket/destination/folder/"
Important: The workflow expects the MS data directory to contain only the raw files intended for search. To prevent execution errors, do not include logs, reports, or other file formats in this folder.
-
-
See How to move data to/from a Google bucket & gcloud CLI cheat sheet for additional information.
-
Configure & Launch:
- Open your workflow in the Workflows tab.
- Check "Use call caching" to ensure failed runs restart at the point of failure (optional but highly recommended).
- Provide the necessary inputs and click Save → Launch.
-
Completion: Terra will send an email notification once the analysis is complete.
-
Download Spectronaut Output: From the Submission page, you can access the execution directory, where you can find the following key Spectronaut outputs:
Output Path Final Outputs and Reports call-combine_sne/spectronaut_output.zipExperiment-Level Spectral Library call-combine_final_archives/merged_library.kitHTRMS-Converted Files Locate within call-htrms_conversion/subdirectories.
If you converted your data to HTRMS and need to move them to a local directory or another GCS bucket, run:
gcloud storage cp -r "gs://fc-your-bucket/submissions/YOUR-JOB-ID/call-htrms_conversion/**.htrms" "/path/to/destination/"Important: All input GCS paths must start with "gs://...".
| Variable | Type | Description |
|---|---|---|
num_vms |
Integer | Number of VMs to run in parallel; each VM consumes one license token. Set to 1 for single-VM directDIA mode (See Parallelization Flowchart: Pipeline 2A). |
experiment_name |
String | Name of the experiment. |
experiment_type |
String | Accepts "proteome" or "ptm"; determines CPU and RAM presets (Default: "proteome"). |
file_directory |
String | GCS path to the folder containing the input MS files. |
directDIA_settings |
File | GCS path to a *.prop directDIA settings file. |
fasta_1 |
File | GCS path to a *.fasta or *.bgsfasta database. |
fasta_[2-3] |
File (Optional) |
GCS path to additional protein database. |
report_schema_[1–4] |
File (Optional) |
GCS path to a *.rs report schema. |
enzyme_database |
File (Optional) |
GCS path to a custom enzyme database for non-standard proteolytic enzymes. |
condition_setup |
File (Optional) |
GCS path to a *.tsv condition setup file (template). |
json_settings |
File (Optional) |
GCS path to a *.json file to override Spectronaut search settings. |
HTRMS conversion is highly recommended for Bruker timsTOF files (*.d/) to reduce analysis cost and improve analytic performance, but it is not recommended on Thermo Fisher Orbitrap files (*.raw).
| Variable | Type | Description |
|---|---|---|
do_conversion |
Boolean (Optional) |
Controls execution of HTRMS conversion (Default: false). |
convert_schema |
File (Optional) |
GCS path to a *.prop conversion schema from HTRMS Converter. |
| Variable | Type | Description |
|---|---|---|
do_pulsar |
Boolean (Optional) |
Controls execution of Pulsar Search (Default: true). |
spectral_library_[1-3] |
String (Optional) |
GCS path to a pre-built spectral library (*.kit). Required when do_pulsar = false. |
🔻 Important:
- When
do_pulsar = false, at least one spectral library must be provided for a standard library-based DIA analysis.- When
do_pulsar = trueand a spectral library is provided, the workflow will run hybrid-DIA analysis using the provided library and an in silico library generated via Pulsar Search.
| Variable | Type | Default | Description |
|---|---|---|---|
generate_sne_large_experiment |
Boolean (Optional) |
true |
Controls generation of the final *.sne file. Recommend to set to false for large experiments (N > 300 samples), where generating the final *.sne file may exceed available memory and disk space. |
average_file_size_gb |
Float (Optional) |
20 |
Estimated average input file size in GB; only used when do_conversion = false. |
disk_size_multiplier |
Float (Optional) |
3 |
Multiplier applied to total input size for disk allocation across most tasks. Increase if tasks run out of disk space. |
sne_combine_disk_size_multiplier |
Float (Optional) |
6 |
Disk multiplier for the final sne_combine step only. Increase if the task runs out of disk space. |
Important: When
generate_sne_large_experiment = false, at least one*.rsreport schema must be provided to obtain analysis results. After the search is complete,sne_combinemust be re-run on all VM-level*.snefiles to generate additional reports — you can find a standalonesne_combineworkflow here.
The n_preemptible_* variables control the number of attempts on preemptible VMs before falling back to Standard instances. Set to 0 to always use Standard instances.
| Variable | Type | Default |
|---|---|---|
n_preemptible_htrms_conversion |
Integer (Optional) |
2 |
n_preemptible_pulsar_step1 |
Integer (Optional) |
1 |
n_preemptible_pulsar_step2 |
Integer (Optional) |
0 |
n_preemptible_pulsar_step3 |
Integer (Optional) |
1 |
n_preemptible_combine_archives |
Integer (Optional) |
0 |
n_preemptible_dia_analysis |
Integer (Optional) |
1 |
n_preemptible_combine_sne |
Integer (Optional) |
0 |
n_preemptible_directDIA_single_vm |
Integer (Optional) |
0 |
The parallelized workflow has undergone rigorous validation using data acquired from both Thermo Fisher Orbitrap and Bruker timsTOF instrument platforms. Benchmarking against the legacy workflow, the parallelized workflow demonstrated exceptional consistency in both protein identification and quantification performance.
Before you start, run through this checklist before submitting your job — it covers the most common causes of failure:
- All raw input files are in a single GCS folder (not nested in subfolders)
- FASTA file is in standard
.fastaor.bgsfastaformat and is accessible from your Terra workspace - The
.propsettings file was exported from the same version of Spectronaut supported by this workflow -
experiment_typeis entered exactly as"proteome"or"ptm"(all lowercase, no spaces) - If
do_pulsar = false, at least onespectral_library_*input is provided - "Use call caching" is checked in the workflow configuration page
- You are authenticated with the correct Broad Google account (
gcloud auth login)
I got an error about insufficient disk space.
Increase the disk_size_multiplier input (e.g., from 3 to 4 or 5). If the error was in the SNE combine step specifically, increase sne_combine_disk_size_multiplier.
I got a permission error when uploading files — "You do not have access to this bucket".
Re-authenticate your machine with your Broad Google account using gcloud auth login. This is the most common cause: another user previously signed in on the same computer, and their credentials are being used instead of yours. See the authentication section above for details.
I'm not sure how many VMs to use.
A reasonable starting point is one VM per 20–30 files. For 200 files, num_vms = 8 or num_vms = 10 works well. The workflow will not create more VMs than you have files — if you request 20 VMs for 15 files, it will automatically scale down to 15 VMs.
I set do_pulsar = false but my job failed immediately with an error about spectral libraries.
When do_pulsar = false, you must provide at least one spectral library via spectral_library_1 (and optionally spectral_library_2, spectral_library_3). The workflow validates this at startup and exits early if no library is supplied.
- Spectronaut Manual
- Running Spectronaut on a Public Dataset in a Distributed Manner (Spectronaut v20.3+)
- Cameron Lian (glian@broadinstitute.org) Research Associate I, Proteomics Platform
- Natalie Clark (nclark@broadinstitute.org) Senior Computational Scientist I, Proteomics Platform
- DR Mani (manidr@broadinstitute.org) Director, Computational Proteomics
- Moe Haines Research Scientist I, Proteomics Platform




