Parallelized Workflow for Spectronaut DIA Proteomics Analysis

🔮 Overview
- ✨ Key Features
- 🔍 How Does Parallelization Work?
🚀 Getting Started
📑 Input Variables
📊 Benchmarking
🔧 Troubleshooting
📁 Useful Resources
☎️ Contacts
🎉 Credits

🔮 Overview

Spectronaut is the primary software platform that we use for data-independent acquisition (DIA) proteomics analysis at the Proteomics Platform. Conventional Spectronaut workflows process samples sequentially on a single compute node, creating a critical throughput bottleneck that limits the scalability in large-scale studies. Although Biognosys has proposed a conceptual framework, a production-ready, fully validated implementation has not previously existed. To address this gap, we developed a cloud-native workflow that distributes analysis across virtual machines (VMs) on Google Cloud Platformat (GCP) via Terra.

This implementation achieves >20x runtime improvement while simultaneously driving down compute costs.


Comparison of runtime and compute cost. Processing a dataset of 90 proteome samples acquired on a Thermo Fisher Orbitrap Astral Mass Spectrometer, the parallelized workflow demonstrates a 78.4% reduction in total runtime alongside a 67.4% reduction in compute cost.

✨ Key Features

Performance & Scalability

Robust Parallelization Framework: Implements the two-step Pulsar framework (Spectronaut v20.3+), eliminating batch-dependent variability by generating a spectral library using aggregated information from all samples.
Flexible Scaling: Set num_vms to match your Spectronaut token budget, and the workflow will scale dynamically to your available resources.
Parallelized HTRMS Conversion: Accelerates HTRMS conversion by distributing one file per VM
Auto-Resource Tuning: Dynamically allocates CPU, RAM, and disk space based on input size, preventing resource overhead and reducing configuration burden for users with limited computational experience

Scientific Versatility

Comprehensive DIA Support: Automatically switches between directDIA, hybrid-DIA, and library-based DIA modes according to user preference.
Fully Customizable: Provides granular access to Spectronaut search parameters to tailor the analysis to your experimental needs.
Proven Reliability: Validated against the legacy workflow to ensure consistent identification and quantification performance. See Benchmarking for details.

Additional Features

Cost-Efficient Computing: Support for Preemptible Instances and Call Caching drives down compute cost and enables restarting a failed run from the exact point of failure.
Verbose Logging: Streamlines troubleshooting experience.
Backward Compatible: Seamlessly reverts to the legacy pipeline when num_vms = 1.

🔍 How Does Parallelization Work?


Flowchart of the parallelized workflow. The workflow executes parallelization by default (Pipeline 2B), while providing backward support for the legacy pipeline (2A). A two-step framework in Pulsar search is adopted in parallelization. Tasks highlighted in red are executed in parallel across VMs.

🚀 Getting Started

1. Prerequisites & Environment

Set Up Terra:
1. Sign In: Go to Terra. New to Terra? Register here.
2. Create Workspace: Click Workspaces → "+". Assign a name and billing project. See Working with Terra workspaces for details.
Create Docker Image:
1. Modify the Dockerfile of your desired Spectronaut version and modify it according to instruction, then build and push the image to a registry, such as Docker Hub or Google Cloud Registry.
  1. If you are new to Docker, check out this step-by-step guide.
2. Update Workflow: In the WDL file, replace the existing docker path with your new image path: docker: "your-registry/your-spectronaut-image:tag".
3. Import Workflow: Go to Library → Workflows → Terra Workflow Repository. Select "Create New Workflow" and upload your updated workflow file. See Create, edit, and share a new workflow for more details.
Set Up gcloud CLI:
1. Install: Follow the gcloud CLI Installation Guide.
2. Initialize: Open your terminal and run:
```
gcloud init
```
3. Authenticate: Log in using your Terra-linked Google account:
```
gcloud auth login
```

2. File Preparation & Upload

Gather Input Files: Ensure all necessary inputs (see Input Variables) are ready for upload.
Upload Files to Terra Workspace:
1. Find Your Workspace GCS Bucket: In the Cloud Information box of your workspace Dashboard, copy the path to its GCS bucket.
2. Upload Data: Upload your files to the GCS bucket using the following commands:
  - For a single file:
```
gcloud storage cp "/path/to/file" "gs://fc-your-bucket/destination/directory/"
```
  - For a folder:
```
gcloud storage cp -r "/path/to/folder/" "gs://fc-your-bucket/destination/folder/"
```
    Important: The workflow expects the MS data directory to contain only the raw files intended for search. To prevent execution errors, do not include logs, reports, or other file formats in this folder.

See How to move data to/from a Google bucket & gcloud CLI cheat sheet for additional information.

3. Execution & Output

Configure & Launch:
1. Open your workflow in the Workflows tab.
2. Check "Use call caching" to ensure failed runs restart at the point of failure (optional but highly recommended).
3. Provide the necessary inputs and click Save → Launch.
Completion: Terra will send an email notification once the analysis is complete.

Download Spectronaut Output: From the Submission page, you can access the execution directory, where you can find the following key Spectronaut outputs:

Output	Path
Final Outputs and Reports	`call-combine_sne/spectronaut_output.zip`
Experiment-Level Spectral Library	`call-combine_final_archives/merged_library.kit`
HTRMS-Converted Files	Locate within `call-htrms_conversion/` subdirectories.

📤 Copying HTRMS files to local or GCS bucket

If you converted your data to HTRMS and need to move them to a local directory or another GCS bucket, run:

gcloud storage cp -r "gs://fc-your-bucket/submissions/YOUR-JOB-ID/call-htrms_conversion/**.htrms" "/path/to/destination/"

📑 Input Variables

Important: All input GCS paths must start with "gs://...".

1. Core Search Inputs

Variable	Type	Description
`num_vms`	Integer	Number of VMs to run in parallel; each VM consumes one license token. Set to `1` for single-VM directDIA mode (See Parallelization Flowchart: Pipeline 2A).
`experiment_name`	String	Name of the experiment.
`experiment_type`	String	Accepts `"proteome"` or `"ptm"`; determines CPU and RAM presets (Default: `"proteome"`).
`file_directory`	String	GCS path to the folder containing the input MS files.
`directDIA_settings`	File	GCS path to a `*.prop` directDIA settings file.
`fasta_1`	File	GCS path to a `.fasta` or `.bgsfasta` database.
`fasta_[2-3]`	File (Optional)	GCS path to additional protein database.
`report_schema_[1–4]`	File (Optional)	GCS path to a `*.rs` report schema.
`enzyme_database`	File (Optional)	GCS path to a custom enzyme database for non-standard proteolytic enzymes.
`condition_setup`	File (Optional)	GCS path to a `*.tsv` condition setup file (template).
`json_settings`	File (Optional)	GCS path to a `*.json` file to override Spectronaut search settings.

2. HTRMS Conversion

HTRMS conversion is highly recommended for Bruker timsTOF files (*.d/) to reduce analysis cost and improve analytic performance, but it is not recommended on Thermo Fisher Orbitrap files (*.raw).

Variable	Type	Description
`do_conversion`	Boolean (Optional)	Controls execution of HTRMS conversion (Default: `false`).
`convert_schema`	File (Optional)	GCS path to a `*.prop` conversion schema from HTRMS Converter.

3. Hybrid-DIA Search

Variable	Type	Description
`do_pulsar`	Boolean (Optional)	Controls execution of Pulsar Search (Default: `true`).
`spectral_library_[1-3]`	String (Optional)	GCS path to a pre-built spectral library (`*.kit`). Required when `do_pulsar = false`.

🔻 Important:

When do_pulsar = false, at least one spectral library must be provided for a standard library-based DIA analysis.

When do_pulsar = true and a spectral library is provided, the workflow will run hybrid-DIA analysis using the provided library and an in silico library generated via Pulsar Search.

Resource & Performance

Variable	Type	Default	Description
`generate_sne_large_experiment`	Boolean (Optional)	`true`	Controls generation of the final `.sne` file. Recommend to set to `false` for large experiments (N > 300 samples), where generating the final `.sne` file may exceed available memory and disk space.
`average_file_size_gb`	Float (Optional)	`20`	Estimated average input file size in GB; only used when `do_conversion = false`.
`disk_size_multiplier`	Float (Optional)	`3`	Multiplier applied to total input size for disk allocation across most tasks. Increase if tasks run out of disk space.
`sne_combine_disk_size_multiplier`	Float (Optional)	`6`	Disk multiplier for the final `sne_combine` step only. Increase if the task runs out of disk space.

Important: When generate_sne_large_experiment = false, at least one *.rs report schema must be provided to obtain analysis results. After the search is complete, sne_combine must be re-run on all VM-level *.sne files to generate additional reports — you can find a standalone sne_combine workflow here.

Preemptible Instance

The n_preemptible_* variables control the number of attempts on preemptible VMs before falling back to Standard instances. Set to 0 to always use Standard instances.

Variable	Type	Default
`n_preemptible_htrms_conversion`	Integer (Optional)	`2`
`n_preemptible_pulsar_step1`	Integer (Optional)	`1`
`n_preemptible_pulsar_step2`	Integer (Optional)	`0`
`n_preemptible_pulsar_step3`	Integer (Optional)	`1`
`n_preemptible_combine_archives`	Integer (Optional)	`0`
`n_preemptible_dia_analysis`	Integer (Optional)	`1`
`n_preemptible_combine_sne`	Integer (Optional)	`0`
`n_preemptible_directDIA_single_vm`	Integer (Optional)	`0`

📊 Benchmarking

The parallelized workflow has undergone rigorous validation using data acquired from both Thermo Fisher Orbitrap and Bruker timsTOF instrument platforms. Benchmarking against the legacy workflow, the parallelized workflow demonstrated exceptional consistency in both protein identification and quantification performance.



Comparison of protein group identification and quantification between the legacy and parallelized workflows. Processing a benchmark dataset of 10 standard Jurkat QC samples acquired on Thermo Fisher Orbitrap Astral and Bruker timsTOF HT mass spectrometers, respectively, the parallelized workflow demonstrates exceptional consistency in both identification and quantification.

🔧 Troubleshooting

Before you start, run through this checklist before submitting your job — it covers the most common causes of failure:

All raw input files are in a single GCS folder (not nested in subfolders)
FASTA file is in standard .fasta or .bgsfasta format and is accessible from your Terra workspace
The .prop settings file was exported from the same version of Spectronaut supported by this workflow
experiment_type is entered exactly as "proteome" or "ptm" (all lowercase, no spaces)
If do_pulsar = false, at least one spectral_library_* input is provided
"Use call caching" is checked in the workflow configuration page
You are authenticated with the correct Broad Google account (gcloud auth login)

I got an error about insufficient disk space. Increase the disk_size_multiplier input (e.g., from 3 to 4 or 5). If the error was in the SNE combine step specifically, increase sne_combine_disk_size_multiplier.

I got a permission error when uploading files — "You do not have access to this bucket". Re-authenticate your machine with your Broad Google account using gcloud auth login. This is the most common cause: another user previously signed in on the same computer, and their credentials are being used instead of yours. See the authentication section above for details.

I'm not sure how many VMs to use. A reasonable starting point is one VM per 20–30 files. For 200 files, num_vms = 8 or num_vms = 10 works well. The workflow will not create more VMs than you have files — if you request 20 VMs for 15 files, it will automatically scale down to 15 VMs.

I set do_pulsar = false but my job failed immediately with an error about spectral libraries. When do_pulsar = false, you must provide at least one spectral library via spectral_library_1 (and optionally spectral_library_2, spectral_library_3). The workflow validates this at startup and exits early if no library is supplied.

📁 Useful Resources

☎️ Contacts

Cameron Lian (glian@broadinstitute.org) Research Associate I, Proteomics Platform
‌Natalie Clark (nclark@broadinstitute.org) Senior Computational Scientist I, Proteomics Platform
DR Mani (manidr@broadinstitute.org) Director, Computational Proteomics

🎉 Credits

Moe Haines Research Scientist I, Proteomics Platform

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.claude		.claude
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
doc		doc
src		src
wdl_workflow		wdl_workflow
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallelized Workflow for Spectronaut DIA Proteomics Analysis

🔮 Overview

✨ Key Features

Performance & Scalability

Scientific Versatility

Additional Features

🔍 How Does Parallelization Work?

🚀 Getting Started

1. Prerequisites & Environment

2. File Preparation & Upload

3. Execution & Output

📤 Copying HTRMS files to local or GCS bucket

📑 Input Variables

1. Core Search Inputs

2. HTRMS Conversion

3. Hybrid-DIA Search

Resource & Performance

Preemptible Instance

📊 Benchmarking

🔧 Troubleshooting

📁 Useful Resources

☎️ Contacts

🎉 Credits

About

Uh oh!

Releases

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Parallelized Workflow for Spectronaut DIA Proteomics Analysis

🔮 Overview

✨ Key Features

Performance & Scalability

Scientific Versatility

Additional Features

🔍 How Does Parallelization Work?

🚀 Getting Started

1. Prerequisites & Environment

2. File Preparation & Upload

3. Execution & Output

📤 Copying HTRMS files to local or GCS bucket

📑 Input Variables

1. Core Search Inputs

2. HTRMS Conversion

3. Hybrid-DIA Search

Resource & Performance

Preemptible Instance

📊 Benchmarking

🔧 Troubleshooting

📁 Useful Resources

☎️ Contacts

🎉 Credits

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 1

Languages