spectrafuse

Incremental spectral clustering pipeline from quantms data. quantms is a workflow for reanalysis of public proteomics data. quantms not only releases a workflow to the public but also performs reanalysis of public proteomics data in a systematic way for TMT, LFQ, ITRAQ and other DDA methods.

quantms has reanalyzed an extensive number of datasets with almost 1 billion MS/MS (Mass Spectrometry/Mass Spectrometry) MS2 analyzed, comprising nearly 100 million PSMs (Peptide-Spectrum Matches) derived from various tissues, cell lines, and diseases. In light of this vast wealth of data, spectrafuse aims to apply spectral clustering techniques to organize this data and construct spectral libraries.

spectrafuse is a Nextflow workflow that performs incremental clustering of quantms data and is based on the tool MaRaCluster.

Workflow Overview

See docs/formats.md for the exact structure of every file the pipeline reads and writes — QPX inputs, the .dat binary, .scan_titles.txt, cluster-DB parquets, MSP, and the pre-existing cluster DB layout expected by --existing_cluster_db.

SpectrafUSE is a single pipeline. New QPX projects are always converted to MaRaCluster's .dat binary format (~100 bytes/spectrum), sliced into precursor m/z windows, clustered, and written to a cluster DB plus an MSP spectral library. If --existing_cluster_db <path> is supplied, representative spectra from that DB are extracted to .dat and clustered alongside the new data — the rest of the pipeline is identical, and the final step merges into the existing DB instead of writing a fresh one.

Parquet → Dat (PARQUET_TO_DAT): Converts PSM parquet files directly to MaRaCluster's binary .dat format. Replicates MaRaCluster's internal binning (bin = floor(mz / 1.000508 + 0.32), top-40 peaks). ~100 bytes/spectrum.
Representative Extraction (EXTRACT_REPS_DAT, only with --existing_cluster_db): Reads each existing cluster_metadata.parquet, writes one consensus spectrum per cluster to .dat with a rep:{cluster_id} scan title — becomes another source for the same clustering pipeline.
Concatenate (CONCAT_DAT_FILES): Per species/[instrument]/charge partition, concatenates all .dat sources (new projects + optional reps), renumbering scannr for global uniqueness.
Split m/z Windows (SPLIT_MZ_WINDOWS): Splits each partition into overlapping precursor m/z windows. MaRaCluster's 20 ppm precursor tolerance means spectra far apart in m/z can never cluster together; windowing formalizes this for Nextflow parallelism.
MaRaCluster (RUN_MARACLUSTER_DAT): Clusters spectra within each (partition, window) via MaRaCluster's -D flag (direct .dat read). Runs in parallel.
Merge m/z Windows (MERGE_MZ_WINDOWS): Reconciles cluster assignments across overlap zones — deterministic lowest-window-index wins, safe because the 1 Da overlap is far wider than the 20 ppm clustering tolerance.
Cluster DB build (BUILD_CLUSTER_DB or MERGE_INTO_EXISTING_DB): Generates cluster_metadata.parquet + psm_cluster_membership.parquet via DuckDB joins over scan_titles. When an existing DB is supplied, pre-existing cluster_ids are reused wherever a rep lands in a new cluster, and new PSMs are appended (dedup by USI). Both branches emit provenance columns (see below).
MSP library (GENERATE_MSP_FORMAT): Writes the consensus spectral library as *.msp.gz, one file per partition.

Partitioning

Clustering runs per species/instrument/charge, or per species/charge when --skip_instrument is set (useful when several instrument models produce interoperable spectra). Cluster DB and MSP publish paths mirror the partition key.

Provenance

cluster_metadata.parquet carries two columns for multi-round traceability:

is_reused_cluster (bool): True when this cluster_id was inherited from --existing_cluster_db via a representative match.
source_datasets (list): distinct project_accession values that contribute PSMs to this cluster, accumulated across rounds.

Benchmarks

Dataset	PSMs	Clusters	Reduction	Purity	Pipeline Time	Peak Memory
PXD014877 (3.6K)	3,608	3,389	6.1%	0.9996	~1 min	< 1 GB
PXD004452 (5.5M)	5,490,831	2,996,391	45.4%	0.9984	~40 min	~5 GB
7 datasets (11.8M)	11,797,213	6,912,696	41.4%	0.9979	~87 min	~8 GB

Key Features

Compact binary format: Parquet-to-dat conversion produces ~100 bytes/spectrum (~1.3 GB for 12.8M PSMs)
Incremental clustering: Add new datasets without re-clustering existing data
m/z window parallelism: Within each charge partition, spectra are split into precursor m/z windows and clustered in parallel, providing near-linear speedup at any scale
DuckDB-powered aggregation: All cluster stats, purity computation, and I/O use DuckDB for efficient performance from thousands to hundreds of millions of PSMs
Chunked cluster DB builder: Processes spectrum arrays in 200K-cluster chunks, keeping memory under 5 GB even for 3M+ PSM charges
Partitioned by metadata: Clustering is performed separately for each species/instrument/charge combination
Consensus spectra: Multiple strategies (best, bin, most, average) for generating consensus spectra from clusters

Input Requirements

The workflow requires:

QPX Parquet files: A directory containing one or more project subdirectories. Each project directory should contain the QPX file set:

File	Purpose
`{id}.psm.parquet`	PSM spectra (mz_array, intensity_array, charge, sequence, etc.)
`{id}.run.parquet`	Run metadata (run_file_name, instrument, samples, enzymes)
`{id}.sample.parquet`	Sample metadata (organism, organism_part)

Directory structure:

parquet_dir/
├── PXD004732/
│   ├── PXD004732.psm.parquet
│   ├── PXD004732.run.parquet
│   └── PXD004732.sample.parquet
├── PXD004733/
│   ├── PXD004733.psm.parquet
│   ├── PXD004733.run.parquet
│   └── PXD004733.sample.parquet
└── ...

Installation

Prerequisites

Nextflow >= 23.04.0
Docker, Singularity, Podman, or another container engine
Java 11 or later (for Nextflow)

Install Nextflow

curl -s https://get.nextflow.io | bash

Clone the Repository

git clone https://github.com/bigbio/spectrafuse.git
cd spectrafuse

Usage

Fresh run

nextflow run main.nf \
    --parquet_dir /path/to/projects \
    --default_species "Homo sapiens" \
    --default_instrument "Q Exactive HF" \
    -profile docker

Merge into an existing cluster DB

nextflow run main.nf \
    --parquet_dir /path/to/new_projects \
    --existing_cluster_db /path/to/cluster_db \
    --default_species "Homo sapiens" \
    --default_instrument "Q Exactive HF" \
    -profile docker

Cluster across instruments (drop instrument partitioning)

nextflow run main.nf \
    --parquet_dir /path/to/projects \
    --default_species "Homo sapiens" \
    --skip_instrument \
    -profile docker

Resume a Previous Run

nextflow run main.nf \
    --parquet_dir /path/to/projects \
    -profile docker \
    -resume

Test Run

nextflow run main.nf -profile test,docker

Parameters

Required Parameters

Parameter	Description	Example
`--parquet_dir`	Directory containing project subdirectories with QPX parquet files	`/data/projects`

Pipeline Parameters

Parameter	Description	Default
`--existing_cluster_db`	Optional path to a prior cluster DB. When set, its representative spectra feed the same pipeline and the result merges into that DB.	`null`
`--default_species`	Species label for cluster DB metadata	`null`
`--default_instrument`	Instrument label (ignored when `--skip_instrument` is set)	`null`
`--skip_instrument`	Drop the instrument level from partitioning and output paths (use when instruments cluster interchangeably)	`false`

MaRaCluster Parameters

Parameter	Description	Default
`--maracluster_pvalue_threshold`	Log10(p-value) threshold for MaRaCluster clustering (e.g., -10.0 = p-value 1e-10)	`-10.0`
`--maracluster_precursor_tolerance`	Precursor m/z tolerance in ppm for MaRaCluster	`20.0`
`--cluster_threshold`	P-value threshold for MaRaCluster output file naming (e.g., `*_p30.tsv`)	`30`
`--maracluster_verbose`	Enable verbose output from MaRaCluster	`false`

Precursor m/z Windowing Parameters

These parameters control parallelization within each charge partition. MaRaCluster uses 20 ppm precursor tolerance internally, so spectra far apart in m/z can never cluster together. Windowing formalizes this to enable parallel processing.

Parameter	Description	Default
`--mz_window_size`	Width of each precursor m/z window in Da	`300`
`--mz_window_overlap`	Overlap between adjacent windows in Da (safety margin)	`1.0`

At small scale (3K PSMs), this produces 1-2 windows with negligible overhead. At large scale (5M+ per charge), 5-10 windows provide near-linear speedup.

Consensus Strategy Parameters

Parameter	Description	Default	Options
`--strategytype`	Consensus spectrum generation method	`best`	`best`, `most`, `bin`, `average`

For most method (Most Similar Spectrum):

Parameter	Description	Default
`--sim`	Similarity measure method	`dot`
`--fragment_mz_tolerance`	Fragment m/z tolerance for spectrum comparison	`0.02`

For bin method (Binning Strategy):

Parameter	Description	Default
`--min_mz`	Minimum m/z value to consider	`100`
`--max_mz`	Maximum m/z value to consider	`2000`
`--bin_size`	Bin size in m/z units	`0.02`
`--peak_quorum`	Relative number of spectra in a cluster that need to contain a peak for it to be included (0.0-1.0)	`0.25`
`--edge_case_threshold`	Threshold for correcting m/z edge cases during binning (0.0-1.0)	`0.5`

For average method (Average Spectrum):

Parameter	Description	Default
`--diff_thresh`	Minimum distance between MS/MS peak clusters	`0.01`
`--dyn_range`	Dynamic range to apply to output spectra	`1000`
`--min_fraction`	Minimum fraction of cluster spectra where MS/MS peak is present (0.0-1.0)	`0.5`
`--pepmass`	Peptide mass calculation method	`lower_median`
`--msms_avg`	MS/MS averaging method	`weighted`

Output Parameters

Parameter	Description	Default
`--outdir`	Output directory for results	`./results`
`--publish_dir_mode`	Publish directory mode	`copy`

Resource Limits

Parameter	Description	Default
`--max_memory`	Maximum memory per process	`128.GB`
`--max_cpus`	Maximum CPUs per process	`16`
`--max_time`	Maximum time per process	`240.h`

Output Structure

results/
├── cluster_db/
│   └── {species}/{instrument}/{charge}/          # {species}/{charge} when --skip_instrument
│       ├── cluster_metadata.parquet              # one row per cluster (consensus, stats, provenance)
│       └── psm_cluster_membership.parquet        # one row per PSM (USI, cluster_id, scores)
├── msp_files/
│   └── {species}/{instrument}/{charge}/*.msp.gz  # consensus spectral library per partition
└── pipeline_info/
    ├── execution_report_*.html
    ├── execution_timeline_*.html
    ├── execution_trace_*.txt
    └── pipeline_dag_*.html

When --existing_cluster_db is supplied, cluster_metadata.parquet carries over pre-existing cluster_id values where a rep matched, and psm_cluster_membership.parquet is the merged membership (dedup by USI). Use the is_reused_cluster and source_datasets columns to separate new-this-round clusters from reused ones and to trace which projects contribute to each cluster.

Profiles

Container Engine Profiles

docker - Use Docker containers
singularity - Use Singularity containers
podman - Use Podman containers
apptainer - Use Apptainer containers
charliecloud - Use Charliecloud containers
shifter - Use Shifter containers

Platform Profiles

arm64 - For ARM64 architecture (e.g., Apple Silicon)
emulate_amd64 - Emulate AMD64 on ARM64 hosts

Other Profiles

test - Use test configuration with minimal dataset
debug - Enable debug mode with detailed logging
gpu - Enable GPU support
wave - Use Wave containers

Example Profile Combinations

# Docker on x86_64
nextflow run main.nf -profile docker

# Docker on Apple Silicon (ARM64)
nextflow run main.nf -profile docker,arm64

# Singularity on HPC cluster
nextflow run main.nf -profile singularity

# EMBL-EBI Codon SLURM cluster (Singularity + SLURM executor + shared cacheDir)
nextflow run main.nf -profile codon_slurm

# Test run with Docker
nextflow run main.nf -profile test,docker

EBI Codon SLURM

The codon_slurm profile (conf/codon_slurm.config) mirrors the pride_codon_slurm profile from quantms: SLURM executor with a queueSize of 2000, a shared Singularity cacheDir at /hps/nobackup/juan/pride/reanalysis/singularity/, the standard PRIDE reanalysis bind mount, generous retries for shared-FS latency, and per-process resource scaling. To pre-populate the cache, build the SIF locally with pyspectrafuse-lib/scripts/build_singularity.sh using MODE=remote OUT_DIR=/hps/nobackup/juan/pride/reanalysis/singularity/ — the output filename (ghcr.io-bigbio-pyspectrafuse-<version>.sif) is the name Nextflow expects.

Consensus Spectrum Strategies

The workflow supports four different strategies for generating consensus spectra from clusters:

best (default): Selects the spectrum with the best quality metric (lowest posterior error probability or global q-value) as the consensus spectrum.
most: Selects the spectrum most similar to all other spectra in the cluster using similarity measures (e.g., dot product).
bin: Uses binning-based approach where peaks are binned by m/z, and peaks present in a quorum of spectra are included in the consensus.
average: Computes an averaged spectrum from all spectra in the cluster, with configurable peak alignment and filtering.

Choose the strategy based on your use case:

best: Fast, good for high-quality data
most: Good for finding representative spectra
bin: Good for noisy data, robust peak detection
average: Good for high-quality clusters, smooth spectra

Troubleshooting

Common Issues

"Please provide a folder containing the files that will be clustered (--parquet_dir)"
- Solution: Ensure you provide the --parquet_dir parameter with a valid path.
Container platform mismatch (ARM64 vs AMD64)
- Solution: Use -profile docker,arm64 on Apple Silicon, or -profile docker,emulate_amd64 to emulate AMD64.
Memory errors during BUILD_CLUSTER_DB
- The cluster DB builder processes spectrum arrays in 200K-cluster chunks to stay within ~5 GB memory. If you see OOM kills (exit 137), check --max_memory is at least 8 GB.
"Error: No such command '/bin/bash'"
- The Docker image must use CMD (not ENTRYPOINT) so Nextflow can run bash scripts. Rebuild the image if you see this error.
Missing QPX parquet files
- Solution: Ensure each project directory contains .psm.parquet, .run.parquet, and .sample.parquet files.

Getting Help

Check the Nextflow log for detailed error messages
Review the execution report in results/pipeline_info/execution_report.html
Open an issue on GitHub

Citation

If you use spectrafuse in your research, please cite:

The spectrafuse workflow (when published)
MaRaCluster for the clustering algorithm
quantms for the reanalysis workflow

License

This workflow is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions are welcome! Please see the Contributing Guidelines for more information.

Acknowledgments

MaRaCluster for the clustering algorithm
quantms for the proteomics reanalysis workflow
Nextflow for the workflow management system
nf-core for workflow best practices and infrastructure

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
.github/workflows		.github/workflows
bin		bin
conf		conf
docs		docs
modules/local		modules/local
subworkflows/nf-core/utils_nextflow_pipeline		subworkflows/nf-core/utils_nextflow_pipeline
test_data/TEST001		test_data/TEST001
workflows		workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config

Folders and files

Latest commit

History

Repository files navigation

spectrafuse

Workflow Overview

Partitioning

Provenance

Benchmarks

Key Features

Input Requirements

Installation

Prerequisites

Install Nextflow

Clone the Repository

Usage

Fresh run

Merge into an existing cluster DB

Cluster across instruments (drop instrument partitioning)

Resume a Previous Run

Test Run

Parameters

Required Parameters

Pipeline Parameters

MaRaCluster Parameters

Precursor m/z Windowing Parameters

Consensus Strategy Parameters

Output Parameters

Resource Limits

Output Structure

Profiles

Container Engine Profiles

Platform Profiles

Other Profiles

Example Profile Combinations

EBI Codon SLURM

Consensus Spectrum Strategies

Troubleshooting

Common Issues

Getting Help

Citation

License

Contributing

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages