MassIVE-KB-spectra-extractor

A Nextflow pipeline for extracting spectra from MassIVE-KB with robust error handling and resume functionality.

Quick Start

Prerequisites

Nextflow and required python packages see environment.yml

Installation

# Clone the repository
git clone https://github.com/bittremieux-lab/MassIVE-KB-spectra-extractor.git
cd MassIVE-KB-spectra-extractor

# Install dependencies (using conda/mamba)
conda env create -f environment.yml
conda activate nf-mkb

Usage

# Basic usage
nextflow run main.nf --task_id YOUR_MASSIVE_TASK_ID

# Resume from cache
nextflow run main.nf --task_id YOUR_MASSIVE_TASK_ID -resume

Pipeline Overview

The pipeline consists of several key processes:

DOWNLOAD_METADATA: Downloads a MassIVE-KB metadata file from the task ID
GROUP_TSV: Groups metadata by mzML/mzXML filename for parallel processing
MZML_GROUP_TO_MGF: Downloads mzML/mzXML file and converts to MGF format
MERGE_MGFS: Combines all successful MGF files into a single MGF file
COLLECT_FAILED_LOGS: Aggregates failure information for analysis
CREATE_PROCESSING_SUMMARY: Generates comprehensive processing report

Error Handling & Resume Functionality

Due to inconsistencies in mzML/mzXML files and MassIVE-KB MZML_GROUP_TO_MGF might fail unexpectedly. This pipeline is designed so all successful MZML_GROUP_TO_MGF processes are reused from cache when rerunning with the same task_id and the -resume flag. At the same time, an overview of failed processes is generated in results_YOUR_TASK_ID/failed_processes.csv. These issues can be solved by editing mzml_group_to_mgf.py. Edits to any other files might invalidate the cached versions of successful MZML_GROUP_TO_MGF processes.

Workflow for Large Processing Jobs

Initial run: Process all files

nextflow run main.nf --task_id YOUR_TASK_ID

Check failures: Review processing summary and failed logs

# Check overall results (replace YOUR_TASK_ID with actual task ID)
cat results_YOUR_TASK_ID/processing_summary.txt

# Review specific failures
cat results_YOUR_TASK_ID/failed_processes.csv

Fix issues: Update the Python script to handle specific error cases

Resume processing: Only failed processes will retry

nextflow run main.nf --task_id YOUR_TASK_ID -resume

Repeat: Continue until all files process successfully

License

This project is licensed under an Apache 2.0 license - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
bin		bin
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_mgf.sbatch		create_mgf.sbatch
environment.yml		environment.yml
main.nf		main.nf
mzml_group_to_mgf.py		mzml_group_to_mgf.py
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MassIVE-KB-spectra-extractor

Quick Start

Prerequisites

Installation

Usage

Pipeline Overview

Error Handling & Resume Functionality

Workflow for Large Processing Jobs

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

bittremieuxlab/MassIVE-KB-spectra-extractor

Folders and files

Latest commit

History

Repository files navigation

MassIVE-KB-spectra-extractor

Quick Start

Prerequisites

Installation

Usage

Pipeline Overview

Error Handling & Resume Functionality

Workflow for Large Processing Jobs

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages