A Nextflow pipeline for extracting spectra from MassIVE-KB with robust error handling and resume functionality.
Nextflow and required python packages see environment.yml
# Clone the repository
git clone https://github.com/bittremieux-lab/MassIVE-KB-spectra-extractor.git
cd MassIVE-KB-spectra-extractor
# Install dependencies (using conda/mamba)
conda env create -f environment.yml
conda activate nf-mkb# Basic usage
nextflow run main.nf --task_id YOUR_MASSIVE_TASK_ID
# Resume from cache
nextflow run main.nf --task_id YOUR_MASSIVE_TASK_ID -resumeThe pipeline consists of several key processes:
- DOWNLOAD_METADATA: Downloads a MassIVE-KB metadata file from the task ID
- GROUP_TSV: Groups metadata by mzML/mzXML filename for parallel processing
- MZML_GROUP_TO_MGF: Downloads mzML/mzXML file and converts to MGF format
- MERGE_MGFS: Combines all successful MGF files into a single MGF file
- COLLECT_FAILED_LOGS: Aggregates failure information for analysis
- CREATE_PROCESSING_SUMMARY: Generates comprehensive processing report
Due to inconsistencies in mzML/mzXML files and MassIVE-KB MZML_GROUP_TO_MGF might fail unexpectedly.
This pipeline is designed so all successful MZML_GROUP_TO_MGF processes are reused from cache when
rerunning with the same task_id and the -resume flag. At the same time, an overview of failed processes is generated
in results_YOUR_TASK_ID/failed_processes.csv. These issues can be solved by editing mzml_group_to_mgf.py.
Edits to any other files might invalidate the cached versions of successful MZML_GROUP_TO_MGF processes.
-
Initial run: Process all files
nextflow run main.nf --task_id YOUR_TASK_ID
-
Check failures: Review processing summary and failed logs
# Check overall results (replace YOUR_TASK_ID with actual task ID) cat results_YOUR_TASK_ID/processing_summary.txt # Review specific failures cat results_YOUR_TASK_ID/failed_processes.csv
-
Fix issues: Update the Python script to handle specific error cases
-
Resume processing: Only failed processes will retry
nextflow run main.nf --task_id YOUR_TASK_ID -resume
-
Repeat: Continue until all files process successfully
This project is licensed under an Apache 2.0 license - see the LICENSE file for details.