additional support for ms2rescore refactoring by ArthurDeclercq · Pull Request #9 · CompOmics/ms2rescore-rs

ArthurDeclercq · 2026-01-09T15:33:25Z

This pull request introduces significant performance improvements and new feature extraction capabilities to the ms2rescore-rs Rust crate. The main focus is on adding parallel processing for spectrum parsing and feature computation, integrating new dependencies, and exposing new feature extraction functions to Python via PyO3.

Performance improvements (parallelization):

Enabled parallel processing for reading and parsing MS2 spectra and precursor information from both mzData and timsrust sources, using the rayon crate.
Parallelization is also used in the new feature extraction functions (below) that are introduced to bring computational heavy feature calculation to rust. Also here parallelization should significantly speed up compute time.

New feature extraction and API enhancements:

Added a new module ms2_features with a ms2_features_from_ms2spectra function, which computes a variety of spectrum annotation and matching features (such as explained intensity, b/y ion coverage, and hyperscore) in parallel, and exposes it as a Python-callable function.
Added a new module ms2pip_features with a ms2pip_features_from_prediction_peak_arrays function, which computes all ms2pip features required for ms2rescore in parallel, and exposes it as a Python-callable function.
Registered both ms2_features_from_ms2spectra and ms2pip_features_from_prediction_peak_arrays as Python functions in the module initialization.

Python interoperability improvements:

Enhanced the MS2Spectrum and Precursor class with a #[pyclass(module = "ms2rescore_rs", get_all, set_all)] decorator, a Python-compatible constructor, and a __reduce__ method for better compatibility with Python serialization (e.g., pickling). This is required for multiprocessing with MS²Rescore.

Package management:

Bumped the crate version from 0.4.3 to 0.5.0 to reflect the new features and breaking changes.
Introduced new dependencies: rayon, rustyms, ordered-float, and numpy in Cargo.toml.

Copilot

Pull request overview

This PR enhances the ms2rescore-rs crate by adding parallelized spectrum parsing and introducing new Rust-based MS2/MS2PIP feature extraction functions exposed to Python via PyO3, alongside improved Python serialization support for core data classes.

Changes:

Parallelized spectrum/precursor parsing for both mzdata- and timsrust-backed readers using rayon.
Added new Python-callable feature extraction APIs: ms2_features_from_ms2spectra and ms2pip_features_from_prediction_peak_arrays.
Improved Python interoperability by adding #[pyclass(module=...)], constructors, and __reduce__ for Precursor and MS2Spectrum; updated dependencies and bumped crate version to 0.5.0.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`src/lib.rs`	Releases the GIL around file parsing, registers new Python-callable feature functions, and adjusts error handling.
`src/parse_mzdata.rs`	Adds Rayon-based parallel processing for mzdata spectra/precursors.
`src/parse_timsrust.rs`	Adds Rayon-based parallel processing for timsrust spectrum reads and conversions.
`src/ms2_features.rs`	New module computing annotation/matching features via `rustyms` in parallel.
`src/ms2pip_features.rs`	New module computing MS2PIP-related features from NumPy-like peak arrays in parallel (block-wise).
`src/precursor.rs`	Adds PyO3 module association, constructor defaults, and `__reduce__` for pickling/multiprocessing.
`src/ms2_spectrum.rs`	Adds PyO3 module association, Python constructor, and `__reduce__` for pickling/multiprocessing.
`Cargo.toml` / `Cargo.lock`	Adds dependencies (`rayon`, `rustyms`, `ordered-float`, `numpy`) and updates versions/features.

Comments suppressed due to low confidence (1)

src/parse_timsrust.rs:84

parse_precursor_info builds an intermediate Vec<Spectrum> (spectra) and then does a second parallel pass to extract precursors. Since the final output is only a HashMap<String, Precursor>, this extra materialization can increase memory pressure unnecessarily. Consider extracting the precursor mapping directly from the first (parallel) pass (e.g., map each index to an Option<(id, precursor)> and collect), avoiding storing full spectra when they’re not needed.

    let spectra = (0..reader.len())
        .into_par_iter()
        .map(|index| match reader.get(index) {
            Ok(spectrum) => Ok(Some(spectrum)),
            Err(err) => handle_spectrum_reader_error(err).map(|_| None),
        })
        // resolve errors
        .collect::<Result<Vec<Option<timsrust::Spectrum>>, std::io::Error>>()
        .map_err(|e| std::io::Error::other(e.to_string()))?
        // remove None values
        .into_iter()
        .flatten()
        .collect::<Vec<_>>();

    let precursor_info = spectra
        .into_par_iter()
        .filter_map(|spectrum| match spectrum.precursor {
            Some(precursor) => Some((spectrum.index.to_string(), Precursor::from(precursor))),
            None => None,
        })
        .collect::<HashMap<_, _>>();

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-10T13:44:47Z

+                                if idx < seq_len {
+                                    match series {
+                                        'b' => {
+                                            b_sum += inten;
+                                            b_flags[idx] = true;
+                                            if calculate_hyperscore {
+                                                b_ints.push(inten);
+                                            }
+                                        }
+                                        'y' => {
+                                            y_sum += inten;
+                                            y_flags[idx] = true;
+                                            if calculate_hyperscore {
+                                                y_ints.push(inten);
+                                            }
+                                        }
+                                        _ => {}


idx parsed from fragment labels like "b5"/"y7" is typically 1-based, but it’s used directly as a 0-based index into b_flags/y_flags (b_flags[idx] = true). This likely shifts all coverage features by one position and can also drop the last ion depending on seq_len. Consider converting to 0-based (e.g., pos = idx - 1) and adjusting the bounds check accordingly (and/or sizing the flags to the number of possible fragments).

Suggested change

if idx < seq_len {

match series {

'b' => {

b_sum += inten;

b_flags[idx] = true;

if calculate_hyperscore {

b_ints.push(inten);

}

}

'y' => {

y_sum += inten;

y_flags[idx] = true;

if calculate_hyperscore {

y_ints.push(inten);

}

}

_ => {}

// Fragment indices from labels like "b5"/"y7" are typically 1-based;

// convert to 0-based before indexing into the flag arrays.

if idx > 0 {

let pos = idx - 1;

if pos < seq_len {

match series {

'b' => {

b_sum += inten;

b_flags[pos] = true;

if calculate_hyperscore {

b_ints.push(inten);

}

}

'y' => {

y_sum += inten;

y_flags[pos] = true;

if calculate_hyperscore {

y_ints.push(inten);

}

}

_ => {}

}

Copilot · 2026-02-10T13:44:48Z

+                feats.insert("matched_b_ions_pct".to_string(), matched_b as f64 / seq_len as f64);
+
+                feats.insert("matched_y_ions".to_string(), matched_y as f64);
+                feats.insert("matched_y_ions_pct".to_string(), matched_y as f64 / seq_len as f64);
+
+                feats.insert(
+                    "matched_ions_pct".to_string(),
+                    (matched_b + matched_y) as f64 / (2.0 * seq_len as f64),


The coverage percentages are computed as matched_* as f64 / seq_len as f64, but for b/y fragments the number of possible ions is usually seq_len - 1 (ions are numbered 1..(L-1)). If seq_len is the peptide length, this will systematically underreport coverage. Consider clarifying what seq_len represents and using the correct denominator for the intended definition.

Suggested change

feats.insert("matched_b_ions_pct".to_string(), matched_b as f64 / seq_len as f64);

feats.insert("matched_y_ions".to_string(), matched_y as f64);

feats.insert("matched_y_ions_pct".to_string(), matched_y as f64 / seq_len as f64);

feats.insert(

"matched_ions_pct".to_string(),

(matched_b + matched_y) as f64 / (2.0 * seq_len as f64),

let possible_ion_count = if seq_len > 1 { seq_len - 1 } else { 1 };

feats.insert(

"matched_b_ions_pct".to_string(),

matched_b as f64 / possible_ion_count as f64,

);

feats.insert("matched_y_ions".to_string(), matched_y as f64);

feats.insert(

"matched_y_ions_pct".to_string(),

matched_y as f64 / possible_ion_count as f64,

);

feats.insert(

"matched_ions_pct".to_string(),

(matched_b + matched_y) as f64 / (2.0 * possible_ion_count as f64),

Copilot · 2026-02-10T13:44:48Z

+                    if it.pb.len() != it.ob.len() || it.py.len() != it.oy.len() {
+                        return (it.idx, HashMap::new());
+                    }
+                    if it.pb.is_empty() && it.py.is_empty() {
+                        return (it.idx, HashMap::new());
+                    }


For per-spectrum array length mismatches (it.pb.len() != it.ob.len() / it.py.len() != it.oy.len()), the function currently returns an empty feature map for that PSM. This can silently hide upstream data alignment bugs and yield hard-to-debug downstream behavior. Consider raising a Python exception (or at least returning an explicit error indicator) when the per-PSM peak arrays are not aligned.

Copilot · 2026-02-10T13:44:49Z

+    let spectra: Vec<_> = reader
        .filter(|spectrum| spectrum.description.ms_level == 2)
+        .collect();
+    Ok(spectra
+        .into_par_iter()


parse_precursor_info now collects all MS2 spectra into a Vec before parallel processing. For large mzML/MGF files this defeats streaming and can significantly increase peak memory usage. If possible, consider using a streaming parallel approach (e.g., par_bridge()/chunking) or rely on mzdata’s internal parallelism rather than materializing the full spectrum list up front.

Suggested change

let spectra: Vec<_> = reader

.filter(|spectrum| spectrum.description.ms_level == 2)

.collect();

Ok(spectra

.into_par_iter()

Ok(reader

.filter(|spectrum| spectrum.description.ms_level == 2)

.par_bridge()

Copilot · 2026-02-10T13:44:49Z

+pub fn get_precursor_info(py: Python<'_>, spectrum_path: String) -> PyResult<HashMap<String, Precursor>> {
    let file_type = match_file_type(&spectrum_path);

-    let precursors = match file_type {
+    let precursors = py.allow_threads(|| match file_type {
        SpectrumFileType::MascotGenericFormat
        | SpectrumFileType::MzML
        | SpectrumFileType::MzMLb
        | SpectrumFileType::ThermoRaw => parse_mzdata::parse_precursor_info(&spectrum_path),
        SpectrumFileType::BrukerRaw => parse_timsrust::parse_precursor_info(&spectrum_path),
-        SpectrumFileType::Unknown => return Err(PyValueError::new_err("Unsupported file type")),
-    };
+        SpectrumFileType::Unknown => Err(std::io::Error::new(
+            std::io::ErrorKind::InvalidInput,
+            "Unsupported file type",
+        )),
+    });


For SpectrumFileType::Unknown, the code now returns an std::io::Error which is later converted into a generic PyException. Previously (per the removed PyValueError import and prior diff) this case likely surfaced as a Python ValueError. Consider returning PyValueError (or another specific Python exception) for invalid input paths/types so Python callers can reliably catch the right error type.

Copilot · 2026-02-10T13:44:49Z

+    pub fn __reduce__(&self, py: Python<'_>) -> PyResult<PrecursorReduceReturn> {
+        let cls = py.import("ms2rescore_rs")?.getattr("Precursor")?;
+        Ok((cls.into(), (self.mz, self.rt, self.im, self.charge, self.intensity)))
    }


Pickling support is introduced via __reduce__, but there’s no test coverage to ensure Precursor round-trips correctly through pickle.dumps/pickle.loads (including default values and attribute preservation). Adding a Python test would help validate the multiprocessing/serialization requirement this PR targets.

Copilot · 2026-02-10T13:44:49Z

+    pub fn __reduce__(&self, py: Python<'_>) -> PyResult<MS2SpectrumReduceReturn> {
+        let cls = py.import("ms2rescore_rs")?.getattr("MS2Spectrum")?;
+        Ok((
+            cls.into(),
+            (
+                self.identifier.clone(),
+                self.mz.clone(),
+                self.intensity.clone(),
+                self.precursor.clone(),
+            ),
+        ))


Pickling support is introduced via __reduce__, but there’s no test coverage to ensure MS2Spectrum (including the nested optional Precursor) round-trips correctly through pickle.dumps/pickle.loads. A small Python test would help confirm this works for multiprocessing and won’t regress.

Copilot · 2026-02-10T13:44:50Z

+    m.add_function(wrap_pyfunction!(
+        ms2_features::ms2_features_from_ms2spectra,
+        m
+    )?)?;
+    m.add_function(wrap_pyfunction!(ms2pip_features::ms2pip_features_from_prediction_peak_arrays, m)?)?;


New Python-callable APIs (ms2_features_from_ms2spectra, ms2pip_features_from_prediction_peak_arrays) are added to the module, but there are no corresponding tests alongside the existing Python tests for get_precursor_info/get_ms2_spectra. Adding at least basic correctness + shape/length validation tests (and ideally a small golden-value test) would help prevent regressions, especially given the parallelism and floating-point feature computations.

paretje · 2026-03-10T15:11:33Z

+
+use rayon::prelude::*;
+
+const CLIP_LOG2_MIN: f64 = -9.965_784_284_662_087; // (0.001_f64).log2()


Any reason to hard code the number instead of having (0.001_f64).log2()? I would be surprised if the rust compiler wouldn't calculate this at compile time. It might be a bit cleaner, and the code essentially becomes self-documenting.

paretje · 2026-03-12T10:47:47Z

+                    let mut feats: HashMap<String, f64> = HashMap::with_capacity(66);
+
+                    // log space
+                    feats.insert("spec_pearson_norm".into(), finite_or_zero(spec_pearson_norm));


Is there really a chance that all of these values are infinite? I also wonder if it would be interesting to calculate (most of) these values while building the HashMap. It would be less verbose, and I think easier to read.

paretje · 2026-03-12T17:58:40Z

+    // ---- Copy spectrum data out of Python objects (must hold GIL) ----
+    #[derive(Clone)]
+    struct OwnedSpec {
+        id: String,
+        mz: Vec<f64>,
+        intensity: Vec<f64>,
+        seq_len: usize,
+        precursor_charge: i32,
+        proforma: String,
+    }


Move struct definition out of function

paretje · 2026-03-12T18:04:23Z

+        proforma: String,
+    }
+
+    let mut owned: Vec<OwnedSpec> = Vec::with_capacity(n);


It would probably be better if we did the conversion in the iterator, this will copy every single spectrum in memory.

paretje · 2026-03-13T13:11:43Z

+    let mut frag_cache: HashMap<(String, i32), Arc<FragList>> = HashMap::new();
+    for item in &owned {
+        let key = (item.proforma.clone(), item.precursor_charge);
+        if item.precursor_charge <= 0 {
+            frag_cache.insert((item.proforma.clone(), item.precursor_charge), Arc::new(Vec::new()));
+            continue;
+        }
+        if frag_cache.contains_key(&key) {
+            continue;
+        }
+
+        // Parse peptide
+        let peptide = match CompoundPeptidoformIon::pro_forma(&item.proforma, None)
+        {
+            Ok(p) => p,
+            Err(_) => {
+                // Store empty fragments to mimic your Python behavior: return [] on parse/annotate failure
+                frag_cache.insert(key, Arc::new(Vec::new()));
+                continue;
+            }
+        };
+
+        // Generate theoretical fragments up to the precursor charge.
+        // If you want to cap this (e.g., min(charge, 2)) for speed, do it here.
+        let frag_charge = rustyms::system::isize::Charge::new::<rustyms::system::e>(
+            item.precursor_charge as isize,
+        );
+        let frags = peptide.generate_theoretical_fragments(frag_charge, &model);
+        frag_cache.insert(key, Arc::new(frags));
+    }


Probably should be a separate function?

paretje · 2026-03-13T13:20:04Z

+                let explained_ratio = if total_intensity > 0.0 {
+                    (matched_intensity / total_intensity + pseudo).ln()
+                } else {
+                    pseudo.ln()
+                };


As we do this three times here, maybe move this into a helper function?

paretje · 2026-03-13T13:27:53Z

+    let frag_cache = Arc::new(frag_cache);
+    let params = Arc::new(params);


It might be me, but I do kind of feel like it's a bit icky to turn these into an Arc after the fact, technically changing the type of these variables.

Bump pyo3 (0.23 → 0.28), numpy (0.23 → 0.28), and timsrust (0.4.1 → 0.4.2). Migrate to new PyO3 0.28 API: allow_threads → detach, PyObject → Py<PyAny>, downcast → cast, and add from_py_object to pyclass attrs. Apply rustfmt formatting throughout.

RalfG · 2026-04-10T11:09:56Z

Merging into release/0.5 to continue refactoring.

ArthurDeclercq added 12 commits December 23, 2025 18:06

add pickle support for ms2spectrum and precursor objects

cd2a3be

rayon multithreading

f3815e1

moving ms2 feature calculation to rust

7678af5

move ms2pip features calculation to rust

e1d5efd

version bump

abc7f6a

version bump

ffe6781

fix type_complexity issues with clippy

1033f2b

adapt ms2pip features to handle all input arrays and then convert

7d0952a

change function calls in feature generators

dad9bfe

add new function names to lib.rs

81710f0

add default back to precursor obj

1578acf

fix clippy needles bound error

9cd9440

ArthurDeclercq requested a review from paretje January 19, 2026 12:57

RalfG requested a review from Copilot February 10, 2026 13:36

Copilot started reviewing on behalf of RalfG February 10, 2026 13:36 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

paretje reviewed Mar 13, 2026

View reviewed changes

RalfG added 2 commits March 19, 2026 11:06

Set version to alpha

2d5e1f4

RalfG changed the base branch from main to release/0.5 April 10, 2026 11:08

RalfG added this to the 0.5.0 milestone Apr 10, 2026

RalfG added the enhancement New feature or request label Apr 10, 2026

RalfG merged commit e64f5eb into release/0.5 Apr 10, 2026
12 checks passed


		use rayon::prelude::*;

		const CLIP_LOG2_MIN: f64 = -9.965_784_284_662_087; // (0.001_f64).log2()

		let frag_cache = Arc::new(frag_cache);
		let params = Arc::new(params);

Conversation

ArthurDeclercq commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

paretje Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

paretje Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

paretje Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

paretje Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

paretje Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

paretje Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

paretje Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

RalfG commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ArthurDeclercq commented Jan 9, 2026 •

edited

Loading