Skip to content

additional support for ms2rescore refactoring#9

Merged
RalfG merged 14 commits into
release/0.5from
pickle-support
Apr 10, 2026
Merged

additional support for ms2rescore refactoring#9
RalfG merged 14 commits into
release/0.5from
pickle-support

Conversation

@ArthurDeclercq
Copy link
Copy Markdown
Contributor

@ArthurDeclercq ArthurDeclercq commented Jan 9, 2026

This pull request introduces significant performance improvements and new feature extraction capabilities to the ms2rescore-rs Rust crate. The main focus is on adding parallel processing for spectrum parsing and feature computation, integrating new dependencies, and exposing new feature extraction functions to Python via PyO3.

Performance improvements (parallelization):

  • Enabled parallel processing for reading and parsing MS2 spectra and precursor information from both mzData and timsrust sources, using the rayon crate.
  • Parallelization is also used in the new feature extraction functions (below) that are introduced to bring computational heavy feature calculation to rust. Also here parallelization should significantly speed up compute time.

New feature extraction and API enhancements:

  • Added a new module ms2_features with a ms2_features_from_ms2spectra function, which computes a variety of spectrum annotation and matching features (such as explained intensity, b/y ion coverage, and hyperscore) in parallel, and exposes it as a Python-callable function.
  • Added a new module ms2pip_features with a ms2pip_features_from_prediction_peak_arrays function, which computes all ms2pip features required for ms2rescore in parallel, and exposes it as a Python-callable function.
  • Registered both ms2_features_from_ms2spectra and ms2pip_features_from_prediction_peak_arrays as Python functions in the module initialization.

Python interoperability improvements:

  • Enhanced the MS2Spectrum and Precursor class with a #[pyclass(module = "ms2rescore_rs", get_all, set_all)] decorator, a Python-compatible constructor, and a __reduce__ method for better compatibility with Python serialization (e.g., pickling). This is required for multiprocessing with MS²Rescore.

Package management:

  • Bumped the crate version from 0.4.3 to 0.5.0 to reflect the new features and breaking changes.
  • Introduced new dependencies: rayon, rustyms, ordered-float, and numpy in Cargo.toml.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the ms2rescore-rs crate by adding parallelized spectrum parsing and introducing new Rust-based MS2/MS2PIP feature extraction functions exposed to Python via PyO3, alongside improved Python serialization support for core data classes.

Changes:

  • Parallelized spectrum/precursor parsing for both mzdata- and timsrust-backed readers using rayon.
  • Added new Python-callable feature extraction APIs: ms2_features_from_ms2spectra and ms2pip_features_from_prediction_peak_arrays.
  • Improved Python interoperability by adding #[pyclass(module=...)], constructors, and __reduce__ for Precursor and MS2Spectrum; updated dependencies and bumped crate version to 0.5.0.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/lib.rs Releases the GIL around file parsing, registers new Python-callable feature functions, and adjusts error handling.
src/parse_mzdata.rs Adds Rayon-based parallel processing for mzdata spectra/precursors.
src/parse_timsrust.rs Adds Rayon-based parallel processing for timsrust spectrum reads and conversions.
src/ms2_features.rs New module computing annotation/matching features via rustyms in parallel.
src/ms2pip_features.rs New module computing MS2PIP-related features from NumPy-like peak arrays in parallel (block-wise).
src/precursor.rs Adds PyO3 module association, constructor defaults, and __reduce__ for pickling/multiprocessing.
src/ms2_spectrum.rs Adds PyO3 module association, Python constructor, and __reduce__ for pickling/multiprocessing.
Cargo.toml / Cargo.lock Adds dependencies (rayon, rustyms, ordered-float, numpy) and updates versions/features.
Comments suppressed due to low confidence (1)

src/parse_timsrust.rs:84

  • parse_precursor_info builds an intermediate Vec<Spectrum> (spectra) and then does a second parallel pass to extract precursors. Since the final output is only a HashMap<String, Precursor>, this extra materialization can increase memory pressure unnecessarily. Consider extracting the precursor mapping directly from the first (parallel) pass (e.g., map each index to an Option<(id, precursor)> and collect), avoiding storing full spectra when they’re not needed.
    let spectra = (0..reader.len())
        .into_par_iter()
        .map(|index| match reader.get(index) {
            Ok(spectrum) => Ok(Some(spectrum)),
            Err(err) => handle_spectrum_reader_error(err).map(|_| None),
        })
        // resolve errors
        .collect::<Result<Vec<Option<timsrust::Spectrum>>, std::io::Error>>()
        .map_err(|e| std::io::Error::other(e.to_string()))?
        // remove None values
        .into_iter()
        .flatten()
        .collect::<Vec<_>>();

    let precursor_info = spectra
        .into_par_iter()
        .filter_map(|spectrum| match spectrum.precursor {
            Some(precursor) => Some((spectrum.index.to_string(), Precursor::from(precursor))),
            None => None,
        })
        .collect::<HashMap<_, _>>();

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ms2_features.rs
Comment on lines +269 to +285
if idx < seq_len {
match series {
'b' => {
b_sum += inten;
b_flags[idx] = true;
if calculate_hyperscore {
b_ints.push(inten);
}
}
'y' => {
y_sum += inten;
y_flags[idx] = true;
if calculate_hyperscore {
y_ints.push(inten);
}
}
_ => {}
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idx parsed from fragment labels like "b5"/"y7" is typically 1-based, but it’s used directly as a 0-based index into b_flags/y_flags (b_flags[idx] = true). This likely shifts all coverage features by one position and can also drop the last ion depending on seq_len. Consider converting to 0-based (e.g., pos = idx - 1) and adjusting the bounds check accordingly (and/or sizing the flags to the number of possible fragments).

Suggested change
if idx < seq_len {
match series {
'b' => {
b_sum += inten;
b_flags[idx] = true;
if calculate_hyperscore {
b_ints.push(inten);
}
}
'y' => {
y_sum += inten;
y_flags[idx] = true;
if calculate_hyperscore {
y_ints.push(inten);
}
}
_ => {}
// Fragment indices from labels like "b5"/"y7" are typically 1-based;
// convert to 0-based before indexing into the flag arrays.
if idx > 0 {
let pos = idx - 1;
if pos < seq_len {
match series {
'b' => {
b_sum += inten;
b_flags[pos] = true;
if calculate_hyperscore {
b_ints.push(inten);
}
}
'y' => {
y_sum += inten;
y_flags[pos] = true;
if calculate_hyperscore {
y_ints.push(inten);
}
}
_ => {}
}

Copilot uses AI. Check for mistakes.
Comment thread src/ms2_features.rs Outdated
Comment on lines +331 to +338
feats.insert("matched_b_ions_pct".to_string(), matched_b as f64 / seq_len as f64);

feats.insert("matched_y_ions".to_string(), matched_y as f64);
feats.insert("matched_y_ions_pct".to_string(), matched_y as f64 / seq_len as f64);

feats.insert(
"matched_ions_pct".to_string(),
(matched_b + matched_y) as f64 / (2.0 * seq_len as f64),
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coverage percentages are computed as matched_* as f64 / seq_len as f64, but for b/y fragments the number of possible ions is usually seq_len - 1 (ions are numbered 1..(L-1)). If seq_len is the peptide length, this will systematically underreport coverage. Consider clarifying what seq_len represents and using the correct denominator for the intended definition.

Suggested change
feats.insert("matched_b_ions_pct".to_string(), matched_b as f64 / seq_len as f64);
feats.insert("matched_y_ions".to_string(), matched_y as f64);
feats.insert("matched_y_ions_pct".to_string(), matched_y as f64 / seq_len as f64);
feats.insert(
"matched_ions_pct".to_string(),
(matched_b + matched_y) as f64 / (2.0 * seq_len as f64),
let possible_ion_count = if seq_len > 1 { seq_len - 1 } else { 1 };
feats.insert(
"matched_b_ions_pct".to_string(),
matched_b as f64 / possible_ion_count as f64,
);
feats.insert("matched_y_ions".to_string(), matched_y as f64);
feats.insert(
"matched_y_ions_pct".to_string(),
matched_y as f64 / possible_ion_count as f64,
);
feats.insert(
"matched_ions_pct".to_string(),
(matched_b + matched_y) as f64 / (2.0 * possible_ion_count as f64),

Copilot uses AI. Check for mistakes.
Comment thread src/ms2pip_features.rs
Comment on lines +396 to +401
if it.pb.len() != it.ob.len() || it.py.len() != it.oy.len() {
return (it.idx, HashMap::new());
}
if it.pb.is_empty() && it.py.is_empty() {
return (it.idx, HashMap::new());
}
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For per-spectrum array length mismatches (it.pb.len() != it.ob.len() / it.py.len() != it.oy.len()), the function currently returns an empty feature map for that PSM. This can silently hide upstream data alignment bugs and yield hard-to-debug downstream behavior. Consider raising a Python exception (or at least returning an explicit error indicator) when the per-PSM peak arrays are not aligned.

Copilot uses AI. Check for mistakes.
Comment thread src/parse_mzdata.rs
Comment on lines +53 to +57
let spectra: Vec<_> = reader
.filter(|spectrum| spectrum.description.ms_level == 2)
.collect();
Ok(spectra
.into_par_iter()
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_precursor_info now collects all MS2 spectra into a Vec before parallel processing. For large mzML/MGF files this defeats streaming and can significantly increase peak memory usage. If possible, consider using a streaming parallel approach (e.g., par_bridge()/chunking) or rely on mzdata’s internal parallelism rather than materializing the full spectrum list up front.

Suggested change
let spectra: Vec<_> = reader
.filter(|spectrum| spectrum.description.ms_level == 2)
.collect();
Ok(spectra
.into_par_iter()
Ok(reader
.filter(|spectrum| spectrum.description.ms_level == 2)
.par_bridge()

Copilot uses AI. Check for mistakes.
Comment thread src/lib.rs Outdated
Comment on lines +28 to +41
pub fn get_precursor_info(py: Python<'_>, spectrum_path: String) -> PyResult<HashMap<String, Precursor>> {
let file_type = match_file_type(&spectrum_path);

let precursors = match file_type {
let precursors = py.allow_threads(|| match file_type {
SpectrumFileType::MascotGenericFormat
| SpectrumFileType::MzML
| SpectrumFileType::MzMLb
| SpectrumFileType::ThermoRaw => parse_mzdata::parse_precursor_info(&spectrum_path),
SpectrumFileType::BrukerRaw => parse_timsrust::parse_precursor_info(&spectrum_path),
SpectrumFileType::Unknown => return Err(PyValueError::new_err("Unsupported file type")),
};
SpectrumFileType::Unknown => Err(std::io::Error::new(
std::io::ErrorKind::InvalidInput,
"Unsupported file type",
)),
});
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SpectrumFileType::Unknown, the code now returns an std::io::Error which is later converted into a generic PyException. Previously (per the removed PyValueError import and prior diff) this case likely surfaced as a Python ValueError. Consider returning PyValueError (or another specific Python exception) for invalid input paths/types so Python callers can reliably catch the right error type.

Copilot uses AI. Check for mistakes.
Comment thread src/precursor.rs
Comment on lines +32 to 35
pub fn __reduce__(&self, py: Python<'_>) -> PyResult<PrecursorReduceReturn> {
let cls = py.import("ms2rescore_rs")?.getattr("Precursor")?;
Ok((cls.into(), (self.mz, self.rt, self.im, self.charge, self.intensity)))
}
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pickling support is introduced via __reduce__, but there’s no test coverage to ensure Precursor round-trips correctly through pickle.dumps/pickle.loads (including default values and attribute preservation). Adding a Python test would help validate the multiprocessing/serialization requirement this PR targets.

Copilot uses AI. Check for mistakes.
Comment thread src/ms2_spectrum.rs
Comment on lines +48 to +58
pub fn __reduce__(&self, py: Python<'_>) -> PyResult<MS2SpectrumReduceReturn> {
let cls = py.import("ms2rescore_rs")?.getattr("MS2Spectrum")?;
Ok((
cls.into(),
(
self.identifier.clone(),
self.mz.clone(),
self.intensity.clone(),
self.precursor.clone(),
),
))
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pickling support is introduced via __reduce__, but there’s no test coverage to ensure MS2Spectrum (including the nested optional Precursor) round-trips correctly through pickle.dumps/pickle.loads. A small Python test would help confirm this works for multiprocessing and won’t regress.

Copilot uses AI. Check for mistakes.
Comment thread src/lib.rs Outdated
Comment on lines +80 to +84
m.add_function(wrap_pyfunction!(
ms2_features::ms2_features_from_ms2spectra,
m
)?)?;
m.add_function(wrap_pyfunction!(ms2pip_features::ms2pip_features_from_prediction_peak_arrays, m)?)?;
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New Python-callable APIs (ms2_features_from_ms2spectra, ms2pip_features_from_prediction_peak_arrays) are added to the module, but there are no corresponding tests alongside the existing Python tests for get_precursor_info/get_ms2_spectra. Adding at least basic correctness + shape/length validation tests (and ideally a small golden-value test) would help prevent regressions, especially given the parallelism and floating-point feature computations.

Copilot uses AI. Check for mistakes.
Comment thread src/ms2_features.rs
Comment thread src/ms2pip_features.rs

use rayon::prelude::*;

const CLIP_LOG2_MIN: f64 = -9.965_784_284_662_087; // (0.001_f64).log2()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to hard code the number instead of having (0.001_f64).log2()? I would be surprised if the rust compiler wouldn't calculate this at compile time. It might be a bit cleaner, and the code essentially becomes self-documenting.

Comment thread src/ms2pip_features.rs Outdated
let mut feats: HashMap<String, f64> = HashMap::with_capacity(66);

// log space
feats.insert("spec_pearson_norm".into(), finite_or_zero(spec_pearson_norm));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there really a chance that all of these values are infinite? I also wonder if it would be interesting to calculate (most of) these values while building the HashMap. It would be less verbose, and I think easier to read.

Comment thread src/ms2_features.rs
Comment on lines +108 to +117
// ---- Copy spectrum data out of Python objects (must hold GIL) ----
#[derive(Clone)]
struct OwnedSpec {
id: String,
mz: Vec<f64>,
intensity: Vec<f64>,
seq_len: usize,
precursor_charge: i32,
proforma: String,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move struct definition out of function

Comment thread src/ms2_features.rs
proforma: String,
}

let mut owned: Vec<OwnedSpec> = Vec::with_capacity(n);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be better if we did the conversion in the iterator, this will copy every single spectrum in memory.

Comment thread src/ms2_features.rs
Comment on lines +152 to +181
let mut frag_cache: HashMap<(String, i32), Arc<FragList>> = HashMap::new();
for item in &owned {
let key = (item.proforma.clone(), item.precursor_charge);
if item.precursor_charge <= 0 {
frag_cache.insert((item.proforma.clone(), item.precursor_charge), Arc::new(Vec::new()));
continue;
}
if frag_cache.contains_key(&key) {
continue;
}

// Parse peptide
let peptide = match CompoundPeptidoformIon::pro_forma(&item.proforma, None)
{
Ok(p) => p,
Err(_) => {
// Store empty fragments to mimic your Python behavior: return [] on parse/annotate failure
frag_cache.insert(key, Arc::new(Vec::new()));
continue;
}
};

// Generate theoretical fragments up to the precursor charge.
// If you want to cap this (e.g., min(charge, 2)) for speed, do it here.
let frag_charge = rustyms::system::isize::Charge::new::<rustyms::system::e>(
item.precursor_charge as isize,
);
let frags = peptide.generate_theoretical_fragments(frag_charge, &model);
frag_cache.insert(key, Arc::new(frags));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be a separate function?

Comment thread src/ms2_features.rs
Comment on lines +300 to +304
let explained_ratio = if total_intensity > 0.0 {
(matched_intensity / total_intensity + pseudo).ln()
} else {
pseudo.ln()
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we do this three times here, maybe move this into a helper function?

Comment thread src/ms2_features.rs
Comment on lines +183 to +184
let frag_cache = Arc::new(frag_cache);
let params = Arc::new(params);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be me, but I do kind of feel like it's a bit icky to turn these into an Arc after the fact, technically changing the type of these variables.

RalfG added 2 commits March 19, 2026 11:06
Bump pyo3 (0.23 → 0.28), numpy (0.23 → 0.28), and timsrust (0.4.1 →
0.4.2). Migrate to new PyO3 0.28 API: allow_threads → detach, PyObject
→ Py<PyAny>, downcast → cast, and add from_py_object to pyclass attrs.
Apply rustfmt formatting throughout.
@RalfG RalfG changed the base branch from main to release/0.5 April 10, 2026 11:08
@RalfG RalfG added this to the 0.5.0 milestone Apr 10, 2026
@RalfG RalfG added the enhancement New feature or request label Apr 10, 2026
@RalfG
Copy link
Copy Markdown
Member

RalfG commented Apr 10, 2026

Merging into release/0.5 to continue refactoring.

@RalfG RalfG merged commit e64f5eb into release/0.5 Apr 10, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants