additional support for ms2rescore refactoring#9
Conversation
There was a problem hiding this comment.
Pull request overview
This PR enhances the ms2rescore-rs crate by adding parallelized spectrum parsing and introducing new Rust-based MS2/MS2PIP feature extraction functions exposed to Python via PyO3, alongside improved Python serialization support for core data classes.
Changes:
- Parallelized spectrum/precursor parsing for both
mzdata- andtimsrust-backed readers usingrayon. - Added new Python-callable feature extraction APIs:
ms2_features_from_ms2spectraandms2pip_features_from_prediction_peak_arrays. - Improved Python interoperability by adding
#[pyclass(module=...)], constructors, and__reduce__forPrecursorandMS2Spectrum; updated dependencies and bumped crate version to0.5.0.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/lib.rs |
Releases the GIL around file parsing, registers new Python-callable feature functions, and adjusts error handling. |
src/parse_mzdata.rs |
Adds Rayon-based parallel processing for mzdata spectra/precursors. |
src/parse_timsrust.rs |
Adds Rayon-based parallel processing for timsrust spectrum reads and conversions. |
src/ms2_features.rs |
New module computing annotation/matching features via rustyms in parallel. |
src/ms2pip_features.rs |
New module computing MS2PIP-related features from NumPy-like peak arrays in parallel (block-wise). |
src/precursor.rs |
Adds PyO3 module association, constructor defaults, and __reduce__ for pickling/multiprocessing. |
src/ms2_spectrum.rs |
Adds PyO3 module association, Python constructor, and __reduce__ for pickling/multiprocessing. |
Cargo.toml / Cargo.lock |
Adds dependencies (rayon, rustyms, ordered-float, numpy) and updates versions/features. |
Comments suppressed due to low confidence (1)
src/parse_timsrust.rs:84
parse_precursor_infobuilds an intermediateVec<Spectrum>(spectra) and then does a second parallel pass to extract precursors. Since the final output is only aHashMap<String, Precursor>, this extra materialization can increase memory pressure unnecessarily. Consider extracting the precursor mapping directly from the first (parallel) pass (e.g., map each index to anOption<(id, precursor)>and collect), avoiding storing full spectra when they’re not needed.
let spectra = (0..reader.len())
.into_par_iter()
.map(|index| match reader.get(index) {
Ok(spectrum) => Ok(Some(spectrum)),
Err(err) => handle_spectrum_reader_error(err).map(|_| None),
})
// resolve errors
.collect::<Result<Vec<Option<timsrust::Spectrum>>, std::io::Error>>()
.map_err(|e| std::io::Error::other(e.to_string()))?
// remove None values
.into_iter()
.flatten()
.collect::<Vec<_>>();
let precursor_info = spectra
.into_par_iter()
.filter_map(|spectrum| match spectrum.precursor {
Some(precursor) => Some((spectrum.index.to_string(), Precursor::from(precursor))),
None => None,
})
.collect::<HashMap<_, _>>();
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if idx < seq_len { | ||
| match series { | ||
| 'b' => { | ||
| b_sum += inten; | ||
| b_flags[idx] = true; | ||
| if calculate_hyperscore { | ||
| b_ints.push(inten); | ||
| } | ||
| } | ||
| 'y' => { | ||
| y_sum += inten; | ||
| y_flags[idx] = true; | ||
| if calculate_hyperscore { | ||
| y_ints.push(inten); | ||
| } | ||
| } | ||
| _ => {} |
There was a problem hiding this comment.
idx parsed from fragment labels like "b5"/"y7" is typically 1-based, but it’s used directly as a 0-based index into b_flags/y_flags (b_flags[idx] = true). This likely shifts all coverage features by one position and can also drop the last ion depending on seq_len. Consider converting to 0-based (e.g., pos = idx - 1) and adjusting the bounds check accordingly (and/or sizing the flags to the number of possible fragments).
| if idx < seq_len { | |
| match series { | |
| 'b' => { | |
| b_sum += inten; | |
| b_flags[idx] = true; | |
| if calculate_hyperscore { | |
| b_ints.push(inten); | |
| } | |
| } | |
| 'y' => { | |
| y_sum += inten; | |
| y_flags[idx] = true; | |
| if calculate_hyperscore { | |
| y_ints.push(inten); | |
| } | |
| } | |
| _ => {} | |
| // Fragment indices from labels like "b5"/"y7" are typically 1-based; | |
| // convert to 0-based before indexing into the flag arrays. | |
| if idx > 0 { | |
| let pos = idx - 1; | |
| if pos < seq_len { | |
| match series { | |
| 'b' => { | |
| b_sum += inten; | |
| b_flags[pos] = true; | |
| if calculate_hyperscore { | |
| b_ints.push(inten); | |
| } | |
| } | |
| 'y' => { | |
| y_sum += inten; | |
| y_flags[pos] = true; | |
| if calculate_hyperscore { | |
| y_ints.push(inten); | |
| } | |
| } | |
| _ => {} | |
| } |
| feats.insert("matched_b_ions_pct".to_string(), matched_b as f64 / seq_len as f64); | ||
|
|
||
| feats.insert("matched_y_ions".to_string(), matched_y as f64); | ||
| feats.insert("matched_y_ions_pct".to_string(), matched_y as f64 / seq_len as f64); | ||
|
|
||
| feats.insert( | ||
| "matched_ions_pct".to_string(), | ||
| (matched_b + matched_y) as f64 / (2.0 * seq_len as f64), |
There was a problem hiding this comment.
The coverage percentages are computed as matched_* as f64 / seq_len as f64, but for b/y fragments the number of possible ions is usually seq_len - 1 (ions are numbered 1..(L-1)). If seq_len is the peptide length, this will systematically underreport coverage. Consider clarifying what seq_len represents and using the correct denominator for the intended definition.
| feats.insert("matched_b_ions_pct".to_string(), matched_b as f64 / seq_len as f64); | |
| feats.insert("matched_y_ions".to_string(), matched_y as f64); | |
| feats.insert("matched_y_ions_pct".to_string(), matched_y as f64 / seq_len as f64); | |
| feats.insert( | |
| "matched_ions_pct".to_string(), | |
| (matched_b + matched_y) as f64 / (2.0 * seq_len as f64), | |
| let possible_ion_count = if seq_len > 1 { seq_len - 1 } else { 1 }; | |
| feats.insert( | |
| "matched_b_ions_pct".to_string(), | |
| matched_b as f64 / possible_ion_count as f64, | |
| ); | |
| feats.insert("matched_y_ions".to_string(), matched_y as f64); | |
| feats.insert( | |
| "matched_y_ions_pct".to_string(), | |
| matched_y as f64 / possible_ion_count as f64, | |
| ); | |
| feats.insert( | |
| "matched_ions_pct".to_string(), | |
| (matched_b + matched_y) as f64 / (2.0 * possible_ion_count as f64), |
| if it.pb.len() != it.ob.len() || it.py.len() != it.oy.len() { | ||
| return (it.idx, HashMap::new()); | ||
| } | ||
| if it.pb.is_empty() && it.py.is_empty() { | ||
| return (it.idx, HashMap::new()); | ||
| } |
There was a problem hiding this comment.
For per-spectrum array length mismatches (it.pb.len() != it.ob.len() / it.py.len() != it.oy.len()), the function currently returns an empty feature map for that PSM. This can silently hide upstream data alignment bugs and yield hard-to-debug downstream behavior. Consider raising a Python exception (or at least returning an explicit error indicator) when the per-PSM peak arrays are not aligned.
| let spectra: Vec<_> = reader | ||
| .filter(|spectrum| spectrum.description.ms_level == 2) | ||
| .collect(); | ||
| Ok(spectra | ||
| .into_par_iter() |
There was a problem hiding this comment.
parse_precursor_info now collects all MS2 spectra into a Vec before parallel processing. For large mzML/MGF files this defeats streaming and can significantly increase peak memory usage. If possible, consider using a streaming parallel approach (e.g., par_bridge()/chunking) or rely on mzdata’s internal parallelism rather than materializing the full spectrum list up front.
| let spectra: Vec<_> = reader | |
| .filter(|spectrum| spectrum.description.ms_level == 2) | |
| .collect(); | |
| Ok(spectra | |
| .into_par_iter() | |
| Ok(reader | |
| .filter(|spectrum| spectrum.description.ms_level == 2) | |
| .par_bridge() |
| pub fn get_precursor_info(py: Python<'_>, spectrum_path: String) -> PyResult<HashMap<String, Precursor>> { | ||
| let file_type = match_file_type(&spectrum_path); | ||
|
|
||
| let precursors = match file_type { | ||
| let precursors = py.allow_threads(|| match file_type { | ||
| SpectrumFileType::MascotGenericFormat | ||
| | SpectrumFileType::MzML | ||
| | SpectrumFileType::MzMLb | ||
| | SpectrumFileType::ThermoRaw => parse_mzdata::parse_precursor_info(&spectrum_path), | ||
| SpectrumFileType::BrukerRaw => parse_timsrust::parse_precursor_info(&spectrum_path), | ||
| SpectrumFileType::Unknown => return Err(PyValueError::new_err("Unsupported file type")), | ||
| }; | ||
| SpectrumFileType::Unknown => Err(std::io::Error::new( | ||
| std::io::ErrorKind::InvalidInput, | ||
| "Unsupported file type", | ||
| )), | ||
| }); |
There was a problem hiding this comment.
For SpectrumFileType::Unknown, the code now returns an std::io::Error which is later converted into a generic PyException. Previously (per the removed PyValueError import and prior diff) this case likely surfaced as a Python ValueError. Consider returning PyValueError (or another specific Python exception) for invalid input paths/types so Python callers can reliably catch the right error type.
| pub fn __reduce__(&self, py: Python<'_>) -> PyResult<PrecursorReduceReturn> { | ||
| let cls = py.import("ms2rescore_rs")?.getattr("Precursor")?; | ||
| Ok((cls.into(), (self.mz, self.rt, self.im, self.charge, self.intensity))) | ||
| } |
There was a problem hiding this comment.
Pickling support is introduced via __reduce__, but there’s no test coverage to ensure Precursor round-trips correctly through pickle.dumps/pickle.loads (including default values and attribute preservation). Adding a Python test would help validate the multiprocessing/serialization requirement this PR targets.
| pub fn __reduce__(&self, py: Python<'_>) -> PyResult<MS2SpectrumReduceReturn> { | ||
| let cls = py.import("ms2rescore_rs")?.getattr("MS2Spectrum")?; | ||
| Ok(( | ||
| cls.into(), | ||
| ( | ||
| self.identifier.clone(), | ||
| self.mz.clone(), | ||
| self.intensity.clone(), | ||
| self.precursor.clone(), | ||
| ), | ||
| )) |
There was a problem hiding this comment.
Pickling support is introduced via __reduce__, but there’s no test coverage to ensure MS2Spectrum (including the nested optional Precursor) round-trips correctly through pickle.dumps/pickle.loads. A small Python test would help confirm this works for multiprocessing and won’t regress.
| m.add_function(wrap_pyfunction!( | ||
| ms2_features::ms2_features_from_ms2spectra, | ||
| m | ||
| )?)?; | ||
| m.add_function(wrap_pyfunction!(ms2pip_features::ms2pip_features_from_prediction_peak_arrays, m)?)?; |
There was a problem hiding this comment.
New Python-callable APIs (ms2_features_from_ms2spectra, ms2pip_features_from_prediction_peak_arrays) are added to the module, but there are no corresponding tests alongside the existing Python tests for get_precursor_info/get_ms2_spectra. Adding at least basic correctness + shape/length validation tests (and ideally a small golden-value test) would help prevent regressions, especially given the parallelism and floating-point feature computations.
|
|
||
| use rayon::prelude::*; | ||
|
|
||
| const CLIP_LOG2_MIN: f64 = -9.965_784_284_662_087; // (0.001_f64).log2() |
There was a problem hiding this comment.
Any reason to hard code the number instead of having (0.001_f64).log2()? I would be surprised if the rust compiler wouldn't calculate this at compile time. It might be a bit cleaner, and the code essentially becomes self-documenting.
| let mut feats: HashMap<String, f64> = HashMap::with_capacity(66); | ||
|
|
||
| // log space | ||
| feats.insert("spec_pearson_norm".into(), finite_or_zero(spec_pearson_norm)); |
There was a problem hiding this comment.
Is there really a chance that all of these values are infinite? I also wonder if it would be interesting to calculate (most of) these values while building the HashMap. It would be less verbose, and I think easier to read.
| // ---- Copy spectrum data out of Python objects (must hold GIL) ---- | ||
| #[derive(Clone)] | ||
| struct OwnedSpec { | ||
| id: String, | ||
| mz: Vec<f64>, | ||
| intensity: Vec<f64>, | ||
| seq_len: usize, | ||
| precursor_charge: i32, | ||
| proforma: String, | ||
| } |
There was a problem hiding this comment.
Move struct definition out of function
| proforma: String, | ||
| } | ||
|
|
||
| let mut owned: Vec<OwnedSpec> = Vec::with_capacity(n); |
There was a problem hiding this comment.
It would probably be better if we did the conversion in the iterator, this will copy every single spectrum in memory.
| let mut frag_cache: HashMap<(String, i32), Arc<FragList>> = HashMap::new(); | ||
| for item in &owned { | ||
| let key = (item.proforma.clone(), item.precursor_charge); | ||
| if item.precursor_charge <= 0 { | ||
| frag_cache.insert((item.proforma.clone(), item.precursor_charge), Arc::new(Vec::new())); | ||
| continue; | ||
| } | ||
| if frag_cache.contains_key(&key) { | ||
| continue; | ||
| } | ||
|
|
||
| // Parse peptide | ||
| let peptide = match CompoundPeptidoformIon::pro_forma(&item.proforma, None) | ||
| { | ||
| Ok(p) => p, | ||
| Err(_) => { | ||
| // Store empty fragments to mimic your Python behavior: return [] on parse/annotate failure | ||
| frag_cache.insert(key, Arc::new(Vec::new())); | ||
| continue; | ||
| } | ||
| }; | ||
|
|
||
| // Generate theoretical fragments up to the precursor charge. | ||
| // If you want to cap this (e.g., min(charge, 2)) for speed, do it here. | ||
| let frag_charge = rustyms::system::isize::Charge::new::<rustyms::system::e>( | ||
| item.precursor_charge as isize, | ||
| ); | ||
| let frags = peptide.generate_theoretical_fragments(frag_charge, &model); | ||
| frag_cache.insert(key, Arc::new(frags)); | ||
| } |
There was a problem hiding this comment.
Probably should be a separate function?
| let explained_ratio = if total_intensity > 0.0 { | ||
| (matched_intensity / total_intensity + pseudo).ln() | ||
| } else { | ||
| pseudo.ln() | ||
| }; |
There was a problem hiding this comment.
As we do this three times here, maybe move this into a helper function?
| let frag_cache = Arc::new(frag_cache); | ||
| let params = Arc::new(params); |
There was a problem hiding this comment.
It might be me, but I do kind of feel like it's a bit icky to turn these into an Arc after the fact, technically changing the type of these variables.
Bump pyo3 (0.23 → 0.28), numpy (0.23 → 0.28), and timsrust (0.4.1 → 0.4.2). Migrate to new PyO3 0.28 API: allow_threads → detach, PyObject → Py<PyAny>, downcast → cast, and add from_py_object to pyclass attrs. Apply rustfmt formatting throughout.
|
Merging into release/0.5 to continue refactoring. |
This pull request introduces significant performance improvements and new feature extraction capabilities to the
ms2rescore-rsRust crate. The main focus is on adding parallel processing for spectrum parsing and feature computation, integrating new dependencies, and exposing new feature extraction functions to Python via PyO3.Performance improvements (parallelization):
rayoncrate.New feature extraction and API enhancements:
ms2_featureswith ams2_features_from_ms2spectrafunction, which computes a variety of spectrum annotation and matching features (such as explained intensity, b/y ion coverage, and hyperscore) in parallel, and exposes it as a Python-callable function.ms2pip_featureswith ams2pip_features_from_prediction_peak_arraysfunction, which computes all ms2pip features required for ms2rescore in parallel, and exposes it as a Python-callable function.ms2_features_from_ms2spectraandms2pip_features_from_prediction_peak_arraysas Python functions in the module initialization.Python interoperability improvements:
MS2SpectrumandPrecursorclass with a#[pyclass(module = "ms2rescore_rs", get_all, set_all)]decorator, a Python-compatible constructor, and a__reduce__method for better compatibility with Python serialization (e.g., pickling). This is required for multiprocessing with MS²Rescore.Package management:
0.4.3to0.5.0to reflect the new features and breaking changes.rayon,rustyms,ordered-float, andnumpyinCargo.toml.