Skip to content

Add ms2pip feature vector and theoretical m/z computation#11

Merged
RalfG merged 8 commits into
release/0.5from
feat/ms2pip-get-vectors
Apr 13, 2026
Merged

Add ms2pip feature vector and theoretical m/z computation#11
RalfG merged 8 commits into
release/0.5from
feat/ms2pip-get-vectors

Conversation

@RalfG
Copy link
Copy Markdown
Member

@RalfG RalfG commented Apr 13, 2026

Implements the two remaining C code replacements needed to make ms2pip pure-Python: XGBoost feature vector computation and theoretical fragment m/z calculation. Together with the annotation and scoring functions from the previous PR, this eliminates all dependencies on ms2pip's C/Cython code.


Added

  • ms2pip_compute_features(proformas) — compute 139 XGBoost features per cleavage site, taking ProForma strings with charge suffix (e.g. PEPTIDE/2). Returns flat numpy float32 arrays, reshaped to (n_ions, 139) on the Python side. Feature vectors match the original C code exactly for XGBoost model compatibility.
  • ms2pip_compute_theoretical_mz(proformas, ion_types, fragmentation_model, mass_mode) — compute theoretical fragment m/z using rustyms, consistent with annotate_ms2_spectra. Supports all ion types including charge variants (b2, y2).
  • Pickle support for FragmentAnnotation and AnnotatedMS2Spectrum
  • Shared helpers in utils.rs: parse_fragment (direct FragmentType enum matching, no heap allocation), extract_charge, aa_to_ms2pip_index
  • Rust unit tests including exact C reference test for ACDE/2
  • 20 Python tests for both new functions

Changed

  • Version bumped to 0.5.0-alpha.1
  • Fragment parsing now uses direct enum matching on rustyms FragmentType instead of to_string() + string parsing, eliminating heap allocations in hot loops
  • Amino acid index lookup uses direct AminoAcid enum matching instead of to_string().chars().next()
  • Consolidated parse_ion_series_and_index and extract_fragment_charge from annotation.rs into shared parse_fragment in utils.rs

RalfG added 7 commits April 13, 2026 09:59
Compute 139 features per cleavage site matching the C code layout:
peptide-level properties, charge one-hot, AA counts, positional
properties, and quartile statistics. Takes ProForma strings directly
with charge suffix. Parsing and computation fully parallelized.
Includes 7 Rust unit tests.
Compute theoretical fragment m/z using rustyms, consistent with
annotate_ms2_spectra. Extract parse_ion_series_and_index and
extract_charge into utils.rs, used by annotation, feature_vectors,
and theoretical_mz modules.
Replace to_string()-based fragment parsing with direct FragmentType enum
matching in parse_fragment(). Replace char-based AA lookup with AminoAcid
enum matching in aa_to_ms2pip_index(). Consolidate shared helpers in utils.
Fix c-ion boundary off-by-one (c_length = peplen-i, not peplen-i-1),
positional features using C's 1-indexed peptide_buf convention, quartile
denominators matching C's per-context formulas, and incremental AA count
tracking. Document intentional C code bugs replicated for XGBoost model
compatibility. Add exact C reference test for ACDE/2.
@RalfG RalfG added this to the 0.5.0 milestone Apr 13, 2026
@RalfG RalfG merged commit 2c189f7 into release/0.5 Apr 13, 2026
4 checks passed
@RalfG RalfG deleted the feat/ms2pip-get-vectors branch April 13, 2026 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant