Add ms2pip feature vector and theoretical m/z computation#11
Merged
Conversation
Compute 139 features per cleavage site matching the C code layout: peptide-level properties, charge one-hot, AA counts, positional properties, and quartile statistics. Takes ProForma strings directly with charge suffix. Parsing and computation fully parallelized. Includes 7 Rust unit tests.
Compute theoretical fragment m/z using rustyms, consistent with annotate_ms2_spectra. Extract parse_ion_series_and_index and extract_charge into utils.rs, used by annotation, feature_vectors, and theoretical_mz modules.
Replace to_string()-based fragment parsing with direct FragmentType enum matching in parse_fragment(). Replace char-based AA lookup with AminoAcid enum matching in aa_to_ms2pip_index(). Consolidate shared helpers in utils.
Fix c-ion boundary off-by-one (c_length = peplen-i, not peplen-i-1), positional features using C's 1-indexed peptide_buf convention, quartile denominators matching C's per-context formulas, and incremental AA count tracking. Document intentional C code bugs replicated for XGBoost model compatibility. Add exact C reference test for ACDE/2.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the two remaining C code replacements needed to make ms2pip pure-Python: XGBoost feature vector computation and theoretical fragment m/z calculation. Together with the annotation and scoring functions from the previous PR, this eliminates all dependencies on ms2pip's C/Cython code.
Added
ms2pip_compute_features(proformas)— compute 139 XGBoost features per cleavage site, taking ProForma strings with charge suffix (e.g.PEPTIDE/2). Returns flat numpy float32 arrays, reshaped to(n_ions, 139)on the Python side. Feature vectors match the original C code exactly for XGBoost model compatibility.ms2pip_compute_theoretical_mz(proformas, ion_types, fragmentation_model, mass_mode)— compute theoretical fragment m/z using rustyms, consistent withannotate_ms2_spectra. Supports all ion types including charge variants (b2, y2).FragmentAnnotationandAnnotatedMS2Spectrumutils.rs:parse_fragment(directFragmentTypeenum matching, no heap allocation),extract_charge,aa_to_ms2pip_indexChanged
0.5.0-alpha.1FragmentTypeinstead ofto_string()+ string parsing, eliminating heap allocations in hot loopsAminoAcidenum matching instead ofto_string().chars().next()parse_ion_series_and_indexandextract_fragment_chargefrom annotation.rs into sharedparse_fragmentin utils.rs