Skip to content

Feat use mgf for sample spectra#173

Merged
JemmaLDaniel merged 11 commits into
mainfrom
feat-use-mgf-for-sample-spectra
Jun 27, 2026
Merged

Feat use mgf for sample spectra#173
JemmaLDaniel merged 11 commits into
mainfrom
feat-use-mgf-for-sample-spectra

Conversation

@JemmaLDaniel

@JemmaLDaniel JemmaLDaniel commented Mar 31, 2026

Copy link
Copy Markdown
Collaborator

Summary

Point default train / predict configs and configuration docs at spectra.mgf for InstaNovo-style example data, so the sample pipeline uses the MGF sample files end-to-end.

Motivation

Using MGF for the Winnow example gives a realistic MS2 spectrum file that people already know from mass-spec workflows, and it is plain text and human-readable, which makes debugging and learning from the sample much easier than with opaque columnar formats like parquet.

@JemmaLDaniel JemmaLDaniel requested a review from BioGeek March 31, 2026 08:21
@JemmaLDaniel JemmaLDaniel self-assigned this Mar 31, 2026
@JemmaLDaniel JemmaLDaniel added the enhancement New feature or request label Mar 31, 2026
@github-actions

github-actions Bot commented Mar 31, 2026

Copy link
Copy Markdown

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py00100% 
data_types.py40100% 
calibration
   __init__.py00100% 
   calibration_features.py50100% 
   calibrator.py901583%69–70, 72, 106–109, 134–135, 137, 162–163, 167, 194–195
   diagnostics.py1685070%70, 96, 101, 111, 115, 137, 146, 203–218, 261–262, 266, 307, 309–324, 335–341
calibration/features
   __init__.py100100% 
   base.py80100% 
   beam.py470100% 
   chimeric.py78198%204
   constants.py40100% 
   fragment_match.py74198%194
   mass_error.py67297%16, 20
   retention_time.py135993%183, 190, 206, 257–259, 269, 272–273
   sequence.py190100% 
   token_score.py37197%82
   utils.py135298%35, 233
compat
   __init__.py00100% 
   instanovo.py10640%12, 14–15, 17, 24–25
datasets
   __init__.py00100% 
   calibration_dataset.py1091784%155, 169, 171, 173, 183, 196, 249, 251–252, 258–261, 263–266
   interfaces.py30100% 
   psm_dataset.py250100% 
datasets/data_loaders
   __init__.py50100% 
   instanovo.py1191984%90, 93, 119, 142, 168–169, 172–174, 176–177, 179, 182–183, 185, 343–345, 356
   mztab.py2155574%103, 106, 157, 161, 210–211, 223, 236–240, 287, 290, 302–303, 315–317, 319–320, 322, 324, 330, 334–336, 338–339, 343–346, 350, 514–515, 518, 521, 528, 542–546, 550–555, 561, 570–571, 573, 599
   pointnovo.py70100% 
   utils.py59198%11
   winnow.py39489%54–55, 91–92
fdr
   __init__.py00100% 
   base.py581574%81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
   database_grounded.py28196%52
   nonparametric.py25484%62, 68–69, 72
scripts
   __init__.py00100% 
   main.py2562560%8, 10–13, 16–20, 23–24, 26–28, 32, 39, 44, 47, 53, 55–56, 59, 68, 76, 79, 86, 88–90, 92, 94–99, 102, 104–105, 110, 125, 128, 135–141, 144–145, 148, 161–163, 166, 169, 174, 176–178, 180, 182–183, 186–187, 190, 192–193, 195, 197, 199–200, 202, 205–206, 209–210, 213–214, 217–219, 221–224, 227–229, 231, 234, 248–250, 252, 254, 259, 261–263, 265–266, 268, 270–271, 273–275, 277, 279, 281–282, 286–289, 291–292, 294–295, 297–298, 300, 303, 317–319, 322, 325, 330, 332–334, 336–338, 340–341, 344–345, 348, 350–351, 353, 355, 357–358, 360, 363–364, 370–372, 374–377, 380–381, 384–385, 388–389, 392–393, 401–403, 407, 410, 414, 417, 423–425, 427–428, 435–436, 438, 440, 445, 447–449, 451–452, 455, 457–458, 460–463, 465–466, 468–469, 471–473, 479–480, 484–485, 488, 495, 500–501, 506–508, 511, 516, 526, 533, 535, 539, 541–542, 546–547, 550, 573, 586–587, 590, 612, 624–625, 628, 653, 666–667, 670, 685, 697–698, 701, 716, 728–729, 732, 744, 756–757, 760, 775, 787–788, 791, 800, 812–813
utils
   __init__.py40100% 
   config_formatter.py534024%29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
   config_path.py76593%24–26, 117–118
   peptide.py160100% 
TOTAL198850474% 

Tests Skipped Failures Errors Time
378 0 💤 0 ❌ 0 🔥 42.510s ⏱️

Comment thread scripts/generate_sample_data.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're using a spectra.mgf as example data, why create a "fake" one? Why not use a real MGF file?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still be generating fake InstaNovo predictions, so the results would likely not make much more sense. The alternative is to use a real mgf file and real InstaNovo predictions, but that has a greater burden of maintenance for me to keep updated or it carries a higher computational cost for the user.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is to use a real mgf file and real InstaNovo predictions, but that has a greater burden of maintenance for me

Agreed, but it is somethng you should only have to do once (and maybe rerun the predictions when a new version of InstaNovo changes the output format, but then you would have to adap the code anyway. So this option still has my preference.

it carries a higher computational cost for the user

If you provide the predictions I don't see why it should carry a higher computational cost for the user.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the sample data to use 100 spectra from a HeLa dataset and generated real InstaNovo v1.2.0 predictions (without refinement) on them.

@JemmaLDaniel JemmaLDaniel changed the base branch from feat-allow-mgf-files-for-instanovo-data-loader to main April 2, 2026 09:58
@JemmaLDaniel JemmaLDaniel requested a review from BioGeek June 22, 2026 15:38

@BioGeek BioGeek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few places where you missed replacing ipc with mgf.

Also at docs/examples.md line 5.

Also, if you now run winnow compute-features or winnow diagnose-calibration with defaults, they still resolve dataset.spectrum_path_or_directory to examples/example_data/spectra.ipc,

$ winnow compute-features 
[06/25/26 18:22:44] INFO     Starting compute-features pipeline.                                                                                              main.py:265
                    INFO     Compute-features configuration: {'dataset': {'spectrum_path_or_directory': 'examples/example_data/spectra.ipc',                  main.py:266
                             'predictions_path': 'examples/example_data/predictions.csv'}, 'dataset_output_path': 'results/metadata.csv',                                
                             'filter_empty_predictions': True, 'labelled': True, 'residue_masses': {'G': 57.021464, 'A': 71.037114, 'S': 87.032028, 'P':                 
                             97.052764, 'V': 99.068414, 'T': 101.04767, 'C': 103.009185, 'L': 113.084064, 'I': 113.084064, 'N': 114.042927, 'D': 115.026943,             
                             'Q': 128.058578, 'K': 128.094963, 'E': 129.042593, 'M': 131.040485, 'H': 137.058912, 'F': 147.068414, 'R': 156.101111, 'Y':                 
                             163.063329, 'W': 186.079313, 'M[UNIMOD:35]': 147.0354, 'C[UNIMOD:4]': 160.030649, 'N[UNIMOD:7]': 115.026943, 'Q[UNIMOD:7]':                 
                             129.042594, 'R[UNIMOD:7]': 157.085127, 'P[UNIMOD:35]': 113.047679, 'S[UNIMOD:21]': 166.998028, 'T[UNIMOD:21]': 181.01367,                   
                             'Y[UNIMOD:21]': 243.029329, 'C[UNIMOD:312]': 222.013284, 'E[UNIMOD:27]': 111.032028, 'Q[UNIMOD:28]': 111.032029, '[UNIMOD:1]':              
                             42.010565, '[UNIMOD:5]': 43.005814, '[UNIMOD:385]': -17.026549, '(+25.98)': 25.980265}, 'calibrator': {'_target_':                          
                             'winnow.calibration.calibrator.ProbabilityCalibrator', 'seed': 42, 'hidden_layer_sizes': [50, 50], 'learning_rate_init': 0.001,             
                             'alpha': 0.0001, 'max_iter': 1000, 'early_stopping': True, 'validation_fraction': 0.1, 'features': {'mass_error': {'_target_':              
                             'winnow.calibration.calibration_features.MassErrorDaFeature', 'residue_masses': '${residue_masses}'}, 'fragment_match_features':            
                             {'_target_': 'winnow.calibration.calibration_features.FragmentMatchFeatures', 'mz_tolerance': 0.02, 'learn_from_missing': False,            
                             'intensity_model_name': '${koina.intensity_model}', 'max_precursor_charge': '${koina.constraints.max_precursor_charge}',                    
                             'max_peptide_length': '${koina.constraints.max_peptide_length}', 'unsupported_residues':                                                    
                             '${koina.constraints.unsupported_residues}', 'model_input_constants': '${koina.input_constants}'}, 'retention_time_feature':                
                             {'_target_': 'winnow.calibration.calibration_features.RetentionTimeFeature', 'train_fraction': 0.1, 'min_train_points': 10,                 
                             'learn_from_missing': False, 'seed': 42, 'irt_model_name': '${koina.irt_model}', 'max_peptide_length':                                      
                             '${koina.constraints.max_peptide_length}', 'unsupported_residues': '${koina.constraints.unsupported_residues}'}}}, 'koina':                 
                             {'intensity_model': 'Prosit_2025_intensity_22PTM', 'irt_model': 'Prosit_2025_irt_22PTM', 'input_constants':                                 
                             {'collision_energies': 27, 'fragmentation_types': 'HCD'}, 'input_columns': {}, 'constraints': {'max_precursor_charge': 6,                   
                             'max_peptide_length': 30, 'unsupported_residues': ['[UNIMOD:5]', '[UNIMOD:385]', '(+25.98)']}}, 'data_loader': {'_target_':                 
                             'winnow.datasets.data_loaders.InstaNovoDatasetLoader', 'add_index_cols': False, 'residue_masses': '${residue_masses}',                      
                             'residue_remapping': {'M(ox)': 'M[UNIMOD:35]', 'M(+15.99)': 'M[UNIMOD:35]', 'S(p)': 'S[UNIMOD:21]', 'T(p)': 'T[UNIMOD:21]',                 
                             'Y(p)': 'Y[UNIMOD:21]', 'S(+79.97)': 'S[UNIMOD:21]', 'T(+79.97)': 'T[UNIMOD:21]', 'Y(+79.97)': 'Y[UNIMOD:21]', 'Q(+0.98)':                  
                             'Q[UNIMOD:7]', 'N(+0.98)': 'N[UNIMOD:7]', 'Q(+.98)': 'Q[UNIMOD:7]', 'N(+.98)': 'N[UNIMOD:7]', 'C(+57.02)': 'C[UNIMOD:4]',                   
                             '(+42.01)': '[UNIMOD:1]', '(+43.01)': '[UNIMOD:5]', '(-17.03)': '[UNIMOD:385]'}, 'column_mapping': {'predictions':                          
                             'predictions', 'predictions_tokenised': 'predictions_tokenised', 'log_probability': 'log_probs'}, 'beam_columns': {'sequence':              
                             'predictions_beam_', 'log_probability': 'predictions_log_probability_beam_', 'token_log_probabilities':                                     
                             'predictions_token_log_probabilities_beam_'}}}                                                                                              
                    INFO     Loading dataset.                                                                                                                 main.py:270
╭────────────────────────────────────────────────────────────────── Traceback (most recent call last) ──────────────────────────────────────────────────────────────────╮
│ /home/j-vangoey/code/winnow/winnow/scripts/main.py:625 in compute_features                                                                                            │
│                                                                                                                                                                       │
│   622 ) -> None:                                                                                                                                                      │
│   623"""Compute calibration features and save metadata CSV."""                                                                                                   │
│   624overrides = ctx.args if ctx.args else None                                                                                                                  │
│ ❱ 625compute_features_entry_point(overrides, config_dir=config_dir)                                                                                              │
│   626                                                                                                                                                                 │
│   627                                                                                                                                                                 │
│   628 @app.command(                                                                                                                                                   │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/winnow/scripts/main.py:277 in compute_features_entry_point                                                                                │
│                                                                                                                                                                       │
│   274dataset_params["data_path"] = dataset_params.pop("spectrum_path_or_directory")                                                                              │
│   275dataset_params["predictions_path"] = dataset_params.pop("predictions_path", None)                                                                           │
│   276 │                                                                                                                                                               │
│ ❱ 277dataset = data_loader.load(**dataset_params)                                                                                                                │
│   278 │                                                                                                                                                               │
│   279logger.info(f"Loaded: {len(dataset.metadata)} spectra")                                                                                                     │
│   280                                                                                                                                                                 │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/winnow/datasets/data_loaders/instanovo.py:168 in load                                                                                     │
│                                                                                                                                                                       │
│   165 │   │   if predictions_path is None:                                                                                                                            │
│   166 │   │   │   raise ValueError("predictions_path is required for InstaNovoDatasetLoader")                                                                         │
│   167 │   │                                                                                                                                                           │
│ ❱ 168 │   │   inputs, has_labels = self._load_spectrum_data(data_path)                                                                                                │
│   169 │   │   inputs = self._process_spectrum_data(inputs, has_labels)                                                                                                │
│   170 │   │                                                                                                                                                           │
│   171 │   │   # Load beam predictions only if beam_columns is configured                                                                                              │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/winnow/datasets/data_loaders/instanovo.py:198 in _load_spectrum_data                                                                      │
│                                                                                                                                                                       │
│   195 │   │   Returns:                                                                                                                                                │
│   196 │   │   │   Tuple[pl.DataFrame, bool]: A tuple containing the spectrum data and a boolea                                                                        │
│   197 │   │   """                                                                                                                                                     │
│ ❱ 198 │   │   return utils.load_spectrum_data(                                                                                                                        │
│   199 │   │   │   spectrum_path, add_index_cols=self.add_index_cols                                                                                                   │
│   200 │   │   )                                                                                                                                                       │
│   201                                                                                                                                                                 │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/winnow/datasets/data_loaders/utils.py:132 in load_spectrum_data                                                                           │
│                                                                                                                                                                       │
│   129 │   if spectrum_path.suffix == ".parquet":                                                                                                                      │
│   130 │   │   df = pl.read_parquet(spectrum_path)                                                                                                                     │
│   131 │   elif spectrum_path.suffix == ".ipc":                                                                                                                        │
│ ❱ 132 │   │   df = pl.read_ipc(spectrum_path)                                                                                                                         │
│   133 │   elif spectrum_path.suffix == ".mgf":                                                                                                                        │
│   134 │   │   from matchms.importing import load_from_mgf                                                                                                             │
│   135                                                                                                                                                                 │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/.venv/lib/python3.12/site-packages/polars/_utils/deprecation.py:128 in wrapper                                                            │
│                                                                                                                                                                       │
│   125 │   │   │   _rename_keyword_argument(                                                                                                                           │
│   126 │   │   │   │   old_name, new_name, kwargs, function.__qualname__, version                                                                                      │
│   127 │   │   │   )                                                                                                                                                   │
│ ❱ 128 │   │   │   return function(*args, **kwargs)                                                                                                                    │
│   129 │   │                                                                                                                                                           │
│   130 │   │   wrapper.__signature__ = inspect.signature(function)  # type: ignore[attr-defined                                                                        │131 │   │   return wrapper                                                                                                                                          │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/.venv/lib/python3.12/site-packages/polars/_utils/deprecation.py:128 in wrapper                                                            │
│                                                                                                                                                                       │
│   125 │   │   │   _rename_keyword_argument(                                                                                                                           │
│   126 │   │   │   │   old_name, new_name, kwargs, function.__qualname__, version                                                                                      │
│   127 │   │   │   )                                                                                                                                                   │
│ ❱ 128 │   │   │   return function(*args, **kwargs)                                                                                                                    │
│   129 │   │                                                                                                                                                           │
│   130 │   │   wrapper.__signature__ = inspect.signature(function)  # type: ignore[attr-defined                                                                        │131 │   │   return wrapper                                                                                                                                          │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/.venv/lib/python3.12/site-packages/polars/io/ipc/functions.py:179 in read_ipc                                                             │
│                                                                                                                                                                       │
│   176 │   │   │   │   df = df.slice(0, n_rows)                                                                                                                        │
│   177 │   │   │   return df                                                                                                                                           │
│   178 │   │                                                                                                                                                           │
│ ❱ 179 │   │   return _read_ipc_impl(                                                                                                                                  │
│   180 │   │   │   data,                                                                                                                                               │
│   181 │   │   │   columns=columns,                                                                                                                                    │
│   182 │   │   │   n_rows=n_rows,                                                                                                                                      │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/.venv/lib/python3.12/site-packages/polars/io/ipc/functions.py:226 in _read_ipc_impl                                                       │
│                                                                                                                                                                       │
│   223 │   │   return df                                                                                                                                               │
│   224 │                                                                                                                                                               │
│   225projection, columns = parse_columns_arg(columns)                                                                                                            │
│ ❱ 226pydf = PyDataFrame.read_ipc(                                                                                                                                │
│   227 │   │   source,                                                                                                                                                 │
│   228 │   │   columns,                                                                                                                                                │
│   229 │   │   projection,                                                                                                                                             │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: No such file or directory (os error 2): examples/example_data/spectra.ipc

Comment thread docs/configuration.md Outdated
Comment thread docs/cli.md Outdated
Comment thread docs/cli.md Outdated
Co-authored-by: Jeroen Van Goey <j.vangoey@instadeep.com>

Update docs/cli.md

Co-authored-by: Jeroen Van Goey <j.vangoey@instadeep.com>

Update docs/cli.md

Co-authored-by: Jeroen Van Goey <j.vangoey@instadeep.com>

chore: update ipc references to MGF reference
@JemmaLDaniel JemmaLDaniel force-pushed the feat-use-mgf-for-sample-spectra branch from 6422b10 to f15d0e0 Compare June 26, 2026 15:31
@JemmaLDaniel

JemmaLDaniel commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

I accepted your changes and squashed them into a single docs commit. The train and compute-features commands should work out the box now, the predict and diagnose-calibration will (still) fail on loading the general model until we merge #190

@JemmaLDaniel JemmaLDaniel requested a review from BioGeek June 26, 2026 15:36

@BioGeek BioGeek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good now!

@JemmaLDaniel JemmaLDaniel merged commit 5927596 into main Jun 27, 2026
7 checks passed
@JemmaLDaniel JemmaLDaniel deleted the feat-use-mgf-for-sample-spectra branch June 27, 2026 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants