Feat use mgf for sample spectra by JemmaLDaniel · Pull Request #173 · instadeepai/winnow

JemmaLDaniel · 2026-03-31T08:21:16Z

Summary

Point default train / predict configs and configuration docs at spectra.mgf for InstaNovo-style example data, so the sample pipeline uses the MGF sample files end-to-end.

Motivation

Using MGF for the Winnow example gives a realistic MS2 spectrum file that people already know from mass-spec workflows, and it is plain text and human-readable, which makes debugging and learning from the sample much easier than with opaque columnar formats like parquet.

github-actions · 2026-03-31T08:22:47Z

Coverage Report

File	Stmts	Miss	Cover	Missing
__init__.py	0	0	100%
data_types.py	4	0	100%
calibration
__init__.py	0	0	100%
calibration_features.py	5	0	100%
calibrator.py	90	15	83%	69–70, 72, 106–109, 134–135, 137, 162–163, 167, 194–195
diagnostics.py	168	50	70%	70, 96, 101, 111, 115, 137, 146, 203–218, 261–262, 266, 307, 309–324, 335–341
calibration/features
__init__.py	10	0	100%
base.py	8	0	100%
beam.py	47	0	100%
chimeric.py	78	1	98%	204
constants.py	4	0	100%
fragment_match.py	74	1	98%	194
mass_error.py	67	2	97%	16, 20
retention_time.py	135	9	93%	183, 190, 206, 257–259, 269, 272–273
sequence.py	19	0	100%
token_score.py	37	1	97%	82
utils.py	135	2	98%	35, 233
compat
__init__.py	0	0	100%
instanovo.py	10	6	40%	12, 14–15, 17, 24–25
datasets
__init__.py	0	0	100%
calibration_dataset.py	109	17	84%	155, 169, 171, 173, 183, 196, 249, 251–252, 258–261, 263–266
interfaces.py	3	0	100%
psm_dataset.py	25	0	100%
datasets/data_loaders
__init__.py	5	0	100%
instanovo.py	119	19	84%	90, 93, 119, 142, 168–169, 172–174, 176–177, 179, 182–183, 185, 343–345, 356
mztab.py	215	55	74%	103, 106, 157, 161, 210–211, 223, 236–240, 287, 290, 302–303, 315–317, 319–320, 322, 324, 330, 334–336, 338–339, 343–346, 350, 514–515, 518, 521, 528, 542–546, 550–555, 561, 570–571, 573, 599
pointnovo.py	7	0	100%
utils.py	59	1	98%	11
winnow.py	39	4	89%	54–55, 91–92
fdr
__init__.py	0	0	100%
base.py	58	15	74%	81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
database_grounded.py	28	1	96%	52
nonparametric.py	25	4	84%	62, 68–69, 72
scripts
__init__.py	0	0	100%
main.py	256	256	0%	8, 10–13, 16–20, 23–24, 26–28, 32, 39, 44, 47, 53, 55–56, 59, 68, 76, 79, 86, 88–90, 92, 94–99, 102, 104–105, 110, 125, 128, 135–141, 144–145, 148, 161–163, 166, 169, 174, 176–178, 180, 182–183, 186–187, 190, 192–193, 195, 197, 199–200, 202, 205–206, 209–210, 213–214, 217–219, 221–224, 227–229, 231, 234, 248–250, 252, 254, 259, 261–263, 265–266, 268, 270–271, 273–275, 277, 279, 281–282, 286–289, 291–292, 294–295, 297–298, 300, 303, 317–319, 322, 325, 330, 332–334, 336–338, 340–341, 344–345, 348, 350–351, 353, 355, 357–358, 360, 363–364, 370–372, 374–377, 380–381, 384–385, 388–389, 392–393, 401–403, 407, 410, 414, 417, 423–425, 427–428, 435–436, 438, 440, 445, 447–449, 451–452, 455, 457–458, 460–463, 465–466, 468–469, 471–473, 479–480, 484–485, 488, 495, 500–501, 506–508, 511, 516, 526, 533, 535, 539, 541–542, 546–547, 550, 573, 586–587, 590, 612, 624–625, 628, 653, 666–667, 670, 685, 697–698, 701, 716, 728–729, 732, 744, 756–757, 760, 775, 787–788, 791, 800, 812–813
utils
__init__.py	4	0	100%
config_formatter.py	53	40	24%	29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
config_path.py	76	5	93%	24–26, 117–118
peptide.py	16	0	100%
TOTAL	1988	504	74%

Tests	Skipped	Failures	Errors	Time
378	0 💤	0 ❌	0 🔥	42.510s ⏱️

BioGeek · 2026-04-01T20:21:03Z

If you're using a spectra.mgf as example data, why create a "fake" one? Why not use a real MGF file?

I would still be generating fake InstaNovo predictions, so the results would likely not make much more sense. The alternative is to use a real mgf file and real InstaNovo predictions, but that has a greater burden of maintenance for me to keep updated or it carries a higher computational cost for the user.

The alternative is to use a real mgf file and real InstaNovo predictions, but that has a greater burden of maintenance for me

Agreed, but it is somethng you should only have to do once (and maybe rerun the predictions when a new version of InstaNovo changes the output format, but then you would have to adap the code anyway. So this option still has my preference.

it carries a higher computational cost for the user

If you provide the predictions I don't see why it should carry a higher computational cost for the user.

I've changed the sample data to use 100 spectra from a HeLa dataset and generated real InstaNovo v1.2.0 predictions (without refinement) on them.

BioGeek

A few places where you missed replacing ipc with mgf.

Also at docs/examples.md line 5.

Also, if you now run winnow compute-features or winnow diagnose-calibration with defaults, they still resolve dataset.spectrum_path_or_directory to examples/example_data/spectra.ipc,

$ winnow compute-features 
[06/25/26 18:22:44] INFO     Starting compute-features pipeline.                                                                                              main.py:265
                    INFO     Compute-features configuration: {'dataset': {'spectrum_path_or_directory': 'examples/example_data/spectra.ipc',                  main.py:266
                             'predictions_path': 'examples/example_data/predictions.csv'}, 'dataset_output_path': 'results/metadata.csv',                                
                             'filter_empty_predictions': True, 'labelled': True, 'residue_masses': {'G': 57.021464, 'A': 71.037114, 'S': 87.032028, 'P':                 
                             97.052764, 'V': 99.068414, 'T': 101.04767, 'C': 103.009185, 'L': 113.084064, 'I': 113.084064, 'N': 114.042927, 'D': 115.026943,             
                             'Q': 128.058578, 'K': 128.094963, 'E': 129.042593, 'M': 131.040485, 'H': 137.058912, 'F': 147.068414, 'R': 156.101111, 'Y':                 
                             163.063329, 'W': 186.079313, 'M[UNIMOD:35]': 147.0354, 'C[UNIMOD:4]': 160.030649, 'N[UNIMOD:7]': 115.026943, 'Q[UNIMOD:7]':                 
                             129.042594, 'R[UNIMOD:7]': 157.085127, 'P[UNIMOD:35]': 113.047679, 'S[UNIMOD:21]': 166.998028, 'T[UNIMOD:21]': 181.01367,                   
                             'Y[UNIMOD:21]': 243.029329, 'C[UNIMOD:312]': 222.013284, 'E[UNIMOD:27]': 111.032028, 'Q[UNIMOD:28]': 111.032029, '[UNIMOD:1]':              
                             42.010565, '[UNIMOD:5]': 43.005814, '[UNIMOD:385]': -17.026549, '(+25.98)': 25.980265}, 'calibrator': {'_target_':                          
                             'winnow.calibration.calibrator.ProbabilityCalibrator', 'seed': 42, 'hidden_layer_sizes': [50, 50], 'learning_rate_init': 0.001,             
                             'alpha': 0.0001, 'max_iter': 1000, 'early_stopping': True, 'validation_fraction': 0.1, 'features': {'mass_error': {'_target_':              
                             'winnow.calibration.calibration_features.MassErrorDaFeature', 'residue_masses': '${residue_masses}'}, 'fragment_match_features':            
                             {'_target_': 'winnow.calibration.calibration_features.FragmentMatchFeatures', 'mz_tolerance': 0.02, 'learn_from_missing': False,            
                             'intensity_model_name': '${koina.intensity_model}', 'max_precursor_charge': '${koina.constraints.max_precursor_charge}',                    
                             'max_peptide_length': '${koina.constraints.max_peptide_length}', 'unsupported_residues':                                                    
                             '${koina.constraints.unsupported_residues}', 'model_input_constants': '${koina.input_constants}'}, 'retention_time_feature':                
                             {'_target_': 'winnow.calibration.calibration_features.RetentionTimeFeature', 'train_fraction': 0.1, 'min_train_points': 10,                 
                             'learn_from_missing': False, 'seed': 42, 'irt_model_name': '${koina.irt_model}', 'max_peptide_length':                                      
                             '${koina.constraints.max_peptide_length}', 'unsupported_residues': '${koina.constraints.unsupported_residues}'}}}, 'koina':                 
                             {'intensity_model': 'Prosit_2025_intensity_22PTM', 'irt_model': 'Prosit_2025_irt_22PTM', 'input_constants':                                 
                             {'collision_energies': 27, 'fragmentation_types': 'HCD'}, 'input_columns': {}, 'constraints': {'max_precursor_charge': 6,                   
                             'max_peptide_length': 30, 'unsupported_residues': ['[UNIMOD:5]', '[UNIMOD:385]', '(+25.98)']}}, 'data_loader': {'_target_':                 
                             'winnow.datasets.data_loaders.InstaNovoDatasetLoader', 'add_index_cols': False, 'residue_masses': '${residue_masses}',                      
                             'residue_remapping': {'M(ox)': 'M[UNIMOD:35]', 'M(+15.99)': 'M[UNIMOD:35]', 'S(p)': 'S[UNIMOD:21]', 'T(p)': 'T[UNIMOD:21]',                 
                             'Y(p)': 'Y[UNIMOD:21]', 'S(+79.97)': 'S[UNIMOD:21]', 'T(+79.97)': 'T[UNIMOD:21]', 'Y(+79.97)': 'Y[UNIMOD:21]', 'Q(+0.98)':                  
                             'Q[UNIMOD:7]', 'N(+0.98)': 'N[UNIMOD:7]', 'Q(+.98)': 'Q[UNIMOD:7]', 'N(+.98)': 'N[UNIMOD:7]', 'C(+57.02)': 'C[UNIMOD:4]',                   
                             '(+42.01)': '[UNIMOD:1]', '(+43.01)': '[UNIMOD:5]', '(-17.03)': '[UNIMOD:385]'}, 'column_mapping': {'predictions':                          
                             'predictions', 'predictions_tokenised': 'predictions_tokenised', 'log_probability': 'log_probs'}, 'beam_columns': {'sequence':              
                             'predictions_beam_', 'log_probability': 'predictions_log_probability_beam_', 'token_log_probabilities':                                     
                             'predictions_token_log_probabilities_beam_'}}}                                                                                              
                    INFO     Loading dataset.                                                                                                                 main.py:270
╭────────────────────────────────────────────────────────────────── Traceback (most recent call last) ──────────────────────────────────────────────────────────────────╮
│ /home/j-vangoey/code/winnow/winnow/scripts/main.py:625 in compute_features                                                                                            │
│                                                                                                                                                                       │
│   622 ) -> None:                                                                                                                                                      │
│   623 │   """Compute calibration features and save metadata CSV."""                                                                                                   │
│   624 │   overrides = ctx.args if ctx.args else None                                                                                                                  │
│ ❱ 625 │   compute_features_entry_point(overrides, config_dir=config_dir)                                                                                              │
│   626                                                                                                                                                                 │
│   627                                                                                                                                                                 │
│   628 @app.command(                                                                                                                                                   │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/winnow/scripts/main.py:277 in compute_features_entry_point                                                                                │
│                                                                                                                                                                       │
│   274 │   dataset_params["data_path"] = dataset_params.pop("spectrum_path_or_directory")                                                                              │
│   275 │   dataset_params["predictions_path"] = dataset_params.pop("predictions_path", None)                                                                           │
│   276 │                                                                                                                                                               │
│ ❱ 277 │   dataset = data_loader.load(**dataset_params)                                                                                                                │
│   278 │                                                                                                                                                               │
│   279 │   logger.info(f"Loaded: {len(dataset.metadata)} spectra")                                                                                                     │
│   280                                                                                                                                                                 │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/winnow/datasets/data_loaders/instanovo.py:168 in load                                                                                     │
│                                                                                                                                                                       │
│   165 │   │   if predictions_path is None:                                                                                                                            │
│   166 │   │   │   raise ValueError("predictions_path is required for InstaNovoDatasetLoader")                                                                         │
│   167 │   │                                                                                                                                                           │
│ ❱ 168 │   │   inputs, has_labels = self._load_spectrum_data(data_path)                                                                                                │
│   169 │   │   inputs = self._process_spectrum_data(inputs, has_labels)                                                                                                │
│   170 │   │                                                                                                                                                           │
│   171 │   │   # Load beam predictions only if beam_columns is configured                                                                                              │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/winnow/datasets/data_loaders/instanovo.py:198 in _load_spectrum_data                                                                      │
│                                                                                                                                                                       │
│   195 │   │   Returns:                                                                                                                                                │
│   196 │   │   │   Tuple[pl.DataFrame, bool]: A tuple containing the spectrum data and a boolea                                                                        │
│   197 │   │   """                                                                                                                                                     │
│ ❱ 198 │   │   return utils.load_spectrum_data(                                                                                                                        │
│   199 │   │   │   spectrum_path, add_index_cols=self.add_index_cols                                                                                                   │
│   200 │   │   )                                                                                                                                                       │
│   201                                                                                                                                                                 │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/winnow/datasets/data_loaders/utils.py:132 in load_spectrum_data                                                                           │
│                                                                                                                                                                       │
│   129 │   if spectrum_path.suffix == ".parquet":                                                                                                                      │
│   130 │   │   df = pl.read_parquet(spectrum_path)                                                                                                                     │
│   131 │   elif spectrum_path.suffix == ".ipc":                                                                                                                        │
│ ❱ 132 │   │   df = pl.read_ipc(spectrum_path)                                                                                                                         │
│   133 │   elif spectrum_path.suffix == ".mgf":                                                                                                                        │
│   134 │   │   from matchms.importing import load_from_mgf                                                                                                             │
│   135                                                                                                                                                                 │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/.venv/lib/python3.12/site-packages/polars/_utils/deprecation.py:128 in wrapper                                                            │
│                                                                                                                                                                       │
│   125 │   │   │   _rename_keyword_argument(                                                                                                                           │
│   126 │   │   │   │   old_name, new_name, kwargs, function.__qualname__, version                                                                                      │
│   127 │   │   │   )                                                                                                                                                   │
│ ❱ 128 │   │   │   return function(*args, **kwargs)                                                                                                                    │
│   129 │   │                                                                                                                                                           │
│   130 │   │   wrapper.__signature__ = inspect.signature(function)  # type: ignore[attr-defined                                                                        │
│   131 │   │   return wrapper                                                                                                                                          │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/.venv/lib/python3.12/site-packages/polars/_utils/deprecation.py:128 in wrapper                                                            │
│                                                                                                                                                                       │
│   125 │   │   │   _rename_keyword_argument(                                                                                                                           │
│   126 │   │   │   │   old_name, new_name, kwargs, function.__qualname__, version                                                                                      │
│   127 │   │   │   )                                                                                                                                                   │
│ ❱ 128 │   │   │   return function(*args, **kwargs)                                                                                                                    │
│   129 │   │                                                                                                                                                           │
│   130 │   │   wrapper.__signature__ = inspect.signature(function)  # type: ignore[attr-defined                                                                        │
│   131 │   │   return wrapper                                                                                                                                          │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/.venv/lib/python3.12/site-packages/polars/io/ipc/functions.py:179 in read_ipc                                                             │
│                                                                                                                                                                       │
│   176 │   │   │   │   df = df.slice(0, n_rows)                                                                                                                        │
│   177 │   │   │   return df                                                                                                                                           │
│   178 │   │                                                                                                                                                           │
│ ❱ 179 │   │   return _read_ipc_impl(                                                                                                                                  │
│   180 │   │   │   data,                                                                                                                                               │
│   181 │   │   │   columns=columns,                                                                                                                                    │
│   182 │   │   │   n_rows=n_rows,                                                                                                                                      │
│                                                                                                                                                                       │
│ /home/j-vangoey/code/winnow/.venv/lib/python3.12/site-packages/polars/io/ipc/functions.py:226 in _read_ipc_impl                                                       │
│                                                                                                                                                                       │
│   223 │   │   return df                                                                                                                                               │
│   224 │                                                                                                                                                               │
│   225 │   projection, columns = parse_columns_arg(columns)                                                                                                            │
│ ❱ 226 │   pydf = PyDataFrame.read_ipc(                                                                                                                                │
│   227 │   │   source,                                                                                                                                                 │
│   228 │   │   columns,                                                                                                                                                │
│   229 │   │   projection,                                                                                                                                             │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: No such file or directory (os error 2): examples/example_data/spectra.ipc

Co-authored-by: Jeroen Van Goey <j.vangoey@instadeep.com> Update docs/cli.md Co-authored-by: Jeroen Van Goey <j.vangoey@instadeep.com> Update docs/cli.md Co-authored-by: Jeroen Van Goey <j.vangoey@instadeep.com> chore: update ipc references to MGF reference

JemmaLDaniel · 2026-06-26T15:36:17Z

I accepted your changes and squashed them into a single docs commit. The train and compute-features commands should work out the box now, the predict and diagnose-calibration will (still) fail on loading the general model until we merge #190

BioGeek

Looking good now!

JemmaLDaniel added 3 commits March 30, 2026 19:25

feat: randomly generate spectra in mgf format

8430423

chore: update sample data to mgf format

a0c300b

docs: update config spectra file type to mgf

b72f918

JemmaLDaniel requested a review from BioGeek March 31, 2026 08:21

JemmaLDaniel self-assigned this Mar 31, 2026

JemmaLDaniel added the enhancement New feature or request label Mar 31, 2026

BioGeek reviewed Apr 1, 2026

View reviewed changes

BioGeek mentioned this pull request Apr 1, 2026

Feat: compute-features command #171

Merged

JemmaLDaniel changed the base branch from feat-allow-mgf-files-for-instanovo-data-loader to main April 2, 2026 09:58

JemmaLDaniel added 5 commits April 2, 2026 11:00

Merge branch 'main' into feat-use-mgf-for-sample-spectra

dffbd6b

chore: update gitignore with mgf example data type

4b820b0

chore: update example datasets to either parquet or mgf

11da31e

Merge branch 'main' into feat-use-mgf-for-sample-spectra

f3acdb8

feat: use real HeLa spectra and InstaNovo predictions as sample data

fc74073

JemmaLDaniel requested a review from BioGeek June 22, 2026 15:38

BioGeek requested changes Jun 25, 2026

View reviewed changes

Comment thread docs/configuration.md Outdated

Comment thread docs/cli.md Outdated

Comment thread docs/cli.md Outdated

JemmaLDaniel added 2 commits June 26, 2026 16:31

chore: update ipc spectra file to mgf

f15d0e0

JemmaLDaniel force-pushed the feat-use-mgf-for-sample-spectra branch from 6422b10 to f15d0e0 Compare June 26, 2026 15:31

chore: reduce minimum iRT regression training points

0fbe9ff

JemmaLDaniel requested a review from BioGeek June 26, 2026 15:36

BioGeek approved these changes Jun 26, 2026

View reviewed changes

JemmaLDaniel merged commit 5927596 into main Jun 27, 2026
7 checks passed

JemmaLDaniel deleted the feat-use-mgf-for-sample-spectra branch June 27, 2026 12:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat use mgf for sample spectra#173

Feat use mgf for sample spectra#173
JemmaLDaniel merged 11 commits into
mainfrom
feat-use-mgf-for-sample-spectra

JemmaLDaniel commented Mar 31, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

BioGeek Apr 1, 2026

Uh oh!

JemmaLDaniel Apr 2, 2026

Uh oh!

BioGeek Apr 8, 2026

Uh oh!

JemmaLDaniel Jun 22, 2026

Uh oh!

BioGeek left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JemmaLDaniel commented Jun 26, 2026 •

edited

Loading

Uh oh!

BioGeek left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JemmaLDaniel commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Uh oh!

github-actions Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BioGeek Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

JemmaLDaniel Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

BioGeek Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

JemmaLDaniel Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

BioGeek left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JemmaLDaniel commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BioGeek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JemmaLDaniel commented Mar 31, 2026 •

edited

Loading

github-actions Bot commented Mar 31, 2026 •

edited

Loading

BioGeek left a comment •

edited

Loading

JemmaLDaniel commented Jun 26, 2026 •

edited

Loading