Feat use mgf for sample spectra#173
Conversation
There was a problem hiding this comment.
If you're using a spectra.mgf as example data, why create a "fake" one? Why not use a real MGF file?
There was a problem hiding this comment.
I would still be generating fake InstaNovo predictions, so the results would likely not make much more sense. The alternative is to use a real mgf file and real InstaNovo predictions, but that has a greater burden of maintenance for me to keep updated or it carries a higher computational cost for the user.
There was a problem hiding this comment.
The alternative is to use a real mgf file and real InstaNovo predictions, but that has a greater burden of maintenance for me
Agreed, but it is somethng you should only have to do once (and maybe rerun the predictions when a new version of InstaNovo changes the output format, but then you would have to adap the code anyway. So this option still has my preference.
it carries a higher computational cost for the user
If you provide the predictions I don't see why it should carry a higher computational cost for the user.
There was a problem hiding this comment.
I've changed the sample data to use 100 spectra from a HeLa dataset and generated real InstaNovo v1.2.0 predictions (without refinement) on them.
There was a problem hiding this comment.
A few places where you missed replacing ipc with mgf.
Also at docs/examples.md line 5.
Also, if you now run winnow compute-features or winnow diagnose-calibration with defaults, they still resolve dataset.spectrum_path_or_directory to examples/example_data/spectra.ipc,
$ winnow compute-features
[06/25/26 18:22:44] INFO Starting compute-features pipeline. main.py:265
INFO Compute-features configuration: {'dataset': {'spectrum_path_or_directory': 'examples/example_data/spectra.ipc', main.py:266
'predictions_path': 'examples/example_data/predictions.csv'}, 'dataset_output_path': 'results/metadata.csv',
'filter_empty_predictions': True, 'labelled': True, 'residue_masses': {'G': 57.021464, 'A': 71.037114, 'S': 87.032028, 'P':
97.052764, 'V': 99.068414, 'T': 101.04767, 'C': 103.009185, 'L': 113.084064, 'I': 113.084064, 'N': 114.042927, 'D': 115.026943,
'Q': 128.058578, 'K': 128.094963, 'E': 129.042593, 'M': 131.040485, 'H': 137.058912, 'F': 147.068414, 'R': 156.101111, 'Y':
163.063329, 'W': 186.079313, 'M[UNIMOD:35]': 147.0354, 'C[UNIMOD:4]': 160.030649, 'N[UNIMOD:7]': 115.026943, 'Q[UNIMOD:7]':
129.042594, 'R[UNIMOD:7]': 157.085127, 'P[UNIMOD:35]': 113.047679, 'S[UNIMOD:21]': 166.998028, 'T[UNIMOD:21]': 181.01367,
'Y[UNIMOD:21]': 243.029329, 'C[UNIMOD:312]': 222.013284, 'E[UNIMOD:27]': 111.032028, 'Q[UNIMOD:28]': 111.032029, '[UNIMOD:1]':
42.010565, '[UNIMOD:5]': 43.005814, '[UNIMOD:385]': -17.026549, '(+25.98)': 25.980265}, 'calibrator': {'_target_':
'winnow.calibration.calibrator.ProbabilityCalibrator', 'seed': 42, 'hidden_layer_sizes': [50, 50], 'learning_rate_init': 0.001,
'alpha': 0.0001, 'max_iter': 1000, 'early_stopping': True, 'validation_fraction': 0.1, 'features': {'mass_error': {'_target_':
'winnow.calibration.calibration_features.MassErrorDaFeature', 'residue_masses': '${residue_masses}'}, 'fragment_match_features':
{'_target_': 'winnow.calibration.calibration_features.FragmentMatchFeatures', 'mz_tolerance': 0.02, 'learn_from_missing': False,
'intensity_model_name': '${koina.intensity_model}', 'max_precursor_charge': '${koina.constraints.max_precursor_charge}',
'max_peptide_length': '${koina.constraints.max_peptide_length}', 'unsupported_residues':
'${koina.constraints.unsupported_residues}', 'model_input_constants': '${koina.input_constants}'}, 'retention_time_feature':
{'_target_': 'winnow.calibration.calibration_features.RetentionTimeFeature', 'train_fraction': 0.1, 'min_train_points': 10,
'learn_from_missing': False, 'seed': 42, 'irt_model_name': '${koina.irt_model}', 'max_peptide_length':
'${koina.constraints.max_peptide_length}', 'unsupported_residues': '${koina.constraints.unsupported_residues}'}}}, 'koina':
{'intensity_model': 'Prosit_2025_intensity_22PTM', 'irt_model': 'Prosit_2025_irt_22PTM', 'input_constants':
{'collision_energies': 27, 'fragmentation_types': 'HCD'}, 'input_columns': {}, 'constraints': {'max_precursor_charge': 6,
'max_peptide_length': 30, 'unsupported_residues': ['[UNIMOD:5]', '[UNIMOD:385]', '(+25.98)']}}, 'data_loader': {'_target_':
'winnow.datasets.data_loaders.InstaNovoDatasetLoader', 'add_index_cols': False, 'residue_masses': '${residue_masses}',
'residue_remapping': {'M(ox)': 'M[UNIMOD:35]', 'M(+15.99)': 'M[UNIMOD:35]', 'S(p)': 'S[UNIMOD:21]', 'T(p)': 'T[UNIMOD:21]',
'Y(p)': 'Y[UNIMOD:21]', 'S(+79.97)': 'S[UNIMOD:21]', 'T(+79.97)': 'T[UNIMOD:21]', 'Y(+79.97)': 'Y[UNIMOD:21]', 'Q(+0.98)':
'Q[UNIMOD:7]', 'N(+0.98)': 'N[UNIMOD:7]', 'Q(+.98)': 'Q[UNIMOD:7]', 'N(+.98)': 'N[UNIMOD:7]', 'C(+57.02)': 'C[UNIMOD:4]',
'(+42.01)': '[UNIMOD:1]', '(+43.01)': '[UNIMOD:5]', '(-17.03)': '[UNIMOD:385]'}, 'column_mapping': {'predictions':
'predictions', 'predictions_tokenised': 'predictions_tokenised', 'log_probability': 'log_probs'}, 'beam_columns': {'sequence':
'predictions_beam_', 'log_probability': 'predictions_log_probability_beam_', 'token_log_probabilities':
'predictions_token_log_probabilities_beam_'}}}
INFO Loading dataset. main.py:270
╭────────────────────────────────────────────────────────────────── Traceback (most recent call last) ──────────────────────────────────────────────────────────────────╮
│ /home/j-vangoey/code/winnow/winnow/scripts/main.py:625 in compute_features │
│ │
│ 622 ) -> None: │
│ 623 │ """Compute calibration features and save metadata CSV.""" │
│ 624 │ overrides = ctx.args if ctx.args else None │
│ ❱ 625 │ compute_features_entry_point(overrides, config_dir=config_dir) │
│ 626 │
│ 627 │
│ 628 @app.command( │
│ │
│ /home/j-vangoey/code/winnow/winnow/scripts/main.py:277 in compute_features_entry_point │
│ │
│ 274 │ dataset_params["data_path"] = dataset_params.pop("spectrum_path_or_directory") │
│ 275 │ dataset_params["predictions_path"] = dataset_params.pop("predictions_path", None) │
│ 276 │ │
│ ❱ 277 │ dataset = data_loader.load(**dataset_params) │
│ 278 │ │
│ 279 │ logger.info(f"Loaded: {len(dataset.metadata)} spectra") │
│ 280 │
│ │
│ /home/j-vangoey/code/winnow/winnow/datasets/data_loaders/instanovo.py:168 in load │
│ │
│ 165 │ │ if predictions_path is None: │
│ 166 │ │ │ raise ValueError("predictions_path is required for InstaNovoDatasetLoader") │
│ 167 │ │ │
│ ❱ 168 │ │ inputs, has_labels = self._load_spectrum_data(data_path) │
│ 169 │ │ inputs = self._process_spectrum_data(inputs, has_labels) │
│ 170 │ │ │
│ 171 │ │ # Load beam predictions only if beam_columns is configured │
│ │
│ /home/j-vangoey/code/winnow/winnow/datasets/data_loaders/instanovo.py:198 in _load_spectrum_data │
│ │
│ 195 │ │ Returns: │
│ 196 │ │ │ Tuple[pl.DataFrame, bool]: A tuple containing the spectrum data and a boolea │
│ 197 │ │ """ │
│ ❱ 198 │ │ return utils.load_spectrum_data( │
│ 199 │ │ │ spectrum_path, add_index_cols=self.add_index_cols │
│ 200 │ │ ) │
│ 201 │
│ │
│ /home/j-vangoey/code/winnow/winnow/datasets/data_loaders/utils.py:132 in load_spectrum_data │
│ │
│ 129 │ if spectrum_path.suffix == ".parquet": │
│ 130 │ │ df = pl.read_parquet(spectrum_path) │
│ 131 │ elif spectrum_path.suffix == ".ipc": │
│ ❱ 132 │ │ df = pl.read_ipc(spectrum_path) │
│ 133 │ elif spectrum_path.suffix == ".mgf": │
│ 134 │ │ from matchms.importing import load_from_mgf │
│ 135 │
│ │
│ /home/j-vangoey/code/winnow/.venv/lib/python3.12/site-packages/polars/_utils/deprecation.py:128 in wrapper │
│ │
│ 125 │ │ │ _rename_keyword_argument( │
│ 126 │ │ │ │ old_name, new_name, kwargs, function.__qualname__, version │
│ 127 │ │ │ ) │
│ ❱ 128 │ │ │ return function(*args, **kwargs) │
│ 129 │ │ │
│ 130 │ │ wrapper.__signature__ = inspect.signature(function) # type: ignore[attr-defined │
│ 131 │ │ return wrapper │
│ │
│ /home/j-vangoey/code/winnow/.venv/lib/python3.12/site-packages/polars/_utils/deprecation.py:128 in wrapper │
│ │
│ 125 │ │ │ _rename_keyword_argument( │
│ 126 │ │ │ │ old_name, new_name, kwargs, function.__qualname__, version │
│ 127 │ │ │ ) │
│ ❱ 128 │ │ │ return function(*args, **kwargs) │
│ 129 │ │ │
│ 130 │ │ wrapper.__signature__ = inspect.signature(function) # type: ignore[attr-defined │
│ 131 │ │ return wrapper │
│ │
│ /home/j-vangoey/code/winnow/.venv/lib/python3.12/site-packages/polars/io/ipc/functions.py:179 in read_ipc │
│ │
│ 176 │ │ │ │ df = df.slice(0, n_rows) │
│ 177 │ │ │ return df │
│ 178 │ │ │
│ ❱ 179 │ │ return _read_ipc_impl( │
│ 180 │ │ │ data, │
│ 181 │ │ │ columns=columns, │
│ 182 │ │ │ n_rows=n_rows, │
│ │
│ /home/j-vangoey/code/winnow/.venv/lib/python3.12/site-packages/polars/io/ipc/functions.py:226 in _read_ipc_impl │
│ │
│ 223 │ │ return df │
│ 224 │ │
│ 225 │ projection, columns = parse_columns_arg(columns) │
│ ❱ 226 │ pydf = PyDataFrame.read_ipc( │
│ 227 │ │ source, │
│ 228 │ │ columns, │
│ 229 │ │ projection, │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: No such file or directory (os error 2): examples/example_data/spectra.ipcCo-authored-by: Jeroen Van Goey <j.vangoey@instadeep.com> Update docs/cli.md Co-authored-by: Jeroen Van Goey <j.vangoey@instadeep.com> Update docs/cli.md Co-authored-by: Jeroen Van Goey <j.vangoey@instadeep.com> chore: update ipc references to MGF reference
6422b10 to
f15d0e0
Compare
|
I accepted your changes and squashed them into a single docs commit. The |
Summary
Point default
train/predictconfigs and configuration docs atspectra.mgffor InstaNovo-style example data, so the sample pipeline uses the MGF sample files end-to-end.Motivation
Using MGF for the Winnow example gives a realistic MS2 spectrum file that people already know from mass-spec workflows, and it is plain text and human-readable, which makes debugging and learning from the sample much easier than with opaque columnar formats like parquet.