Feat xcorr by JemmaLDaniel · Pull Request #186 · instadeepai/winnow

JemmaLDaniel · 2026-04-10T14:13:28Z

Add cross-correlation score (XCorr) feature

Summary

Adds an xcorr feature to FragmentMatchFeatures (and chimeric_xcorr to ChimericFeatures). XCorr is the fast cross-correlation score originally introduced by SEQUEST and refined by Comet — it measures the correlation between an observed spectrum and a synthetic theoretical spectrum after background subtraction. Unlike the spectral angle which compares matched peak intensities directly, XCorr operates on binned spectra and accounts for background noise, making it a complementary and well-established discriminant for PSM quality.

How it works

The implementation follows the Comet/SEQUEST fast cross-correlation approach:

Bin the observed spectrum into unit-dalton bins (bin width = 1.0005079 Da, offset = 0.4 Da) with square-root intensity compression and window-based normalisation (10 equal-width windows, max intensity normalised to 50).
Construct a theoretical spectrum by placing unit-intensity peaks at each theoretical fragment m/z, including flanking bins at ±1 Da with reduced intensity.
Subtract the background from the observed spectrum using a running mean over ±75 bins.
Compute the cross-correlation as the dot product of the background-subtracted observed spectrum and the theoretical spectrum.
Normalise the raw score by dividing by 10,000 to produce a value on a comparable scale to other features.

Changes

winnow/calibration/features/utils.py:
- Adds _bin_observed_spectrum — bins observed peaks with sqrt compression.
- Adds _normalize_spectrum_windows — per-window intensity normalisation.
- Adds _build_theoretical_spectrum — creates unit-intensity theoretical peaks with flanking bins.
- Adds _subtract_background — running mean background subtraction.
- Adds compute_xcorr — orchestrates the full fast XCorr computation.
- compute_ion_identifications now calls compute_xcorr and returns it as an additional output.
winnow/calibration/features/constants.py:
- Adds constants for XCorr parameters (XCORR_BIN_SIZE, XCORR_BIN_OFFSET, XCORR_NUM_WINDOWS, XCORR_MAX_OFFSET, XCORR_WINDOW_NORM_VALUE).
winnow/calibration/features/fragment_match.py — adds xcorr to columns() and stores the computed value.
winnow/calibration/features/chimeric.py — adds chimeric_xcorr column.
docs/api/features/fragment_match.md — documents the XCorr feature with description of the algorithm and its parameters.
docs/api/features/chimeric.md — documents the chimeric XCorr column.
tests/calibration/features/test_utils.py — tests for each XCorr sub-function and the end-to-end compute_xcorr, covering binning, normalisation, theoretical spectrum construction, background subtraction, and edge cases (empty spectra, single peaks, no overlap).
tests/calibration/features/test_fragment_match.py, test_chimeric.py — assert new columns are present.

github-actions · 2026-04-10T14:15:05Z

Coverage Report

File	Stmts	Miss	Cover	Missing
__init__.py	0	0	100%
data_types.py	4	0	100%
calibration
__init__.py	0	0	100%
calibration_features.py	9	0	100%
calibrator.py	102	11	89%	69–70, 72, 107, 134–135, 137, 163, 168, 195–196
diagnostics.py	168	50	70%	70, 96, 101, 111, 115, 137, 146, 203–218, 261–262, 266, 307, 309–324, 335–341
calibration/features
__init__.py	10	0	100%
base.py	8	0	100%
beam.py	47	0	100%
chimeric.py	82	1	98%	213
constants.py	9	0	100%
fragment_match.py	78	1	98%	203
mass_error.py	67	2	97%	16, 20
retention_time.py	135	9	93%	183, 190, 206, 257–259, 269, 272–273
sequence.py	19	0	100%
token_score.py	37	1	97%	82
utils.py	261	3	98%	96, 368, 594
compat
__init__.py	0	0	100%
instanovo.py	10	6	40%	12, 14–15, 17, 24–25
datasets
__init__.py	0	0	100%
calibration_dataset.py	109	17	84%	155, 169, 171, 173, 183, 196, 249, 251–252, 258–261, 263–266
interfaces.py	3	0	100%
psm_dataset.py	25	0	100%
datasets/data_loaders
__init__.py	5	0	100%
instanovo.py	119	19	84%	90, 93, 119, 142, 168–169, 172–174, 176–177, 179, 182–183, 185, 343–345, 356
mztab.py	215	55	74%	103, 106, 157, 161, 210–211, 223, 236–240, 287, 290, 302–303, 315–317, 319–320, 322, 324, 330, 334–336, 338–339, 343–346, 350, 514–515, 518, 521, 528, 542–546, 550–555, 561, 570–571, 573, 599
pointnovo.py	7	0	100%
utils.py	59	1	98%	11
winnow.py	39	4	89%	54–55, 91–92
fdr
__init__.py	0	0	100%
base.py	58	15	74%	81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
database_grounded.py	28	1	96%	52
nonparametric.py	25	4	84%	62, 68–69, 72
scripts
__init__.py	0	0	100%
main.py	256	256	0%	8, 10–13, 16–20, 23–24, 26–28, 32, 39, 44, 47, 53, 55–56, 59, 68, 76, 79, 86, 88–90, 92, 94–99, 102, 104–105, 110, 125, 128, 135–141, 144–145, 148, 161–163, 166, 169, 174, 176–178, 180, 182–183, 186–187, 190, 192–193, 195, 197, 199–200, 202, 205–206, 209–210, 213–214, 217–219, 221–224, 227–229, 231, 234, 248–250, 252, 254, 259, 261–263, 265–266, 268, 270–271, 273–275, 277, 279, 281–282, 286–289, 291–292, 294–295, 297–298, 300, 303, 317–319, 322, 325, 330, 332–334, 336–338, 340–341, 344–345, 348, 350–351, 353, 355, 357–358, 360, 363–364, 370–372, 374–377, 380–381, 384–385, 388–389, 392–393, 401–403, 407, 410, 414, 417, 423–425, 427–428, 435–436, 438, 440, 445, 447–449, 451–452, 455, 457–458, 460–463, 465–466, 468–469, 471–473, 479–480, 484–485, 488, 495, 500–501, 506–508, 511, 516, 526, 533, 535, 539, 541–542, 546–547, 550, 573, 586–587, 590, 612, 624–625, 628, 653, 666–667, 670, 685, 697–698, 701, 716, 728–729, 732, 744, 756–757, 760, 775, 787–788, 791, 800, 812–813
utils
__init__.py	4	0	100%
config_formatter.py	53	40	24%	29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
config_path.py	76	5	93%	24–26, 117–118
peptide.py	16	0	100%
TOTAL	2143	501	76%

Tests	Skipped	Failures	Errors	Time
430	0 💤	0 ❌	0 🔥	34.839s ⏱️

…oretical_ions committed

BioGeek · 2026-06-25T14:53:46Z

+    max_mz = max(max(observed_mz), max(theoretical_mz))
+    num_bins = int(max_mz / bin_size + bin_offset) + 1


When a predicted fragment lies above the observed scan range, including max(theoretical_mz) in max_mz changes the number of bins and therefore the 10 normalization windows used for the observed spectrum. This makes the processed observed intensities depend on the candidate peptide. Adding an unmatched high-m/z theoretical ion can change the score contribution of already matching peaks even though the observed spectrum is unchanged. Size/normalize the observed spectrum from a fixed observed scan range, then handle out-of-range theoretical bins separately.

This is not a faithful implementation of the Eng et al. 2008 algorithm. The paper’s key idea is: preprocess the acquired spectrum y once into y', then score each candidate theoretical spectrum x by a scalar dot product x · y'. The correction subtracts the mean of the ±75 shifted acquired spectra, excluding the zero shift, divided by 150. The PR does implement the broad shape: bin observed peaks, sqrt intensities, normalize in 10 windows, subtract ±75-bin background, then dot with theoretical bins. But these lines size the observed spectrum using max(max(observed_mz), max(theoretical_mz)). That makes y' depend on the candidate theoretical peptide, which contradicts the paper’s derivation that y' is an acquired-spectrum preprocessing step done once before candidate scoring.

See newly added test test_observed_preprocessing_independent_of_out_of_range_theoretical_ions. It asserts that adding an unmatched theoretical ion at 2000.0 m/z does not change the score for the same observed spectrum.

Compare with Comet: CometSearch/CometPreprocess.cpp:1326

pScoring->_spectrumInfoInternal.iArraySize = (int)((pScoring->_pepMassInfo.dExpPepMass + dCushion) * g_staticParams.dInverseBinWidth);

Comet sizes the xcorr/preprocessing array from the query precursor/experimental peptide mass plus a fixed cushion, then bin width. It does not use the candidate theoretical fragment list. Candidate ions are later looked up against this preprocessed array.

Or compare with Crux: src/model/Scorer.cpp:824

std::vector<FLOAT_T> observed(getMaxBin(), 0);

getMaxBin() comes from

return INTEGERIZE(sp_max_mz_, bin_width_, bin_offset_);

and sp_max_mz_ is set earlier from precursor-derived experimental_mass_cut_off, rounded up to a fixed block size. Again, it is independent of the candidate theoretical ion list.

So the key difference is: Comet and Crux size/preprocess the observed spectrum from scan/query constraints; the PR sizes it from both observed peaks and candidate theoretical peaks. That candidate dependency is the bug.

Wow, good catch. Fixed by sizing and preprocessing the observed spectrum from the scan range only (max(observed_mz)); out-of-range theoretical bins are skipped at dot-product time and no longer affect y′. This follows the SEQUEST/Tide/Crux description: preprocess the acquired spectrum once from the reported scan m/z range, rather than Comet’s precursor-mass-plus-cushion sizing.

…ate spectra

BioGeek

Looking good now, thanks!

JemmaLDaniel self-assigned this Apr 10, 2026

JemmaLDaniel added the enhancement New feature or request label Apr 10, 2026

JemmaLDaniel force-pushed the feat-spectral-angle branch from 40e590e to 8842a46 Compare April 10, 2026 14:37

JemmaLDaniel force-pushed the feat-xcorr branch from c35a90c to b433c13 Compare April 10, 2026 14:37

JemmaLDaniel added 4 commits April 10, 2026 17:41

feat: add xcorr feature

653720f

test: add xcorr feature tests

16a8e47

docs: update feature documentation with xcorr feature

c525040

chore: move xcorr constants into constants file

489cd88

JemmaLDaniel force-pushed the feat-xcorr branch from b433c13 to 489cd88 Compare April 10, 2026 16:41

JemmaLDaniel force-pushed the feat-spectral-angle branch from 8842a46 to 5107498 Compare April 10, 2026 16:41

JemmaLDaniel added 2 commits June 22, 2026 15:18

fix: correct argument name in feature utils tests

cfd38fe

Merge branch 'feat-spectral-angle' into feat-xcorr

538bc8a

JemmaLDaniel force-pushed the feat-spectral-angle branch from cfd38fe to 17231f4 Compare June 22, 2026 14:29

Merge branch 'feat-spectral-angle' into feat-xcorr

c50bc8b

JemmaLDaniel requested a review from BioGeek June 22, 2026 14:34

test: add test_observed_preprocessing_independent_of_out_of_range_the…

a0efccc

…oretical_ions committed

BioGeek requested changes Jun 25, 2026

View reviewed changes

JemmaLDaniel added 2 commits June 27, 2026 16:07

Merge branch 'feat-spectral-angle' into feat-xcorr

30694e8

fix: preprocess observed spectrum equivalently irregardless of candid…

31a6cdd

…ate spectra

JemmaLDaniel force-pushed the feat-xcorr branch from 6a402f1 to 31a6cdd Compare June 27, 2026 15:26

JemmaLDaniel requested a review from BioGeek June 27, 2026 15:29

Base automatically changed from feat-spectral-angle to main June 27, 2026 15:30

BioGeek approved these changes Jun 27, 2026

View reviewed changes

JemmaLDaniel merged commit 76b9610 into main Jun 27, 2026
2 checks passed

JemmaLDaniel deleted the feat-xcorr branch June 27, 2026 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat xcorr#186

Feat xcorr#186
JemmaLDaniel merged 10 commits into
mainfrom
feat-xcorr

JemmaLDaniel commented Apr 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

BioGeek Jun 25, 2026 •

edited

Loading

Uh oh!

JemmaLDaniel Jun 27, 2026

Uh oh!

BioGeek left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		max_mz = max(max(observed_mz), max(theoretical_mz))
		num_bins = int(max_mz / bin_size + bin_offset) + 1

Uh oh!

Conversation

JemmaLDaniel commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add cross-correlation score (XCorr) feature

Summary

How it works

Changes

Uh oh!

github-actions Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BioGeek Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JemmaLDaniel Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

BioGeek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JemmaLDaniel commented Apr 10, 2026 •

edited

Loading

github-actions Bot commented Apr 10, 2026 •

edited

Loading

BioGeek Jun 25, 2026 •

edited

Loading