Skip to content

Feat xcorr#186

Merged
JemmaLDaniel merged 10 commits into
mainfrom
feat-xcorr
Jun 27, 2026
Merged

Feat xcorr#186
JemmaLDaniel merged 10 commits into
mainfrom
feat-xcorr

Conversation

@JemmaLDaniel

@JemmaLDaniel JemmaLDaniel commented Apr 10, 2026

Copy link
Copy Markdown
Collaborator

Add cross-correlation score (XCorr) feature

Summary

Adds an xcorr feature to FragmentMatchFeatures (and chimeric_xcorr to ChimericFeatures). XCorr is the fast cross-correlation score originally introduced by SEQUEST and refined by Comet — it measures the correlation between an observed spectrum and a synthetic theoretical spectrum after background subtraction. Unlike the spectral angle which compares matched peak intensities directly, XCorr operates on binned spectra and accounts for background noise, making it a complementary and well-established discriminant for PSM quality.

How it works

The implementation follows the Comet/SEQUEST fast cross-correlation approach:

  1. Bin the observed spectrum into unit-dalton bins (bin width = 1.0005079 Da, offset = 0.4 Da) with square-root intensity compression and window-based normalisation (10 equal-width windows, max intensity normalised to 50).
  2. Construct a theoretical spectrum by placing unit-intensity peaks at each theoretical fragment m/z, including flanking bins at ±1 Da with reduced intensity.
  3. Subtract the background from the observed spectrum using a running mean over ±75 bins.
  4. Compute the cross-correlation as the dot product of the background-subtracted observed spectrum and the theoretical spectrum.
  5. Normalise the raw score by dividing by 10,000 to produce a value on a comparable scale to other features.

Changes

  • winnow/calibration/features/utils.py:
    • Adds _bin_observed_spectrum — bins observed peaks with sqrt compression.
    • Adds _normalize_spectrum_windows — per-window intensity normalisation.
    • Adds _build_theoretical_spectrum — creates unit-intensity theoretical peaks with flanking bins.
    • Adds _subtract_background — running mean background subtraction.
    • Adds compute_xcorr — orchestrates the full fast XCorr computation.
    • compute_ion_identifications now calls compute_xcorr and returns it as an additional output.
  • winnow/calibration/features/constants.py:
    • Adds constants for XCorr parameters (XCORR_BIN_SIZE, XCORR_BIN_OFFSET, XCORR_NUM_WINDOWS, XCORR_MAX_OFFSET, XCORR_WINDOW_NORM_VALUE).
  • winnow/calibration/features/fragment_match.py — adds xcorr to columns() and stores the computed value.
  • winnow/calibration/features/chimeric.py — adds chimeric_xcorr column.
  • docs/api/features/fragment_match.md — documents the XCorr feature with description of the algorithm and its parameters.
  • docs/api/features/chimeric.md — documents the chimeric XCorr column.
  • tests/calibration/features/test_utils.py — tests for each XCorr sub-function and the end-to-end compute_xcorr, covering binning, normalisation, theoretical spectrum construction, background subtraction, and edge cases (empty spectra, single peaks, no overlap).
  • tests/calibration/features/test_fragment_match.py, test_chimeric.py — assert new columns are present.

@JemmaLDaniel JemmaLDaniel self-assigned this Apr 10, 2026
@JemmaLDaniel JemmaLDaniel added the enhancement New feature or request label Apr 10, 2026
@github-actions

github-actions Bot commented Apr 10, 2026

Copy link
Copy Markdown

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py00100% 
data_types.py40100% 
calibration
   __init__.py00100% 
   calibration_features.py90100% 
   calibrator.py1021189%69–70, 72, 107, 134–135, 137, 163, 168, 195–196
   diagnostics.py1685070%70, 96, 101, 111, 115, 137, 146, 203–218, 261–262, 266, 307, 309–324, 335–341
calibration/features
   __init__.py100100% 
   base.py80100% 
   beam.py470100% 
   chimeric.py82198%213
   constants.py90100% 
   fragment_match.py78198%203
   mass_error.py67297%16, 20
   retention_time.py135993%183, 190, 206, 257–259, 269, 272–273
   sequence.py190100% 
   token_score.py37197%82
   utils.py261398%96, 368, 594
compat
   __init__.py00100% 
   instanovo.py10640%12, 14–15, 17, 24–25
datasets
   __init__.py00100% 
   calibration_dataset.py1091784%155, 169, 171, 173, 183, 196, 249, 251–252, 258–261, 263–266
   interfaces.py30100% 
   psm_dataset.py250100% 
datasets/data_loaders
   __init__.py50100% 
   instanovo.py1191984%90, 93, 119, 142, 168–169, 172–174, 176–177, 179, 182–183, 185, 343–345, 356
   mztab.py2155574%103, 106, 157, 161, 210–211, 223, 236–240, 287, 290, 302–303, 315–317, 319–320, 322, 324, 330, 334–336, 338–339, 343–346, 350, 514–515, 518, 521, 528, 542–546, 550–555, 561, 570–571, 573, 599
   pointnovo.py70100% 
   utils.py59198%11
   winnow.py39489%54–55, 91–92
fdr
   __init__.py00100% 
   base.py581574%81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
   database_grounded.py28196%52
   nonparametric.py25484%62, 68–69, 72
scripts
   __init__.py00100% 
   main.py2562560%8, 10–13, 16–20, 23–24, 26–28, 32, 39, 44, 47, 53, 55–56, 59, 68, 76, 79, 86, 88–90, 92, 94–99, 102, 104–105, 110, 125, 128, 135–141, 144–145, 148, 161–163, 166, 169, 174, 176–178, 180, 182–183, 186–187, 190, 192–193, 195, 197, 199–200, 202, 205–206, 209–210, 213–214, 217–219, 221–224, 227–229, 231, 234, 248–250, 252, 254, 259, 261–263, 265–266, 268, 270–271, 273–275, 277, 279, 281–282, 286–289, 291–292, 294–295, 297–298, 300, 303, 317–319, 322, 325, 330, 332–334, 336–338, 340–341, 344–345, 348, 350–351, 353, 355, 357–358, 360, 363–364, 370–372, 374–377, 380–381, 384–385, 388–389, 392–393, 401–403, 407, 410, 414, 417, 423–425, 427–428, 435–436, 438, 440, 445, 447–449, 451–452, 455, 457–458, 460–463, 465–466, 468–469, 471–473, 479–480, 484–485, 488, 495, 500–501, 506–508, 511, 516, 526, 533, 535, 539, 541–542, 546–547, 550, 573, 586–587, 590, 612, 624–625, 628, 653, 666–667, 670, 685, 697–698, 701, 716, 728–729, 732, 744, 756–757, 760, 775, 787–788, 791, 800, 812–813
utils
   __init__.py40100% 
   config_formatter.py534024%29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
   config_path.py76593%24–26, 117–118
   peptide.py160100% 
TOTAL214350176% 

Tests Skipped Failures Errors Time
430 0 💤 0 ❌ 0 🔥 34.839s ⏱️

@JemmaLDaniel JemmaLDaniel force-pushed the feat-spectral-angle branch from cfd38fe to 17231f4 Compare June 22, 2026 14:29
@JemmaLDaniel JemmaLDaniel requested a review from BioGeek June 22, 2026 14:34
Comment thread winnow/calibration/features/utils.py Outdated
Comment on lines +788 to +789
max_mz = max(max(observed_mz), max(theoretical_mz))
num_bins = int(max_mz / bin_size + bin_offset) + 1

@BioGeek BioGeek Jun 25, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a predicted fragment lies above the observed scan range, including max(theoretical_mz) in max_mz changes the number of bins and therefore the 10 normalization windows used for the observed spectrum. This makes the processed observed intensities depend on the candidate peptide. Adding an unmatched high-m/z theoretical ion can change the score contribution of already matching peaks even though the observed spectrum is unchanged. Size/normalize the observed spectrum from a fixed observed scan range, then handle out-of-range theoretical bins separately.

This is not a faithful implementation of the Eng et al. 2008 algorithm. The paper’s key idea is: preprocess the acquired spectrum y once into y', then score each candidate theoretical spectrum x by a scalar dot product x · y'. The correction subtracts the mean of the ±75 shifted acquired spectra, excluding the zero shift, divided by 150. The PR does implement the broad shape: bin observed peaks, sqrt intensities, normalize in 10 windows, subtract ±75-bin background, then dot with theoretical bins. But these lines size the observed spectrum using max(max(observed_mz), max(theoretical_mz)). That makes y' depend on the candidate theoretical peptide, which contradicts the paper’s derivation that y' is an acquired-spectrum preprocessing step done once before candidate scoring.

See newly added test test_observed_preprocessing_independent_of_out_of_range_theoretical_ions. It asserts that adding an unmatched theoretical ion at 2000.0 m/z does not change the score for the same observed spectrum.

Compare with Comet: CometSearch/CometPreprocess.cpp:1326

pScoring->_spectrumInfoInternal.iArraySize =
      (int)((pScoring->_pepMassInfo.dExpPepMass + dCushion) * g_staticParams.dInverseBinWidth);

Comet sizes the xcorr/preprocessing array from the query precursor/experimental peptide mass plus a fixed cushion, then bin width. It does not use the candidate theoretical fragment list. Candidate ions are later looked up against this preprocessed array.

Or compare with Crux: src/model/Scorer.cpp:824

std::vector<FLOAT_T> observed(getMaxBin(), 0);

getMaxBin() comes from

return INTEGERIZE(sp_max_mz_, bin_width_, bin_offset_);

and sp_max_mz_ is set earlier from precursor-derived experimental_mass_cut_off, rounded up to a fixed block size. Again, it is independent of the candidate theoretical ion list.

So the key difference is: Comet and Crux size/preprocess the observed spectrum from scan/query constraints; the PR sizes it from both observed peaks and candidate theoretical peaks. That candidate dependency is the bug.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, good catch. Fixed by sizing and preprocessing the observed spectrum from the scan range only (max(observed_mz)); out-of-range theoretical bins are skipped at dot-product time and no longer affect y′. This follows the SEQUEST/Tide/Crux description: preprocess the acquired spectrum once from the reported scan m/z range, rather than Comet’s precursor-mass-plus-cushion sizing.

@JemmaLDaniel JemmaLDaniel requested a review from BioGeek June 27, 2026 15:29
Base automatically changed from feat-spectral-angle to main June 27, 2026 15:30

@BioGeek BioGeek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good now, thanks!

@JemmaLDaniel JemmaLDaniel merged commit 76b9610 into main Jun 27, 2026
2 checks passed
@JemmaLDaniel JemmaLDaniel deleted the feat-xcorr branch June 27, 2026 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants