Feat xcorr#186
Conversation
40e590e to
8842a46
Compare
c35a90c to
b433c13
Compare
b433c13 to
489cd88
Compare
8842a46 to
5107498
Compare
cfd38fe to
17231f4
Compare
…oretical_ions committed
| max_mz = max(max(observed_mz), max(theoretical_mz)) | ||
| num_bins = int(max_mz / bin_size + bin_offset) + 1 |
There was a problem hiding this comment.
When a predicted fragment lies above the observed scan range, including max(theoretical_mz) in max_mz changes the number of bins and therefore the 10 normalization windows used for the observed spectrum. This makes the processed observed intensities depend on the candidate peptide. Adding an unmatched high-m/z theoretical ion can change the score contribution of already matching peaks even though the observed spectrum is unchanged. Size/normalize the observed spectrum from a fixed observed scan range, then handle out-of-range theoretical bins separately.
This is not a faithful implementation of the Eng et al. 2008 algorithm. The paper’s key idea is: preprocess the acquired spectrum y once into y', then score each candidate theoretical spectrum x by a scalar dot product x · y'. The correction subtracts the mean of the ±75 shifted acquired spectra, excluding the zero shift, divided by 150. The PR does implement the broad shape: bin observed peaks, sqrt intensities, normalize in 10 windows, subtract ±75-bin background, then dot with theoretical bins. But these lines size the observed spectrum using max(max(observed_mz), max(theoretical_mz)). That makes y' depend on the candidate theoretical peptide, which contradicts the paper’s derivation that y' is an acquired-spectrum preprocessing step done once before candidate scoring.
See newly added test test_observed_preprocessing_independent_of_out_of_range_theoretical_ions. It asserts that adding an unmatched theoretical ion at 2000.0 m/z does not change the score for the same observed spectrum.
Compare with Comet: CometSearch/CometPreprocess.cpp:1326
pScoring->_spectrumInfoInternal.iArraySize =
(int)((pScoring->_pepMassInfo.dExpPepMass + dCushion) * g_staticParams.dInverseBinWidth);Comet sizes the xcorr/preprocessing array from the query precursor/experimental peptide mass plus a fixed cushion, then bin width. It does not use the candidate theoretical fragment list. Candidate ions are later looked up against this preprocessed array.
Or compare with Crux: src/model/Scorer.cpp:824
std::vector<FLOAT_T> observed(getMaxBin(), 0);getMaxBin() comes from
return INTEGERIZE(sp_max_mz_, bin_width_, bin_offset_);and sp_max_mz_ is set earlier from precursor-derived experimental_mass_cut_off, rounded up to a fixed block size. Again, it is independent of the candidate theoretical ion list.
So the key difference is: Comet and Crux size/preprocess the observed spectrum from scan/query constraints; the PR sizes it from both observed peaks and candidate theoretical peaks. That candidate dependency is the bug.
There was a problem hiding this comment.
Wow, good catch. Fixed by sizing and preprocessing the observed spectrum from the scan range only (max(observed_mz)); out-of-range theoretical bins are skipped at dot-product time and no longer affect y′. This follows the SEQUEST/Tide/Crux description: preprocess the acquired spectrum once from the reported scan m/z range, rather than Comet’s precursor-mass-plus-cushion sizing.
BioGeek
left a comment
There was a problem hiding this comment.
Looking good now, thanks!
Add cross-correlation score (XCorr) feature
Summary
Adds an
xcorrfeature toFragmentMatchFeatures(andchimeric_xcorrtoChimericFeatures). XCorr is the fast cross-correlation score originally introduced by SEQUEST and refined by Comet — it measures the correlation between an observed spectrum and a synthetic theoretical spectrum after background subtraction. Unlike the spectral angle which compares matched peak intensities directly, XCorr operates on binned spectra and accounts for background noise, making it a complementary and well-established discriminant for PSM quality.How it works
The implementation follows the Comet/SEQUEST fast cross-correlation approach:
10,000to produce a value on a comparable scale to other features.Changes
winnow/calibration/features/utils.py:_bin_observed_spectrum— bins observed peaks with sqrt compression._normalize_spectrum_windows— per-window intensity normalisation._build_theoretical_spectrum— creates unit-intensity theoretical peaks with flanking bins._subtract_background— running mean background subtraction.compute_xcorr— orchestrates the full fast XCorr computation.compute_ion_identificationsnow callscompute_xcorrand returns it as an additional output.winnow/calibration/features/constants.py:XCORR_BIN_SIZE,XCORR_BIN_OFFSET,XCORR_NUM_WINDOWS,XCORR_MAX_OFFSET,XCORR_WINDOW_NORM_VALUE).winnow/calibration/features/fragment_match.py— addsxcorrtocolumns()and stores the computed value.winnow/calibration/features/chimeric.py— addschimeric_xcorrcolumn.docs/api/features/fragment_match.md— documents the XCorr feature with description of the algorithm and its parameters.docs/api/features/chimeric.md— documents the chimeric XCorr column.tests/calibration/features/test_utils.py— tests for each XCorr sub-function and the end-to-endcompute_xcorr, covering binning, normalisation, theoretical spectrum construction, background subtraction, and edge cases (empty spectra, single peaks, no overlap).tests/calibration/features/test_fragment_match.py,test_chimeric.py— assert new columns are present.