Skip to content

Format-agnostic idXML/mzIdentML I/O + localizer performance optimizations#51

Closed
ypriverol wants to merge 14 commits into
mainfrom
fix/phosphors-flr-site-prob-ranking
Closed

Format-agnostic idXML/mzIdentML I/O + localizer performance optimizations#51
ypriverol wants to merge 14 commits into
mainfrom
fix/phosphors-flr-site-prob-ranking

Conversation

@ypriverol

@ypriverol ypriverol commented Jun 20, 2026

Copy link
Copy Markdown
Member

Summary

This branch adds two independent bodies of work on top of the existing PhosphoRS/AScore/LucXor pipeline.

1. Format-agnostic idXML / mzIdentML I/O (new feature)

The pipeline now reads and writes either idXML or mzIdentML by file extension:

onsite all -in spectra.mzML -id ids.mzid -out scored.mzid [--add-decoys]
  • New onsite/mzid_adapter.py: load_identifications / store_identifications (extension dispatch; the mzid store strips non-UNIMOD custom PhosphoDecoy modifications from search params so MzIdentMLFile.store accepts them — the modification on the hits round-trips intact), has_alanine, and a scan-number-tolerant validate_spectrum_refs (matches the tools' own scan-based resolver, so it never aborts a run the tools could score).
  • The three CLIs and run_all_localizers / merge_algorithm_results are wired through those helpers.
  • Decoy gating: default no-decoy; --add-decoys + Alanine present → decoy pack; --add-decoys + no Alanine → warns and falls back (never fails).
  • merge_algorithm_results now propagates all four score keys (AScore_site_scores, PhosphoRS_site_probs, PhosphoRS_site_delta, Luciphor_site_scores).

2. Localizer performance optimizations (numerically identical)

Profiling-guided hot-path fixes; all bit-exact vs baseline (max abs diff 0.0 across 7,394 hits each):

  • AScore: numpy array matching replaces millions of per-peak pyOpenMS getMZ()/size() calls; per-window top-depth m/z precomputed once; single reused instance.
  • PhosphoRS: real memoization of binomial_tail_probability (previously a dead no-op cache), constant-tolerance hoist, isoform theoretical-m/z caching.
  • LucXor: per-charge density model + constants hoisted out of the per-peak loop (model-aware for CID Gaussian vs HCD non-parametric kernel).

Testing

Full suite: 187 passed, 0 failed. New tests cover the format-agnostic round-trips, the PhosphoDecoy mzid store, scan-number-tolerant validation, and the decoy fallback.

Note on scope

This branch is far ahead of main and also carries prior PhosphoRS FLR investigation history, so the diff is large. The substantive new changes are the two sections above (onsite/mzid_adapter.py, onsite/onsitec.py, the three cli.py files, the perf changes, and tests/).

Summary by CodeRabbit

  • New Features

    • Added support for mzid format alongside idXML for identification file handling
    • Added spectrum reference validation against mzML data
  • Performance Improvements

    • Implemented precomputed m/z arrays for peptide modification site scoring
    • Added caching for repeated theoretical ion generation
    • Optimized charge model lookups in scoring loops
  • Tests

    • Added comprehensive test suite for identification format handling and integrated workflows

ypriverol added 14 commits June 20, 2026 09:15
…/LucXor

Profiling on PXD000138 file 1 (10,735 MS2) identified one dominant hotspot per
algorithm; all three fixes are numerically exact (max abs diff 0.0 vs baseline
across 7,394 hits) and the full suite (178 tests) passes.

AScore: rewrite numberOfMatchedIons_ on numpy m/z arrays (one get_peaks() vs
millions of Peak1D.getMZ()/MSSpectrum.size() binding calls); precompute each
window's top-i m/z arrays once instead of copy+sort x10 per window; reuse a
single AScore instance across PSMs (single-thread path) instead of rebuilding
the TheoreticalSpectrumGenerator per PSM.

PhosphoRS: memoize binomial_tail_probability on (k,n,p) — the prior cache was a
dead no-op, so every call recomputed the scipy binomial (62% of runtime); hoist
the constant Da tolerance out of the 3.3M-call per-ion closure; cache
charge-validated isoform theoretical m/z per (seq,charge) so they aren't
regenerated for both depth-selection and final scoring.

LucXor: hoist the per-charge density model + constants out of the per-peak loop
(eliminates 1.16M redundant get_charge_model lookups); dispatch on model type so
the CID Gaussian inlining is not applied to the HCD non-parametric kernel model.

Verified: bit-exact output; pytest tests/ -> 178 passed; CodeRabbit -> no findings.
…nups

- Finding 1: replace exact-string matching in validate_spectrum_refs with
  scan-number-tolerant matching (_extract_scan_number helper + native scan
  number set); a compact ref 'scan=N' now resolves against full mzML
  nativeIDs such as 'controllerType=0 controllerNumber=1 scan=N'
- Finding 2: replace unused 'import os' with 'import re' in mzid_adapter.py
- Finding 3: remove stale IdXMLFile import from lucxor/cli.py
- Finding 4: strengthen test_add_decoys_falls_back_when_no_ala to assert
  the '--add-decoys / no Alanine' fallback warning is emitted via capsys
- Finding 5: add in-place mutation note to _strip_custom_variable_mods docstring
- New regression test: test_validate_spectrum_refs_scan_number_tolerant
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Introduces onsite/mzid_adapter.py as a format-agnostic identification I/O layer supporting both mzIdentML and IdXML, including spectrum-reference validation and alanine detection. All three localizer CLIs and the top-level orchestrator are wired to use the adapter. A new run_all_localizers function replaces inline orchestration. AScore, PhosphoRS, and LucXor scoring hot paths are optimized with numpy two-pointer matching and bounded memoization caches.

Changes

mzid Adapter, I/O Wiring, Orchestration, and Scoring Performance

Layer / File(s) Summary
mzid_adapter: format detection, I/O, spectrum validation
onsite/mzid_adapter.py
New module adds is_mzid, load_identifications/store_identifications dispatching to IdXMLFile/MzIdentMLFile, PhosphoDecoy stripping on mzid write, has_alanine, scan-number extraction helpers, SpectrumRefError, ValidationResult, and validate_spectrum_refs resolving PSM spectrum references against an mzML by exact native-ID or scan-number match.
Wire all CLIs and onsitec to mzid_adapter I/O
onsite/ascore/cli.py, onsite/phosphors/cli.py, onsite/lucxor/cli.py, onsite/onsitec.py
All four modules replace direct IdXMLFile().load/store calls with load_identifications/store_identifications from onsite.mzid_adapter; onsitec.py also imports has_alanine and validate_spectrum_refs.
run_all_localizers orchestration and merge guard fixes
onsite/onsitec.py, onsite/lucxor/cli.py
run_all_localizers added to load identifications, validate spectrum refs, conditionally disable decoys based on alanine presence, and run all three localizers sequentially before merging; the all CLI command delegates to it. merge_algorithm_results now guards each score meta-key write on key presence. LucXor CLI's compute_all_scores path delegates to run_all_algorithms_from_single_cli.
AScore CLI: shared instance and injection
onsite/ascore/cli.py
Constructs a single shared_ascore instance once per CLI invocation; passes it via ascore_instance to process_peptide_identification, which now accepts the optional parameter and reuses it rather than constructing a new instance per PSM.
AScore: numpy two-pointer ion matching
onsite/ascore/ascore.py
compute() precomputes windows_depth_mz via spectrumMZArray_ and windowDepthMZArrays_; countMatchedMZ_ implements two-pointer float64 matching with absolute/ppm tolerance, replacing numberOfMatchedIons_ which rebuilt MSSpectrum objects per call; calculatePermutationPeptideScores_ sums countMatchedMZ_ across windows per depth.
PhosphoRS: binomial-tail cache, tolerance hoisting, isoform m/z memoization
onsite/phosphors/phosphors.py
Adds bounded FIFO _binomial_tail_cache with _store_binomial_tail covering all return paths; _count_matched_ions hoists tolerance to a constant Da path or per-mz ppm closure; _isoform_theo_mz gains optional cache/cache_tag parameters threaded through _reduce_by_peak_depth_optimization and calculate_phospho_localization_compomics_style.
LucXor PSM: hoist density callables out of per-peak loop
onsite/lucxor/psm.py
score_permutations dispatches once between HCD and CID density strategies before the inner peak loop, binding _intensity_density and _distance_density callables used for all subsequent matched-peak scoring.
Tests for mzid_adapter and integration flows
tests/test_mzid_adapter.py
Covers load/store round-trips (idXML and mzid), PhosphoDecoy preservation, has_alanine detection, spectrum-reference mismatch raising SpectrumRefError, compact scan=N reference resolution, AScore CLI accepting mzid I/O, run_all_localizers writing three score keys, and add_decoys fallback when no alanine via monkeypatching.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as all / run_all_localizers
  participant adapter as mzid_adapter
  participant mzML as MSExperiment
  participant AScore as ascore CLI
  participant PhosphoRS as phosphors CLI
  participant LucXor as lucxor CLI
  participant merge as merge_algorithm_results

  CLI->>adapter: load_identifications(id_file)
  adapter-->>CLI: prot_ids, pep_ids
  CLI->>adapter: validate_spectrum_refs(pep_ids, mzml_path)
  adapter->>mzML: MzMLFile().load(mzml_path)
  mzML-->>adapter: spectra with nativeIDs
  adapter-->>CLI: ValidationResult
  CLI->>adapter: has_alanine(pep_ids)
  adapter-->>CLI: True / False
  alt no alanine
    CLI-->>CLI: disable add_decoys, emit warning
  end
  CLI->>AScore: ctx.invoke(ascore, ...)
  AScore-->>CLI: temp ascore idXML
  CLI->>PhosphoRS: ctx.invoke(phosphors, ...)
  PhosphoRS-->>CLI: temp phosphors idXML
  CLI->>LucXor: ctx.invoke(lucxor, ...)
  LucXor-->>CLI: temp lucxor idXML
  CLI->>merge: ascore_out, phosphors_out, lucxor_out
  merge->>adapter: store_identifications(out_file, ...)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • bigbio/onsite#6: Introduces the initial AScore two-pointer matched-ion style, which this PR extends by replacing numberOfMatchedIons_ with countMatchedMZ_ over precomputed numpy arrays.
  • bigbio/onsite#22: Adds the --compute-all-scores CLI option and delegation to onsitec; this PR extends that path in LucXor CLI and replaces the inlined orchestration in onsitec.py with run_all_localizers.
  • bigbio/onsite#45: Modifies _isoform_theo_mz in phosphors.py's charge-validated ion generation, the same function this PR wraps with optional memoization cache parameters.

Suggested labels

Review effort 5/5

Suggested reviewers

  • weizhongchun

Poem

🐇 Hoppin' through formats, mzid or XML,
No more rebuilding MSSpectrum on the fly!
Two-pointer arrays zip through peaks with glee,
Cached binomial tails set the scoring free.
One adapter rules them all — load, store, validate,
The rabbit merges three scores and calls it great! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the two main feature sets: format-agnostic idXML/mzIdentML I/O support and localizer performance optimizations, matching the substantive changes across 7 files and test coverage.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/phosphors-flr-site-prob-ranking
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch fix/phosphors-flr-site-prob-ranking

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production

codacy-production Bot commented Jun 20, 2026

Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 21 high · 8 medium · 2 minor

Alerts:
⚠ 31 issues (≤ 0 issues of at least minor severity)

Results:
31 new issues

Category Results
UnusedCode 3 medium
2 minor
ErrorProne 3 high
Security 18 high
Complexity 5 medium

View in Codacy

🟢 Metrics 88 complexity · 0 duplication

Metric Results
Complexity 88
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
onsite/lucxor/cli.py (1)

287-290: 💤 Low value

Redundant condition: compute_all_scores is always False here.

When compute_all_scores=True, the function returns early at line 253. If execution reaches line 288, compute_all_scores is guaranteed to be False, making the condition redundant.

♻️ Suggested simplification
-        # Only call sys.exit if not being called from compute_all_scores
-        if not compute_all_scores:
-            sys.exit(exit_code)
-        return exit_code
+        sys.exit(exit_code)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@onsite/lucxor/cli.py` around lines 287 - 290, The condition checking `if not
compute_all_scores:` before calling `sys.exit(exit_code)` is redundant because
the function returns early when `compute_all_scores` is True, meaning execution
can only reach this point when `compute_all_scores` is False. Remove the
conditional check and call `sys.exit(exit_code)` directly without the if
statement, keeping only the `return exit_code` line after it if needed for
consistency.
onsite/mzid_adapter.py (1)

99-103: 💤 Low value

Minor inefficiency: _extract_scan_number called twice per native ID.

The set comprehension calls _extract_scan_number(nid) twice for each native ID—once in the condition and once for the value. For large mzML files this doubles regex operations unnecessarily.

♻️ Suggested optimization
-    native_scan_numbers = {
-        _extract_scan_number(nid)
-        for nid in native_ids
-        if _extract_scan_number(nid) is not None
-    }
+    native_scan_numbers = {
+        scan for scan in (_extract_scan_number(nid) for nid in native_ids) if scan is not None
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@onsite/mzid_adapter.py` around lines 99 - 103, The set comprehension for
native_scan_numbers calls _extract_scan_number(nid) twice for each native
ID—once in the if condition and once for the set value, which doubles the regex
operations unnecessarily. Refactor the comprehension to use the walrus operator
(`:=`) to extract and assign the scan number once in the condition, then
reference that assigned variable in the set value to eliminate the redundant
function call.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@onsite/lucxor/cli.py`:
- Around line 287-290: The condition checking `if not compute_all_scores:`
before calling `sys.exit(exit_code)` is redundant because the function returns
early when `compute_all_scores` is True, meaning execution can only reach this
point when `compute_all_scores` is False. Remove the conditional check and call
`sys.exit(exit_code)` directly without the if statement, keeping only the
`return exit_code` line after it if needed for consistency.

In `@onsite/mzid_adapter.py`:
- Around line 99-103: The set comprehension for native_scan_numbers calls
_extract_scan_number(nid) twice for each native ID—once in the if condition and
once for the set value, which doubles the regex operations unnecessarily.
Refactor the comprehension to use the walrus operator (`:=`) to extract and
assign the scan number once in the condition, then reference that assigned
variable in the set value to eliminate the redundant function call.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 45bf935a-2fbd-4204-853b-b05186716552

📥 Commits

Reviewing files that changed from the base of the PR and between 40a56b4 and 1e01552.

📒 Files selected for processing (10)
  • .gitignore
  • onsite/ascore/ascore.py
  • onsite/ascore/cli.py
  • onsite/lucxor/cli.py
  • onsite/lucxor/psm.py
  • onsite/mzid_adapter.py
  • onsite/onsitec.py
  • onsite/phosphors/cli.py
  • onsite/phosphors/phosphors.py
  • tests/test_mzid_adapter.py

@ypriverol ypriverol marked this pull request as draft June 20, 2026 15:36
@ypriverol

Copy link
Copy Markdown
Member Author

Parking this PR. main has since migrated identification I/O to idParquet (PRs #44-#49, new onsite/idparquet.py), which supersedes this branch's pyOpenMS-object mzid approach. Plan: rebuild on main's idParquet base as a unified format layer supporting idXML + mzIdentML + idParquet, carrying the performance optimizations forward. This branch is kept for reference.

@ypriverol

Copy link
Copy Markdown
Member Author

Superseded by #52, which rebuilds this on main's idParquet architecture as a unified idXML/mzIdentML/idParquet I/O layer (the format-agnostic approach you requested), carrying the performance optimizations forward. Closing in favor of #52.

@ypriverol ypriverol closed this Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant