Format-agnostic idXML/mzIdentML I/O + localizer performance optimizations by ypriverol · Pull Request #51 · bigbio/onsite

ypriverol · 2026-06-20T15:15:41Z

Summary

This branch adds two independent bodies of work on top of the existing PhosphoRS/AScore/LucXor pipeline.

1. Format-agnostic idXML / mzIdentML I/O (new feature)

The pipeline now reads and writes either idXML or mzIdentML by file extension:

onsite all -in spectra.mzML -id ids.mzid -out scored.mzid [--add-decoys]

New onsite/mzid_adapter.py: load_identifications / store_identifications (extension dispatch; the mzid store strips non-UNIMOD custom PhosphoDecoy modifications from search params so MzIdentMLFile.store accepts them — the modification on the hits round-trips intact), has_alanine, and a scan-number-tolerant validate_spectrum_refs (matches the tools' own scan-based resolver, so it never aborts a run the tools could score).
The three CLIs and run_all_localizers / merge_algorithm_results are wired through those helpers.
Decoy gating: default no-decoy; --add-decoys + Alanine present → decoy pack; --add-decoys + no Alanine → warns and falls back (never fails).
merge_algorithm_results now propagates all four score keys (AScore_site_scores, PhosphoRS_site_probs, PhosphoRS_site_delta, Luciphor_site_scores).

2. Localizer performance optimizations (numerically identical)

Profiling-guided hot-path fixes; all bit-exact vs baseline (max abs diff 0.0 across 7,394 hits each):

AScore: numpy array matching replaces millions of per-peak pyOpenMS getMZ()/size() calls; per-window top-depth m/z precomputed once; single reused instance.
PhosphoRS: real memoization of binomial_tail_probability (previously a dead no-op cache), constant-tolerance hoist, isoform theoretical-m/z caching.
LucXor: per-charge density model + constants hoisted out of the per-peak loop (model-aware for CID Gaussian vs HCD non-parametric kernel).

Testing

Full suite: 187 passed, 0 failed. New tests cover the format-agnostic round-trips, the PhosphoDecoy mzid store, scan-number-tolerant validation, and the decoy fallback.

Note on scope

This branch is far ahead of main and also carries prior PhosphoRS FLR investigation history, so the diff is large. The substantive new changes are the two sections above (onsite/mzid_adapter.py, onsite/onsitec.py, the three cli.py files, the perf changes, and tests/).

Summary by CodeRabbit

New Features
- Added support for mzid format alongside idXML for identification file handling
- Added spectrum reference validation against mzML data
Performance Improvements
- Implemented precomputed m/z arrays for peptide modification site scoring
- Added caching for repeated theoretical ion generation
- Optimized charge model lookups in scoring loops
Tests
- Added comprehensive test suite for identification format handling and integrated workflows

…ripts)

…/LucXor Profiling on PXD000138 file 1 (10,735 MS2) identified one dominant hotspot per algorithm; all three fixes are numerically exact (max abs diff 0.0 vs baseline across 7,394 hits) and the full suite (178 tests) passes. AScore: rewrite numberOfMatchedIons_ on numpy m/z arrays (one get_peaks() vs millions of Peak1D.getMZ()/MSSpectrum.size() binding calls); precompute each window's top-i m/z arrays once instead of copy+sort x10 per window; reuse a single AScore instance across PSMs (single-thread path) instead of rebuilding the TheoreticalSpectrumGenerator per PSM. PhosphoRS: memoize binomial_tail_probability on (k,n,p) — the prior cache was a dead no-op, so every call recomputed the scipy binomial (62% of runtime); hoist the constant Da tolerance out of the 3.3M-call per-ion closure; cache charge-validated isoform theoretical m/z per (seq,charge) so they aren't regenerated for both depth-selection and final scoring. LucXor: hoist the per-charge density model + constants out of the per-peak loop (eliminates 1.16M redundant get_charge_model lookups); dispatch on model type so the CID Gaussian inlining is not applied to the HCD non-parametric kernel model. Verified: bit-exact output; pytest tests/ -> 178 passed; CodeRabbit -> no findings.

…in merge

…ats)

…lpers)

…t in run_all_localizers

…nups - Finding 1: replace exact-string matching in validate_spectrum_refs with scan-number-tolerant matching (_extract_scan_number helper + native scan number set); a compact ref 'scan=N' now resolves against full mzML nativeIDs such as 'controllerType=0 controllerNumber=1 scan=N' - Finding 2: replace unused 'import os' with 'import re' in mzid_adapter.py - Finding 3: remove stale IdXMLFile import from lucxor/cli.py - Finding 4: strengthen test_add_decoys_falls_back_when_no_ala to assert the '--add-decoys / no Alanine' fallback warning is emitted via capsys - Finding 5: add in-place mutation note to _strip_custom_variable_mods docstring - New regression test: test_validate_spectrum_refs_scan_number_tolerant

qodo-code-review · 2026-06-20T15:15:45Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

coderabbitai · 2026-06-20T15:15:56Z

📝 Walkthrough

Walkthrough

Introduces onsite/mzid_adapter.py as a format-agnostic identification I/O layer supporting both mzIdentML and IdXML, including spectrum-reference validation and alanine detection. All three localizer CLIs and the top-level orchestrator are wired to use the adapter. A new run_all_localizers function replaces inline orchestration. AScore, PhosphoRS, and LucXor scoring hot paths are optimized with numpy two-pointer matching and bounded memoization caches.

Changes

mzid Adapter, I/O Wiring, Orchestration, and Scoring Performance

Layer / File(s)	Summary
mzid_adapter: format detection, I/O, spectrum validation `onsite/mzid_adapter.py`	New module adds `is_mzid`, `load_identifications`/`store_identifications` dispatching to `IdXMLFile`/`MzIdentMLFile`, `PhosphoDecoy` stripping on mzid write, `has_alanine`, scan-number extraction helpers, `SpectrumRefError`, `ValidationResult`, and `validate_spectrum_refs` resolving PSM spectrum references against an mzML by exact native-ID or scan-number match.
Wire all CLIs and onsitec to mzid_adapter I/O `onsite/ascore/cli.py`, `onsite/phosphors/cli.py`, `onsite/lucxor/cli.py`, `onsite/onsitec.py`	All four modules replace direct `IdXMLFile().load`/`store` calls with `load_identifications`/`store_identifications` from `onsite.mzid_adapter`; `onsitec.py` also imports `has_alanine` and `validate_spectrum_refs`.
run_all_localizers orchestration and merge guard fixes `onsite/onsitec.py`, `onsite/lucxor/cli.py`	`run_all_localizers` added to load identifications, validate spectrum refs, conditionally disable decoys based on alanine presence, and run all three localizers sequentially before merging; the `all` CLI command delegates to it. `merge_algorithm_results` now guards each score meta-key write on key presence. LucXor CLI's `compute_all_scores` path delegates to `run_all_algorithms_from_single_cli`.
AScore CLI: shared instance and injection `onsite/ascore/cli.py`	Constructs a single `shared_ascore` instance once per CLI invocation; passes it via `ascore_instance` to `process_peptide_identification`, which now accepts the optional parameter and reuses it rather than constructing a new instance per PSM.
AScore: numpy two-pointer ion matching `onsite/ascore/ascore.py`	`compute()` precomputes `windows_depth_mz` via `spectrumMZArray_` and `windowDepthMZArrays_`; `countMatchedMZ_` implements two-pointer float64 matching with absolute/ppm tolerance, replacing `numberOfMatchedIons_` which rebuilt `MSSpectrum` objects per call; `calculatePermutationPeptideScores_` sums `countMatchedMZ_` across windows per depth.
PhosphoRS: binomial-tail cache, tolerance hoisting, isoform m/z memoization `onsite/phosphors/phosphors.py`	Adds bounded FIFO `_binomial_tail_cache` with `_store_binomial_tail` covering all return paths; `_count_matched_ions` hoists tolerance to a constant Da path or per-mz ppm closure; `_isoform_theo_mz` gains optional `cache`/`cache_tag` parameters threaded through `_reduce_by_peak_depth_optimization` and `calculate_phospho_localization_compomics_style`.
LucXor PSM: hoist density callables out of per-peak loop `onsite/lucxor/psm.py`	`score_permutations` dispatches once between HCD and CID density strategies before the inner peak loop, binding `_intensity_density` and `_distance_density` callables used for all subsequent matched-peak scoring.
Tests for mzid_adapter and integration flows `tests/test_mzid_adapter.py`	Covers load/store round-trips (idXML and mzid), `PhosphoDecoy` preservation, `has_alanine` detection, spectrum-reference mismatch raising `SpectrumRefError`, compact `scan=N` reference resolution, AScore CLI accepting mzid I/O, `run_all_localizers` writing three score keys, and `add_decoys` fallback when no alanine via monkeypatching.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as all / run_all_localizers
  participant adapter as mzid_adapter
  participant mzML as MSExperiment
  participant AScore as ascore CLI
  participant PhosphoRS as phosphors CLI
  participant LucXor as lucxor CLI
  participant merge as merge_algorithm_results

  CLI->>adapter: load_identifications(id_file)
  adapter-->>CLI: prot_ids, pep_ids
  CLI->>adapter: validate_spectrum_refs(pep_ids, mzml_path)
  adapter->>mzML: MzMLFile().load(mzml_path)
  mzML-->>adapter: spectra with nativeIDs
  adapter-->>CLI: ValidationResult
  CLI->>adapter: has_alanine(pep_ids)
  adapter-->>CLI: True / False
  alt no alanine
    CLI-->>CLI: disable add_decoys, emit warning
  end
  CLI->>AScore: ctx.invoke(ascore, ...)
  AScore-->>CLI: temp ascore idXML
  CLI->>PhosphoRS: ctx.invoke(phosphors, ...)
  PhosphoRS-->>CLI: temp phosphors idXML
  CLI->>LucXor: ctx.invoke(lucxor, ...)
  LucXor-->>CLI: temp lucxor idXML
  CLI->>merge: ascore_out, phosphors_out, lucxor_out
  merge->>adapter: store_identifications(out_file, ...)

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

bigbio/onsite#6: Introduces the initial AScore two-pointer matched-ion style, which this PR extends by replacing numberOfMatchedIons_ with countMatchedMZ_ over precomputed numpy arrays.
bigbio/onsite#22: Adds the --compute-all-scores CLI option and delegation to onsitec; this PR extends that path in LucXor CLI and replaces the inlined orchestration in onsitec.py with run_all_localizers.
bigbio/onsite#45: Modifies _isoform_theo_mz in phosphors.py's charge-validated ion generation, the same function this PR wraps with optional memoization cache parameters.

Suggested labels

Review effort 5/5

Suggested reviewers

weizhongchun

Poem

🐇 Hoppin' through formats, mzid or XML,
No more rebuilding MSSpectrum on the fly!
Two-pointer arrays zip through peaks with glee,
Cached binomial tails set the scoring free.
One adapter rules them all — load, store, validate,
The rabbit merges three scores and calls it great! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 69.09% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the two main feature sets: format-agnostic idXML/mzIdentML I/O support and localizer performance optimizations, matching the substantive changes across 7 files and test coverage.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/phosphors-flr-site-prob-ranking

⚔️ Resolve merge conflicts

Resolve merge conflict in branch fix/phosphors-flr-site-prob-ranking

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codacy-production · 2026-06-20T15:16:59Z

Not up to standards ⛔

🔴 Issues 21 high · 8 medium · 2 minor

Alerts:
⚠ 31 issues (≤ 0 issues of at least minor severity)

Results:
31 new issues

Category Results

UnusedCode 3 medium
2 minor

ErrorProne 3 high

Security 18 high

Complexity 5 medium

View in Codacy

🟢 Metrics 88 complexity · 0 duplication

Metric Results

Complexity 88

Duplication 0

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

coderabbitai

🧹 Nitpick comments (2)

onsite/lucxor/cli.py (1)

287-290: 💤 Low value

Redundant condition: compute_all_scores is always False here.

When compute_all_scores=True, the function returns early at line 253. If execution reaches line 288, compute_all_scores is guaranteed to be False, making the condition redundant.

♻️ Suggested simplification

-        # Only call sys.exit if not being called from compute_all_scores
-        if not compute_all_scores:
-            sys.exit(exit_code)
-        return exit_code
+        sys.exit(exit_code)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@onsite/lucxor/cli.py` around lines 287 - 290, The condition checking `if not
compute_all_scores:` before calling `sys.exit(exit_code)` is redundant because
the function returns early when `compute_all_scores` is True, meaning execution
can only reach this point when `compute_all_scores` is False. Remove the
conditional check and call `sys.exit(exit_code)` directly without the if
statement, keeping only the `return exit_code` line after it if needed for
consistency.

onsite/mzid_adapter.py (1)

99-103: 💤 Low value

Minor inefficiency: _extract_scan_number called twice per native ID.

The set comprehension calls _extract_scan_number(nid) twice for each native ID—once in the condition and once for the value. For large mzML files this doubles regex operations unnecessarily.

♻️ Suggested optimization

-    native_scan_numbers = {
-        _extract_scan_number(nid)
-        for nid in native_ids
-        if _extract_scan_number(nid) is not None
-    }
+    native_scan_numbers = {
+        scan for scan in (_extract_scan_number(nid) for nid in native_ids) if scan is not None
+    }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@onsite/mzid_adapter.py` around lines 99 - 103, The set comprehension for
native_scan_numbers calls _extract_scan_number(nid) twice for each native
ID—once in the if condition and once for the set value, which doubles the regex
operations unnecessarily. Refactor the comprehension to use the walrus operator
(`:=`) to extract and assign the scan number once in the condition, then
reference that assigned variable in the set value to eliminate the redundant
function call.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@onsite/lucxor/cli.py`:
- Around line 287-290: The condition checking `if not compute_all_scores:`
before calling `sys.exit(exit_code)` is redundant because the function returns
early when `compute_all_scores` is True, meaning execution can only reach this
point when `compute_all_scores` is False. Remove the conditional check and call
`sys.exit(exit_code)` directly without the if statement, keeping only the
`return exit_code` line after it if needed for consistency.

In `@onsite/mzid_adapter.py`:
- Around line 99-103: The set comprehension for native_scan_numbers calls
_extract_scan_number(nid) twice for each native ID—once in the if condition and
once for the set value, which doubles the regex operations unnecessarily.
Refactor the comprehension to use the walrus operator (`:=`) to extract and
assign the scan number once in the condition, then reference that assigned
variable in the set value to eliminate the redundant function call.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 45bf935a-2fbd-4204-853b-b05186716552

📥 Commits

Reviewing files that changed from the base of the PR and between 40a56b4 and 1e01552.

📒 Files selected for processing (10)

.gitignore
onsite/ascore/ascore.py
onsite/ascore/cli.py
onsite/lucxor/cli.py
onsite/lucxor/psm.py
onsite/mzid_adapter.py
onsite/onsitec.py
onsite/phosphors/cli.py
onsite/phosphors/phosphors.py
tests/test_mzid_adapter.py

ypriverol · 2026-06-20T15:36:13Z

Parking this PR. main has since migrated identification I/O to idParquet (PRs #44-#49, new onsite/idparquet.py), which supersedes this branch's pyOpenMS-object mzid approach. Plan: rebuild on main's idParquet base as a unified format layer supporting idXML + mzIdentML + idParquet, carrying the performance optimizations forward. This branch is kept for reference.

ypriverol · 2026-06-20T18:07:37Z

Superseded by #52, which rebuilds this on main's idParquet architecture as a unified idXML/mzIdentML/idParquet I/O layer (the format-agnostic approach you requested), carrying the performance optimizations forward. Closing in favor of #52.

ypriverol added 14 commits June 20, 2026 09:15

chore: gitignore benchmarks/ (large mzML/idXML data + reproduction sc…

0604b68

…ripts)

docs: design spec for mzIdentML adapter (three localizers, decoy-gated)

0b1e350

docs: implementation plan for mzIdentML adapter

278289d

refactor(onsitec): extract run_all_localizers from all command

b530d7e

fix(onsitec): restore all-command banners, copy PhosphoRS_site_delta …

1e5201a

…in merge

feat(mzid): mzid<->idXML conversion helpers with Ala detection

ae57e89

docs: revise mzid design to format-agnostic I/O (read/write both form…

57ab20e

…ats)

feat(mzid): format-agnostic load/store helpers (replace conversion he…

e672012

…lpers)

feat(mzid): spectrum-reference validation against mzML native IDs

6b0c193

feat(mzid): wire format-agnostic load/store into the three CLIs

8c1f253

feat(mzid): decoy guard + spectrum validation + format-agnostic outpu…

5b773aa

…t in run_all_localizers

chore: remove mzid design spec and implementation plan from repo

1e01552

coderabbitai Bot reviewed Jun 20, 2026

View reviewed changes

ypriverol marked this pull request as draft June 20, 2026 15:36

ypriverol mentioned this pull request Jun 20, 2026

Format-agnostic identification I/O: idXML / mzIdentML / idParquet #52

Merged

ypriverol closed this Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Format-agnostic idXML/mzIdentML I/O + localizer performance optimizations#51

Format-agnostic idXML/mzIdentML I/O + localizer performance optimizations#51
ypriverol wants to merge 14 commits into
mainfrom
fix/phosphors-flr-site-prob-ranking

ypriverol commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

qodo-code-review Bot commented Jun 20, 2026

Uh oh!

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

codacy-production Bot commented Jun 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

ypriverol commented Jun 20, 2026

Uh oh!

ypriverol commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ypriverol commented Jun 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Format-agnostic idXML / mzIdentML I/O (new feature)

2. Localizer performance optimizations (numerically identical)

Testing

Note on scope

Summary by CodeRabbit

Uh oh!

qodo-code-review Bot commented Jun 20, 2026

Qodo reviews are paused for this user.

Uh oh!

coderabbitai Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

codacy-production Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ypriverol commented Jun 20, 2026

Uh oh!

ypriverol commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ypriverol commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading

codacy-production Bot commented Jun 20, 2026 •

edited

Loading