Format-agnostic idXML/mzIdentML I/O + localizer performance optimizations#51
Format-agnostic idXML/mzIdentML I/O + localizer performance optimizations#51ypriverol wants to merge 14 commits into
Conversation
…/LucXor Profiling on PXD000138 file 1 (10,735 MS2) identified one dominant hotspot per algorithm; all three fixes are numerically exact (max abs diff 0.0 vs baseline across 7,394 hits) and the full suite (178 tests) passes. AScore: rewrite numberOfMatchedIons_ on numpy m/z arrays (one get_peaks() vs millions of Peak1D.getMZ()/MSSpectrum.size() binding calls); precompute each window's top-i m/z arrays once instead of copy+sort x10 per window; reuse a single AScore instance across PSMs (single-thread path) instead of rebuilding the TheoreticalSpectrumGenerator per PSM. PhosphoRS: memoize binomial_tail_probability on (k,n,p) — the prior cache was a dead no-op, so every call recomputed the scipy binomial (62% of runtime); hoist the constant Da tolerance out of the 3.3M-call per-ion closure; cache charge-validated isoform theoretical m/z per (seq,charge) so they aren't regenerated for both depth-selection and final scoring. LucXor: hoist the per-charge density model + constants out of the per-peak loop (eliminates 1.16M redundant get_charge_model lookups); dispatch on model type so the CID Gaussian inlining is not applied to the HCD non-parametric kernel model. Verified: bit-exact output; pytest tests/ -> 178 passed; CodeRabbit -> no findings.
…t in run_all_localizers
…nups - Finding 1: replace exact-string matching in validate_spectrum_refs with scan-number-tolerant matching (_extract_scan_number helper + native scan number set); a compact ref 'scan=N' now resolves against full mzML nativeIDs such as 'controllerType=0 controllerNumber=1 scan=N' - Finding 2: replace unused 'import os' with 'import re' in mzid_adapter.py - Finding 3: remove stale IdXMLFile import from lucxor/cli.py - Finding 4: strengthen test_add_decoys_falls_back_when_no_ala to assert the '--add-decoys / no Alanine' fallback warning is emitted via capsys - Finding 5: add in-place mutation note to _strip_custom_variable_mods docstring - New regression test: test_validate_spectrum_refs_scan_number_tolerant
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
📝 WalkthroughWalkthroughIntroduces Changesmzid Adapter, I/O Wiring, Orchestration, and Scoring Performance
Sequence Diagram(s)sequenceDiagram
participant CLI as all / run_all_localizers
participant adapter as mzid_adapter
participant mzML as MSExperiment
participant AScore as ascore CLI
participant PhosphoRS as phosphors CLI
participant LucXor as lucxor CLI
participant merge as merge_algorithm_results
CLI->>adapter: load_identifications(id_file)
adapter-->>CLI: prot_ids, pep_ids
CLI->>adapter: validate_spectrum_refs(pep_ids, mzml_path)
adapter->>mzML: MzMLFile().load(mzml_path)
mzML-->>adapter: spectra with nativeIDs
adapter-->>CLI: ValidationResult
CLI->>adapter: has_alanine(pep_ids)
adapter-->>CLI: True / False
alt no alanine
CLI-->>CLI: disable add_decoys, emit warning
end
CLI->>AScore: ctx.invoke(ascore, ...)
AScore-->>CLI: temp ascore idXML
CLI->>PhosphoRS: ctx.invoke(phosphors, ...)
PhosphoRS-->>CLI: temp phosphors idXML
CLI->>LucXor: ctx.invoke(lucxor, ...)
LucXor-->>CLI: temp lucxor idXML
CLI->>merge: ascore_out, phosphors_out, lucxor_out
merge->>adapter: store_identifications(out_file, ...)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| UnusedCode | 3 medium 2 minor |
| ErrorProne | 3 high |
| Security | 18 high |
| Complexity | 5 medium |
🟢 Metrics 88 complexity · 0 duplication
Metric Results Complexity 88 Duplication 0
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
onsite/lucxor/cli.py (1)
287-290: 💤 Low valueRedundant condition:
compute_all_scoresis alwaysFalsehere.When
compute_all_scores=True, the function returns early at line 253. If execution reaches line 288,compute_all_scoresis guaranteed to beFalse, making the condition redundant.♻️ Suggested simplification
- # Only call sys.exit if not being called from compute_all_scores - if not compute_all_scores: - sys.exit(exit_code) - return exit_code + sys.exit(exit_code)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@onsite/lucxor/cli.py` around lines 287 - 290, The condition checking `if not compute_all_scores:` before calling `sys.exit(exit_code)` is redundant because the function returns early when `compute_all_scores` is True, meaning execution can only reach this point when `compute_all_scores` is False. Remove the conditional check and call `sys.exit(exit_code)` directly without the if statement, keeping only the `return exit_code` line after it if needed for consistency.onsite/mzid_adapter.py (1)
99-103: 💤 Low valueMinor inefficiency:
_extract_scan_numbercalled twice per native ID.The set comprehension calls
_extract_scan_number(nid)twice for each native ID—once in the condition and once for the value. For large mzML files this doubles regex operations unnecessarily.♻️ Suggested optimization
- native_scan_numbers = { - _extract_scan_number(nid) - for nid in native_ids - if _extract_scan_number(nid) is not None - } + native_scan_numbers = { + scan for scan in (_extract_scan_number(nid) for nid in native_ids) if scan is not None + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@onsite/mzid_adapter.py` around lines 99 - 103, The set comprehension for native_scan_numbers calls _extract_scan_number(nid) twice for each native ID—once in the if condition and once for the set value, which doubles the regex operations unnecessarily. Refactor the comprehension to use the walrus operator (`:=`) to extract and assign the scan number once in the condition, then reference that assigned variable in the set value to eliminate the redundant function call.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@onsite/lucxor/cli.py`:
- Around line 287-290: The condition checking `if not compute_all_scores:`
before calling `sys.exit(exit_code)` is redundant because the function returns
early when `compute_all_scores` is True, meaning execution can only reach this
point when `compute_all_scores` is False. Remove the conditional check and call
`sys.exit(exit_code)` directly without the if statement, keeping only the
`return exit_code` line after it if needed for consistency.
In `@onsite/mzid_adapter.py`:
- Around line 99-103: The set comprehension for native_scan_numbers calls
_extract_scan_number(nid) twice for each native ID—once in the if condition and
once for the set value, which doubles the regex operations unnecessarily.
Refactor the comprehension to use the walrus operator (`:=`) to extract and
assign the scan number once in the condition, then reference that assigned
variable in the set value to eliminate the redundant function call.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 45bf935a-2fbd-4204-853b-b05186716552
📒 Files selected for processing (10)
.gitignoreonsite/ascore/ascore.pyonsite/ascore/cli.pyonsite/lucxor/cli.pyonsite/lucxor/psm.pyonsite/mzid_adapter.pyonsite/onsitec.pyonsite/phosphors/cli.pyonsite/phosphors/phosphors.pytests/test_mzid_adapter.py
|
Parking this PR. main has since migrated identification I/O to idParquet (PRs #44-#49, new onsite/idparquet.py), which supersedes this branch's pyOpenMS-object mzid approach. Plan: rebuild on main's idParquet base as a unified format layer supporting idXML + mzIdentML + idParquet, carrying the performance optimizations forward. This branch is kept for reference. |
Summary
This branch adds two independent bodies of work on top of the existing PhosphoRS/AScore/LucXor pipeline.
1. Format-agnostic idXML / mzIdentML I/O (new feature)
The pipeline now reads and writes either idXML or mzIdentML by file extension:
onsite/mzid_adapter.py:load_identifications/store_identifications(extension dispatch; the mzid store strips non-UNIMOD customPhosphoDecoymodifications from search params soMzIdentMLFile.storeaccepts them — the modification on the hits round-trips intact),has_alanine, and a scan-number-tolerantvalidate_spectrum_refs(matches the tools' own scan-based resolver, so it never aborts a run the tools could score).run_all_localizers/merge_algorithm_resultsare wired through those helpers.--add-decoys+ Alanine present → decoy pack;--add-decoys+ no Alanine → warns and falls back (never fails).merge_algorithm_resultsnow propagates all four score keys (AScore_site_scores,PhosphoRS_site_probs,PhosphoRS_site_delta,Luciphor_site_scores).2. Localizer performance optimizations (numerically identical)
Profiling-guided hot-path fixes; all bit-exact vs baseline (max abs diff 0.0 across 7,394 hits each):
getMZ()/size()calls; per-window top-depth m/z precomputed once; single reused instance.binomial_tail_probability(previously a dead no-op cache), constant-tolerance hoist, isoform theoretical-m/z caching.Testing
Full suite: 187 passed, 0 failed. New tests cover the format-agnostic round-trips, the PhosphoDecoy mzid store, scan-number-tolerant validation, and the decoy fallback.
Note on scope
This branch is far ahead of
mainand also carries prior PhosphoRS FLR investigation history, so the diff is large. The substantive new changes are the two sections above (onsite/mzid_adapter.py,onsite/onsitec.py, the threecli.pyfiles, the perf changes, andtests/).Summary by CodeRabbit
New Features
Performance Improvements
Tests