Skip to content

Support new QPX format and fix warnings#13

Merged
ypriverol merged 9 commits intobigbio:devfrom
Shen-YuFei:dev
Mar 21, 2026
Merged

Support new QPX format and fix warnings#13
ypriverol merged 9 commits intobigbio:devfrom
Shen-YuFei:dev

Conversation

@Shen-YuFei
Copy link
Contributor

Support new QPX format and fix Bandit security warnings

Summary

Adapt mokume to the latest QPX parquet schema and fix Bandit warnings across the codebase.

Changes

New QPX format support

  • Detect and handle pg_accessions as list<struct> via list_transform(x -> x.accession)
  • Support is_decoy (bool) for optimized decoy filtering in SQLFilterBuilder
  • Support anchor_protein for protein-level grouping in get_low_frequency_peptides
  • Handle unique as bool (previously int), charge/run_file_name column detection
  • Add comprehensive compatibility test suite (test_qpx_format_compat.py, 13 tests)

Bandit B608 — SQL injection prevention

  • Refactor SQLFilterBuilder.build_where_clause() to return (clause, params) tuple
  • Replace all f-string/.format() SQL construction with "".join() + parameterized execute(sql, params)
  • Convert HTML template in interactive.py from f-string to string.Template

Copilot AI review requested due to automatic review settings March 18, 2026 10:36
@coderabbitai
Copy link

coderabbitai bot commented Mar 18, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2cbef764-a7d9-43da-bf28-5c75658d6480

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates mokume’s DuckDB/parquet ingestion and filtering to be compatible with the latest QPX schema (including pg_accessions as list<struct>, is_decoy, anchor_protein, and new column names), while refactoring query construction to address Bandit SQL-injection warnings via parameterized execution. It also adds/updates tests to cover compatibility across legacy and new QPX formats.

Changes:

  • Extend QPX parsing to normalize pg_accessions (list<struct>list<string>), support anchor_protein, and surface is_decoy.
  • Refactor SQLFilterBuilder.build_where_clause() to return (clause, params) and update call sites to execute parameterized queries.
  • Add a new QPX format compatibility test suite and update existing tests for the new filter builder API.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
mokume/io/feature.py Adds new/legacy QPX detection and normalization, parameterized filtering, and exposes optional new columns (is_decoy, anchor_protein).
mokume/quantification/ratio.py Updates QPX schema handling and switches to parameterized execution when appending filter clauses.
mokume/pipeline/stages.py Replaces f-string SQL with parameterized execute(sql, params) for filtered parquet view queries.
mokume/reports/interactive.py Replaces HTML f-string generation with string.Template substitution.
tests/test_qpx_format_compat.py Adds a new test suite covering both QPX schemas and deep-compat scenarios.
tests/test_peptide_normalize.py Updates tests to the new (clause, params) API and adjusts assertions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines 430 to +435
def get_report_from_database(self, samples: list, columns: list = None):
"""Retrieves a standardized report from the database for specified samples."""
cols = ",".join(columns) if columns is not None else "*"
database = self.parquet_db.sql(
"""SELECT {} FROM parquet_db WHERE sample_accession IN {}""".format(
cols, tuple(samples)
)
)
placeholders = ",".join(["?"] * len(samples))
sql = "".join(["SELECT ", cols, " FROM parquet_db WHERE sample_accession IN (", placeholders, ")"])
database = self.parquet_db.execute(sql, samples)
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates mokume’s QPX parquet ingestion to be compatible with the latest QPX schema while addressing Bandit SQL-injection warnings by moving DuckDB queries to parameterized execution.

Changes:

  • Add support for new QPX schema fields/types (e.g., pg_accessions as list<struct>, unique as bool, is_decoy, anchor_protein, new charge/run_file_name columns).
  • Refactor SQL filtering/query construction to return (where_clause, params) and execute parameterized DuckDB queries (reducing Bandit B608 findings).
  • Add/adjust tests for new/legacy QPX compatibility and update interactive report HTML generation to avoid f-string templating.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_qpx_format_compat.py New compatibility-focused tests covering new vs. legacy QPX parquet schemas
tests/test_peptide_normalize.py Updates tests to match the new parameterized SQLFilterBuilder API
pyproject.toml Adds Bandit configuration (including test exclusions)
mokume/reports/interactive.py Switches HTML generation from f-string to string.Template and pre-dumps JSON
mokume/quantification/ratio.py Updates ratio PSM loading for new QPX schema + parameterized DuckDB execution
mokume/quantification/init.py Updates MaxLFQ factory wiring (threads vs prior args)
mokume/pipeline/stages.py Parameterizes DuckDB query execution when applying filters
mokume/io/feature.py Core implementation: schema detection, normalized views, parameterized filters, new fields support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +204 to +219
# In new QPX, pg_accessions is list<struct{accession,...}>
# mokume code does: df["pg_accessions"].str[0] then .split("|")
first_elem = df["pg_accessions"].str[0]
# With struct, first_elem would be a dict, not a string
print(f"pg_accessions type: {type(first_elem.iloc[0])}")
print(f"pg_accessions[0]: {first_elem.iloc[0]}")

# This is what mokume ratio.py does:
first_acc = df["pg_accessions"].str[0].fillna("")
result = np.where(
first_acc.str.contains("|", regex=False),
first_acc.str.split("|").str[1],
first_acc,
)
print(f"Parsed protein names: {result}")

Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_pg_accessions_struct_in_pandas only prints intermediate values and never asserts anything about the parsed output/type. As written, this test will pass even if pg_accessions struct handling regresses. Add concrete assertions (e.g., that df["pg_accessions"].str[0] is a string accession and that the parsed UniProt ID matches the expected value) and remove/guard the debug prints (e.g., via capsys).

Suggested change
# In new QPX, pg_accessions is list<struct{accession,...}>
# mokume code does: df["pg_accessions"].str[0] then .split("|")
first_elem = df["pg_accessions"].str[0]
# With struct, first_elem would be a dict, not a string
print(f"pg_accessions type: {type(first_elem.iloc[0])}")
print(f"pg_accessions[0]: {first_elem.iloc[0]}")
# This is what mokume ratio.py does:
first_acc = df["pg_accessions"].str[0].fillna("")
result = np.where(
first_acc.str.contains("|", regex=False),
first_acc.str.split("|").str[1],
first_acc,
)
print(f"Parsed protein names: {result}")
# In new QPX, pg_accessions is list<struct{accession,...}>; mokume code
# expects to be able to treat it like list<string> and then split on "|".
first_elem = df["pg_accessions"].str[0]
# Ensure the first element behaves like a string accession, not a dict.
first_value = first_elem.iloc[0]
assert isinstance(
first_value, str
), f"Expected string accession, got {type(first_value)}"
# This is what mokume ratio.py does to derive the UniProt ID:
first_acc = df["pg_accessions"].str[0].fillna("")
result = np.where(
first_acc.str.contains("|", regex=False),
first_acc.str.split("|").str[1],
first_acc,
)
# Verify that the parsed UniProt ID matches what we would expect from
# the original accession string.
original_accession = first_value
if "|" in original_accession:
expected_uniprot_id = original_accession.split("|")[1]
else:
expected_uniprot_id = original_accession
assert (
result[0] == expected_uniprot_id
), f"Parsed UniProt ID {result[0]!r} does not match expected {expected_uniprot_id!r}"

Copilot uses AI. Check for mistakes.
feat = Feature(parquet_file)

result = feat.get_low_frequency_peptides(percentage=0.2)
print(f"Low frequency peptides: {result}")
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_get_low_frequency_peptides currently just prints the result without asserting expected behavior. This doesn’t validate the new anchor_protein/struct pg_accessions logic and can’t catch regressions. Please assert on the returned tuple contents/shape (e.g., that it’s non-empty and contains expected (protein, sequence) pairs for the test fixture).

Suggested change
print(f"Low frequency peptides: {result}")
# Validate that we get a non-empty collection of (protein, peptide) tuples.
assert isinstance(result, (list, tuple)), (
"get_low_frequency_peptides should return a list/tuple of (protein, peptide) pairs"
)
assert len(result) > 0, "get_low_frequency_peptides returned no low-frequency peptides"
for item in result:
assert isinstance(item, tuple), (
"Each low-frequency peptide entry should be a tuple (protein, peptide)"
)
assert len(item) == 2, (
"Each low-frequency peptide tuple should have length 2: (protein, peptide)"
)
protein, peptide = item
assert isinstance(protein, str), "Protein identifier should be a string"
assert isinstance(peptide, str), "Peptide sequence should be a string"

Copilot uses AI. Check for mistakes.
Comment on lines 71 to 125
@@ -120,13 +120,13 @@ def get_quantification_method(method: str, **kwargs) -> ProteinQuantificationMet
elif method_lower == "maxlfq":
return MaxLFQQuantification(
min_peptides=kwargs.get("min_peptides", 2),
n_jobs=kwargs.get("n_jobs", -1),
use_variance_guided=kwargs.get("use_variance_guided", True),
threads=kwargs.get("n_jobs", -1),
verbose=kwargs.get("verbose", 0),
)
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_quantification_method(..., method="maxlfq") maps kwargs["n_jobs"] into the threads constructor argument, but ignores a caller-provided threads kwarg. This is inconsistent with the public docs in this repo (e.g., README uses threads=...) and will lead to surprising behavior. Consider accepting both (prefer threads when present, fall back to n_jobs) and update the docstring bullets/examples to match.

Copilot uses AI. Check for mistakes.
@ypriverol ypriverol merged commit 89546eb into bigbio:dev Mar 21, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants