Support new QPX format and fix warnings by Shen-YuFei · Pull Request #13 · bigbio/mokume

Shen-YuFei · 2026-03-18T10:36:27Z

Support new QPX format and fix Bandit security warnings

Summary

Adapt mokume to the latest QPX parquet schema and fix Bandit warnings across the codebase.

Changes

New QPX format support

Detect and handle pg_accessions as list<struct> via list_transform(x -> x.accession)
Support is_decoy (bool) for optimized decoy filtering in SQLFilterBuilder
Support anchor_protein for protein-level grouping in get_low_frequency_peptides
Handle unique as bool (previously int), charge/run_file_name column detection
Add comprehensive compatibility test suite (test_qpx_format_compat.py, 13 tests)

Bandit B608 — SQL injection prevention

Refactor SQLFilterBuilder.build_where_clause() to return (clause, params) tuple
Replace all f-string/.format() SQL construction with "".join() + parameterized execute(sql, params)
Convert HTML template in interactive.py from f-string to string.Template

coderabbitai · 2026-03-18T10:36:36Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2cbef764-a7d9-43da-bf28-5c75658d6480

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

Copilot

Pull request overview

This PR updates mokume’s DuckDB/parquet ingestion and filtering to be compatible with the latest QPX schema (including pg_accessions as list<struct>, is_decoy, anchor_protein, and new column names), while refactoring query construction to address Bandit SQL-injection warnings via parameterized execution. It also adds/updates tests to cover compatibility across legacy and new QPX formats.

Changes:

Extend QPX parsing to normalize pg_accessions (list<struct> → list<string>), support anchor_protein, and surface is_decoy.
Refactor SQLFilterBuilder.build_where_clause() to return (clause, params) and update call sites to execute parameterized queries.
Add a new QPX format compatibility test suite and update existing tests for the new filter builder API.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`mokume/io/feature.py`	Adds new/legacy QPX detection and normalization, parameterized filtering, and exposes optional new columns (`is_decoy`, `anchor_protein`).
`mokume/quantification/ratio.py`	Updates QPX schema handling and switches to parameterized execution when appending filter clauses.
`mokume/pipeline/stages.py`	Replaces f-string SQL with parameterized `execute(sql, params)` for filtered parquet view queries.
`mokume/reports/interactive.py`	Replaces HTML f-string generation with `string.Template` substitution.
`tests/test_qpx_format_compat.py`	Adds a new test suite covering both QPX schemas and deep-compat scenarios.
`tests/test_peptide_normalize.py`	Updates tests to the new `(clause, params)` API and adjusts assertions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

mokume/io/feature.py

    def get_report_from_database(self, samples: list, columns: list = None):
        """Retrieves a standardized report from the database for specified samples."""
        cols = ",".join(columns) if columns is not None else "*"
-        database = self.parquet_db.sql(
-            """SELECT {} FROM parquet_db WHERE sample_accession IN {}""".format(
-                cols, tuple(samples)
-            )
-        )
+        placeholders = ",".join(["?"] * len(samples))
+        sql = "".join(["SELECT ", cols, " FROM parquet_db WHERE sample_accession IN (", placeholders, ")"])
+        database = self.parquet_db.execute(sql, samples)


mokume/io/feature.py

tests/test_peptide_normalize.py

tests/test_qpx_format_compat.py

mokume/quantification/ratio.py

…est asserts, defer is_decoy detection

Copilot

Pull request overview

This PR updates mokume’s QPX parquet ingestion to be compatible with the latest QPX schema while addressing Bandit SQL-injection warnings by moving DuckDB queries to parameterized execution.

Changes:

Add support for new QPX schema fields/types (e.g., pg_accessions as list<struct>, unique as bool, is_decoy, anchor_protein, new charge/run_file_name columns).
Refactor SQL filtering/query construction to return (where_clause, params) and execute parameterized DuckDB queries (reducing Bandit B608 findings).
Add/adjust tests for new/legacy QPX compatibility and update interactive report HTML generation to avoid f-string templating.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_qpx_format_compat.py	New compatibility-focused tests covering new vs. legacy QPX parquet schemas
tests/test_peptide_normalize.py	Updates tests to match the new parameterized `SQLFilterBuilder` API
pyproject.toml	Adds Bandit configuration (including test exclusions)
mokume/reports/interactive.py	Switches HTML generation from f-string to `string.Template` and pre-dumps JSON
mokume/quantification/ratio.py	Updates ratio PSM loading for new QPX schema + parameterized DuckDB execution
mokume/quantification/init.py	Updates MaxLFQ factory wiring (`threads` vs prior args)
mokume/pipeline/stages.py	Parameterizes DuckDB query execution when applying filters
mokume/io/feature.py	Core implementation: schema detection, normalized views, parameterized filters, new fields support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-21T10:40:01Z

tests/test_qpx_format_compat.py

+        # In new QPX, pg_accessions is list<struct{accession,...}>
+        # mokume code does: df["pg_accessions"].str[0] then .split("|")
+        first_elem = df["pg_accessions"].str[0]
+        # With struct, first_elem would be a dict, not a string
+        print(f"pg_accessions type: {type(first_elem.iloc[0])}")
+        print(f"pg_accessions[0]: {first_elem.iloc[0]}")
+
+        # This is what mokume ratio.py does:
+        first_acc = df["pg_accessions"].str[0].fillna("")
+        result = np.where(
+            first_acc.str.contains("|", regex=False),
+            first_acc.str.split("|").str[1],
+            first_acc,
+        )
+        print(f"Parsed protein names: {result}")
+


test_pg_accessions_struct_in_pandas only prints intermediate values and never asserts anything about the parsed output/type. As written, this test will pass even if pg_accessions struct handling regresses. Add concrete assertions (e.g., that df["pg_accessions"].str[0] is a string accession and that the parsed UniProt ID matches the expected value) and remove/guard the debug prints (e.g., via capsys).

Suggested change

# In new QPX, pg_accessions is list<struct{accession,...}>

# mokume code does: df["pg_accessions"].str[0] then .split("|")

first_elem = df["pg_accessions"].str[0]

# With struct, first_elem would be a dict, not a string

print(f"pg_accessions type: {type(first_elem.iloc[0])}")

print(f"pg_accessions[0]: {first_elem.iloc[0]}")

# This is what mokume ratio.py does:

first_acc = df["pg_accessions"].str[0].fillna("")

result = np.where(

first_acc.str.contains("|", regex=False),

first_acc.str.split("|").str[1],

first_acc,

)

print(f"Parsed protein names: {result}")

# In new QPX, pg_accessions is list<struct{accession,...}>; mokume code

# expects to be able to treat it like list<string> and then split on "|".

first_elem = df["pg_accessions"].str[0]

# Ensure the first element behaves like a string accession, not a dict.

first_value = first_elem.iloc[0]

assert isinstance(

first_value, str

), f"Expected string accession, got {type(first_value)}"

# This is what mokume ratio.py does to derive the UniProt ID:

first_acc = df["pg_accessions"].str[0].fillna("")

result = np.where(

first_acc.str.contains("|", regex=False),

first_acc.str.split("|").str[1],

first_acc,

)

# Verify that the parsed UniProt ID matches what we would expect from

# the original accession string.

original_accession = first_value

if "|" in original_accession:

expected_uniprot_id = original_accession.split("|")[1]

else:

expected_uniprot_id = original_accession

assert (

result[0] == expected_uniprot_id

), f"Parsed UniProt ID {result[0]!r} does not match expected {expected_uniprot_id!r}"

Copilot · 2026-03-21T10:40:01Z

tests/test_qpx_format_compat.py

+        feat = Feature(parquet_file)
+
+        result = feat.get_low_frequency_peptides(percentage=0.2)
+        print(f"Low frequency peptides: {result}")


test_get_low_frequency_peptides currently just prints the result without asserting expected behavior. This doesn’t validate the new anchor_protein/struct pg_accessions logic and can’t catch regressions. Please assert on the returned tuple contents/shape (e.g., that it’s non-empty and contains expected (protein, sequence) pairs for the test fixture).

Suggested change

print(f"Low frequency peptides: {result}")

# Validate that we get a non-empty collection of (protein, peptide) tuples.

assert isinstance(result, (list, tuple)), (

"get_low_frequency_peptides should return a list/tuple of (protein, peptide) pairs"

)

assert len(result) > 0, "get_low_frequency_peptides returned no low-frequency peptides"

for item in result:

assert isinstance(item, tuple), (

"Each low-frequency peptide entry should be a tuple (protein, peptide)"

)

assert len(item) == 2, (

"Each low-frequency peptide tuple should have length 2: (protein, peptide)"

)

protein, peptide = item

assert isinstance(protein, str), "Protein identifier should be a string"

assert isinstance(peptide, str), "Peptide sequence should be a string"

Copilot · 2026-03-21T10:40:02Z

mokume/quantification/__init__.py

@@ -120,13 +120,13 @@ def get_quantification_method(method: str, **kwargs) -> ProteinQuantificationMet
    elif method_lower == "maxlfq":
        return MaxLFQQuantification(
            min_peptides=kwargs.get("min_peptides", 2),
-            n_jobs=kwargs.get("n_jobs", -1),
-            use_variance_guided=kwargs.get("use_variance_guided", True),
+            threads=kwargs.get("n_jobs", -1),
            verbose=kwargs.get("verbose", 0),
        )


get_quantification_method(..., method="maxlfq") maps kwargs["n_jobs"] into the threads constructor argument, but ignores a caller-provided threads kwarg. This is inconsistent with the public docs in this repo (e.g., README uses threads=...) and will lead to surprising behavior. Consider accepting both (prefer threads when present, fall back to n_jobs) and update the docstring bullets/examples to match.

Shen-YuFei added 6 commits March 18, 2026 17:09

support new QPX format

d24f49c

fix F401

bb78691

fix B101

e2358ae

fix B608

5f50d53

fix B101

63fba9e

fix

8d3eb34

Copilot AI review requested due to automatic review settings March 18, 2026 10:36

Copilot started reviewing on behalf of Shen-YuFei March 18, 2026 10:36 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

Shen-YuFei added 3 commits March 18, 2026 19:04

parameterize parquet path, validate columns, fix dead code, restore t…

03b0968

…est asserts, defer is_decoy detection

fix

4e4ce1c

fix MaxLFQ param name n_jobs -> threads, remove use_variance_guided

3d1fbce

ypriverol requested review from Copilot and ypriverol March 21, 2026 10:35

Copilot started reviewing on behalf of ypriverol March 21, 2026 10:35 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

ypriverol approved these changes Mar 21, 2026

View reviewed changes

ypriverol merged commit 89546eb into bigbio:dev Mar 21, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support new QPX format and fix warnings#13

Support new QPX format and fix warnings#13
ypriverol merged 9 commits intobigbio:devfrom
Shen-YuFei:dev

Shen-YuFei commented Mar 18, 2026

Uh oh!

coderabbitai bot commented Mar 18, 2026 •

edited

Loading

Review skipped

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        print(f"Low frequency peptides: {result}")
+        # Validate that we get a non-empty collection of (protein, peptide) tuples.
+        assert isinstance(result, (list, tuple)), (
+            "get_low_frequency_peptides should return a list/tuple of (protein, peptide) pairs"
+        )
+        assert len(result) > 0, "get_low_frequency_peptides returned no low-frequency peptides"
+        for item in result:
+            assert isinstance(item, tuple), (
+                "Each low-frequency peptide entry should be a tuple (protein, peptide)"
+            )
+            assert len(item) == 2, (
+                "Each low-frequency peptide tuple should have length 2: (protein, peptide)"
+            )
+            protein, peptide = item
+            assert isinstance(protein, str), "Protein identifier should be a string"
+            assert isinstance(peptide, str), "Peptide sequence should be a string"

Conversation

Shen-YuFei commented Mar 18, 2026

Support new QPX format and fix Bandit security warnings

Summary

Changes

New QPX format support

Bandit B608 — SQL injection prevention

Uh oh!

coderabbitai bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Mar 18, 2026 •

edited

Loading