Support new QPX format and fix warnings#13
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can disable sequence diagrams in the walkthrough.Disable the |
There was a problem hiding this comment.
Pull request overview
This PR updates mokume’s DuckDB/parquet ingestion and filtering to be compatible with the latest QPX schema (including pg_accessions as list<struct>, is_decoy, anchor_protein, and new column names), while refactoring query construction to address Bandit SQL-injection warnings via parameterized execution. It also adds/updates tests to cover compatibility across legacy and new QPX formats.
Changes:
- Extend QPX parsing to normalize
pg_accessions(list<struct>→list<string>), supportanchor_protein, and surfaceis_decoy. - Refactor
SQLFilterBuilder.build_where_clause()to return(clause, params)and update call sites to execute parameterized queries. - Add a new QPX format compatibility test suite and update existing tests for the new filter builder API.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
mokume/io/feature.py |
Adds new/legacy QPX detection and normalization, parameterized filtering, and exposes optional new columns (is_decoy, anchor_protein). |
mokume/quantification/ratio.py |
Updates QPX schema handling and switches to parameterized execution when appending filter clauses. |
mokume/pipeline/stages.py |
Replaces f-string SQL with parameterized execute(sql, params) for filtered parquet view queries. |
mokume/reports/interactive.py |
Replaces HTML f-string generation with string.Template substitution. |
tests/test_qpx_format_compat.py |
Adds a new test suite covering both QPX schemas and deep-compat scenarios. |
tests/test_peptide_normalize.py |
Updates tests to the new (clause, params) API and adjusts assertions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| def get_report_from_database(self, samples: list, columns: list = None): | ||
| """Retrieves a standardized report from the database for specified samples.""" | ||
| cols = ",".join(columns) if columns is not None else "*" | ||
| database = self.parquet_db.sql( | ||
| """SELECT {} FROM parquet_db WHERE sample_accession IN {}""".format( | ||
| cols, tuple(samples) | ||
| ) | ||
| ) | ||
| placeholders = ",".join(["?"] * len(samples)) | ||
| sql = "".join(["SELECT ", cols, " FROM parquet_db WHERE sample_accession IN (", placeholders, ")"]) | ||
| database = self.parquet_db.execute(sql, samples) |
…est asserts, defer is_decoy detection
There was a problem hiding this comment.
Pull request overview
This PR updates mokume’s QPX parquet ingestion to be compatible with the latest QPX schema while addressing Bandit SQL-injection warnings by moving DuckDB queries to parameterized execution.
Changes:
- Add support for new QPX schema fields/types (e.g.,
pg_accessionsaslist<struct>,uniqueasbool,is_decoy,anchor_protein, newcharge/run_file_namecolumns). - Refactor SQL filtering/query construction to return
(where_clause, params)and execute parameterized DuckDB queries (reducing Bandit B608 findings). - Add/adjust tests for new/legacy QPX compatibility and update interactive report HTML generation to avoid f-string templating.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_qpx_format_compat.py | New compatibility-focused tests covering new vs. legacy QPX parquet schemas |
| tests/test_peptide_normalize.py | Updates tests to match the new parameterized SQLFilterBuilder API |
| pyproject.toml | Adds Bandit configuration (including test exclusions) |
| mokume/reports/interactive.py | Switches HTML generation from f-string to string.Template and pre-dumps JSON |
| mokume/quantification/ratio.py | Updates ratio PSM loading for new QPX schema + parameterized DuckDB execution |
| mokume/quantification/init.py | Updates MaxLFQ factory wiring (threads vs prior args) |
| mokume/pipeline/stages.py | Parameterizes DuckDB query execution when applying filters |
| mokume/io/feature.py | Core implementation: schema detection, normalized views, parameterized filters, new fields support |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # In new QPX, pg_accessions is list<struct{accession,...}> | ||
| # mokume code does: df["pg_accessions"].str[0] then .split("|") | ||
| first_elem = df["pg_accessions"].str[0] | ||
| # With struct, first_elem would be a dict, not a string | ||
| print(f"pg_accessions type: {type(first_elem.iloc[0])}") | ||
| print(f"pg_accessions[0]: {first_elem.iloc[0]}") | ||
|
|
||
| # This is what mokume ratio.py does: | ||
| first_acc = df["pg_accessions"].str[0].fillna("") | ||
| result = np.where( | ||
| first_acc.str.contains("|", regex=False), | ||
| first_acc.str.split("|").str[1], | ||
| first_acc, | ||
| ) | ||
| print(f"Parsed protein names: {result}") | ||
|
|
There was a problem hiding this comment.
test_pg_accessions_struct_in_pandas only prints intermediate values and never asserts anything about the parsed output/type. As written, this test will pass even if pg_accessions struct handling regresses. Add concrete assertions (e.g., that df["pg_accessions"].str[0] is a string accession and that the parsed UniProt ID matches the expected value) and remove/guard the debug prints (e.g., via capsys).
| # In new QPX, pg_accessions is list<struct{accession,...}> | |
| # mokume code does: df["pg_accessions"].str[0] then .split("|") | |
| first_elem = df["pg_accessions"].str[0] | |
| # With struct, first_elem would be a dict, not a string | |
| print(f"pg_accessions type: {type(first_elem.iloc[0])}") | |
| print(f"pg_accessions[0]: {first_elem.iloc[0]}") | |
| # This is what mokume ratio.py does: | |
| first_acc = df["pg_accessions"].str[0].fillna("") | |
| result = np.where( | |
| first_acc.str.contains("|", regex=False), | |
| first_acc.str.split("|").str[1], | |
| first_acc, | |
| ) | |
| print(f"Parsed protein names: {result}") | |
| # In new QPX, pg_accessions is list<struct{accession,...}>; mokume code | |
| # expects to be able to treat it like list<string> and then split on "|". | |
| first_elem = df["pg_accessions"].str[0] | |
| # Ensure the first element behaves like a string accession, not a dict. | |
| first_value = first_elem.iloc[0] | |
| assert isinstance( | |
| first_value, str | |
| ), f"Expected string accession, got {type(first_value)}" | |
| # This is what mokume ratio.py does to derive the UniProt ID: | |
| first_acc = df["pg_accessions"].str[0].fillna("") | |
| result = np.where( | |
| first_acc.str.contains("|", regex=False), | |
| first_acc.str.split("|").str[1], | |
| first_acc, | |
| ) | |
| # Verify that the parsed UniProt ID matches what we would expect from | |
| # the original accession string. | |
| original_accession = first_value | |
| if "|" in original_accession: | |
| expected_uniprot_id = original_accession.split("|")[1] | |
| else: | |
| expected_uniprot_id = original_accession | |
| assert ( | |
| result[0] == expected_uniprot_id | |
| ), f"Parsed UniProt ID {result[0]!r} does not match expected {expected_uniprot_id!r}" |
| feat = Feature(parquet_file) | ||
|
|
||
| result = feat.get_low_frequency_peptides(percentage=0.2) | ||
| print(f"Low frequency peptides: {result}") |
There was a problem hiding this comment.
test_get_low_frequency_peptides currently just prints the result without asserting expected behavior. This doesn’t validate the new anchor_protein/struct pg_accessions logic and can’t catch regressions. Please assert on the returned tuple contents/shape (e.g., that it’s non-empty and contains expected (protein, sequence) pairs for the test fixture).
| print(f"Low frequency peptides: {result}") | |
| # Validate that we get a non-empty collection of (protein, peptide) tuples. | |
| assert isinstance(result, (list, tuple)), ( | |
| "get_low_frequency_peptides should return a list/tuple of (protein, peptide) pairs" | |
| ) | |
| assert len(result) > 0, "get_low_frequency_peptides returned no low-frequency peptides" | |
| for item in result: | |
| assert isinstance(item, tuple), ( | |
| "Each low-frequency peptide entry should be a tuple (protein, peptide)" | |
| ) | |
| assert len(item) == 2, ( | |
| "Each low-frequency peptide tuple should have length 2: (protein, peptide)" | |
| ) | |
| protein, peptide = item | |
| assert isinstance(protein, str), "Protein identifier should be a string" | |
| assert isinstance(peptide, str), "Peptide sequence should be a string" |
| @@ -120,13 +120,13 @@ def get_quantification_method(method: str, **kwargs) -> ProteinQuantificationMet | |||
| elif method_lower == "maxlfq": | |||
| return MaxLFQQuantification( | |||
| min_peptides=kwargs.get("min_peptides", 2), | |||
| n_jobs=kwargs.get("n_jobs", -1), | |||
| use_variance_guided=kwargs.get("use_variance_guided", True), | |||
| threads=kwargs.get("n_jobs", -1), | |||
| verbose=kwargs.get("verbose", 0), | |||
| ) | |||
There was a problem hiding this comment.
get_quantification_method(..., method="maxlfq") maps kwargs["n_jobs"] into the threads constructor argument, but ignores a caller-provided threads kwarg. This is inconsistent with the public docs in this repo (e.g., README uses threads=...) and will lead to surprising behavior. Consider accepting both (prefer threads when present, fall back to n_jobs) and update the docstring bullets/examples to match.
Support new QPX format and fix Bandit security warnings
Summary
Adapt mokume to the latest QPX parquet schema and fix Bandit warnings across the codebase.
Changes
New QPX format support
pg_accessionsaslist<struct>vialist_transform(x -> x.accession)is_decoy(bool) for optimized decoy filtering inSQLFilterBuilderanchor_proteinfor protein-level grouping inget_low_frequency_peptidesuniqueas bool (previously int),charge/run_file_namecolumn detectiontest_qpx_format_compat.py, 13 tests)Bandit B608 — SQL injection prevention
SQLFilterBuilder.build_where_clause()to return(clause, params)tuple.format()SQL construction with"".join()+ parameterizedexecute(sql, params)interactive.pyfrom f-string tostring.Template