Major changes including differential expression module by ypriverol · Pull Request #9 · bigbio/mokume

ypriverol · 2026-02-09T16:46:07Z

Summary by CodeRabbit

New Features
- Added a comprehensive TMT vs LFQ benchmarking suite (analyses, figures, consolidated report), ratio-based TMT quantification, IRS and TMM normalization options, interactive QC/DE and workflow-comparison HTML reports, and CLI flags for IRS, differential expression, and plotting.
Documentation
- Complete docs site (concepts, user guide, reference, installation, quickstart) and a docs deployment workflow.
Refactor
- Modular four-stage pipeline and unified "Intensity" output naming.
Tests
- New TMM and normalization test coverage.

coderabbitai · 2026-02-09T16:47:58Z

📝 Walkthrough

Walkthrough

Refactors the quantification pipeline into staged components (Loading, Normalization, Quantification, Postprocessing); adds IRS and TMM normalizers, Ratio quantification, DifferentialExpression (ttest/limma), a DuckDB-backed Feature API with SQLFilterBuilder, unified Intensity outputs, interactive/QC/workflow reports, a full PXD007683 benchmarking suite, and extensive docs and CI for docs deployment.

Changes

Cohort / File(s)	Summary
CI / Docs Config `\.github/workflows/deploy-docs.yml`, `mkdocs.yml`, `requirements-docs.txt`, `pyproject.toml`	Adds MkDocs site config, docs dependencies, docs deployment workflow, and packaging/extras updates (statsmodels, optional plotly, reports extra).
Benchmarks (PXD007683) `benchmarks/quant-pxd007683-tmt-vs-lfq/...`	Adds complete multi-phase benchmarking suite: `config.py`, scripts `00_`–`12_`, `run_benchmark.py`, README and results report; introduces many new I/O-heavy analysis scripts and figure outputs.
Pipeline & CLI Refactor `mokume/pipeline/config.py`, `mokume/pipeline/stages.py`, `mokume/pipeline/__init__.py`, `mokume/pipeline/features_to_proteins.py`, `mokume/commands/features2proteins.py`	Replaces monolithic pipeline with dataclass configs and four-stage pipeline classes; rewrites features_to_proteins to use the staged flow and expands CLI with IRS, ratio, DE, plotting/reporting options.
Data Access & Filtering `mokume/io/feature.py`, `mokume/io/__init__.py`, `mokume/preprocessing/aggregation.py`, `mokume/preprocessing/sdrf.py`	Adds DuckDB-backed `Feature` API, `SQLFilterBuilder`, SDRF analysis and aggregation/filtering utilities used by LoadingStage.
Normalization: IRS & TMM `mokume/normalization/irs.py`, `mokume/normalization/tmm.py`, `mokume/normalization/__init__.py`, `mokume/model/normalization.py`	Implements `IRSNormalizer` + SDRF helpers and full `TMMNormalizer` with `tmm_normalize`; updates exports and enum to include TMM and dataset-level flag.
Quantification: Ratio & Standardization `mokume/quantification/ratio.py`, `mokume/quantification/*`	Adds `RatioQuantification` and PSM loader; standardizes all quantification outputs to an `Intensity` column; updates MaxLFQ/DirectLFQ/TopN/iBAQ internals and some signatures.
Analysis & Reports `mokume/analysis/differential_expression.py`, `mokume/reports/`, `mokume/plotting/`	Adds `DifferentialExpression` (ttest/limma), interactive Plotly DE/QC/workflow-comparison reports, QC report generator, and plotting utilities (volcano, heatmap, PCA).
Preprocessing & Filters `mokume/preprocessing/filters/*`, `mokume/preprocessing/__init__.py`	Vectorizes peptide/protein filters, adds aggregation helpers, and re-exports preprocessing utilities.
Postprocessing & Batch Correction `mokume/postprocessing/batch_correction.py`	Improves SAMPLE_PREFIX batch detection with more robust prefix extraction logic.
Core & Exports `mokume/io/__init__.py`, `mokume/normalization/peptide.py`, `mokume/plotting/__init__.py`, `mokume/pipeline/__init__.py`, `mokume/core/constants.py`, `mokume/mokume_cli.py`	Reorganizes exports and re-exports (Feature, SQLFilterBuilder, pipeline stages/configs), removes deprecated constants, corrects logging mapping, and enables lazy plotting exports.
Tests `tests/*`	Adapts tests to the new PipelineConfig shape and unified `Intensity` column; adds TMM tests and Feature QPX-format tests.
Documentation `docs/**`	Extensive docs added and reorganized: concepts, user guide, reference, installation, quickstart, community pages, and mkdocs navigation.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Pipeline as QuantPipeline
    participant Loader as LoadingStage
    participant Normalizer as NormalizationStage
    participant Quantifier as QuantificationStage
    participant Postproc as PostprocessingStage

    User->>Pipeline: features_to_proteins(config)
    activate Pipeline

    Pipeline->>Loader: load_for_mokume()
    activate Loader
    Loader->>Loader: read parquet + SDRF, apply SQL filters
    Loader-->>Pipeline: peptide_df
    deactivate Loader

    Pipeline->>Normalizer: apply_hierarchical/apply_tmm/apply_irs()
    activate Normalizer
    Normalizer->>Normalizer: compute normalization factors / IRS scaling
    Normalizer-->>Pipeline: normalized_peptides
    deactivate Normalizer

    Pipeline->>Quantifier: quantify(normalized_peptides)
    activate Quantifier
    Quantifier->>Quantifier: run quant method (ibaq/maxlfq/ratio/...)
    Quantifier-->>Pipeline: protein_df (wide, Intensity)
    deactivate Quantifier

    Pipeline->>Postproc: apply_batch_correction/run_differential_expression/generate_reports()
    activate Postproc
    Postproc->>Postproc: ComBat / DE (ttest/limma) / plotting / HTML reports
    Postproc-->>Pipeline: corrected_df, de_results, reports
    deactivate Postproc

    Pipeline-->>User: protein_df + artifacts (plots, reports)
    deactivate Pipeline

sequenceDiagram
    participant PSM as PSM_Data
    participant RatioQ as RatioQuantification
    participant Frac as FractionAgg
    participant Ref as ReferenceComp
    participant RatioCalc as RatioCalc
    participant Agg as Aggregation

    PSM->>RatioQ: psm_long_df
    activate RatioQ

    RatioQ->>Frac: _average_fractions()
    activate Frac
    Frac-->>RatioQ: averaged_psm_df
    deactivate Frac

    RatioQ->>Ref: _compute_reference_intensity()
    activate Ref
    Ref-->>RatioQ: reference_df
    deactivate Ref

    RatioQ->>RatioCalc: _compute_log2_ratios(averaged, reference)
    activate RatioCalc
    RatioCalc-->>RatioQ: ratio_df
    deactivate RatioCalc

    RatioQ->>Agg: aggregate PSM->peptide->protein
    activate Agg
    Agg-->>RatioQ: protein_df (log2 ratios)
    deactivate Agg

    RatioQ-->>PSM: pivot to wide (protein x sample)
    deactivate RatioQ

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

batch effect changes in the algorithms to make it more easy to understand #7: Overlaps pipeline orchestration and normalization (IRS, TMM) and public API exports — likely high surface overlap.
Changes to support qpx #6: Related to Feature SDRF enrichment and QPX/run-file handling — touches the same data enrichment/filtering logic.
Solving multiple bugs detected about filter and existing data. #5: Related to SQLFilterBuilder and Feature filtering utilities used by normalization/IRS and loading stages.

Poem

🐰 I hopped through code with whiskers bright,

I nudged the plexes into equal light,
Trimmed means and ratios, tidy and spry,
Reports that sparkle under Plotly sky,
A carrot for docs — I sparkled a byte.

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dev

coderabbitai

Actionable comments posted: 5

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

mokume/quantification/directlfq.py (1)
298-300: ⚠️ Potential issue | 🟡 Minor

Original traceback is lost when re-raising as RuntimeError.

Use raise ... from e to preserve the exception chain, which aids debugging.
Proposed fix
             except Exception as e:
                 logger.error(f"DirectLFQ failed: {e}")
-                raise RuntimeError(f"DirectLFQ quantification failed: {e}")
+                raise RuntimeError(f"DirectLFQ quantification failed: {e}") from e
docs/concepts/preprocessing.md (1)

145-146: ⚠️ Potential issue | 🟡 Minor

Missing trailing newline.

Line 146 is not marked as a new line (~), yet this is a new file. This may indicate a missing final newline character.
mokume/preprocessing/filters/peptide.py (1)
269-288: ⚠️ Potential issue | 🟠 Major

Missed cleavage count is wrong when K/R is at the last position of the truncated sequence.

After removing the C-terminal residue (line 275), K/R at the end of the truncated string is silently skipped (line 285–287 pass). However, this K/R is an internal cleavage site—the character that followed it in the original sequence may not be P. The code should check the original next character instead of unconditionally skipping.
🐛 Proposed fix
         def count_missed_cleavages(seq):
             if pd.isna(seq):
                 return 0
-            seq_str = str(seq).upper()
+            seq_full = str(seq).upper()
             # Remove last character (C-terminal cleavage site doesn't count)
-            if len(seq_str) > 1:
-                seq_str = seq_str[:-1]
+            if len(seq_full) <= 1:
+                return 0
+            seq_str = seq_full[:-1]
 
             if self.enzyme == "trypsin":
                 # Count K and R not followed by P
                 count = 0
                 for i, aa in enumerate(seq_str):
                     if aa in ["K", "R"]:
                         # Check if followed by P
-                        if i + 1 < len(seq_str) and seq_str[i + 1] != "P":
-                            count += 1
-                        elif i + 1 >= len(seq_str):
-                            # Last position doesn't count as missed
-                            pass
+                        next_aa = seq_str[i + 1] if i + 1 < len(seq_str) else seq_full[-1]
+                        if next_aa != "P":
+                            count += 1
                 return count
mokume/commands/features2proteins.py (1)

472-516: ⚠️ Potential issue | 🟡 Minor

Add directlfq_min_nonan to the QuantificationConfig initialization.

The --directlfq-min-nonan CLI option (line 132) is accepted but never forwarded to run_pipeline. The parameter is received by the command function (line 330) but not passed to QuantificationConfig during initialization, causing the user-supplied value to be silently dropped in favor of the default. Since the function signature is being expanded, address this gap by including directlfq_min_nonan=directlfq_min_nonan in the QuantificationConfig constructor call.

docs/reference/cli.md (1)

215-227: ⚠️ Potential issue | 🟡 Minor

The tsne-visualization options table is truncated.

The file ends abruptly at line 227. The last table row for -o/--pattern is missing its description, and likely additional options (e.g., output path, perplexity, etc.) are missing entirely.

🤖 Fix all issues with AI agents

In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/06_generate_report.py`:
- Around line 176-181: The f-string in the loop over yeast_df.iterrows (inside
the lines.append call) applies the numeric format specifier : .2f to
row.get('compression_ratio', 'N/A'), which raises ValueError when the fallback
'N/A' is used; fix by extracting compression_ratio into a variable (e.g., comp =
row.get('compression_ratio')), then build a comp_str that is "N/A" when comp is
None or non-numeric (or pandas.isna(comp)), otherwise format it with "{:.2f}",
and use comp_str in the lines.append f-string instead of applying : .2f directly
to the get(...) expression.

In `@mokume/io/feature.py`:
- Around line 515-527: The query in get_irs_scaling_factors references a mixture
column that doesn't exist in the DuckDB views; update the view definitions used
to build self.parquet_db so that both the base parquet_db view and the
SDRF-enriched view include a deterministic mixture column (e.g., derive or
coalesce the appropriate SDRF/sample field into an alias named mixture) so the
SELECT and GROUP BY in the irs_df SQL succeed; ensure the column name is exactly
mixture to match the SELECT/GROUP BY in get_irs_scaling_factors.

In `@mokume/preprocessing/filters/peptide.py`:
- Around line 414-416: If self._compiled_patterns is empty, combined_pattern
becomes "" so df[col].str.contains("") matches everything and removes all rows;
fix by short-circuiting: if not self._compiled_patterns then set mask to a
Series of True for all rows (i.e., nothing excluded) or use a regex that matches
nothing (e.g., "(?!)") before calling
df[col].fillna("").astype(str).str.contains; update the logic around
combined_pattern, self._compiled_patterns and mask to handle the empty-list case
explicitly.
- Around line 199-204: When self.exclude_modifications is empty the code builds
patterns="" so str.contains("") matches every row and removes all peptides; fix
the block that builds patterns and mask in the peptide filter (referencing
self.exclude_modifications, patterns, mask, self.sequence_column, and df) by
adding a guard: if not self.exclude_modifications then set mask to a boolean
Series of True for df.index (i.e., keep all rows), otherwise build patterns =
"|".join(...) and compute mask =
~df[self.sequence_column].fillna("").astype(str).str.contains(patterns,
case=False, regex=True) as before.

In `@mokume/preprocessing/filters/protein.py`:
- Around line 63-66: The current vectorized contaminant filter builds
pattern_regex from self.patterns and then uses
upper_col.str.contains(pattern_regex), which with an empty self.patterns yields
an empty string and matches every row; add a guard: if not self.patterns (or
len(self.patterns) == 0) skip the contains call and set mask to a boolean Series
of True for all df rows (e.g., mask = pd.Series(True, index=df.index)) so no
rows are removed, otherwise build pattern_regex from self.patterns and apply
mask = ~upper_col.str.contains(pattern_regex, regex=True); reference the
variables/members self.patterns, self.protein_column, upper_col, pattern_regex
and mask to locate where to change.

🟠 Major comments (19)

mokume/quantification/ratio.py-323-346 (1)
323-346: ⚠️ Potential issue | 🟠 Major

Use parameterized queries to prevent SQL injection via the parquet file path.

Line 327 interpolates parquet_path directly into the SQL string via f-string. If the path originates from user-controlled configuration and contains a crafted quote or SQL syntax, it can break out of the string literal and execute arbitrary DuckDB commands (e.g., exfiltrating data via COPY ... TO).

Replace string interpolation with DuckDB's native parameterized query support:
Fix: Use parameterized query with placeholder
    conn.execute(
-       f"CREATE VIEW parquet_raw AS SELECT * FROM parquet_scan('{parquet_path}')"
+       "CREATE VIEW parquet_raw AS SELECT * FROM parquet_scan(?)",
+       [parquet_path],
    )
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/00_download_data.py-29-45 (1)
29-45: ⚠️ Potential issue | 🟠 Major

Partial downloads are not cleaned up on failure, causing silent corruption on retry.

If the download fails mid-transfer, urlretrieve leaves a partial file on disk. The next run sees dest.exists() on line 22 and skips the download, silently using the corrupt file.
Proposed fix: remove partial file on error
     except Exception as e:
+        if dest.exists():
+            dest.unlink()
         print(f"\n    ERROR: {e}")
         return False
mokume/analysis/differential_expression.py-100-104 (1)
100-104: ⚠️ Potential issue | 🟠 Major

Zeros replaced with NaN even when skip_log2=True.

Line 103 unconditionally replaces zeros with NaN. When skip_log2=True (data already log2-transformed), zero is a valid value (representing log2(1) = 0). This replacement would silently drop proteins with a true fold-change of 1× from the analysis.
Proposed fix: only replace zeros before log-transform
-        intensity_matrix = intensity_matrix.replace(0, np.nan)
-        log2_matrix = intensity_matrix if self.skip_log2 else np.log2(intensity_matrix)
+        if self.skip_log2:
+            log2_matrix = intensity_matrix
+        else:
+            intensity_matrix = intensity_matrix.replace(0, np.nan)
+            log2_matrix = np.log2(intensity_matrix)
mokume/analysis/differential_expression.py-76-78 (1)
76-78: ⚠️ Potential issue | 🟠 Major

Fragile assumption: first column is the protein identifier.

protein_col = protein_df.columns[0] relies on column ordering rather than an explicit parameter. If a user passes a DataFrame with columns reordered (e.g., after a merge or join), this silently picks the wrong column, corrupting all downstream results.

Consider accepting protein_column as a parameter (defaulting to "ProteinName") for robustness, consistent with other methods in the codebase (e.g., MaxLFQQuantification.quantify).
Proposed fix
     def run(
         self,
         protein_df: pd.DataFrame,
         sample_to_condition: dict[str, str],
         contrast: tuple[str, str],
+        protein_column: str = "ProteinName",
     ) -> pd.DataFrame:
Then replace:
-        protein_col = protein_df.columns[0]
-        sample_cols = [c for c in protein_df.columns if c != protein_col]
+        protein_col = protein_column
+        sample_cols = [c for c in protein_df.columns if c != protein_col]
And propagate the parameter to run_comparisons:
     def run_comparisons(
         self,
         protein_df: pd.DataFrame,
         sample_to_condition: dict[str, str],
         contrasts: list[tuple[str, str]],
+        protein_column: str = "ProteinName",
     ) -> dict[str, pd.DataFrame]:
         ...
         for contrast in contrasts:
             key = f"{contrast[0]}-{contrast[1]}"
-            results[key] = self.run(protein_df, sample_to_condition, contrast)
+            results[key] = self.run(protein_df, sample_to_condition, contrast, protein_column)
mokume/io/feature.py-123-127 (1)
123-127: ⚠️ Potential issue | 🟠 Major

SQL injection / breakage if database_path contains a single quote.

The path is interpolated directly into SQL via str.format(). A filename containing ' (e.g., /tmp/user's_data.parquet) will produce malformed SQL. Use parameterized queries or escape the path.
Proposed fix
-            self.parquet_db.execute(
-                "CREATE VIEW parquet_db_raw AS SELECT * FROM parquet_scan('{}')".format(
-                    database_path
-                )
-            )
+            safe_path = database_path.replace("'", "''")
+            self.parquet_db.execute(
+                f"CREATE VIEW parquet_db_raw AS SELECT * FROM parquet_scan('{safe_path}')"
+            )
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/05_cross_technology_correlation.py-64-73 (1)
64-73: ⚠️ Potential issue | 🟠 Major

Potential KeyError when DataFrame columns are non-string types.

Lines 64-65 convert column values to strings for set intersection, but line 72-73 index the original DataFrames with those string keys. If the parquet source yields integer column names (e.g., sample IDs stored as int), .loc[..., common_samples] will raise KeyError because the column index dtype won't match.

The same pattern is repeated in compute_per_protein_correlation (lines 112-114, 121-122) and compute_per_sample_correlation (lines 143-145, 149-150).
Proposed fix: cast columns to string up-front in `load_quantified_data`
     df_wide = df.pivot(index="ProteinName", columns="SampleID", values="Intensity")
+    df_wide.columns = df_wide.columns.astype(str)
 
     return df_wide
This ensures all downstream string-based operations are consistent.
mokume/reports/workflow_comparison.py-287-292 (1)
287-292: ⚠️ Potential issue | 🟠 Major

XSS risk: unsanitized values injected into HTML and <script> blocks.

title and workflow names are interpolated directly into HTML (lines 291, 331) without escaping. If these ever contain user-controlled data with <script> or HTML entities, it's exploitable.

json.dumps on lines 417–420 does not escape </script> by default. A workflow name or protein name containing the literal string </script> would prematurely close the <script> tag, enabling injection.

For (1), use html.escape(title). For (2), a common defense is to post-process with .replace("</", "<\\/") or use json.dumps(...).replace("<", "\\u003c").
Proposed fix sketch
+import html as html_mod
 ...
     return f"""<!DOCTYPE html>
 <html>
 <head>
     <meta charset="utf-8">
-    <title>{title}</title>
+    <title>{html_mod.escape(title)}</title>
 ...
-    <h1>{title}</h1>
+    <h1>{html_mod.escape(title)}</h1>
 ...
-const summaryRows = {json.dumps(summary_rows)};
-const pcaData = {json.dumps(pca_data)};
-const concordanceData = {json.dumps(concordance_json)};
-const fcScatterData = {json.dumps(fc_scatter_data)};
+const summaryRows = {json.dumps(summary_rows).replace("<", "\\u003c")};
+const pcaData = {json.dumps(pca_data).replace("<", "\\u003c")};
+const concordanceData = {json.dumps(concordance_json).replace("<", "\\u003c")};
+const fcScatterData = {json.dumps(fc_scatter_data).replace("<", "\\u003c")};
Also applies to: 401-420
mokume/plotting/differential_expression.py-14-16 (1)
14-16: ⚠️ Potential issue | 🟠 Major

matplotlib.use("Agg") at import time is a global side effect.

Importing this module forces the Agg backend for the entire process, which will break interactive plotting in Jupyter notebooks or GUI applications. Since these functions already accept output_file and return figures, the backend choice should be left to the caller. If Agg is needed for headless environments, set it conditionally or in the script entrypoint, not at library-module scope.
Proposed fix
 import matplotlib
-import seaborn as sns
-from sklearn.decomposition import PCA
 
-matplotlib.use("Agg")
+# Only set Agg backend if no display is available and no backend is already set
+import os
+if os.environ.get("DISPLAY") is None and matplotlib.get_backend() == "agg":
+    pass  # already non-interactive
+
+import seaborn as sns
+from sklearn.decomposition import PCA
Or simply remove the matplotlib.use("Agg") line and let callers control the backend.
mokume/preprocessing/aggregation.py-258-258 (1)
258-258: ⚠️ Potential issue | 🟠 Major

dropna(..., inplace=True) mutates the caller's DataFrame.

merge_fractions, get_peptidoform_normalize_intensities, and sum_peptidoform_intensities all call dataset.dropna(subset=[NORM_INTENSITY], inplace=True) before further processing. Since inplace=True modifies the DataFrame passed by the caller, this can cause surprising side effects upstream.

Either operate on a copy or use the non-inplace form:
Proposed fix (example for merge_fractions)
-    dataset.dropna(subset=[NORM_INTENSITY], inplace=True)
+    dataset = dataset.dropna(subset=[NORM_INTENSITY])
Apply the same pattern to lines 294 and 327.

Also applies to: 294-294, 327-327
mokume/reports/qc_report.py-69-73 (1)
69-73: ⚠️ Potential issue | 🟠 Major

In log2 space, replacing 0 with NaN incorrectly drops proteins with intensity = 1.

When is_log2=True, line 73 replaces 0 with NaN. But in log2 space, 0 is a valid value (it corresponds to a raw intensity of 1). This silently removes valid data points from all downstream QC metrics.

Consider using a different sentinel or only replacing actual missing values (which should already be NaN in the input):
Proposed fix
     if not is_log2:
         intensity = intensity.replace(0, np.nan)
         log2_matrix = np.log2(intensity)
     else:
-        log2_matrix = intensity.replace(0, np.nan)
+        log2_matrix = intensity
mokume/preprocessing/aggregation.py-122-126 (1)
122-126: ⚠️ Potential issue | 🟠 Major

Regex injection: user-provided protein IDs are joined into a regex pattern without escaping.

If the protein ID file contains characters with special regex meaning (., +, (, *, etc.), the pattern will match unintended proteins. For example, a contaminant ID like P12345.1 would match P12345X1 due to the unescaped dot.
Proposed fix
+    import re as _re
     contaminants = [cont for cont in contaminants if cont.strip()]
-    cregex = "|".join(contaminants)
+    cregex = "|".join(_re.escape(cont) for cont in contaminants)
     return dataset[~dataset[protein_field].str.contains(cregex, regex=True)]
mokume/normalization/irs.py-96-106 (1)
96-106: ⚠️ Potential issue | 🟠 Major

Negative reference intensities are not replaced with NaN despite the comment, and division-by-zero risk in scaling factors.

Line 98-99: The comment says "Replace 0 and negatives with NaN" but only 0 is replaced. Negative values (e.g., from log-transformed input or data artifacts) would produce NaN from np.log silently.

Additionally, on line 106, if plex_ref_intensity[plex] contains zeros for some proteins, global_ref / plex_ref_intensity[plex] produces inf scaling factors, which will silently corrupt downstream values in transform().
Proposed fix
         # Compute global reference intensity (geometric mean across plexes)
         ref_df = pd.DataFrame(plex_ref_intensity)
-        # Replace 0 and negatives with NaN for geometric mean
-        ref_df = ref_df.replace(0, np.nan)
+        # Replace 0 and negatives with NaN for geometric mean
+        ref_df[ref_df <= 0] = np.nan
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/09_quant_norm_interaction.py-304-353 (1)

304-353: ⚠️ Potential issue | 🟠 Major

x and quant_methods leak from inner loop — misaligned tick labels and potential NameError.

x (line 330) and quant_methods (line 326) are set inside the for i, tech loop. After the loop exits, lines 346–347 use them for set_xticks / set_xticklabels. Two problems:

If the two technologies have different numbers of quantification methods, the ticks will only match the last technology processed — the other technology's bars will be misaligned.

If both technologies have zero rows, x and quant_methods are never assigned → NameError at runtime.

Consider collecting all unique quant methods across both technologies before the loop, or moving the tick-setting logic inside the per-technology block with proper alignment.
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/12_optimal_pipeline.py-462-510 (1)
462-510: ⚠️ Potential issue | 🟠 Major

Radar chart is missing polar projection — renders as a Cartesian line plot.

plt.subplots(1, 2, ...) creates standard rectangular axes. Plotting angles as x-coordinates produces a meaningless Cartesian chart, not a radar/spider chart. You need subplot_kw=dict(projection='polar').
Proposed fix
-    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
+    fig, axes = plt.subplots(1, 2, figsize=(14, 6),
+                             subplot_kw=dict(projection='polar'))
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/12_optimal_pipeline.py-356-407 (1)
356-407: ⚠️ Potential issue | 🟠 Major

run_pipeline returns a partial dict on error — downstream code will KeyError.

When an exception is caught at line 402, results only contains technology, pipeline, and the config keys — it's missing within_cv, fc_rmse, n_proteins, etc. Downstream, main() and plot_pipeline_comparison index into these keys unconditionally (e.g., tech_df["within_cv"].rank()), which will raise KeyError or produce silent NaN propagation.

Either populate the missing keys with np.nan in the error path, or filter out failed pipelines before downstream use.
Proposed fix
         except Exception as e:
             print(f"    ERROR: {e}")
             import traceback
             traceback.print_exc()
+            # Fill missing metric keys so downstream doesn't break
+            for key in ["n_proteins", "n_yeast", "within_cv", "fc_rmse", "pct_good_cv"]:
+                results.setdefault(key, np.nan)

     return results
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/08_normalization_grid.py-155-179 (1)
155-179: ⚠️ Potential issue | 🟠 Major

Incorrect quantile normalization implementation.

This implementation does not perform standard quantile normalization. Proper quantile normalization (1) ranks each column, (2) computes the mean value at each rank across all columns, then (3) replaces each value with the cross-column mean at its rank. The current code only sorts and re-indexes each column independently, which is effectively a no-op modulo tie-breaking.

Additionally, the nested for col … for i, rank … loop is O(proteins × samples) with per-cell .iloc assignment, which will be very slow on large matrices.

Consider using a well-known implementation (e.g., from qnorm package or the standard rank-then-average recipe):
Standard quantile normalization sketch
     elif method == "quantile":
-        from scipy.stats import rankdata
-        result = df.copy()
-        ranks = df.apply(lambda x: rankdata(x, method='average', nan_policy='omit'), axis=0)
-        sorted_means = df.apply(lambda x: np.sort(x.dropna().values), axis=0)
-        n_rows = len(df)
-        for col in df.columns:
-            col_ranks = ranks[col].values
-            col_sorted = np.sort(df[col].dropna().values)
-            for i, rank in enumerate(col_ranks):
-                if not np.isnan(rank) and not np.isnan(df[col].iloc[i]):
-                    idx = int(rank) - 1
-                    if idx < len(col_sorted):
-                        result.iloc[i, result.columns.get_loc(col)] = col_sorted[idx]
-        return result
+        ranked = df.rank(method="average")
+        sorted_vals = np.sort(df.values, axis=0)
+        mean_sorted = sorted_vals.mean(axis=1)
+        result = ranked.apply(
+            lambda col: np.interp(col, np.arange(1, len(mean_sorted) + 1), mean_sorted)
+        )
+        # Restore NaN positions
+        result[df.isna()] = np.nan
+        return result
mokume/pipeline/stages.py-522-537 (1)
522-537: ⚠️ Potential issue | 🟠 Major

No guard for missing FASTA file in iBAQ quantification.

self.config.input.fasta_file is passed directly to extract_fasta on line 531. If no FASTA was provided in the config, this will produce an opaque error. Consider an early validation:
Suggested fix
     def _quantify_ibaq(self, peptide_df: pd.DataFrame) -> pd.DataFrame:
         """Quantify using iBAQ method."""
         from mokume.quantification.ibaq import extract_fasta
 
+        if not self.config.input.fasta_file:
+            raise ValueError(
+                "iBAQ quantification requires a FASTA file (--fasta)"
+            )
+
         proteins = peptide_df[PROTEIN_NAME].unique().tolist()
mokume/pipeline/stages.py-730-736 (1)
730-736: ⚠️ Potential issue | 🟠 Major

Contrast parsing with split("-") breaks on hyphenated condition names.

Condition names like "wild-type" or "drug-treated" are common in SDRF files. Splitting on "-" (line 732) will incorrectly parse "wild-type-control" as 3 parts instead of identifying the two conditions.

Consider using a different delimiter (e.g., " vs ", "__vs__") or requiring exactly one separator that is unlikely to appear in condition names.
Suggested fix
         if self.config.de.contrasts:
             for c in self.config.de.contrasts:
-                parts = c.split("-")
-                if len(parts) == 2:
-                    contrasts.append((parts[0].strip(), parts[1].strip()))
+                # Support "A vs B" or "A-B" (first try " vs ", fallback to single "-")
+                if " vs " in c:
+                    parts = c.split(" vs ", 1)
+                elif c.count("-") == 1:
+                    parts = c.split("-", 1)
                 else:
-                    logger.warning(f"Invalid contrast format '{c}', expected 'A-B'")
+                    logger.warning(f"Ambiguous contrast format '{c}', use 'A vs B'")
+                    continue
+                contrasts.append((parts[0].strip(), parts[1].strip()))
mokume/pipeline/stages.py-215-260 (1)
215-260: ⚠️ Potential issue | 🟠 Major

Missing SDRF guard in load_for_ratio.

detect_pooled_from_sdrf (line 231) and detect_plexes_from_sdrf (line 249) are called with self.config.input.sdrf without first verifying it's not None. Compare with apply_irs (line 366) which properly guards against this. If sdrf is None, these calls will likely fail with an unhelpful error.
Suggested fix
 def load_for_ratio(self) -> tuple:
     ...
+    if not self.config.input.sdrf:
+        raise ValueError(
+            "Ratio quantification requires an SDRF file (--sdrf)"
+        )
+
     # Detect reference samples (reuse IRS detection logic)
     ref_samples = detect_pooled_from_sdrf(self.config.input.sdrf)

🟡 Minor comments (33)

docs/user-guide/features2peptides.md-70-127 (1)

70-127: ⚠️ Potential issue | 🟡 Minor

Inconsistent CLI option naming: underscores vs hyphens.

Lines 74–78 use underscores (--min_aa, --min_unique, --remove_decoy_contaminants), while lines 109–126 use hyphens (--filter-min-intensity, --filter-cv-threshold). Most CLI tools (including Click) conventionally use hyphens. Consider unifying the convention.

benchmarks/quant-pxd007683-tmt-vs-lfq/results/BENCHMARK_REPORT.md-30-40 (1)

30-40: ⚠️ Potential issue | 🟡 Minor

Narrative contradicts the data.

The table shows condition explains only ~1% of variance (with ~99% residual), yet line 39 states "Condition (yeast spike-in level) explains a significant portion of variance." This is misleading — 1% is negligible. The text should be revised to accurately reflect the findings, or the variance decomposition methodology should be re-examined.
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/config.py-118-118 (1)
118-118: ⚠️ Potential issue | 🟡 Minor

Non-deterministic ordering from list(set(...)).

set does not guarantee iteration order, so QUANTIFICATION_METHODS may differ between runs. If any downstream code depends on consistent ordering (e.g., report tables, indexing), this can cause flaky results. Use dict.fromkeys(...) to deduplicate while preserving insertion order.
Proposed fix
-QUANTIFICATION_METHODS = list(set(QUANTIFICATION_METHODS_TMT + QUANTIFICATION_METHODS_LFQ))
+QUANTIFICATION_METHODS = list(dict.fromkeys(QUANTIFICATION_METHODS_TMT + QUANTIFICATION_METHODS_LFQ))
mokume/normalization/tmm.py-254-265 (1)
254-265: ⚠️ Potential issue | 🟡 Minor

Edge case: all-NaN or all-zero input passes the X.empty check.

X.empty only checks shape (no rows or no columns). A DataFrame full of NaN/zero values would pass this check but cause _select_reference_sample to produce NaN upper quartiles, leading to an error in idxmin() at line 128.
Proposed guard
         if X.empty:
             raise ValueError("Input DataFrame is empty")
 
+        # Check that there is at least some valid data
+        if X.replace(0, np.nan).dropna(how='all').empty:
+            raise ValueError("Input DataFrame contains no valid (non-zero, non-NaN) values")
+
         # Handle log-transformed input
mokume/io/feature.py-518-522 (1)
518-522: ⚠️ Potential issue | 🟡 Minor

CAST to INTEGER will throw on non-numeric run names.

If run doesn't contain _ and isn't numeric, CAST(run AS INTEGER) will raise a DuckDB error. Similarly, if the part after _ isn't numeric. Use TRY_CAST which returns NULL on failure instead.
Proposed fix
-                       CASE WHEN position('_' in run) > 0 THEN CAST(split_part(run, '_', 2) AS INTEGER)
-                            ELSE CAST(run AS INTEGER) END AS techreplicate
+                       CASE WHEN position('_' in run) > 0 THEN TRY_CAST(split_part(run, '_', 2) AS INTEGER)
+                            ELSE TRY_CAST(run AS INTEGER) END AS techreplicate
mokume/io/feature.py-80-89 (1)
80-89: ⚠️ Potential issue | 🟡 Minor

LIKE wildcard characters in contaminant_patterns are not escaped.

If a pattern contains % or _, these are interpreted as SQL LIKE wildcards rather than literal characters. The default patterns are safe, but since contaminant_patterns is a public field, external callers could supply patterns containing these characters and get unexpected match behavior.
Proposed fix: escape LIKE wildcards
             for pattern in self.contaminant_patterns:
-                # Escape any SQL special characters in the pattern
                 safe_pattern = pattern.replace("'", "''")
+                safe_pattern = safe_pattern.replace("%", "\\%").replace("_", "\\_")
                 pattern_conditions.append(
-                    f"pg_accessions::text NOT LIKE '%{safe_pattern}%'"
+                    f"pg_accessions::text NOT LIKE '%{safe_pattern}%' ESCAPE '\\'"
                 )
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/05_cross_technology_correlation.py-212-238 (1)
212-238: ⚠️ Potential issue | 🟡 Minor

Unguarded division when per_protein_df is empty.

If compute_per_protein_correlation returns an empty DataFrame (e.g., fewer than 3 common samples), len(per_protein_df) is 0 and line 224 will raise ZeroDivisionError.
Proposed guard
+    if per_protein_df.empty:
+        return
+
     fig, ax = plt.subplots(figsize=(10, 6))
docs/user-guide/peptides2protein.md-31-36 (1)
31-36: ⚠️ Potential issue | 🟡 Minor

Unused import AllPeptidesQuantification in the Python example.

AllPeptidesQuantification is imported on Line 34 but never used in the code snippet that follows. Either remove it or add an example demonstrating its usage.
Proposed fix
     from mokume.quantification import (
         TopNQuantification,
         MaxLFQQuantification,
-        AllPeptidesQuantification,
         peptides_to_protein,
     )
mokume/postprocessing/batch_correction.py-126-142 (1)
126-142: ⚠️ Potential issue | 🟡 Minor

Regex won't match purely alphabetic prefixes like "abc_1".

The pattern ^([a-zA-Z]+\d+)_ requires at least one digit after the alphabetic chars. A sample ID like "abc_1" or "batch_3" will fall through to the full-string fallback instead of extracting "abc" or "batch" as the prefix. If that's intentional (only letter+digit combos like p1, TMT10 are valid plex prefixes), worth a brief comment in the code to clarify.
If purely alphabetic prefixes should also be matched
-                m = re.match(r"^([a-zA-Z]+\d+)_", s)
+                m = re.match(r"^([a-zA-Z]+\d*)_", s)
Change \d+ to \d* to also capture prefixes without trailing digits.
docs/reference/cli.md-3-3 (1)
3-3: ⚠️ Potential issue | 🟡 Minor

Command count is incorrect: five commands are documented, not four.

The text says "four main commands" but the page documents five: features2proteins, features2peptides, peptides2protein, correct-batches, and tsne-visualization.
📝 Proposed fix
-mokume provides four main commands. Use `mokume --help` or `mokume <command> --help` for details.
+mokume provides five main commands. Use `mokume --help` or `mokume <command> --help` for details.
tests/test_batch_correction_integration.py-208-208 (1)

208-208: ⚠️ Potential issue | 🟡 Minor

Typo in test class name: TestSDRFCovariatExtraction → TestSDRFCovariateExtraction.

Missing 'e' in "Covariate".
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/02_variance_decomposition.py-306-313 (1)
306-313: ⚠️ Potential issue | 🟡 Minor

Hardcoded .png extension in print message doesn't match FIGURE_FORMAT.

Line 311 correctly uses FIGURE_FORMAT for the actual file path, but line 313 hardcodes .png in the print message. If FIGURE_FORMAT is e.g. "svg", the log will mislead.
Proposed fix
-        print(f"Combined PCA plot saved to: {FIGURES_DIR / 'pca_combined.png'}")
+        print(f"Combined PCA plot saved to: {FIGURES_DIR / f'pca_combined.{FIGURE_FORMAT}'}")
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/run_benchmark.py-51-57 (1)
51-57: ⚠️ Potential issue | 🟡 Minor

Redundant return code check — check=True already guarantees success.

subprocess.run(..., check=True) raises CalledProcessError for any non-zero return code (caught on line 58). Therefore, result.returncode == 0 on line 57 is always True when reached. Either remove check=True and rely on the returncode check, or simplify the success path.
Proposed fix (simplify)
     try:
         result = subprocess.run(
             [sys.executable, str(script_path)],
             cwd=str(SCRIPT_DIR),
             check=True,
         )
-        return result.returncode == 0
+        return True
     except subprocess.CalledProcessError as e:
mokume/reports/workflow_comparison.py-138-147 (1)
138-147: ⚠️ Potential issue | 🟡 Minor

Potential ZeroDivisionError if the merge on overlapping proteins yields an empty DataFrame.

overlap is non-empty here, but if de_a or de_b contain duplicate ProteinName entries that get filtered out during the merge (or if the significance column is missing for some rows), merged_sig could end up empty, causing len(merged_sig) == 0 on line 145.
Proposed defensive guard
             if overlap:
                 merged_sig = de_a[de_a["ProteinName"].isin(overlap)][["ProteinName", "significance"]].merge(
                     de_b[de_b["ProteinName"].isin(overlap)][["ProteinName", "significance"]],
                     on="ProteinName", suffixes=("_a", "_b"),
                 )
-                direction_agree = (merged_sig["significance_a"] == merged_sig["significance_b"]).sum()
-                direction_concordance = float(direction_agree / len(merged_sig))
+                if not merged_sig.empty:
+                    direction_agree = (merged_sig["significance_a"] == merged_sig["significance_b"]).sum()
+                    direction_concordance = float(direction_agree / len(merged_sig))
+                else:
+                    direction_concordance = float("nan")
             else:
                 direction_concordance = float("nan")
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/01_grid_search_methods.py-160-193 (1)
160-193: ⚠️ Potential issue | 🟡 Minor

Unused sample_conditions parameter — function uses global instead.

compute_cv_within_conditions accepts sample_conditions as a parameter but never references it. Instead, it calls get_condition() on line 168, which reads from the module-global SAMPLE_CONDITIONS. This makes the parameter misleading and the function harder to test or reuse with different condition mappings.
Proposed fix: use the parameter
 def compute_cv_within_conditions(df_wide: pd.DataFrame, sample_conditions: dict) -> float:
     """Compute mean CV within conditions."""
     cvs = []
 
     # Map columns to conditions
     col_conditions = {}
     for col in df_wide.columns:
         sample_id = extract_sample_id(str(col))
-        condition = get_condition(sample_id)
+        condition = sample_conditions.get(sample_id, "Unknown")
         if condition != "Unknown":
             col_conditions[col] = condition
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/02_variance_decomposition.py-196-207 (1)
196-207: ⚠️ Potential issue | 🟡 Minor

Potential IndexError on var_explained[1] if PCA has only 1 component.

plot_pca_by_condition accesses var_explained[0] and var_explained[1] (lines 199–200) unconditionally. If n_components ends up as 1 (when there's only 1 sample or 1 feature), indexing [1] will raise an IndexError. The same pattern appears in plot_combined_pca at lines 231–232.

For the benchmark dataset this is unlikely, but a guard would be safer.
Proposed fix for plot_pca_by_condition
     ax.set_xlabel(f"PC1 ({var_explained[0]*100:.1f}% variance)")
-    ax.set_ylabel(f"PC2 ({var_explained[1]*100:.1f}% variance)")
+    ax.set_ylabel(f"PC2 ({var_explained[1]*100:.1f}% variance)" if len(var_explained) > 1 else "PC2")
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/04_stability_metrics.py-148-164 (1)

148-164: ⚠️ Potential issue | 🟡 Minor

Same NameError risk as 10_imputation_benchmark.py if both technologies have no data.

methods and x are defined inside the inner loop and used outside it (lines 163-164). If both tmt and lfq have empty DataFrames, these variables are unbound.

mokume/reports/interactive.py-424-424 (1)

424-424: ⚠️ Potential issue | 🟡 Minor

Protein names with special characters create invalid HTML element IDs.

tr.id = 'row-' + row.protein — protein accessions often contain characters like ;, ., /, which are invalid in HTML IDs. document.getElementById('row-' + protein) at line 444 will then fail to find the row, breaking the highlight-on-click interaction between the volcano plot and the table.

Consider using a CSS-safe encoding (e.g., a simple hash or index-based ID) and maintaining a JS lookup map from protein name to row ID.
mokume/reports/interactive.py-227-227 (1)
227-227: ⚠️ Potential issue | 🟡 Minor

Unpinned Plotly CDN (plotly-latest.min.js) may cause future breakage.

Using plotly-latest.min.js means any breaking change in a new Plotly release can silently break the generated reports. The QC report (qc_report.py) already pins plotly-2.27.0.min.js — this file should do the same for consistency and stability.
Proposed fix
-    <script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
+    <script src="https://cdn.plot.ly/plotly-2.27.0.min.js"></script>
mokume/preprocessing/aggregation.py-186-186 (1)
186-186: ⚠️ Potential issue | 🟡 Minor

Hardcoded "Condition" string instead of the CONDITION constant.

Line 186 uses the string literal "Condition" while the module imports the CONDITION constant. Use the constant for consistency and to avoid silent breakage if the constant value changes.
Proposed fix
-    data_df = data_df[(data_df["Condition"] != "Empty") | (data_df["Condition"].isnull())]
+    data_df = data_df[(data_df[CONDITION] != "Empty") | (data_df[CONDITION].isnull())]
mokume/reports/interactive.py-162-163 (1)
162-163: ⚠️ Potential issue | 🟡 Minor

Specify encoding="utf-8" when writing the HTML file.

The HTML declares <meta charset="utf-8"> but the open() call uses the platform default encoding. On some systems (e.g., Windows), this defaults to a locale-specific encoding, which can corrupt non-ASCII characters (e.g., protein names or annotation text).
Proposed fix
-    with open(output_html, "w") as f:
+    with open(output_html, "w", encoding="utf-8") as f:
mokume/reports/qc_report.py-491-492 (1)
491-492: ⚠️ Potential issue | 🟡 Minor

Specify encoding="utf-8" when writing the HTML file.

Same issue as in interactive.py — the HTML declares UTF-8 charset but the open() call uses platform-default encoding.
Proposed fix
-    with open(output_html, "w") as f:
+    with open(output_html, "w", encoding="utf-8") as f:
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/04_stability_metrics.py-282-284 (1)
282-284: ⚠️ Potential issue | 🟡 Minor

Potential KeyError if quality_within dict is empty.

assess_quality returns {} when the input series is empty (line 79-80). If compute_cv_metrics returns an empty quality_within, line 284's metrics['quality_within']['good'] raises KeyError.
Proposed fix
-            print(f"    Quality: {metrics['quality_within']['good']*100:.1f}% proteins with CV < 20%")
+            pct_good = metrics['quality_within'].get('good', 0) * 100
+            print(f"    Quality: {pct_good:.1f}% proteins with CV < 20%")
mokume/reports/qc_report.py-746-755 (1)
746-755: ⚠️ Potential issue | 🟡 Minor

showTab relies on implicit event object — fragile across JS environments.

Line 752 uses event.target without receiving event as a parameter. While this works via the global event in most browsers' inline handlers, it's not reliable in strict mode or all environments.
Proposed fix — pass the event explicitly
-function showTab(tabId) {{
+function showTab(tabId, evt) {{
     document.querySelectorAll('.tab-content').forEach(el => el.classList.remove('active'));
     document.querySelectorAll('.tab-btn').forEach(el => el.classList.remove('active'));
     document.getElementById('tab-' + tabId).classList.add('active');
-    event.target.classList.add('active');
+    evt.target.classList.add('active');
And update the onclick attributes:
-            <button class="tab-btn active" onclick="showTab('sample-qc')">Sample QC</button>
-            <button class="tab-btn" onclick="showTab('reproducibility')">Reproducibility</button>
-            <button class="tab-btn" onclick="showTab('de-quality')">DE Quality</button>
+            <button class="tab-btn active" onclick="showTab('sample-qc', event)">Sample QC</button>
+            <button class="tab-btn" onclick="showTab('reproducibility', event)">Reproducibility</button>
+            <button class="tab-btn" onclick="showTab('de-quality', event)">DE Quality</button>
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/10_imputation_benchmark.py-295-307 (1)
295-307: ⚠️ Potential issue | 🟡 Minor

Potential NameError if both technologies have no data for a metric.

If both tmt and lfq DataFrames are empty, the inner loop (lines 286-300) executes continue for both iterations, leaving x and methods unbound. Lines 305-306 then raise NameError.
Proposed fix
     for ax, (metric, title, lower_better) in zip(axes.flat, metrics):
+        x = np.array([])
+        methods = []
         for i, tech in enumerate(["tmt", "lfq"]):
             tech_df = results_df[results_df["technology"] == tech]
 
             if len(tech_df) == 0:
                 continue
 
             methods = tech_df["imputation"].tolist()
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/03_fold_change_accuracy.py-397-404 (1)
397-404: ⚠️ Potential issue | 🟡 Minor

Hardcoded .png in print statement — should use FIGURE_FORMAT.

Line 402 correctly uses FIGURE_FORMAT for the actual saved file, but line 404 hardcodes .png in the print message. If FIGURE_FORMAT is "svg" or "pdf", the logged path won't match the actual file.
-        print(f"Comparison plot saved to: {FIGURES_DIR / 'fold_change_comparison.png'}")
+        print(f"Comparison plot saved to: {FIGURES_DIR / f'fold_change_comparison.{FIGURE_FORMAT}'}")
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/07_tmm_normalization_benchmark.py-132-134 (1)
132-134: ⚠️ Potential issue | 🟡 Minor

Division by zero in cv_improvement when within_cv_median is zero.

If metrics_before['within_cv_median'] is 0 (e.g., all proteins have zero variance within conditions), line 134 will raise ZeroDivisionError.
Proposed fix
-        cv_improvement = (metrics_before['within_cv_median'] - metrics_after['within_cv_median']) / metrics_before['within_cv_median'] * 100
+        if metrics_before['within_cv_median'] and metrics_before['within_cv_median'] != 0:
+            cv_improvement = (metrics_before['within_cv_median'] - metrics_after['within_cv_median']) / metrics_before['within_cv_median'] * 100
+        else:
+            cv_improvement = np.nan
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/07_tmm_normalization_benchmark.py-376-383 (1)
376-383: ⚠️ Potential issue | 🟡 Minor

Unsafe .values[0] access — IndexError if maxlfq was not benchmarked for a technology.

Line 383 filters tech_data for method=='maxlfq' and immediately accesses .values[0]. If maxlfq was skipped (e.g., file not found), the filtered result is empty and .values[0] raises IndexError.
Proposed fix
-                print(f"  Factor variation (maxlfq): {tech_data[tech_data['method']=='maxlfq']['factor_std'].values[0]:.3f}")
+                maxlfq_data = tech_data[tech_data['method'] == 'maxlfq']
+                if len(maxlfq_data) > 0:
+                    print(f"  Factor variation (maxlfq): {maxlfq_data['factor_std'].values[0]:.3f}")
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/09_quant_norm_interaction.py-89-93 (1)
89-93: ⚠️ Potential issue | 🟡 Minor

Division by zero when a sample median is zero.

If any column in df is all-NaN or all-zero, sample_medians will contain 0 or NaN, and global_median / sample_medians will produce inf values silently propagated through the data. Consider guarding against zero medians.
Proposed fix
     elif method == "globalmedian":
         global_median = df.median().median()
         sample_medians = df.median()
+        sample_medians = sample_medians.replace(0, np.nan)
         factors = global_median / sample_medians
         return df * factors
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/03_fold_change_accuracy.py-114-143 (1)
114-143: ⚠️ Potential issue | 🟡 Minor

No zero filtering before computing log2 fold-change — potential inf values.

mean1 and mean2 (lines 118-119) can contain zeros when proteins are undetected in a condition. np.log2(mean1 / mean2) then produces inf, -inf, or NaN. While pd.notna on line 127 filters NaN, it does not filter inf/-inf, so infinite fold-changes will be included in the results and skew metrics.
Proposed fix
         # Compute fold-change (cond1 / cond2)
-        log2_fc = np.log2(mean1 / mean2)
+        ratio = mean1 / mean2
+        ratio = ratio.replace([np.inf, -np.inf, 0], np.nan)
+        log2_fc = np.log2(ratio)
mokume/pipeline/stages.py-501-520 (1)
501-520: ⚠️ Potential issue | 🟡 Minor

Fragile fallback for intensity column detection.

Lines 506–509: If "Intensity" is not present, the code falls back to the last numeric column. This is a silent assumption that could pick an unrelated column (e.g., a count or index column). Consider logging a warning when using the fallback, or raising if the expected column is missing.
Suggested improvement
         if intensity_col not in long_df.columns:
-            # Fallback: use the last numeric column
             numeric_cols = long_df.select_dtypes(include=[np.number]).columns
             if len(numeric_cols) > 0:
                 intensity_col = numeric_cols[-1]
+                logger.warning(
+                    f"'Intensity' column not found, falling back to '{intensity_col}'"
+                )
+            else:
+                raise ValueError(
+                    f"No numeric columns found in quantification output. "
+                    f"Columns: {list(long_df.columns)}"
+                )
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/11_preprocessing_filters.py-267-275 (1)
267-275: ⚠️ Potential issue | 🟡 Minor

Guard against NaN/zero in median normalization.

If any sample has all-NaN values after the dropna(thresh=3) row filter, sample_medians will contain NaN, and global_median / sample_medians silently produces NaN factors, zeroing out that sample's data after multiplication. Consider adding a check:
Suggested guard
             global_median = df_quant.median().median()
             sample_medians = df_quant.median()
+            if sample_medians.isna().any() or (sample_medians == 0).any():
+                print(f"  WARNING: some samples have NaN/zero median, skipping normalization")
+                df_norm = df_quant
+            else:
-            factors = global_median / sample_medians
-            df_norm = df_quant * factors
+                factors = global_median / sample_medians
+                df_norm = df_quant * factors
mokume/pipeline/stages.py-481-486 (1)
481-486: ⚠️ Potential issue | 🟡 Minor

Non-numeric top suffix silently defaults to n=3.

A method like "topx" passes the startswith("top") check but fails the regex, silently defaulting to n=3. Consider warning or raising for unparseable values.
Suggested fix
         elif quant_method.startswith("top"):
             match = re.match(r"top(\d+)", quant_method)
             if match:
                 n = int(match.group(1))
             else:
-                n = 3
+                logger.warning(
+                    f"Could not parse N from '{quant_method}', defaulting to top3"
+                )
+                n = 3

coderabbitai · 2026-02-09T17:11:12Z

benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/06_generate_report.py

+            for _, row in yeast_df.iterrows():
+                lines.append(
+                    f"| {row['technology'].upper()} | {row['comparison']} | "
+                    f"{row['expected_log2_fc']:.2f} | {row['observed_median_log2_fc']:.2f} | "
+                    f"{row.get('compression_ratio', 'N/A'):.2f} | {row['rmse']:.3f} |"
+                )


⚠️ Potential issue | 🔴 Critical

Bug: :.2f format applied to string fallback 'N/A' raises ValueError.

When compression_ratio is not present in the row, row.get('compression_ratio', 'N/A') returns the string 'N/A', which is then formatted with :.2f, causing a ValueError at runtime.

Proposed fix

for _, row in yeast_df.iterrows(): + compression = row.get('compression_ratio', None) + compression_str = f"{compression:.2f}" if pd.notna(compression) else "N/A" lines.append( f"| {row['technology'].upper()} | {row['comparison']} | " - f"{row['expected_log2_fc']:.2f} | {row['observed_median_log2_fc']:.2f} | " - f"{row.get('compression_ratio', 'N/A'):.2f} | {row['rmse']:.3f} |" + f"{row['expected_log2_fc']:.2f} | {row['observed_median_log2_fc']:.2f} | " + f"{compression_str} | {row['rmse']:.3f} |" )

🤖 Prompt for AI Agents

In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/06_generate_report.py` around lines 176 - 181, The f-string in the loop over yeast_df.iterrows (inside the lines.append call) applies the numeric format specifier : .2f to row.get('compression_ratio', 'N/A'), which raises ValueError when the fallback 'N/A' is used; fix by extracting compression_ratio into a variable (e.g., comp = row.get('compression_ratio')), then build a comp_str that is "N/A" when comp is None or non-numeric (or pandas.isna(comp)), otherwise format it with "{:.2f}", and use comp_str in the lines.append f-string instead of applying : .2f directly to the get(...) expression.

mokume/preprocessing/filters/peptide.py

coderabbitai · 2026-02-09T17:11:12Z

mokume/preprocessing/filters/protein.py

+        # Vectorized contaminant matching using regex OR pattern
+        upper_col = df[self.protein_column].fillna("").astype(str).str.upper()
+        pattern_regex = "|".join(re.escape(p.upper()) for p in self.patterns)
+        mask = ~upper_col.str.contains(pattern_regex, regex=True)


⚠️ Potential issue | 🔴 Critical

Empty patterns list causes all rows to be removed.

If self.patterns is an empty list, pattern_regex becomes an empty string "". str.contains("") returns True for every row, so ~mask filters out the entire DataFrame. Add a guard for empty patterns.

🐛 Proposed fix

# Vectorized contaminant matching using regex OR pattern + if not self.patterns: + logger.debug("%s: No contaminant patterns provided, skipping filter", self.name) + return df, self._create_result(input_count, input_count) + upper_col = df[self.protein_column].fillna("").astype(str).str.upper() pattern_regex = "|".join(re.escape(p.upper()) for p in self.patterns) mask = ~upper_col.str.contains(pattern_regex, regex=True)

🤖 Prompt for AI Agents

In `@mokume/preprocessing/filters/protein.py` around lines 63 - 66, The current vectorized contaminant filter builds pattern_regex from self.patterns and then uses upper_col.str.contains(pattern_regex), which with an empty self.patterns yields an empty string and matches every row; add a guard: if not self.patterns (or len(self.patterns) == 0) skip the contains call and set mask to a boolean Series of True for all df rows (e.g., mask = pd.Series(True, index=df.index)) so no rows are removed, otherwise build pattern_regex from self.patterns and apply mask = ~upper_col.str.contains(pattern_regex, regex=True); reference the variables/members self.patterns, self.protein_column, upper_col, pattern_regex and mask to locate where to change.

coderabbitai

Actionable comments posted: 9

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)

mokume/normalization/peptide.py (2)
177-186: ⚠️ Potential issue | 🟠 Major

Writers are not closed if an exception occurs in the processing loop — resource leak.

WriteCSVTask and WriteParquetTask are started here but only closed at lines 362-365. If any exception is raised during the for samples, df in feature.iter_samples() loop, these writers will never be closed, potentially leaving background threads running and output files corrupted or incomplete. Wrap the processing and close logic in try/finally.
Proposed fix (sketch)
+    try:
         for samples, df in feature.iter_samples():
             ...  # existing loop body
+    finally:
         if write_csv:
             write_csv_task.close()
         if save_parquet:
             writer_parquet_task.close()
Also applies to: 362-365

352-353: ⚠️ Potential issue | 🟡 Minor

np.log2 applied without guarding against zero or negative intensities.

If any NORM_INTENSITY values are zero or negative after normalization/IRS scaling, this will produce -inf or NaN silently. Consider filtering or clamping before the transform, or at minimum logging a warning when non-positive values are encountered.
mokume/postprocessing/batch_correction.py (2)

287-295: ⚠️ Potential issue | 🟠 Major

Substring-based partial matching for sample IDs is fragile.

sample_id in sdrf_sample or sdrf_sample in sample_id will match "S1" against "S10", "S11", etc. This could silently assign wrong covariate values. Consider using a stricter matching strategy or at minimum preferring exact matches first and only falling back to substring with a warning.

268-276: ⚠️ Potential issue | 🟡 Minor

Partial column matching is bidirectional and could match unintended columns.

col_lower in sdrf_col or sdrf_col in col_lower means a short covariate name like "sex" would match any column containing "sex" (e.g., "characteristics[sex]" — intended, but also "characteristics[osexorientation]" — not intended). Consider requiring that the match target contains the full bracketed form, or use a more specific matching pattern.

mokume/preprocessing/filters/peptide.py (1)

269-291: ⚠️ Potential issue | 🟠 Major

Missed cleavage counting has a real off-by-one error affecting trypsin sequences.

Lines 274–275 remove the last character to exclude the C-terminal cleavage site, but then lines 280–287 loop through the truncated string. When K/R appears at the penultimate position of the original sequence (e.g., KA, KR), it becomes the last character of the truncated string. The condition at line 283 (if i + 1 < len(seq_str)) fails, and the elif at line 285–287 explicitly does not count it as a missed cleavage.

This violates the trypsin cleavage model (Keil rule): K/R followed by any non-proline residue should be counted as a missed cleavage. Test cases confirm:

KA (length 2) → truncates to K, returns 0 (incorrect; should be 1)

KR (length 2) → truncates to K, returns 0 (incorrect; should be 1)

KAR (length 3) → truncates to KA, returns 1 (correct; K–A is counted)

Additionally, there are no unit tests for MissedCleavageFilter, leaving this logic bug undetected.

Fix: Check the full sequence but skip counting if K/R is truly at the terminal position, rather than truncating before analysis.
mokume/quantification/directlfq.py (1)
298-300: ⚠️ Potential issue | 🟡 Minor

Use raise ... from e to preserve the exception chain.

The original traceback is lost when re-raising. This makes debugging DirectLFQ failures harder since the root cause stack trace is discarded.
Proposed fix
             except Exception as e:
                 logger.error(f"DirectLFQ failed: {e}")
-                raise RuntimeError(f"DirectLFQ quantification failed: {e}")
+                raise RuntimeError(f"DirectLFQ quantification failed: {e}") from e
tests/test_quantification_accuracy.py (1)
69-81: ⚠️ Potential issue | 🟡 Minor

Network call without timeout could hang CI indefinitely.

urllib.request.urlretrieve on line 72 has no timeout. If the PRIDE FTP server is slow or unresponsive, this will block indefinitely. Also, zf.extractall on line 79 is susceptible to zip-slip attacks if the archive contains path-traversal entries — though the risk is lower in a test context.
🛡️ Proposed fix for timeout
-            urllib.request.urlretrieve(PRIDE_DATASET_URL, zip_path)
+            import urllib.request as _req
+            _req.urlretrieve(PRIDE_DATASET_URL, zip_path)  # consider using requests with timeout
A more robust approach:
import urllib.request
req = urllib.request.Request(PRIDE_DATASET_URL)
with urllib.request.urlopen(req, timeout=300) as resp, open(zip_path, 'wb') as out:
    import shutil
    shutil.copyfileobj(resp, out)

🤖 Fix all issues with AI agents

In
`@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/05_cross_technology_correlation.py`:
- Around line 55-105: The bug is that compute_overall_correlation (and the
related compute_per_protein_correlation and compute_per_sample_correlation) cast
column labels to strings to find common_samples but then use those string IDs to
index the original DataFrames whose columns may be ints, causing KeyError; fix
by finding common sample labels without changing their dtype (e.g., use
tmt_wide.columns.intersection(lfq_wide.columns) or compute common_samples = [c
for c in tmt_wide.columns if c in lfq_wide.columns]) so the returned
common_samples keep the original type and can be used directly for indexing in
tmt_wide.loc[...] and lfq_wide.loc[...], and apply the same change in
compute_per_protein_correlation and compute_per_sample_correlation.

In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/06_generate_report.py`:
- Around line 176-181: The f-string in the loop over yeast_df.iterrows() applies
the numeric format specifier :.2f to row.get('compression_ratio', 'N/A'), which
raises ValueError when the value is the string 'N/A' or non-numeric/NaN; update
the code in the loop that builds lines so you detect and normalize
compression_ratio (and any other possibly-NaN fields like expected_log2_fc or
observed_median_log2_fc) before formatting: retrieve compression_ratio into a
local variable, use pandas.isna() or isinstance(..., numbers.Number) to decide
whether to format as "{:.2f}" or use a placeholder like "N/A", and then use the
prepared string in the f-string instead of applying :.2f directly to the raw
row.get(...) expression.

In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/08_normalization_grid.py`:
- Around line 155-179: The quantile branch (method == "quantile") is incorrect:
sorted_means is unused, fractional ranks from rankdata are truncated, the code
maps values to the same-column sorted value (no cross-column normalization), and
it uses a slow nested Python loop; replace this block to perform true quantile
normalization by computing the mean across columns at each rank and mapping each
value to that cross-column mean (handling fractional ranks by averaging
appropriate quantile positions), or simply call the existing, tested
implementation mokume.normalization.protein.quantile_normalize instead of the
custom logic; update/remove the unused variable sorted_means and eliminate the
O(n×m) loops (use vectorized rank/argsort operations or the mokume helper) so df
→ result is the proper cross-column ranked-mean mapping.

In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/12_optimal_pipeline.py`:
- Around line 462-510: The plot_radar_chart function is creating regular
Cartesian Axes (via plt.subplots) so the computed angles are plotted on a linear
axis instead of a polar/spider plot; change the axes creation to polar axes for
each subplot (use projection="polar" when creating axes for the two panels),
ensure angles (variable angles) remain in radians and are used as the theta
coordinates, and keep the rest of the plotting logic (values, ax.plot, ax.fill,
ax.set_xticks/set_xticklabels, ax.set_title, ax.legend) but applied to the polar
Axes returned for plot_radar_chart so the radar/spider visualization renders
correctly.

In `@mokume/analysis/differential_expression.py`:
- Around line 100-104: The current code always replaces zeros with NaN before
deciding whether to skip log2, which drops valid zeros when skip_log2 is True;
change the logic in the block that builds intensity_matrix/log2_matrix so that
intensity_matrix = protein_df.set_index(protein_col)[sample_cols] is created
first, then if self.skip_log2 is False perform zero/negative handling (e.g.,
replace 0 with np.nan and handle negatives) and then compute log2_matrix =
np.log2(intensity_matrix), otherwise (self.skip_log2 True) set log2_matrix =
intensity_matrix without modifying zeros; update references to
intensity_matrix/log2_matrix accordingly (look for intensity_matrix,
log2_matrix, self.skip_log2, protein_df.set_index, sample_cols).

In `@mokume/io/feature.py`:
- Around line 515-527: The SQL in get_irs_scaling_factors references a
non-existent column "mixture" when building irs_df from the parquet_db view;
either add a "mixture" column to the parquet_db (e.g. modify enrich_with_sdrf to
populate a mixture alias from SDRF/sample metadata) or change the query and
subsequent post-processing to use an existing column (for example "condition" or
"sample_accession") consistently; update the SELECT, GROUP BY and any downstream
uses of irs_df that expect mixture so they reference the chosen existing column
or the new mixture alias.

In `@mokume/preprocessing/filters/peptide.py`:
- Around line 414-417: When self._compiled_patterns is empty combined_pattern
becomes "" and str.contains("") matches every row, so change the logic in the
filtering block to short-circuit: if self._compiled_patterns is empty, set mask
to a Series of True for all df.index (or simply return df.copy()) so no rows are
filtered; otherwise keep the existing combined_pattern/str.contains flow that
assigns mask (referencing self._compiled_patterns, combined_pattern, mask, df
and col).
- Around line 199-205: The current filter building in the Peptide filter uses
patterns = "|".join(...) on self.exclude_modifications which yields an empty
string for an empty list and causes str.contains("", regex=True) to match every
row; update the logic in the method that computes mask (referencing
self.exclude_modifications, self.sequence_column, df, mask, filtered_df) to
first check if self.exclude_modifications is empty and in that case skip the
regex filter (i.e., set mask = True for all rows or simply copy df to
filtered_df), otherwise build the joined pattern and apply str.contains as
before so the empty-list case does not remove all rows.

In `@mokume/preprocessing/filters/protein.py`:
- Around line 63-66: The current vectorized contaminant matching in the method
using self.patterns builds pattern_regex and then computes mask =
~upper_col.str.contains(pattern_regex...), which when self.patterns is empty
creates an empty regex that matches every row and removes all rows; fix by
adding an early check for an empty patterns list (check "if not self.patterns"
or similar) and return the original DataFrame (or set mask to all True) before
constructing pattern_regex; update uses of self.protein_column, pattern_regex,
and mask accordingly so the function preserves rows when there are no patterns.

🟠 Major comments (21)

mokume/normalization/irs.py-96-106 (1)
96-106: ⚠️ Potential issue | 🟠 Major

Division by zero when computing IRS scaling factors.

plex_ref_intensity[plex] (line 106) retains the original values from lines 90-94 (which may contain zeros), but only ref_df (the copy used for geometric mean computation) has zeros replaced with NaN on line 99. When a protein has zero reference intensity in a plex, global_ref / plex_ref_intensity[plex] produces inf, which propagates through transform() into the final protein matrix.

Guard against zero/NaN denominators:
🐛 Proposed fix
         # Compute scaling factors: global_ref / plex_ref for each plex
         scaling_factors = {}
         for plex in plexes:
-            scaling_factors[plex] = global_ref / plex_ref_intensity[plex]
+            denom = plex_ref_intensity[plex].replace(0, np.nan)
+            scaling_factors[plex] = global_ref / denom
pyproject.toml-47-47 (1)

47-47: ⚠️ Potential issue | 🟠 Major

Move statsmodels to an optional dependency group since differential expression analysis is an optional pipeline feature.

statsmodels is imported solely in mokume/analysis/differential_expression.py for multiple testing correction, and the DE feature itself is disabled by default (n_expression: bool = False). Following the existing pattern for optional features (e.g., directlfq, batch-correction), add a de = ["statsmodels"] extra to keep the base install lightweight.
mokume/reports/qc_report.py-302-345 (1)
302-345: ⚠️ Potential issue | 🟠 Major

_compute_variance_decomposition crashes when n_components evaluates to 0.

If there are fewer than 2 samples, data.shape[0] - 1 can be 0 (or negative with 0 samples), and PCA(n_components=0) will raise. Unlike _compute_silhouette (which guards len(data) < 3), this function has no such guard.
Proposed fix
     n_components = min(10, data.shape[0] - 1, data.shape[1])
+    if n_components < 1:
+        return {
+            "condition_pct": 0.0,
+            "plex_pct": 0.0,
+            "residual_pct": 100.0,
+            "n_pcs_used": 0,
+            "total_var_captured": 0.0,
+        }
     pca = PCA(n_components=n_components)
mokume/quantification/ratio.py-326-328 (1)
326-328: ⚠️ Potential issue | 🟠 Major

Use parameterized queries for parquet_path to prevent SQL injection and path parsing errors.

Direct f-string interpolation of parquet_path into the SQL string is unsafe. If the path contains a single quote or special characters, it will break the SQL syntax. DuckDB supports parameterized queries for parquet_scan(), so use the ? placeholder instead.
Proposed fix
         conn.execute(
-            f"CREATE VIEW parquet_raw AS SELECT * FROM parquet_scan('{parquet_path}')"
+            "CREATE VIEW parquet_raw AS SELECT * FROM parquet_scan(?)",
+            [parquet_path],
         )
mokume/preprocessing/aggregation.py-148-156 (1)
148-156: ⚠️ Potential issue | 🟠 Major

Potential AttributeError when choice is None for non-LFQ labels.

The choice parameter is Optional[IsobaricLabel], so it can be None. If label is not LFQ and choice is None, Line 154 calls choice.channels() on None.
Proposed fix: add a guard
     if label == QuantificationCategory.LFQ:
         data_df.drop(CHANNEL, inplace=True, axis=1)
     else:
+        if choice is None:
+            raise ValueError("Isobaric label choice must be provided for non-LFQ quantification")
         data_df[CHANNEL] = data_df[CHANNEL].map(choice.channels())
mokume/preprocessing/aggregation.py-97-99 (1)
97-99: ⚠️ Potential issue | 🟠 Major

str.contains without na=False will raise on NaN values.

If protein_field contains any NaN values, str.contains(cregex) returns NaN for those rows, and the boolean indexing with ~ will raise a ValueError. The same issue exists in remove_protein_by_ids at Line 126.
Proposed fix
-    return dataset[~dataset[protein_field].str.contains(cregex)]
+    return dataset[~dataset[protein_field].str.contains(cregex, na=False)]
And in remove_protein_by_ids:
-    return dataset[~dataset[protein_field].str.contains(cregex, regex=True)]
+    return dataset[~dataset[protein_field].str.contains(cregex, regex=True, na=False)]
mokume/io/feature.py-307-316 (1)
307-316: ⚠️ Potential issue | 🟠 Major

SQL injection risk in get_report_from_database and get_report_condition_from_database.

tuple(samples) and tuple(cons) are interpolated directly into SQL strings via .format() / f-strings (Lines 311 and 405). Sample accession strings containing quotes will break or inject SQL. Additionally, if samples has exactly one element, tuple(samples) produces ('value',) with a trailing comma — DuckDB handles this, but it's fragile.
Proposed fix using parameterized queries
     def get_report_from_database(self, samples: list, columns: list = None):
         cols = ",".join(columns) if columns is not None else "*"
+        placeholders = ",".join(["?"] * len(samples))
         database = self.parquet_db.sql(
-            """SELECT {} FROM parquet_db WHERE sample_accession IN {}""".format(
-                cols, tuple(samples)
-            )
+            f"SELECT {cols} FROM parquet_db WHERE sample_accession IN ({placeholders})",
+            params=samples,
         )
mokume/preprocessing/aggregation.py-122-126 (1)
122-126: ⚠️ Potential issue | 🟠 Major

Protein IDs from file are used as regex without escaping special characters.

If any protein ID in the file contains regex metacharacters (e.g., ., (, +), the resulting regex will either match unintended strings or raise a re.error. Use re.escape on each identifier.
Proposed fix
+    import re as _re
     contaminants = [cont for cont in contaminants if cont.strip()]
-    cregex = "|".join(contaminants)
+    cregex = "|".join(_re.escape(cont) for cont in contaminants)
     return dataset[~dataset[protein_field].str.contains(cregex, regex=True, na=False)]
mokume/io/feature.py-116-155 (1)
116-155: ⚠️ Potential issue | 🟠 Major

Use parameterized queries instead of string interpolation for all SQL statements.

The code has multiple SQL injection vulnerabilities caused by interpolating user-controlled or path values directly into SQL:

Line 124: database_path is interpolated into parquet_scan() via .format()

Line 311–312: cols and tuple(samples) are interpolated into the WHERE clause

Line 405: cols and tuple(cons) are interpolated into the WHERE clause

Line 512: irs_channel is interpolated directly into a filter condition

DuckDB's Python API supports parameterized queries with ? placeholders. For example:
Example fix for the parquet_scan call
-            self.parquet_db.execute(
-                "CREATE VIEW parquet_db_raw AS SELECT * FROM parquet_scan('{}')".format(
-                    database_path
-                )
-            )
+            self.parquet_db.execute(
+                "CREATE VIEW parquet_db_raw AS SELECT * FROM parquet_scan(?)",
+                [database_path],
+            )
Apply the same pattern to the other affected queries at lines 311–312, 405, and 512.
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/12_optimal_pipeline.py-425-455 (1)

425-455: ⚠️ Potential issue | 🟠 Major

Tick labels reflect only the last technology's pipeline names.

x, values, and pipelines are reassigned inside the inner for tech in ["tmt", "lfq"] loop. After the loop, ax.set_xticks(x) and ax.set_xticklabels(pipelines, ...) use whichever technology was processed last. If TMT and LFQ have different pipeline names, the labels will be incorrect for the other technology.

Consider computing a unified set of pipeline names (or using grouped bar positions derived from both technologies) before setting tick labels.
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/run_benchmark.py-114-117 (1)
114-117: ⚠️ Potential issue | 🟠 Major

input() will hang in non-interactive / CI environments.

If this pipeline runs in CI or a headless context, input() on line 116 will block indefinitely (or raise EOFError if stdin is closed). Consider adding a --no-interactive / --fail-fast flag that defaults to aborting (or continuing) without prompting.
🛡️ Proposed fix — add a CLI flag and guard the prompt
     parser.add_argument(
         "--skip-download",
         action="store_true",
         help="Skip data download step",
     )
+    parser.add_argument(
+        "--fail-fast",
+        action="store_true",
+        help="Abort immediately on first failure (no interactive prompt)",
+    )
     args = parser.parse_args()
         if not success:
             print(f"\nStep failed: {script}")
-            if input("Continue with remaining steps? (y/n): ").lower() != "y":
+            if args.fail_fast or not sys.stdin.isatty():
+                break
+            elif input("Continue with remaining steps? (y/n): ").lower() != "y":
                 break
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/09_quant_norm_interaction.py-88-93 (1)
88-93: ⚠️ Potential issue | 🟠 Major

Division by zero in globalmedian normalization when a sample median is zero.

If any column is entirely zero or NaN, sample_medians contains 0 or NaN entries, making factors contain inf/NaN. Multiplying by inf silently corrupts the entire sample column.
Suggested fix
     elif method == "globalmedian":
         global_median = df.median().median()
         sample_medians = df.median()
+        sample_medians = sample_medians.replace(0, np.nan)
         factors = global_median / sample_medians
-        return df * factors
+        return df * factors.fillna(1.0)
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/09_quant_norm_interaction.py-149-155 (1)
149-155: ⚠️ Potential issue | 🟠 Major

Same inf contamination from zero-division as in 03_fold_change_accuracy.py.

condition_intensities[cond1] / condition_intensities[cond2] produces inf when cond2 means are zero. dropna() does not remove inf, so np.log2(inf) = inf flows into the RMSE calculation.
Suggested fix
                     obs_fc = (condition_intensities[cond1] / condition_intensities[cond2]).dropna()
+                    obs_fc = obs_fc.replace([np.inf, -np.inf], np.nan).dropna()
                     obs_log2_fc = np.log2(obs_fc)
mokume/normalization/tmm.py-83-98 (1)
83-98: ⚠️ Potential issue | 🟠 Major

Missing parameter validation for trim fractions.

m_trim and a_trim are used as trim proportions but are never validated. Values outside (0, 0.5) would cause silent misbehavior — e.g., m_trim=0.6 trims more than half from each side, producing m_lo > m_hi and an empty keep set; m_trim=-0.1 or m_trim=1.0 would produce nonsensical bounds.
Suggested validation
     def __init__(
         self,
         m_trim: float = 0.3,
         a_trim: float = 0.05,
         ref_sample: Optional[str] = None,
         log_transform: bool = False,
     ):
+        if not 0 < m_trim < 0.5:
+            raise ValueError(f"m_trim must be in (0, 0.5), got {m_trim}")
+        if not 0 < a_trim < 0.5:
+            raise ValueError(f"a_trim must be in (0, 0.5), got {a_trim}")
         self.m_trim = m_trim
         self.a_trim = a_trim
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/03_fold_change_accuracy.py-121-142 (1)
121-142: ⚠️ Potential issue | 🟠 Major

np.inf values from zero-mean conditions pass the notna check and contaminate results.

When mean2 is 0 for a protein, log2(mean1 / 0) yields inf. Since pd.notna(np.inf) returns True, these infinite fold-changes are silently included in the results, corrupting downstream RMSE and bias calculations.
Suggested fix
         log2_fc = np.log2(mean1 / mean2)
+        log2_fc = log2_fc.replace([np.inf, -np.inf], np.nan)
         expected_log2_fc = np.log2(expected_fc)
mokume/plotting/differential_expression.py-16-16 (1)
16-16: ⚠️ Potential issue | 🟠 Major

matplotlib.use("Agg") at module level breaks interactive environments.

This is library code imported by other modules. Unconditionally forcing the Agg backend at import time will prevent figure display in Jupyter notebooks, IPython, and any interactive session. This side-effect belongs in scripts/entry points, not in a library module.
Proposed fix
-matplotlib.use("Agg")
Remove this line. Callers who need a non-interactive backend can set it themselves before importing matplotlib/pyplot.
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/01_grid_search_methods.py-101-132 (1)
101-132: ⚠️ Potential issue | 🟠 Major

Silent fallthrough for unrecognized normalization methods returns raw data.

If method doesn't match any known branch (e.g., a typo like "meadian"), execution falls through to Line 132 and returns the original df_wide without normalization or any warning. The same pattern exists in apply_imputation (Line 157). This silently masks configuration errors.
Proposed fix
-    return df_wide
+    raise ValueError(f"Unknown normalization method: {method}")
And similarly for apply_imputation:
-    return df_wide
+    raise ValueError(f"Unknown imputation method: {method}")
mokume/plotting/differential_expression.py-79-79 (1)
79-79: ⚠️ Potential issue | 🟠 Major

str.contains without regex=False treats gene names as regex patterns.

Protein/gene names can contain characters like +, ., (, ) (e.g., "S100A8+A9", "TP53(1-393)"). Without regex=False, these are interpreted as regex metacharacters, leading to either wrong matches or re.error exceptions.
-            matches = df[df["ProteinName"].str.contains(gene, case=False, na=False)]
+            matches = df[df["ProteinName"].str.contains(gene, case=False, na=False, regex=False)]
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/08_normalization_grid.py-398-433 (1)
398-433: ⚠️ Potential issue | 🟠 Major

plot_comparison_bars will crash if fewer than 5 results exist for a technology.

Line 420 creates x = np.arange(5), and Line 419 takes .head(5). If tech_df has fewer than 5 rows, top5 will have fewer entries, but x is always length 5. ax.barh(x, top5[metric]) will raise a length mismatch error.
-            top5 = tech_df.head(5)
-            x = np.arange(5) + i * 6
+            top_n = tech_df.head(5)
+            x = np.arange(len(top_n)) + i * 6
+            colors = ["#2ecc71" if j == 0 else "#3498db" for j in range(len(top_n))]
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/10_imputation_benchmark.py-295-308 (1)
295-308: ⚠️ Potential issue | 🟠 Major

UnboundLocalError for x and methods when both technologies have no data.

After the inner for i, tech in ... loop, x and methods (Lines 305–306) are referenced outside the loop. If both "tmt" and "lfq" DataFrames are empty (e.g., no results yet), the continue on Line 290 skips the assignments and x/methods are never defined.
Proposed fix
     for ax, (metric, title, lower_better) in zip(axes.flat, metrics):
+        x = np.array([])
+        methods = []
         for i, tech in enumerate(["tmt", "lfq"]):
             tech_df = results_df[results_df["technology"] == tech]
 
             if len(tech_df) == 0:
                 continue
 
             methods = tech_df["imputation"].tolist()
             values = tech_df[metric].tolist()
 
             x = np.arange(len(methods))
             ...
 
-        ax.set_xticks(x)
-        ax.set_xticklabels(methods, rotation=45, ha="right")
+        if len(x) > 0:
+            ax.set_xticks(x)
+            ax.set_xticklabels(methods, rotation=45, ha="right")
mokume/pipeline/stages.py-731-736 (1)
731-736: ⚠️ Potential issue | 🟠 Major

Contrast parsing with split("-") breaks on condition names containing hyphens.

If a condition name contains a hyphen (e.g., "non-treated-control"), c.split("-") will produce more than 2 parts, silently skipping a valid contrast. Consider using split("-", 1) or a different delimiter (e.g., " vs ").
Proposed fix
         if self.config.de.contrasts:
             for c in self.config.de.contrasts:
-                parts = c.split("-")
+                parts = c.split("-", 1)
                 if len(parts) == 2:
                     contrasts.append((parts[0].strip(), parts[1].strip()))
                 else:
                     logger.warning(f"Invalid contrast format '{c}', expected 'A-B'")
Note: Even with maxsplit=1, a condition like "A-B-C" would yield ("A", "B-C"). If both condition names can contain hyphens, a different delimiter (e.g., " vs ") would be more robust.

🟡 Minor comments (34)

docs/user-guide/batch-correct.md-1-3 (1)

1-3: ⚠️ Potential issue | 🟡 Minor

Title/command name mismatch may confuse users.

The page title says "batch-correct" but the actual CLI command referenced on Line 3 (and throughout) is correct-batches. Consider aligning the title with the command name, e.g., # correct-batches: Batch Correction.
mokume/postprocessing/batch_correction.py-126-144 (1)
126-144: ⚠️ Potential issue | 🟡 Minor

Regex fallback doesn't cover all reasonable non-hyphen sample name patterns.

The regex ^([a-zA-Z]+\d+)_ requires at least one letter followed by at least one digit before the underscore. Sample names like "sample_1" (letters only, no trailing digits before _) or "1_sample" (digits first) will silently fall through to the full-string fallback, effectively assigning each such sample to its own batch. This may produce unexpected single-sample batches that later fail the TooFewSamplesInBatch check.

Consider either broadening the regex to capture more patterns (e.g., r"^(.+?)_\d+$" to split on the last _digits suffix), or logging a warning when the fallback is used so users know the prefix extraction didn't match.
Suggested improvement
-                # Match leading non-numeric + optional digits as prefix (e.g. p1_1 → p1)
-                m = re.match(r"^([a-zA-Z]+\d+)_", s)
-                if m:
+                # Match everything before the last _digits suffix (e.g. p1_1 → p1)
+                m = re.match(r"^(.+?)_\d+$", s)
+                if m:
                     prefixes.append(m.group(1))
                 else:
+                    logger.warning(
+                        f"Could not extract batch prefix from '{s}', using full name"
+                    )
                     prefixes.append(s)
mokume/reports/interactive.py-227-227 (1)
227-227: ⚠️ Potential issue | 🟡 Minor

Unpinned Plotly CDN version; inconsistent with qc_report.py.

This file uses plotly-latest.min.js while qc_report.py (line 580) pins plotly-2.27.0.min.js. Using latest is fragile — a breaking Plotly release could silently break the interactive report. Pin to the same version for consistency and reproducibility.
Proposed fix
-    <script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
+    <script src="https://cdn.plot.ly/plotly-2.27.0.min.js"></script>
mokume/reports/qc_report.py-797-803 (1)
797-803: ⚠️ Potential issue | 🟡 Minor

JS accesses pca.var_explained[1] which is undefined when PCA has only 1 component.

When n_components == 1, the var_explained array has a single element. Accessing index [1] on Line 800 yields undefined, rendering "PC2 (undefined%)" in the axis label. Guard with a fallback.
Proposed fix
-        xaxis: {{ title: 'PC1 (' + pca.var_explained[0].toFixed(1) + '%)' }},
-        yaxis: {{ title: 'PC2 (' + pca.var_explained[1].toFixed(1) + '%)' }},
+        xaxis: {{ title: 'PC1 (' + pca.var_explained[0].toFixed(1) + '%)' }},
+        yaxis: {{ title: 'PC2 (' + (pca.var_explained.length > 1 ? pca.var_explained[1].toFixed(1) : '0.0') + '%)' }},
mokume/preprocessing/sdrf.py-16-47 (1)
16-47: ⚠️ Potential issue | 🟡 Minor

Potential KeyError if comment[technical replicate] column is absent.

Not all SDRF files contain the comment[technical replicate] column. If it's missing, Line 46 will raise an unhandled KeyError. Consider guarding with a default:
Proposed fix
-    technical_repetitions = len(sdrf_df["comment[technical replicate]"].unique())
+    if "comment[technical replicate]" in sdrf_df.columns:
+        technical_repetitions = len(sdrf_df["comment[technical replicate]"].unique())
+    else:
+        technical_repetitions = 1
mokume/reports/interactive.py-432-435 (1)
432-435: ⚠️ Potential issue | 🟡 Minor

Protein names are injected into innerHTML without escaping — potential DOM XSS.

If a protein name contains HTML metacharacters (e.g. <, >), it will be interpreted as markup. While unlikely in typical proteomics accessions, this is still unsanitized user-derived data flowing into innerHTML. Use textContent or escape HTML entities.
Sketch: escape helper in JS
+function escapeHtml(str) {
+    return str.replace(/&/g,'&amp;').replace(/</g,'&lt;').replace(/>/g,'&gt;').replace(/"/g,'&quot;');
+}
+
 // In populateTable:
-        tr.innerHTML = '<td>' + (row.highlighted ? '<b>' + row.protein + '</b>' : row.protein) + '</td>'
+        tr.innerHTML = '<td>' + (row.highlighted ? '<b>' + escapeHtml(row.protein) + '</b>' : escapeHtml(row.protein)) + '</td>'
mokume/io/feature.py-80-91 (1)

80-91: ⚠️ Potential issue | 🟡 Minor

SQL LIKE wildcards in contaminant patterns are not escaped.

The safe_pattern only escapes single quotes, but DuckDB LIKE treats % and _ as wildcards. If a contaminant pattern contained these characters, matches would be broader than intended. Low risk since the patterns are typically controlled strings like "CONTAMINANT".
mokume/preprocessing/aggregation.py-186-186 (1)
186-186: ⚠️ Potential issue | 🟡 Minor

Hardcoded "Condition" instead of CONDITION constant.

The rest of this module uses CONDITION (imported from mokume.core.constants), but this line uses a hardcoded "Condition" string twice. If CONDITION ever changes, this line will silently break.
Proposed fix
-    data_df = data_df[(data_df["Condition"] != "Empty") | (data_df["Condition"].isnull())]
+    data_df = data_df[(data_df[CONDITION] != "Empty") | (data_df[CONDITION].isnull())]
mokume/io/feature.py-393-398 (1)
393-398: ⚠️ Potential issue | 🟡 Minor

Division by zero if global_med or per-sample median is zero.

If all sample medians are zero (e.g., heavily filtered data), global_med will be 0 and the division at Line 397 will produce inf/NaN. Same risk in get_median_map_to_condition at Line 463 where meds.mean() could be 0.
Proposed guard
         global_med = np.median(list(med_map.values()))
+        if global_med == 0:
+            logger.warning("Global median intensity is 0; normalization factors will be set to 1.")
+            return {s: 1.0 for s in med_map}

         for sample, med in med_map.items():
-            med_map[sample] = med / global_med
+            med_map[sample] = med / global_med if global_med != 0 else 1.0
docs/reference/cli.md-226-226 (1)

226-226: ⚠️ Potential issue | 🟡 Minor

-o short flag conflicts with typical "output" semantics and the table appears truncated.

Using -o for --pattern is confusing since -o conventionally means --output. Also, the tsne-visualization options table seems incomplete — it ends abruptly without closing.
docs/reference/cli.md-3-3 (1)
3-3: ⚠️ Potential issue | 🟡 Minor

Incorrect command count: "four" should be "five".

The page documents five commands (features2proteins, features2peptides, peptides2protein, correct-batches, tsne-visualization).
Proposed fix
-mokume provides four main commands. Use `mokume --help` or `mokume <command> --help` for details.
+mokume provides five main commands. Use `mokume --help` or `mokume <command> --help` for details.
docs/reference/configuration.md-123-125 (1)
123-125: ⚠️ Potential issue | 🟡 Minor

Example uses method="median" but default/recommended is "maxlfq".

The configuration table (line 49) and the Quick Example on the landing page both default to "maxlfq". Using "median" here without explanation may confuse users copying the example. Consider using "maxlfq" for consistency, or add a comment explaining the choice.
Proposed fix
     quantification=QuantificationConfig(
-        method="median",
+        method="maxlfq",
     ),
docs/reference/computed-values.md-12-12 (1)

12-12: ⚠️ Potential issue | 🟡 Minor

IbaqPpb multiplier doesn't match "parts per billion".

"Parts per billion" implies multiplication by 10⁹ (1,000,000,000), but the formula shows IbaqNorm * 100,000,000 which is 10⁸. Either the multiplier should be 1,000,000,000 or the description should say "parts per hundred million."

benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/00_download_data.py-22-24 (1)

22-24: ⚠️ Potential issue | 🟡 Minor

Partial downloads will be silently accepted and never re-downloaded.

If a download is interrupted, a partial file will remain on disk. On the next run, the dest.exists() check on line 22 will skip it, leaving the user with a corrupt file. Consider checking file size against Content-Length or adding a checksum verification step.

Also applies to: 38-38

benchmarks/quant-pxd007683-tmt-vs-lfq/results/BENCHMARK_REPORT.md-32-40 (1)

32-40: ⚠️ Potential issue | 🟡 Minor

Misleading interpretation: 1% variance is not "a significant portion."

The table shows condition explains only ~1% of variance (TMT: 1.0%, LFQ: 0.9%) with ~99% residual, yet line 39-40 states "Condition (yeast spike-in level) explains a significant portion of variance." This contradicts the data. Consider revising to something like "Condition explains only a small portion of overall variance, suggesting technical variability dominates."

docs/concepts/quantification.md-7-16 (1)

7-16: ⚠️ Potential issue | 🟡 Minor

Inconsistent "Class" column: peptides_to_protein() is a function, not a class.

The overview table header says "Class" but the iBAQ row lists a function call peptides_to_protein(). For consistency with the other rows (which list class names like TopNQuantification), consider either renaming the column to "API" or listing the actual class/module used for iBAQ.
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/00_generate_ibaq.py-183-184 (1)
183-184: ⚠️ Potential issue | 🟡 Minor

Hardcoded step number "2." prints for both TMT and LFQ iterations.

Both loop iterations display "2. Processing …", making the output misleading. Use an enumerated counter.
🐛 Proposed fix
-    for technology in ["tmt", "lfq"]:
-        print(f"\n2. Processing {technology.upper()}...")
+    for step_idx, technology in enumerate(["tmt", "lfq"], start=2):
+        print(f"\n{step_idx}. Processing {technology.upper()}...")
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/12_optimal_pipeline.py-356-407 (1)
356-407: ⚠️ Potential issue | 🟡 Minor

On exception, run_pipeline returns a dict missing metric keys, which may cause downstream KeyError or silent NaN propagation.

If the try block at line 363 raises, results is returned without n_proteins, within_cv, fc_rmse, etc. pd.DataFrame(all_results) will fill these as NaN, which is tolerable, but downstream code (e.g., line 400 with :.3f formatting, line 585 with idxmax()) may fail on NaN or produce confusing output. Consider populating default metric values in the error path.
🛡️ Proposed fix
     except Exception as e:
         print(f"    ERROR: {e}")
         import traceback
         traceback.print_exc()
+        results.update({
+            "n_proteins": 0,
+            "n_yeast": 0,
+            "within_cv": np.nan,
+            "fc_rmse": np.nan,
+            "pct_good_cv": 0,
+        })

     return results
mokume/normalization/tmm.py-100-130 (1)

100-130: ⚠️ Potential issue | 🟡 Minor

_select_reference_sample will raise on all-NaN upper quartiles.

If every sample has only zeros/NaN, uq_series will be all NaN, mean_uq will be NaN, and (uq_series - mean_uq).abs().idxmin() raises ValueError. While unlikely with real data, an early check and a clear error message would be more user-friendly than a cryptic idxmin error.
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/04_stability_metrics.py-282-284 (1)
282-284: ⚠️ Potential issue | 🟡 Minor

Potential KeyError when quality_within is an empty dict.

If all within-condition CVs are empty (e.g., all samples map to "Unknown" or conditions have fewer than 2 samples), assess_quality returns {}, and metrics['quality_within']['good'] will raise a KeyError.
Suggested fix
-            print(f"    Quality: {metrics['quality_within']['good']*100:.1f}% proteins with CV < 20%")
+            pct_good = metrics['quality_within'].get('good', 0) * 100
+            print(f"    Quality: {pct_good:.1f}% proteins with CV < 20%")
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/06_generate_report.py-347-348 (1)
347-348: ⚠️ Potential issue | 🟡 Minor

Hardcoded *.png glob ignores FIGURE_FORMAT from config.

The figures section enumerates files matching *.png, but the benchmark scripts save figures using the FIGURE_FORMAT config constant. If the format is changed (e.g., to svg or pdf), no figures will appear in the report.
Suggested fix
-    for fig in sorted(FIGURES_DIR.glob("*.png")):
+    for fig in sorted(FIGURES_DIR.glob(f"*.{FIGURE_FORMAT}")):
This requires adding FIGURE_FORMAT to the import from config.
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/07_tmm_normalization_benchmark.py-381-383 (1)
381-383: ⚠️ Potential issue | 🟡 Minor

Unsafe values[0] indexing when maxlfq might not be in results.

If maxlfq data was not available for a technology, the filter tech_data[tech_data['method']=='maxlfq'] returns an empty DataFrame, and .values[0] raises IndexError.
Suggested fix
-                print(f"  Factor variation (maxlfq): {tech_data[tech_data['method']=='maxlfq']['factor_std'].values[0]:.3f}")
+                maxlfq_data = tech_data[tech_data['method'] == 'maxlfq']
+                if len(maxlfq_data) > 0:
+                    print(f"  Factor variation (maxlfq): {maxlfq_data['factor_std'].values[0]:.3f}")
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/04_stability_metrics.py-103-106 (1)
103-106: ⚠️ Potential issue | 🟡 Minor

Hardcoded methods list duplicates config and diverges from get_methods_for_technology.

config.py already provides get_methods_for_technology() and per-technology method lists. The hardcoded list here omits ibaq (present in config) and includes directlfq for both technologies, which contradicts the documented constraint that DirectLFQ should not be used for TMT.
Suggested fix
-    methods = ["sum", "top3", "top5", "top10", "maxlfq", "directlfq"]
+    methods = get_methods_for_technology(technology)
This requires importing get_methods_for_technology from config (add to the existing import block).
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/03_fold_change_accuracy.py-54-59 (1)
54-59: ⚠️ Potential issue | 🟡 Minor

Return type annotation is inaccurate — function can return None.

The function signature says -> pd.DataFrame but returns None when the parquet file doesn't exist. This should be -> Optional[pd.DataFrame] to match the actual behavior and avoid type-checking confusion.
Suggested fix
-def load_species_info(technology: str) -> pd.DataFrame:
+def load_species_info(technology: str) -> Optional[pd.DataFrame]:
(Requires from typing import Optional — add to the imports.)
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/07_tmm_normalization_benchmark.py-132-134 (1)
132-134: ⚠️ Potential issue | 🟡 Minor

Division by zero when within_cv_median is zero or NaN.

If metrics_before['within_cv_median'] is 0 (perfectly reproducible data) or NaN (no valid conditions), this line will produce inf or NaN silently, which propagates into the results dict and CSV output.
Suggested fix
-        cv_improvement = (metrics_before['within_cv_median'] - metrics_after['within_cv_median']) / metrics_before['within_cv_median'] * 100
+        before_cv = metrics_before['within_cv_median']
+        after_cv = metrics_after['within_cv_median']
+        if before_cv and np.isfinite(before_cv) and before_cv != 0:
+            cv_improvement = (before_cv - after_cv) / before_cv * 100
+        else:
+            cv_improvement = np.nan
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/09_quant_norm_interaction.py-340-349 (1)
340-349: ⚠️ Potential issue | 🟡 Minor

x and quant_methods may be unbound if no technology has data.

x (Line 330) and quant_methods (Line 326) are defined inside the inner for i, tech loop. If both technologies have empty DataFrames for a metric (both hit continue at Line 318), these variables are never assigned, and Lines 346–347 raise NameError.
Suggested fix
+        x = np.array([])
+        quant_methods = []
         for i, tech in enumerate(["tmt", "lfq"]):
             tech_df = results_df[results_df["technology"] == tech]
 
             if len(tech_df) == 0:
                 continue
             ...
 
-        ax.set_xlabel("Quantification Method")
-        ax.set_ylabel(metric)
-        ax.set_title(title)
-        ax.set_xticks(x)
-        ax.set_xticklabels(quant_methods, rotation=45, ha="right")
-        ax.legend()
-        ax.grid(True, alpha=0.3, axis="y")
+        ax.set_xlabel("Quantification Method")
+        ax.set_ylabel(metric)
+        ax.set_title(title)
+        if len(x) > 0:
+            ax.set_xticks(x)
+            ax.set_xticklabels(quant_methods, rotation=45, ha="right")
+            ax.legend()
+        ax.grid(True, alpha=0.3, axis="y")
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/02_variance_decomposition.py-196-207 (1)

196-207: ⚠️ Potential issue | 🟡 Minor

Potential IndexError on var_explained[1] if only 1 PCA component is computed.

n_components at Line 135 is min(5, X.shape[0], X.shape[1]). If there are fewer than 2 samples (columns in complete_data), X.shape[0] would be 1, producing a single component. Accessing var_explained[1] at Lines 200 and 232 would then raise an IndexError. While unlikely for this dataset, a guard would make this robust.

benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/08_normalization_grid.py-98-108 (1)

98-108: ⚠️ Potential issue | 🟡 Minor

Same division-by-zero risk in globalmedian normalization.

If any sample_medians entry is zero (all-NaN or all-zero column after cleaning), the division on Line 107 produces inf. This pattern recurs across scripts 08, 10, and 11.
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/11_preprocessing_filters.py-270-275 (1)
270-275: ⚠️ Potential issue | 🟡 Minor

Division by zero in normalization when a sample's median is zero.

Same issue as in 10_imputation_benchmark.py: if any sample_medians entry is zero, global_median / sample_medians produces inf, corrupting that column.
         global_median = df_quant.median().median()
         sample_medians = df_quant.median()
+        sample_medians = sample_medians.replace(0, np.nan)
         factors = global_median / sample_medians
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/02_variance_decomposition.py-313-313 (1)
313-313: ⚠️ Potential issue | 🟡 Minor

Hardcoded .png in log message while actual file uses FIGURE_FORMAT.

Line 311 saves the file with FIGURE_FORMAT (could be svg, pdf, etc.), but Line 313 always prints .png. This will confuse users when the configured format is not PNG.
-        print(f"Combined PCA plot saved to: {FIGURES_DIR / 'pca_combined.png'}")
+        print(f"Combined PCA plot saved to: {FIGURES_DIR / f'pca_combined.{FIGURE_FORMAT}'}")
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/10_imputation_benchmark.py-121-131 (1)
121-131: ⚠️ Potential issue | 🟡 Minor

Division by zero when sample_medians contains zero or NaN.

If all values in a sample column are NaN (e.g., after imputation of an all-missing column), df.median() returns NaN for that column. The division global_median / sample_medians then produces NaN silently, which is tolerable—but if a column has a median of exactly 0, this produces inf, corrupting that column's data.
Proposed fix
 def apply_sample_normalization(df: pd.DataFrame, method: str = "globalmedian") -> pd.DataFrame:
     """Apply sample-level normalization."""
     if method == "none":
         return df.copy()
     elif method == "globalmedian":
         global_median = df.median().median()
         sample_medians = df.median()
+        sample_medians = sample_medians.replace(0, np.nan)
         factors = global_median / sample_medians
         return df * factors
mokume/pipeline/stages.py-501-520 (1)

501-520: ⚠️ Potential issue | 🟡 Minor

Fallback intensity column selection (numeric_cols[-1]) is fragile.

If "Intensity" is not present, the code falls back to the last numeric column. This is order-dependent and could silently pick the wrong column (e.g., a peptide count column). Consider raising an error or logging a warning when the expected column is missing rather than silently guessing.

mokume/pipeline/stages.py-668-674 (1)

668-674: ⚠️ Potential issue | 🟡 Minor

Proteins with any missing sample are left entirely uncorrected.

Line 668-670 splits on isna().any(axis=1), meaning a protein missing intensity in even one sample out of many is excluded from ComBat. Depending on the dataset sparsity, this could leave a large fraction of proteins uncorrected, silently degrading the batch correction. Consider imputation before ComBat or at least logging the fraction excluded.
mokume/pipeline/features_to_proteins.py-83-95 (1)
83-95: ⚠️ Potential issue | 🟡 Minor

Missing existence check for SDRF file.

Parquet (line 85) and FASTA (line 94) files are validated for existence, but config.input.sdrf is not. A non-existent SDRF path will produce a confusing error deep in processing (e.g., during analyse_sdrf or detect_condition_from_sdrf).
Proposed fix
         if self.config.input.fasta_file and not Path(self.config.input.fasta_file).exists():
             raise FileNotFoundError(f"FASTA file not found: {self.config.input.fasta_file}")
+
+        if self.config.input.sdrf and not Path(self.config.input.sdrf).exists():
+            raise FileNotFoundError(f"SDRF file not found: {self.config.input.sdrf}")

coderabbitai · 2026-02-09T17:15:36Z

benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/05_cross_technology_correlation.py

+def compute_overall_correlation(tmt_wide: pd.DataFrame, lfq_wide: pd.DataFrame) -> dict:
+    """Compute overall correlation between TMT and LFQ."""
+    # Find common proteins
+    common_proteins = tmt_wide.index.intersection(lfq_wide.index)
+
+    if len(common_proteins) == 0:
+        return {"error": "No common proteins found"}
+
+    # Find common samples (by sample number)
+    tmt_samples = set(str(s) for s in tmt_wide.columns)
+    lfq_samples = set(str(s) for s in lfq_wide.columns)
+    common_samples = sorted(tmt_samples & lfq_samples, key=lambda x: int(x) if x.isdigit() else 0)
+
+    if len(common_samples) == 0:
+        return {"error": "No matching samples found"}
+
+    # Create aligned matrices
+    tmt_aligned = tmt_wide.loc[common_proteins, common_samples]
+    lfq_aligned = lfq_wide.loc[common_proteins, common_samples]
+
+    # Log transform for correlation
+    tmt_log = np.log10(tmt_aligned.replace(0, np.nan))
+    lfq_log = np.log10(lfq_aligned.replace(0, np.nan))
+
+    # Flatten and remove NaN pairs
+    tmt_flat = tmt_log.values.flatten()
+    lfq_flat = lfq_log.values.flatten()
+
+    valid = ~(np.isnan(tmt_flat) | np.isnan(lfq_flat))
+    tmt_valid = tmt_flat[valid]
+    lfq_valid = lfq_flat[valid]
+
+    if len(tmt_valid) < 10:
+        return {"error": "Not enough valid values for correlation"}
+
+    # Compute correlations
+    pearson_r, pearson_p = stats.pearsonr(tmt_valid, lfq_valid)
+    spearman_r, spearman_p = stats.spearmanr(tmt_valid, lfq_valid)
+
+    return {
+        "n_proteins": len(common_proteins),
+        "n_samples": len(common_samples),
+        "n_values": len(tmt_valid),
+        "pearson_r": pearson_r,
+        "pearson_p": pearson_p,
+        "spearman_r": spearman_r,
+        "spearman_p": spearman_p,
+        "tmt_log": tmt_log,
+        "lfq_log": lfq_log,
+        "common_samples": common_samples,
+    }


⚠️ Potential issue | 🔴 Critical

Type mismatch: string sample IDs used to index potentially integer DataFrame columns.

Lines 64–65 convert column labels to strings for set intersection, producing string-typed common_samples. However, Line 72 then uses these strings to index tmt_wide and lfq_wide, whose columns retain their original type from the Parquet file (often int). This would raise KeyError if SampleID is stored as integer.

The same pattern appears in compute_per_protein_correlation (Line 114) and compute_per_sample_correlation (Line 145).

Proposed fix

# Find common samples (by sample number) - tmt_samples = set(str(s) for s in tmt_wide.columns) - lfq_samples = set(str(s) for s in lfq_wide.columns) - common_samples = sorted(tmt_samples & lfq_samples, key=lambda x: int(x) if x.isdigit() else 0) + tmt_sample_map = {str(s): s for s in tmt_wide.columns} + lfq_sample_map = {str(s): s for s in lfq_wide.columns} + common_keys = sorted(set(tmt_sample_map) & set(lfq_sample_map), + key=lambda x: int(x) if x.isdigit() else 0) + tmt_common = [tmt_sample_map[k] for k in common_keys] + lfq_common = [lfq_sample_map[k] for k in common_keys] ... # Create aligned matrices - tmt_aligned = tmt_wide.loc[common_proteins, common_samples] - lfq_aligned = lfq_wide.loc[common_proteins, common_samples] + tmt_aligned = tmt_wide.loc[common_proteins, tmt_common] + lfq_aligned = lfq_wide.loc[common_proteins, lfq_common]

🤖 Prompt for AI Agents

In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/05_cross_technology_correlation.py` around lines 55 - 105, The bug is that compute_overall_correlation (and the related compute_per_protein_correlation and compute_per_sample_correlation) cast column labels to strings to find common_samples but then use those string IDs to index the original DataFrames whose columns may be ints, causing KeyError; fix by finding common sample labels without changing their dtype (e.g., use tmt_wide.columns.intersection(lfq_wide.columns) or compute common_samples = [c for c in tmt_wide.columns if c in lfq_wide.columns]) so the returned common_samples keep the original type and can be used directly for indexing in tmt_wide.loc[...] and lfq_wide.loc[...], and apply the same change in compute_per_protein_correlation and compute_per_sample_correlation.

coderabbitai · 2026-02-09T17:15:36Z

benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/06_generate_report.py

+            for _, row in yeast_df.iterrows():
+                lines.append(
+                    f"| {row['technology'].upper()} | {row['comparison']} | "
+                    f"{row['expected_log2_fc']:.2f} | {row['observed_median_log2_fc']:.2f} | "
+                    f"{row.get('compression_ratio', 'N/A'):.2f} | {row['rmse']:.3f} |"
+                )


⚠️ Potential issue | 🔴 Critical

:.2f format on a string default causes ValueError.

If compression_ratio is absent from the row, row.get('compression_ratio', 'N/A') returns the string 'N/A', and f"{...:.2f}" raises ValueError: Unknown format code 'f' for object of type 'str'. Similarly, NaN values format as nan rather than a user-friendly placeholder.

Suggested fix

for _, row in yeast_df.iterrows(): + compression = row.get('compression_ratio', np.nan) + compression_str = f"{compression:.2f}" if pd.notna(compression) else "N/A" lines.append( f"| {row['technology'].upper()} | {row['comparison']} | " f"{row['expected_log2_fc']:.2f} | {row['observed_median_log2_fc']:.2f} | " - f"{row.get('compression_ratio', 'N/A'):.2f} | {row['rmse']:.3f} |" + f"{compression_str} | {row['rmse']:.3f} |" )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for _, row in yeast_df.iterrows():

lines.append(

f"| {row['technology'].upper()} | {row['comparison']} | "

f"{row['expected_log2_fc']:.2f} | {row['observed_median_log2_fc']:.2f} | "

f"{row.get('compression_ratio', 'N/A'):.2f} | {row['rmse']:.3f} |"

)

for _, row in yeast_df.iterrows():

compression = row.get('compression_ratio', np.nan)

compression_str = f"{compression:.2f}" if pd.notna(compression) else "N/A"

lines.append(

f"| {row['technology'].upper()} | {row['comparison']} | "

f"{row['expected_log2_fc']:.2f} | {row['observed_median_log2_fc']:.2f} | "

f"{compression_str} | {row['rmse']:.3f} |"

)

🤖 Prompt for AI Agents

In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/06_generate_report.py` around lines 176 - 181, The f-string in the loop over yeast_df.iterrows() applies the numeric format specifier :.2f to row.get('compression_ratio', 'N/A'), which raises ValueError when the value is the string 'N/A' or non-numeric/NaN; update the code in the loop that builds lines so you detect and normalize compression_ratio (and any other possibly-NaN fields like expected_log2_fc or observed_median_log2_fc) before formatting: retrieve compression_ratio into a local variable, use pandas.isna() or isinstance(..., numbers.Number) to decide whether to format as "{:.2f}" or use a placeholder like "N/A", and then use the prepared string in the f-string instead of applying :.2f directly to the raw row.get(...) expression.

coderabbitai · 2026-02-09T17:15:36Z

benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/08_normalization_grid.py

+    elif method == "quantile":
+        # Quantile normalization
+        from scipy.stats import rankdata
+
+        result = df.copy()
+        # Get ranks for each column
+        ranks = df.apply(lambda x: rankdata(x, method='average', nan_policy='omit'), axis=0)
+        # Get sorted values (ignoring NaN)
+        sorted_means = df.apply(lambda x: np.sort(x.dropna().values), axis=0)
+
+        # Use mean of sorted values at each rank
+        n_rows = len(df)
+        for col in df.columns:
+            col_ranks = ranks[col].values
+            col_sorted = np.sort(df[col].dropna().values)
+
+            # Interpolate for each rank
+            for i, rank in enumerate(col_ranks):
+                if not np.isnan(rank) and not np.isnan(df[col].iloc[i]):
+                    # Map rank to quantile value
+                    idx = int(rank) - 1
+                    if idx < len(col_sorted):
+                        result.iloc[i, result.columns.get_loc(col)] = col_sorted[idx]
+
+        return result


⚠️ Potential issue | 🔴 Critical

Quantile normalization implementation is incorrect.

Several problems:

sorted_means (Line 163) is computed but never used — it's dead code.

Fractional ranks are truncated — rankdata(..., method='average') returns non-integer ranks for ties (e.g., 2.5). int(rank) - 1 truncates to 1, silently mapping tied values to the wrong quantile position.

Algorithm is wrong — standard quantile normalization replaces each value with the cross-column mean at its rank. This implementation replaces values with the same column's sorted value, which is essentially a no-op (value → sort → index by own rank → same value, modulo tie-breaking errors).

O(n × m) Python loop is very slow for proteomics-scale matrices.

Consider using mokume.normalization.protein.quantile_normalize (already imported in 01_grid_search_methods.py) or a well-tested implementation instead.

🤖 Prompt for AI Agents

In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/08_normalization_grid.py` around lines 155 - 179, The quantile branch (method == "quantile") is incorrect: sorted_means is unused, fractional ranks from rankdata are truncated, the code maps values to the same-column sorted value (no cross-column normalization), and it uses a slow nested Python loop; replace this block to perform true quantile normalization by computing the mean across columns at each rank and mapping each value to that cross-column mean (handling fractional ranks by averaging appropriate quantile positions), or simply call the existing, tested implementation mokume.normalization.protein.quantile_normalize instead of the custom logic; update/remove the unused variable sorted_means and eliminate the O(n×m) loops (use vectorized rank/argsort operations or the mokume helper) so df → result is the proper cross-column ranked-mean mapping.

coderabbitai · 2026-02-09T17:15:36Z

benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/12_optimal_pipeline.py

+def plot_radar_chart(results_df: pd.DataFrame, output_path: Path):
+    """Plot radar chart comparing pipelines."""
+    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
+
+    metrics = ["n_proteins", "pct_good_cv", "within_cv_inv", "fc_rmse_inv"]
+    metric_labels = ["Proteins", "% Good CV", "1/CV", "1/RMSE"]
+
+    for ax, tech in zip(axes, ["tmt", "lfq"]):
+        tech_df = results_df[results_df["technology"] == tech].copy()
+
+        if len(tech_df) == 0:
+            continue
+
+        # Invert CV and RMSE so higher is better
+        tech_df["within_cv_inv"] = 1 / tech_df["within_cv"]
+        tech_df["fc_rmse_inv"] = 1 / tech_df["fc_rmse"].replace(0, np.nan)
+
+        # Normalize to 0-1
+        for m in metrics:
+            if m in tech_df.columns:
+                min_val = tech_df[m].min()
+                max_val = tech_df[m].max()
+                if max_val > min_val:
+                    tech_df[m + "_norm"] = (tech_df[m] - min_val) / (max_val - min_val)
+                else:
+                    tech_df[m + "_norm"] = 0.5
+
+        # Angles for radar
+        angles = np.linspace(0, 2 * np.pi, len(metrics), endpoint=False).tolist()
+        angles += angles[:1]  # Complete the circle
+
+        # Plot each pipeline
+        colors = plt.cm.Set2(np.linspace(0, 1, len(tech_df)))
+        for i, (_, row) in enumerate(tech_df.iterrows()):
+            values = [row.get(m + "_norm", 0) for m in metrics]
+            values += values[:1]
+
+            ax.plot(angles, values, "o-", linewidth=2, label=row["pipeline"],
+                   color=colors[i])
+            ax.fill(angles, values, alpha=0.1, color=colors[i])
+
+        ax.set_xticks(angles[:-1])
+        ax.set_xticklabels(metric_labels)
+        ax.set_title(f"{tech.upper()}: Pipeline Comparison")
+        ax.legend(loc="upper right", bbox_to_anchor=(1.3, 1))
+
+    plt.tight_layout()
+    plt.savefig(output_path, dpi=FIGURE_DPI)
+    plt.close()


⚠️ Potential issue | 🔴 Critical

plot_radar_chart creates Cartesian axes — the radar/spider chart won't render correctly.

plt.subplots(1, 2) creates standard Cartesian axes. For a radar chart, you need polar axes. The angles (in radians) will be plotted on a linear x-axis, producing a meaningless line chart instead of a spider plot.

🐛 Proposed fix

- fig, axes = plt.subplots(1, 2, figsize=(14, 6)) + fig, axes = plt.subplots(1, 2, figsize=(14, 6), + subplot_kw=dict(polar=True))

🤖 Prompt for AI Agents

In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/12_optimal_pipeline.py` around lines 462 - 510, The plot_radar_chart function is creating regular Cartesian Axes (via plt.subplots) so the computed angles are plotted on a linear axis instead of a polar/spider plot; change the axes creation to polar axes for each subplot (use projection="polar" when creating axes for the two panels), ensure angles (variable angles) remain in radians and are used as the theta coordinates, and keep the rest of the plotting logic (values, ax.plot, ax.fill, ax.set_xticks/set_xticklabels, ax.set_title, ax.legend) but applied to the polar Axes returned for plot_radar_chart so the radar/spider visualization renders correctly.

mokume/analysis/differential_expression.py

mokume/preprocessing/filters/peptide.py

coderabbitai · 2026-02-09T17:15:36Z

mokume/preprocessing/filters/protein.py

+        # Vectorized contaminant matching using regex OR pattern
+        upper_col = df[self.protein_column].fillna("").astype(str).str.upper()
+        pattern_regex = "|".join(re.escape(p.upper()) for p in self.patterns)
+        mask = ~upper_col.str.contains(pattern_regex, regex=True)


⚠️ Potential issue | 🔴 Critical

Bug: empty patterns list removes all rows.

When self.patterns is empty, "|".join(...) produces "", and str.contains("", regex=True) matches every row. The negated mask then filters out the entire DataFrame.

Add an early return when there are no patterns to match against.

Proposed fix

+ if not self.patterns: + logger.debug("%s: No contaminant patterns provided, skipping filter", self.name) + return df, self._create_result(input_count, input_count) + # Vectorized contaminant matching using regex OR pattern upper_col = df[self.protein_column].fillna("").astype(str).str.upper() pattern_regex = "|".join(re.escape(p.upper()) for p in self.patterns)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Vectorized contaminant matching using regex OR pattern

upper_col = df[self.protein_column].fillna("").astype(str).str.upper()

pattern_regex = "|".join(re.escape(p.upper()) for p in self.patterns)

mask = ~upper_col.str.contains(pattern_regex, regex=True)

if not self.patterns:

logger.debug("%s: No contaminant patterns provided, skipping filter", self.name)

return df, self._create_result(input_count, input_count)

# Vectorized contaminant matching using regex OR pattern

upper_col = df[self.protein_column].fillna("").astype(str).str.upper()

pattern_regex = "|".join(re.escape(p.upper()) for p in self.patterns)

mask = ~upper_col.str.contains(pattern_regex, regex=True)

🤖 Prompt for AI Agents

In `@mokume/preprocessing/filters/protein.py` around lines 63 - 66, The current vectorized contaminant matching in the method using self.patterns builds pattern_regex and then computes mask = ~upper_col.str.contains(pattern_regex...), which when self.patterns is empty creates an empty regex that matches every row and removes all rows; fix by adding an early check for an empty patterns list (check "if not self.patterns" or similar) and return the original DataFrame (or set mask to all True) before constructing pattern_regex; update uses of self.protein_column, pattern_regex, and mask accordingly so the function preserves rows when there are no patterns.

coderabbitai

Actionable comments posted: 18

🤖 Fix all issues with AI agents

In `@mokume/analysis/differential_expression.py`:
- Around line 342-349: The code filters s2 and df into valid arrays then calls
np.median(s2) which can be empty; add an explicit guard after filtering: if
len(s2) == 0 then log a clear warning via logger.warning and return a safe
default (e.g. prior_df=3.0, prior_var=1.0) or raise a ValueError depending on
desired behavior—ensure you update the block that currently does "if len(s2) <
3: ... return 3.0, float(np.median(s2))" to handle the zero-length case
separately so np.median is never called on an empty s2.

In `@mokume/io/feature.py`:
- Around line 404-411: The SQL in get_report_condition_from_database currently
injects cons via f-string and tuple(cons); change it to a parameterized query
using placeholders for the IN clause and pass cons as parameters to
parquet_db.sql to avoid SQL injection (e.g., build a placeholder list based on
len(cons) and call parquet_db.sql(query, params)). Update the call sites in
get_report_condition_from_database and ensure Feature.standardize_df(report)
still receives the resulting dataframe from database.df(); keep using cols =
",".join(columns) if provided but do not inline cons directly into the SQL
string.
- Around line 310-319: The current get_report_from_database (and similarly
get_report_condition_from_database) builds SQL by interpolating columns and
tuple(samples) which is an SQL injection risk; change it to use parameterized
queries for the samples (e.g., pass samples as a parameter to
parquet_db.sql/execute instead of formatting them into the string) and avoid
direct string interpolation for columns by validating/whitelisting column names
or mapping requested column names to safe identifiers before joining them into
the SELECT list; update the query construction to use a parameter placeholder
for the IN list and call parquet_db.sql/execute with the samples parameter, then
return Feature.standardize_df(report) as before.
- Around line 518-530: The SQL building the irs_df uses CAST(split_part(run,
'_', 2) AS INTEGER) to compute techreplicate which can silently yield NULL for
non-numeric segments; change this to use TRY_CAST and COALESCE so non-numeric
parts don't drop data (e.g., compute techreplicate with
TRY_CAST(split_part(run,'_',2) AS INTEGER) and COALESCE to fallback to
TRY_CAST(split_part(run,'_',-1) AS INTEGER) or TRY_CAST(run AS INTEGER) similar
to get_unique_tec_reps()), ensuring the SELECT in the irs_df query (symbols:
irs_df, parquet_db, stat_fn, where_clause, techreplicate) preserves rows instead
of producing NULL group keys.

In `@mokume/normalization/irs.py`:
- Around line 96-106: The code computing scaling_factors can divide by
zero/negative values because plex_ref_intensity[plex] is not sanitized; update
the logic in the block around plex_ref_intensity, global_ref and scaling_factors
so denominators <= 0 are replaced with np.nan (or a small positive epsilon)
before division, compute scaling_factors as global_ref / sanitized_plex_ref and
then handle resulting NaNs (e.g., propagate NaN or fill with 1.0) to avoid
inf/-inf values; refer to variables plex_ref_intensity, global_ref,
scaling_factors and the loop iterating over plexes when making this change.

In `@mokume/pipeline/stages.py`:
- Around line 722-725: The broad "except Exception as e" in the ComBat batch
correction block should not silently swallow failures; change it to catch
narrower, likely failure types (e.g., except (ValueError, KeyError, TypeError)
as e:) and call logger.exception("Batch correction failed") to record the full
traceback and context (include protein_df/batch info if helpful); for truly
unexpected errors keep a final except Exception as e:
logger.exception("Unexpected batch correction error") and re-raise the exception
instead of returning uncorrected protein_df so real bugs surface. Ensure you
update the except clause in the batch correction block that currently logs
"Batch correction failed: {e}" and uses protein_df as the fallback.
- Around line 744-751: The contrast-parsing fallback that splits on "-" (the
branch handling variable c and appending to contrasts) misparses hyphenated
condition names; update that branch in mokume/pipeline/stages.py (the block
referencing variable c and the contrasts list) to emit a warning when "-" is
used as the delimiter (e.g., via the module logger or warnings.warn) explaining
that "-" is ambiguous and recommending using " vs " instead, while keeping the
existing split behavior to preserve backwards compatibility.
- Around line 708-720: The code unnecessarily calls
corrected_df.rename(columns={"index": protein_col}) after
corrected_matrix.reset_index(): because the DataFrame index already has the name
protein_col the reset_index() call will produce a column named protein_col, so
remove the redundant rename call (or, alternatively, change it to rename from
the actual column name if you expect a different name); update the block that
creates corrected_df (using corrected_matrix.reset_index()) to omit the rename
and leave corrected_df as-is, ensuring logger and return continue to use
corrected_df, corrected_matrix, incomplete_matrix, protein_col, and sample_cols
as before.

In `@mokume/preprocessing/aggregation.py`:
- Around line 151-158: The code calls choice.channels() for non-LFQ branches
without checking that choice (Optional[IsobaricLabel]) is not None, which can
raise AttributeError; in the branch that assigns data_df[CHANNEL] =
data_df[CHANNEL].map(choice.channels()) (and where label !=
QuantificationCategory.LFQ) add a defensive check that choice is not None and
raise a clear ValueError (or handle a sensible default) if it is; reference the
symbols choice, IsobaricLabel, CHANNEL, QuantificationCategory.LFQ, and the
assignment to data_df[CHANNEL] so the guard is added immediately before calling
choice.channels().
- Around line 79-99: The current remove_contaminants_entrapments_decoys function
uses dataset[protein_field].str.contains(cregex) without handling NaNs, which
causes rows with missing protein names to be dropped; update the call to
str.contains to pass na=False (and if you want to be extra safe against special
characters, build the pattern using re.escape on each contaminant before
joining) so that missing protein_field values are not treated as matches and are
preserved.
- Around line 185-199: Replace hardcoded column names with constants: change
"Condition" to CONDITION and "Reference" to REFERENCE in the data filtering and
missing-files logic inside the block that manipulates data_df (where INTENSITY
and RUN are already used). Add an import for REFERENCE from
mokume.core.constants alongside existing imports (so CONDITION and REFERENCE are
available), and update the logger message to reference the REFERENCE constant
usage location; keep the rest of the logic (filtering zeros, dropping NA RUNs)
unchanged.

In `@mokume/quantification/ratio.py`:
- Around line 382-388: The current filter df =
df[~df[PROTEIN_NAME].str.contains(";")] will drop rows where PROTEIN_NAME is NaN
because str.contains defaults na=NaN; update the filter to handle missing values
explicitly (e.g., use str.contains(";", na=False)) so NaN protein rows are not
silently removed, and then continue with the peptide_counts =
df.groupby(PROTEIN_NAME)[PEPTIDE_CANONICAL].nunique() / valid_proteins logic
unchanged to enforce min_unique_peptides on actual protein names.

In `@mokume/reports/qc_report.py`:
- Around line 134-154: In _compute_pca, guard against the case where data
becomes empty after dropna (or has too few rows) before calling PCA: check
data.shape and if there are no columns or fewer than the minimum samples (mirror
the _compute_tsne check, e.g., if data.shape[0] < 4 or data.shape[1] == 0),
return a result dictionary with the same keys ("samples", "conditions",
"plexes", "pc1", "pc2", optional "pc3", and "var_explained") filled with safe
defaults (zeros or empty lists of correct length and var_explained filled with
zeros) instead of calling PCA; ensure you reference _compute_pca, samples,
sample_to_condition, and sample_to_plex so callers receive consistent shaped
outputs.
- Around line 492-493: The file is opened without an explicit encoding which can
corrupt non-ASCII characters; modify the open call that writes the HTML (the
with open(output_html, "w") as f: block that writes the html variable) to
specify encoding="utf-8" so the output HTML (which declares <meta
charset="utf-8">) is written with matching UTF-8 encoding.
- Around line 68-74: The code incorrectly replaces zeros with NaN when is_log2
is True, dropping valid zero log2 values; update the block around
intensity/log2_matrix so zeros are only replaced with NaN when converting raw
intensities to log2 (i.e., when not is_log2). Specifically, in the branch for
is_log2 use log2_matrix = intensity (don’t call replace(0, np.nan)), and keep
the replace+np.log2 behavior only in the not is_log2 branch
(intensity.replace(0, np.nan) then np.log2). Refer to the variables intensity,
log2_matrix, is_log2, protein_df, protein_col, and exp_samples to locate and fix
the logic (mirror the fix used in differential_expression.py).
- Around line 319-336: The loop that computes weights divides by
total_var_explained which can be zero (n_components, PCA,
pca.explained_variance_ratio_, total_var_explained); to fix, guard against
total_var_explained == 0 by computing a safe denominator: if total_var_explained
> 0 keep the current weight = pca.explained_variance_ratio_[i] /
total_var_explained, otherwise use an equal weight = 1.0 / n_components for each
component; apply this change where weight is computed before calling _anova_r2
and use the equal-weight branch when explained variance sums to zero.
- Around line 884-967: The embedded JSON for distributions, condition_cv
(cvData) and missing['per_sample'] is not being escaped for a closing script
tag; update the json.dumps(...) calls that produce the JS constants "dist" (used
for boxplot), "cvData" (used for cv-violin) and the "missing" object (used for
missing-plot) to apply .replace('</', '<\\/') on the dumped string so any
"</script>" sequences are neutralized before injecting into the page; keep the
existing variable names and plot targets ('boxplot', 'cv-violin',
'missing-plot') so only the json embedding calls change.

In `@mokume/reports/workflow_comparison.py`:
- Around line 83-88: The local variable named html returned from
_build_comparison_html shadows the imported html module (so subsequent calls
like html.escape would fail); rename that local variable (e.g., rendered_html or
html_content) wherever it's used and write that renamed variable to the file
instead of html, and open the file with encoding="utf-8" (use output_html and
the result of _build_comparison_html accordingly) so the module remains
accessible and the file encoding matches the declared charset.

🧹 Nitpick comments (17)

mokume/normalization/irs.py (2)
128-139: transform silently skips samples missing from scaling factors or the DataFrame.

Line 136 skips any sample not found in result.columns or not mapped to a known plex. If the caller passes a different set of samples than what was fitted (e.g., a typo in sample names), no error or warning is raised, and un-scaled columns are returned as-is. Consider logging a warning when samples are skipped so misconfigurations surface early.
Proposed enhancement
+        skipped = []
         for sample, plex in self._sample_to_plex.items():
             if sample in result.columns and plex in self.scaling_factors_:
                 result[sample] = result[sample] * self.scaling_factors_[plex]
+            else:
+                skipped.append(sample)
+
+        if skipped:
+            logger.warning(f"IRS transform: {len(skipped)} samples skipped (not in DataFrame or plex): {skipped}")

         return result.reset_index()
270-305: Fallback to "plex1" may silently group unrelated samples.

When source names don't match either the p\d+_\d+ pattern or the prefix_digit convention (lines 292–301), all such samples are lumped into a single "plex1". In datasets with non-standard naming, this could silently produce incorrect plex assignments. The warning at line 302 helps, but for a normalization step where correctness is critical, consider raising an error (or at minimum returning a flag) when a significant fraction of samples fall through to the default.
mokume/analysis/differential_expression.py (2)
122-122: Prefix unused stat with underscore.

Static analysis flags stat as unused. Minor lint fix.
Suggested fix
-            stat, pvalue = stats.ttest_ind(vals_a, vals_b, equal_var=False)
+            _stat, pvalue = stats.ttest_ind(vals_a, vals_b, equal_var=False)
245-247: Prefix unused reject with underscore.
Suggested fix
-        reject, adj_pvalues, _, _ = multipletests(
+        _reject, adj_pvalues, _, _ = multipletests(
mokume/reports/qc_report.py (2)

499-499: highlight_genes and de_results parameters are accepted but unused.

Static analysis confirms these are dead parameters. Either remove them from the signature or implement the intended highlighting/DE logic. If this is planned for a future iteration, add a TODO.

981-1034: DE quality JS rendering looks correct.

Consider importing json at the module level (it's already used in _build_qc_html) rather than locally in both functions, but this is a minor style point.
mokume/reports/workflow_comparison.py (1)
622-638: User-supplied marker gene names are not HTML-escaped.

info['name'], acc, r['significance'], and fc are injected directly into HTML without escaping. While marker gene data is typically programmatic, applying html.escape() would be safer against unexpected inputs.
Suggested fix (example for line 624)
-        row = f"<tr><td><b>{info['name']}</b></td><td>{acc}</td>"
-        row += f"<td class='{'good' if info['expected'] == 'UP' else 'bad'}'>{info['expected']}</td>"
+        row = f"<tr><td><b>{html.escape(info['name'])}</b></td><td>{html.escape(acc)}</td>"
+        row += f"<td class='{'good' if info['expected'] == 'UP' else 'bad'}'>{html.escape(info['expected'])}</td>"
mokume/preprocessing/aggregation.py (2)

208-219: Fragile technical replicate extraction from run names.

The heuristic at lines 209-219 attempts to parse tech replicate IDs from run names. If the last underscore-delimited segment is not numeric (e.g., run name "sample_Brain"), the astype("int") on line 212 throws ValueError, caught by the except block. The fallback (line 217-219) assigns sequential indices, which may not reflect actual technical replicates. This is acceptable as a best-effort approach, but worth documenting the assumption.

358-365: Consider using groupby().agg() instead of transform("sum") + drop_duplicates().

The current approach replaces each row's intensity with the group sum via transform, then deduplicates. A direct groupby().agg({"NormIntensity": "sum"}).reset_index() would be more idiomatic, clearer in intent, and avoids the intermediate oversized DataFrame.

mokume/quantification/ratio.py (1)

131-163: Side-effect: _compute_reference_intensity mutates the caller's df by adding _plex.

Line 138 adds a _plex column directly to the df passed by quantify(). While _compute_log2_ratios later overwrites it, this hidden mutation couples the two methods. Consider operating on a copy or filtering without modifying the input.
mokume/pipeline/stages.py (7)
73-82: Unused variable sample_names.

sample_names (line 80) is unpacked but never referenced. Prefix with _ to signal intentional discard.
Suggested fix
             technical_repetitions, label, sample_names, choice = analyse_sdrf(
+            technical_repetitions, label, _sample_names, choice = analyse_sdrf(
112-116: groupby().filter(lambda x: len(set(...))) is slow — prefer nunique().

Creating a set inside a lambda for each group is inefficient on large DataFrames. A vectorized approach is faster and clearer:
Suggested refactor
-                dataset_df = dataset_df.groupby(PROTEIN_NAME).filter(
-                    lambda x: len(set(x[PEPTIDE_CANONICAL]))
-                    >= self.config.filtering.min_unique_peptides
-                )
+                peptide_counts = dataset_df.groupby(PROTEIN_NAME)[PEPTIDE_CANONICAL].nunique()
+                valid_proteins = peptide_counts[peptide_counts >= self.config.filtering.min_unique_peptides].index
+                dataset_df = dataset_df[dataset_df[PROTEIN_NAME].isin(valid_proteins)]
178-192: Duplicated protein-accession parsing logic across files.

The pattern at lines 180-185 (extract first pg_accessions element, split by |, take UniProt ID) is repeated in mokume/quantification/ratio.py lines 352-357 and here. Extract a shared helper (e.g., parse_pg_accessions(series) -> Series) to avoid drift between the two copies.

416-429: Overly complex column deduplication — simplify reference sample removal.

The logic prepends protein_col, then re-includes it via c == protein_col in the comprehension, then manually deduplicates. This can be simplified:
Suggested simplification
-            protein_col = protein_df.columns[0]
-            cols_to_keep = [protein_col] + [
-                c for c in protein_df.columns
-                if c == protein_col or c not in ref_samples
-            ]
-            # Deduplicate while preserving order
-            seen = set()
-            unique_cols = []
-            for c in cols_to_keep:
-                if c not in seen:
-                    seen.add(c)
-                    unique_cols.append(c)
-            protein_df = protein_df[unique_cols]
+            protein_col = protein_df.columns[0]
+            cols_to_keep = [c for c in protein_df.columns if c == protein_col or c not in ref_samples]
+            protein_df = protein_df[cols_to_keep]
507-526: Unused method_name parameter.

The method_name parameter is accepted but never referenced in the method body. Either remove it or use it (e.g., in the debug log at line 517).

540-546: Unused mw_dict from extract_fasta return value.

Prefix with _ to signal intentional discard:
-        unique_peptide_counts, mw_dict, found_proteins = extract_fasta(
+        unique_peptide_counts, _mw_dict, found_proteins = extract_fasta(
896-903: rsplit(".", 1) result used ambiguously when filename has no extension.

If report_output has no . (no extension), rsplit(".", 1) returns a single-element list. Line 900 accesses base[0], which works, but the f"{base[0]}_{contrast_name}.html" always appends .html regardless of the original extension. This is probably fine, but the variable name base being a list is a bit unclear.

coderabbitai · 2026-02-10T07:38:54Z

mokume/analysis/differential_expression.py

+    # Filter out zeros/inf
+    valid = np.isfinite(s2) & (s2 > 0) & np.isfinite(df) & (df > 0)
+    s2 = s2[valid]
+    df = df[valid]
+
+    if len(s2) < 3:
+        logger.warning("Too few proteins for eBayes estimation, using defaults")
+        return 3.0, float(np.median(s2))


⚠️ Potential issue | 🟡 Minor

Edge case: np.median(s2) when filtered array is empty.

If all s2 values are zero, non-finite, or df <= 0, the valid mask could leave an empty array. np.median([]) produces a runtime warning and returns nan, which would propagate into the moderated variance calculation and silently produce nan p-values for all proteins.

Consider returning early with a more defensive default or raising when no valid variances remain.

Suggested guard

if len(s2) < 3: logger.warning("Too few proteins for eBayes estimation, using defaults") + median_s2 = float(np.median(s2)) if len(s2) > 0 else 1.0 - return 3.0, float(np.median(s2)) + return 3.0, median_s2

🤖 Prompt for AI Agents

In `@mokume/analysis/differential_expression.py` around lines 342 - 349, The code filters s2 and df into valid arrays then calls np.median(s2) which can be empty; add an explicit guard after filtering: if len(s2) == 0 then log a clear warning via logger.warning and return a safe default (e.g. prior_df=3.0, prior_var=1.0) or raise a ValueError depending on desired behavior—ensure you update the block that currently does "if len(s2) < 3: ... return 3.0, float(np.median(s2))" to handle the zero-length case separately so np.median is never called on an empty s2.

coderabbitai · 2026-02-10T07:38:54Z

mokume/io/feature.py

+    def get_report_from_database(self, samples: list, columns: list = None):
+        """Retrieves a standardized report from the database for specified samples."""
+        cols = ",".join(columns) if columns is not None else "*"
+        database = self.parquet_db.sql(
+            """SELECT {} FROM parquet_db WHERE sample_accession IN {}""".format(
+                cols, tuple(samples)
+            )
+        )
+        report = database.df()
+        return Feature.standardize_df(report)


⚠️ Potential issue | 🟠 Major

SQL injection risk: samples and columns are interpolated directly into SQL.

tuple(samples) relies on Python's repr() for string quoting, which does not perform SQL escaping. A sample name containing a single quote (e.g., "sample'1") would produce a malformed query or injection vector. Use DuckDB's parameterized queries instead:

Suggested approach

def get_report_from_database(self, samples: list, columns: list = None): cols = ",".join(columns) if columns is not None else "*" - database = self.parquet_db.sql( - """SELECT {} FROM parquet_db WHERE sample_accession IN {}""".format( - cols, tuple(samples) - ) - ) + placeholders = ", ".join(["?"] * len(samples)) + database = self.parquet_db.execute( + f"SELECT {cols} FROM parquet_db WHERE sample_accession IN ({placeholders})", + samples, + ) report = database.df() return Feature.standardize_df(report)

Note: the same pattern exists in get_report_condition_from_database (line 408).

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 310-310: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

[error] 314-316: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents

In `@mokume/io/feature.py` around lines 310 - 319, The current get_report_from_database (and similarly get_report_condition_from_database) builds SQL by interpolating columns and tuple(samples) which is an SQL injection risk; change it to use parameterized queries for the samples (e.g., pass samples as a parameter to parquet_db.sql/execute instead of formatting them into the string) and avoid direct string interpolation for columns by validating/whitelisting column names or mapping requested column names to safe identifiers before joining them into the SELECT list; update the query construction to use a parameter placeholder for the IN list and call parquet_db.sql/execute with the samples parameter, then return Feature.standardize_df(report) as before.

coderabbitai · 2026-02-10T07:38:54Z

mokume/io/feature.py

+    def get_report_condition_from_database(self, cons: list, columns: list = None) -> pd.DataFrame:
+        """Retrieves a standardized report from the database for specified conditions."""
+        cols = ",".join(columns) if columns is not None else "*"
+        database = self.parquet_db.sql(
+            f"""SELECT {cols} FROM parquet_db WHERE condition IN {tuple(cons)}"""
+        )
+        report = database.df()
+        return Feature.standardize_df(report)


⚠️ Potential issue | 🟠 Major

Same SQL injection pattern as get_report_from_database.

get_report_condition_from_database uses the same tuple(cons) interpolation into SQL. Apply the same parameterized query fix recommended for get_report_from_database.

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 404-404: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

[error] 408-408: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents

In `@mokume/io/feature.py` around lines 404 - 411, The SQL in get_report_condition_from_database currently injects cons via f-string and tuple(cons); change it to a parameterized query using placeholders for the IN clause and pass cons as parameters to parquet_db.sql to avoid SQL injection (e.g., build a placeholder list based on len(cons) and call parquet_db.sql(query, params)). Update the call sites in get_report_condition_from_database and ensure Feature.standardize_df(report) still receives the resulting dataframe from database.df(); keep using cols = ",".join(columns) if provided but do not inline cons directly into the SQL string.

coderabbitai · 2026-02-10T07:38:54Z

mokume/io/feature.py

+        irs_df = self.parquet_db.sql(
+            f"""
+            SELECT run, {stat_fn}(intensity) as irs_value, mixture, techreplicate as techrep_guess
+            FROM (
+                SELECT *,
+                       CASE WHEN position('_' in run) > 0 THEN CAST(split_part(run, '_', 2) AS INTEGER)
+                            ELSE CAST(run AS INTEGER) END AS techreplicate
+                FROM parquet_db
+                WHERE {where_clause}
+            )
+            GROUP BY run, mixture, techrep_guess
+            """
+        ).df()


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

DuckDB TRY_CAST function support documentation

💡 Result:

DuckDB supports TRY_CAST as a “safe cast” that returns NULL instead of throwing an error when a conversion can’t be performed. Syntax: TRY_CAST(expr AS TYPENAME). Example: SELECT TRY_CAST('hello' AS INTEGER); returns NULL. [1]

Related docs/notes:

Regular CAST can fail (e.g., invalid strings, overflow), while TRY_CAST converts those failures to NULL (including overflow cases like casting 999 to TINYINT). [1][2]

DuckDB also provides a more general TRY(<expression>) wrapper (inspired by TRY_CAST) to return NULL for rows where an expression errors, not just casts. [3][4]

Sources: DuckDB docs on Casting / TRY_CAST [1], Typecasting (overflow note) [2], TRY expression [3], DuckDB 1.3.0 release notes (generalization to TRY) [4].

🏁 Script executed:

# Check the actual implementation in mokume/io/feature.py cd /repo && cat -n mokume/io/feature.py | sed -n '500,575p'

Repository: bigbio/mokume

Length of output: 112

🏁 Script executed:

# Find and examine get_unique_tec_reps() to understand the comparison rg -n "def get_unique_tec_reps" --type py

Repository: bigbio/mokume

Length of output: 129

🏁 Script executed:

# Search for how run names are used and their patterns rg -n "run.*_" --type py -A 2 -B 2 | head -100

Repository: bigbio/mokume

Length of output: 6852

🏁 Script executed:

cat -n mokume/io/feature.py | sed -n '515,535p'

Repository: bigbio/mokume

Length of output: 1004

🏁 Script executed:

cat -n mokume/io/feature.py | sed -n '342,380p'

Repository: bigbio/mokume

Length of output: 2045

🏁 Script executed:

# Check for error handling around the SQL call in get_irs_scaling_factors cat -n mokume/io/feature.py | sed -n '471,540p'

Repository: bigbio/mokume

Length of output: 3342

🏁 Script executed:

# Search for test data or run name patterns to understand what run names actually look like fd -e "csv" -o -e "parquet" -o -e "json" | head -20

Repository: bigbio/mokume

Length of output: 173

🏁 Script executed:

# Search for test fixtures or mock data rg -n "run.*=" --type py tests/ | grep -E "(run|Run)" | head -20

Repository: bigbio/mokume

Length of output: 2015

🏁 Script executed:

# Check if there's error handling at the call site of get_irs_scaling_factors rg -n "get_irs_scaling_factors" --type py -B 3 -A 3

Repository: bigbio/mokume

Length of output: 980

🏁 Script executed:

# Check the entire call stack to see if there's any error handling cat -n mokume/normalization/peptide.py | sed -n '200,215p'

Repository: bigbio/mokume

Length of output: 905

🏁 Script executed:

# Look for any tests that exercise the get_irs_scaling_factors with problematic run names rg -n "get_irs_scaling_factors" --type py tests/ -B 5 -A 5

Repository: bigbio/mokume

Length of output: 39

🏁 Script executed:

# Verify what happens with non-numeric casts in the context # Check if there are any test runs or fixtures with underscore patterns that aren't numeric rg -n "run.*=.*_" tests/ --type py | head -15

Repository: bigbio/mokume

Length of output: 1545

🏁 Script executed:

# Look for sample data or fixtures that show actual run name patterns find . -type f $ -name "*.json" -o -name "*.yaml" -o -name "*.yml" $ | head -10

Repository: bigbio/mokume

Length of output: 338

🏁 Script executed:

# Check the conftest or test fixtures for mock data rg -n "run" tests/conftest.py -A 3 -B 1 | head -40

Repository: bigbio/mokume

Length of output: 114

🏁 Script executed:

# Look at the aggregation.py mentioned in original comment for context rg -n "split_part" --type py -B 5 -A 5

Repository: bigbio/mokume

Length of output: 2534

🏁 Script executed:

# Check if there are any tests that would exercise this with non-numeric run names rg -n "techrep" tests/ --type py -B 3 -A 3

Repository: bigbio/mokume

Length of output: 39

🏁 Script executed:

# Look at the docstring and context for get_irs_scaling_factors to understand the position 2 choice cat -n mokume/io/feature.py | sed -n '471,495p'

Repository: bigbio/mokume

Length of output: 1141

🏁 Script executed:

# Check if there's documentation about run name format conventions rg -n "run.*format\|format.*run" --type py -i | head -10

Repository: bigbio/mokume

Length of output: 39

🏁 Script executed:

# Look at how run names are created/set in the codebase rg -n "reference_file_name as run" --type py -B 5 -A 5

Repository: bigbio/mokume

Length of output: 1569

🏁 Script executed:

# Final check: see if there are actual cases where CAST would fail rg -n "run" mokume/ --type py | grep -E "(numeric|integer|pattern)" | head -10

Repository: bigbio/mokume

Length of output: 342

CAST(split_part(run, '_', 2) AS INTEGER) may return NULL with non-numeric position-2 segments, causing GROUP BY rows to be lost.

If run names have a non-numeric value at position 2 after splitting by _ (e.g., "sample_brain_1"), the CAST silently returns NULL in DuckDB. This causes those rows to be grouped under a NULL techreplicate value, losing data. While the method won't crash—the call is wrapped in a broad try/except—silently dropping data is worse than early failure.

Additionally, this approach differs from get_unique_tec_reps() (lines 342–365), which extracts the last segment and includes Python-level error handling.

Consider using TRY_CAST with COALESCE to either extract position 2 or fall back to the entire run value:

Suggested fix

- CASE WHEN position('_' in run) > 0 THEN CAST(split_part(run, '_', 2) AS INTEGER) - ELSE CAST(run AS INTEGER) END AS techreplicate + COALESCE( + TRY_CAST(split_part(run, '_', 2) AS INTEGER), + TRY_CAST(run AS INTEGER) + ) AS techreplicate

🧰 Tools

🪛 Ruff (0.14.14)

[error] 519-529: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents

In `@mokume/io/feature.py` around lines 518 - 530, The SQL building the irs_df uses CAST(split_part(run, '_', 2) AS INTEGER) to compute techreplicate which can silently yield NULL for non-numeric segments; change this to use TRY_CAST and COALESCE so non-numeric parts don't drop data (e.g., compute techreplicate with TRY_CAST(split_part(run,'_',2) AS INTEGER) and COALESCE to fallback to TRY_CAST(split_part(run,'_',-1) AS INTEGER) or TRY_CAST(run AS INTEGER) similar to get_unique_tec_reps()), ensuring the SELECT in the irs_df query (symbols: irs_df, parquet_db, stat_fn, where_clause, techreplicate) preserves rows instead of producing NULL group keys.

coderabbitai · 2026-02-10T07:38:54Z

mokume/normalization/irs.py

+        # Compute global reference intensity (geometric mean across plexes)
+        ref_df = pd.DataFrame(plex_ref_intensity)
+        # Replace 0 and negatives with NaN for geometric mean
+        ref_df[ref_df <= 0] = np.nan
+        log_refs = np.log(ref_df)
+        global_ref = np.exp(log_refs.mean(axis=1))
+
+        # Compute scaling factors: global_ref / plex_ref for each plex
+        scaling_factors = {}
+        for plex in plexes:
+            scaling_factors[plex] = global_ref / plex_ref_intensity[plex]


⚠️ Potential issue | 🟠 Major

Division by zero / negative denominator when computing scaling factors.

plex_ref_intensity[plex] (line 106) is computed from raw data (lines 90–94) and may contain zero or negative values for some proteins. While global_ref sanitizes non-positive values via the NaN replacement on line 99, the denominator is not similarly protected. This produces inf / -inf scaling factors that silently corrupt downstream intensities.

Proposed fix

# Compute scaling factors: global_ref / plex_ref for each plex scaling_factors = {} for plex in plexes: - scaling_factors[plex] = global_ref / plex_ref_intensity[plex] + denom = plex_ref_intensity[plex].replace(0, np.nan) + denom[denom < 0] = np.nan + scaling_factors[plex] = global_ref / denom

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Compute global reference intensity (geometric mean across plexes)

ref_df = pd.DataFrame(plex_ref_intensity)

# Replace 0 and negatives with NaN for geometric mean

ref_df[ref_df <= 0] = np.nan

log_refs = np.log(ref_df)

global_ref = np.exp(log_refs.mean(axis=1))

# Compute scaling factors: global_ref / plex_ref for each plex

scaling_factors = {}

for plex in plexes:

scaling_factors[plex] = global_ref / plex_ref_intensity[plex]

# Compute global reference intensity (geometric mean across plexes)

ref_df = pd.DataFrame(plex_ref_intensity)

# Replace 0 and negatives with NaN for geometric mean

ref_df[ref_df <= 0] = np.nan

log_refs = np.log(ref_df)

global_ref = np.exp(log_refs.mean(axis=1))

# Compute scaling factors: global_ref / plex_ref for each plex

scaling_factors = {}

for plex in plexes:

denom = plex_ref_intensity[plex].replace(0, np.nan)

denom[denom < 0] = np.nan

scaling_factors[plex] = global_ref / denom

🤖 Prompt for AI Agents

In `@mokume/normalization/irs.py` around lines 96 - 106, The code computing scaling_factors can divide by zero/negative values because plex_ref_intensity[plex] is not sanitized; update the logic in the block around plex_ref_intensity, global_ref and scaling_factors so denominators <= 0 are replaced with np.nan (or a small positive epsilon) before division, compute scaling_factors as global_ref / sanitized_plex_ref and then handle resulting NaNs (e.g., propagate NaN or fill with 1.0) to avoid inf/-inf values; refer to variables plex_ref_intensity, global_ref, scaling_factors and the loop iterating over plexes when making this change.

coderabbitai · 2026-02-10T07:38:55Z

mokume/reports/qc_report.py

+def _compute_pca(log2_matrix, samples, sample_to_condition, sample_to_plex):
+    """Compute PCA coordinates."""
+    data = log2_matrix[samples].T.dropna(axis=1)
+    data = data.fillna(data.median())
+
+    n_components = min(3, data.shape[0], data.shape[1])
+    pca = PCA(n_components=n_components)
+    coords = pca.fit_transform(data)
+
+    result = {
+        "samples": list(data.index),
+        "conditions": [sample_to_condition.get(s, "Unknown") for s in data.index],
+        "plexes": [sample_to_plex.get(s, "plex1") for s in data.index],
+        "pc1": coords[:, 0].tolist(),
+        "pc2": coords[:, 1].tolist() if n_components >= 2 else [0] * len(data),
+        "var_explained": (pca.explained_variance_ratio_ * 100).tolist(),
+    }
+    if n_components >= 3:
+        result["pc3"] = coords[:, 2].tolist()
+
+    return result


⚠️ Potential issue | 🟡 Minor

Edge case: PCA will raise if all proteins are dropped after dropna.

If every protein has at least one NaN across all samples, dropna(axis=1) on the transposed matrix leaves zero columns and PCA.fit_transform will fail. A guard similar to _compute_tsne's if data.shape[0] < 4 check would be prudent.

🤖 Prompt for AI Agents

In `@mokume/reports/qc_report.py` around lines 134 - 154, In _compute_pca, guard against the case where data becomes empty after dropna (or has too few rows) before calling PCA: check data.shape and if there are no columns or fewer than the minimum samples (mirror the _compute_tsne check, e.g., if data.shape[0] < 4 or data.shape[1] == 0), return a result dictionary with the same keys ("samples", "conditions", "plexes", "pc1", "pc2", optional "pc3", and "var_explained") filled with safe defaults (zeros or empty lists of correct length and var_explained filled with zeros) instead of calling PCA; ensure you reference _compute_pca, samples, sample_to_condition, and sample_to_plex so callers receive consistent shaped outputs.

coderabbitai · 2026-02-10T07:38:55Z

mokume/reports/qc_report.py

+    n_components = min(10, data.shape[0] - 1, data.shape[1])
+    pca = PCA(n_components=n_components)
+    pc_data = pca.fit_transform(data)
+
+    # For each PC, compute R² for condition and plex using simple 1-way ANOVA
+    condition_r2_weighted = 0
+    plex_r2_weighted = 0
+    total_var_explained = sum(pca.explained_variance_ratio_[:n_components])
+
+    for i in range(n_components):
+        pc_values = pc_data[:, i]
+        weight = pca.explained_variance_ratio_[i] / total_var_explained
+
+        cond_r2 = _anova_r2(pc_values, conditions)
+        plex_r2 = _anova_r2(pc_values, plexes)
+
+        condition_r2_weighted += cond_r2 * weight
+        plex_r2_weighted += plex_r2 * weight


⚠️ Potential issue | 🟡 Minor

Division-by-zero risk if total_var_explained is zero.

If all PCA components explain zero variance (e.g., all-constant data), Line 330 divides by total_var_explained = 0. Guard against this.

Suggested fix

total_var_explained = sum(pca.explained_variance_ratio_[:n_components]) + if total_var_explained == 0: + return { + "condition_pct": 0.0, + "plex_pct": 0.0, + "residual_pct": 100.0, + "n_pcs_used": n_components, + "total_var_captured": 0.0, + } for i in range(n_components): pc_values = pc_data[:, i] weight = pca.explained_variance_ratio_[i] / total_var_explained

🤖 Prompt for AI Agents

In `@mokume/reports/qc_report.py` around lines 319 - 336, The loop that computes weights divides by total_var_explained which can be zero (n_components, PCA, pca.explained_variance_ratio_, total_var_explained); to fix, guard against total_var_explained == 0 by computing a safe denominator: if total_var_explained > 0 keep the current weight = pca.explained_variance_ratio_[i] / total_var_explained, otherwise use an equal weight = 1.0 / n_components for each component; apply this change where weight is computed before calling _anova_r2 and use the equal-weight branch when explained variance sums to zero.

coderabbitai · 2026-02-10T07:38:55Z

mokume/reports/qc_report.py

+    with open(output_html, "w") as f:
+        f.write(html)


⚠️ Potential issue | 🟡 Minor

Specify encoding="utf-8" when writing the HTML file.

The HTML declares <meta charset="utf-8"> (Line 579) but the file is opened without explicit encoding. On some platforms the default encoding is not UTF-8, which could corrupt non-ASCII characters in sample/protein names.

Suggested fix

- with open(output_html, "w") as f: + with open(output_html, "w", encoding="utf-8") as f:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with open(output_html, "w") as f:

f.write(html)

with open(output_html, "w", encoding="utf-8") as f:

f.write(html)

🤖 Prompt for AI Agents

In `@mokume/reports/qc_report.py` around lines 492 - 493, The file is opened without an explicit encoding which can corrupt non-ASCII characters; modify the open call that writes the HTML (the with open(output_html, "w") as f: block that writes the html variable) to specify encoding="utf-8" so the output HTML (which declares <meta charset="utf-8">) is written with matching UTF-8 encoding.

coderabbitai · 2026-02-10T07:38:56Z

mokume/reports/qc_report.py

+// Per-sample boxplots (distribution)
+(function() {{
+    const dist = {json.dumps(distributions)};
+    const samples = Object.keys(dist).sort();
+    const traces = [];
+    samples.forEach(s => {{
+        const d = dist[s];
+        traces.push({{
+            type: 'box',
+            name: s,
+            y: [d.min, d.q25, d.median, d.q75, d.max],
+            boxmean: true,
+            marker: {{ color: condColors[d.condition] || '#999' }},
+            line: {{ width: 1 }},
+        }});
+    }});
+    // Use simplified box representation
+    const boxTraces = [];
+    const conds = [...new Set(samples.map(s => dist[s].condition))].sort();
+    conds.forEach(cond => {{
+        const condSamples = samples.filter(s => dist[s].condition === cond);
+        boxTraces.push({{
+            type: 'bar',
+            name: cond,
+            x: condSamples,
+            y: condSamples.map(s => dist[s].median),
+            error_y: {{
+                type: 'data',
+                symmetric: false,
+                array: condSamples.map(s => dist[s].q75 - dist[s].median),
+                arrayminus: condSamples.map(s => dist[s].median - dist[s].q25),
+            }},
+            marker: {{ color: condColors[cond] || '#999' }},
+        }});
+    }});
+    Plotly.newPlot('boxplot', boxTraces, {{
+        title: {{ text: 'Per-Sample Distribution (median + IQR)', font: {{ size: 14 }} }},
+        xaxis: {{ tickangle: 45, tickfont: {{ size: 9 }} }},
+        yaxis: {{ title: '{"log2 ratio" if summary["is_log2"] else "log2 intensity"}' }},
+        barmode: 'group',
+        margin: {{ t: 50, b: 100, l: 60, r: 20 }},
+    }}, {{responsive: true}});
+}})();
+
+// CV violin/box by condition
+(function() {{
+    const cvData = {json.dumps({k: v['cv_values'] for k, v in condition_cv.items()})};
+    const traces = [];
+    Object.keys(cvData).sort().forEach(cond => {{
+        traces.push({{
+            type: 'violin',
+            name: cond,
+            y: cvData[cond],
+            box: {{ visible: true }},
+            meanline: {{ visible: true }},
+            marker: {{ color: condColors[cond] || '#999' }},
+            line: {{ width: 1 }},
+        }});
+    }});
+    Plotly.newPlot('cv-violin', traces, {{
+        title: {{ text: 'Per-Protein CV Distribution by Condition', font: {{ size: 14 }} }},
+        yaxis: {{ title: 'Coefficient of Variation', range: [0, 3] }},
+        margin: {{ t: 50, b: 40, l: 60, r: 20 }},
+    }}, {{responsive: true}});
+}})();
+
+// Missing values per sample
+(function() {{
+    const missing = {json.dumps(missing['per_sample'])};
+    const samples = Object.keys(missing).sort();
+    Plotly.newPlot('missing-plot', [{{
+        type: 'bar',
+        x: samples,
+        y: samples.map(s => (missing[s].rate * 100).toFixed(1)),
+        marker: {{ color: samples.map(s => missing[s].rate > 0.3 ? '#e74c3c' : '#27ae60') }},
+        text: samples.map(s => (missing[s].rate * 100).toFixed(1) + '%'),
+        textposition: 'auto',
+    }}], {{
+        title: {{ text: 'Missing Value Rate per Sample', font: {{ size: 14 }} }},
+        xaxis: {{ tickangle: 45, tickfont: {{ size: 9 }} }},
+        yaxis: {{ title: '% Missing', range: [0, 100] }},
+        margin: {{ t: 50, b: 100, l: 60, r: 20 }},
+    }}, {{responsive: true}});
+}})();


⚠️ Potential issue | 🟡 Minor

Inconsistent </script> injection protection in JSON embedding.

Lines 766, 809, 865 correctly use .replace('</', '<\\/') on json.dumps() output to prevent </script> tag injection, but Lines 886, 930, and 952 omit this escaping. Apply it consistently to all embedded JSON data.

Suggested fix (example for line 886)

- const dist = {json.dumps(distributions)}; + const dist = {json.dumps(distributions).replace('</', '<\\/')};

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Per-sample boxplots (distribution)

(function() {{

const dist = {json.dumps(distributions)};

const samples = Object.keys(dist).sort();

const traces = [];

samples.forEach(s => {{

const d = dist[s];

traces.push({{

type: 'box',

name: s,

y: [d.min, d.q25, d.median, d.q75, d.max],

boxmean: true,

marker: {{ color: condColors[d.condition] || '#999' }},

line: {{ width: 1 }},

}});

}});

// Use simplified box representation

const boxTraces = [];

const conds = [...new Set(samples.map(s => dist[s].condition))].sort();

conds.forEach(cond => {{

const condSamples = samples.filter(s => dist[s].condition === cond);

boxTraces.push({{

type: 'bar',

name: cond,

x: condSamples,

y: condSamples.map(s => dist[s].median),

error_y: {{

type: 'data',

symmetric: false,

array: condSamples.map(s => dist[s].q75 - dist[s].median),

arrayminus: condSamples.map(s => dist[s].median - dist[s].q25),

}},

marker: {{ color: condColors[cond] || '#999' }},

}});

}});

Plotly.newPlot('boxplot', boxTraces, {{

title: {{ text: 'Per-Sample Distribution (median + IQR)', font: {{ size: 14 }} }},

xaxis: {{ tickangle: 45, tickfont: {{ size: 9 }} }},

yaxis: {{ title: '{"log2 ratio" if summary["is_log2"] else "log2 intensity"}' }},

barmode: 'group',

margin: {{ t: 50, b: 100, l: 60, r: 20 }},

}}, {{responsive: true}});

}})();

// CV violin/box by condition

(function() {{

const cvData = {json.dumps({k: v['cv_values'] for k, v in condition_cv.items()})};

const traces = [];

Object.keys(cvData).sort().forEach(cond => {{

traces.push({{

type: 'violin',

name: cond,

y: cvData[cond],

box: {{ visible: true }},

meanline: {{ visible: true }},

marker: {{ color: condColors[cond] || '#999' }},

line: {{ width: 1 }},

}});

}});

Plotly.newPlot('cv-violin', traces, {{

title: {{ text: 'Per-Protein CV Distribution by Condition', font: {{ size: 14 }} }},

yaxis: {{ title: 'Coefficient of Variation', range: [0, 3] }},

margin: {{ t: 50, b: 40, l: 60, r: 20 }},

}}, {{responsive: true}});

}})();

// Missing values per sample

(function() {{

const missing = {json.dumps(missing['per_sample'])};

const samples = Object.keys(missing).sort();

Plotly.newPlot('missing-plot', [{{

type: 'bar',

x: samples,

y: samples.map(s => (missing[s].rate * 100).toFixed(1)),

marker: {{ color: samples.map(s => missing[s].rate > 0.3 ? '#e74c3c' : '#27ae60') }},

text: samples.map(s => (missing[s].rate * 100).toFixed(1) + '%'),

textposition: 'auto',

}}], {{

title: {{ text: 'Missing Value Rate per Sample', font: {{ size: 14 }} }},

xaxis: {{ tickangle: 45, tickfont: {{ size: 9 }} }},

yaxis: {{ title: '% Missing', range: [0, 100] }},

margin: {{ t: 50, b: 100, l: 60, r: 20 }},

}}, {{responsive: true}});

}})();

// Per-sample boxplots (distribution)

(function() {{

const dist = {json.dumps(distributions).replace('</', '<\\/')};

const samples = Object.keys(dist).sort();

const traces = [];

samples.forEach(s => {{

const d = dist[s];

traces.push({{

type: 'box',

name: s,

y: [d.min, d.q25, d.median, d.q75, d.max],

boxmean: true,

marker: {{ color: condColors[d.condition] || '#999' }},

line: {{ width: 1 }},

}});

}});

// Use simplified box representation

const boxTraces = [];

const conds = [...new Set(samples.map(s => dist[s].condition))].sort();

conds.forEach(cond => {{

const condSamples = samples.filter(s => dist[s].condition === cond);

boxTraces.push({{

type: 'bar',

name: cond,

x: condSamples,

y: condSamples.map(s => dist[s].median),

error_y: {{

type: 'data',

symmetric: false,

array: condSamples.map(s => dist[s].q75 - dist[s].median),

arrayminus: condSamples.map(s => dist[s].median - dist[s].q25),

}},

marker: {{ color: condColors[cond] || '#999' }},

}});

}});

Plotly.newPlot('boxplot', boxTraces, {{

title: {{ text: 'Per-Sample Distribution (median + IQR)', font: {{ size: 14 }} }},

xaxis: {{ tickangle: 45, tickfont: {{ size: 9 }} }},

yaxis: {{ title: '{"log2 ratio" if summary["is_log2"] else "log2 intensity"}' }},

barmode: 'group',

margin: {{ t: 50, b: 100, l: 60, r: 20 }},

}}, {{responsive: true}});

}})();

// CV violin/box by condition

(function() {{

const cvData = {json.dumps({k: v['cv_values'] for k, v in condition_cv.items()}).replace('</', '<\\/')};

const traces = [];

Object.keys(cvData).sort().forEach(cond => {{

traces.push({{

type: 'violin',

name: cond,

y: cvData[cond],

box: {{ visible: true }},

meanline: {{ visible: true }},

marker: {{ color: condColors[cond] || '#999' }},

line: {{ width: 1 }},

}});

}});

Plotly.newPlot('cv-violin', traces, {{

title: {{ text: 'Per-Protein CV Distribution by Condition', font: {{ size: 14 }} }},

yaxis: {{ title: 'Coefficient of Variation', range: [0, 3] }},

margin: {{ t: 50, b: 40, l: 60, r: 20 }},

}}, {{responsive: true}});

}})();

// Missing values per sample

(function() {{

const missing = {json.dumps(missing['per_sample']).replace('</', '<\\/')};

const samples = Object.keys(missing).sort();

Plotly.newPlot('missing-plot', [{{

type: 'bar',

x: samples,

y: samples.map(s => (missing[s].rate * 100).toFixed(1)),

marker: {{ color: samples.map(s => missing[s].rate > 0.3 ? '#e74c3c' : '#27ae60') }},

text: samples.map(s => (missing[s].rate * 100).toFixed(1) + '%'),

textposition: 'auto',

}}], {{

title: {{ text: 'Missing Value Rate per Sample', font: {{ size: 14 }} }},

xaxis: {{ tickangle: 45, tickfont: {{ size: 9 }} }},

yaxis: {{ title: '% Missing', range: [0, 100] }},

margin: {{ t: 50, b: 100, l: 60, r: 20 }},

}}, {{responsive: true}});

}})();

🤖 Prompt for AI Agents

In `@mokume/reports/qc_report.py` around lines 884 - 967, The embedded JSON for distributions, condition_cv (cvData) and missing['per_sample'] is not being escaped for a closing script tag; update the json.dumps(...) calls that produce the JS constants "dist" (used for boxplot), "cvData" (used for cv-violin) and the "missing" object (used for missing-plot) to apply .replace('</', '<\\/') on the dumped string so any "</script>" sequences are neutralized before injecting into the page; keep the existing variable names and plot targets ('boxplot', 'cv-violin', 'missing-plot') so only the json embedding calls change.

coderabbitai · 2026-02-10T07:38:56Z

mokume/reports/workflow_comparison.py

+    html = _build_comparison_html(
+        title, workflow_metrics, concordance, marker_results, marker_genes,
+    )
+
+    with open(output_html, "w") as f:
+        f.write(html)


⚠️ Potential issue | 🟠 Major

html variable shadows the html module import — maintenance trap.

Line 83 assigns the string result to a local variable named html, which shadows the import html on Line 14. Any future code in this function referencing html.escape() will raise AttributeError. Also, specify encoding="utf-8" on the file write to match the charset declared in the output.

Suggested fix

- html = _build_comparison_html( + report_html = _build_comparison_html( title, workflow_metrics, concordance, marker_results, marker_genes, ) - with open(output_html, "w") as f: - f.write(html) + with open(output_html, "w", encoding="utf-8") as f: + f.write(report_html)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

html = _build_comparison_html(

title, workflow_metrics, concordance, marker_results, marker_genes,

)

with open(output_html, "w") as f:

f.write(html)

report_html = _build_comparison_html(

title, workflow_metrics, concordance, marker_results, marker_genes,

)

with open(output_html, "w", encoding="utf-8") as f:

f.write(report_html)

🤖 Prompt for AI Agents

In `@mokume/reports/workflow_comparison.py` around lines 83 - 88, The local variable named html returned from _build_comparison_html shadows the imported html module (so subsequent calls like html.escape would fail); rename that local variable (e.g., rendered_html or html_content) wherever it's used and write that renamed variable to the file instead of html, and open the file with encoding="utf-8" (use output_html and the result of _build_comparison_html accordingly) so the module remains accessible and the file encoding matches the declared charset.

… format

…xtures

support new QPX feature schema with backward compatibility

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

mokume/io/feature.py (2)
360-369: ⚠️ Potential issue | 🟠 Major

Still interpolating raw values into the two report queries.

tuple(samples) / tuple(cons) and ",".join(columns) are still spliced directly into SQL. A quoted sample/condition or an unvalidated column name can break the query or inject SQL; use placeholders for the IN (...) values and whitelist column names before joining.

Also applies to: 448-455
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mokume/io/feature.py` around lines 360 - 369, The get_report_from_database
method currently injects raw values into the SQL string (using tuple(samples)
and ",".join(columns)); fix it by parameterizing the IN list with placeholders
and passing samples as parameters to parquet_db.sql, and validate/whitelist
column names before joining to prevent injection; update the same pattern in the
other report query referenced (the block around lines 448-455) so both queries
use prepared placeholders for sample/condition values and only allow vetted
column names when building cols, leaving Feature.standardize_df(report)
unchanged.
552-562: ⚠️ Potential issue | 🔴 Critical

IRS tech-rep parsing will fail on valid new-QPX run names.

The new-QPX fixture in tests/example/sdrf_new_qpx.tsv:2-3 uses run_A.raw, so the run value becomes run_A. split_part(run, '_', 2) is A, and CAST('A' AS INTEGER) raises in DuckDB, so IRS breaks on that dataset. Please mirror get_unique_tec_reps() by parsing the last token and using TRY_CAST plus explicit null handling.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mokume/io/feature.py`:
- Around line 271-294: The parquet_db view is rebuilt with SDRF-derived
sample_accession and condition but self.samples is only populated once in
__init__, so callers (iter_samples, experimental_inference,
get_low_frequency_peptides) still see stale sample metadata; after creating the
parquet_db view in the method containing this SQL (in mokume/io/feature.py)
refresh the cached sample list by reassigning or repopulating self.samples from
the updated parquet_db (reuse the same logic that __init__ uses to build
samples) so downstream methods reflect the enriched SDRF values.

In `@mokume/quantification/ratio.py`:
- Around line 396-400: The code currently falls back to run_file_name when the
SDRF join misses (df = df.merge(...); df['sample_accession'] =
df['source_name'].fillna(df['run_file_name'])), which hides unmapped
(run_file_name, label) pairs and later causes silent drops; remove the fillna
fallback and instead detect rows where the merge produced null SDRF values
(e.g., df['source_name'].isna()) immediately after the merge and raise a clear
error (RuntimeError) listing the offending (run_file_name, label) combinations
so the pipeline fails fast; update any downstream expectations (e.g., mappings
built in mokume/pipeline/stages.py) to rely on the validated
df['sample_accession'] populated only from SDRF values.
- Around line 364-373: The WHERE uses unprojected aliases so DuckDB will fail;
change the query in ratio.py where query is constructed (using charge_col,
unnest_sql, where_clause) to materialize the unnested rows first via a CTE or
subquery (e.g., WITH unnested AS (SELECT parquet_raw.*, unnest.* FROM
parquet_raw, UNNEST(intensities) AS unnest) SELECT pg_accessions, sequence,
{charge_col} AS precursor_charge, {unnest_sql} FROM unnested WHERE
{where_clause}) so predicates emitted by SQLFilterBuilder.build_where_clause()
can reference properly-scoped columns (or alternatively update
SQLFilterBuilder.build_where_clause to accept and apply explicit prefixes like
"unnest." to its column names); ensure all references to intensity and "unique"
are fully qualified to the subquery/CTE aliases.

---

Duplicate comments:
In `@mokume/io/feature.py`:
- Around line 360-369: The get_report_from_database method currently injects raw
values into the SQL string (using tuple(samples) and ",".join(columns)); fix it
by parameterizing the IN list with placeholders and passing samples as
parameters to parquet_db.sql, and validate/whitelist column names before joining
to prevent injection; update the same pattern in the other report query
referenced (the block around lines 448-455) so both queries use prepared
placeholders for sample/condition values and only allow vetted column names when
building cols, leaving Feature.standardize_df(report) unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f7d27c6-d70d-4cb5-9f57-c56aad0d0d06

📥 Commits

Reviewing files that changed from the base of the PR and between 3fee301 and 9ec72ae.

⛔ Files ignored due to path filters (2)

tests/example/feature_wide_new_qpx.parquet is excluded by !**/*.parquet
tests/example/sdrf_new_qpx.tsv is excluded by !**/*.tsv

📒 Files selected for processing (4)

mokume/core/constants.py
mokume/io/feature.py
mokume/quantification/ratio.py
tests/test_peptide_normalize.py

🚧 Files skipped from review as they are similar to previous changes (1)

mokume/core/constants.py

coderabbitai · 2026-03-16T08:45:00Z

mokume/io/feature.py

+        self.parquet_db.execute("DROP VIEW IF EXISTS parquet_db")
+        self.parquet_db.execute(f"""
+            CREATE VIEW parquet_db AS
+            SELECT
+                p.sequence,
+                p.peptidoform,
+                p.pg_accessions,
+                p.charge,
+                p.run_file_name,
+                p."unique",
+                COALESCE(s.sdrf_sample_accession, {sa_fallback}) as sample_accession,
+                p.channel,
+                p.intensity,
+                p.run,
+                COALESCE(s.sdrf_condition, {sa_fallback}) as condition,
+                COALESCE(CAST(s.sdrf_biological_replicate AS INTEGER), 1) as biological_replicate,
+                COALESCE(CAST(s.sdrf_fraction AS VARCHAR), '1') as fraction,
+                split_part(COALESCE(s.sdrf_sample_accession, {sa_fallback}), '_', 1) as mixture
+            FROM parquet_db_unnested p
+            LEFT JOIN sdrf_mapping s
+                {join_clause}
+        """)
+
+        logger.info("Enriched parquet data with SDRF metadata from %s", sdrf_path)


⚠️ Potential issue | 🟠 Major

Refresh cached sample metadata after rebuilding parquet_db.

__init__() snapshots self.samples once, but this method replaces sample_accession and condition with SDRF-derived values. After enrichment, iter_samples(), experimental_inference(), and the denominator in get_low_frequency_peptides() still use the pre-SDRF run list.

♻️ Suggested fix

self.parquet_db.execute(f""" CREATE VIEW parquet_db AS SELECT p.sequence, p.peptidoform, p.pg_accessions, p.charge, p.run_file_name, p."unique", COALESCE(s.sdrf_sample_accession, {sa_fallback}) as sample_accession, p.channel, p.intensity, p.run, COALESCE(s.sdrf_condition, {sa_fallback}) as condition, COALESCE(CAST(s.sdrf_biological_replicate AS INTEGER), 1) as biological_replicate, COALESCE(CAST(s.sdrf_fraction AS VARCHAR), '1') as fraction, split_part(COALESCE(s.sdrf_sample_accession, {sa_fallback}), '_', 1) as mixture FROM parquet_db_unnested p LEFT JOIN sdrf_mapping s {join_clause} """) + + self.samples = self.get_unique_samples() + self.labels = None + self.label = None + self.choice = None + self.technical_repetitions = None logger.info("Enriched parquet data with SDRF metadata from %s", sdrf_path)

🧰 Tools

🪛 Ruff (0.15.5)

[error] 272-292: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mokume/io/feature.py` around lines 271 - 294, The parquet_db view is rebuilt with SDRF-derived sample_accession and condition but self.samples is only populated once in __init__, so callers (iter_samples, experimental_inference, get_low_frequency_peptides) still see stale sample metadata; after creating the parquet_db view in the method containing this SQL (in mokume/io/feature.py) refresh the cached sample list by reassigning or repopulating self.samples from the updated parquet_db (reuse the same logic that __init__ uses to build samples) so downstream methods reflect the enriched SDRF values.

coderabbitai · 2026-03-16T08:45:00Z

mokume/quantification/ratio.py

+        query = f"""
+            SELECT
+                pg_accessions,
+                sequence,
+                {charge_col} as precursor_charge,
+                {unnest_sql}
+            FROM parquet_raw, UNNEST(intensities) as unnest
+            WHERE unnest.intensity IS NOT NULL
+              AND {where_clause}
+        """


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file and the specific lines mentioned cd mokume && head -n 373 quantification/ratio.py | tail -n 20

Repository: bigbio/mokume

Length of output: 816

🏁 Script executed:

# Let's get more context around those lines cd mokume && sed -n '350,380p' quantification/ratio.py

Repository: bigbio/mokume

Length of output: 1151

🏁 Script executed:

# Find SQLFilterBuilder definition fd -t f . -name "*.py" | xargs rg "class SQLFilterBuilder"

Repository: bigbio/mokume

Length of output: 349

🏁 Script executed:

# Search for where_clause building in ratio.py cd mokume && rg "where_clause" quantification/ratio.py -B 2 -A 2

Repository: bigbio/mokume

Length of output: 337

🏁 Script executed:

# Find and read the SQLFilterBuilder class rg "class SQLFilterBuilder" -A 50 mokume/io/feature.py

Repository: bigbio/mokume

Length of output: 1845

🏁 Script executed:

# Also look for the build_where_clause method rg "def build_where_clause" -A 30 mokume/io/feature.py

Repository: bigbio/mokume

Length of output: 1171

🏁 Script executed:

# Check what columns SQLFilterBuilder processes rg "intensity" mokume/io/feature.py -B 3 -A 3

Repository: bigbio/mokume

Length of output: 6342

🏁 Script executed:

# Search for tests of load_psm_data function rg "load_psm_data" --type py

Repository: bigbio/mokume

Length of output: 252

🏁 Script executed:

# Look for test files related to ratio or quantification fd -type f -name "*test*.py" | xargs rg -l "load_psm_data\|ratio" 2>/dev/null

Repository: bigbio/mokume

Length of output: 229

🏁 Script executed:

# Check the full implementation of load_psm_data to see context sed -n '320,400p' mokume/quantification/ratio.py

Repository: bigbio/mokume

Length of output: 3076

🏁 Script executed:

# Verify DuckDB's behavior by checking if there are any comments or issues about this rg "SELECT.*as.*WHERE" mokume/quantification/ratio.py -A 5 -B 5

Repository: bigbio/mokume

Length of output: 39

🏁 Script executed:

# Search for test files find . -type f -name "*test*.py" -o -name "test_*.py" | head -20

Repository: bigbio/mokume

Length of output: 404

🏁 Script executed:

# Look at how load_psm_data is called in stages.py rg -B 10 -A 10 "load_psm_data\(" mokume/pipeline/stages.py

Repository: bigbio/mokume

Length of output: 795

🏁 Script executed:

# Check if there are any error messages or issues reported with the SQL rg "Column.*intensity.*not found\|intensity.*not found" . --type py

Repository: bigbio/mokume

Length of output: 39

🏁 Script executed:

# Let me verify DuckDB behavior - check if there's a simple way to test SELECT alias in WHERE cat > /tmp/test_duckdb.py << 'EOF' import duckdb # Test if DuckDB allows SELECT aliases in WHERE clause try: conn = duckdb.connect(':memory:') # Create test data conn.execute(""" CREATE TABLE test AS SELECT 1 as id, ARRAY[10, 20, 30] as values """) # Try to use SELECT alias in WHERE clause result = conn.execute(""" SELECT id, v as intensity FROM test, UNNEST(values) as t(v) WHERE intensity > 15 """).fetchall() print("SUCCESS: DuckDB allows SELECT aliases in WHERE") print(result) except Exception as e: print(f"FAILED: {e}") # Test with column prefix instead try: result = conn.execute(""" SELECT id, v as intensity FROM test, UNNEST(values) as t(v) WHERE v > 15 """).fetchall() print("\nSUCCESS: Using actual column name works") print(result) except Exception as e: print(f"FAILED: {e}") EOF python /tmp/test_duckdb.py

Repository: bigbio/mokume

Length of output: 205

🏁 Script executed:

# Check what columns are expected in parquet_raw by looking at schema handling rg "sequence\|unique" mokume/quantification/ratio.py -B 2 -A 2

Repository: bigbio/mokume

Length of output: 39

🏁 Script executed:

# Look for any column mapping or schema definition rg "CREATE VIEW parquet_raw" mokume/quantification/ratio.py -A 5 -B 5

Repository: bigbio/mokume

Length of output: 366

🏁 Script executed:

# Check if there's a separate WHERE clause for different contexts rg "where_clause.*=" mokume/quantification/ratio.py -B 3 -A 3

Repository: bigbio/mokume

Length of output: 263

🏁 Script executed:

# Look at the full function signature and inputs to load_psm_data sed -n '310,330p' mokume/quantification/ratio.py

Repository: bigbio/mokume

Length of output: 643

🏁 Script executed:

# Get the full load_psm_data function definition sed -n '310,320p' mokume/quantification/ratio.py

Repository: bigbio/mokume

Length of output: 400

🏁 Script executed:

# Look for function signature rg "def load_psm_data" -A 15 mokume/quantification/ratio.py

Repository: bigbio/mokume

Length of output: 478

🏁 Script executed:

# Check if sequence and unique are expected columns in parquet rg "pg_accessions\|sequence\|unique" mokume/quantification/ratio.py | head -20

Repository: bigbio/mokume

Length of output: 39

🏁 Script executed:

# Look at the complete SELECT query to understand column sourcing sed -n '363,380p' mokume/quantification/ratio.py

Repository: bigbio/mokume

Length of output: 615

🏁 Script executed:

# Check the actual WHERE clause construction more carefully - does it prefix columns? sed -n '250,280p' mokume/io/feature.py

Repository: bigbio/mokume

Length of output: 1230

🏁 Script executed:

# Check how intensities is structured - is it an array of structs? rg "intensities" mokume/quantification/ratio.py -B 2 -A 2 | head -40

Repository: bigbio/mokume

Length of output: 965

🏁 Script executed:

# Look at how the Feature class handles the same query pattern sed -n '140,170p' mokume/io/feature.py

Repository: bigbio/mokume

Length of output: 1279

🏁 Script executed:

# Check if there are any known issues or comments about WHERE clause scope rg "WHERE.*unnest\|WHERE.*intensity" mokume/io/feature.py -B 2 -A 2

Repository: bigbio/mokume

Length of output: 39

🏁 Script executed:

# Check the exact place where where_clause is built and used rg -B 5 "AND {where_clause}" mokume/

Repository: bigbio/mokume

Length of output: 162

🏁 Script executed:

# Get the complete build_where_clause method rg "def build_where_clause" -A 50 mokume/io/feature.py

Repository: bigbio/mokume

Length of output: 2060

🏁 Script executed:

# Check if there's a WHERE clause being used in other similar contexts rg "UNNEST.*unnest.*WHERE" mokume/io/feature.py -A 3 -B 3

Repository: bigbio/mokume

Length of output: 39

🏁 Script executed:

# Carefully look at how Feature class uses the filter builder rg "filter_builder.build_where_clause" mokume/io/feature.py -B 5 -A 5

Repository: bigbio/mokume

Length of output: 1527

🏁 Script executed:

# Now let me trace exactly how Feature class uses this - does it use the same where_clause pattern? sed -n '155,175p' mokume/io/feature.py

Repository: bigbio/mokume

Length of output: 797

🏁 Script executed:

# Look at the view creation in Feature to understand the difference sed -n '175,210p' mokume/io/feature.py

Repository: bigbio/mokume

Length of output: 1375

🏁 Script executed:

# See if there's any column scope documentation in comments rg "column.*scope\|SELECT.*alias\|WHERE.*alias" mokume/

Repository: bigbio/mokume

Length of output: 39

🏁 Script executed:

# Check if load_psm_data has been used in actual tests find . -name "*.py" -path "*/tests/*" -exec grep -l "load_psm_data" {} \;

Repository: bigbio/mokume

Length of output: 39

🏁 Script executed:

# Check if there are any subquery patterns used elsewhere with where_clause rg "FROM.*SELECT" mokume/quantification/ratio.py mokume/io/feature.py -A 5 -B 5

Repository: bigbio/mokume

Length of output: 39

🏁 Script executed:

# Look for any recent commits that might show this was added recently cd mokume && git log --oneline -20 2>/dev/null | head -20

Repository: bigbio/mokume

Length of output: 107

🏁 Script executed:

# Check if the function has been changed recently cd mokume && git log --oneline mokume/quantification/ratio.py 2>/dev/null | head -10

Repository: bigbio/mokume

Length of output: 39

🏁 Script executed:

# Verify the issue more carefully - check if DuckDB actually allows bare column names after UNNEST # by looking at similar patterns in the codebase rg "UNNEST.*as.*unnest" mokume/ -A 10 | grep -A 10 "SELECT"

Repository: bigbio/mokume

Length of output: 197

🏁 Script executed:

# Check the git history of load_psm_data function specifically cd mokume && git log -p --all -S "def load_psm_data" -- mokume/quantification/ratio.py 2>/dev/null | head -100

Repository: bigbio/mokume

Length of output: 39

🏁 Script executed:

# Check if this code path is actually exercised by looking for test files find . -path "*/tests/*" -name "*.py" -exec grep -l "quantification\|ratio" {} \;

Repository: bigbio/mokume

Length of output: 276

🏁 Script executed:

# Look at test_quantification_accuracy.py to see if it tests load_psm_data rg "load_psm_data\|RatioQuantification" ./tests/test_quantification_accuracy.py

Repository: bigbio/mokume

Length of output: 39

🏁 Script executed:

# One more check - see if there's documentation on expected parquet schema rg "intensities.*label.*intensity\|UNNEST.*intensities" mokume/ -B 2 -A 2

Repository: bigbio/mokume

Length of output: 39

Wrap the UNNEST in a subquery or CTE before applying the filter clause.

SQLFilterBuilder.build_where_clause() emits predicates like intensity > 0 and "unique" = 1, but these reference columns that don't exist in the WHERE clause scope. The source field unnest.intensity is only projected as a SELECT alias, and "unique" is from the base table. DuckDB evaluates WHERE before SELECT projections, so unqualified column names will fail at runtime. Either use a CTE/subquery to materialize the unnested data first, or modify the filter builder to accept and apply column prefixes.

🧰 Tools

🪛 Ruff (0.15.5)

[error] 364-373: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mokume/quantification/ratio.py` around lines 364 - 373, The WHERE uses unprojected aliases so DuckDB will fail; change the query in ratio.py where query is constructed (using charge_col, unnest_sql, where_clause) to materialize the unnested rows first via a CTE or subquery (e.g., WITH unnested AS (SELECT parquet_raw.*, unnest.* FROM parquet_raw, UNNEST(intensities) AS unnest) SELECT pg_accessions, sequence, {charge_col} AS precursor_charge, {unnest_sql} FROM unnested WHERE {where_clause}) so predicates emitted by SQLFilterBuilder.build_where_clause() can reference properly-scoped columns (or alternatively update SQLFilterBuilder.build_where_clause to accept and apply explicit prefixes like "unnest." to its column names); ensure all references to intensity and "unique" are fully qualified to the subquery/CTE aliases.

coderabbitai · 2026-03-16T08:45:00Z

mokume/quantification/ratio.py

+        df = df.merge(
+            sdrf_map, on=['run_file_name', 'label'], how='left',
+        )
+        df['sample_accession'] = df['source_name'].fillna(df['run_file_name'])
+        df['Fraction'] = df['fraction'].fillna('1')


⚠️ Potential issue | 🟠 Major

Do not silently fall back to run_file_name on SDRF join misses.

In mokume/pipeline/stages.py:215-266, both reference_samples and sample_to_plex are derived from SDRF sample names, not run ids. If a (run_file_name, label) pair misses here, the fallback sample id no longer matches those maps and the row is silently dropped during plex/reference matching. Fail fast on unmapped pairs instead of producing a partial matrix.

♻️ Suggested fix

df = df.merge( sdrf_map, on=['run_file_name', 'label'], how='left', ) - df['sample_accession'] = df['source_name'].fillna(df['run_file_name']) + unmapped = df['source_name'].isna() + if unmapped.any(): + missing = df.loc[unmapped, ['run_file_name', 'label']].drop_duplicates() + raise ValueError( + "Missing SDRF rows for run/label pairs: " + f"{missing.to_dict('records')[:10]}" + ) + df['sample_accession'] = df['source_name'] df['Fraction'] = df['fraction'].fillna('1')

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mokume/quantification/ratio.py` around lines 396 - 400, The code currently falls back to run_file_name when the SDRF join misses (df = df.merge(...); df['sample_accession'] = df['source_name'].fillna(df['run_file_name'])), which hides unmapped (run_file_name, label) pairs and later causes silent drops; remove the fillna fallback and instead detect rows where the merge produced null SDRF values (e.g., df['source_name'].isna()) immediately after the merge and raise a clear error (RuntimeError) listing the offending (run_file_name, label) combinations so the pipeline fails fast; update any downstream expectations (e.g., mappings built in mokume/pipeline/stages.py) to rely on the validated df['sample_accession'] populated only from SDRF values.

…(E999)

extract f-string backslash to _safe_json for Python 3.10/3.11 compat

coderabbitai

Actionable comments posted: 8

♻️ Duplicate comments (1)

mokume/reports/workflow_comparison.py (1)

83-88: ⚠️ Potential issue | 🟠 Major

Avoid shadowing html and write the report with explicit UTF-8.

Line 83 masks the imported html module from Line 14, and Line 87 should explicitly use UTF-8 to match the document charset.

Suggested fix

-    html = _build_comparison_html(
+    report_html = _build_comparison_html(
         title, workflow_metrics, concordance, marker_results, marker_genes,
     )

-    with open(output_html, "w") as f:
-        f.write(html)
+    with open(output_html, "w", encoding="utf-8") as f:
+        f.write(report_html)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@mokume/reports/workflow_comparison.py` around lines 83 - 88, The local
variable named html is shadowing the imported html module; rename the result of
_build_comparison_html (e.g., html_content or report_html) and then write the
file using an explicit UTF-8 encoding (open(output_html, "w",
encoding="utf-8")). Update references to the renamed variable and keep using the
_build_comparison_html call and output_html target.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mokume/reports/qc_report.py`:
- Around line 136-137: The code currently drops any protein column with any
missing value by using .dropna(axis=1) then filling NAs, which removes partially
observed features; change the drop to only remove columns that are entirely
missing (e.g., replace .dropna(axis=1) with .dropna(axis=1, how='all') or
.dropna(axis=1, thresh=1)), so that partially observed proteins are preserved
and then imputed via data.fillna(data.median()). Make this same change in the
other callsites inside _compute_tsne, _compute_silhouette, and
_compute_variance_decomposition where you see data =
log2_matrix[samples].T.dropna(axis=1) (and the analogous blocks at lines
referenced) so all four metrics use the same missing-value policy.
- Around line 319-321: The code computes n_components = min(10, data.shape[0] -
1, data.shape[1]) and immediately constructs PCA(n_components=n_components),
which will raise when n_components <= 0 (e.g., only one sample or no proteins);
update the PVCA branch to check n_components > 0 before creating PCA (and before
calling pca.fit_transform(data)), and if n_components <= 0 return or emit a safe
PVCA summary/placeholder result (or skip PVCA) so report generation does not
abort; reference the variables n_components, PCA, pca, pc_data and the input
matrix data when adding this guard.
- Around line 524-540: The condition labels (variable cond) are written raw into
HTML in the loops that build cv_summary_rows and missing_summary_rows, which can
break markup or allow injection; update the code that builds rows for
condition_cv and missing["per_condition"] to HTML-escape cond (e.g., use
html.escape or equivalent) before concatenation, ensure the html module is
imported if not already, and use the escaped value in the f-strings for
cv_summary_rows, intra_corr lookup context (still use original cond for keys if
needed), and missing_summary_rows to safely render condition names.
- Around line 139-149: The PCA payload can be invalid for single-component
inputs because pc2 is padded but var_explained remains a single entry; update
the PCA post-processing (around n_components, PCA(), coords, and the result dict
construction) to ensure var_explained is padded to at least two entries (e.g.,
append 0 when n_components < 2) so entries "pc2" and "var_explained" always have
two elements and downstream code referencing pca.var_explained[1] won’t fail.

In `@mokume/reports/workflow_comparison.py`:
- Around line 456-468: The template assignment to tr.innerHTML injects unescaped
dynamic values (e.g., row.name, row.n_proteins, row.pi1, row.effect_size,
row.direction_balance and any marker/comparison strings) creating an XSS risk;
replace this innerHTML string construction with safe DOM creation or
text-escaping: build the <td> elements using document.createElement /
textContent (or a sanitizer/escape function) and set className from
cellClass(...) but assign cell.textContent for values (handling NaN formatting
before setting text) so no untrusted HTML is inserted; update all similar blocks
(the ones at the top of this diff and the other occurrences noted around lines
527-544 and 611-636) to follow the same pattern.
- Around line 176-177: The CAT curve disappears when max_k < 10 because ks is
constructed with range(10, max_k+1, 5) which yields an empty list; update the ks
construction to always include valid k values by checking max_k and falling back
to a smaller-step or start value: for example, if max_k < 10 set ks =
list(range(1, max_k + 1)) (or use range(1, max_k+1, max(1, 5//max_k))) otherwise
keep ks = list(range(10, max_k + 1, 5)); modify the code around max_k, sorted_a,
sorted_b and ks to implement this conditional so small result sets still produce
a CAT curve.
- Around line 516-517: The x/y axis title construction in workflow_comparison.py
directly indexes pca.var_explained[0] and pca.var_explained[1], which throws
when PCA produced fewer than two components; update the code that builds the
axis titles (the expressions using pca.var_explained) to guard access—check
length or existence before reading index 1 and use a safe fallback (e.g., omit
the percent or use '0.0%') for PC2 when pca.var_explained has only one value;
ensure the same guarded logic is applied for both xaxis and yaxis title
formulations to avoid report rendering errors.

In `@tests/test_quantification_accuracy.py`:
- Around line 351-354: Several print statements use an unnecessary f-string
prefix for plain literal messages; remove the leading f from those print calls
that have no {placeholders} (e.g., the plain label print " [Small DIA-NN Subset]
MaxLFQ results:" and the equivalent plain-label prints elsewhere) while leaving
f-strings that do include expressions (like print(... {len(result):,} ...) and
the intensity range print) unchanged; update the print calls that match the
literal label strings to regular string literals (remove the f) so Ruff F541 is
resolved.

---

Duplicate comments:
In `@mokume/reports/workflow_comparison.py`:
- Around line 83-88: The local variable named html is shadowing the imported
html module; rename the result of _build_comparison_html (e.g., html_content or
report_html) and then write the file using an explicit UTF-8 encoding
(open(output_html, "w", encoding="utf-8")). Update references to the renamed
variable and keep using the _build_comparison_html call and output_html target.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2a1223db-e1a1-4423-b453-0ee5bb356996

📥 Commits

Reviewing files that changed from the base of the PR and between 9ec72ae and 9310a76.

📒 Files selected for processing (4)

.github/workflows/python-app.yml
mokume/reports/qc_report.py
mokume/reports/workflow_comparison.py
tests/test_quantification_accuracy.py

coderabbitai · 2026-03-16T09:59:06Z

mokume/reports/qc_report.py

+    data = log2_matrix[samples].T.dropna(axis=1)
+    data = data.fillna(data.median())


⚠️ Potential issue | 🟠 Major

Preserve partially observed proteins before imputation.

Lines 136-137, 164-165, 277-278, and 312-313 drop any protein with a single missing sample, so the following fillna(data.median()) never gets to impute anything. On proteomics matrices that can collapse PCA/t-SNE/silhouette/PVCA onto a tiny complete-case subset or no features at all.

Suggested change

- data = log2_matrix[samples].T.dropna(axis=1) + data = log2_matrix[samples].T.dropna(axis=1, how="all") data = data.fillna(data.median())

Apply the same change in _compute_tsne, _compute_silhouette, and _compute_variance_decomposition so all four metrics use the same missing-value policy.

Also applies to: 164-165, 277-278, 312-313

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mokume/reports/qc_report.py` around lines 136 - 137, The code currently drops any protein column with any missing value by using .dropna(axis=1) then filling NAs, which removes partially observed features; change the drop to only remove columns that are entirely missing (e.g., replace .dropna(axis=1) with .dropna(axis=1, how='all') or .dropna(axis=1, thresh=1)), so that partially observed proteins are preserved and then imputed via data.fillna(data.median()). Make this same change in the other callsites inside _compute_tsne, _compute_silhouette, and _compute_variance_decomposition where you see data = log2_matrix[samples].T.dropna(axis=1) (and the analogous blocks at lines referenced) so all four metrics use the same missing-value policy.

coderabbitai · 2026-03-16T09:59:06Z

mokume/reports/qc_report.py

+    n_components = min(3, data.shape[0], data.shape[1])
+    pca = PCA(n_components=n_components)
+    coords = pca.fit_transform(data)
+
+    result = {
+        "samples": list(data.index),
+        "conditions": [sample_to_condition.get(s, "Unknown") for s in data.index],
+        "plexes": [sample_to_plex.get(s, "plex1") for s in data.index],
+        "pc1": coords[:, 0].tolist(),
+        "pc2": coords[:, 1].tolist() if n_components >= 2 else [0] * len(data),
+        "var_explained": (pca.explained_variance_ratio_ * 100).tolist(),


⚠️ Potential issue | 🟠 Major

Return a stable 2D PCA payload for small inputs.

When only one principal component is available, this function pads pc2 but leaves var_explained with a single entry. Line 808 later does pca.var_explained[1].toFixed(1), which throws and aborts the rest of the report script.

Suggested change

n_components = min(3, data.shape[0], data.shape[1]) + if n_components == 0: + return { + "samples": list(data.index), + "conditions": [sample_to_condition.get(s, "Unknown") for s in data.index], + "plexes": [sample_to_plex.get(s, "plex1") for s in data.index], + "pc1": [0.0] * len(data), + "pc2": [0.0] * len(data), + "var_explained": [0.0, 0.0], + } pca = PCA(n_components=n_components) coords = pca.fit_transform(data) + var_explained = (pca.explained_variance_ratio_ * 100).tolist() + if len(var_explained) < 2: + var_explained.extend([0.0] * (2 - len(var_explained))) result = { "samples": list(data.index), "conditions": [sample_to_condition.get(s, "Unknown") for s in data.index], "plexes": [sample_to_plex.get(s, "plex1") for s in data.index], "pc1": coords[:, 0].tolist(), "pc2": coords[:, 1].tolist() if n_components >= 2 else [0] * len(data), - "var_explained": (pca.explained_variance_ratio_ * 100).tolist(), + "var_explained": var_explained, }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mokume/reports/qc_report.py` around lines 139 - 149, The PCA payload can be invalid for single-component inputs because pc2 is padded but var_explained remains a single entry; update the PCA post-processing (around n_components, PCA(), coords, and the result dict construction) to ensure var_explained is padded to at least two entries (e.g., append 0 when n_components < 2) so entries "pc2" and "var_explained" always have two elements and downstream code referencing pca.var_explained[1] won’t fail.

coderabbitai · 2026-03-16T09:59:06Z

mokume/reports/qc_report.py

+    n_components = min(10, data.shape[0] - 1, data.shape[1])
+    pca = PCA(n_components=n_components)
+    pc_data = pca.fit_transform(data)


⚠️ Potential issue | 🟠 Major

Guard PVCA before constructing PCA(n_components=0).

Line 319 evaluates to 0 whenever only one sample survives preprocessing or all proteins are filtered out. PCA(n_components=0) raises here, so a small/empty matrix can abort report generation instead of returning a safe PVCA summary.

Suggested change

n_components = min(10, data.shape[0] - 1, data.shape[1]) + if n_components <= 0: + return { + "condition_pct": 0.0, + "plex_pct": 0.0, + "residual_pct": 100.0, + "n_pcs_used": 0, + "total_var_captured": 0.0, + } pca = PCA(n_components=n_components)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

n_components = min(10, data.shape[0] - 1, data.shape[1])

pca = PCA(n_components=n_components)

pc_data = pca.fit_transform(data)

n_components = min(10, data.shape[0] - 1, data.shape[1])

if n_components <= 0:

return {

"condition_pct": 0.0,

"plex_pct": 0.0,

"residual_pct": 100.0,

"n_pcs_used": 0,

"total_var_captured": 0.0,

}

pca = PCA(n_components=n_components)

pc_data = pca.fit_transform(data)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mokume/reports/qc_report.py` around lines 319 - 321, The code computes n_components = min(10, data.shape[0] - 1, data.shape[1]) and immediately constructs PCA(n_components=n_components), which will raise when n_components <= 0 (e.g., only one sample or no proteins); update the PVCA branch to check n_components > 0 before creating PCA (and before calling pca.fit_transform(data)), and if n_components <= 0 return or emit a safe PVCA summary/placeholder result (or skip PVCA) so report generation does not abort; reference the variables n_components, PCA, pca, pc_data and the input matrix data when adding this guard.

coderabbitai · 2026-03-16T09:59:06Z

mokume/reports/qc_report.py

+    for cond, cv_data in condition_cv.items():
+        cv_summary_rows += (
+            f"<tr><td>{cond}</td>"
+            f"<td>{cv_data['median_cv']:.3f}</td>"
+            f"<td>{cv_data['q25']:.3f} - {cv_data['q75']:.3f}</td>"
+            f"<td>{intra_corr.get(cond, float('nan')):.3f}</td></tr>"
+        )
+
+    # Format missing value summary
+    missing_summary_rows = ""
+    for cond, m in missing["per_condition"].items():
+        total = m["complete"] + m["partially_missing"] + m["completely_missing"]
+        missing_summary_rows += (
+            f"<tr><td>{cond}</td>"
+            f"<td>{m['complete']}</td>"
+            f"<td>{m['partially_missing']}</td>"
+            f"<td>{m['completely_missing']}</td>"


⚠️ Potential issue | 🟠 Major

Escape condition labels before writing HTML rows.

Lines 526 and 537 insert cond directly into the page. A condition name containing <, &, or quotes will corrupt the markup, and crafted metadata can inject HTML/JS into the generated report.

Suggested change

cv_summary_rows = "" for cond, cv_data in condition_cv.items(): + cond_html = html_module.escape(cond) cv_summary_rows += ( - f"<tr><td>{cond}</td>" + f"<tr><td>{cond_html}</td>" f"<td>{cv_data['median_cv']:.3f}</td>" f"<td>{cv_data['q25']:.3f} - {cv_data['q75']:.3f}</td>" f"<td>{intra_corr.get(cond, float('nan')):.3f}</td></tr>" ) # Format missing value summary missing_summary_rows = "" for cond, m in missing["per_condition"].items(): + cond_html = html_module.escape(cond) total = m["complete"] + m["partially_missing"] + m["completely_missing"] missing_summary_rows += ( - f"<tr><td>{cond}</td>" + f"<tr><td>{cond_html}</td>" f"<td>{m['complete']}</td>" f"<td>{m['partially_missing']}</td>" f"<td>{m['completely_missing']}</td>" f"<td>{total}</td></tr>"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mokume/reports/qc_report.py` around lines 524 - 540, The condition labels (variable cond) are written raw into HTML in the loops that build cv_summary_rows and missing_summary_rows, which can break markup or allow injection; update the code that builds rows for condition_cv and missing["per_condition"] to HTML-escape cond (e.g., use html.escape or equivalent) before concatenation, ensure the html module is imported if not already, and use the escaped value in the f-strings for cv_summary_rows, intra_corr lookup context (still use original cond for keys if needed), and missing_summary_rows to safely render condition names.

coderabbitai · 2026-03-16T09:59:06Z

mokume/reports/workflow_comparison.py

+    max_k = min(max_k, len(sorted_a), len(sorted_b))
+    ks = list(range(10, max_k + 1, 5))


⚠️ Potential issue | 🟡 Minor

CAT curve silently disappears for small result sets (max_k < 10).

For small DE tables, ks becomes empty and the CAT section shows no curve even though data exists.

Suggested fix

- ks = list(range(10, max_k + 1, 5)) + start_k = 1 if max_k < 10 else 10 + step = 1 if max_k < 10 else 5 + ks = list(range(start_k, max_k + 1, step))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

max_k = min(max_k, len(sorted_a), len(sorted_b))

ks = list(range(10, max_k + 1, 5))

max_k = min(max_k, len(sorted_a), len(sorted_b))

start_k = 1 if max_k < 10 else 10

step = 1 if max_k < 10 else 5

ks = list(range(start_k, max_k + 1, step))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mokume/reports/workflow_comparison.py` around lines 176 - 177, The CAT curve disappears when max_k < 10 because ks is constructed with range(10, max_k+1, 5) which yields an empty list; update the ks construction to always include valid k values by checking max_k and falling back to a smaller-step or start value: for example, if max_k < 10 set ks = list(range(1, max_k + 1)) (or use range(1, max_k+1, max(1, 5//max_k))) otherwise keep ks = list(range(10, max_k + 1, 5)); modify the code around max_k, sorted_a, sorted_b and ks to implement this conditional so small result sets still produce a CAT curve.

coderabbitai · 2026-03-16T09:59:06Z

mokume/reports/workflow_comparison.py

+        tr.innerHTML = `
+            <td style="text-align:left;font-weight:bold">${{row.name}}</td>
+            <td>${{row.n_proteins}}</td>
+            <td class="${{cellClass('silhouette', row.silhouette)}}">${{row.silhouette.toFixed(3)}}</td>
+            <td class="${{cellClass('var_condition', row.var_condition)}}">${{row.var_condition.toFixed(1)}}%</td>
+            <td class="${{cellClass('var_plex', row.var_plex)}}">${{row.var_plex.toFixed(1)}}%</td>
+            <td class="${{cellClass('median_cv', row.median_cv)}}">${{row.median_cv.toFixed(3)}}</td>
+            <td class="${{cellClass('replicate_corr', row.replicate_corr)}}">${{row.replicate_corr.toFixed(3)}}</td>
+            <td>${{row.n_de}} (${{row.n_up}}/${{row.n_down}})</td>
+            <td class="${{cellClass('pi1', row.pi1)}}">${{isNaN(row.pi1) ? 'N/A' : row.pi1.toFixed(2)}}</td>
+            <td class="${{cellClass('effect_size', row.effect_size)}}">${{row.effect_size.toFixed(2)}}</td>
+            <td>${{isNaN(row.direction_balance) ? 'N/A' : row.direction_balance.toFixed(2)}}</td>
+        `;


⚠️ Potential issue | 🟠 Major

Unescaped dynamic values are injected into HTML (innerHTML) — XSS risk.

Workflow names, comparison keys, marker names/accessions/significance are rendered as raw HTML. If any input contains markup/script, the report becomes executable content.

Suggested hardening

@@ (function() {{ + function escapeHtml(v) {{ + return String(v) + .replaceAll('&', '&') + .replaceAll('<', '<') + .replaceAll('>', '>') + .replaceAll('"', '"') + .replaceAll("'", '''); + }} @@ - <td style="text-align:left;font-weight:bold">${{row.name}}</td> + <td style="text-align:left;font-weight:bold">${{escapeHtml(row.name)}}</td> @@ - <td style="text-align:left">${{key}}</td> + <td style="text-align:left">${{escapeHtml(key)}}</td>

def _build_marker_table(marker_results, marker_genes, workflow_metrics): @@ - for wf in wf_names: - header += f"<th colspan='2'>{wf}</th>" + for wf in wf_names: + header += f"<th colspan='2'>{html.escape(str(wf))}</th>" @@ - row = f"<tr><td><b>{info['name']}</b></td><td>{acc}</td>" - row += f"<td class='{'good' if info['expected'] == 'UP' else 'bad'}'>{info['expected']}</td>" + safe_gene = html.escape(str(info["name"])) + safe_acc = html.escape(str(acc)) + safe_expected = html.escape(str(info["expected"])) + row = f"<tr><td><b>{safe_gene}</b></td><td>{safe_acc}</td>" + row += f"<td class='{'good' if info['expected'] == 'UP' else 'bad'}'>{safe_expected}</td>" @@ - sig = r["significance"] + sig = html.escape(str(r["significance"]))

Also applies to: 527-544, 611-636

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mokume/reports/workflow_comparison.py` around lines 456 - 468, The template assignment to tr.innerHTML injects unescaped dynamic values (e.g., row.name, row.n_proteins, row.pi1, row.effect_size, row.direction_balance and any marker/comparison strings) creating an XSS risk; replace this innerHTML string construction with safe DOM creation or text-escaping: build the <td> elements using document.createElement / textContent (or a sanitizer/escape function) and set className from cellClass(...) but assign cell.textContent for values (handling NaN formatting before setting text) so no untrusted HTML is inserted; update all similar blocks (the ones at the top of this diff and the other occurrences noted around lines 527-544 and 611-636) to follow the same pattern.

coderabbitai · 2026-03-16T09:59:06Z

mokume/reports/workflow_comparison.py

+            xaxis: {{ title: 'PC1 (' + pca.var_explained[0].toFixed(1) + '%)' }},
+            yaxis: {{ title: 'PC2 (' + pca.var_explained[1].toFixed(1) + '%)' }},


⚠️ Potential issue | 🟠 Major

Guard PCA variance labels when fewer than 2 components are available.

Line 517 assumes pca.var_explained[1] exists; with 1-component PCA this throws and breaks report rendering.

Suggested fix

- Plotly.newPlot(`pca-${{wfIdx}}`, traces, {{ + const pc1Var = (pca.var_explained && pca.var_explained.length > 0) ? pca.var_explained[0] : 0; + const pc2Var = (pca.var_explained && pca.var_explained.length > 1) ? pca.var_explained[1] : 0; + Plotly.newPlot(`pca-${{wfIdx}}`, traces, {{ title: {{ text: wfName, font: {{ size: 14 }} }}, - xaxis: {{ title: 'PC1 (' + pca.var_explained[0].toFixed(1) + '%)' }}, - yaxis: {{ title: 'PC2 (' + pca.var_explained[1].toFixed(1) + '%)' }}, + xaxis: {{ title: 'PC1 (' + pc1Var.toFixed(1) + '%)' }}, + yaxis: {{ title: 'PC2 (' + pc2Var.toFixed(1) + '%)' }}, legend: {{ font: {{ size: 9 }}, orientation: 'h', y: -0.15 }}, margin: {{ t: 50, b: 80, l: 60, r: 20 }}, }}, {{responsive: true}});

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mokume/reports/workflow_comparison.py` around lines 516 - 517, The x/y axis title construction in workflow_comparison.py directly indexes pca.var_explained[0] and pca.var_explained[1], which throws when PCA produced fewer than two components; update the code that builds the axis titles (the expressions using pca.var_explained) to guard access—check length or existence before reading index 1 and use a safe fallback (e.g., omit the percent or use '0.0%') for PC2 when pca.var_explained has only one value; ensure the same guarded logic is applied for both xaxis and yaxis title formulations to avoid report rendering errors.

coderabbitai · 2026-03-16T09:59:06Z

tests/test_quantification_accuracy.py

        print(f"\n[Small DIA-NN Subset] MaxLFQ results:")
        print(f"  Protein-sample combinations: {len(result):,}")
-        print(f"  Intensity range: {result['MaxLFQIntensity'].min():.2f} - {result['MaxLFQIntensity'].max():.2f}")
+        print(f"  Intensity range: {result['Intensity'].min():.2f} - {result['Intensity'].max():.2f}")



⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix from f-strings without placeholders.

Static analysis (Ruff F541) correctly identifies that line 351 has an f-string without any placeholders. The same issue exists at lines 486 and 540.

Proposed fix

- print(f"\n[Small DIA-NN Subset] MaxLFQ results:") + print("\n[Small DIA-NN Subset] MaxLFQ results:")

Line 486:

- print(f"\n[Small DIA-NN Subset] Built-in MaxLFQ fallback:") + print("\n[Small DIA-NN Subset] Built-in MaxLFQ fallback:")

Line 540:

- print(f"\n[PRIDE PXD063291] MaxLFQ results:") + print("\n[PRIDE PXD063291] MaxLFQ results:")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print(f"\n[Small DIA-NN Subset] MaxLFQ results:")

print(f" Protein-sample combinations: {len(result):,}")

print(f" Intensity range: {result['MaxLFQIntensity'].min():.2f} - {result['MaxLFQIntensity'].max():.2f}")

print(f" Intensity range: {result['Intensity'].min():.2f} - {result['Intensity'].max():.2f}")

print("\n[Small DIA-NN Subset] MaxLFQ results:")

print(f" Protein-sample combinations: {len(result):,}")

print(f" Intensity range: {result['Intensity'].min():.2f} - {result['Intensity'].max():.2f}")

🧰 Tools

🪛 Ruff (0.15.5)

[error] 351-351: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_quantification_accuracy.py` around lines 351 - 354, Several print statements use an unnecessary f-string prefix for plain literal messages; remove the leading f from those print calls that have no {placeholders} (e.g., the plain label print " [Small DIA-NN Subset] MaxLFQ results:" and the equivalent plain-label prints elsewhere) while leaving f-strings that do include expressions (like print(... {len(result):,} ...) and the intensity range print) unchanged; update the print calls that match the literal label strings to regular string literals (remove the f) so Ruff F541 is resolved.

ypriverol added 3 commits February 9, 2026 15:42

major changes adding README

a334fe9

Merge branch 'main' of https://github.com/bigbio/mokume into dev

d1306ba

major changes in mokume

2ba61e0

mior changes in the docs deployment

a79655b

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

minro changes in the workflow.

3fee301

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

Shen-YuFei and others added 5 commits March 13, 2026 15:43

support new QPX feature schema with backward compatibility for legacy…

ae66827

… format

extract _detect_qpx_format and _create_unnest_view from Feature.__init__

dc615e4

ratio.py TMT channel mapping for new QPX format + add new QPX test fi…

e42be97

…xtures

replace assert with raise AssertionError in new QPX tests

d2ed7b9

Merge pull request #10 from Shen-YuFei/dev

9ec72ae

support new QPX feature schema with backward compatibility

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

Shen-YuFei and others added 4 commits March 16, 2026 16:57

extract f-string backslash to _safe_json for Python 3.10/3.11 compat …

2012f85

…(E999)

add dev branch to CI trigger, remove non-existent develop branch

cf58496

exact match report.tsv in PRIDE dataset download (updated zip structure)

70fcc5e

Merge pull request #11 from Shen-YuFei/dev

9310a76

extract f-string backslash to _safe_json for Python 3.10/3.11 compat

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

ypriverol merged commit 5bd54c5 into main Mar 16, 2026
5 of 6 checks passed

		data = log2_matrix[samples].T.dropna(axis=1)
		data = data.fillna(data.median())

		max_k = min(max_k, len(sorted_a), len(sorted_b))
		ks = list(range(10, max_k + 1, 5))

		xaxis: {{ title: 'PC1 (' + pca.var_explained[0].toFixed(1) + '%)' }},
		yaxis: {{ title: 'PC2 (' + pca.var_explained[1].toFixed(1) + '%)' }},

Conversation

ypriverol commented Feb 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

ypriverol commented Feb 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 9, 2026 •

edited

Loading