Conversation
📝 WalkthroughWalkthroughRefactors the quantification pipeline into staged components (Loading, Normalization, Quantification, Postprocessing); adds IRS and TMM normalizers, Ratio quantification, DifferentialExpression (ttest/limma), a DuckDB-backed Feature API with SQLFilterBuilder, unified Intensity outputs, interactive/QC/workflow reports, a full PXD007683 benchmarking suite, and extensive docs and CI for docs deployment. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Pipeline as QuantPipeline
participant Loader as LoadingStage
participant Normalizer as NormalizationStage
participant Quantifier as QuantificationStage
participant Postproc as PostprocessingStage
User->>Pipeline: features_to_proteins(config)
activate Pipeline
Pipeline->>Loader: load_for_mokume()
activate Loader
Loader->>Loader: read parquet + SDRF, apply SQL filters
Loader-->>Pipeline: peptide_df
deactivate Loader
Pipeline->>Normalizer: apply_hierarchical/apply_tmm/apply_irs()
activate Normalizer
Normalizer->>Normalizer: compute normalization factors / IRS scaling
Normalizer-->>Pipeline: normalized_peptides
deactivate Normalizer
Pipeline->>Quantifier: quantify(normalized_peptides)
activate Quantifier
Quantifier->>Quantifier: run quant method (ibaq/maxlfq/ratio/...)
Quantifier-->>Pipeline: protein_df (wide, Intensity)
deactivate Quantifier
Pipeline->>Postproc: apply_batch_correction/run_differential_expression/generate_reports()
activate Postproc
Postproc->>Postproc: ComBat / DE (ttest/limma) / plotting / HTML reports
Postproc-->>Pipeline: corrected_df, de_results, reports
deactivate Postproc
Pipeline-->>User: protein_df + artifacts (plots, reports)
deactivate Pipeline
sequenceDiagram
participant PSM as PSM_Data
participant RatioQ as RatioQuantification
participant Frac as FractionAgg
participant Ref as ReferenceComp
participant RatioCalc as RatioCalc
participant Agg as Aggregation
PSM->>RatioQ: psm_long_df
activate RatioQ
RatioQ->>Frac: _average_fractions()
activate Frac
Frac-->>RatioQ: averaged_psm_df
deactivate Frac
RatioQ->>Ref: _compute_reference_intensity()
activate Ref
Ref-->>RatioQ: reference_df
deactivate Ref
RatioQ->>RatioCalc: _compute_log2_ratios(averaged, reference)
activate RatioCalc
RatioCalc-->>RatioQ: ratio_df
deactivate RatioCalc
RatioQ->>Agg: aggregate PSM->peptide->protein
activate Agg
Agg-->>RatioQ: protein_df (log2 ratios)
deactivate Agg
RatioQ-->>PSM: pivot to wide (protein x sample)
deactivate RatioQ
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
✨ Finishing Touches
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 5
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
mokume/quantification/directlfq.py (1)
298-300:⚠️ Potential issue | 🟡 MinorOriginal traceback is lost when re-raising as
RuntimeError.Use
raise ... from eto preserve the exception chain, which aids debugging.Proposed fix
except Exception as e: logger.error(f"DirectLFQ failed: {e}") - raise RuntimeError(f"DirectLFQ quantification failed: {e}") + raise RuntimeError(f"DirectLFQ quantification failed: {e}") from edocs/concepts/preprocessing.md (1)
145-146:⚠️ Potential issue | 🟡 MinorMissing trailing newline.
Line 146 is not marked as a new line (
~), yet this is a new file. This may indicate a missing final newline character.mokume/preprocessing/filters/peptide.py (1)
269-288:⚠️ Potential issue | 🟠 MajorMissed cleavage count is wrong when K/R is at the last position of the truncated sequence.
After removing the C-terminal residue (line 275), K/R at the end of the truncated string is silently skipped (line 285–287
pass). However, this K/R is an internal cleavage site—the character that followed it in the original sequence may not be P. The code should check the original next character instead of unconditionally skipping.🐛 Proposed fix
def count_missed_cleavages(seq): if pd.isna(seq): return 0 - seq_str = str(seq).upper() + seq_full = str(seq).upper() # Remove last character (C-terminal cleavage site doesn't count) - if len(seq_str) > 1: - seq_str = seq_str[:-1] + if len(seq_full) <= 1: + return 0 + seq_str = seq_full[:-1] if self.enzyme == "trypsin": # Count K and R not followed by P count = 0 for i, aa in enumerate(seq_str): if aa in ["K", "R"]: # Check if followed by P - if i + 1 < len(seq_str) and seq_str[i + 1] != "P": - count += 1 - elif i + 1 >= len(seq_str): - # Last position doesn't count as missed - pass + next_aa = seq_str[i + 1] if i + 1 < len(seq_str) else seq_full[-1] + if next_aa != "P": + count += 1 return countmokume/commands/features2proteins.py (1)
472-516:⚠️ Potential issue | 🟡 MinorAdd
directlfq_min_nonanto the QuantificationConfig initialization.The
--directlfq-min-nonanCLI option (line 132) is accepted but never forwarded torun_pipeline. The parameter is received by the command function (line 330) but not passed toQuantificationConfigduring initialization, causing the user-supplied value to be silently dropped in favor of the default. Since the function signature is being expanded, address this gap by includingdirectlfq_min_nonan=directlfq_min_nonanin the QuantificationConfig constructor call.docs/reference/cli.md (1)
215-227:⚠️ Potential issue | 🟡 MinorThe
tsne-visualizationoptions table is truncated.The file ends abruptly at line 227. The last table row for
-o/--patternis missing its description, and likely additional options (e.g., output path, perplexity, etc.) are missing entirely.
🤖 Fix all issues with AI agents
In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/06_generate_report.py`:
- Around line 176-181: The f-string in the loop over yeast_df.iterrows (inside
the lines.append call) applies the numeric format specifier : .2f to
row.get('compression_ratio', 'N/A'), which raises ValueError when the fallback
'N/A' is used; fix by extracting compression_ratio into a variable (e.g., comp =
row.get('compression_ratio')), then build a comp_str that is "N/A" when comp is
None or non-numeric (or pandas.isna(comp)), otherwise format it with "{:.2f}",
and use comp_str in the lines.append f-string instead of applying : .2f directly
to the get(...) expression.
In `@mokume/io/feature.py`:
- Around line 515-527: The query in get_irs_scaling_factors references a mixture
column that doesn't exist in the DuckDB views; update the view definitions used
to build self.parquet_db so that both the base parquet_db view and the
SDRF-enriched view include a deterministic mixture column (e.g., derive or
coalesce the appropriate SDRF/sample field into an alias named mixture) so the
SELECT and GROUP BY in the irs_df SQL succeed; ensure the column name is exactly
mixture to match the SELECT/GROUP BY in get_irs_scaling_factors.
In `@mokume/preprocessing/filters/peptide.py`:
- Around line 414-416: If self._compiled_patterns is empty, combined_pattern
becomes "" so df[col].str.contains("") matches everything and removes all rows;
fix by short-circuiting: if not self._compiled_patterns then set mask to a
Series of True for all rows (i.e., nothing excluded) or use a regex that matches
nothing (e.g., "(?!)") before calling
df[col].fillna("").astype(str).str.contains; update the logic around
combined_pattern, self._compiled_patterns and mask to handle the empty-list case
explicitly.
- Around line 199-204: When self.exclude_modifications is empty the code builds
patterns="" so str.contains("") matches every row and removes all peptides; fix
the block that builds patterns and mask in the peptide filter (referencing
self.exclude_modifications, patterns, mask, self.sequence_column, and df) by
adding a guard: if not self.exclude_modifications then set mask to a boolean
Series of True for df.index (i.e., keep all rows), otherwise build patterns =
"|".join(...) and compute mask =
~df[self.sequence_column].fillna("").astype(str).str.contains(patterns,
case=False, regex=True) as before.
In `@mokume/preprocessing/filters/protein.py`:
- Around line 63-66: The current vectorized contaminant filter builds
pattern_regex from self.patterns and then uses
upper_col.str.contains(pattern_regex), which with an empty self.patterns yields
an empty string and matches every row; add a guard: if not self.patterns (or
len(self.patterns) == 0) skip the contains call and set mask to a boolean Series
of True for all df rows (e.g., mask = pd.Series(True, index=df.index)) so no
rows are removed, otherwise build pattern_regex from self.patterns and apply
mask = ~upper_col.str.contains(pattern_regex, regex=True); reference the
variables/members self.patterns, self.protein_column, upper_col, pattern_regex
and mask to locate where to change.
🟠 Major comments (19)
mokume/quantification/ratio.py-323-346 (1)
323-346:⚠️ Potential issue | 🟠 MajorUse parameterized queries to prevent SQL injection via the parquet file path.
Line 327 interpolates
parquet_pathdirectly into the SQL string via f-string. If the path originates from user-controlled configuration and contains a crafted quote or SQL syntax, it can break out of the string literal and execute arbitrary DuckDB commands (e.g., exfiltrating data viaCOPY ... TO).Replace string interpolation with DuckDB's native parameterized query support:
Fix: Use parameterized query with placeholder
conn.execute( - f"CREATE VIEW parquet_raw AS SELECT * FROM parquet_scan('{parquet_path}')" + "CREATE VIEW parquet_raw AS SELECT * FROM parquet_scan(?)", + [parquet_path], )benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/00_download_data.py-29-45 (1)
29-45:⚠️ Potential issue | 🟠 MajorPartial downloads are not cleaned up on failure, causing silent corruption on retry.
If the download fails mid-transfer,
urlretrieveleaves a partial file on disk. The next run seesdest.exists()on line 22 and skips the download, silently using the corrupt file.Proposed fix: remove partial file on error
except Exception as e: + if dest.exists(): + dest.unlink() print(f"\n ERROR: {e}") return Falsemokume/analysis/differential_expression.py-100-104 (1)
100-104:⚠️ Potential issue | 🟠 MajorZeros replaced with NaN even when
skip_log2=True.Line 103 unconditionally replaces zeros with NaN. When
skip_log2=True(data already log2-transformed), zero is a valid value (representinglog2(1) = 0). This replacement would silently drop proteins with a true fold-change of 1× from the analysis.Proposed fix: only replace zeros before log-transform
- intensity_matrix = intensity_matrix.replace(0, np.nan) - log2_matrix = intensity_matrix if self.skip_log2 else np.log2(intensity_matrix) + if self.skip_log2: + log2_matrix = intensity_matrix + else: + intensity_matrix = intensity_matrix.replace(0, np.nan) + log2_matrix = np.log2(intensity_matrix)mokume/analysis/differential_expression.py-76-78 (1)
76-78:⚠️ Potential issue | 🟠 MajorFragile assumption: first column is the protein identifier.
protein_col = protein_df.columns[0]relies on column ordering rather than an explicit parameter. If a user passes a DataFrame with columns reordered (e.g., after a merge or join), this silently picks the wrong column, corrupting all downstream results.Consider accepting
protein_columnas a parameter (defaulting to"ProteinName") for robustness, consistent with other methods in the codebase (e.g.,MaxLFQQuantification.quantify).Proposed fix
def run( self, protein_df: pd.DataFrame, sample_to_condition: dict[str, str], contrast: tuple[str, str], + protein_column: str = "ProteinName", ) -> pd.DataFrame:Then replace:
- protein_col = protein_df.columns[0] - sample_cols = [c for c in protein_df.columns if c != protein_col] + protein_col = protein_column + sample_cols = [c for c in protein_df.columns if c != protein_col]And propagate the parameter to
run_comparisons:def run_comparisons( self, protein_df: pd.DataFrame, sample_to_condition: dict[str, str], contrasts: list[tuple[str, str]], + protein_column: str = "ProteinName", ) -> dict[str, pd.DataFrame]: ... for contrast in contrasts: key = f"{contrast[0]}-{contrast[1]}" - results[key] = self.run(protein_df, sample_to_condition, contrast) + results[key] = self.run(protein_df, sample_to_condition, contrast, protein_column)mokume/io/feature.py-123-127 (1)
123-127:⚠️ Potential issue | 🟠 MajorSQL injection / breakage if
database_pathcontains a single quote.The path is interpolated directly into SQL via
str.format(). A filename containing'(e.g.,/tmp/user's_data.parquet) will produce malformed SQL. Use parameterized queries or escape the path.Proposed fix
- self.parquet_db.execute( - "CREATE VIEW parquet_db_raw AS SELECT * FROM parquet_scan('{}')".format( - database_path - ) - ) + safe_path = database_path.replace("'", "''") + self.parquet_db.execute( + f"CREATE VIEW parquet_db_raw AS SELECT * FROM parquet_scan('{safe_path}')" + )benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/05_cross_technology_correlation.py-64-73 (1)
64-73:⚠️ Potential issue | 🟠 MajorPotential
KeyErrorwhen DataFrame columns are non-string types.Lines 64-65 convert column values to strings for set intersection, but line 72-73 index the original DataFrames with those string keys. If the parquet source yields integer column names (e.g., sample IDs stored as
int),.loc[..., common_samples]will raiseKeyErrorbecause the column index dtype won't match.The same pattern is repeated in
compute_per_protein_correlation(lines 112-114, 121-122) andcompute_per_sample_correlation(lines 143-145, 149-150).Proposed fix: cast columns to string up-front in `load_quantified_data`
df_wide = df.pivot(index="ProteinName", columns="SampleID", values="Intensity") + df_wide.columns = df_wide.columns.astype(str) return df_wideThis ensures all downstream string-based operations are consistent.
mokume/reports/workflow_comparison.py-287-292 (1)
287-292:⚠️ Potential issue | 🟠 MajorXSS risk: unsanitized values injected into HTML and
<script>blocks.
titleand workflow names are interpolated directly into HTML (lines 291, 331) without escaping. If these ever contain user-controlled data with<script>or HTML entities, it's exploitable.json.dumpson lines 417–420 does not escape</script>by default. A workflow name or protein name containing the literal string</script>would prematurely close the<script>tag, enabling injection.For (1), use
html.escape(title). For (2), a common defense is to post-process with.replace("</", "<\\/")or usejson.dumps(...).replace("<", "\\u003c").Proposed fix sketch
+import html as html_mod ... return f"""<!DOCTYPE html> <html> <head> <meta charset="utf-8"> - <title>{title}</title> + <title>{html_mod.escape(title)}</title> ... - <h1>{title}</h1> + <h1>{html_mod.escape(title)}</h1> ... -const summaryRows = {json.dumps(summary_rows)}; -const pcaData = {json.dumps(pca_data)}; -const concordanceData = {json.dumps(concordance_json)}; -const fcScatterData = {json.dumps(fc_scatter_data)}; +const summaryRows = {json.dumps(summary_rows).replace("<", "\\u003c")}; +const pcaData = {json.dumps(pca_data).replace("<", "\\u003c")}; +const concordanceData = {json.dumps(concordance_json).replace("<", "\\u003c")}; +const fcScatterData = {json.dumps(fc_scatter_data).replace("<", "\\u003c")};Also applies to: 401-420
mokume/plotting/differential_expression.py-14-16 (1)
14-16:⚠️ Potential issue | 🟠 Major
matplotlib.use("Agg")at import time is a global side effect.Importing this module forces the Agg backend for the entire process, which will break interactive plotting in Jupyter notebooks or GUI applications. Since these functions already accept
output_fileand return figures, the backend choice should be left to the caller. If Agg is needed for headless environments, set it conditionally or in the script entrypoint, not at library-module scope.Proposed fix
import matplotlib -import seaborn as sns -from sklearn.decomposition import PCA -matplotlib.use("Agg") +# Only set Agg backend if no display is available and no backend is already set +import os +if os.environ.get("DISPLAY") is None and matplotlib.get_backend() == "agg": + pass # already non-interactive + +import seaborn as sns +from sklearn.decomposition import PCAOr simply remove the
matplotlib.use("Agg")line and let callers control the backend.mokume/preprocessing/aggregation.py-258-258 (1)
258-258:⚠️ Potential issue | 🟠 Major
dropna(..., inplace=True)mutates the caller's DataFrame.
merge_fractions,get_peptidoform_normalize_intensities, andsum_peptidoform_intensitiesall calldataset.dropna(subset=[NORM_INTENSITY], inplace=True)before further processing. Sinceinplace=Truemodifies the DataFrame passed by the caller, this can cause surprising side effects upstream.Either operate on a copy or use the non-inplace form:
Proposed fix (example for merge_fractions)
- dataset.dropna(subset=[NORM_INTENSITY], inplace=True) + dataset = dataset.dropna(subset=[NORM_INTENSITY])Apply the same pattern to lines 294 and 327.
Also applies to: 294-294, 327-327
mokume/reports/qc_report.py-69-73 (1)
69-73:⚠️ Potential issue | 🟠 MajorIn log2 space, replacing 0 with NaN incorrectly drops proteins with intensity = 1.
When
is_log2=True, line 73 replaces 0 with NaN. But in log2 space, 0 is a valid value (it corresponds to a raw intensity of 1). This silently removes valid data points from all downstream QC metrics.Consider using a different sentinel or only replacing actual missing values (which should already be NaN in the input):
Proposed fix
if not is_log2: intensity = intensity.replace(0, np.nan) log2_matrix = np.log2(intensity) else: - log2_matrix = intensity.replace(0, np.nan) + log2_matrix = intensitymokume/preprocessing/aggregation.py-122-126 (1)
122-126:⚠️ Potential issue | 🟠 MajorRegex injection: user-provided protein IDs are joined into a regex pattern without escaping.
If the protein ID file contains characters with special regex meaning (
.,+,(,*, etc.), the pattern will match unintended proteins. For example, a contaminant ID likeP12345.1would matchP12345X1due to the unescaped dot.Proposed fix
+ import re as _re contaminants = [cont for cont in contaminants if cont.strip()] - cregex = "|".join(contaminants) + cregex = "|".join(_re.escape(cont) for cont in contaminants) return dataset[~dataset[protein_field].str.contains(cregex, regex=True)]mokume/normalization/irs.py-96-106 (1)
96-106:⚠️ Potential issue | 🟠 MajorNegative reference intensities are not replaced with NaN despite the comment, and division-by-zero risk in scaling factors.
Line 98-99: The comment says "Replace 0 and negatives with NaN" but only
0is replaced. Negative values (e.g., from log-transformed input or data artifacts) would produceNaNfromnp.logsilently.Additionally, on line 106, if
plex_ref_intensity[plex]contains zeros for some proteins,global_ref / plex_ref_intensity[plex]producesinfscaling factors, which will silently corrupt downstream values intransform().Proposed fix
# Compute global reference intensity (geometric mean across plexes) ref_df = pd.DataFrame(plex_ref_intensity) - # Replace 0 and negatives with NaN for geometric mean - ref_df = ref_df.replace(0, np.nan) + # Replace 0 and negatives with NaN for geometric mean + ref_df[ref_df <= 0] = np.nanbenchmarks/quant-pxd007683-tmt-vs-lfq/scripts/09_quant_norm_interaction.py-304-353 (1)
304-353:⚠️ Potential issue | 🟠 Major
xandquant_methodsleak from inner loop — misaligned tick labels and potentialNameError.
x(line 330) andquant_methods(line 326) are set inside thefor i, techloop. After the loop exits, lines 346–347 use them forset_xticks/set_xticklabels. Two problems:
- If the two technologies have different numbers of quantification methods, the ticks will only match the last technology processed — the other technology's bars will be misaligned.
- If both technologies have zero rows,
xandquant_methodsare never assigned →NameErrorat runtime.Consider collecting all unique quant methods across both technologies before the loop, or moving the tick-setting logic inside the per-technology block with proper alignment.
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/12_optimal_pipeline.py-462-510 (1)
462-510:⚠️ Potential issue | 🟠 MajorRadar chart is missing polar projection — renders as a Cartesian line plot.
plt.subplots(1, 2, ...)creates standard rectangular axes. Plotting angles as x-coordinates produces a meaningless Cartesian chart, not a radar/spider chart. You needsubplot_kw=dict(projection='polar').Proposed fix
- fig, axes = plt.subplots(1, 2, figsize=(14, 6)) + fig, axes = plt.subplots(1, 2, figsize=(14, 6), + subplot_kw=dict(projection='polar'))benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/12_optimal_pipeline.py-356-407 (1)
356-407:⚠️ Potential issue | 🟠 Major
run_pipelinereturns a partial dict on error — downstream code willKeyError.When an exception is caught at line 402,
resultsonly containstechnology,pipeline, and the config keys — it's missingwithin_cv,fc_rmse,n_proteins, etc. Downstream,main()andplot_pipeline_comparisonindex into these keys unconditionally (e.g.,tech_df["within_cv"].rank()), which will raiseKeyErroror produce silent NaN propagation.Either populate the missing keys with
np.nanin the error path, or filter out failed pipelines before downstream use.Proposed fix
except Exception as e: print(f" ERROR: {e}") import traceback traceback.print_exc() + # Fill missing metric keys so downstream doesn't break + for key in ["n_proteins", "n_yeast", "within_cv", "fc_rmse", "pct_good_cv"]: + results.setdefault(key, np.nan) return resultsbenchmarks/quant-pxd007683-tmt-vs-lfq/scripts/08_normalization_grid.py-155-179 (1)
155-179:⚠️ Potential issue | 🟠 MajorIncorrect quantile normalization implementation.
This implementation does not perform standard quantile normalization. Proper quantile normalization (1) ranks each column, (2) computes the mean value at each rank across all columns, then (3) replaces each value with the cross-column mean at its rank. The current code only sorts and re-indexes each column independently, which is effectively a no-op modulo tie-breaking.
Additionally, the nested
for col … for i, rank …loop is O(proteins × samples) with per-cell.ilocassignment, which will be very slow on large matrices.Consider using a well-known implementation (e.g., from
qnormpackage or the standard rank-then-average recipe):Standard quantile normalization sketch
elif method == "quantile": - from scipy.stats import rankdata - result = df.copy() - ranks = df.apply(lambda x: rankdata(x, method='average', nan_policy='omit'), axis=0) - sorted_means = df.apply(lambda x: np.sort(x.dropna().values), axis=0) - n_rows = len(df) - for col in df.columns: - col_ranks = ranks[col].values - col_sorted = np.sort(df[col].dropna().values) - for i, rank in enumerate(col_ranks): - if not np.isnan(rank) and not np.isnan(df[col].iloc[i]): - idx = int(rank) - 1 - if idx < len(col_sorted): - result.iloc[i, result.columns.get_loc(col)] = col_sorted[idx] - return result + ranked = df.rank(method="average") + sorted_vals = np.sort(df.values, axis=0) + mean_sorted = sorted_vals.mean(axis=1) + result = ranked.apply( + lambda col: np.interp(col, np.arange(1, len(mean_sorted) + 1), mean_sorted) + ) + # Restore NaN positions + result[df.isna()] = np.nan + return resultmokume/pipeline/stages.py-522-537 (1)
522-537:⚠️ Potential issue | 🟠 MajorNo guard for missing FASTA file in iBAQ quantification.
self.config.input.fasta_fileis passed directly toextract_fastaon line 531. If no FASTA was provided in the config, this will produce an opaque error. Consider an early validation:Suggested fix
def _quantify_ibaq(self, peptide_df: pd.DataFrame) -> pd.DataFrame: """Quantify using iBAQ method.""" from mokume.quantification.ibaq import extract_fasta + if not self.config.input.fasta_file: + raise ValueError( + "iBAQ quantification requires a FASTA file (--fasta)" + ) + proteins = peptide_df[PROTEIN_NAME].unique().tolist()mokume/pipeline/stages.py-730-736 (1)
730-736:⚠️ Potential issue | 🟠 MajorContrast parsing with
split("-")breaks on hyphenated condition names.Condition names like
"wild-type"or"drug-treated"are common in SDRF files. Splitting on"-"(line 732) will incorrectly parse"wild-type-control"as 3 parts instead of identifying the two conditions.Consider using a different delimiter (e.g.,
" vs ","__vs__") or requiring exactly one separator that is unlikely to appear in condition names.Suggested fix
if self.config.de.contrasts: for c in self.config.de.contrasts: - parts = c.split("-") - if len(parts) == 2: - contrasts.append((parts[0].strip(), parts[1].strip())) + # Support "A vs B" or "A-B" (first try " vs ", fallback to single "-") + if " vs " in c: + parts = c.split(" vs ", 1) + elif c.count("-") == 1: + parts = c.split("-", 1) else: - logger.warning(f"Invalid contrast format '{c}', expected 'A-B'") + logger.warning(f"Ambiguous contrast format '{c}', use 'A vs B'") + continue + contrasts.append((parts[0].strip(), parts[1].strip()))mokume/pipeline/stages.py-215-260 (1)
215-260:⚠️ Potential issue | 🟠 MajorMissing SDRF guard in
load_for_ratio.
detect_pooled_from_sdrf(line 231) anddetect_plexes_from_sdrf(line 249) are called withself.config.input.sdrfwithout first verifying it's notNone. Compare withapply_irs(line 366) which properly guards against this. IfsdrfisNone, these calls will likely fail with an unhelpful error.Suggested fix
def load_for_ratio(self) -> tuple: ... + if not self.config.input.sdrf: + raise ValueError( + "Ratio quantification requires an SDRF file (--sdrf)" + ) + # Detect reference samples (reuse IRS detection logic) ref_samples = detect_pooled_from_sdrf(self.config.input.sdrf)
🟡 Minor comments (33)
docs/user-guide/features2peptides.md-70-127 (1)
70-127:⚠️ Potential issue | 🟡 MinorInconsistent CLI option naming: underscores vs hyphens.
Lines 74–78 use underscores (
--min_aa,--min_unique,--remove_decoy_contaminants), while lines 109–126 use hyphens (--filter-min-intensity,--filter-cv-threshold). Most CLI tools (including Click) conventionally use hyphens. Consider unifying the convention.benchmarks/quant-pxd007683-tmt-vs-lfq/results/BENCHMARK_REPORT.md-30-40 (1)
30-40:⚠️ Potential issue | 🟡 MinorNarrative contradicts the data.
The table shows condition explains only ~1% of variance (with ~99% residual), yet line 39 states "Condition (yeast spike-in level) explains a significant portion of variance." This is misleading — 1% is negligible. The text should be revised to accurately reflect the findings, or the variance decomposition methodology should be re-examined.
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/config.py-118-118 (1)
118-118:⚠️ Potential issue | 🟡 MinorNon-deterministic ordering from
list(set(...)).
setdoes not guarantee iteration order, soQUANTIFICATION_METHODSmay differ between runs. If any downstream code depends on consistent ordering (e.g., report tables, indexing), this can cause flaky results. Usedict.fromkeys(...)to deduplicate while preserving insertion order.Proposed fix
-QUANTIFICATION_METHODS = list(set(QUANTIFICATION_METHODS_TMT + QUANTIFICATION_METHODS_LFQ)) +QUANTIFICATION_METHODS = list(dict.fromkeys(QUANTIFICATION_METHODS_TMT + QUANTIFICATION_METHODS_LFQ))mokume/normalization/tmm.py-254-265 (1)
254-265:⚠️ Potential issue | 🟡 MinorEdge case: all-NaN or all-zero input passes the
X.emptycheck.
X.emptyonly checks shape (no rows or no columns). A DataFrame full of NaN/zero values would pass this check but cause_select_reference_sampleto produce NaN upper quartiles, leading to an error inidxmin()at line 128.Proposed guard
if X.empty: raise ValueError("Input DataFrame is empty") + # Check that there is at least some valid data + if X.replace(0, np.nan).dropna(how='all').empty: + raise ValueError("Input DataFrame contains no valid (non-zero, non-NaN) values") + # Handle log-transformed inputmokume/io/feature.py-518-522 (1)
518-522:⚠️ Potential issue | 🟡 Minor
CASTtoINTEGERwill throw on non-numeric run names.If
rundoesn't contain_and isn't numeric,CAST(run AS INTEGER)will raise a DuckDB error. Similarly, if the part after_isn't numeric. UseTRY_CASTwhich returnsNULLon failure instead.Proposed fix
- CASE WHEN position('_' in run) > 0 THEN CAST(split_part(run, '_', 2) AS INTEGER) - ELSE CAST(run AS INTEGER) END AS techreplicate + CASE WHEN position('_' in run) > 0 THEN TRY_CAST(split_part(run, '_', 2) AS INTEGER) + ELSE TRY_CAST(run AS INTEGER) END AS techreplicatemokume/io/feature.py-80-89 (1)
80-89:⚠️ Potential issue | 🟡 MinorLIKE wildcard characters in
contaminant_patternsare not escaped.If a pattern contains
%or_, these are interpreted as SQL LIKE wildcards rather than literal characters. The default patterns are safe, but sincecontaminant_patternsis a public field, external callers could supply patterns containing these characters and get unexpected match behavior.Proposed fix: escape LIKE wildcards
for pattern in self.contaminant_patterns: - # Escape any SQL special characters in the pattern safe_pattern = pattern.replace("'", "''") + safe_pattern = safe_pattern.replace("%", "\\%").replace("_", "\\_") pattern_conditions.append( - f"pg_accessions::text NOT LIKE '%{safe_pattern}%'" + f"pg_accessions::text NOT LIKE '%{safe_pattern}%' ESCAPE '\\'" )benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/05_cross_technology_correlation.py-212-238 (1)
212-238:⚠️ Potential issue | 🟡 MinorUnguarded division when
per_protein_dfis empty.If
compute_per_protein_correlationreturns an empty DataFrame (e.g., fewer than 3 common samples),len(per_protein_df)is 0 and line 224 will raiseZeroDivisionError.Proposed guard
+ if per_protein_df.empty: + return + fig, ax = plt.subplots(figsize=(10, 6))docs/user-guide/peptides2protein.md-31-36 (1)
31-36:⚠️ Potential issue | 🟡 MinorUnused import
AllPeptidesQuantificationin the Python example.
AllPeptidesQuantificationis imported on Line 34 but never used in the code snippet that follows. Either remove it or add an example demonstrating its usage.Proposed fix
from mokume.quantification import ( TopNQuantification, MaxLFQQuantification, - AllPeptidesQuantification, peptides_to_protein, )mokume/postprocessing/batch_correction.py-126-142 (1)
126-142:⚠️ Potential issue | 🟡 MinorRegex won't match purely alphabetic prefixes like
"abc_1".The pattern
^([a-zA-Z]+\d+)_requires at least one digit after the alphabetic chars. A sample ID like"abc_1"or"batch_3"will fall through to the full-string fallback instead of extracting"abc"or"batch"as the prefix. If that's intentional (onlyletter+digitcombos likep1,TMT10are valid plex prefixes), worth a brief comment in the code to clarify.If purely alphabetic prefixes should also be matched
- m = re.match(r"^([a-zA-Z]+\d+)_", s) + m = re.match(r"^([a-zA-Z]+\d*)_", s)Change
\d+to\d*to also capture prefixes without trailing digits.docs/reference/cli.md-3-3 (1)
3-3:⚠️ Potential issue | 🟡 MinorCommand count is incorrect: five commands are documented, not four.
The text says "four main commands" but the page documents five:
features2proteins,features2peptides,peptides2protein,correct-batches, andtsne-visualization.📝 Proposed fix
-mokume provides four main commands. Use `mokume --help` or `mokume <command> --help` for details. +mokume provides five main commands. Use `mokume --help` or `mokume <command> --help` for details.tests/test_batch_correction_integration.py-208-208 (1)
208-208:⚠️ Potential issue | 🟡 MinorTypo in test class name:
TestSDRFCovariatExtraction→TestSDRFCovariateExtraction.Missing 'e' in "Covariate".
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/02_variance_decomposition.py-306-313 (1)
306-313:⚠️ Potential issue | 🟡 MinorHardcoded
.pngextension in print message doesn't matchFIGURE_FORMAT.Line 311 correctly uses
FIGURE_FORMATfor the actual file path, but line 313 hardcodes.pngin the print message. IfFIGURE_FORMATis e.g."svg", the log will mislead.Proposed fix
- print(f"Combined PCA plot saved to: {FIGURES_DIR / 'pca_combined.png'}") + print(f"Combined PCA plot saved to: {FIGURES_DIR / f'pca_combined.{FIGURE_FORMAT}'}")benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/run_benchmark.py-51-57 (1)
51-57:⚠️ Potential issue | 🟡 MinorRedundant return code check —
check=Truealready guarantees success.
subprocess.run(..., check=True)raisesCalledProcessErrorfor any non-zero return code (caught on line 58). Therefore,result.returncode == 0on line 57 is alwaysTruewhen reached. Either removecheck=Trueand rely on the returncode check, or simplify the success path.Proposed fix (simplify)
try: result = subprocess.run( [sys.executable, str(script_path)], cwd=str(SCRIPT_DIR), check=True, ) - return result.returncode == 0 + return True except subprocess.CalledProcessError as e:mokume/reports/workflow_comparison.py-138-147 (1)
138-147:⚠️ Potential issue | 🟡 MinorPotential
ZeroDivisionErrorif the merge on overlapping proteins yields an empty DataFrame.
overlapis non-empty here, but ifde_aorde_bcontain duplicateProteinNameentries that get filtered out during the merge (or if thesignificancecolumn is missing for some rows),merged_sigcould end up empty, causinglen(merged_sig)== 0 on line 145.Proposed defensive guard
if overlap: merged_sig = de_a[de_a["ProteinName"].isin(overlap)][["ProteinName", "significance"]].merge( de_b[de_b["ProteinName"].isin(overlap)][["ProteinName", "significance"]], on="ProteinName", suffixes=("_a", "_b"), ) - direction_agree = (merged_sig["significance_a"] == merged_sig["significance_b"]).sum() - direction_concordance = float(direction_agree / len(merged_sig)) + if not merged_sig.empty: + direction_agree = (merged_sig["significance_a"] == merged_sig["significance_b"]).sum() + direction_concordance = float(direction_agree / len(merged_sig)) + else: + direction_concordance = float("nan") else: direction_concordance = float("nan")benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/01_grid_search_methods.py-160-193 (1)
160-193:⚠️ Potential issue | 🟡 MinorUnused
sample_conditionsparameter — function uses global instead.
compute_cv_within_conditionsacceptssample_conditionsas a parameter but never references it. Instead, it callsget_condition()on line 168, which reads from the module-globalSAMPLE_CONDITIONS. This makes the parameter misleading and the function harder to test or reuse with different condition mappings.Proposed fix: use the parameter
def compute_cv_within_conditions(df_wide: pd.DataFrame, sample_conditions: dict) -> float: """Compute mean CV within conditions.""" cvs = [] # Map columns to conditions col_conditions = {} for col in df_wide.columns: sample_id = extract_sample_id(str(col)) - condition = get_condition(sample_id) + condition = sample_conditions.get(sample_id, "Unknown") if condition != "Unknown": col_conditions[col] = conditionbenchmarks/quant-pxd007683-tmt-vs-lfq/scripts/02_variance_decomposition.py-196-207 (1)
196-207:⚠️ Potential issue | 🟡 MinorPotential
IndexErroronvar_explained[1]if PCA has only 1 component.
plot_pca_by_conditionaccessesvar_explained[0]andvar_explained[1](lines 199–200) unconditionally. Ifn_componentsends up as 1 (when there's only 1 sample or 1 feature), indexing[1]will raise anIndexError. The same pattern appears inplot_combined_pcaat lines 231–232.For the benchmark dataset this is unlikely, but a guard would be safer.
Proposed fix for plot_pca_by_condition
ax.set_xlabel(f"PC1 ({var_explained[0]*100:.1f}% variance)") - ax.set_ylabel(f"PC2 ({var_explained[1]*100:.1f}% variance)") + ax.set_ylabel(f"PC2 ({var_explained[1]*100:.1f}% variance)" if len(var_explained) > 1 else "PC2")benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/04_stability_metrics.py-148-164 (1)
148-164:⚠️ Potential issue | 🟡 MinorSame
NameErrorrisk as10_imputation_benchmark.pyif both technologies have no data.
methodsandxare defined inside the inner loop and used outside it (lines 163-164). If bothtmtandlfqhave empty DataFrames, these variables are unbound.mokume/reports/interactive.py-424-424 (1)
424-424:⚠️ Potential issue | 🟡 MinorProtein names with special characters create invalid HTML element IDs.
tr.id = 'row-' + row.protein— protein accessions often contain characters like;,.,/, which are invalid in HTML IDs.document.getElementById('row-' + protein)at line 444 will then fail to find the row, breaking the highlight-on-click interaction between the volcano plot and the table.Consider using a CSS-safe encoding (e.g., a simple hash or index-based ID) and maintaining a JS lookup map from protein name to row ID.
mokume/reports/interactive.py-227-227 (1)
227-227:⚠️ Potential issue | 🟡 MinorUnpinned Plotly CDN (
plotly-latest.min.js) may cause future breakage.Using
plotly-latest.min.jsmeans any breaking change in a new Plotly release can silently break the generated reports. The QC report (qc_report.py) already pinsplotly-2.27.0.min.js— this file should do the same for consistency and stability.Proposed fix
- <script src="https://cdn.plot.ly/plotly-latest.min.js"></script> + <script src="https://cdn.plot.ly/plotly-2.27.0.min.js"></script>mokume/preprocessing/aggregation.py-186-186 (1)
186-186:⚠️ Potential issue | 🟡 MinorHardcoded
"Condition"string instead of theCONDITIONconstant.Line 186 uses the string literal
"Condition"while the module imports theCONDITIONconstant. Use the constant for consistency and to avoid silent breakage if the constant value changes.Proposed fix
- data_df = data_df[(data_df["Condition"] != "Empty") | (data_df["Condition"].isnull())] + data_df = data_df[(data_df[CONDITION] != "Empty") | (data_df[CONDITION].isnull())]mokume/reports/interactive.py-162-163 (1)
162-163:⚠️ Potential issue | 🟡 MinorSpecify
encoding="utf-8"when writing the HTML file.The HTML declares
<meta charset="utf-8">but theopen()call uses the platform default encoding. On some systems (e.g., Windows), this defaults to a locale-specific encoding, which can corrupt non-ASCII characters (e.g., protein names or annotation text).Proposed fix
- with open(output_html, "w") as f: + with open(output_html, "w", encoding="utf-8") as f:mokume/reports/qc_report.py-491-492 (1)
491-492:⚠️ Potential issue | 🟡 MinorSpecify
encoding="utf-8"when writing the HTML file.Same issue as in
interactive.py— the HTML declares UTF-8 charset but theopen()call uses platform-default encoding.Proposed fix
- with open(output_html, "w") as f: + with open(output_html, "w", encoding="utf-8") as f:benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/04_stability_metrics.py-282-284 (1)
282-284:⚠️ Potential issue | 🟡 MinorPotential
KeyErrorifquality_withindict is empty.
assess_qualityreturns{}when the input series is empty (line 79-80). Ifcompute_cv_metricsreturns an emptyquality_within, line 284'smetrics['quality_within']['good']raisesKeyError.Proposed fix
- print(f" Quality: {metrics['quality_within']['good']*100:.1f}% proteins with CV < 20%") + pct_good = metrics['quality_within'].get('good', 0) * 100 + print(f" Quality: {pct_good:.1f}% proteins with CV < 20%")mokume/reports/qc_report.py-746-755 (1)
746-755:⚠️ Potential issue | 🟡 Minor
showTabrelies on impliciteventobject — fragile across JS environments.Line 752 uses
event.targetwithout receivingeventas a parameter. While this works via the globaleventin most browsers' inline handlers, it's not reliable in strict mode or all environments.Proposed fix — pass the event explicitly
-function showTab(tabId) {{ +function showTab(tabId, evt) {{ document.querySelectorAll('.tab-content').forEach(el => el.classList.remove('active')); document.querySelectorAll('.tab-btn').forEach(el => el.classList.remove('active')); document.getElementById('tab-' + tabId).classList.add('active'); - event.target.classList.add('active'); + evt.target.classList.add('active');And update the onclick attributes:
- <button class="tab-btn active" onclick="showTab('sample-qc')">Sample QC</button> - <button class="tab-btn" onclick="showTab('reproducibility')">Reproducibility</button> - <button class="tab-btn" onclick="showTab('de-quality')">DE Quality</button> + <button class="tab-btn active" onclick="showTab('sample-qc', event)">Sample QC</button> + <button class="tab-btn" onclick="showTab('reproducibility', event)">Reproducibility</button> + <button class="tab-btn" onclick="showTab('de-quality', event)">DE Quality</button>benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/10_imputation_benchmark.py-295-307 (1)
295-307:⚠️ Potential issue | 🟡 MinorPotential
NameErrorif both technologies have no data for a metric.If both
tmtandlfqDataFrames are empty, the inner loop (lines 286-300) executescontinuefor both iterations, leavingxandmethodsunbound. Lines 305-306 then raiseNameError.Proposed fix
for ax, (metric, title, lower_better) in zip(axes.flat, metrics): + x = np.array([]) + methods = [] for i, tech in enumerate(["tmt", "lfq"]): tech_df = results_df[results_df["technology"] == tech] if len(tech_df) == 0: continue methods = tech_df["imputation"].tolist()benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/03_fold_change_accuracy.py-397-404 (1)
397-404:⚠️ Potential issue | 🟡 MinorHardcoded
.pngin print statement — should useFIGURE_FORMAT.Line 402 correctly uses
FIGURE_FORMATfor the actual saved file, but line 404 hardcodes.pngin the print message. IfFIGURE_FORMATis"svg"or"pdf", the logged path won't match the actual file.- print(f"Comparison plot saved to: {FIGURES_DIR / 'fold_change_comparison.png'}") + print(f"Comparison plot saved to: {FIGURES_DIR / f'fold_change_comparison.{FIGURE_FORMAT}'}")benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/07_tmm_normalization_benchmark.py-132-134 (1)
132-134:⚠️ Potential issue | 🟡 MinorDivision by zero in
cv_improvementwhenwithin_cv_medianis zero.If
metrics_before['within_cv_median']is 0 (e.g., all proteins have zero variance within conditions), line 134 will raiseZeroDivisionError.Proposed fix
- cv_improvement = (metrics_before['within_cv_median'] - metrics_after['within_cv_median']) / metrics_before['within_cv_median'] * 100 + if metrics_before['within_cv_median'] and metrics_before['within_cv_median'] != 0: + cv_improvement = (metrics_before['within_cv_median'] - metrics_after['within_cv_median']) / metrics_before['within_cv_median'] * 100 + else: + cv_improvement = np.nanbenchmarks/quant-pxd007683-tmt-vs-lfq/scripts/07_tmm_normalization_benchmark.py-376-383 (1)
376-383:⚠️ Potential issue | 🟡 MinorUnsafe
.values[0]access —IndexErrorifmaxlfqwas not benchmarked for a technology.Line 383 filters
tech_dataformethod=='maxlfq'and immediately accesses.values[0]. Ifmaxlfqwas skipped (e.g., file not found), the filtered result is empty and.values[0]raisesIndexError.Proposed fix
- print(f" Factor variation (maxlfq): {tech_data[tech_data['method']=='maxlfq']['factor_std'].values[0]:.3f}") + maxlfq_data = tech_data[tech_data['method'] == 'maxlfq'] + if len(maxlfq_data) > 0: + print(f" Factor variation (maxlfq): {maxlfq_data['factor_std'].values[0]:.3f}")benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/09_quant_norm_interaction.py-89-93 (1)
89-93:⚠️ Potential issue | 🟡 MinorDivision by zero when a sample median is zero.
If any column in
dfis all-NaN or all-zero,sample_medianswill contain 0 or NaN, andglobal_median / sample_medianswill produceinfvalues silently propagated through the data. Consider guarding against zero medians.Proposed fix
elif method == "globalmedian": global_median = df.median().median() sample_medians = df.median() + sample_medians = sample_medians.replace(0, np.nan) factors = global_median / sample_medians return df * factorsbenchmarks/quant-pxd007683-tmt-vs-lfq/scripts/03_fold_change_accuracy.py-114-143 (1)
114-143:⚠️ Potential issue | 🟡 MinorNo zero filtering before computing log2 fold-change — potential
infvalues.
mean1andmean2(lines 118-119) can contain zeros when proteins are undetected in a condition.np.log2(mean1 / mean2)then producesinf,-inf, orNaN. Whilepd.notnaon line 127 filtersNaN, it does not filterinf/-inf, so infinite fold-changes will be included in the results and skew metrics.Proposed fix
# Compute fold-change (cond1 / cond2) - log2_fc = np.log2(mean1 / mean2) + ratio = mean1 / mean2 + ratio = ratio.replace([np.inf, -np.inf, 0], np.nan) + log2_fc = np.log2(ratio)mokume/pipeline/stages.py-501-520 (1)
501-520:⚠️ Potential issue | 🟡 MinorFragile fallback for intensity column detection.
Lines 506–509: If
"Intensity"is not present, the code falls back to the last numeric column. This is a silent assumption that could pick an unrelated column (e.g., a count or index column). Consider logging a warning when using the fallback, or raising if the expected column is missing.Suggested improvement
if intensity_col not in long_df.columns: - # Fallback: use the last numeric column numeric_cols = long_df.select_dtypes(include=[np.number]).columns if len(numeric_cols) > 0: intensity_col = numeric_cols[-1] + logger.warning( + f"'Intensity' column not found, falling back to '{intensity_col}'" + ) + else: + raise ValueError( + f"No numeric columns found in quantification output. " + f"Columns: {list(long_df.columns)}" + )benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/11_preprocessing_filters.py-267-275 (1)
267-275:⚠️ Potential issue | 🟡 MinorGuard against NaN/zero in median normalization.
If any sample has all-NaN values after the
dropna(thresh=3)row filter,sample_medianswill contain NaN, andglobal_median / sample_medianssilently produces NaN factors, zeroing out that sample's data after multiplication. Consider adding a check:Suggested guard
global_median = df_quant.median().median() sample_medians = df_quant.median() + if sample_medians.isna().any() or (sample_medians == 0).any(): + print(f" WARNING: some samples have NaN/zero median, skipping normalization") + df_norm = df_quant + else: - factors = global_median / sample_medians - df_norm = df_quant * factors + factors = global_median / sample_medians + df_norm = df_quant * factorsmokume/pipeline/stages.py-481-486 (1)
481-486:⚠️ Potential issue | 🟡 MinorNon-numeric
topsuffix silently defaults ton=3.A method like
"topx"passes thestartswith("top")check but fails the regex, silently defaulting ton=3. Consider warning or raising for unparseable values.Suggested fix
elif quant_method.startswith("top"): match = re.match(r"top(\d+)", quant_method) if match: n = int(match.group(1)) else: - n = 3 + logger.warning( + f"Could not parse N from '{quant_method}', defaulting to top3" + ) + n = 3
| for _, row in yeast_df.iterrows(): | ||
| lines.append( | ||
| f"| {row['technology'].upper()} | {row['comparison']} | " | ||
| f"{row['expected_log2_fc']:.2f} | {row['observed_median_log2_fc']:.2f} | " | ||
| f"{row.get('compression_ratio', 'N/A'):.2f} | {row['rmse']:.3f} |" | ||
| ) |
There was a problem hiding this comment.
Bug: :.2f format applied to string fallback 'N/A' raises ValueError.
When compression_ratio is not present in the row, row.get('compression_ratio', 'N/A') returns the string 'N/A', which is then formatted with :.2f, causing a ValueError at runtime.
Proposed fix
for _, row in yeast_df.iterrows():
+ compression = row.get('compression_ratio', None)
+ compression_str = f"{compression:.2f}" if pd.notna(compression) else "N/A"
lines.append(
f"| {row['technology'].upper()} | {row['comparison']} | "
- f"{row['expected_log2_fc']:.2f} | {row['observed_median_log2_fc']:.2f} | "
- f"{row.get('compression_ratio', 'N/A'):.2f} | {row['rmse']:.3f} |"
+ f"{row['expected_log2_fc']:.2f} | {row['observed_median_log2_fc']:.2f} | "
+ f"{compression_str} | {row['rmse']:.3f} |"
)🤖 Prompt for AI Agents
In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/06_generate_report.py` around
lines 176 - 181, The f-string in the loop over yeast_df.iterrows (inside the
lines.append call) applies the numeric format specifier : .2f to
row.get('compression_ratio', 'N/A'), which raises ValueError when the fallback
'N/A' is used; fix by extracting compression_ratio into a variable (e.g., comp =
row.get('compression_ratio')), then build a comp_str that is "N/A" when comp is
None or non-numeric (or pandas.isna(comp)), otherwise format it with "{:.2f}",
and use comp_str in the lines.append f-string instead of applying : .2f directly
to the get(...) expression.
| # Vectorized contaminant matching using regex OR pattern | ||
| upper_col = df[self.protein_column].fillna("").astype(str).str.upper() | ||
| pattern_regex = "|".join(re.escape(p.upper()) for p in self.patterns) | ||
| mask = ~upper_col.str.contains(pattern_regex, regex=True) |
There was a problem hiding this comment.
Empty patterns list causes all rows to be removed.
If self.patterns is an empty list, pattern_regex becomes an empty string "". str.contains("") returns True for every row, so ~mask filters out the entire DataFrame. Add a guard for empty patterns.
🐛 Proposed fix
# Vectorized contaminant matching using regex OR pattern
+ if not self.patterns:
+ logger.debug("%s: No contaminant patterns provided, skipping filter", self.name)
+ return df, self._create_result(input_count, input_count)
+
upper_col = df[self.protein_column].fillna("").astype(str).str.upper()
pattern_regex = "|".join(re.escape(p.upper()) for p in self.patterns)
mask = ~upper_col.str.contains(pattern_regex, regex=True)🤖 Prompt for AI Agents
In `@mokume/preprocessing/filters/protein.py` around lines 63 - 66, The current
vectorized contaminant filter builds pattern_regex from self.patterns and then
uses upper_col.str.contains(pattern_regex), which with an empty self.patterns
yields an empty string and matches every row; add a guard: if not self.patterns
(or len(self.patterns) == 0) skip the contains call and set mask to a boolean
Series of True for all df rows (e.g., mask = pd.Series(True, index=df.index)) so
no rows are removed, otherwise build pattern_regex from self.patterns and apply
mask = ~upper_col.str.contains(pattern_regex, regex=True); reference the
variables/members self.patterns, self.protein_column, upper_col, pattern_regex
and mask to locate where to change.
There was a problem hiding this comment.
Actionable comments posted: 9
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (7)
mokume/normalization/peptide.py (2)
177-186:⚠️ Potential issue | 🟠 MajorWriters are not closed if an exception occurs in the processing loop — resource leak.
WriteCSVTaskandWriteParquetTaskare started here but only closed at lines 362-365. If any exception is raised during thefor samples, df in feature.iter_samples()loop, these writers will never be closed, potentially leaving background threads running and output files corrupted or incomplete. Wrap the processing and close logic intry/finally.Proposed fix (sketch)
+ try: for samples, df in feature.iter_samples(): ... # existing loop body + finally: if write_csv: write_csv_task.close() if save_parquet: writer_parquet_task.close()Also applies to: 362-365
352-353:⚠️ Potential issue | 🟡 Minor
np.log2applied without guarding against zero or negative intensities.If any
NORM_INTENSITYvalues are zero or negative after normalization/IRS scaling, this will produce-inforNaNsilently. Consider filtering or clamping before the transform, or at minimum logging a warning when non-positive values are encountered.mokume/postprocessing/batch_correction.py (2)
287-295:⚠️ Potential issue | 🟠 MajorSubstring-based partial matching for sample IDs is fragile.
sample_id in sdrf_sample or sdrf_sample in sample_idwill match"S1"against"S10","S11", etc. This could silently assign wrong covariate values. Consider using a stricter matching strategy or at minimum preferring exact matches first and only falling back to substring with a warning.
268-276:⚠️ Potential issue | 🟡 MinorPartial column matching is bidirectional and could match unintended columns.
col_lower in sdrf_col or sdrf_col in col_lowermeans a short covariate name like"sex"would match any column containing"sex"(e.g.,"characteristics[sex]"— intended, but also"characteristics[osexorientation]"— not intended). Consider requiring that the match target contains the full bracketed form, or use a more specific matching pattern.mokume/preprocessing/filters/peptide.py (1)
269-291:⚠️ Potential issue | 🟠 MajorMissed cleavage counting has a real off-by-one error affecting trypsin sequences.
Lines 274–275 remove the last character to exclude the C-terminal cleavage site, but then lines 280–287 loop through the truncated string. When K/R appears at the penultimate position of the original sequence (e.g.,
KA,KR), it becomes the last character of the truncated string. The condition at line 283 (if i + 1 < len(seq_str)) fails, and theelifat line 285–287 explicitly does not count it as a missed cleavage.This violates the trypsin cleavage model (Keil rule): K/R followed by any non-proline residue should be counted as a missed cleavage. Test cases confirm:
KA(length 2) → truncates toK, returns 0 (incorrect; should be 1)KR(length 2) → truncates toK, returns 0 (incorrect; should be 1)KAR(length 3) → truncates toKA, returns 1 (correct; K–A is counted)Additionally, there are no unit tests for
MissedCleavageFilter, leaving this logic bug undetected.Fix: Check the full sequence but skip counting if K/R is truly at the terminal position, rather than truncating before analysis.
mokume/quantification/directlfq.py (1)
298-300:⚠️ Potential issue | 🟡 MinorUse
raise ... from eto preserve the exception chain.The original traceback is lost when re-raising. This makes debugging DirectLFQ failures harder since the root cause stack trace is discarded.
Proposed fix
except Exception as e: logger.error(f"DirectLFQ failed: {e}") - raise RuntimeError(f"DirectLFQ quantification failed: {e}") + raise RuntimeError(f"DirectLFQ quantification failed: {e}") from etests/test_quantification_accuracy.py (1)
69-81:⚠️ Potential issue | 🟡 MinorNetwork call without timeout could hang CI indefinitely.
urllib.request.urlretrieveon line 72 has no timeout. If the PRIDE FTP server is slow or unresponsive, this will block indefinitely. Also,zf.extractallon line 79 is susceptible to zip-slip attacks if the archive contains path-traversal entries — though the risk is lower in a test context.🛡️ Proposed fix for timeout
- urllib.request.urlretrieve(PRIDE_DATASET_URL, zip_path) + import urllib.request as _req + _req.urlretrieve(PRIDE_DATASET_URL, zip_path) # consider using requests with timeoutA more robust approach:
import urllib.request req = urllib.request.Request(PRIDE_DATASET_URL) with urllib.request.urlopen(req, timeout=300) as resp, open(zip_path, 'wb') as out: import shutil shutil.copyfileobj(resp, out)
🤖 Fix all issues with AI agents
In
`@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/05_cross_technology_correlation.py`:
- Around line 55-105: The bug is that compute_overall_correlation (and the
related compute_per_protein_correlation and compute_per_sample_correlation) cast
column labels to strings to find common_samples but then use those string IDs to
index the original DataFrames whose columns may be ints, causing KeyError; fix
by finding common sample labels without changing their dtype (e.g., use
tmt_wide.columns.intersection(lfq_wide.columns) or compute common_samples = [c
for c in tmt_wide.columns if c in lfq_wide.columns]) so the returned
common_samples keep the original type and can be used directly for indexing in
tmt_wide.loc[...] and lfq_wide.loc[...], and apply the same change in
compute_per_protein_correlation and compute_per_sample_correlation.
In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/06_generate_report.py`:
- Around line 176-181: The f-string in the loop over yeast_df.iterrows() applies
the numeric format specifier :.2f to row.get('compression_ratio', 'N/A'), which
raises ValueError when the value is the string 'N/A' or non-numeric/NaN; update
the code in the loop that builds lines so you detect and normalize
compression_ratio (and any other possibly-NaN fields like expected_log2_fc or
observed_median_log2_fc) before formatting: retrieve compression_ratio into a
local variable, use pandas.isna() or isinstance(..., numbers.Number) to decide
whether to format as "{:.2f}" or use a placeholder like "N/A", and then use the
prepared string in the f-string instead of applying :.2f directly to the raw
row.get(...) expression.
In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/08_normalization_grid.py`:
- Around line 155-179: The quantile branch (method == "quantile") is incorrect:
sorted_means is unused, fractional ranks from rankdata are truncated, the code
maps values to the same-column sorted value (no cross-column normalization), and
it uses a slow nested Python loop; replace this block to perform true quantile
normalization by computing the mean across columns at each rank and mapping each
value to that cross-column mean (handling fractional ranks by averaging
appropriate quantile positions), or simply call the existing, tested
implementation mokume.normalization.protein.quantile_normalize instead of the
custom logic; update/remove the unused variable sorted_means and eliminate the
O(n×m) loops (use vectorized rank/argsort operations or the mokume helper) so df
→ result is the proper cross-column ranked-mean mapping.
In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/12_optimal_pipeline.py`:
- Around line 462-510: The plot_radar_chart function is creating regular
Cartesian Axes (via plt.subplots) so the computed angles are plotted on a linear
axis instead of a polar/spider plot; change the axes creation to polar axes for
each subplot (use projection="polar" when creating axes for the two panels),
ensure angles (variable angles) remain in radians and are used as the theta
coordinates, and keep the rest of the plotting logic (values, ax.plot, ax.fill,
ax.set_xticks/set_xticklabels, ax.set_title, ax.legend) but applied to the polar
Axes returned for plot_radar_chart so the radar/spider visualization renders
correctly.
In `@mokume/analysis/differential_expression.py`:
- Around line 100-104: The current code always replaces zeros with NaN before
deciding whether to skip log2, which drops valid zeros when skip_log2 is True;
change the logic in the block that builds intensity_matrix/log2_matrix so that
intensity_matrix = protein_df.set_index(protein_col)[sample_cols] is created
first, then if self.skip_log2 is False perform zero/negative handling (e.g.,
replace 0 with np.nan and handle negatives) and then compute log2_matrix =
np.log2(intensity_matrix), otherwise (self.skip_log2 True) set log2_matrix =
intensity_matrix without modifying zeros; update references to
intensity_matrix/log2_matrix accordingly (look for intensity_matrix,
log2_matrix, self.skip_log2, protein_df.set_index, sample_cols).
In `@mokume/io/feature.py`:
- Around line 515-527: The SQL in get_irs_scaling_factors references a
non-existent column "mixture" when building irs_df from the parquet_db view;
either add a "mixture" column to the parquet_db (e.g. modify enrich_with_sdrf to
populate a mixture alias from SDRF/sample metadata) or change the query and
subsequent post-processing to use an existing column (for example "condition" or
"sample_accession") consistently; update the SELECT, GROUP BY and any downstream
uses of irs_df that expect mixture so they reference the chosen existing column
or the new mixture alias.
In `@mokume/preprocessing/filters/peptide.py`:
- Around line 414-417: When self._compiled_patterns is empty combined_pattern
becomes "" and str.contains("") matches every row, so change the logic in the
filtering block to short-circuit: if self._compiled_patterns is empty, set mask
to a Series of True for all df.index (or simply return df.copy()) so no rows are
filtered; otherwise keep the existing combined_pattern/str.contains flow that
assigns mask (referencing self._compiled_patterns, combined_pattern, mask, df
and col).
- Around line 199-205: The current filter building in the Peptide filter uses
patterns = "|".join(...) on self.exclude_modifications which yields an empty
string for an empty list and causes str.contains("", regex=True) to match every
row; update the logic in the method that computes mask (referencing
self.exclude_modifications, self.sequence_column, df, mask, filtered_df) to
first check if self.exclude_modifications is empty and in that case skip the
regex filter (i.e., set mask = True for all rows or simply copy df to
filtered_df), otherwise build the joined pattern and apply str.contains as
before so the empty-list case does not remove all rows.
In `@mokume/preprocessing/filters/protein.py`:
- Around line 63-66: The current vectorized contaminant matching in the method
using self.patterns builds pattern_regex and then computes mask =
~upper_col.str.contains(pattern_regex...), which when self.patterns is empty
creates an empty regex that matches every row and removes all rows; fix by
adding an early check for an empty patterns list (check "if not self.patterns"
or similar) and return the original DataFrame (or set mask to all True) before
constructing pattern_regex; update uses of self.protein_column, pattern_regex,
and mask accordingly so the function preserves rows when there are no patterns.
🟠 Major comments (21)
mokume/normalization/irs.py-96-106 (1)
96-106:⚠️ Potential issue | 🟠 MajorDivision by zero when computing IRS scaling factors.
plex_ref_intensity[plex](line 106) retains the original values from lines 90-94 (which may contain zeros), but onlyref_df(the copy used for geometric mean computation) has zeros replaced with NaN on line 99. When a protein has zero reference intensity in a plex,global_ref / plex_ref_intensity[plex]producesinf, which propagates throughtransform()into the final protein matrix.Guard against zero/NaN denominators:
🐛 Proposed fix
# Compute scaling factors: global_ref / plex_ref for each plex scaling_factors = {} for plex in plexes: - scaling_factors[plex] = global_ref / plex_ref_intensity[plex] + denom = plex_ref_intensity[plex].replace(0, np.nan) + scaling_factors[plex] = global_ref / denompyproject.toml-47-47 (1)
47-47:⚠️ Potential issue | 🟠 MajorMove
statsmodelsto an optional dependency group since differential expression analysis is an optional pipeline feature.
statsmodelsis imported solely inmokume/analysis/differential_expression.pyfor multiple testing correction, and the DE feature itself is disabled by default (n_expression: bool = False). Following the existing pattern for optional features (e.g.,directlfq,batch-correction), add ade = ["statsmodels"]extra to keep the base install lightweight.mokume/reports/qc_report.py-302-345 (1)
302-345:⚠️ Potential issue | 🟠 Major
_compute_variance_decompositioncrashes whenn_componentsevaluates to 0.If there are fewer than 2 samples,
data.shape[0] - 1can be 0 (or negative with 0 samples), andPCA(n_components=0)will raise. Unlike_compute_silhouette(which guardslen(data) < 3), this function has no such guard.Proposed fix
n_components = min(10, data.shape[0] - 1, data.shape[1]) + if n_components < 1: + return { + "condition_pct": 0.0, + "plex_pct": 0.0, + "residual_pct": 100.0, + "n_pcs_used": 0, + "total_var_captured": 0.0, + } pca = PCA(n_components=n_components)mokume/quantification/ratio.py-326-328 (1)
326-328:⚠️ Potential issue | 🟠 MajorUse parameterized queries for
parquet_pathto prevent SQL injection and path parsing errors.Direct f-string interpolation of
parquet_pathinto the SQL string is unsafe. If the path contains a single quote or special characters, it will break the SQL syntax. DuckDB supports parameterized queries forparquet_scan(), so use the?placeholder instead.Proposed fix
conn.execute( - f"CREATE VIEW parquet_raw AS SELECT * FROM parquet_scan('{parquet_path}')" + "CREATE VIEW parquet_raw AS SELECT * FROM parquet_scan(?)", + [parquet_path], )mokume/preprocessing/aggregation.py-148-156 (1)
148-156:⚠️ Potential issue | 🟠 MajorPotential
AttributeErrorwhenchoiceisNonefor non-LFQ labels.The
choiceparameter isOptional[IsobaricLabel], so it can beNone. Iflabelis notLFQandchoiceisNone, Line 154 callschoice.channels()onNone.Proposed fix: add a guard
if label == QuantificationCategory.LFQ: data_df.drop(CHANNEL, inplace=True, axis=1) else: + if choice is None: + raise ValueError("Isobaric label choice must be provided for non-LFQ quantification") data_df[CHANNEL] = data_df[CHANNEL].map(choice.channels())mokume/preprocessing/aggregation.py-97-99 (1)
97-99:⚠️ Potential issue | 🟠 Major
str.containswithoutna=Falsewill raise on NaN values.If
protein_fieldcontains any NaN values,str.contains(cregex)returns NaN for those rows, and the boolean indexing with~will raise aValueError. The same issue exists inremove_protein_by_idsat Line 126.Proposed fix
- return dataset[~dataset[protein_field].str.contains(cregex)] + return dataset[~dataset[protein_field].str.contains(cregex, na=False)]And in
remove_protein_by_ids:- return dataset[~dataset[protein_field].str.contains(cregex, regex=True)] + return dataset[~dataset[protein_field].str.contains(cregex, regex=True, na=False)]mokume/io/feature.py-307-316 (1)
307-316:⚠️ Potential issue | 🟠 MajorSQL injection risk in
get_report_from_databaseandget_report_condition_from_database.
tuple(samples)andtuple(cons)are interpolated directly into SQL strings via.format()/ f-strings (Lines 311 and 405). Sample accession strings containing quotes will break or inject SQL. Additionally, ifsampleshas exactly one element,tuple(samples)produces('value',)with a trailing comma — DuckDB handles this, but it's fragile.Proposed fix using parameterized queries
def get_report_from_database(self, samples: list, columns: list = None): cols = ",".join(columns) if columns is not None else "*" + placeholders = ",".join(["?"] * len(samples)) database = self.parquet_db.sql( - """SELECT {} FROM parquet_db WHERE sample_accession IN {}""".format( - cols, tuple(samples) - ) + f"SELECT {cols} FROM parquet_db WHERE sample_accession IN ({placeholders})", + params=samples, )mokume/preprocessing/aggregation.py-122-126 (1)
122-126:⚠️ Potential issue | 🟠 MajorProtein IDs from file are used as regex without escaping special characters.
If any protein ID in the file contains regex metacharacters (e.g.,
.,(,+), the resulting regex will either match unintended strings or raise are.error. Usere.escapeon each identifier.Proposed fix
+ import re as _re contaminants = [cont for cont in contaminants if cont.strip()] - cregex = "|".join(contaminants) + cregex = "|".join(_re.escape(cont) for cont in contaminants) return dataset[~dataset[protein_field].str.contains(cregex, regex=True, na=False)]mokume/io/feature.py-116-155 (1)
116-155:⚠️ Potential issue | 🟠 MajorUse parameterized queries instead of string interpolation for all SQL statements.
The code has multiple SQL injection vulnerabilities caused by interpolating user-controlled or path values directly into SQL:
- Line 124:
database_pathis interpolated intoparquet_scan()via.format()- Line 311–312:
colsandtuple(samples)are interpolated into the WHERE clause- Line 405:
colsandtuple(cons)are interpolated into the WHERE clause- Line 512:
irs_channelis interpolated directly into a filter conditionDuckDB's Python API supports parameterized queries with
?placeholders. For example:Example fix for the parquet_scan call
- self.parquet_db.execute( - "CREATE VIEW parquet_db_raw AS SELECT * FROM parquet_scan('{}')".format( - database_path - ) - ) + self.parquet_db.execute( + "CREATE VIEW parquet_db_raw AS SELECT * FROM parquet_scan(?)", + [database_path], + )Apply the same pattern to the other affected queries at lines 311–312, 405, and 512.
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/12_optimal_pipeline.py-425-455 (1)
425-455:⚠️ Potential issue | 🟠 MajorTick labels reflect only the last technology's pipeline names.
x,values, andpipelinesare reassigned inside the innerfor tech in ["tmt", "lfq"]loop. After the loop,ax.set_xticks(x)andax.set_xticklabels(pipelines, ...)use whichever technology was processed last. If TMT and LFQ have different pipeline names, the labels will be incorrect for the other technology.Consider computing a unified set of pipeline names (or using grouped bar positions derived from both technologies) before setting tick labels.
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/run_benchmark.py-114-117 (1)
114-117:⚠️ Potential issue | 🟠 Major
input()will hang in non-interactive / CI environments.If this pipeline runs in CI or a headless context,
input()on line 116 will block indefinitely (or raiseEOFErrorif stdin is closed). Consider adding a--no-interactive/--fail-fastflag that defaults to aborting (or continuing) without prompting.🛡️ Proposed fix — add a CLI flag and guard the prompt
parser.add_argument( "--skip-download", action="store_true", help="Skip data download step", ) + parser.add_argument( + "--fail-fast", + action="store_true", + help="Abort immediately on first failure (no interactive prompt)", + ) args = parser.parse_args()if not success: print(f"\nStep failed: {script}") - if input("Continue with remaining steps? (y/n): ").lower() != "y": + if args.fail_fast or not sys.stdin.isatty(): + break + elif input("Continue with remaining steps? (y/n): ").lower() != "y": breakbenchmarks/quant-pxd007683-tmt-vs-lfq/scripts/09_quant_norm_interaction.py-88-93 (1)
88-93:⚠️ Potential issue | 🟠 MajorDivision by zero in
globalmediannormalization when a sample median is zero.If any column is entirely zero or NaN,
sample_medianscontains 0 or NaN entries, makingfactorscontaininf/NaN. Multiplying byinfsilently corrupts the entire sample column.Suggested fix
elif method == "globalmedian": global_median = df.median().median() sample_medians = df.median() + sample_medians = sample_medians.replace(0, np.nan) factors = global_median / sample_medians - return df * factors + return df * factors.fillna(1.0)benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/09_quant_norm_interaction.py-149-155 (1)
149-155:⚠️ Potential issue | 🟠 MajorSame
infcontamination from zero-division as in03_fold_change_accuracy.py.
condition_intensities[cond1] / condition_intensities[cond2]producesinfwhencond2means are zero.dropna()does not removeinf, sonp.log2(inf) = infflows into the RMSE calculation.Suggested fix
obs_fc = (condition_intensities[cond1] / condition_intensities[cond2]).dropna() + obs_fc = obs_fc.replace([np.inf, -np.inf], np.nan).dropna() obs_log2_fc = np.log2(obs_fc)mokume/normalization/tmm.py-83-98 (1)
83-98:⚠️ Potential issue | 🟠 MajorMissing parameter validation for trim fractions.
m_trimanda_trimare used as trim proportions but are never validated. Values outside(0, 0.5)would cause silent misbehavior — e.g.,m_trim=0.6trims more than half from each side, producingm_lo > m_hiand an empty keep set;m_trim=-0.1orm_trim=1.0would produce nonsensical bounds.Suggested validation
def __init__( self, m_trim: float = 0.3, a_trim: float = 0.05, ref_sample: Optional[str] = None, log_transform: bool = False, ): + if not 0 < m_trim < 0.5: + raise ValueError(f"m_trim must be in (0, 0.5), got {m_trim}") + if not 0 < a_trim < 0.5: + raise ValueError(f"a_trim must be in (0, 0.5), got {a_trim}") self.m_trim = m_trim self.a_trim = a_trimbenchmarks/quant-pxd007683-tmt-vs-lfq/scripts/03_fold_change_accuracy.py-121-142 (1)
121-142:⚠️ Potential issue | 🟠 Major
np.infvalues from zero-mean conditions pass thenotnacheck and contaminate results.When
mean2is 0 for a protein,log2(mean1 / 0)yieldsinf. Sincepd.notna(np.inf)returnsTrue, these infinite fold-changes are silently included in the results, corrupting downstream RMSE and bias calculations.Suggested fix
log2_fc = np.log2(mean1 / mean2) + log2_fc = log2_fc.replace([np.inf, -np.inf], np.nan) expected_log2_fc = np.log2(expected_fc)mokume/plotting/differential_expression.py-16-16 (1)
16-16:⚠️ Potential issue | 🟠 Major
matplotlib.use("Agg")at module level breaks interactive environments.This is library code imported by other modules. Unconditionally forcing the Agg backend at import time will prevent figure display in Jupyter notebooks, IPython, and any interactive session. This side-effect belongs in scripts/entry points, not in a library module.
Proposed fix
-matplotlib.use("Agg")Remove this line. Callers who need a non-interactive backend can set it themselves before importing matplotlib/pyplot.
benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/01_grid_search_methods.py-101-132 (1)
101-132:⚠️ Potential issue | 🟠 MajorSilent fallthrough for unrecognized normalization methods returns raw data.
If
methoddoesn't match any known branch (e.g., a typo like"meadian"), execution falls through to Line 132 and returns the originaldf_widewithout normalization or any warning. The same pattern exists inapply_imputation(Line 157). This silently masks configuration errors.Proposed fix
- return df_wide + raise ValueError(f"Unknown normalization method: {method}")And similarly for
apply_imputation:- return df_wide + raise ValueError(f"Unknown imputation method: {method}")mokume/plotting/differential_expression.py-79-79 (1)
79-79:⚠️ Potential issue | 🟠 Major
str.containswithoutregex=Falsetreats gene names as regex patterns.Protein/gene names can contain characters like
+,.,(,)(e.g.,"S100A8+A9","TP53(1-393)"). Withoutregex=False, these are interpreted as regex metacharacters, leading to either wrong matches orre.errorexceptions.- matches = df[df["ProteinName"].str.contains(gene, case=False, na=False)] + matches = df[df["ProteinName"].str.contains(gene, case=False, na=False, regex=False)]benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/08_normalization_grid.py-398-433 (1)
398-433:⚠️ Potential issue | 🟠 Major
plot_comparison_barswill crash if fewer than 5 results exist for a technology.Line 420 creates
x = np.arange(5), and Line 419 takes.head(5). Iftech_dfhas fewer than 5 rows,top5will have fewer entries, butxis always length 5.ax.barh(x, top5[metric])will raise a length mismatch error.- top5 = tech_df.head(5) - x = np.arange(5) + i * 6 + top_n = tech_df.head(5) + x = np.arange(len(top_n)) + i * 6 + colors = ["#2ecc71" if j == 0 else "#3498db" for j in range(len(top_n))]benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/10_imputation_benchmark.py-295-308 (1)
295-308:⚠️ Potential issue | 🟠 Major
UnboundLocalErrorforxandmethodswhen both technologies have no data.After the inner
for i, tech in ...loop,xandmethods(Lines 305–306) are referenced outside the loop. If both"tmt"and"lfq"DataFrames are empty (e.g., no results yet), thecontinueon Line 290 skips the assignments andx/methodsare never defined.Proposed fix
for ax, (metric, title, lower_better) in zip(axes.flat, metrics): + x = np.array([]) + methods = [] for i, tech in enumerate(["tmt", "lfq"]): tech_df = results_df[results_df["technology"] == tech] if len(tech_df) == 0: continue methods = tech_df["imputation"].tolist() values = tech_df[metric].tolist() x = np.arange(len(methods)) ... - ax.set_xticks(x) - ax.set_xticklabels(methods, rotation=45, ha="right") + if len(x) > 0: + ax.set_xticks(x) + ax.set_xticklabels(methods, rotation=45, ha="right")mokume/pipeline/stages.py-731-736 (1)
731-736:⚠️ Potential issue | 🟠 MajorContrast parsing with
split("-")breaks on condition names containing hyphens.If a condition name contains a hyphen (e.g.,
"non-treated-control"),c.split("-")will produce more than 2 parts, silently skipping a valid contrast. Consider usingsplit("-", 1)or a different delimiter (e.g.," vs ").Proposed fix
if self.config.de.contrasts: for c in self.config.de.contrasts: - parts = c.split("-") + parts = c.split("-", 1) if len(parts) == 2: contrasts.append((parts[0].strip(), parts[1].strip())) else: logger.warning(f"Invalid contrast format '{c}', expected 'A-B'")Note: Even with
maxsplit=1, a condition like"A-B-C"would yield("A", "B-C"). If both condition names can contain hyphens, a different delimiter (e.g.," vs ") would be more robust.
🟡 Minor comments (34)
docs/user-guide/batch-correct.md-1-3 (1)
1-3:⚠️ Potential issue | 🟡 MinorTitle/command name mismatch may confuse users.
The page title says "batch-correct" but the actual CLI command referenced on Line 3 (and throughout) is
correct-batches. Consider aligning the title with the command name, e.g.,# correct-batches: Batch Correction.mokume/postprocessing/batch_correction.py-126-144 (1)
126-144:⚠️ Potential issue | 🟡 MinorRegex fallback doesn't cover all reasonable non-hyphen sample name patterns.
The regex
^([a-zA-Z]+\d+)_requires at least one letter followed by at least one digit before the underscore. Sample names like"sample_1"(letters only, no trailing digits before_) or"1_sample"(digits first) will silently fall through to the full-string fallback, effectively assigning each such sample to its own batch. This may produce unexpected single-sample batches that later fail theTooFewSamplesInBatchcheck.Consider either broadening the regex to capture more patterns (e.g.,
r"^(.+?)_\d+$"to split on the last_digitssuffix), or logging a warning when the fallback is used so users know the prefix extraction didn't match.Suggested improvement
- # Match leading non-numeric + optional digits as prefix (e.g. p1_1 → p1) - m = re.match(r"^([a-zA-Z]+\d+)_", s) - if m: + # Match everything before the last _digits suffix (e.g. p1_1 → p1) + m = re.match(r"^(.+?)_\d+$", s) + if m: prefixes.append(m.group(1)) else: + logger.warning( + f"Could not extract batch prefix from '{s}', using full name" + ) prefixes.append(s)mokume/reports/interactive.py-227-227 (1)
227-227:⚠️ Potential issue | 🟡 MinorUnpinned Plotly CDN version; inconsistent with
qc_report.py.This file uses
plotly-latest.min.jswhileqc_report.py(line 580) pinsplotly-2.27.0.min.js. Usinglatestis fragile — a breaking Plotly release could silently break the interactive report. Pin to the same version for consistency and reproducibility.Proposed fix
- <script src="https://cdn.plot.ly/plotly-latest.min.js"></script> + <script src="https://cdn.plot.ly/plotly-2.27.0.min.js"></script>mokume/reports/qc_report.py-797-803 (1)
797-803:⚠️ Potential issue | 🟡 MinorJS accesses
pca.var_explained[1]which is undefined when PCA has only 1 component.When
n_components == 1, thevar_explainedarray has a single element. Accessing index[1]on Line 800 yieldsundefined, rendering"PC2 (undefined%)"in the axis label. Guard with a fallback.Proposed fix
- xaxis: {{ title: 'PC1 (' + pca.var_explained[0].toFixed(1) + '%)' }}, - yaxis: {{ title: 'PC2 (' + pca.var_explained[1].toFixed(1) + '%)' }}, + xaxis: {{ title: 'PC1 (' + pca.var_explained[0].toFixed(1) + '%)' }}, + yaxis: {{ title: 'PC2 (' + (pca.var_explained.length > 1 ? pca.var_explained[1].toFixed(1) : '0.0') + '%)' }},mokume/preprocessing/sdrf.py-16-47 (1)
16-47:⚠️ Potential issue | 🟡 MinorPotential
KeyErrorifcomment[technical replicate]column is absent.Not all SDRF files contain the
comment[technical replicate]column. If it's missing, Line 46 will raise an unhandledKeyError. Consider guarding with a default:Proposed fix
- technical_repetitions = len(sdrf_df["comment[technical replicate]"].unique()) + if "comment[technical replicate]" in sdrf_df.columns: + technical_repetitions = len(sdrf_df["comment[technical replicate]"].unique()) + else: + technical_repetitions = 1mokume/reports/interactive.py-432-435 (1)
432-435:⚠️ Potential issue | 🟡 MinorProtein names are injected into
innerHTMLwithout escaping — potential DOM XSS.If a protein name contains HTML metacharacters (e.g.
<,>), it will be interpreted as markup. While unlikely in typical proteomics accessions, this is still unsanitized user-derived data flowing intoinnerHTML. UsetextContentor escape HTML entities.Sketch: escape helper in JS
+function escapeHtml(str) { + return str.replace(/&/g,'&').replace(/</g,'<').replace(/>/g,'>').replace(/"/g,'"'); +} + // In populateTable: - tr.innerHTML = '<td>' + (row.highlighted ? '<b>' + row.protein + '</b>' : row.protein) + '</td>' + tr.innerHTML = '<td>' + (row.highlighted ? '<b>' + escapeHtml(row.protein) + '</b>' : escapeHtml(row.protein)) + '</td>'mokume/io/feature.py-80-91 (1)
80-91:⚠️ Potential issue | 🟡 MinorSQL
LIKEwildcards in contaminant patterns are not escaped.The
safe_patternonly escapes single quotes, but DuckDBLIKEtreats%and_as wildcards. If a contaminant pattern contained these characters, matches would be broader than intended. Low risk since the patterns are typically controlled strings like"CONTAMINANT".mokume/preprocessing/aggregation.py-186-186 (1)
186-186:⚠️ Potential issue | 🟡 MinorHardcoded
"Condition"instead ofCONDITIONconstant.The rest of this module uses
CONDITION(imported frommokume.core.constants), but this line uses a hardcoded"Condition"string twice. IfCONDITIONever changes, this line will silently break.Proposed fix
- data_df = data_df[(data_df["Condition"] != "Empty") | (data_df["Condition"].isnull())] + data_df = data_df[(data_df[CONDITION] != "Empty") | (data_df[CONDITION].isnull())]mokume/io/feature.py-393-398 (1)
393-398:⚠️ Potential issue | 🟡 MinorDivision by zero if
global_medor per-sample median is zero.If all sample medians are zero (e.g., heavily filtered data),
global_medwill be 0 and the division at Line 397 will produceinf/NaN. Same risk inget_median_map_to_conditionat Line 463 wheremeds.mean()could be 0.Proposed guard
global_med = np.median(list(med_map.values())) + if global_med == 0: + logger.warning("Global median intensity is 0; normalization factors will be set to 1.") + return {s: 1.0 for s in med_map} for sample, med in med_map.items(): - med_map[sample] = med / global_med + med_map[sample] = med / global_med if global_med != 0 else 1.0docs/reference/cli.md-226-226 (1)
226-226:⚠️ Potential issue | 🟡 Minor
-oshort flag conflicts with typical "output" semantics and the table appears truncated.Using
-ofor--patternis confusing since-oconventionally means--output. Also, the tsne-visualization options table seems incomplete — it ends abruptly without closing.docs/reference/cli.md-3-3 (1)
3-3:⚠️ Potential issue | 🟡 MinorIncorrect command count: "four" should be "five".
The page documents five commands (features2proteins, features2peptides, peptides2protein, correct-batches, tsne-visualization).
Proposed fix
-mokume provides four main commands. Use `mokume --help` or `mokume <command> --help` for details. +mokume provides five main commands. Use `mokume --help` or `mokume <command> --help` for details.docs/reference/configuration.md-123-125 (1)
123-125:⚠️ Potential issue | 🟡 MinorExample uses
method="median"but default/recommended is"maxlfq".The configuration table (line 49) and the Quick Example on the landing page both default to
"maxlfq". Using"median"here without explanation may confuse users copying the example. Consider using"maxlfq"for consistency, or add a comment explaining the choice.Proposed fix
quantification=QuantificationConfig( - method="median", + method="maxlfq", ),docs/reference/computed-values.md-12-12 (1)
12-12:⚠️ Potential issue | 🟡 Minor
IbaqPpbmultiplier doesn't match "parts per billion"."Parts per billion" implies multiplication by 10⁹ (1,000,000,000), but the formula shows
IbaqNorm * 100,000,000which is 10⁸. Either the multiplier should be1,000,000,000or the description should say "parts per hundred million."benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/00_download_data.py-22-24 (1)
22-24:⚠️ Potential issue | 🟡 MinorPartial downloads will be silently accepted and never re-downloaded.
If a download is interrupted, a partial file will remain on disk. On the next run, the
dest.exists()check on line 22 will skip it, leaving the user with a corrupt file. Consider checking file size againstContent-Lengthor adding a checksum verification step.Also applies to: 38-38
benchmarks/quant-pxd007683-tmt-vs-lfq/results/BENCHMARK_REPORT.md-32-40 (1)
32-40:⚠️ Potential issue | 🟡 MinorMisleading interpretation: 1% variance is not "a significant portion."
The table shows condition explains only ~1% of variance (TMT: 1.0%, LFQ: 0.9%) with ~99% residual, yet line 39-40 states "Condition (yeast spike-in level) explains a significant portion of variance." This contradicts the data. Consider revising to something like "Condition explains only a small portion of overall variance, suggesting technical variability dominates."
docs/concepts/quantification.md-7-16 (1)
7-16:⚠️ Potential issue | 🟡 MinorInconsistent "Class" column:
peptides_to_protein()is a function, not a class.The overview table header says "Class" but the iBAQ row lists a function call
peptides_to_protein(). For consistency with the other rows (which list class names likeTopNQuantification), consider either renaming the column to "API" or listing the actual class/module used for iBAQ.benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/00_generate_ibaq.py-183-184 (1)
183-184:⚠️ Potential issue | 🟡 MinorHardcoded step number "2." prints for both TMT and LFQ iterations.
Both loop iterations display "2. Processing …", making the output misleading. Use an enumerated counter.
🐛 Proposed fix
- for technology in ["tmt", "lfq"]: - print(f"\n2. Processing {technology.upper()}...") + for step_idx, technology in enumerate(["tmt", "lfq"], start=2): + print(f"\n{step_idx}. Processing {technology.upper()}...")benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/12_optimal_pipeline.py-356-407 (1)
356-407:⚠️ Potential issue | 🟡 MinorOn exception,
run_pipelinereturns a dict missing metric keys, which may cause downstreamKeyErroror silent NaN propagation.If the
tryblock at line 363 raises,resultsis returned withoutn_proteins,within_cv,fc_rmse, etc.pd.DataFrame(all_results)will fill these as NaN, which is tolerable, but downstream code (e.g., line 400 with:.3fformatting, line 585 withidxmax()) may fail on NaN or produce confusing output. Consider populating default metric values in the error path.🛡️ Proposed fix
except Exception as e: print(f" ERROR: {e}") import traceback traceback.print_exc() + results.update({ + "n_proteins": 0, + "n_yeast": 0, + "within_cv": np.nan, + "fc_rmse": np.nan, + "pct_good_cv": 0, + }) return resultsmokume/normalization/tmm.py-100-130 (1)
100-130:⚠️ Potential issue | 🟡 Minor
_select_reference_samplewill raise on all-NaN upper quartiles.If every sample has only zeros/NaN,
uq_serieswill be all NaN,mean_uqwill be NaN, and(uq_series - mean_uq).abs().idxmin()raisesValueError. While unlikely with real data, an early check and a clear error message would be more user-friendly than a crypticidxminerror.benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/04_stability_metrics.py-282-284 (1)
282-284:⚠️ Potential issue | 🟡 MinorPotential
KeyErrorwhenquality_withinis an empty dict.If all within-condition CVs are empty (e.g., all samples map to "Unknown" or conditions have fewer than 2 samples),
assess_qualityreturns{}, andmetrics['quality_within']['good']will raise aKeyError.Suggested fix
- print(f" Quality: {metrics['quality_within']['good']*100:.1f}% proteins with CV < 20%") + pct_good = metrics['quality_within'].get('good', 0) * 100 + print(f" Quality: {pct_good:.1f}% proteins with CV < 20%")benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/06_generate_report.py-347-348 (1)
347-348:⚠️ Potential issue | 🟡 MinorHardcoded
*.pngglob ignoresFIGURE_FORMATfrom config.The figures section enumerates files matching
*.png, but the benchmark scripts save figures using theFIGURE_FORMATconfig constant. If the format is changed (e.g., tosvgorSuggested fix
- for fig in sorted(FIGURES_DIR.glob("*.png")): + for fig in sorted(FIGURES_DIR.glob(f"*.{FIGURE_FORMAT}")):This requires adding
FIGURE_FORMATto the import fromconfig.benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/07_tmm_normalization_benchmark.py-381-383 (1)
381-383:⚠️ Potential issue | 🟡 MinorUnsafe
values[0]indexing whenmaxlfqmight not be in results.If
maxlfqdata was not available for a technology, the filtertech_data[tech_data['method']=='maxlfq']returns an empty DataFrame, and.values[0]raisesIndexError.Suggested fix
- print(f" Factor variation (maxlfq): {tech_data[tech_data['method']=='maxlfq']['factor_std'].values[0]:.3f}") + maxlfq_data = tech_data[tech_data['method'] == 'maxlfq'] + if len(maxlfq_data) > 0: + print(f" Factor variation (maxlfq): {maxlfq_data['factor_std'].values[0]:.3f}")benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/04_stability_metrics.py-103-106 (1)
103-106:⚠️ Potential issue | 🟡 MinorHardcoded methods list duplicates config and diverges from
get_methods_for_technology.
config.pyalready providesget_methods_for_technology()and per-technology method lists. The hardcoded list here omitsibaq(present in config) and includesdirectlfqfor both technologies, which contradicts the documented constraint that DirectLFQ should not be used for TMT.Suggested fix
- methods = ["sum", "top3", "top5", "top10", "maxlfq", "directlfq"] + methods = get_methods_for_technology(technology)This requires importing
get_methods_for_technologyfrom config (add to the existing import block).benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/03_fold_change_accuracy.py-54-59 (1)
54-59:⚠️ Potential issue | 🟡 MinorReturn type annotation is inaccurate — function can return
None.The function signature says
-> pd.DataFramebut returnsNonewhen the parquet file doesn't exist. This should be-> Optional[pd.DataFrame]to match the actual behavior and avoid type-checking confusion.Suggested fix
-def load_species_info(technology: str) -> pd.DataFrame: +def load_species_info(technology: str) -> Optional[pd.DataFrame]:(Requires
from typing import Optional— add to the imports.)benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/07_tmm_normalization_benchmark.py-132-134 (1)
132-134:⚠️ Potential issue | 🟡 MinorDivision by zero when
within_cv_medianis zero or NaN.If
metrics_before['within_cv_median']is 0 (perfectly reproducible data) orNaN(no valid conditions), this line will produceinforNaNsilently, which propagates into the results dict and CSV output.Suggested fix
- cv_improvement = (metrics_before['within_cv_median'] - metrics_after['within_cv_median']) / metrics_before['within_cv_median'] * 100 + before_cv = metrics_before['within_cv_median'] + after_cv = metrics_after['within_cv_median'] + if before_cv and np.isfinite(before_cv) and before_cv != 0: + cv_improvement = (before_cv - after_cv) / before_cv * 100 + else: + cv_improvement = np.nanbenchmarks/quant-pxd007683-tmt-vs-lfq/scripts/09_quant_norm_interaction.py-340-349 (1)
340-349:⚠️ Potential issue | 🟡 Minor
xandquant_methodsmay be unbound if no technology has data.
x(Line 330) andquant_methods(Line 326) are defined inside the innerfor i, techloop. If both technologies have empty DataFrames for a metric (both hitcontinueat Line 318), these variables are never assigned, and Lines 346–347 raiseNameError.Suggested fix
+ x = np.array([]) + quant_methods = [] for i, tech in enumerate(["tmt", "lfq"]): tech_df = results_df[results_df["technology"] == tech] if len(tech_df) == 0: continue ... - ax.set_xlabel("Quantification Method") - ax.set_ylabel(metric) - ax.set_title(title) - ax.set_xticks(x) - ax.set_xticklabels(quant_methods, rotation=45, ha="right") - ax.legend() - ax.grid(True, alpha=0.3, axis="y") + ax.set_xlabel("Quantification Method") + ax.set_ylabel(metric) + ax.set_title(title) + if len(x) > 0: + ax.set_xticks(x) + ax.set_xticklabels(quant_methods, rotation=45, ha="right") + ax.legend() + ax.grid(True, alpha=0.3, axis="y")benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/02_variance_decomposition.py-196-207 (1)
196-207:⚠️ Potential issue | 🟡 MinorPotential
IndexErroronvar_explained[1]if only 1 PCA component is computed.
n_componentsat Line 135 ismin(5, X.shape[0], X.shape[1]). If there are fewer than 2 samples (columns incomplete_data),X.shape[0]would be 1, producing a single component. Accessingvar_explained[1]at Lines 200 and 232 would then raise anIndexError. While unlikely for this dataset, a guard would make this robust.benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/08_normalization_grid.py-98-108 (1)
98-108:⚠️ Potential issue | 🟡 MinorSame division-by-zero risk in
globalmediannormalization.If any
sample_mediansentry is zero (all-NaN or all-zero column after cleaning), the division on Line 107 producesinf. This pattern recurs across scripts 08, 10, and 11.benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/11_preprocessing_filters.py-270-275 (1)
270-275:⚠️ Potential issue | 🟡 MinorDivision by zero in normalization when a sample's median is zero.
Same issue as in
10_imputation_benchmark.py: if anysample_mediansentry is zero,global_median / sample_mediansproducesinf, corrupting that column.global_median = df_quant.median().median() sample_medians = df_quant.median() + sample_medians = sample_medians.replace(0, np.nan) factors = global_median / sample_mediansbenchmarks/quant-pxd007683-tmt-vs-lfq/scripts/02_variance_decomposition.py-313-313 (1)
313-313:⚠️ Potential issue | 🟡 MinorHardcoded
.pngin log message while actual file usesFIGURE_FORMAT.Line 311 saves the file with
FIGURE_FORMAT(could besvg,.png. This will confuse users when the configured format is not PNG.- print(f"Combined PCA plot saved to: {FIGURES_DIR / 'pca_combined.png'}") + print(f"Combined PCA plot saved to: {FIGURES_DIR / f'pca_combined.{FIGURE_FORMAT}'}")benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/10_imputation_benchmark.py-121-131 (1)
121-131:⚠️ Potential issue | 🟡 MinorDivision by zero when
sample_medianscontains zero or NaN.If all values in a sample column are
NaN(e.g., after imputation of an all-missing column),df.median()returnsNaNfor that column. The divisionglobal_median / sample_mediansthen producesNaNsilently, which is tolerable—but if a column has a median of exactly0, this producesinf, corrupting that column's data.Proposed fix
def apply_sample_normalization(df: pd.DataFrame, method: str = "globalmedian") -> pd.DataFrame: """Apply sample-level normalization.""" if method == "none": return df.copy() elif method == "globalmedian": global_median = df.median().median() sample_medians = df.median() + sample_medians = sample_medians.replace(0, np.nan) factors = global_median / sample_medians return df * factorsmokume/pipeline/stages.py-501-520 (1)
501-520:⚠️ Potential issue | 🟡 MinorFallback intensity column selection (
numeric_cols[-1]) is fragile.If
"Intensity"is not present, the code falls back to the last numeric column. This is order-dependent and could silently pick the wrong column (e.g., a peptide count column). Consider raising an error or logging a warning when the expected column is missing rather than silently guessing.mokume/pipeline/stages.py-668-674 (1)
668-674:⚠️ Potential issue | 🟡 MinorProteins with any missing sample are left entirely uncorrected.
Line 668-670 splits on
isna().any(axis=1), meaning a protein missing intensity in even one sample out of many is excluded from ComBat. Depending on the dataset sparsity, this could leave a large fraction of proteins uncorrected, silently degrading the batch correction. Consider imputation before ComBat or at least logging the fraction excluded.mokume/pipeline/features_to_proteins.py-83-95 (1)
83-95:⚠️ Potential issue | 🟡 MinorMissing existence check for SDRF file.
Parquet (line 85) and FASTA (line 94) files are validated for existence, but
config.input.sdrfis not. A non-existent SDRF path will produce a confusing error deep in processing (e.g., duringanalyse_sdrfordetect_condition_from_sdrf).Proposed fix
if self.config.input.fasta_file and not Path(self.config.input.fasta_file).exists(): raise FileNotFoundError(f"FASTA file not found: {self.config.input.fasta_file}") + + if self.config.input.sdrf and not Path(self.config.input.sdrf).exists(): + raise FileNotFoundError(f"SDRF file not found: {self.config.input.sdrf}")
| def compute_overall_correlation(tmt_wide: pd.DataFrame, lfq_wide: pd.DataFrame) -> dict: | ||
| """Compute overall correlation between TMT and LFQ.""" | ||
| # Find common proteins | ||
| common_proteins = tmt_wide.index.intersection(lfq_wide.index) | ||
|
|
||
| if len(common_proteins) == 0: | ||
| return {"error": "No common proteins found"} | ||
|
|
||
| # Find common samples (by sample number) | ||
| tmt_samples = set(str(s) for s in tmt_wide.columns) | ||
| lfq_samples = set(str(s) for s in lfq_wide.columns) | ||
| common_samples = sorted(tmt_samples & lfq_samples, key=lambda x: int(x) if x.isdigit() else 0) | ||
|
|
||
| if len(common_samples) == 0: | ||
| return {"error": "No matching samples found"} | ||
|
|
||
| # Create aligned matrices | ||
| tmt_aligned = tmt_wide.loc[common_proteins, common_samples] | ||
| lfq_aligned = lfq_wide.loc[common_proteins, common_samples] | ||
|
|
||
| # Log transform for correlation | ||
| tmt_log = np.log10(tmt_aligned.replace(0, np.nan)) | ||
| lfq_log = np.log10(lfq_aligned.replace(0, np.nan)) | ||
|
|
||
| # Flatten and remove NaN pairs | ||
| tmt_flat = tmt_log.values.flatten() | ||
| lfq_flat = lfq_log.values.flatten() | ||
|
|
||
| valid = ~(np.isnan(tmt_flat) | np.isnan(lfq_flat)) | ||
| tmt_valid = tmt_flat[valid] | ||
| lfq_valid = lfq_flat[valid] | ||
|
|
||
| if len(tmt_valid) < 10: | ||
| return {"error": "Not enough valid values for correlation"} | ||
|
|
||
| # Compute correlations | ||
| pearson_r, pearson_p = stats.pearsonr(tmt_valid, lfq_valid) | ||
| spearman_r, spearman_p = stats.spearmanr(tmt_valid, lfq_valid) | ||
|
|
||
| return { | ||
| "n_proteins": len(common_proteins), | ||
| "n_samples": len(common_samples), | ||
| "n_values": len(tmt_valid), | ||
| "pearson_r": pearson_r, | ||
| "pearson_p": pearson_p, | ||
| "spearman_r": spearman_r, | ||
| "spearman_p": spearman_p, | ||
| "tmt_log": tmt_log, | ||
| "lfq_log": lfq_log, | ||
| "common_samples": common_samples, | ||
| } |
There was a problem hiding this comment.
Type mismatch: string sample IDs used to index potentially integer DataFrame columns.
Lines 64–65 convert column labels to strings for set intersection, producing string-typed common_samples. However, Line 72 then uses these strings to index tmt_wide and lfq_wide, whose columns retain their original type from the Parquet file (often int). This would raise KeyError if SampleID is stored as integer.
The same pattern appears in compute_per_protein_correlation (Line 114) and compute_per_sample_correlation (Line 145).
Proposed fix
# Find common samples (by sample number)
- tmt_samples = set(str(s) for s in tmt_wide.columns)
- lfq_samples = set(str(s) for s in lfq_wide.columns)
- common_samples = sorted(tmt_samples & lfq_samples, key=lambda x: int(x) if x.isdigit() else 0)
+ tmt_sample_map = {str(s): s for s in tmt_wide.columns}
+ lfq_sample_map = {str(s): s for s in lfq_wide.columns}
+ common_keys = sorted(set(tmt_sample_map) & set(lfq_sample_map),
+ key=lambda x: int(x) if x.isdigit() else 0)
+ tmt_common = [tmt_sample_map[k] for k in common_keys]
+ lfq_common = [lfq_sample_map[k] for k in common_keys]
...
# Create aligned matrices
- tmt_aligned = tmt_wide.loc[common_proteins, common_samples]
- lfq_aligned = lfq_wide.loc[common_proteins, common_samples]
+ tmt_aligned = tmt_wide.loc[common_proteins, tmt_common]
+ lfq_aligned = lfq_wide.loc[common_proteins, lfq_common]🤖 Prompt for AI Agents
In
`@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/05_cross_technology_correlation.py`
around lines 55 - 105, The bug is that compute_overall_correlation (and the
related compute_per_protein_correlation and compute_per_sample_correlation) cast
column labels to strings to find common_samples but then use those string IDs to
index the original DataFrames whose columns may be ints, causing KeyError; fix
by finding common sample labels without changing their dtype (e.g., use
tmt_wide.columns.intersection(lfq_wide.columns) or compute common_samples = [c
for c in tmt_wide.columns if c in lfq_wide.columns]) so the returned
common_samples keep the original type and can be used directly for indexing in
tmt_wide.loc[...] and lfq_wide.loc[...], and apply the same change in
compute_per_protein_correlation and compute_per_sample_correlation.
| for _, row in yeast_df.iterrows(): | ||
| lines.append( | ||
| f"| {row['technology'].upper()} | {row['comparison']} | " | ||
| f"{row['expected_log2_fc']:.2f} | {row['observed_median_log2_fc']:.2f} | " | ||
| f"{row.get('compression_ratio', 'N/A'):.2f} | {row['rmse']:.3f} |" | ||
| ) |
There was a problem hiding this comment.
:.2f format on a string default causes ValueError.
If compression_ratio is absent from the row, row.get('compression_ratio', 'N/A') returns the string 'N/A', and f"{...:.2f}" raises ValueError: Unknown format code 'f' for object of type 'str'. Similarly, NaN values format as nan rather than a user-friendly placeholder.
Suggested fix
for _, row in yeast_df.iterrows():
+ compression = row.get('compression_ratio', np.nan)
+ compression_str = f"{compression:.2f}" if pd.notna(compression) else "N/A"
lines.append(
f"| {row['technology'].upper()} | {row['comparison']} | "
f"{row['expected_log2_fc']:.2f} | {row['observed_median_log2_fc']:.2f} | "
- f"{row.get('compression_ratio', 'N/A'):.2f} | {row['rmse']:.3f} |"
+ f"{compression_str} | {row['rmse']:.3f} |"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for _, row in yeast_df.iterrows(): | |
| lines.append( | |
| f"| {row['technology'].upper()} | {row['comparison']} | " | |
| f"{row['expected_log2_fc']:.2f} | {row['observed_median_log2_fc']:.2f} | " | |
| f"{row.get('compression_ratio', 'N/A'):.2f} | {row['rmse']:.3f} |" | |
| ) | |
| for _, row in yeast_df.iterrows(): | |
| compression = row.get('compression_ratio', np.nan) | |
| compression_str = f"{compression:.2f}" if pd.notna(compression) else "N/A" | |
| lines.append( | |
| f"| {row['technology'].upper()} | {row['comparison']} | " | |
| f"{row['expected_log2_fc']:.2f} | {row['observed_median_log2_fc']:.2f} | " | |
| f"{compression_str} | {row['rmse']:.3f} |" | |
| ) |
🤖 Prompt for AI Agents
In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/06_generate_report.py` around
lines 176 - 181, The f-string in the loop over yeast_df.iterrows() applies the
numeric format specifier :.2f to row.get('compression_ratio', 'N/A'), which
raises ValueError when the value is the string 'N/A' or non-numeric/NaN; update
the code in the loop that builds lines so you detect and normalize
compression_ratio (and any other possibly-NaN fields like expected_log2_fc or
observed_median_log2_fc) before formatting: retrieve compression_ratio into a
local variable, use pandas.isna() or isinstance(..., numbers.Number) to decide
whether to format as "{:.2f}" or use a placeholder like "N/A", and then use the
prepared string in the f-string instead of applying :.2f directly to the raw
row.get(...) expression.
| elif method == "quantile": | ||
| # Quantile normalization | ||
| from scipy.stats import rankdata | ||
|
|
||
| result = df.copy() | ||
| # Get ranks for each column | ||
| ranks = df.apply(lambda x: rankdata(x, method='average', nan_policy='omit'), axis=0) | ||
| # Get sorted values (ignoring NaN) | ||
| sorted_means = df.apply(lambda x: np.sort(x.dropna().values), axis=0) | ||
|
|
||
| # Use mean of sorted values at each rank | ||
| n_rows = len(df) | ||
| for col in df.columns: | ||
| col_ranks = ranks[col].values | ||
| col_sorted = np.sort(df[col].dropna().values) | ||
|
|
||
| # Interpolate for each rank | ||
| for i, rank in enumerate(col_ranks): | ||
| if not np.isnan(rank) and not np.isnan(df[col].iloc[i]): | ||
| # Map rank to quantile value | ||
| idx = int(rank) - 1 | ||
| if idx < len(col_sorted): | ||
| result.iloc[i, result.columns.get_loc(col)] = col_sorted[idx] | ||
|
|
||
| return result |
There was a problem hiding this comment.
Quantile normalization implementation is incorrect.
Several problems:
sorted_means(Line 163) is computed but never used — it's dead code.- Fractional ranks are truncated —
rankdata(..., method='average')returns non-integer ranks for ties (e.g.,2.5).int(rank) - 1truncates to1, silently mapping tied values to the wrong quantile position. - Algorithm is wrong — standard quantile normalization replaces each value with the cross-column mean at its rank. This implementation replaces values with the same column's sorted value, which is essentially a no-op (value → sort → index by own rank → same value, modulo tie-breaking errors).
- O(n × m) Python loop is very slow for proteomics-scale matrices.
Consider using mokume.normalization.protein.quantile_normalize (already imported in 01_grid_search_methods.py) or a well-tested implementation instead.
🤖 Prompt for AI Agents
In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/08_normalization_grid.py`
around lines 155 - 179, The quantile branch (method == "quantile") is incorrect:
sorted_means is unused, fractional ranks from rankdata are truncated, the code
maps values to the same-column sorted value (no cross-column normalization), and
it uses a slow nested Python loop; replace this block to perform true quantile
normalization by computing the mean across columns at each rank and mapping each
value to that cross-column mean (handling fractional ranks by averaging
appropriate quantile positions), or simply call the existing, tested
implementation mokume.normalization.protein.quantile_normalize instead of the
custom logic; update/remove the unused variable sorted_means and eliminate the
O(n×m) loops (use vectorized rank/argsort operations or the mokume helper) so df
→ result is the proper cross-column ranked-mean mapping.
| def plot_radar_chart(results_df: pd.DataFrame, output_path: Path): | ||
| """Plot radar chart comparing pipelines.""" | ||
| fig, axes = plt.subplots(1, 2, figsize=(14, 6)) | ||
|
|
||
| metrics = ["n_proteins", "pct_good_cv", "within_cv_inv", "fc_rmse_inv"] | ||
| metric_labels = ["Proteins", "% Good CV", "1/CV", "1/RMSE"] | ||
|
|
||
| for ax, tech in zip(axes, ["tmt", "lfq"]): | ||
| tech_df = results_df[results_df["technology"] == tech].copy() | ||
|
|
||
| if len(tech_df) == 0: | ||
| continue | ||
|
|
||
| # Invert CV and RMSE so higher is better | ||
| tech_df["within_cv_inv"] = 1 / tech_df["within_cv"] | ||
| tech_df["fc_rmse_inv"] = 1 / tech_df["fc_rmse"].replace(0, np.nan) | ||
|
|
||
| # Normalize to 0-1 | ||
| for m in metrics: | ||
| if m in tech_df.columns: | ||
| min_val = tech_df[m].min() | ||
| max_val = tech_df[m].max() | ||
| if max_val > min_val: | ||
| tech_df[m + "_norm"] = (tech_df[m] - min_val) / (max_val - min_val) | ||
| else: | ||
| tech_df[m + "_norm"] = 0.5 | ||
|
|
||
| # Angles for radar | ||
| angles = np.linspace(0, 2 * np.pi, len(metrics), endpoint=False).tolist() | ||
| angles += angles[:1] # Complete the circle | ||
|
|
||
| # Plot each pipeline | ||
| colors = plt.cm.Set2(np.linspace(0, 1, len(tech_df))) | ||
| for i, (_, row) in enumerate(tech_df.iterrows()): | ||
| values = [row.get(m + "_norm", 0) for m in metrics] | ||
| values += values[:1] | ||
|
|
||
| ax.plot(angles, values, "o-", linewidth=2, label=row["pipeline"], | ||
| color=colors[i]) | ||
| ax.fill(angles, values, alpha=0.1, color=colors[i]) | ||
|
|
||
| ax.set_xticks(angles[:-1]) | ||
| ax.set_xticklabels(metric_labels) | ||
| ax.set_title(f"{tech.upper()}: Pipeline Comparison") | ||
| ax.legend(loc="upper right", bbox_to_anchor=(1.3, 1)) | ||
|
|
||
| plt.tight_layout() | ||
| plt.savefig(output_path, dpi=FIGURE_DPI) | ||
| plt.close() |
There was a problem hiding this comment.
plot_radar_chart creates Cartesian axes — the radar/spider chart won't render correctly.
plt.subplots(1, 2) creates standard Cartesian axes. For a radar chart, you need polar axes. The angles (in radians) will be plotted on a linear x-axis, producing a meaningless line chart instead of a spider plot.
🐛 Proposed fix
- fig, axes = plt.subplots(1, 2, figsize=(14, 6))
+ fig, axes = plt.subplots(1, 2, figsize=(14, 6),
+ subplot_kw=dict(polar=True))🤖 Prompt for AI Agents
In `@benchmarks/quant-pxd007683-tmt-vs-lfq/scripts/12_optimal_pipeline.py` around
lines 462 - 510, The plot_radar_chart function is creating regular Cartesian
Axes (via plt.subplots) so the computed angles are plotted on a linear axis
instead of a polar/spider plot; change the axes creation to polar axes for each
subplot (use projection="polar" when creating axes for the two panels), ensure
angles (variable angles) remain in radians and are used as the theta
coordinates, and keep the rest of the plotting logic (values, ax.plot, ax.fill,
ax.set_xticks/set_xticklabels, ax.set_title, ax.legend) but applied to the polar
Axes returned for plot_radar_chart so the radar/spider visualization renders
correctly.
| # Vectorized contaminant matching using regex OR pattern | ||
| upper_col = df[self.protein_column].fillna("").astype(str).str.upper() | ||
| pattern_regex = "|".join(re.escape(p.upper()) for p in self.patterns) | ||
| mask = ~upper_col.str.contains(pattern_regex, regex=True) |
There was a problem hiding this comment.
Bug: empty patterns list removes all rows.
When self.patterns is empty, "|".join(...) produces "", and str.contains("", regex=True) matches every row. The negated mask then filters out the entire DataFrame.
Add an early return when there are no patterns to match against.
Proposed fix
+ if not self.patterns:
+ logger.debug("%s: No contaminant patterns provided, skipping filter", self.name)
+ return df, self._create_result(input_count, input_count)
+
# Vectorized contaminant matching using regex OR pattern
upper_col = df[self.protein_column].fillna("").astype(str).str.upper()
pattern_regex = "|".join(re.escape(p.upper()) for p in self.patterns)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Vectorized contaminant matching using regex OR pattern | |
| upper_col = df[self.protein_column].fillna("").astype(str).str.upper() | |
| pattern_regex = "|".join(re.escape(p.upper()) for p in self.patterns) | |
| mask = ~upper_col.str.contains(pattern_regex, regex=True) | |
| if not self.patterns: | |
| logger.debug("%s: No contaminant patterns provided, skipping filter", self.name) | |
| return df, self._create_result(input_count, input_count) | |
| # Vectorized contaminant matching using regex OR pattern | |
| upper_col = df[self.protein_column].fillna("").astype(str).str.upper() | |
| pattern_regex = "|".join(re.escape(p.upper()) for p in self.patterns) | |
| mask = ~upper_col.str.contains(pattern_regex, regex=True) |
🤖 Prompt for AI Agents
In `@mokume/preprocessing/filters/protein.py` around lines 63 - 66, The current
vectorized contaminant matching in the method using self.patterns builds
pattern_regex and then computes mask =
~upper_col.str.contains(pattern_regex...), which when self.patterns is empty
creates an empty regex that matches every row and removes all rows; fix by
adding an early check for an empty patterns list (check "if not self.patterns"
or similar) and return the original DataFrame (or set mask to all True) before
constructing pattern_regex; update uses of self.protein_column, pattern_regex,
and mask accordingly so the function preserves rows when there are no patterns.
There was a problem hiding this comment.
Actionable comments posted: 18
🤖 Fix all issues with AI agents
In `@mokume/analysis/differential_expression.py`:
- Around line 342-349: The code filters s2 and df into valid arrays then calls
np.median(s2) which can be empty; add an explicit guard after filtering: if
len(s2) == 0 then log a clear warning via logger.warning and return a safe
default (e.g. prior_df=3.0, prior_var=1.0) or raise a ValueError depending on
desired behavior—ensure you update the block that currently does "if len(s2) <
3: ... return 3.0, float(np.median(s2))" to handle the zero-length case
separately so np.median is never called on an empty s2.
In `@mokume/io/feature.py`:
- Around line 404-411: The SQL in get_report_condition_from_database currently
injects cons via f-string and tuple(cons); change it to a parameterized query
using placeholders for the IN clause and pass cons as parameters to
parquet_db.sql to avoid SQL injection (e.g., build a placeholder list based on
len(cons) and call parquet_db.sql(query, params)). Update the call sites in
get_report_condition_from_database and ensure Feature.standardize_df(report)
still receives the resulting dataframe from database.df(); keep using cols =
",".join(columns) if provided but do not inline cons directly into the SQL
string.
- Around line 310-319: The current get_report_from_database (and similarly
get_report_condition_from_database) builds SQL by interpolating columns and
tuple(samples) which is an SQL injection risk; change it to use parameterized
queries for the samples (e.g., pass samples as a parameter to
parquet_db.sql/execute instead of formatting them into the string) and avoid
direct string interpolation for columns by validating/whitelisting column names
or mapping requested column names to safe identifiers before joining them into
the SELECT list; update the query construction to use a parameter placeholder
for the IN list and call parquet_db.sql/execute with the samples parameter, then
return Feature.standardize_df(report) as before.
- Around line 518-530: The SQL building the irs_df uses CAST(split_part(run,
'_', 2) AS INTEGER) to compute techreplicate which can silently yield NULL for
non-numeric segments; change this to use TRY_CAST and COALESCE so non-numeric
parts don't drop data (e.g., compute techreplicate with
TRY_CAST(split_part(run,'_',2) AS INTEGER) and COALESCE to fallback to
TRY_CAST(split_part(run,'_',-1) AS INTEGER) or TRY_CAST(run AS INTEGER) similar
to get_unique_tec_reps()), ensuring the SELECT in the irs_df query (symbols:
irs_df, parquet_db, stat_fn, where_clause, techreplicate) preserves rows instead
of producing NULL group keys.
In `@mokume/normalization/irs.py`:
- Around line 96-106: The code computing scaling_factors can divide by
zero/negative values because plex_ref_intensity[plex] is not sanitized; update
the logic in the block around plex_ref_intensity, global_ref and scaling_factors
so denominators <= 0 are replaced with np.nan (or a small positive epsilon)
before division, compute scaling_factors as global_ref / sanitized_plex_ref and
then handle resulting NaNs (e.g., propagate NaN or fill with 1.0) to avoid
inf/-inf values; refer to variables plex_ref_intensity, global_ref,
scaling_factors and the loop iterating over plexes when making this change.
In `@mokume/pipeline/stages.py`:
- Around line 722-725: The broad "except Exception as e" in the ComBat batch
correction block should not silently swallow failures; change it to catch
narrower, likely failure types (e.g., except (ValueError, KeyError, TypeError)
as e:) and call logger.exception("Batch correction failed") to record the full
traceback and context (include protein_df/batch info if helpful); for truly
unexpected errors keep a final except Exception as e:
logger.exception("Unexpected batch correction error") and re-raise the exception
instead of returning uncorrected protein_df so real bugs surface. Ensure you
update the except clause in the batch correction block that currently logs
"Batch correction failed: {e}" and uses protein_df as the fallback.
- Around line 744-751: The contrast-parsing fallback that splits on "-" (the
branch handling variable c and appending to contrasts) misparses hyphenated
condition names; update that branch in mokume/pipeline/stages.py (the block
referencing variable c and the contrasts list) to emit a warning when "-" is
used as the delimiter (e.g., via the module logger or warnings.warn) explaining
that "-" is ambiguous and recommending using " vs " instead, while keeping the
existing split behavior to preserve backwards compatibility.
- Around line 708-720: The code unnecessarily calls
corrected_df.rename(columns={"index": protein_col}) after
corrected_matrix.reset_index(): because the DataFrame index already has the name
protein_col the reset_index() call will produce a column named protein_col, so
remove the redundant rename call (or, alternatively, change it to rename from
the actual column name if you expect a different name); update the block that
creates corrected_df (using corrected_matrix.reset_index()) to omit the rename
and leave corrected_df as-is, ensuring logger and return continue to use
corrected_df, corrected_matrix, incomplete_matrix, protein_col, and sample_cols
as before.
In `@mokume/preprocessing/aggregation.py`:
- Around line 151-158: The code calls choice.channels() for non-LFQ branches
without checking that choice (Optional[IsobaricLabel]) is not None, which can
raise AttributeError; in the branch that assigns data_df[CHANNEL] =
data_df[CHANNEL].map(choice.channels()) (and where label !=
QuantificationCategory.LFQ) add a defensive check that choice is not None and
raise a clear ValueError (or handle a sensible default) if it is; reference the
symbols choice, IsobaricLabel, CHANNEL, QuantificationCategory.LFQ, and the
assignment to data_df[CHANNEL] so the guard is added immediately before calling
choice.channels().
- Around line 79-99: The current remove_contaminants_entrapments_decoys function
uses dataset[protein_field].str.contains(cregex) without handling NaNs, which
causes rows with missing protein names to be dropped; update the call to
str.contains to pass na=False (and if you want to be extra safe against special
characters, build the pattern using re.escape on each contaminant before
joining) so that missing protein_field values are not treated as matches and are
preserved.
- Around line 185-199: Replace hardcoded column names with constants: change
"Condition" to CONDITION and "Reference" to REFERENCE in the data filtering and
missing-files logic inside the block that manipulates data_df (where INTENSITY
and RUN are already used). Add an import for REFERENCE from
mokume.core.constants alongside existing imports (so CONDITION and REFERENCE are
available), and update the logger message to reference the REFERENCE constant
usage location; keep the rest of the logic (filtering zeros, dropping NA RUNs)
unchanged.
In `@mokume/quantification/ratio.py`:
- Around line 382-388: The current filter df =
df[~df[PROTEIN_NAME].str.contains(";")] will drop rows where PROTEIN_NAME is NaN
because str.contains defaults na=NaN; update the filter to handle missing values
explicitly (e.g., use str.contains(";", na=False)) so NaN protein rows are not
silently removed, and then continue with the peptide_counts =
df.groupby(PROTEIN_NAME)[PEPTIDE_CANONICAL].nunique() / valid_proteins logic
unchanged to enforce min_unique_peptides on actual protein names.
In `@mokume/reports/qc_report.py`:
- Around line 134-154: In _compute_pca, guard against the case where data
becomes empty after dropna (or has too few rows) before calling PCA: check
data.shape and if there are no columns or fewer than the minimum samples (mirror
the _compute_tsne check, e.g., if data.shape[0] < 4 or data.shape[1] == 0),
return a result dictionary with the same keys ("samples", "conditions",
"plexes", "pc1", "pc2", optional "pc3", and "var_explained") filled with safe
defaults (zeros or empty lists of correct length and var_explained filled with
zeros) instead of calling PCA; ensure you reference _compute_pca, samples,
sample_to_condition, and sample_to_plex so callers receive consistent shaped
outputs.
- Around line 492-493: The file is opened without an explicit encoding which can
corrupt non-ASCII characters; modify the open call that writes the HTML (the
with open(output_html, "w") as f: block that writes the html variable) to
specify encoding="utf-8" so the output HTML (which declares <meta
charset="utf-8">) is written with matching UTF-8 encoding.
- Around line 68-74: The code incorrectly replaces zeros with NaN when is_log2
is True, dropping valid zero log2 values; update the block around
intensity/log2_matrix so zeros are only replaced with NaN when converting raw
intensities to log2 (i.e., when not is_log2). Specifically, in the branch for
is_log2 use log2_matrix = intensity (don’t call replace(0, np.nan)), and keep
the replace+np.log2 behavior only in the not is_log2 branch
(intensity.replace(0, np.nan) then np.log2). Refer to the variables intensity,
log2_matrix, is_log2, protein_df, protein_col, and exp_samples to locate and fix
the logic (mirror the fix used in differential_expression.py).
- Around line 319-336: The loop that computes weights divides by
total_var_explained which can be zero (n_components, PCA,
pca.explained_variance_ratio_, total_var_explained); to fix, guard against
total_var_explained == 0 by computing a safe denominator: if total_var_explained
> 0 keep the current weight = pca.explained_variance_ratio_[i] /
total_var_explained, otherwise use an equal weight = 1.0 / n_components for each
component; apply this change where weight is computed before calling _anova_r2
and use the equal-weight branch when explained variance sums to zero.
- Around line 884-967: The embedded JSON for distributions, condition_cv
(cvData) and missing['per_sample'] is not being escaped for a closing script
tag; update the json.dumps(...) calls that produce the JS constants "dist" (used
for boxplot), "cvData" (used for cv-violin) and the "missing" object (used for
missing-plot) to apply .replace('</', '<\\/') on the dumped string so any
"</script>" sequences are neutralized before injecting into the page; keep the
existing variable names and plot targets ('boxplot', 'cv-violin',
'missing-plot') so only the json embedding calls change.
In `@mokume/reports/workflow_comparison.py`:
- Around line 83-88: The local variable named html returned from
_build_comparison_html shadows the imported html module (so subsequent calls
like html.escape would fail); rename that local variable (e.g., rendered_html or
html_content) wherever it's used and write that renamed variable to the file
instead of html, and open the file with encoding="utf-8" (use output_html and
the result of _build_comparison_html accordingly) so the module remains
accessible and the file encoding matches the declared charset.
🧹 Nitpick comments (17)
mokume/normalization/irs.py (2)
128-139:transformsilently skips samples missing from scaling factors or the DataFrame.Line 136 skips any sample not found in
result.columnsor not mapped to a known plex. If the caller passes a different set of samples than what was fitted (e.g., a typo in sample names), no error or warning is raised, and un-scaled columns are returned as-is. Consider logging a warning when samples are skipped so misconfigurations surface early.Proposed enhancement
+ skipped = [] for sample, plex in self._sample_to_plex.items(): if sample in result.columns and plex in self.scaling_factors_: result[sample] = result[sample] * self.scaling_factors_[plex] + else: + skipped.append(sample) + + if skipped: + logger.warning(f"IRS transform: {len(skipped)} samples skipped (not in DataFrame or plex): {skipped}") return result.reset_index()
270-305: Fallback to"plex1"may silently group unrelated samples.When source names don't match either the
p\d+_\d+pattern or theprefix_digitconvention (lines 292–301), all such samples are lumped into a single"plex1". In datasets with non-standard naming, this could silently produce incorrect plex assignments. The warning at line 302 helps, but for a normalization step where correctness is critical, consider raising an error (or at minimum returning a flag) when a significant fraction of samples fall through to the default.mokume/analysis/differential_expression.py (2)
122-122: Prefix unusedstatwith underscore.Static analysis flags
statas unused. Minor lint fix.Suggested fix
- stat, pvalue = stats.ttest_ind(vals_a, vals_b, equal_var=False) + _stat, pvalue = stats.ttest_ind(vals_a, vals_b, equal_var=False)
245-247: Prefix unusedrejectwith underscore.Suggested fix
- reject, adj_pvalues, _, _ = multipletests( + _reject, adj_pvalues, _, _ = multipletests(mokume/reports/qc_report.py (2)
499-499:highlight_genesandde_resultsparameters are accepted but unused.Static analysis confirms these are dead parameters. Either remove them from the signature or implement the intended highlighting/DE logic. If this is planned for a future iteration, add a TODO.
981-1034: DE quality JS rendering looks correct.Consider importing
jsonat the module level (it's already used in_build_qc_html) rather than locally in both functions, but this is a minor style point.mokume/reports/workflow_comparison.py (1)
622-638: User-supplied marker gene names are not HTML-escaped.
info['name'],acc,r['significance'], andfcare injected directly into HTML without escaping. While marker gene data is typically programmatic, applyinghtml.escape()would be safer against unexpected inputs.Suggested fix (example for line 624)
- row = f"<tr><td><b>{info['name']}</b></td><td>{acc}</td>" - row += f"<td class='{'good' if info['expected'] == 'UP' else 'bad'}'>{info['expected']}</td>" + row = f"<tr><td><b>{html.escape(info['name'])}</b></td><td>{html.escape(acc)}</td>" + row += f"<td class='{'good' if info['expected'] == 'UP' else 'bad'}'>{html.escape(info['expected'])}</td>"mokume/preprocessing/aggregation.py (2)
208-219: Fragile technical replicate extraction from run names.The heuristic at lines 209-219 attempts to parse tech replicate IDs from run names. If the last underscore-delimited segment is not numeric (e.g., run name
"sample_Brain"), theastype("int")on line 212 throwsValueError, caught by the except block. The fallback (line 217-219) assigns sequential indices, which may not reflect actual technical replicates. This is acceptable as a best-effort approach, but worth documenting the assumption.
358-365: Consider usinggroupby().agg()instead oftransform("sum")+drop_duplicates().The current approach replaces each row's intensity with the group sum via
transform, then deduplicates. A directgroupby().agg({"NormIntensity": "sum"}).reset_index()would be more idiomatic, clearer in intent, and avoids the intermediate oversized DataFrame.mokume/quantification/ratio.py (1)
131-163: Side-effect:_compute_reference_intensitymutates the caller'sdfby adding_plex.Line 138 adds a
_plexcolumn directly to thedfpassed byquantify(). While_compute_log2_ratioslater overwrites it, this hidden mutation couples the two methods. Consider operating on a copy or filtering without modifying the input.mokume/pipeline/stages.py (7)
73-82: Unused variablesample_names.
sample_names(line 80) is unpacked but never referenced. Prefix with_to signal intentional discard.Suggested fix
technical_repetitions, label, sample_names, choice = analyse_sdrf( + technical_repetitions, label, _sample_names, choice = analyse_sdrf(
112-116:groupby().filter(lambda x: len(set(...)))is slow — prefernunique().Creating a
setinside a lambda for each group is inefficient on large DataFrames. A vectorized approach is faster and clearer:Suggested refactor
- dataset_df = dataset_df.groupby(PROTEIN_NAME).filter( - lambda x: len(set(x[PEPTIDE_CANONICAL])) - >= self.config.filtering.min_unique_peptides - ) + peptide_counts = dataset_df.groupby(PROTEIN_NAME)[PEPTIDE_CANONICAL].nunique() + valid_proteins = peptide_counts[peptide_counts >= self.config.filtering.min_unique_peptides].index + dataset_df = dataset_df[dataset_df[PROTEIN_NAME].isin(valid_proteins)]
178-192: Duplicated protein-accession parsing logic across files.The pattern at lines 180-185 (extract first
pg_accessionselement, split by|, take UniProt ID) is repeated inmokume/quantification/ratio.pylines 352-357 and here. Extract a shared helper (e.g.,parse_pg_accessions(series) -> Series) to avoid drift between the two copies.
416-429: Overly complex column deduplication — simplify reference sample removal.The logic prepends
protein_col, then re-includes it viac == protein_colin the comprehension, then manually deduplicates. This can be simplified:Suggested simplification
- protein_col = protein_df.columns[0] - cols_to_keep = [protein_col] + [ - c for c in protein_df.columns - if c == protein_col or c not in ref_samples - ] - # Deduplicate while preserving order - seen = set() - unique_cols = [] - for c in cols_to_keep: - if c not in seen: - seen.add(c) - unique_cols.append(c) - protein_df = protein_df[unique_cols] + protein_col = protein_df.columns[0] + cols_to_keep = [c for c in protein_df.columns if c == protein_col or c not in ref_samples] + protein_df = protein_df[cols_to_keep]
507-526: Unusedmethod_nameparameter.The
method_nameparameter is accepted but never referenced in the method body. Either remove it or use it (e.g., in the debug log at line 517).
540-546: Unusedmw_dictfromextract_fastareturn value.Prefix with
_to signal intentional discard:- unique_peptide_counts, mw_dict, found_proteins = extract_fasta( + unique_peptide_counts, _mw_dict, found_proteins = extract_fasta(
896-903:rsplit(".", 1)result used ambiguously when filename has no extension.If
report_outputhas no.(no extension),rsplit(".", 1)returns a single-element list. Line 900 accessesbase[0], which works, but thef"{base[0]}_{contrast_name}.html"always appends.htmlregardless of the original extension. This is probably fine, but the variable namebasebeing a list is a bit unclear.
| # Filter out zeros/inf | ||
| valid = np.isfinite(s2) & (s2 > 0) & np.isfinite(df) & (df > 0) | ||
| s2 = s2[valid] | ||
| df = df[valid] | ||
|
|
||
| if len(s2) < 3: | ||
| logger.warning("Too few proteins for eBayes estimation, using defaults") | ||
| return 3.0, float(np.median(s2)) |
There was a problem hiding this comment.
Edge case: np.median(s2) when filtered array is empty.
If all s2 values are zero, non-finite, or df <= 0, the valid mask could leave an empty array. np.median([]) produces a runtime warning and returns nan, which would propagate into the moderated variance calculation and silently produce nan p-values for all proteins.
Consider returning early with a more defensive default or raising when no valid variances remain.
Suggested guard
if len(s2) < 3:
logger.warning("Too few proteins for eBayes estimation, using defaults")
+ median_s2 = float(np.median(s2)) if len(s2) > 0 else 1.0
- return 3.0, float(np.median(s2))
+ return 3.0, median_s2🤖 Prompt for AI Agents
In `@mokume/analysis/differential_expression.py` around lines 342 - 349, The code
filters s2 and df into valid arrays then calls np.median(s2) which can be empty;
add an explicit guard after filtering: if len(s2) == 0 then log a clear warning
via logger.warning and return a safe default (e.g. prior_df=3.0, prior_var=1.0)
or raise a ValueError depending on desired behavior—ensure you update the block
that currently does "if len(s2) < 3: ... return 3.0, float(np.median(s2))" to
handle the zero-length case separately so np.median is never called on an empty
s2.
| def get_report_from_database(self, samples: list, columns: list = None): | ||
| """Retrieves a standardized report from the database for specified samples.""" | ||
| cols = ",".join(columns) if columns is not None else "*" | ||
| database = self.parquet_db.sql( | ||
| """SELECT {} FROM parquet_db WHERE sample_accession IN {}""".format( | ||
| cols, tuple(samples) | ||
| ) | ||
| ) | ||
| report = database.df() | ||
| return Feature.standardize_df(report) |
There was a problem hiding this comment.
SQL injection risk: samples and columns are interpolated directly into SQL.
tuple(samples) relies on Python's repr() for string quoting, which does not perform SQL escaping. A sample name containing a single quote (e.g., "sample'1") would produce a malformed query or injection vector. Use DuckDB's parameterized queries instead:
Suggested approach
def get_report_from_database(self, samples: list, columns: list = None):
cols = ",".join(columns) if columns is not None else "*"
- database = self.parquet_db.sql(
- """SELECT {} FROM parquet_db WHERE sample_accession IN {}""".format(
- cols, tuple(samples)
- )
- )
+ placeholders = ", ".join(["?"] * len(samples))
+ database = self.parquet_db.execute(
+ f"SELECT {cols} FROM parquet_db WHERE sample_accession IN ({placeholders})",
+ samples,
+ )
report = database.df()
return Feature.standardize_df(report)Note: the same pattern exists in get_report_condition_from_database (line 408).
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 310-310: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
[error] 314-316: Possible SQL injection vector through string-based query construction
(S608)
🤖 Prompt for AI Agents
In `@mokume/io/feature.py` around lines 310 - 319, The current
get_report_from_database (and similarly get_report_condition_from_database)
builds SQL by interpolating columns and tuple(samples) which is an SQL injection
risk; change it to use parameterized queries for the samples (e.g., pass samples
as a parameter to parquet_db.sql/execute instead of formatting them into the
string) and avoid direct string interpolation for columns by
validating/whitelisting column names or mapping requested column names to safe
identifiers before joining them into the SELECT list; update the query
construction to use a parameter placeholder for the IN list and call
parquet_db.sql/execute with the samples parameter, then return
Feature.standardize_df(report) as before.
| def get_report_condition_from_database(self, cons: list, columns: list = None) -> pd.DataFrame: | ||
| """Retrieves a standardized report from the database for specified conditions.""" | ||
| cols = ",".join(columns) if columns is not None else "*" | ||
| database = self.parquet_db.sql( | ||
| f"""SELECT {cols} FROM parquet_db WHERE condition IN {tuple(cons)}""" | ||
| ) | ||
| report = database.df() | ||
| return Feature.standardize_df(report) |
There was a problem hiding this comment.
Same SQL injection pattern as get_report_from_database.
get_report_condition_from_database uses the same tuple(cons) interpolation into SQL. Apply the same parameterized query fix recommended for get_report_from_database.
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 404-404: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
[error] 408-408: Possible SQL injection vector through string-based query construction
(S608)
🤖 Prompt for AI Agents
In `@mokume/io/feature.py` around lines 404 - 411, The SQL in
get_report_condition_from_database currently injects cons via f-string and
tuple(cons); change it to a parameterized query using placeholders for the IN
clause and pass cons as parameters to parquet_db.sql to avoid SQL injection
(e.g., build a placeholder list based on len(cons) and call
parquet_db.sql(query, params)). Update the call sites in
get_report_condition_from_database and ensure Feature.standardize_df(report)
still receives the resulting dataframe from database.df(); keep using cols =
",".join(columns) if provided but do not inline cons directly into the SQL
string.
mokume/io/feature.py
Outdated
| irs_df = self.parquet_db.sql( | ||
| f""" | ||
| SELECT run, {stat_fn}(intensity) as irs_value, mixture, techreplicate as techrep_guess | ||
| FROM ( | ||
| SELECT *, | ||
| CASE WHEN position('_' in run) > 0 THEN CAST(split_part(run, '_', 2) AS INTEGER) | ||
| ELSE CAST(run AS INTEGER) END AS techreplicate | ||
| FROM parquet_db | ||
| WHERE {where_clause} | ||
| ) | ||
| GROUP BY run, mixture, techrep_guess | ||
| """ | ||
| ).df() |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
DuckDB TRY_CAST function support documentation
💡 Result:
DuckDB supports TRY_CAST as a “safe cast” that returns NULL instead of throwing an error when a conversion can’t be performed. Syntax: TRY_CAST(expr AS TYPENAME). Example: SELECT TRY_CAST('hello' AS INTEGER); returns NULL. [1]
Related docs/notes:
- Regular
CASTcan fail (e.g., invalid strings, overflow), whileTRY_CASTconverts those failures toNULL(including overflow cases like casting999toTINYINT). [1][2] - DuckDB also provides a more general
TRY(<expression>)wrapper (inspired byTRY_CAST) to returnNULLfor rows where an expression errors, not just casts. [3][4]
Sources: DuckDB docs on Casting / TRY_CAST [1], Typecasting (overflow note) [2], TRY expression [3], DuckDB 1.3.0 release notes (generalization to TRY) [4].
🏁 Script executed:
# Check the actual implementation in mokume/io/feature.py
cd /repo && cat -n mokume/io/feature.py | sed -n '500,575p'Repository: bigbio/mokume
Length of output: 112
🏁 Script executed:
# Find and examine get_unique_tec_reps() to understand the comparison
rg -n "def get_unique_tec_reps" --type pyRepository: bigbio/mokume
Length of output: 129
🏁 Script executed:
# Search for how run names are used and their patterns
rg -n "run.*_" --type py -A 2 -B 2 | head -100Repository: bigbio/mokume
Length of output: 6852
🏁 Script executed:
cat -n mokume/io/feature.py | sed -n '515,535p'Repository: bigbio/mokume
Length of output: 1004
🏁 Script executed:
cat -n mokume/io/feature.py | sed -n '342,380p'Repository: bigbio/mokume
Length of output: 2045
🏁 Script executed:
# Check for error handling around the SQL call in get_irs_scaling_factors
cat -n mokume/io/feature.py | sed -n '471,540p'Repository: bigbio/mokume
Length of output: 3342
🏁 Script executed:
# Search for test data or run name patterns to understand what run names actually look like
fd -e "csv" -o -e "parquet" -o -e "json" | head -20Repository: bigbio/mokume
Length of output: 173
🏁 Script executed:
# Search for test fixtures or mock data
rg -n "run.*=" --type py tests/ | grep -E "(run|Run)" | head -20Repository: bigbio/mokume
Length of output: 2015
🏁 Script executed:
# Check if there's error handling at the call site of get_irs_scaling_factors
rg -n "get_irs_scaling_factors" --type py -B 3 -A 3Repository: bigbio/mokume
Length of output: 980
🏁 Script executed:
# Check the entire call stack to see if there's any error handling
cat -n mokume/normalization/peptide.py | sed -n '200,215p'Repository: bigbio/mokume
Length of output: 905
🏁 Script executed:
# Look for any tests that exercise the get_irs_scaling_factors with problematic run names
rg -n "get_irs_scaling_factors" --type py tests/ -B 5 -A 5Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Verify what happens with non-numeric casts in the context
# Check if there are any test runs or fixtures with underscore patterns that aren't numeric
rg -n "run.*=.*_" tests/ --type py | head -15Repository: bigbio/mokume
Length of output: 1545
🏁 Script executed:
# Look for sample data or fixtures that show actual run name patterns
find . -type f \( -name "*.json" -o -name "*.yaml" -o -name "*.yml" \) | head -10Repository: bigbio/mokume
Length of output: 338
🏁 Script executed:
# Check the conftest or test fixtures for mock data
rg -n "run" tests/conftest.py -A 3 -B 1 | head -40Repository: bigbio/mokume
Length of output: 114
🏁 Script executed:
# Look at the aggregation.py mentioned in original comment for context
rg -n "split_part" --type py -B 5 -A 5Repository: bigbio/mokume
Length of output: 2534
🏁 Script executed:
# Check if there are any tests that would exercise this with non-numeric run names
rg -n "techrep" tests/ --type py -B 3 -A 3Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Look at the docstring and context for get_irs_scaling_factors to understand the position 2 choice
cat -n mokume/io/feature.py | sed -n '471,495p'Repository: bigbio/mokume
Length of output: 1141
🏁 Script executed:
# Check if there's documentation about run name format conventions
rg -n "run.*format\|format.*run" --type py -i | head -10Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Look at how run names are created/set in the codebase
rg -n "reference_file_name as run" --type py -B 5 -A 5Repository: bigbio/mokume
Length of output: 1569
🏁 Script executed:
# Final check: see if there are actual cases where CAST would fail
rg -n "run" mokume/ --type py | grep -E "(numeric|integer|pattern)" | head -10Repository: bigbio/mokume
Length of output: 342
CAST(split_part(run, '_', 2) AS INTEGER) may return NULL with non-numeric position-2 segments, causing GROUP BY rows to be lost.
If run names have a non-numeric value at position 2 after splitting by _ (e.g., "sample_brain_1"), the CAST silently returns NULL in DuckDB. This causes those rows to be grouped under a NULL techreplicate value, losing data. While the method won't crash—the call is wrapped in a broad try/except—silently dropping data is worse than early failure.
Additionally, this approach differs from get_unique_tec_reps() (lines 342–365), which extracts the last segment and includes Python-level error handling.
Consider using TRY_CAST with COALESCE to either extract position 2 or fall back to the entire run value:
Suggested fix
- CASE WHEN position('_' in run) > 0 THEN CAST(split_part(run, '_', 2) AS INTEGER)
- ELSE CAST(run AS INTEGER) END AS techreplicate
+ COALESCE(
+ TRY_CAST(split_part(run, '_', 2) AS INTEGER),
+ TRY_CAST(run AS INTEGER)
+ ) AS techreplicate🧰 Tools
🪛 Ruff (0.14.14)
[error] 519-529: Possible SQL injection vector through string-based query construction
(S608)
🤖 Prompt for AI Agents
In `@mokume/io/feature.py` around lines 518 - 530, The SQL building the irs_df
uses CAST(split_part(run, '_', 2) AS INTEGER) to compute techreplicate which can
silently yield NULL for non-numeric segments; change this to use TRY_CAST and
COALESCE so non-numeric parts don't drop data (e.g., compute techreplicate with
TRY_CAST(split_part(run,'_',2) AS INTEGER) and COALESCE to fallback to
TRY_CAST(split_part(run,'_',-1) AS INTEGER) or TRY_CAST(run AS INTEGER) similar
to get_unique_tec_reps()), ensuring the SELECT in the irs_df query (symbols:
irs_df, parquet_db, stat_fn, where_clause, techreplicate) preserves rows instead
of producing NULL group keys.
| # Compute global reference intensity (geometric mean across plexes) | ||
| ref_df = pd.DataFrame(plex_ref_intensity) | ||
| # Replace 0 and negatives with NaN for geometric mean | ||
| ref_df[ref_df <= 0] = np.nan | ||
| log_refs = np.log(ref_df) | ||
| global_ref = np.exp(log_refs.mean(axis=1)) | ||
|
|
||
| # Compute scaling factors: global_ref / plex_ref for each plex | ||
| scaling_factors = {} | ||
| for plex in plexes: | ||
| scaling_factors[plex] = global_ref / plex_ref_intensity[plex] |
There was a problem hiding this comment.
Division by zero / negative denominator when computing scaling factors.
plex_ref_intensity[plex] (line 106) is computed from raw data (lines 90–94) and may contain zero or negative values for some proteins. While global_ref sanitizes non-positive values via the NaN replacement on line 99, the denominator is not similarly protected. This produces inf / -inf scaling factors that silently corrupt downstream intensities.
Proposed fix
# Compute scaling factors: global_ref / plex_ref for each plex
scaling_factors = {}
for plex in plexes:
- scaling_factors[plex] = global_ref / plex_ref_intensity[plex]
+ denom = plex_ref_intensity[plex].replace(0, np.nan)
+ denom[denom < 0] = np.nan
+ scaling_factors[plex] = global_ref / denom📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Compute global reference intensity (geometric mean across plexes) | |
| ref_df = pd.DataFrame(plex_ref_intensity) | |
| # Replace 0 and negatives with NaN for geometric mean | |
| ref_df[ref_df <= 0] = np.nan | |
| log_refs = np.log(ref_df) | |
| global_ref = np.exp(log_refs.mean(axis=1)) | |
| # Compute scaling factors: global_ref / plex_ref for each plex | |
| scaling_factors = {} | |
| for plex in plexes: | |
| scaling_factors[plex] = global_ref / plex_ref_intensity[plex] | |
| # Compute global reference intensity (geometric mean across plexes) | |
| ref_df = pd.DataFrame(plex_ref_intensity) | |
| # Replace 0 and negatives with NaN for geometric mean | |
| ref_df[ref_df <= 0] = np.nan | |
| log_refs = np.log(ref_df) | |
| global_ref = np.exp(log_refs.mean(axis=1)) | |
| # Compute scaling factors: global_ref / plex_ref for each plex | |
| scaling_factors = {} | |
| for plex in plexes: | |
| denom = plex_ref_intensity[plex].replace(0, np.nan) | |
| denom[denom < 0] = np.nan | |
| scaling_factors[plex] = global_ref / denom |
🤖 Prompt for AI Agents
In `@mokume/normalization/irs.py` around lines 96 - 106, The code computing
scaling_factors can divide by zero/negative values because
plex_ref_intensity[plex] is not sanitized; update the logic in the block around
plex_ref_intensity, global_ref and scaling_factors so denominators <= 0 are
replaced with np.nan (or a small positive epsilon) before division, compute
scaling_factors as global_ref / sanitized_plex_ref and then handle resulting
NaNs (e.g., propagate NaN or fill with 1.0) to avoid inf/-inf values; refer to
variables plex_ref_intensity, global_ref, scaling_factors and the loop iterating
over plexes when making this change.
| def _compute_pca(log2_matrix, samples, sample_to_condition, sample_to_plex): | ||
| """Compute PCA coordinates.""" | ||
| data = log2_matrix[samples].T.dropna(axis=1) | ||
| data = data.fillna(data.median()) | ||
|
|
||
| n_components = min(3, data.shape[0], data.shape[1]) | ||
| pca = PCA(n_components=n_components) | ||
| coords = pca.fit_transform(data) | ||
|
|
||
| result = { | ||
| "samples": list(data.index), | ||
| "conditions": [sample_to_condition.get(s, "Unknown") for s in data.index], | ||
| "plexes": [sample_to_plex.get(s, "plex1") for s in data.index], | ||
| "pc1": coords[:, 0].tolist(), | ||
| "pc2": coords[:, 1].tolist() if n_components >= 2 else [0] * len(data), | ||
| "var_explained": (pca.explained_variance_ratio_ * 100).tolist(), | ||
| } | ||
| if n_components >= 3: | ||
| result["pc3"] = coords[:, 2].tolist() | ||
|
|
||
| return result |
There was a problem hiding this comment.
Edge case: PCA will raise if all proteins are dropped after dropna.
If every protein has at least one NaN across all samples, dropna(axis=1) on the transposed matrix leaves zero columns and PCA.fit_transform will fail. A guard similar to _compute_tsne's if data.shape[0] < 4 check would be prudent.
🤖 Prompt for AI Agents
In `@mokume/reports/qc_report.py` around lines 134 - 154, In _compute_pca, guard
against the case where data becomes empty after dropna (or has too few rows)
before calling PCA: check data.shape and if there are no columns or fewer than
the minimum samples (mirror the _compute_tsne check, e.g., if data.shape[0] < 4
or data.shape[1] == 0), return a result dictionary with the same keys
("samples", "conditions", "plexes", "pc1", "pc2", optional "pc3", and
"var_explained") filled with safe defaults (zeros or empty lists of correct
length and var_explained filled with zeros) instead of calling PCA; ensure you
reference _compute_pca, samples, sample_to_condition, and sample_to_plex so
callers receive consistent shaped outputs.
| n_components = min(10, data.shape[0] - 1, data.shape[1]) | ||
| pca = PCA(n_components=n_components) | ||
| pc_data = pca.fit_transform(data) | ||
|
|
||
| # For each PC, compute R² for condition and plex using simple 1-way ANOVA | ||
| condition_r2_weighted = 0 | ||
| plex_r2_weighted = 0 | ||
| total_var_explained = sum(pca.explained_variance_ratio_[:n_components]) | ||
|
|
||
| for i in range(n_components): | ||
| pc_values = pc_data[:, i] | ||
| weight = pca.explained_variance_ratio_[i] / total_var_explained | ||
|
|
||
| cond_r2 = _anova_r2(pc_values, conditions) | ||
| plex_r2 = _anova_r2(pc_values, plexes) | ||
|
|
||
| condition_r2_weighted += cond_r2 * weight | ||
| plex_r2_weighted += plex_r2 * weight |
There was a problem hiding this comment.
Division-by-zero risk if total_var_explained is zero.
If all PCA components explain zero variance (e.g., all-constant data), Line 330 divides by total_var_explained = 0. Guard against this.
Suggested fix
total_var_explained = sum(pca.explained_variance_ratio_[:n_components])
+ if total_var_explained == 0:
+ return {
+ "condition_pct": 0.0,
+ "plex_pct": 0.0,
+ "residual_pct": 100.0,
+ "n_pcs_used": n_components,
+ "total_var_captured": 0.0,
+ }
for i in range(n_components):
pc_values = pc_data[:, i]
weight = pca.explained_variance_ratio_[i] / total_var_explained🤖 Prompt for AI Agents
In `@mokume/reports/qc_report.py` around lines 319 - 336, The loop that computes
weights divides by total_var_explained which can be zero (n_components, PCA,
pca.explained_variance_ratio_, total_var_explained); to fix, guard against
total_var_explained == 0 by computing a safe denominator: if total_var_explained
> 0 keep the current weight = pca.explained_variance_ratio_[i] /
total_var_explained, otherwise use an equal weight = 1.0 / n_components for each
component; apply this change where weight is computed before calling _anova_r2
and use the equal-weight branch when explained variance sums to zero.
| with open(output_html, "w") as f: | ||
| f.write(html) |
There was a problem hiding this comment.
Specify encoding="utf-8" when writing the HTML file.
The HTML declares <meta charset="utf-8"> (Line 579) but the file is opened without explicit encoding. On some platforms the default encoding is not UTF-8, which could corrupt non-ASCII characters in sample/protein names.
Suggested fix
- with open(output_html, "w") as f:
+ with open(output_html, "w", encoding="utf-8") as f:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with open(output_html, "w") as f: | |
| f.write(html) | |
| with open(output_html, "w", encoding="utf-8") as f: | |
| f.write(html) |
🤖 Prompt for AI Agents
In `@mokume/reports/qc_report.py` around lines 492 - 493, The file is opened
without an explicit encoding which can corrupt non-ASCII characters; modify the
open call that writes the HTML (the with open(output_html, "w") as f: block that
writes the html variable) to specify encoding="utf-8" so the output HTML (which
declares <meta charset="utf-8">) is written with matching UTF-8 encoding.
| // Per-sample boxplots (distribution) | ||
| (function() {{ | ||
| const dist = {json.dumps(distributions)}; | ||
| const samples = Object.keys(dist).sort(); | ||
| const traces = []; | ||
| samples.forEach(s => {{ | ||
| const d = dist[s]; | ||
| traces.push({{ | ||
| type: 'box', | ||
| name: s, | ||
| y: [d.min, d.q25, d.median, d.q75, d.max], | ||
| boxmean: true, | ||
| marker: {{ color: condColors[d.condition] || '#999' }}, | ||
| line: {{ width: 1 }}, | ||
| }}); | ||
| }}); | ||
| // Use simplified box representation | ||
| const boxTraces = []; | ||
| const conds = [...new Set(samples.map(s => dist[s].condition))].sort(); | ||
| conds.forEach(cond => {{ | ||
| const condSamples = samples.filter(s => dist[s].condition === cond); | ||
| boxTraces.push({{ | ||
| type: 'bar', | ||
| name: cond, | ||
| x: condSamples, | ||
| y: condSamples.map(s => dist[s].median), | ||
| error_y: {{ | ||
| type: 'data', | ||
| symmetric: false, | ||
| array: condSamples.map(s => dist[s].q75 - dist[s].median), | ||
| arrayminus: condSamples.map(s => dist[s].median - dist[s].q25), | ||
| }}, | ||
| marker: {{ color: condColors[cond] || '#999' }}, | ||
| }}); | ||
| }}); | ||
| Plotly.newPlot('boxplot', boxTraces, {{ | ||
| title: {{ text: 'Per-Sample Distribution (median + IQR)', font: {{ size: 14 }} }}, | ||
| xaxis: {{ tickangle: 45, tickfont: {{ size: 9 }} }}, | ||
| yaxis: {{ title: '{"log2 ratio" if summary["is_log2"] else "log2 intensity"}' }}, | ||
| barmode: 'group', | ||
| margin: {{ t: 50, b: 100, l: 60, r: 20 }}, | ||
| }}, {{responsive: true}}); | ||
| }})(); | ||
|
|
||
| // CV violin/box by condition | ||
| (function() {{ | ||
| const cvData = {json.dumps({k: v['cv_values'] for k, v in condition_cv.items()})}; | ||
| const traces = []; | ||
| Object.keys(cvData).sort().forEach(cond => {{ | ||
| traces.push({{ | ||
| type: 'violin', | ||
| name: cond, | ||
| y: cvData[cond], | ||
| box: {{ visible: true }}, | ||
| meanline: {{ visible: true }}, | ||
| marker: {{ color: condColors[cond] || '#999' }}, | ||
| line: {{ width: 1 }}, | ||
| }}); | ||
| }}); | ||
| Plotly.newPlot('cv-violin', traces, {{ | ||
| title: {{ text: 'Per-Protein CV Distribution by Condition', font: {{ size: 14 }} }}, | ||
| yaxis: {{ title: 'Coefficient of Variation', range: [0, 3] }}, | ||
| margin: {{ t: 50, b: 40, l: 60, r: 20 }}, | ||
| }}, {{responsive: true}}); | ||
| }})(); | ||
|
|
||
| // Missing values per sample | ||
| (function() {{ | ||
| const missing = {json.dumps(missing['per_sample'])}; | ||
| const samples = Object.keys(missing).sort(); | ||
| Plotly.newPlot('missing-plot', [{{ | ||
| type: 'bar', | ||
| x: samples, | ||
| y: samples.map(s => (missing[s].rate * 100).toFixed(1)), | ||
| marker: {{ color: samples.map(s => missing[s].rate > 0.3 ? '#e74c3c' : '#27ae60') }}, | ||
| text: samples.map(s => (missing[s].rate * 100).toFixed(1) + '%'), | ||
| textposition: 'auto', | ||
| }}], {{ | ||
| title: {{ text: 'Missing Value Rate per Sample', font: {{ size: 14 }} }}, | ||
| xaxis: {{ tickangle: 45, tickfont: {{ size: 9 }} }}, | ||
| yaxis: {{ title: '% Missing', range: [0, 100] }}, | ||
| margin: {{ t: 50, b: 100, l: 60, r: 20 }}, | ||
| }}, {{responsive: true}}); | ||
| }})(); |
There was a problem hiding this comment.
Inconsistent </script> injection protection in JSON embedding.
Lines 766, 809, 865 correctly use .replace('</', '<\\/') on json.dumps() output to prevent </script> tag injection, but Lines 886, 930, and 952 omit this escaping. Apply it consistently to all embedded JSON data.
Suggested fix (example for line 886)
- const dist = {json.dumps(distributions)};
+ const dist = {json.dumps(distributions).replace('</', '<\\/')};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Per-sample boxplots (distribution) | |
| (function() {{ | |
| const dist = {json.dumps(distributions)}; | |
| const samples = Object.keys(dist).sort(); | |
| const traces = []; | |
| samples.forEach(s => {{ | |
| const d = dist[s]; | |
| traces.push({{ | |
| type: 'box', | |
| name: s, | |
| y: [d.min, d.q25, d.median, d.q75, d.max], | |
| boxmean: true, | |
| marker: {{ color: condColors[d.condition] || '#999' }}, | |
| line: {{ width: 1 }}, | |
| }}); | |
| }}); | |
| // Use simplified box representation | |
| const boxTraces = []; | |
| const conds = [...new Set(samples.map(s => dist[s].condition))].sort(); | |
| conds.forEach(cond => {{ | |
| const condSamples = samples.filter(s => dist[s].condition === cond); | |
| boxTraces.push({{ | |
| type: 'bar', | |
| name: cond, | |
| x: condSamples, | |
| y: condSamples.map(s => dist[s].median), | |
| error_y: {{ | |
| type: 'data', | |
| symmetric: false, | |
| array: condSamples.map(s => dist[s].q75 - dist[s].median), | |
| arrayminus: condSamples.map(s => dist[s].median - dist[s].q25), | |
| }}, | |
| marker: {{ color: condColors[cond] || '#999' }}, | |
| }}); | |
| }}); | |
| Plotly.newPlot('boxplot', boxTraces, {{ | |
| title: {{ text: 'Per-Sample Distribution (median + IQR)', font: {{ size: 14 }} }}, | |
| xaxis: {{ tickangle: 45, tickfont: {{ size: 9 }} }}, | |
| yaxis: {{ title: '{"log2 ratio" if summary["is_log2"] else "log2 intensity"}' }}, | |
| barmode: 'group', | |
| margin: {{ t: 50, b: 100, l: 60, r: 20 }}, | |
| }}, {{responsive: true}}); | |
| }})(); | |
| // CV violin/box by condition | |
| (function() {{ | |
| const cvData = {json.dumps({k: v['cv_values'] for k, v in condition_cv.items()})}; | |
| const traces = []; | |
| Object.keys(cvData).sort().forEach(cond => {{ | |
| traces.push({{ | |
| type: 'violin', | |
| name: cond, | |
| y: cvData[cond], | |
| box: {{ visible: true }}, | |
| meanline: {{ visible: true }}, | |
| marker: {{ color: condColors[cond] || '#999' }}, | |
| line: {{ width: 1 }}, | |
| }}); | |
| }}); | |
| Plotly.newPlot('cv-violin', traces, {{ | |
| title: {{ text: 'Per-Protein CV Distribution by Condition', font: {{ size: 14 }} }}, | |
| yaxis: {{ title: 'Coefficient of Variation', range: [0, 3] }}, | |
| margin: {{ t: 50, b: 40, l: 60, r: 20 }}, | |
| }}, {{responsive: true}}); | |
| }})(); | |
| // Missing values per sample | |
| (function() {{ | |
| const missing = {json.dumps(missing['per_sample'])}; | |
| const samples = Object.keys(missing).sort(); | |
| Plotly.newPlot('missing-plot', [{{ | |
| type: 'bar', | |
| x: samples, | |
| y: samples.map(s => (missing[s].rate * 100).toFixed(1)), | |
| marker: {{ color: samples.map(s => missing[s].rate > 0.3 ? '#e74c3c' : '#27ae60') }}, | |
| text: samples.map(s => (missing[s].rate * 100).toFixed(1) + '%'), | |
| textposition: 'auto', | |
| }}], {{ | |
| title: {{ text: 'Missing Value Rate per Sample', font: {{ size: 14 }} }}, | |
| xaxis: {{ tickangle: 45, tickfont: {{ size: 9 }} }}, | |
| yaxis: {{ title: '% Missing', range: [0, 100] }}, | |
| margin: {{ t: 50, b: 100, l: 60, r: 20 }}, | |
| }}, {{responsive: true}}); | |
| }})(); | |
| // Per-sample boxplots (distribution) | |
| (function() {{ | |
| const dist = {json.dumps(distributions).replace('</', '<\\/')}; | |
| const samples = Object.keys(dist).sort(); | |
| const traces = []; | |
| samples.forEach(s => {{ | |
| const d = dist[s]; | |
| traces.push({{ | |
| type: 'box', | |
| name: s, | |
| y: [d.min, d.q25, d.median, d.q75, d.max], | |
| boxmean: true, | |
| marker: {{ color: condColors[d.condition] || '#999' }}, | |
| line: {{ width: 1 }}, | |
| }}); | |
| }}); | |
| // Use simplified box representation | |
| const boxTraces = []; | |
| const conds = [...new Set(samples.map(s => dist[s].condition))].sort(); | |
| conds.forEach(cond => {{ | |
| const condSamples = samples.filter(s => dist[s].condition === cond); | |
| boxTraces.push({{ | |
| type: 'bar', | |
| name: cond, | |
| x: condSamples, | |
| y: condSamples.map(s => dist[s].median), | |
| error_y: {{ | |
| type: 'data', | |
| symmetric: false, | |
| array: condSamples.map(s => dist[s].q75 - dist[s].median), | |
| arrayminus: condSamples.map(s => dist[s].median - dist[s].q25), | |
| }}, | |
| marker: {{ color: condColors[cond] || '#999' }}, | |
| }}); | |
| }}); | |
| Plotly.newPlot('boxplot', boxTraces, {{ | |
| title: {{ text: 'Per-Sample Distribution (median + IQR)', font: {{ size: 14 }} }}, | |
| xaxis: {{ tickangle: 45, tickfont: {{ size: 9 }} }}, | |
| yaxis: {{ title: '{"log2 ratio" if summary["is_log2"] else "log2 intensity"}' }}, | |
| barmode: 'group', | |
| margin: {{ t: 50, b: 100, l: 60, r: 20 }}, | |
| }}, {{responsive: true}}); | |
| }})(); | |
| // CV violin/box by condition | |
| (function() {{ | |
| const cvData = {json.dumps({k: v['cv_values'] for k, v in condition_cv.items()}).replace('</', '<\\/')}; | |
| const traces = []; | |
| Object.keys(cvData).sort().forEach(cond => {{ | |
| traces.push({{ | |
| type: 'violin', | |
| name: cond, | |
| y: cvData[cond], | |
| box: {{ visible: true }}, | |
| meanline: {{ visible: true }}, | |
| marker: {{ color: condColors[cond] || '#999' }}, | |
| line: {{ width: 1 }}, | |
| }}); | |
| }}); | |
| Plotly.newPlot('cv-violin', traces, {{ | |
| title: {{ text: 'Per-Protein CV Distribution by Condition', font: {{ size: 14 }} }}, | |
| yaxis: {{ title: 'Coefficient of Variation', range: [0, 3] }}, | |
| margin: {{ t: 50, b: 40, l: 60, r: 20 }}, | |
| }}, {{responsive: true}}); | |
| }})(); | |
| // Missing values per sample | |
| (function() {{ | |
| const missing = {json.dumps(missing['per_sample']).replace('</', '<\\/')}; | |
| const samples = Object.keys(missing).sort(); | |
| Plotly.newPlot('missing-plot', [{{ | |
| type: 'bar', | |
| x: samples, | |
| y: samples.map(s => (missing[s].rate * 100).toFixed(1)), | |
| marker: {{ color: samples.map(s => missing[s].rate > 0.3 ? '#e74c3c' : '#27ae60') }}, | |
| text: samples.map(s => (missing[s].rate * 100).toFixed(1) + '%'), | |
| textposition: 'auto', | |
| }}], {{ | |
| title: {{ text: 'Missing Value Rate per Sample', font: {{ size: 14 }} }}, | |
| xaxis: {{ tickangle: 45, tickfont: {{ size: 9 }} }}, | |
| yaxis: {{ title: '% Missing', range: [0, 100] }}, | |
| margin: {{ t: 50, b: 100, l: 60, r: 20 }}, | |
| }}, {{responsive: true}}); | |
| }})(); |
🤖 Prompt for AI Agents
In `@mokume/reports/qc_report.py` around lines 884 - 967, The embedded JSON for
distributions, condition_cv (cvData) and missing['per_sample'] is not being
escaped for a closing script tag; update the json.dumps(...) calls that produce
the JS constants "dist" (used for boxplot), "cvData" (used for cv-violin) and
the "missing" object (used for missing-plot) to apply .replace('</', '<\\/') on
the dumped string so any "</script>" sequences are neutralized before injecting
into the page; keep the existing variable names and plot targets ('boxplot',
'cv-violin', 'missing-plot') so only the json embedding calls change.
| html = _build_comparison_html( | ||
| title, workflow_metrics, concordance, marker_results, marker_genes, | ||
| ) | ||
|
|
||
| with open(output_html, "w") as f: | ||
| f.write(html) |
There was a problem hiding this comment.
html variable shadows the html module import — maintenance trap.
Line 83 assigns the string result to a local variable named html, which shadows the import html on Line 14. Any future code in this function referencing html.escape() will raise AttributeError. Also, specify encoding="utf-8" on the file write to match the charset declared in the output.
Suggested fix
- html = _build_comparison_html(
+ report_html = _build_comparison_html(
title, workflow_metrics, concordance, marker_results, marker_genes,
)
- with open(output_html, "w") as f:
- f.write(html)
+ with open(output_html, "w", encoding="utf-8") as f:
+ f.write(report_html)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| html = _build_comparison_html( | |
| title, workflow_metrics, concordance, marker_results, marker_genes, | |
| ) | |
| with open(output_html, "w") as f: | |
| f.write(html) | |
| report_html = _build_comparison_html( | |
| title, workflow_metrics, concordance, marker_results, marker_genes, | |
| ) | |
| with open(output_html, "w", encoding="utf-8") as f: | |
| f.write(report_html) |
🤖 Prompt for AI Agents
In `@mokume/reports/workflow_comparison.py` around lines 83 - 88, The local
variable named html returned from _build_comparison_html shadows the imported
html module (so subsequent calls like html.escape would fail); rename that local
variable (e.g., rendered_html or html_content) wherever it's used and write that
renamed variable to the file instead of html, and open the file with
encoding="utf-8" (use output_html and the result of _build_comparison_html
accordingly) so the module remains accessible and the file encoding matches the
declared charset.
support new QPX feature schema with backward compatibility
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
mokume/io/feature.py (2)
360-369:⚠️ Potential issue | 🟠 MajorStill interpolating raw values into the two report queries.
tuple(samples)/tuple(cons)and",".join(columns)are still spliced directly into SQL. A quoted sample/condition or an unvalidated column name can break the query or inject SQL; use placeholders for theIN (...)values and whitelist column names before joining.Also applies to: 448-455
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mokume/io/feature.py` around lines 360 - 369, The get_report_from_database method currently injects raw values into the SQL string (using tuple(samples) and ",".join(columns)); fix it by parameterizing the IN list with placeholders and passing samples as parameters to parquet_db.sql, and validate/whitelist column names before joining to prevent injection; update the same pattern in the other report query referenced (the block around lines 448-455) so both queries use prepared placeholders for sample/condition values and only allow vetted column names when building cols, leaving Feature.standardize_df(report) unchanged.
552-562:⚠️ Potential issue | 🔴 CriticalIRS tech-rep parsing will fail on valid new-QPX run names.
The new-QPX fixture in
tests/example/sdrf_new_qpx.tsv:2-3usesrun_A.raw, so therunvalue becomesrun_A.split_part(run, '_', 2)isA, andCAST('A' AS INTEGER)raises in DuckDB, so IRS breaks on that dataset. Please mirrorget_unique_tec_reps()by parsing the last token and usingTRY_CASTplus explicit null handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@mokume/io/feature.py`:
- Around line 271-294: The parquet_db view is rebuilt with SDRF-derived
sample_accession and condition but self.samples is only populated once in
__init__, so callers (iter_samples, experimental_inference,
get_low_frequency_peptides) still see stale sample metadata; after creating the
parquet_db view in the method containing this SQL (in mokume/io/feature.py)
refresh the cached sample list by reassigning or repopulating self.samples from
the updated parquet_db (reuse the same logic that __init__ uses to build
samples) so downstream methods reflect the enriched SDRF values.
In `@mokume/quantification/ratio.py`:
- Around line 396-400: The code currently falls back to run_file_name when the
SDRF join misses (df = df.merge(...); df['sample_accession'] =
df['source_name'].fillna(df['run_file_name'])), which hides unmapped
(run_file_name, label) pairs and later causes silent drops; remove the fillna
fallback and instead detect rows where the merge produced null SDRF values
(e.g., df['source_name'].isna()) immediately after the merge and raise a clear
error (RuntimeError) listing the offending (run_file_name, label) combinations
so the pipeline fails fast; update any downstream expectations (e.g., mappings
built in mokume/pipeline/stages.py) to rely on the validated
df['sample_accession'] populated only from SDRF values.
- Around line 364-373: The WHERE uses unprojected aliases so DuckDB will fail;
change the query in ratio.py where query is constructed (using charge_col,
unnest_sql, where_clause) to materialize the unnested rows first via a CTE or
subquery (e.g., WITH unnested AS (SELECT parquet_raw.*, unnest.* FROM
parquet_raw, UNNEST(intensities) AS unnest) SELECT pg_accessions, sequence,
{charge_col} AS precursor_charge, {unnest_sql} FROM unnested WHERE
{where_clause}) so predicates emitted by SQLFilterBuilder.build_where_clause()
can reference properly-scoped columns (or alternatively update
SQLFilterBuilder.build_where_clause to accept and apply explicit prefixes like
"unnest." to its column names); ensure all references to intensity and "unique"
are fully qualified to the subquery/CTE aliases.
---
Duplicate comments:
In `@mokume/io/feature.py`:
- Around line 360-369: The get_report_from_database method currently injects raw
values into the SQL string (using tuple(samples) and ",".join(columns)); fix it
by parameterizing the IN list with placeholders and passing samples as
parameters to parquet_db.sql, and validate/whitelist column names before joining
to prevent injection; update the same pattern in the other report query
referenced (the block around lines 448-455) so both queries use prepared
placeholders for sample/condition values and only allow vetted column names when
building cols, leaving Feature.standardize_df(report) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5f7d27c6-d70d-4cb5-9f57-c56aad0d0d06
⛔ Files ignored due to path filters (2)
tests/example/feature_wide_new_qpx.parquetis excluded by!**/*.parquettests/example/sdrf_new_qpx.tsvis excluded by!**/*.tsv
📒 Files selected for processing (4)
mokume/core/constants.pymokume/io/feature.pymokume/quantification/ratio.pytests/test_peptide_normalize.py
🚧 Files skipped from review as they are similar to previous changes (1)
- mokume/core/constants.py
| self.parquet_db.execute("DROP VIEW IF EXISTS parquet_db") | ||
| self.parquet_db.execute(f""" | ||
| CREATE VIEW parquet_db AS | ||
| SELECT | ||
| p.sequence, | ||
| p.peptidoform, | ||
| p.pg_accessions, | ||
| p.charge, | ||
| p.run_file_name, | ||
| p."unique", | ||
| COALESCE(s.sdrf_sample_accession, {sa_fallback}) as sample_accession, | ||
| p.channel, | ||
| p.intensity, | ||
| p.run, | ||
| COALESCE(s.sdrf_condition, {sa_fallback}) as condition, | ||
| COALESCE(CAST(s.sdrf_biological_replicate AS INTEGER), 1) as biological_replicate, | ||
| COALESCE(CAST(s.sdrf_fraction AS VARCHAR), '1') as fraction, | ||
| split_part(COALESCE(s.sdrf_sample_accession, {sa_fallback}), '_', 1) as mixture | ||
| FROM parquet_db_unnested p | ||
| LEFT JOIN sdrf_mapping s | ||
| {join_clause} | ||
| """) | ||
|
|
||
| logger.info("Enriched parquet data with SDRF metadata from %s", sdrf_path) |
There was a problem hiding this comment.
Refresh cached sample metadata after rebuilding parquet_db.
__init__() snapshots self.samples once, but this method replaces sample_accession and condition with SDRF-derived values. After enrichment, iter_samples(), experimental_inference(), and the denominator in get_low_frequency_peptides() still use the pre-SDRF run list.
♻️ Suggested fix
self.parquet_db.execute(f"""
CREATE VIEW parquet_db AS
SELECT
p.sequence,
p.peptidoform,
p.pg_accessions,
p.charge,
p.run_file_name,
p."unique",
COALESCE(s.sdrf_sample_accession, {sa_fallback}) as sample_accession,
p.channel,
p.intensity,
p.run,
COALESCE(s.sdrf_condition, {sa_fallback}) as condition,
COALESCE(CAST(s.sdrf_biological_replicate AS INTEGER), 1) as biological_replicate,
COALESCE(CAST(s.sdrf_fraction AS VARCHAR), '1') as fraction,
split_part(COALESCE(s.sdrf_sample_accession, {sa_fallback}), '_', 1) as mixture
FROM parquet_db_unnested p
LEFT JOIN sdrf_mapping s
{join_clause}
""")
+
+ self.samples = self.get_unique_samples()
+ self.labels = None
+ self.label = None
+ self.choice = None
+ self.technical_repetitions = None
logger.info("Enriched parquet data with SDRF metadata from %s", sdrf_path)🧰 Tools
🪛 Ruff (0.15.5)
[error] 272-292: Possible SQL injection vector through string-based query construction
(S608)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mokume/io/feature.py` around lines 271 - 294, The parquet_db view is rebuilt
with SDRF-derived sample_accession and condition but self.samples is only
populated once in __init__, so callers (iter_samples, experimental_inference,
get_low_frequency_peptides) still see stale sample metadata; after creating the
parquet_db view in the method containing this SQL (in mokume/io/feature.py)
refresh the cached sample list by reassigning or repopulating self.samples from
the updated parquet_db (reuse the same logic that __init__ uses to build
samples) so downstream methods reflect the enriched SDRF values.
| query = f""" | ||
| SELECT | ||
| pg_accessions, | ||
| sequence, | ||
| {charge_col} as precursor_charge, | ||
| {unnest_sql} | ||
| FROM parquet_raw, UNNEST(intensities) as unnest | ||
| WHERE unnest.intensity IS NOT NULL | ||
| AND {where_clause} | ||
| """ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the file and the specific lines mentioned
cd mokume && head -n 373 quantification/ratio.py | tail -n 20Repository: bigbio/mokume
Length of output: 816
🏁 Script executed:
# Let's get more context around those lines
cd mokume && sed -n '350,380p' quantification/ratio.pyRepository: bigbio/mokume
Length of output: 1151
🏁 Script executed:
# Find SQLFilterBuilder definition
fd -t f . -name "*.py" | xargs rg "class SQLFilterBuilder"Repository: bigbio/mokume
Length of output: 349
🏁 Script executed:
# Search for where_clause building in ratio.py
cd mokume && rg "where_clause" quantification/ratio.py -B 2 -A 2Repository: bigbio/mokume
Length of output: 337
🏁 Script executed:
# Find and read the SQLFilterBuilder class
rg "class SQLFilterBuilder" -A 50 mokume/io/feature.pyRepository: bigbio/mokume
Length of output: 1845
🏁 Script executed:
# Also look for the build_where_clause method
rg "def build_where_clause" -A 30 mokume/io/feature.pyRepository: bigbio/mokume
Length of output: 1171
🏁 Script executed:
# Check what columns SQLFilterBuilder processes
rg "intensity" mokume/io/feature.py -B 3 -A 3Repository: bigbio/mokume
Length of output: 6342
🏁 Script executed:
# Search for tests of load_psm_data function
rg "load_psm_data" --type pyRepository: bigbio/mokume
Length of output: 252
🏁 Script executed:
# Look for test files related to ratio or quantification
fd -type f -name "*test*.py" | xargs rg -l "load_psm_data\|ratio" 2>/dev/nullRepository: bigbio/mokume
Length of output: 229
🏁 Script executed:
# Check the full implementation of load_psm_data to see context
sed -n '320,400p' mokume/quantification/ratio.pyRepository: bigbio/mokume
Length of output: 3076
🏁 Script executed:
# Verify DuckDB's behavior by checking if there are any comments or issues about this
rg "SELECT.*as.*WHERE" mokume/quantification/ratio.py -A 5 -B 5Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Search for test files
find . -type f -name "*test*.py" -o -name "test_*.py" | head -20Repository: bigbio/mokume
Length of output: 404
🏁 Script executed:
# Look at how load_psm_data is called in stages.py
rg -B 10 -A 10 "load_psm_data\(" mokume/pipeline/stages.pyRepository: bigbio/mokume
Length of output: 795
🏁 Script executed:
# Check if there are any error messages or issues reported with the SQL
rg "Column.*intensity.*not found\|intensity.*not found" . --type pyRepository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Let me verify DuckDB behavior - check if there's a simple way to test SELECT alias in WHERE
cat > /tmp/test_duckdb.py << 'EOF'
import duckdb
# Test if DuckDB allows SELECT aliases in WHERE clause
try:
conn = duckdb.connect(':memory:')
# Create test data
conn.execute("""
CREATE TABLE test AS
SELECT 1 as id, ARRAY[10, 20, 30] as values
""")
# Try to use SELECT alias in WHERE clause
result = conn.execute("""
SELECT id, v as intensity
FROM test, UNNEST(values) as t(v)
WHERE intensity > 15
""").fetchall()
print("SUCCESS: DuckDB allows SELECT aliases in WHERE")
print(result)
except Exception as e:
print(f"FAILED: {e}")
# Test with column prefix instead
try:
result = conn.execute("""
SELECT id, v as intensity
FROM test, UNNEST(values) as t(v)
WHERE v > 15
""").fetchall()
print("\nSUCCESS: Using actual column name works")
print(result)
except Exception as e:
print(f"FAILED: {e}")
EOF
python /tmp/test_duckdb.pyRepository: bigbio/mokume
Length of output: 205
🏁 Script executed:
# Check what columns are expected in parquet_raw by looking at schema handling
rg "sequence\|unique" mokume/quantification/ratio.py -B 2 -A 2Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Look for any column mapping or schema definition
rg "CREATE VIEW parquet_raw" mokume/quantification/ratio.py -A 5 -B 5Repository: bigbio/mokume
Length of output: 366
🏁 Script executed:
# Check if there's a separate WHERE clause for different contexts
rg "where_clause.*=" mokume/quantification/ratio.py -B 3 -A 3Repository: bigbio/mokume
Length of output: 263
🏁 Script executed:
# Look at the full function signature and inputs to load_psm_data
sed -n '310,330p' mokume/quantification/ratio.pyRepository: bigbio/mokume
Length of output: 643
🏁 Script executed:
# Get the full load_psm_data function definition
sed -n '310,320p' mokume/quantification/ratio.pyRepository: bigbio/mokume
Length of output: 400
🏁 Script executed:
# Look for function signature
rg "def load_psm_data" -A 15 mokume/quantification/ratio.pyRepository: bigbio/mokume
Length of output: 478
🏁 Script executed:
# Check if sequence and unique are expected columns in parquet
rg "pg_accessions\|sequence\|unique" mokume/quantification/ratio.py | head -20Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Look at the complete SELECT query to understand column sourcing
sed -n '363,380p' mokume/quantification/ratio.pyRepository: bigbio/mokume
Length of output: 615
🏁 Script executed:
# Check the actual WHERE clause construction more carefully - does it prefix columns?
sed -n '250,280p' mokume/io/feature.pyRepository: bigbio/mokume
Length of output: 1230
🏁 Script executed:
# Check how intensities is structured - is it an array of structs?
rg "intensities" mokume/quantification/ratio.py -B 2 -A 2 | head -40Repository: bigbio/mokume
Length of output: 965
🏁 Script executed:
# Look at how the Feature class handles the same query pattern
sed -n '140,170p' mokume/io/feature.pyRepository: bigbio/mokume
Length of output: 1279
🏁 Script executed:
# Check if there are any known issues or comments about WHERE clause scope
rg "WHERE.*unnest\|WHERE.*intensity" mokume/io/feature.py -B 2 -A 2Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Check the exact place where where_clause is built and used
rg -B 5 "AND {where_clause}" mokume/Repository: bigbio/mokume
Length of output: 162
🏁 Script executed:
# Get the complete build_where_clause method
rg "def build_where_clause" -A 50 mokume/io/feature.pyRepository: bigbio/mokume
Length of output: 2060
🏁 Script executed:
# Check if there's a WHERE clause being used in other similar contexts
rg "UNNEST.*unnest.*WHERE" mokume/io/feature.py -A 3 -B 3Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Carefully look at how Feature class uses the filter builder
rg "filter_builder.build_where_clause" mokume/io/feature.py -B 5 -A 5Repository: bigbio/mokume
Length of output: 1527
🏁 Script executed:
# Now let me trace exactly how Feature class uses this - does it use the same where_clause pattern?
sed -n '155,175p' mokume/io/feature.pyRepository: bigbio/mokume
Length of output: 797
🏁 Script executed:
# Look at the view creation in Feature to understand the difference
sed -n '175,210p' mokume/io/feature.pyRepository: bigbio/mokume
Length of output: 1375
🏁 Script executed:
# See if there's any column scope documentation in comments
rg "column.*scope\|SELECT.*alias\|WHERE.*alias" mokume/Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Check if load_psm_data has been used in actual tests
find . -name "*.py" -path "*/tests/*" -exec grep -l "load_psm_data" {} \;Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Check if there are any subquery patterns used elsewhere with where_clause
rg "FROM.*SELECT" mokume/quantification/ratio.py mokume/io/feature.py -A 5 -B 5Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Look for any recent commits that might show this was added recently
cd mokume && git log --oneline -20 2>/dev/null | head -20Repository: bigbio/mokume
Length of output: 107
🏁 Script executed:
# Check if the function has been changed recently
cd mokume && git log --oneline mokume/quantification/ratio.py 2>/dev/null | head -10Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Verify the issue more carefully - check if DuckDB actually allows bare column names after UNNEST
# by looking at similar patterns in the codebase
rg "UNNEST.*as.*unnest" mokume/ -A 10 | grep -A 10 "SELECT"Repository: bigbio/mokume
Length of output: 197
🏁 Script executed:
# Check the git history of load_psm_data function specifically
cd mokume && git log -p --all -S "def load_psm_data" -- mokume/quantification/ratio.py 2>/dev/null | head -100Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Check if this code path is actually exercised by looking for test files
find . -path "*/tests/*" -name "*.py" -exec grep -l "quantification\|ratio" {} \;Repository: bigbio/mokume
Length of output: 276
🏁 Script executed:
# Look at test_quantification_accuracy.py to see if it tests load_psm_data
rg "load_psm_data\|RatioQuantification" ./tests/test_quantification_accuracy.pyRepository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# One more check - see if there's documentation on expected parquet schema
rg "intensities.*label.*intensity\|UNNEST.*intensities" mokume/ -B 2 -A 2Repository: bigbio/mokume
Length of output: 39
Wrap the UNNEST in a subquery or CTE before applying the filter clause.
SQLFilterBuilder.build_where_clause() emits predicates like intensity > 0 and "unique" = 1, but these reference columns that don't exist in the WHERE clause scope. The source field unnest.intensity is only projected as a SELECT alias, and "unique" is from the base table. DuckDB evaluates WHERE before SELECT projections, so unqualified column names will fail at runtime. Either use a CTE/subquery to materialize the unnested data first, or modify the filter builder to accept and apply column prefixes.
🧰 Tools
🪛 Ruff (0.15.5)
[error] 364-373: Possible SQL injection vector through string-based query construction
(S608)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mokume/quantification/ratio.py` around lines 364 - 373, The WHERE uses
unprojected aliases so DuckDB will fail; change the query in ratio.py where
query is constructed (using charge_col, unnest_sql, where_clause) to materialize
the unnested rows first via a CTE or subquery (e.g., WITH unnested AS (SELECT
parquet_raw.*, unnest.* FROM parquet_raw, UNNEST(intensities) AS unnest) SELECT
pg_accessions, sequence, {charge_col} AS precursor_charge, {unnest_sql} FROM
unnested WHERE {where_clause}) so predicates emitted by
SQLFilterBuilder.build_where_clause() can reference properly-scoped columns (or
alternatively update SQLFilterBuilder.build_where_clause to accept and apply
explicit prefixes like "unnest." to its column names); ensure all references to
intensity and "unique" are fully qualified to the subquery/CTE aliases.
| df = df.merge( | ||
| sdrf_map, on=['run_file_name', 'label'], how='left', | ||
| ) | ||
| df['sample_accession'] = df['source_name'].fillna(df['run_file_name']) | ||
| df['Fraction'] = df['fraction'].fillna('1') |
There was a problem hiding this comment.
Do not silently fall back to run_file_name on SDRF join misses.
In mokume/pipeline/stages.py:215-266, both reference_samples and sample_to_plex are derived from SDRF sample names, not run ids. If a (run_file_name, label) pair misses here, the fallback sample id no longer matches those maps and the row is silently dropped during plex/reference matching. Fail fast on unmapped pairs instead of producing a partial matrix.
♻️ Suggested fix
df = df.merge(
sdrf_map, on=['run_file_name', 'label'], how='left',
)
- df['sample_accession'] = df['source_name'].fillna(df['run_file_name'])
+ unmapped = df['source_name'].isna()
+ if unmapped.any():
+ missing = df.loc[unmapped, ['run_file_name', 'label']].drop_duplicates()
+ raise ValueError(
+ "Missing SDRF rows for run/label pairs: "
+ f"{missing.to_dict('records')[:10]}"
+ )
+ df['sample_accession'] = df['source_name']
df['Fraction'] = df['fraction'].fillna('1')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mokume/quantification/ratio.py` around lines 396 - 400, The code currently
falls back to run_file_name when the SDRF join misses (df = df.merge(...);
df['sample_accession'] = df['source_name'].fillna(df['run_file_name'])), which
hides unmapped (run_file_name, label) pairs and later causes silent drops;
remove the fillna fallback and instead detect rows where the merge produced null
SDRF values (e.g., df['source_name'].isna()) immediately after the merge and
raise a clear error (RuntimeError) listing the offending (run_file_name, label)
combinations so the pipeline fails fast; update any downstream expectations
(e.g., mappings built in mokume/pipeline/stages.py) to rely on the validated
df['sample_accession'] populated only from SDRF values.
extract f-string backslash to _safe_json for Python 3.10/3.11 compat
There was a problem hiding this comment.
Actionable comments posted: 8
♻️ Duplicate comments (1)
mokume/reports/workflow_comparison.py (1)
83-88:⚠️ Potential issue | 🟠 MajorAvoid shadowing
htmland write the report with explicit UTF-8.Line 83 masks the imported
htmlmodule from Line 14, and Line 87 should explicitly use UTF-8 to match the document charset.Suggested fix
- html = _build_comparison_html( + report_html = _build_comparison_html( title, workflow_metrics, concordance, marker_results, marker_genes, ) - with open(output_html, "w") as f: - f.write(html) + with open(output_html, "w", encoding="utf-8") as f: + f.write(report_html)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mokume/reports/workflow_comparison.py` around lines 83 - 88, The local variable named html is shadowing the imported html module; rename the result of _build_comparison_html (e.g., html_content or report_html) and then write the file using an explicit UTF-8 encoding (open(output_html, "w", encoding="utf-8")). Update references to the renamed variable and keep using the _build_comparison_html call and output_html target.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@mokume/reports/qc_report.py`:
- Around line 136-137: The code currently drops any protein column with any
missing value by using .dropna(axis=1) then filling NAs, which removes partially
observed features; change the drop to only remove columns that are entirely
missing (e.g., replace .dropna(axis=1) with .dropna(axis=1, how='all') or
.dropna(axis=1, thresh=1)), so that partially observed proteins are preserved
and then imputed via data.fillna(data.median()). Make this same change in the
other callsites inside _compute_tsne, _compute_silhouette, and
_compute_variance_decomposition where you see data =
log2_matrix[samples].T.dropna(axis=1) (and the analogous blocks at lines
referenced) so all four metrics use the same missing-value policy.
- Around line 319-321: The code computes n_components = min(10, data.shape[0] -
1, data.shape[1]) and immediately constructs PCA(n_components=n_components),
which will raise when n_components <= 0 (e.g., only one sample or no proteins);
update the PVCA branch to check n_components > 0 before creating PCA (and before
calling pca.fit_transform(data)), and if n_components <= 0 return or emit a safe
PVCA summary/placeholder result (or skip PVCA) so report generation does not
abort; reference the variables n_components, PCA, pca, pc_data and the input
matrix data when adding this guard.
- Around line 524-540: The condition labels (variable cond) are written raw into
HTML in the loops that build cv_summary_rows and missing_summary_rows, which can
break markup or allow injection; update the code that builds rows for
condition_cv and missing["per_condition"] to HTML-escape cond (e.g., use
html.escape or equivalent) before concatenation, ensure the html module is
imported if not already, and use the escaped value in the f-strings for
cv_summary_rows, intra_corr lookup context (still use original cond for keys if
needed), and missing_summary_rows to safely render condition names.
- Around line 139-149: The PCA payload can be invalid for single-component
inputs because pc2 is padded but var_explained remains a single entry; update
the PCA post-processing (around n_components, PCA(), coords, and the result dict
construction) to ensure var_explained is padded to at least two entries (e.g.,
append 0 when n_components < 2) so entries "pc2" and "var_explained" always have
two elements and downstream code referencing pca.var_explained[1] won’t fail.
In `@mokume/reports/workflow_comparison.py`:
- Around line 456-468: The template assignment to tr.innerHTML injects unescaped
dynamic values (e.g., row.name, row.n_proteins, row.pi1, row.effect_size,
row.direction_balance and any marker/comparison strings) creating an XSS risk;
replace this innerHTML string construction with safe DOM creation or
text-escaping: build the <td> elements using document.createElement /
textContent (or a sanitizer/escape function) and set className from
cellClass(...) but assign cell.textContent for values (handling NaN formatting
before setting text) so no untrusted HTML is inserted; update all similar blocks
(the ones at the top of this diff and the other occurrences noted around lines
527-544 and 611-636) to follow the same pattern.
- Around line 176-177: The CAT curve disappears when max_k < 10 because ks is
constructed with range(10, max_k+1, 5) which yields an empty list; update the ks
construction to always include valid k values by checking max_k and falling back
to a smaller-step or start value: for example, if max_k < 10 set ks =
list(range(1, max_k + 1)) (or use range(1, max_k+1, max(1, 5//max_k))) otherwise
keep ks = list(range(10, max_k + 1, 5)); modify the code around max_k, sorted_a,
sorted_b and ks to implement this conditional so small result sets still produce
a CAT curve.
- Around line 516-517: The x/y axis title construction in workflow_comparison.py
directly indexes pca.var_explained[0] and pca.var_explained[1], which throws
when PCA produced fewer than two components; update the code that builds the
axis titles (the expressions using pca.var_explained) to guard access—check
length or existence before reading index 1 and use a safe fallback (e.g., omit
the percent or use '0.0%') for PC2 when pca.var_explained has only one value;
ensure the same guarded logic is applied for both xaxis and yaxis title
formulations to avoid report rendering errors.
In `@tests/test_quantification_accuracy.py`:
- Around line 351-354: Several print statements use an unnecessary f-string
prefix for plain literal messages; remove the leading f from those print calls
that have no {placeholders} (e.g., the plain label print " [Small DIA-NN Subset]
MaxLFQ results:" and the equivalent plain-label prints elsewhere) while leaving
f-strings that do include expressions (like print(... {len(result):,} ...) and
the intensity range print) unchanged; update the print calls that match the
literal label strings to regular string literals (remove the f) so Ruff F541 is
resolved.
---
Duplicate comments:
In `@mokume/reports/workflow_comparison.py`:
- Around line 83-88: The local variable named html is shadowing the imported
html module; rename the result of _build_comparison_html (e.g., html_content or
report_html) and then write the file using an explicit UTF-8 encoding
(open(output_html, "w", encoding="utf-8")). Update references to the renamed
variable and keep using the _build_comparison_html call and output_html target.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2a1223db-e1a1-4423-b453-0ee5bb356996
📒 Files selected for processing (4)
.github/workflows/python-app.ymlmokume/reports/qc_report.pymokume/reports/workflow_comparison.pytests/test_quantification_accuracy.py
| data = log2_matrix[samples].T.dropna(axis=1) | ||
| data = data.fillna(data.median()) |
There was a problem hiding this comment.
Preserve partially observed proteins before imputation.
Lines 136-137, 164-165, 277-278, and 312-313 drop any protein with a single missing sample, so the following fillna(data.median()) never gets to impute anything. On proteomics matrices that can collapse PCA/t-SNE/silhouette/PVCA onto a tiny complete-case subset or no features at all.
Suggested change
- data = log2_matrix[samples].T.dropna(axis=1)
+ data = log2_matrix[samples].T.dropna(axis=1, how="all")
data = data.fillna(data.median())Apply the same change in _compute_tsne, _compute_silhouette, and _compute_variance_decomposition so all four metrics use the same missing-value policy.
Also applies to: 164-165, 277-278, 312-313
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mokume/reports/qc_report.py` around lines 136 - 137, The code currently drops
any protein column with any missing value by using .dropna(axis=1) then filling
NAs, which removes partially observed features; change the drop to only remove
columns that are entirely missing (e.g., replace .dropna(axis=1) with
.dropna(axis=1, how='all') or .dropna(axis=1, thresh=1)), so that partially
observed proteins are preserved and then imputed via data.fillna(data.median()).
Make this same change in the other callsites inside _compute_tsne,
_compute_silhouette, and _compute_variance_decomposition where you see data =
log2_matrix[samples].T.dropna(axis=1) (and the analogous blocks at lines
referenced) so all four metrics use the same missing-value policy.
| n_components = min(3, data.shape[0], data.shape[1]) | ||
| pca = PCA(n_components=n_components) | ||
| coords = pca.fit_transform(data) | ||
|
|
||
| result = { | ||
| "samples": list(data.index), | ||
| "conditions": [sample_to_condition.get(s, "Unknown") for s in data.index], | ||
| "plexes": [sample_to_plex.get(s, "plex1") for s in data.index], | ||
| "pc1": coords[:, 0].tolist(), | ||
| "pc2": coords[:, 1].tolist() if n_components >= 2 else [0] * len(data), | ||
| "var_explained": (pca.explained_variance_ratio_ * 100).tolist(), |
There was a problem hiding this comment.
Return a stable 2D PCA payload for small inputs.
When only one principal component is available, this function pads pc2 but leaves var_explained with a single entry. Line 808 later does pca.var_explained[1].toFixed(1), which throws and aborts the rest of the report script.
Suggested change
n_components = min(3, data.shape[0], data.shape[1])
+ if n_components == 0:
+ return {
+ "samples": list(data.index),
+ "conditions": [sample_to_condition.get(s, "Unknown") for s in data.index],
+ "plexes": [sample_to_plex.get(s, "plex1") for s in data.index],
+ "pc1": [0.0] * len(data),
+ "pc2": [0.0] * len(data),
+ "var_explained": [0.0, 0.0],
+ }
pca = PCA(n_components=n_components)
coords = pca.fit_transform(data)
+ var_explained = (pca.explained_variance_ratio_ * 100).tolist()
+ if len(var_explained) < 2:
+ var_explained.extend([0.0] * (2 - len(var_explained)))
result = {
"samples": list(data.index),
"conditions": [sample_to_condition.get(s, "Unknown") for s in data.index],
"plexes": [sample_to_plex.get(s, "plex1") for s in data.index],
"pc1": coords[:, 0].tolist(),
"pc2": coords[:, 1].tolist() if n_components >= 2 else [0] * len(data),
- "var_explained": (pca.explained_variance_ratio_ * 100).tolist(),
+ "var_explained": var_explained,
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mokume/reports/qc_report.py` around lines 139 - 149, The PCA payload can be
invalid for single-component inputs because pc2 is padded but var_explained
remains a single entry; update the PCA post-processing (around n_components,
PCA(), coords, and the result dict construction) to ensure var_explained is
padded to at least two entries (e.g., append 0 when n_components < 2) so entries
"pc2" and "var_explained" always have two elements and downstream code
referencing pca.var_explained[1] won’t fail.
| n_components = min(10, data.shape[0] - 1, data.shape[1]) | ||
| pca = PCA(n_components=n_components) | ||
| pc_data = pca.fit_transform(data) |
There was a problem hiding this comment.
Guard PVCA before constructing PCA(n_components=0).
Line 319 evaluates to 0 whenever only one sample survives preprocessing or all proteins are filtered out. PCA(n_components=0) raises here, so a small/empty matrix can abort report generation instead of returning a safe PVCA summary.
Suggested change
n_components = min(10, data.shape[0] - 1, data.shape[1])
+ if n_components <= 0:
+ return {
+ "condition_pct": 0.0,
+ "plex_pct": 0.0,
+ "residual_pct": 100.0,
+ "n_pcs_used": 0,
+ "total_var_captured": 0.0,
+ }
pca = PCA(n_components=n_components)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| n_components = min(10, data.shape[0] - 1, data.shape[1]) | |
| pca = PCA(n_components=n_components) | |
| pc_data = pca.fit_transform(data) | |
| n_components = min(10, data.shape[0] - 1, data.shape[1]) | |
| if n_components <= 0: | |
| return { | |
| "condition_pct": 0.0, | |
| "plex_pct": 0.0, | |
| "residual_pct": 100.0, | |
| "n_pcs_used": 0, | |
| "total_var_captured": 0.0, | |
| } | |
| pca = PCA(n_components=n_components) | |
| pc_data = pca.fit_transform(data) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mokume/reports/qc_report.py` around lines 319 - 321, The code computes
n_components = min(10, data.shape[0] - 1, data.shape[1]) and immediately
constructs PCA(n_components=n_components), which will raise when n_components <=
0 (e.g., only one sample or no proteins); update the PVCA branch to check
n_components > 0 before creating PCA (and before calling
pca.fit_transform(data)), and if n_components <= 0 return or emit a safe PVCA
summary/placeholder result (or skip PVCA) so report generation does not abort;
reference the variables n_components, PCA, pca, pc_data and the input matrix
data when adding this guard.
| for cond, cv_data in condition_cv.items(): | ||
| cv_summary_rows += ( | ||
| f"<tr><td>{cond}</td>" | ||
| f"<td>{cv_data['median_cv']:.3f}</td>" | ||
| f"<td>{cv_data['q25']:.3f} - {cv_data['q75']:.3f}</td>" | ||
| f"<td>{intra_corr.get(cond, float('nan')):.3f}</td></tr>" | ||
| ) | ||
|
|
||
| # Format missing value summary | ||
| missing_summary_rows = "" | ||
| for cond, m in missing["per_condition"].items(): | ||
| total = m["complete"] + m["partially_missing"] + m["completely_missing"] | ||
| missing_summary_rows += ( | ||
| f"<tr><td>{cond}</td>" | ||
| f"<td>{m['complete']}</td>" | ||
| f"<td>{m['partially_missing']}</td>" | ||
| f"<td>{m['completely_missing']}</td>" |
There was a problem hiding this comment.
Escape condition labels before writing HTML rows.
Lines 526 and 537 insert cond directly into the page. A condition name containing <, &, or quotes will corrupt the markup, and crafted metadata can inject HTML/JS into the generated report.
Suggested change
cv_summary_rows = ""
for cond, cv_data in condition_cv.items():
+ cond_html = html_module.escape(cond)
cv_summary_rows += (
- f"<tr><td>{cond}</td>"
+ f"<tr><td>{cond_html}</td>"
f"<td>{cv_data['median_cv']:.3f}</td>"
f"<td>{cv_data['q25']:.3f} - {cv_data['q75']:.3f}</td>"
f"<td>{intra_corr.get(cond, float('nan')):.3f}</td></tr>"
)
# Format missing value summary
missing_summary_rows = ""
for cond, m in missing["per_condition"].items():
+ cond_html = html_module.escape(cond)
total = m["complete"] + m["partially_missing"] + m["completely_missing"]
missing_summary_rows += (
- f"<tr><td>{cond}</td>"
+ f"<tr><td>{cond_html}</td>"
f"<td>{m['complete']}</td>"
f"<td>{m['partially_missing']}</td>"
f"<td>{m['completely_missing']}</td>"
f"<td>{total}</td></tr>"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mokume/reports/qc_report.py` around lines 524 - 540, The condition labels
(variable cond) are written raw into HTML in the loops that build
cv_summary_rows and missing_summary_rows, which can break markup or allow
injection; update the code that builds rows for condition_cv and
missing["per_condition"] to HTML-escape cond (e.g., use html.escape or
equivalent) before concatenation, ensure the html module is imported if not
already, and use the escaped value in the f-strings for cv_summary_rows,
intra_corr lookup context (still use original cond for keys if needed), and
missing_summary_rows to safely render condition names.
| max_k = min(max_k, len(sorted_a), len(sorted_b)) | ||
| ks = list(range(10, max_k + 1, 5)) |
There was a problem hiding this comment.
CAT curve silently disappears for small result sets (max_k < 10).
For small DE tables, ks becomes empty and the CAT section shows no curve even though data exists.
Suggested fix
- ks = list(range(10, max_k + 1, 5))
+ start_k = 1 if max_k < 10 else 10
+ step = 1 if max_k < 10 else 5
+ ks = list(range(start_k, max_k + 1, step))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| max_k = min(max_k, len(sorted_a), len(sorted_b)) | |
| ks = list(range(10, max_k + 1, 5)) | |
| max_k = min(max_k, len(sorted_a), len(sorted_b)) | |
| start_k = 1 if max_k < 10 else 10 | |
| step = 1 if max_k < 10 else 5 | |
| ks = list(range(start_k, max_k + 1, step)) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mokume/reports/workflow_comparison.py` around lines 176 - 177, The CAT curve
disappears when max_k < 10 because ks is constructed with range(10, max_k+1, 5)
which yields an empty list; update the ks construction to always include valid k
values by checking max_k and falling back to a smaller-step or start value: for
example, if max_k < 10 set ks = list(range(1, max_k + 1)) (or use range(1,
max_k+1, max(1, 5//max_k))) otherwise keep ks = list(range(10, max_k + 1, 5));
modify the code around max_k, sorted_a, sorted_b and ks to implement this
conditional so small result sets still produce a CAT curve.
| tr.innerHTML = ` | ||
| <td style="text-align:left;font-weight:bold">${{row.name}}</td> | ||
| <td>${{row.n_proteins}}</td> | ||
| <td class="${{cellClass('silhouette', row.silhouette)}}">${{row.silhouette.toFixed(3)}}</td> | ||
| <td class="${{cellClass('var_condition', row.var_condition)}}">${{row.var_condition.toFixed(1)}}%</td> | ||
| <td class="${{cellClass('var_plex', row.var_plex)}}">${{row.var_plex.toFixed(1)}}%</td> | ||
| <td class="${{cellClass('median_cv', row.median_cv)}}">${{row.median_cv.toFixed(3)}}</td> | ||
| <td class="${{cellClass('replicate_corr', row.replicate_corr)}}">${{row.replicate_corr.toFixed(3)}}</td> | ||
| <td>${{row.n_de}} (${{row.n_up}}/${{row.n_down}})</td> | ||
| <td class="${{cellClass('pi1', row.pi1)}}">${{isNaN(row.pi1) ? 'N/A' : row.pi1.toFixed(2)}}</td> | ||
| <td class="${{cellClass('effect_size', row.effect_size)}}">${{row.effect_size.toFixed(2)}}</td> | ||
| <td>${{isNaN(row.direction_balance) ? 'N/A' : row.direction_balance.toFixed(2)}}</td> | ||
| `; |
There was a problem hiding this comment.
Unescaped dynamic values are injected into HTML (innerHTML) — XSS risk.
Workflow names, comparison keys, marker names/accessions/significance are rendered as raw HTML. If any input contains markup/script, the report becomes executable content.
Suggested hardening
@@
(function() {{
+ function escapeHtml(v) {{
+ return String(v)
+ .replaceAll('&', '&')
+ .replaceAll('<', '<')
+ .replaceAll('>', '>')
+ .replaceAll('"', '"')
+ .replaceAll("'", ''');
+ }}
@@
- <td style="text-align:left;font-weight:bold">${{row.name}}</td>
+ <td style="text-align:left;font-weight:bold">${{escapeHtml(row.name)}}</td>
@@
- <td style="text-align:left">${{key}}</td>
+ <td style="text-align:left">${{escapeHtml(key)}}</td> def _build_marker_table(marker_results, marker_genes, workflow_metrics):
@@
- for wf in wf_names:
- header += f"<th colspan='2'>{wf}</th>"
+ for wf in wf_names:
+ header += f"<th colspan='2'>{html.escape(str(wf))}</th>"
@@
- row = f"<tr><td><b>{info['name']}</b></td><td>{acc}</td>"
- row += f"<td class='{'good' if info['expected'] == 'UP' else 'bad'}'>{info['expected']}</td>"
+ safe_gene = html.escape(str(info["name"]))
+ safe_acc = html.escape(str(acc))
+ safe_expected = html.escape(str(info["expected"]))
+ row = f"<tr><td><b>{safe_gene}</b></td><td>{safe_acc}</td>"
+ row += f"<td class='{'good' if info['expected'] == 'UP' else 'bad'}'>{safe_expected}</td>"
@@
- sig = r["significance"]
+ sig = html.escape(str(r["significance"]))Also applies to: 527-544, 611-636
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mokume/reports/workflow_comparison.py` around lines 456 - 468, The template
assignment to tr.innerHTML injects unescaped dynamic values (e.g., row.name,
row.n_proteins, row.pi1, row.effect_size, row.direction_balance and any
marker/comparison strings) creating an XSS risk; replace this innerHTML string
construction with safe DOM creation or text-escaping: build the <td> elements
using document.createElement / textContent (or a sanitizer/escape function) and
set className from cellClass(...) but assign cell.textContent for values
(handling NaN formatting before setting text) so no untrusted HTML is inserted;
update all similar blocks (the ones at the top of this diff and the other
occurrences noted around lines 527-544 and 611-636) to follow the same pattern.
| xaxis: {{ title: 'PC1 (' + pca.var_explained[0].toFixed(1) + '%)' }}, | ||
| yaxis: {{ title: 'PC2 (' + pca.var_explained[1].toFixed(1) + '%)' }}, |
There was a problem hiding this comment.
Guard PCA variance labels when fewer than 2 components are available.
Line 517 assumes pca.var_explained[1] exists; with 1-component PCA this throws and breaks report rendering.
Suggested fix
- Plotly.newPlot(`pca-${{wfIdx}}`, traces, {{
+ const pc1Var = (pca.var_explained && pca.var_explained.length > 0) ? pca.var_explained[0] : 0;
+ const pc2Var = (pca.var_explained && pca.var_explained.length > 1) ? pca.var_explained[1] : 0;
+ Plotly.newPlot(`pca-${{wfIdx}}`, traces, {{
title: {{ text: wfName, font: {{ size: 14 }} }},
- xaxis: {{ title: 'PC1 (' + pca.var_explained[0].toFixed(1) + '%)' }},
- yaxis: {{ title: 'PC2 (' + pca.var_explained[1].toFixed(1) + '%)' }},
+ xaxis: {{ title: 'PC1 (' + pc1Var.toFixed(1) + '%)' }},
+ yaxis: {{ title: 'PC2 (' + pc2Var.toFixed(1) + '%)' }},
legend: {{ font: {{ size: 9 }}, orientation: 'h', y: -0.15 }},
margin: {{ t: 50, b: 80, l: 60, r: 20 }},
}}, {{responsive: true}});🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mokume/reports/workflow_comparison.py` around lines 516 - 517, The x/y axis
title construction in workflow_comparison.py directly indexes
pca.var_explained[0] and pca.var_explained[1], which throws when PCA produced
fewer than two components; update the code that builds the axis titles (the
expressions using pca.var_explained) to guard access—check length or existence
before reading index 1 and use a safe fallback (e.g., omit the percent or use
'0.0%') for PC2 when pca.var_explained has only one value; ensure the same
guarded logic is applied for both xaxis and yaxis title formulations to avoid
report rendering errors.
| print(f"\n[Small DIA-NN Subset] MaxLFQ results:") | ||
| print(f" Protein-sample combinations: {len(result):,}") | ||
| print(f" Intensity range: {result['MaxLFQIntensity'].min():.2f} - {result['MaxLFQIntensity'].max():.2f}") | ||
| print(f" Intensity range: {result['Intensity'].min():.2f} - {result['Intensity'].max():.2f}") | ||
|
|
There was a problem hiding this comment.
Remove extraneous f prefix from f-strings without placeholders.
Static analysis (Ruff F541) correctly identifies that line 351 has an f-string without any placeholders. The same issue exists at lines 486 and 540.
Proposed fix
- print(f"\n[Small DIA-NN Subset] MaxLFQ results:")
+ print("\n[Small DIA-NN Subset] MaxLFQ results:")Line 486:
- print(f"\n[Small DIA-NN Subset] Built-in MaxLFQ fallback:")
+ print("\n[Small DIA-NN Subset] Built-in MaxLFQ fallback:")Line 540:
- print(f"\n[PRIDE PXD063291] MaxLFQ results:")
+ print("\n[PRIDE PXD063291] MaxLFQ results:")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print(f"\n[Small DIA-NN Subset] MaxLFQ results:") | |
| print(f" Protein-sample combinations: {len(result):,}") | |
| print(f" Intensity range: {result['MaxLFQIntensity'].min():.2f} - {result['MaxLFQIntensity'].max():.2f}") | |
| print(f" Intensity range: {result['Intensity'].min():.2f} - {result['Intensity'].max():.2f}") | |
| print("\n[Small DIA-NN Subset] MaxLFQ results:") | |
| print(f" Protein-sample combinations: {len(result):,}") | |
| print(f" Intensity range: {result['Intensity'].min():.2f} - {result['Intensity'].max():.2f}") |
🧰 Tools
🪛 Ruff (0.15.5)
[error] 351-351: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_quantification_accuracy.py` around lines 351 - 354, Several print
statements use an unnecessary f-string prefix for plain literal messages; remove
the leading f from those print calls that have no {placeholders} (e.g., the
plain label print " [Small DIA-NN Subset] MaxLFQ results:" and the equivalent
plain-label prints elsewhere) while leaving f-strings that do include
expressions (like print(... {len(result):,} ...) and the intensity range print)
unchanged; update the print calls that match the literal label strings to
regular string literals (remove the f) so Ruff F541 is resolved.
Summary by CodeRabbit
New Features
Documentation
Refactor
Tests