-
Notifications
You must be signed in to change notification settings - Fork 49
Added Metamorpheus Datasets - MBR+NoMBR wNormalization to benchmarking #163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
|
|
||
| input = input %>% filter(`Protein Group` %in% protein_mappings$`Protein Groups`) | ||
|
|
||
| output = MetamorpheusToMSstatsFormat(input, annot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MetamorpheusToMSstatsFormat function also has these two parameters. By default, they're set to TRUE
removeFewMeasurements, removeProtein_with1Feature
Could you double check there aren't major differences in empirical FDR when these two parameters are set to FALSE? I'm thinking due to the absence of PIP, certain proteins may be filtered out altogether, which could explain better empirical FDR with no PIP.
WalkthroughRetargets CI to a different HPC host and secret, updates remote paths and Slurm job config and R environment, adds a MetaMorpheus benchmarking R harness and a metrics helper, and updates dataset entries in the benchmark controller JSON. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as GitHub Actions
participant SSH as Explorer SSH
participant Slurm as Slurm Scheduler
participant Node as Compute Node
Dev->>SSH: Setup SSH (new key/host)
Dev->>SSH: rsync benchmark/ -> /projects/.../benchmark
Dev->>SSH: sbatch /projects/.../benchmark/config.slurm
SSH->>Slurm: Submit job
Slurm-->>Dev: Job ID
Slurm->>Node: Launch job
Node->>Node: Setup R env, install deps
Node->>Node: Run R scripts (line-buffered)
Node-->>SSH: Write job_output.txt / job_error.txt
Dev->>SSH: Retrieve outputs from /projects/.../benchmark/
Dev-->>Dev: Upload artifact
sequenceDiagram
autonumber
participant R as benchmark_Metamorpheus.R
participant CFG as scriptController.json
participant Proc as MSstats/MSstatsConvert
participant Met as calculate_Metrics
R->>CFG: Load dataset config
R->>R: Read QuantifiedPeaks/Proteins + annotation
R->>R: Filter decoys/multiproteins and map organisms
R->>Proc: MetamorpheusToMSstatsFormat()
par Six processing configs
R->>Proc: dataProcess(...) per config
and
end
loop For each summarized result
R->>Met: calculate_Metrics(summarized, mappings, label)
end
R->>R: Aggregate FDR metrics and print timing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
♻️ Duplicate comments (5)
metamorpheus_code.R (3)
21-26: Remove commented-out code blocksThese stale comments add noise.
Apply this diff:
-# input_no_mbr$`Protein Group` = ifelse( -# input_no_mbr$`Protein Group` %in% ecoli$`Protein Groups`, -# paste(input_no_mbr$`Protein Group`, "|ECOLI", sep = ""), -# paste(input_no_mbr$`Protein Group`, "|HUMAN", sep = "")) -# write.csv(input_no_mbr, "QuantifiedPeaks.csv", row.names = FALSE)and
-# input$`Protein Group` = ifelse( -# input$`Protein Group` %in% ecoli$`Protein Groups`, -# paste(input$`Protein Group`, "|ECOLI", sep = ""), -# paste(input$`Protein Group`, "|HUMAN", sep = "")) -# write.csv(input, "QuantifiedPeaks-MBR.csv", row.names = FALSE)Also applies to: 73-78
1-112: Please remove this exploratory script from the PR (env-specific, redundant, and not used in the CI path)This file hard-codes absolute HPC paths, duplicates logic that now exists in the benchmarking harness, and includes ad-hoc visualization (hist) and commented code. It’s not invoked by the workflow, and a prior review already requested its removal.
Recommendation:
- Remove this file from the PR (or move it to a docs/examples location with parameterized paths and no hard-coded env specifics).
- The CI and Slurm pipeline already use benchmark/benchmark_Metamorpheus.R + benchmark/metamorpheus_Process.R.
49-56: Don’t hard-code a single comparison; reuse calculate_Metrics and compute robust FDRsYou’re filtering only “B-A” and doing one-off FDR logic here. The PR already introduces calculate_Metrics for exactly this. Also guard against division by zero.
Apply this diff:
-e_group_no_mbr = model_no_mbr$ComparisonResult %>% filter(Label == "B-A") %>% filter(is.na(issue)) -ecoli_no_mbr = e_group_no_mbr %>% filter(ecoli == TRUE) -hist(ecoli_no_mbr$log2FC) - -ecoli_no_mbr = e_group_no_mbr %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == TRUE) -human_no_mbr = e_group_no_mbr %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == FALSE) -FDR_no_mbr = nrow(human_no_mbr) / (nrow(ecoli_no_mbr) + nrow(human_no_mbr)) +source("benchmark/metamorpheus_Process.R") +metrics_no_mbr <- calculate_Metrics(QuantData_no_mbr, protein_mappings, task_label = "NoMBR", alpha = 0.05) +print(metrics_no_mbr)benchmark/benchmark_Metamorpheus.R (2)
31-35: Prefer using the Organism column in QuantifiedPeaks.tsv instead of joining via QuantifiedProteins.tsvPer earlier feedback, newer MetaMorpheus outputs include Organism directly in QuantifiedPeaks.tsv. Filtering in-place avoids an extra read and potential mismatches between “Protein Group(s)” columns across files.
Apply this diff to inline the filter with a fallback to the old behavior if Organism is absent:
-protein_mappings = data.table::fread(file.path(filePath, "QuantifiedProteins.tsv")) -protein_mappings = protein_mappings %>% filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens")) - -input = input %>% filter(`Protein Group` %in% protein_mappings$`Protein Groups`) +if ("Organism" %in% names(input)) { + input <- input %>% + dplyr::filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens")) +} else { + protein_mappings <- data.table::fread(file.path(filePath, "QuantifiedProteins.tsv")) + protein_mappings <- protein_mappings %>% + dplyr::filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens")) + input <- input %>% + dplyr::filter(`Protein Group` %in% protein_mappings$`Protein Groups`) +}If organism strings vary across datasets (e.g., “Escherichia coli str. K-12”), consider a regex-based match or drive organism filters from config.
36-36: Double-check MSstatsConvert pre-filtering options for FDR comparabilityMetamorpheusToMSstatsFormat likely forwards or reimplements MSstatsConvert behavior. Please verify the effect of:
- removeFewMeasurements
- removeProtein_with1Feature
Defaults are typically TRUE and may change empirical FDR (especially without PIP). Consider exposing these via config or explicitly setting them here for the benchmarking runs to ensure apples-to-apples comparisons.
🧹 Nitpick comments (11)
benchmark/metamorpheus_Process.R (2)
19-22: Harden organism matching and avoid reliance on attached packagesHard-coded exact string matching is brittle. Also, this function assumes dplyr is attached. Prefer explicit namespaces and robust matching.
Apply this diff:
- ecoli_ids <- protein_mappings %>% - filter(Organism == "Escherichia coli (strain K12)") %>% - pull(`Protein Groups`) + ecoli_ids <- protein_mappings |> + dplyr::filter(grepl("Escherichia coli", Organism, fixed = TRUE)) |> + dplyr::pull(`Protein Groups`)If you prefer to keep
%>%, either importmagrittror switch the file to use|>consistently. Using namespaces (dplyr::) keeps this helper self-contained and safer when sourced.
23-26: Namespace dplyr calls to reduce hidden dependenciesThis function uses dplyr verbs without importing. Namespace them to avoid runtime errors when the caller hasn't attached dplyr.
Apply this diff:
- comp <- model$ComparisonResult %>% - mutate(ecoli = Protein %in% ecoli_ids) %>% - filter(is.na(issue)) + comp <- model$ComparisonResult |> + dplyr::mutate(ecoli = Protein %in% ecoli_ids) |> + dplyr::filter(is.na(issue)).github/workflows/benchmark.yml (4)
24-26: Sanity-check SSH key handlingUsing echo for multi-line secrets often works, but printf is safer. Not mandatory, but reduces surprises with shell expansions.
Apply this diff:
- echo "${{ secrets.SSH_PRIVATE_KEY_EXPLORER }}" > ~/.ssh/id_rsa + printf '%s' "${{ secrets.SSH_PRIVATE_KEY_EXPLORER }}" > ~/.ssh/id_rsa
34-37: Fix YAML trailing spaces flagged by linterThere are trailing spaces on these lines.
Apply this diff:
- ssh [email protected] "cd /projects/VitekLab/Projects/Benchmarking/benchmark && sbatch config.slurm" | tee slurm_job_id.txt - slurm_job_id=$(grep -oP '\d+' slurm_job_id.txt) + ssh [email protected] "cd /projects/VitekLab/Projects/Benchmarking/benchmark && sbatch config.slurm" | tee slurm_job_id.txt + slurm_job_id=$(grep -oP '\d+' slurm_job_id.txt) echo "Slurm Job ID is $slurm_job_id" - echo "slurm_job_id=$slurm_job_id" >> $GITHUB_ENV + echo "slurm_job_id=$slurm_job_id" >> $GITHUB_ENV
17-17: Update to checkout v4Keep GA actions current.
Apply this diff:
- uses: actions/checkout@v3 + uses: actions/checkout@v4
31-48: Optional: export FDR threshold to Slurm job environmentYou define
FDR_THRESHOLDbut never pass it along. If the R scripts read it, export via sbatch.Apply this diff:
- ssh [email protected] "cd /projects/VitekLab/Projects/Benchmarking/benchmark && sbatch config.slurm" | tee slurm_job_id.txt + ssh [email protected] "cd /projects/VitekLab/Projects/Benchmarking/benchmark && sbatch --export=ALL,FDR_THRESHOLD=${FDR_THRESHOLD} config.slurm" | tee slurm_job_id.txtbenchmark/config.slurm (1)
12-14: Confirm toolchain availability for source builds (nloptr) on the clusterYou removed the explicit gcc module but are forcing a source build of nloptr, which requires a compiler (and NLopt via cmake). If GCC isn’t loaded by default on Explorer, this will fail.
Options:
- Load a compiler toolchain, e.g.,
module load gcc(and possiblygcc/12or cluster default).- Or avoid forcing source with
install.packages("nloptr", repos = "https://cloud.r-project.org")and let R decide.Would you like me to adjust the script for a fully pinned toolchain?
benchmark/benchmark_Metamorpheus.R (4)
28-30: Decoy filtering: use a dedicated Decoy flag when presentIf available, prefer the explicit Decoy/IsDecoy boolean column over string matching. It’s more robust across naming schemes.
-input = input %>% filter(!str_detect(`Protein Group`, ";")) # remove multiple protein group in same cell -input = input %>% filter(!str_detect(`Protein Group`, "DECOY")) # remove decoys +input <- input %>% dplyr::filter(!str_detect(`Protein Group`, ";")) # remove multiple protein group in same cell +if ("Decoy" %in% names(input)) { + input <- input %>% dplyr::filter(!Decoy) +} else if ("IsDecoy" %in% names(input)) { + input <- input %>% dplyr::filter(!IsDecoy) +} else { + input <- input %>% dplyr::filter(!str_detect(`Protein Group`, "DECOY")) +}
67-72: Guard against zero/oversubscribed cores in mclapplydetectCores() - 1 can be zero on small VMs, and oversubscription can hurt performance. Bound cores to [1, length(tasks)].
-num_cores <- detectCores() - 1 +num_cores <- max(1L, min(length(data_process_tasks), detectCores(logical = TRUE) - 1L)) summarized_results <- mclapply(data_process_tasks, function(task) { list(label = task$label, summarized = task$result()) -}, mc.cores = num_cores) +}, mc.cores = num_cores)
74-77: Carry dataset name into the metrics output for traceabilityDownstream aggregation across datasets benefits from an explicit Dataset column.
-results_list <- mclapply(summarized_results, function(res) { - calculate_Metrics(res$summarized, protein_mappings, res$label) -}, mc.cores = num_cores) +results_list <- mclapply(summarized_results, function(res) { + df <- calculate_Metrics(res$summarized, protein_mappings, res$label) + df$Dataset <- dataset_config$name + df +}, mc.cores = num_cores)
89-90: Optional: Drive datasets from the controller instead of hard-codingTo reduce maintenance and keep scripts DRY, consider deriving the targets from config (e.g., those with “Metamorpheus” in the key or a dedicated flag).
Example replacement:
-runBenchmarkForMetaMorpheusData("DDA-Solivais2024-Metamorpheus_MBR_LFQ", config) -runBenchmarkForMetaMorpheusData("DDA-Solivais2024-Metamorpheus_NoMBR_LFQ", config) +mm_keys <- names(config$datasets) +mm_keys <- mm_keys[grepl("Metamorpheus", mm_keys, ignore.case = TRUE)] +invisible(lapply(mm_keys, runBenchmarkForMetaMorpheusData, config = config))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
.github/workflows/benchmark.yml(3 hunks)benchmark/benchmark_Metamorpheus.R(1 hunks)benchmark/config.slurm(1 hunks)benchmark/metamorpheus_Process.R(1 hunks)benchmark/scriptController.json(2 hunks)metamorpheus_code.R(1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
.github/workflows/benchmark.yml
[error] 34-34: trailing spaces
(trailing-spaces)
[error] 35-35: trailing spaces
(trailing-spaces)
[error] 37-37: trailing spaces
(trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build
- GitHub Check: Benchmarking-pipeline
🔇 Additional comments (3)
benchmark/scriptController.json (2)
35-49: LGTM: new MetaMorpheus NoMBR dataset entry is coherent and consistentThe
parent+dataconvention and sample patterns align with the new benchmarking harness. Patterns are properly anchored.
50-64: LGTM: new MetaMorpheus MBR dataset entry is consistentMatches NoMBR structure; sample patterns and parent/data paths look good.
benchmark/config.slurm (1)
43-50: Nice: line-buffered Rscript output for Slurm logsLine buffering via stdbuf improves observability; loop structure and logging are fine.
| source("metamorpheus_Process.R") | ||
| config <- fromJSON("scriptController.json", simplifyVector = FALSE) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Fix: MetamorpheusToMSstatsFormat likely undefined; also make sources/paths robust to working directory
MetamorpheusToMSstatsFormat isn’t defined in this script and isn’t exported by MSstats/MSstatsConvert. It appears to live in metamorpheus_code.R, which isn’t sourced here. Also, both the source() and fromJSON() calls are relative to the current working directory and will break if the script isn’t launched from benchmark/.
Apply this diff to source both helper files and load the controller JSON robustly:
- source("metamorpheus_Process.R")
-config <- fromJSON("scriptController.json", simplifyVector = FALSE)
+script_args <- commandArgs(trailingOnly = FALSE)
+script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)])
+base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else "."
+repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE)
+
+source(file.path(base_dir, "metamorpheus_Process.R"))
+# Ensure MetamorpheusToMSstatsFormat is available
+if (file.exists(file.path(repo_root, "metamorpheus_code.R"))) {
+ source(file.path(repo_root, "metamorpheus_code.R"))
+}
+config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "PWD: $(pwd)"
echo
echo "---- benchmark directory listing ----"
ls -la benchmark || true
echo
echo "---- show benchmark/benchmark_Metamorpheus.R (first 200 lines) ----"
sed -n '1,200p' benchmark/benchmark_Metamorpheus.R || true
echo
echo "---- rg: search for MetamorpheusToMSstatsFormat ----"
rg -n --hidden --no-ignore -S "MetamorpheusToMSstatsFormat" || true
echo
echo "---- rg: search for metamorpheus_code.R references ----"
rg -n --hidden --no-ignore -S "metamorpheus_code\\.R" || true
echo
echo "---- fd: find files named metamorpheus_code.R or metamorpheus_Process.R ----"
fd -I -e R --hidden "metamorpheus_code.R" || true
fd -I -e R --hidden "metamorpheus_Process.R" || true
echo
echo "---- rg: search for source(...) of metamorpheus_Process.R ----"
rg -n --hidden --no-ignore -S "source\\(.*metamorpheus_Process\\.R" || true
echo
echo "---- rg: search for fromJSON\\(.*scriptController.json ----"
rg -n --hidden --no-ignore -S "fromJSON\\(.*scriptController\\.json" || true
echo
echo "---- check for metamorpheus_code.R at repo root ----"
if [ -f "metamorpheus_code.R" ]; then echo "FOUND: metamorpheus_code.R at repo root"; else echo "NOT FOUND: metamorpheus_code.R at repo root"; fiLength of output: 6452
Fix required: source metamorpheus_code.R and make file paths robust to working directory
MetamorpheusToMSstatsFormat is defined in metamorpheus_code.R at the repo root (rg matches). benchmark/benchmark_Metamorpheus.R currently only does source("metamorpheus_Process.R") and fromJSON("scriptController.json"), which will break if the script is run from a different working directory — source the helper that defines MetamorpheusToMSstatsFormat and resolve paths relative to the script.
Files to update:
- benchmark/benchmark_Metamorpheus.R — replace the relative source()/fromJSON() calls (lines ~8–10).
- metamorpheus_code.R (repo root) — contains MetamorpheusToMSstatsFormat and must be sourced by the benchmark script.
- benchmark/metamorpheus_Process.R — keep sourced, but source via computed script path.
Apply this diff to benchmark/benchmark_Metamorpheus.R:
- source("metamorpheus_Process.R")
- config <- fromJSON("scriptController.json", simplifyVector = FALSE)
+ script_args <- commandArgs(trailingOnly = FALSE)
+ script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)])
+ base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else "."
+ repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE)
+
+ source(file.path(base_dir, "metamorpheus_Process.R"))
+ # Ensure MetamorpheusToMSstatsFormat is available
+ if (file.exists(file.path(repo_root, "metamorpheus_code.R"))) {
+ source(file.path(repo_root, "metamorpheus_code.R"))
+ }
+ config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| source("metamorpheus_Process.R") | |
| config <- fromJSON("scriptController.json", simplifyVector = FALSE) | |
| script_args <- commandArgs(trailingOnly = FALSE) | |
| script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)]) | |
| base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else "." | |
| repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE) | |
| source(file.path(base_dir, "metamorpheus_Process.R")) | |
| # Ensure MetamorpheusToMSstatsFormat is available | |
| if (file.exists(file.path(repo_root, "metamorpheus_code.R"))) { | |
| source(file.path(repo_root, "metamorpheus_code.R")) | |
| } | |
| config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE) |
🤖 Prompt for AI Agents
In benchmark/benchmark_Metamorpheus.R around lines 8–10, the script currently
sources metamorpheus_Process.R and reads scriptController.json using relative
paths which will break if run from a different working directory; change the top
of the script to compute the script directory, source the repo-root
metamorpheus_code.R and the local metamorpheus_Process.R via absolute paths, and
read scriptController.json via that computed path (e.g. determine script_dir
with a robust pattern that falls back to getwd(), then call
source(file.path(script_dir, "metamorpheus_Process.R")) and
source(file.path(script_dir, "..", "metamorpheus_code.R")) and config <-
fromJSON(file.path(script_dir, "scriptController.json"), simplifyVector =
FALSE)).
| label = "Data process with Normalization and MBImpute False", | ||
| result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE) | ||
| ), | ||
| list( | ||
| label = "Data process without Normalization", | ||
| result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20) | ||
| ), | ||
| list( | ||
| label = "Data process without Normalization with MBImpute False", | ||
| result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE) | ||
| ), | ||
| list( | ||
| label = "Data process without Normalization and Imputation On for all features", | ||
| result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = FALSE) | ||
| ), | ||
| list( | ||
| label = "Data process without Normalization and Imputation On for top3 features", | ||
| result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = FALSE) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct MBimpute argument name and align “Imputation On” task behavior
Two issues:
- Typo: MBImpute (capital I) is not a valid dataProcess() argument; it should be MBimpute.
- Tasks labeled “…Imputation On…” currently pass FALSE, turning imputation off and contradicting the labels.
Apply this diff to fix both problems:
list(
label = "Data process with Normalization and MBImpute False",
- result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE)
+ result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE)
),
@@
list(
label = "Data process without Normalization with MBImpute False",
- result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
+ result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
),
list(
label = "Data process without Normalization and Imputation On for all features",
- result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBImpute = FALSE)
+ result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = TRUE)
),
list(
label = "Data process without Normalization and Imputation On for top3 features",
- result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBImpute = FALSE)
+ result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = TRUE)
)Note: The first two blocks already use the correct argument spelling (MBimpute). The changes above fix the last two blocks to both correct the spelling and enable imputation.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| label = "Data process with Normalization and MBImpute False", | |
| result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE) | |
| ), | |
| list( | |
| label = "Data process without Normalization", | |
| result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20) | |
| ), | |
| list( | |
| label = "Data process without Normalization with MBImpute False", | |
| result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE) | |
| ), | |
| list( | |
| label = "Data process without Normalization and Imputation On for all features", | |
| result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = FALSE) | |
| ), | |
| list( | |
| label = "Data process without Normalization and Imputation On for top3 features", | |
| result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = FALSE) | |
| ) | |
| label = "Data process with Normalization and MBImpute False", | |
| result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE) | |
| ), | |
| list( | |
| label = "Data process without Normalization", | |
| result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20) | |
| ), | |
| list( | |
| label = "Data process without Normalization with MBImpute False", | |
| result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE) | |
| ), | |
| list( | |
| label = "Data process without Normalization and Imputation On for all features", | |
| result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = TRUE) | |
| ), | |
| list( | |
| label = "Data process without Normalization and Imputation On for top3 features", | |
| result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = TRUE) | |
| ) |
🤖 Prompt for AI Agents
In benchmark/benchmark_Metamorpheus.R around lines 44 to 62, the dataProcess
calls have a typo (MBImpute with a capital I) and the two tasks labeled
"...Imputation On..." pass MBimpute = FALSE (disabling imputation) which
contradicts the labels; fix by renaming the argument to MBimpute (lowercase i)
where misspelled and set MBimpute = TRUE for the two "Imputation On" cases so
their behavior matches their labels.
| Rscript -e ' | ||
| .libPaths("/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.4") | ||
| Sys.unsetenv("R_LIBS") | ||
| cat(".libPaths():\n"); print(.libPaths()) | ||
| if (!requireNamespace("remotes", quietly = TRUE)) | ||
| install.packages("remotes", repos = "https://cloud.r-project.org") | ||
| install.packages("nloptr", type = "source", repos = "https://cloud.r-project.org") | ||
| for (pkg in c("dplyr", "stringr", "ggplot2")) { | ||
| if (!requireNamespace(pkg, quietly = TRUE)) { | ||
| install.packages(pkg, repos = "https://cloud.r-project.org") | ||
| } | ||
| } | ||
| BiocManager::install(c("MSstatsConvert", "preprocessCore"), force = TRUE) | ||
| remotes::install_github("Vitek-Lab/MSstats", ref = "devel", force = TRUE) | ||
| ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail if BiocManager is missing; install it before calling BiocManager::install
BiocManager::install errors if BiocManager isn’t installed. Add a guard to install BiocManager first.
Apply this diff:
Rscript -e '
.libPaths("/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.4")
Sys.unsetenv("R_LIBS")
cat(".libPaths():\n"); print(.libPaths())
if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes", repos = "https://cloud.r-project.org")
-install.packages("nloptr", type = "source", repos = "https://cloud.r-project.org")
+install.packages("nloptr", type = "source", repos = "https://cloud.r-project.org")
for (pkg in c("dplyr", "stringr", "ggplot2")) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg, repos = "https://cloud.r-project.org")
}
}
-BiocManager::install(c("MSstatsConvert", "preprocessCore"), force = TRUE)
+if (!requireNamespace("BiocManager", quietly = TRUE)) {
+ install.packages("BiocManager", repos = "https://cloud.r-project.org")
+}
+BiocManager::install(c("MSstatsConvert", "preprocessCore"), force = TRUE)
remotes::install_github("Vitek-Lab/MSstats", ref = "devel", force = TRUE)
'📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Rscript -e ' | |
| .libPaths("/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.4") | |
| Sys.unsetenv("R_LIBS") | |
| cat(".libPaths():\n"); print(.libPaths()) | |
| if (!requireNamespace("remotes", quietly = TRUE)) | |
| install.packages("remotes", repos = "https://cloud.r-project.org") | |
| install.packages("nloptr", type = "source", repos = "https://cloud.r-project.org") | |
| for (pkg in c("dplyr", "stringr", "ggplot2")) { | |
| if (!requireNamespace(pkg, quietly = TRUE)) { | |
| install.packages(pkg, repos = "https://cloud.r-project.org") | |
| } | |
| } | |
| BiocManager::install(c("MSstatsConvert", "preprocessCore"), force = TRUE) | |
| remotes::install_github("Vitek-Lab/MSstats", ref = "devel", force = TRUE) | |
| ' | |
| Rscript -e ' | |
| .libPaths("/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.4") | |
| Sys.unsetenv("R_LIBS") | |
| cat(".libPaths():\n"); print(.libPaths()) | |
| if (!requireNamespace("remotes", quietly = TRUE)) | |
| install.packages("remotes", repos = "https://cloud.r-project.org") | |
| install.packages("nloptr", type = "source", repos = "https://cloud.r-project.org") | |
| for (pkg in c("dplyr", "stringr", "ggplot2")) { | |
| if (!requireNamespace(pkg, quietly = TRUE)) { | |
| install.packages(pkg, repos = "https://cloud.r-project.org") | |
| } | |
| } | |
| if (!requireNamespace("BiocManager", quietly = TRUE)) { | |
| install.packages("BiocManager", repos = "https://cloud.r-project.org") | |
| } | |
| BiocManager::install(c("MSstatsConvert", "preprocessCore"), force = TRUE) | |
| remotes::install_github("Vitek-Lab/MSstats", ref = "devel", force = TRUE) | |
| ' |
🤖 Prompt for AI Agents
In benchmark/config.slurm around lines 24 to 41, the script calls
BiocManager::install without ensuring the BiocManager package is present; add a
guard to check for BiocManager with requireNamespace("BiocManager",
quietly=TRUE) and, if missing, install it (install.packages("BiocManager",
repos="https://cloud.r-project.org")) before calling BiocManager::install, then
proceed with the existing BiocManager::install call.
| groups <- levels(QuantData$ProteinLevelData$GROUP) | ||
| colnames(comparison) <- groups[order(as.numeric(groups))] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid fragile column ordering: stop coercing group labels to numeric
as.numeric(groups) will warn and is unreliable when GROUP levels are non-numeric (e.g., "A","B","C"...). You only need to align by names; just set the column names directly. Optionally validate the expected length.
Apply this diff:
- groups <- levels(QuantData$ProteinLevelData$GROUP)
- colnames(comparison) <- groups[order(as.numeric(groups))]
+ groups <- levels(QuantData$ProteinLevelData$GROUP)
+ if (length(groups) != ncol(comparison)) {
+ stop(sprintf("Contrast matrix expects %d groups but found %d in data: %s",
+ ncol(comparison), length(groups), paste(groups, collapse = ", ")))
+ }
+ colnames(comparison) <- groups📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| groups <- levels(QuantData$ProteinLevelData$GROUP) | |
| colnames(comparison) <- groups[order(as.numeric(groups))] | |
| groups <- levels(QuantData$ProteinLevelData$GROUP) | |
| if (length(groups) != ncol(comparison)) { | |
| stop(sprintf("Contrast matrix expects %d groups but found %d in data: %s", | |
| ncol(comparison), length(groups), paste(groups, collapse = ", "))) | |
| } | |
| colnames(comparison) <- groups |
🤖 Prompt for AI Agents
In benchmark/metamorpheus_Process.R around lines 10 to 12, the code coerces
group level labels to numeric via as.numeric(groups) which is fragile and warns
for non-numeric levels; instead assign column names directly using the factor
levels (or sorted/validated names if needed) and optionally verify the length
matches ncol(comparison) before setting colnames to avoid misalignment. Ensure
you remove the as.numeric() coercion and replace it with a direct assignment of
groups (or a deterministic name ordering) and add a simple length check to throw
a clear error if sizes differ.
benchmark/metamorpheus_Process.R
Outdated
| labels <- unique(comp$Label) | ||
| result_rows <- lapply(labels, function(lbl) { | ||
| df <- comp %>% filter(Label == lbl) | ||
| sig <- df %>% filter(adj.pvalue < alpha) | ||
|
|
||
| tp <- sig %>% filter(ecoli) %>% nrow() | ||
| fp <- sig %>% filter(!ecoli) %>% nrow() | ||
| tot <- tp + fp | ||
| fdr <- if (tot > 0) fp / tot else NA_real_ | ||
|
|
||
| data.frame( | ||
| Task = task_label, | ||
| Comparison = lbl, | ||
| FDR = fdr, | ||
| stringsAsFactors = FALSE | ||
| ) | ||
| }) | ||
|
|
||
| results <- do.call(rbind, result_rows) | ||
| return(results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Guard against empty results to avoid rbind error
If comp is empty (e.g., no comparisons available), do.call(rbind, list()) errors. Return an empty data.frame instead to keep the pipeline resilient.
Apply this diff:
- labels <- unique(comp$Label)
- result_rows <- lapply(labels, function(lbl) {
+ labels <- unique(comp$Label)
+ if (length(labels) == 0L) {
+ return(data.frame(Task = character(), Comparison = character(), FDR = numeric(),
+ stringsAsFactors = FALSE))
+ }
+ result_rows <- lapply(labels, function(lbl) {
df <- comp %>% filter(Label == lbl)
sig <- df %>% filter(adj.pvalue < alpha)
tp <- sig %>% filter(ecoli) %>% nrow()
fp <- sig %>% filter(!ecoli) %>% nrow()
tot <- tp + fp
fdr <- if (tot > 0) fp / tot else NA_real_
data.frame(
Task = task_label,
Comparison = lbl,
FDR = fdr,
stringsAsFactors = FALSE
)
})
results <- do.call(rbind, result_rows)
return(results)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| labels <- unique(comp$Label) | |
| result_rows <- lapply(labels, function(lbl) { | |
| df <- comp %>% filter(Label == lbl) | |
| sig <- df %>% filter(adj.pvalue < alpha) | |
| tp <- sig %>% filter(ecoli) %>% nrow() | |
| fp <- sig %>% filter(!ecoli) %>% nrow() | |
| tot <- tp + fp | |
| fdr <- if (tot > 0) fp / tot else NA_real_ | |
| data.frame( | |
| Task = task_label, | |
| Comparison = lbl, | |
| FDR = fdr, | |
| stringsAsFactors = FALSE | |
| ) | |
| }) | |
| results <- do.call(rbind, result_rows) | |
| return(results) | |
| labels <- unique(comp$Label) | |
| if (length(labels) == 0L) { | |
| return(data.frame(Task = character(), Comparison = character(), FDR = numeric(), | |
| stringsAsFactors = FALSE)) | |
| } | |
| result_rows <- lapply(labels, function(lbl) { | |
| df <- comp %>% filter(Label == lbl) | |
| sig <- df %>% filter(adj.pvalue < alpha) | |
| tp <- sig %>% filter(ecoli) %>% nrow() | |
| fp <- sig %>% filter(!ecoli) %>% nrow() | |
| tot <- tp + fp | |
| fdr <- if (tot > 0) fp / tot else NA_real_ | |
| data.frame( | |
| Task = task_label, | |
| Comparison = lbl, | |
| FDR = fdr, | |
| stringsAsFactors = FALSE | |
| ) | |
| }) | |
| results <- do.call(rbind, result_rows) | |
| return(results) |
🤖 Prompt for AI Agents
In benchmark/metamorpheus_Process.R around lines 27 to 46, the code calls
do.call(rbind, result_rows) which will error when result_rows is empty; modify
the logic to check if result_rows has length > 0 before binding and, if empty,
return an empty data.frame with the same columns (Task, Comparison, FDR) and
types (stringsAsFactors = FALSE) so the function returns a valid empty result
instead of failing.
| "name": "DDA-Dowell2021-HEqe408_LFQ", | ||
| "file": "/work/VitekLab/Data/MS/Benchmarking/DDA-Dowell2021-HEqe408_LFQ/FragPipe/TOP0/MSstats.csv", | ||
| "file": "/projects/VitekLab/Data/MS/Benchmarking/DDA-Dowell2021-HEqe408_LFQ/FragPipe/TOP0/MSstats.csv", | ||
| "samples": { | ||
| "Human": { | ||
| "pattern": "_HUMAN$", | ||
| "type": "insignificant" | ||
| }, | ||
| "Ecoli": { | ||
| "pattern": "_ECOLI$", | ||
| "type": "significant" | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Confirm consumer code reads new fields (parent and data)
Older datasets use a file field; the MetaMorpheus entries introduce parent + data. Ensure the benchmarking script branches correctly when these fields are present.
Run this to verify the harness reads parent and data and constructs file paths accordingly:
🏁 Script executed:
#!/bin/bash
# Inspect benchmark scripts for usage of scriptController.json
rg -n -C3 'scriptController\.json|parent|data"|samples' --type=r benchmarkLength of output: 4202
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Searching benchmark R scripts for dataset_config usage and path construction..."
rg -n -C3 'dataset_config|filePath|file.path\(|dataset_config\$file|dataset_config\$parent|dataset_config\$data' --type=r benchmark || true
echo
echo "----- Full contents of the three benchmark scripts referenced in previous output -----"
for f in benchmark/benchmark_Metamorpheus.R benchmark/benchmark_Dowell2021-HEqe408_LFQ.R benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R; do
echo "=== $f ==="
sed -n '1,240p' "$f" || true
echo
doneLength of output: 14136
Action: Make benchmark scripts accept both file and parent+data config shapes
Verified: benchmark/benchmark_Metamorpheus.R already constructs paths from dataset_config$parent + dataset_config$data; benchmark_Dowell2021-HEqe408_LFQ.R and benchmark_Puyvelde2022-HYE5600735_LFQ.R still assume dataset_config$file only. This will break if scriptController.json uses the new parent+data shape for non‑Metamorpheus datasets.
Files to update:
- benchmark/benchmark_Metamorpheus.R — lines ~18–21: add a fallback to dataset_config$file when parent/data are absent.
- benchmark/benchmark_Dowell2021-HEqe408_LFQ.R — lines ~14–21: resolve a data_file from either dataset_config$file or parent+data, then use it for fread and logging.
- benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R — lines ~14–21: same change as Dowell.
Suggested minimal patches (apply the same pattern to both Dowell and Puyvelde):
Before:
cat("Processing Dataset:", dataset_config$name, "\n")
cat("Dataset File Path:", dataset_config$file, "\n")
...
fragpipe_raw <- data.table::fread(dataset_config$file)After:
data_file <- if (!is.null(dataset_config$file)) {
dataset_config$file
} else if (!is.null(dataset_config$parent) && !is.null(dataset_config$data)) {
file.path(dataset_config$parent, dataset_config$data)
} else {
stop("Dataset path not specified in config (expected 'file' or 'parent' + 'data')")
}
cat("Processing Dataset:", dataset_config$name, "\n")
cat("Dataset File Path:", data_file, "\n")
...
fragpipe_raw <- data.table::fread(data_file)Suggested change for Metamorpheus (keep existing parent+data behavior, add fallback to file):
Before:
filePath <- file.path(dataset_config$parent, dataset_config$data)
annotPath <- dataset_config$parentAfter:
if (!is.null(dataset_config$file)) {
# If a file path is supplied, use its directory as the dataset directory
filePath <- dirname(dataset_config$file)
annotPath <- dirname(filePath)
} else {
filePath <- file.path(dataset_config$parent, dataset_config$data)
annotPath <- dataset_config$parent
}Reasoning: This makes each script robust to either config shape (old file or new parent+data) and fails fast with a clear error if neither is present.
Please apply these small changes to the three scripts (or ensure scriptController.json keeps the original file field for datasets consumed by the FragPipe-based scripts).
🤖 Prompt for AI Agents
In benchmark/scriptController.json around lines 22–33 and the three R scripts:
benchmark/benchmark_Metamorpheus.R (~lines 18–21),
benchmark/benchmark_Dowell2021-HEqe408_LFQ.R (~lines 14–21), and
benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R (~lines 14–21), update each
script to accept either dataset_config$file OR dataset_config$parent +
dataset_config$data: for Dowell and Puyvelde, resolve a single data_file by
using dataset_config$file if present, otherwise build it with
file.path(dataset_config$parent, dataset_config$data), stop with a clear error
if neither, then use data_file for logging and fread; for Metamorpheus, keep
parent+data behavior but add a fallback where if dataset_config$file is
provided, set filePath to dirname(dataset_config$file) and annotPath to its
parent (dirname(filePath)); ensure all logging uses the resolved path and fail
fast with a descriptive message when no valid path is provided.
metamorpheus_code.R
Outdated
| groups = levels(QuantData_no_mbr$ProteinLevelData$GROUP) | ||
| colnames(comparison) <- groups[order(as.numeric(groups))] | ||
| model_no_mbr <- groupComparison(contrast.matrix=comparison, data=QuantData_no_mbr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix contrast column naming; current coercion to numeric is fragile and warns
Identical issue as in the helper: as.numeric(groups) is unreliable for non-numeric labels.
Apply this diff:
-groups = levels(QuantData_no_mbr$ProteinLevelData$GROUP)
-colnames(comparison) <- groups[order(as.numeric(groups))]
+groups <- levels(QuantData_no_mbr$ProteinLevelData$GROUP)
+colnames(comparison) <- groups📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| groups = levels(QuantData_no_mbr$ProteinLevelData$GROUP) | |
| colnames(comparison) <- groups[order(as.numeric(groups))] | |
| model_no_mbr <- groupComparison(contrast.matrix=comparison, data=QuantData_no_mbr, | |
| groups <- levels(QuantData_no_mbr$ProteinLevelData$GROUP) | |
| colnames(comparison) <- groups | |
| model_no_mbr <- groupComparison(contrast.matrix=comparison, data=QuantData_no_mbr, |
🤖 Prompt for AI Agents
In metamorpheus_code.R around lines 40 to 42, the contrast column naming uses
order(as.numeric(groups)) which coerces non-numeric group labels to NA and emits
warnings; replace that coercion with a safe ordering such as
order(as.character(groups)) (or order(as.numeric(as.character(groups))) only if
you are certain labels are numeric) and assign colnames(comparison) <-
groups[order(as.character(groups))] so non-numeric labels are handled without
warnings.
metamorpheus_code.R
Outdated
| e_group = model$ComparisonResult %>% filter(Label == "B-A") %>% filter(is.na(issue)) | ||
| ecoli = e_group %>% filter(ecoli == TRUE) | ||
| hist(ecoli$log2FC) | ||
|
|
||
| ecoli = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == TRUE) | ||
| human = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == FALSE) | ||
| FDR = nrow(human) / (nrow(ecoli) + nrow(human)) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid hard-coded “B-A” and duplicated FDR code; delegate to calculate_Metrics
Mirror the No MBR fix here; remove ad-hoc histogram and use the shared helper.
Apply this diff:
-e_group = model$ComparisonResult %>% filter(Label == "B-A") %>% filter(is.na(issue))
-ecoli = e_group %>% filter(ecoli == TRUE)
-hist(ecoli$log2FC)
-
-ecoli = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == TRUE)
-human = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == FALSE)
-FDR = nrow(human) / (nrow(ecoli) + nrow(human))
-
-cat("FDR MBR", FDR, "\n")
+metrics_mbr <- calculate_Metrics(QuantData, protein_mappings, task_label = "MBR", alpha = 0.05)
+print(metrics_mbr)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| e_group = model$ComparisonResult %>% filter(Label == "B-A") %>% filter(is.na(issue)) | |
| ecoli = e_group %>% filter(ecoli == TRUE) | |
| hist(ecoli$log2FC) | |
| ecoli = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == TRUE) | |
| human = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == FALSE) | |
| FDR = nrow(human) / (nrow(ecoli) + nrow(human)) | |
| metrics_mbr <- calculate_Metrics(QuantData, protein_mappings, task_label = "MBR", alpha = 0.05) | |
| print(metrics_mbr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
benchmark/benchmark_Metamorpheus.R (3)
31-37: Good use of Organism-based filtering (addresses prior feedback)Filtering QuantifiedProteins.tsv to organisms present in QuantifiedPeaks.tsv is correct and aligns with earlier feedback to use the Organism column.
8-10: Fix brittle paths and ensure MetamorpheusToMSstatsFormat is availableUsing source("...") and fromJSON("...") relative to the CWD will break when invoked outside benchmark/. Also, MetamorpheusToMSstatsFormat is not guaranteed to be in scope unless metamorpheus_code.R is sourced. Make paths robust and fail fast if the function is missing.
Apply this diff:
-source("metamorpheus_Process.R") -config <- fromJSON("scriptController.json", simplifyVector = FALSE) +## Resolve paths relative to this script; source helpers robustly +script_args <- commandArgs(trailingOnly = FALSE) +script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)]) +base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else normalizePath(".") +repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE) + +source(file.path(base_dir, "metamorpheus_Process.R")) + +# Ensure MetamorpheusToMSstatsFormat is available (defined in metamorpheus_code.R at repo root) +mm_code <- file.path(repo_root, "metamorpheus_code.R") +if (file.exists(mm_code)) { + source(mm_code) +} +if (!exists("MetamorpheusToMSstatsFormat")) { + stop("MetamorpheusToMSstatsFormat not found; please source metamorpheus_code.R or add it to the search path.") +} + +# Load controller JSON from the benchmark directory +config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)
58-64: “Imputation On” tasks currently disable imputation; set MBimpute = TRUELabels promise “Imputation On” but MBimpute is FALSE in both tasks, changing the experiment semantics and metrics.
list( label = "Data process without Normalization and Imputation On for all features", - result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBImpute = FALSE) + result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = TRUE) ), list( label = "Data process without Normalization and Imputation On for top3 features", - result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = FALSE) + result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = TRUE) )
🧹 Nitpick comments (7)
benchmark/benchmark_Metamorpheus.R (7)
21-23: Add file existence checks and read only necessary columns to reduce I/OFail fast with clear messages if expected inputs are missing. Also, reading only “Protein Groups” and “Organism” trims memory and speeds up processing.
- input = data.table::fread(file.path(filePath, "QuantifiedPeaks.tsv")) - annot = data.table::fread(file.path(annotPath, "annotation.csv")) + peaks_path <- file.path(filePath, "QuantifiedPeaks.tsv") + annot_path <- file.path(annotPath, "annotation.csv") + if (!file.exists(peaks_path)) stop("Missing QuantifiedPeaks.tsv at: ", peaks_path) + if (!file.exists(annot_path)) stop("Missing annotation.csv at: ", annot_path) + input <- data.table::fread(peaks_path) + annot <- data.table::fread(annot_path) @@ - protein_mappings = data.table::fread(file.path(filePath, "QuantifiedProteins.tsv")) + proteins_path <- file.path(filePath, "QuantifiedProteins.tsv") + if (!file.exists(proteins_path)) stop("Missing QuantifiedProteins.tsv at: ", proteins_path) + protein_mappings <- data.table::fread(proteins_path, select = c("Protein Groups", "Organism"))Also applies to: 31-31
28-30: Use fixed() in str_detect to avoid regex overhead and edge casesTreat ";" and "DECOY" as fixed strings instead of regex to speed up filtering and avoid unintended regex behavior.
- input = input %>% filter(!str_detect(`Protein Group`, ";")) # remove multiple protein group in same cell - input = input %>% filter(!str_detect(`Protein Group`, "DECOY")) # remove decoys + input <- input %>% filter(!str_detect(`Protein Group`, fixed(";"))) # remove multiple protein group in same cell + input <- input %>% filter(!str_detect(`Protein Group`, fixed("DECOY"))) # remove decoys
16-17: Be resilient if dataset_config$name is missingSome controller entries may not set name. Fall back to datasetPath for logging.
- cat("Processing Dataset:", dataset_config$name, "\n") + ds_name <- if (!is.null(dataset_config$name)) dataset_config$name else datasetPath + cat("Processing Dataset:", ds_name, "\n")
40-47: Clarify task label to reflect actual settingsFirst task uses default MBimpute; consider making that explicit in the label to avoid confusion when comparing results.
- label = "Data process with Normalized Data", + label = "Data process with Normalization (default MBimpute)",
38-38: Expose MSstatsConvert filters to config for FDR sensitivity analysisMetamorpheusToMSstatsFormat defaults removeFewMeasurements = TRUE and removeProtein_with1Feature = TRUE. To investigate empirical FDR sensitivity (as raised previously), parameterize these via scriptController.json and pass through here. I can draft the wiring if you want.
81-86: Persist results to disk for downstream aggregationPrinting is transient. Consider writing CSV/TSV labeled by dataset/MBR mode to the dataset folder or a results directory for the workflow.
final_results <- do.call(rbind, results_list) end_time <- Sys.time() total_time <- end_time - start_time - print(final_results) - print(paste("Total Execution Time:", total_time)) + print(final_results) + print(paste("Total Execution Time:", total_time)) + # Optional: persist results + out_file <- file.path(annotPath, paste0("benchmark_Metamorpheus_", gsub("\\s+", "_", ds_name), ".csv")) + data.table::fwrite(final_results, out_file) + message("Wrote results to: ", out_file)
91-92: Isolate dataset runs so one failure doesn’t stop the entire batchWrap each invocation in tryCatch to continue processing the remaining dataset if one fails.
-runBenchmarkForMetaMorpheusData("DDA-Solivais2024-Metamorpheus_MBR_LFQ", config) -runBenchmarkForMetaMorpheusData("DDA-Solivais2024-Metamorpheus_NoMBR_LFQ", config) +lapply( + c("DDA-Solivais2024-Metamorpheus_MBR_LFQ", "DDA-Solivais2024-Metamorpheus_NoMBR_LFQ"), + function(ds) { + tryCatch( + runBenchmarkForMetaMorpheusData(ds, config), + error = function(e) { + message("ERROR in ", ds, ": ", conditionMessage(e)) + } + ) + } +)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
benchmark/benchmark_Metamorpheus.R(1 hunks)benchmark/metamorpheus_Process.R(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- benchmark/metamorpheus_Process.R
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build
- GitHub Check: Benchmarking-pipeline
| num_cores <- detectCores() - 1 | ||
|
|
||
| summarized_results <- mclapply(data_process_tasks, function(task) { | ||
| list(label = task$label, summarized = task$result()) | ||
| }, mc.cores = num_cores) | ||
|
|
||
|
|
||
| results_list <- mclapply(summarized_results, function(res) { | ||
| calculate_Metrics(res$summarized, protein_mappings, res$label) | ||
| }, mc.cores = num_cores) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Guard against mc.cores == 0 and provide a Windows-safe fallback
detectCores() - 1 can be zero; mclapply will error. Also, mclapply with mc.cores > 1 is unavailable on Windows. Add a safe core calculation and fall back to lapply when needed.
- num_cores <- detectCores() - 1
-
- summarized_results <- mclapply(data_process_tasks, function(task) {
- list(label = task$label, summarized = task$result())
- }, mc.cores = num_cores)
+ num_cores <- max(1L, as.integer(detectCores() - 1L))
+ if (.Platform$OS.type != "windows" && num_cores > 1L) {
+ summarized_results <- mclapply(
+ data_process_tasks,
+ function(task) list(label = task$label, summarized = task$result()),
+ mc.cores = num_cores
+ )
+ } else {
+ summarized_results <- lapply(
+ data_process_tasks,
+ function(task) list(label = task$label, summarized = task$result())
+ )
+ }
@@
- results_list <- mclapply(summarized_results, function(res) {
- calculate_Metrics(res$summarized, protein_mappings, res$label)
- }, mc.cores = num_cores)
+ if (.Platform$OS.type != "windows" && num_cores > 1L) {
+ results_list <- mclapply(
+ summarized_results,
+ function(res) calculate_Metrics(res$summarized, protein_mappings, res$label),
+ mc.cores = num_cores
+ )
+ } else {
+ results_list <- lapply(
+ summarized_results,
+ function(res) calculate_Metrics(res$summarized, protein_mappings, res$label)
+ )
+ }Optional verification: please run a small dataset on a 1-core environment to confirm no regression when num_cores == 1.
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Checking for benchmark/benchmark_Metamorpheus.R..."
if [ -f benchmark/benchmark_Metamorpheus.R ]; then
echo "== File: benchmark/benchmark_Metamorpheus.R (first 220 lines) =="
sed -n '1,220p' benchmark/benchmark_Metamorpheus.R
else
echo "benchmark/benchmark_Metamorpheus.R not found"
fi
echo
echo "== Searching repo for parallel-related usages =="
rg -n --hidden -S 'detectCores|mclapply|library\\(parallel\\)|require\\(parallel\\)|\\.Platform' || true
echo
echo "== Showing matches in benchmark/ directory with context =="
rg -n --hidden -S --context 3 'detectCores|mclapply|\\.Platform' benchmark || trueLength of output: 8050
Guard against mc.cores == 0 and add a Windows-safe fallback (applies to multiple benchmark scripts)
detectCores() - 1 can be 0 (causing mclapply to error) and mclapply’s multicore mode is not available on Windows — fall back to lapply when multicore isn’t usable.
Files to update:
- benchmark/benchmark_Metamorpheus.R (lines ~69–79)
- benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R (lines ~47–56)
- benchmark/benchmark_Dowell2021-HEqe408_LFQ.R (lines ~48–56)
Suggested change (apply the same pattern to each file):
- num_cores <- detectCores() - 1
-
- summarized_results <- mclapply(data_process_tasks, function(task) {
- list(label = task$label, summarized = task$result())
- }, mc.cores = num_cores)
+ num_cores <- max(1L, as.integer(detectCores() - 1L))
+ if (.Platform$OS.type != "windows" && num_cores > 1L) {
+ summarized_results <- mclapply(
+ data_process_tasks,
+ function(task) list(label = task$label, summarized = task$result()),
+ mc.cores = num_cores
+ )
+ } else {
+ summarized_results <- lapply(
+ data_process_tasks,
+ function(task) list(label = task$label, summarized = task$result())
+ )
+ }
@@
- results_list <- mclapply(summarized_results, function(res) {
- calculate_Metrics(res$summarized, protein_mappings, res$label)
- }, mc.cores = num_cores)
+ if (.Platform$OS.type != "windows" && num_cores > 1L) {
+ results_list <- mclapply(
+ summarized_results,
+ function(res) calculate_Metrics(res$summarized, protein_mappings, res$label),
+ mc.cores = num_cores
+ )
+ } else {
+ results_list <- lapply(
+ summarized_results,
+ function(res) calculate_Metrics(res$summarized, protein_mappings, res$label)
+ )
+ }Optional: verify on a 1-core / Windows-like environment to ensure no regressions.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| num_cores <- detectCores() - 1 | |
| summarized_results <- mclapply(data_process_tasks, function(task) { | |
| list(label = task$label, summarized = task$result()) | |
| }, mc.cores = num_cores) | |
| results_list <- mclapply(summarized_results, function(res) { | |
| calculate_Metrics(res$summarized, protein_mappings, res$label) | |
| }, mc.cores = num_cores) | |
| num_cores <- max(1L, as.integer(detectCores() - 1L)) | |
| if (.Platform$OS.type != "windows" && num_cores > 1L) { | |
| summarized_results <- mclapply( | |
| data_process_tasks, | |
| function(task) list(label = task$label, summarized = task$result()), | |
| mc.cores = num_cores | |
| ) | |
| } else { | |
| summarized_results <- lapply( | |
| data_process_tasks, | |
| function(task) list(label = task$label, summarized = task$result()) | |
| ) | |
| } | |
| if (.Platform$OS.type != "windows" && num_cores > 1L) { | |
| results_list <- mclapply( | |
| summarized_results, | |
| function(res) calculate_Metrics(res$summarized, protein_mappings, res$label), | |
| mc.cores = num_cores | |
| ) | |
| } else { | |
| results_list <- lapply( | |
| summarized_results, | |
| function(res) calculate_Metrics(res$summarized, protein_mappings, res$label) | |
| ) | |
| } |
🤖 Prompt for AI Agents
In benchmark/benchmark_Metamorpheus.R around lines 69–79, detectCores() - 1 may
be 0 and mclapply is not available on Windows; change to compute num_cores <-
max(detectCores() - 1L, 1L) and then call either lapply when running on Windows
or when num_cores == 1, otherwise call mclapply with mc.cores = num_cores; apply
the same pattern to the other two files (benchmark_Puyvelde2022-HYE5600735_LFQ.R
lines ~47–56 and benchmark_Dowell2021-HEqe408_LFQ.R lines ~48–56) so multicore
is used only when supported and falls back to lapply safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (9)
metamorpheus_code.R (4)
1-1: Consider removing this standalone script from the PRThis file appears to be a standalone analysis script that duplicates functionality from the benchmarking framework. Since the benchmarking harness (
benchmark/benchmark_Metamorpheus.R) already implements the same workflow in a more structured way, keeping this file may cause confusion and maintenance burden.
99-108: Use the shared calculate_Metrics helper for consistencyThe FDR calculation is duplicated from the no-MBR section. Using the shared helper would improve maintainability.
Apply this diff:
-e_group = model$ComparisonResult %>% filter(Label == "B-A") %>% filter(is.na(issue)) -ecoli = e_group %>% filter(ecoli == TRUE) -hist(ecoli$log2FC) - -ecoli = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == TRUE) -human = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == FALSE) -FDR = nrow(human) / (nrow(ecoli) + nrow(human)) - - -cat("FDR MBR", FDR, "\n") +metrics_mbr <- calculate_Metrics(QuantData, protein_mappings, task_label = "MBR", alpha = 0.05) +print(metrics_mbr)
40-42: Fix fragile group ordering for contrast column namesUsing
as.numeric(groups)on factor levels can produce NA values and warnings for non-numeric group labels. This approach is unreliable and should be replaced with a more robust ordering mechanism.Apply this diff to fix the issue:
-groups = levels(QuantData_no_mbr$ProteinLevelData$GROUP) -colnames(comparison) <- groups[order(as.numeric(groups))] +groups <- levels(QuantData_no_mbr$ProteinLevelData$GROUP) +colnames(comparison) <- groups
90-92: Fix fragile group ordering for MBR contrast column namesSame issue as in the no-MBR section - using
as.numeric(groups)on factor levels is unreliable.Apply this diff:
-groups = levels(QuantData$ProteinLevelData$GROUP) -colnames(comparison) <- groups[order(as.numeric(groups))] +groups <- levels(QuantData$ProteinLevelData$GROUP) +colnames(comparison) <- groupsbenchmark/benchmark_Metamorpheus.R (5)
31-33: Use the Organism column from QuantifiedPeaks.tsv if availableIt's been noted that Metamorpheus now includes an "Organism" column in the
QuantifiedPeaks.tsvfile. Using this built-in column would be more robust than relying on a separate proteins file.Consider checking if the Organism column exists in the input data and using it directly:
-protein_mappings = data.table::fread(file.path(filePath, "QuantifiedProteins.tsv")) -protein_mappings = protein_mappings %>% filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens")) +if ("Organism" %in% colnames(input)) { + # Use built-in Organism column + input = input %>% filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens")) + protein_mappings = input %>% distinct(`Protein Group`, Organism) +} else { + # Fall back to QuantifiedProteins.tsv + protein_mappings = data.table::fread(file.path(filePath, "QuantifiedProteins.tsv")) + protein_mappings = protein_mappings %>% filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens")) +}
36-36: Verify impact of removeFewMeasurements and removeProtein_with1Feature parametersThe
MetamorpheusToMSstatsFormatfunction has two parameters (removeFewMeasurements,removeProtein_with1Feature) that are TRUE by default. These could filter out proteins and potentially explain FDR differences between MBR and no-MBR workflows.Please verify if setting these parameters to FALSE affects the empirical FDR results:
#!/bin/bash # Check if MetamorpheusToMSstatsFormat is defined and its default parameters rg -n -A 10 "MetamorpheusToMSstatsFormat\s*<-\s*function" --type R
8-10: Make file paths robust to working directory changesThe script uses relative paths for sourcing files and loading JSON, which will fail if run from a different working directory. Additionally,
MetamorpheusToMSstatsFormatappears to be defined inmetamorpheus_code.Rat the repo root but isn't sourced here.Apply this diff to make paths robust:
-source("metamorpheus_Process.R") -config <- fromJSON("scriptController.json", simplifyVector = FALSE) +script_args <- commandArgs(trailingOnly = FALSE) +script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)]) +base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else "." +repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE) + +source(file.path(base_dir, "metamorpheus_Process.R")) +# Source MetamorpheusToMSstatsFormat definition +if (file.exists(file.path(repo_root, "metamorpheus_code.R"))) { + source(file.path(repo_root, "metamorpheus_code.R")) +} +config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)
56-57: Fix inconsistent imputation settings in task labelsTasks labeled "Imputation On" are actually passing
MBimpute = FALSE, which disables imputation. This contradicts the task labels and will produce misleading benchmark results.Apply this diff to enable imputation for the "Imputation On" tasks:
list( label = "Data process without Normalization and Imputation On for all features", - result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = FALSE) + result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = TRUE) ), list( label = "Data process without Normalization and Imputation On for top3 features", - result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = FALSE) + result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = TRUE) )Also applies to: 60-61
67-76: Add safeguards for parallel execution on limited-core or Windows systemsThe code can fail in two scenarios: (1) when
detectCores() - 1returns 0 on single-core systems, and (2) on Windows wheremclapplywithmc.cores > 1isn't available.Apply this diff to add proper safeguards:
- num_cores <- detectCores() - 1 - - summarized_results <- mclapply(data_process_tasks, function(task) { - list(label = task$label, summarized = task$result()) - }, mc.cores = num_cores) - - - results_list <- mclapply(summarized_results, function(res) { - calculate_Metrics(res$summarized, protein_mappings, res$label) - }, mc.cores = num_cores) + num_cores <- max(1L, as.integer(detectCores() - 1L)) + + if (.Platform$OS.type != "windows" && num_cores > 1L) { + summarized_results <- mclapply( + data_process_tasks, + function(task) list(label = task$label, summarized = task$result()), + mc.cores = num_cores + ) + + results_list <- mclapply( + summarized_results, + function(res) calculate_Metrics(res$summarized, protein_mappings, res$label), + mc.cores = num_cores + ) + } else { + summarized_results <- lapply( + data_process_tasks, + function(task) list(label = task$label, summarized = task$result()) + ) + + results_list <- lapply( + summarized_results, + function(res) calculate_Metrics(res$summarized, protein_mappings, res$label) + ) + }
🧹 Nitpick comments (1)
metamorpheus_code.R (1)
49-57: Consider using the shared calculate_Metrics helper functionThe FDR calculation logic is duplicated here and in the MBR section. Since
benchmark/metamorpheus_Process.Rprovides acalculate_Metricsfunction for this exact purpose, consider using it for consistency and maintainability.Apply this diff to use the shared helper:
-e_group_no_mbr = model_no_mbr$ComparisonResult %>% filter(Label == "B-A") %>% filter(is.na(issue)) -ecoli_no_mbr = e_group_no_mbr %>% filter(ecoli == TRUE) -hist(ecoli_no_mbr$log2FC) - -ecoli_no_mbr = e_group_no_mbr %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == TRUE) -human_no_mbr = e_group_no_mbr %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == FALSE) -FDR_no_mbr = nrow(human_no_mbr) / (nrow(ecoli_no_mbr) + nrow(human_no_mbr)) - -cat("FDR no MBR", FDR_no_mbr, "\n") +# Source the helper if not already done +source("benchmark/metamorpheus_Process.R") +metrics_no_mbr <- calculate_Metrics(QuantData_no_mbr, protein_mappings, task_label = "No MBR", alpha = 0.05) +print(metrics_no_mbr)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
benchmark/benchmark_Metamorpheus.R(1 hunks)benchmark/metamorpheus_Process.R(1 hunks)metamorpheus_code.R(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- benchmark/metamorpheus_Process.R
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build
- GitHub Check: Benchmarking-pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
benchmark/benchmark_Metamorpheus.R (4)
40-41: Verify impact of removeFewMeasurements and removeProtein_with1Feature in MetamorpheusToMSstatsFormatPrior feedback asked to double-check empirical FDR when these are set to FALSE. Current call relies on defaults; consider running a scenario with both flags FALSE for comparison.
Example variant to try (not necessarily for commit):
- output_alt <- MetamorpheusToMSstatsFormat(input, annot, removeFewMeasurements = FALSE, removeProtein_with1Feature = FALSE)
- Run the same tasks on output_alt and compare FDR metrics against output.
71-80: Guard against mc.cores == 0 and Windows fallback for mclapplydetectCores() - 1 can be 0, and multicore mclapply isn’t available on Windows. Fall back to lapply when needed.
- num_cores <- detectCores() - 1 - - summarized_results <- mclapply(data_process_tasks, function(task) { - list(label = task$label, summarized = task$result()) - }, mc.cores = num_cores) + num_cores <- max(1L, as.integer(detectCores() - 1L)) + if (.Platform$OS.type != "windows" && num_cores > 1L) { + summarized_results <- mclapply( + data_process_tasks, + function(task) list(label = task$label, summarized = task$result()), + mc.cores = num_cores + ) + } else { + summarized_results <- lapply( + data_process_tasks, + function(task) list(label = task$label, summarized = task$result()) + ) + } @@ - results_list <- mclapply(summarized_results, function(res) { - calculate_Metrics(res$summarized, protein_mappings, res$label) - }, mc.cores = num_cores) + if (.Platform$OS.type != "windows" && num_cores > 1L) { + results_list <- mclapply( + summarized_results, + function(res) calculate_Metrics(res$summarized, protein_mappings, res$label), + mc.cores = num_cores + ) + } else { + results_list <- lapply( + summarized_results, + function(res) calculate_Metrics(res$summarized, protein_mappings, res$label) + ) + }
8-10: Source metamorpheus_code.R and resolve paths relative to the script (prevents undefined function error and WD fragility)MetamorpheusToMSstatsFormat is used later (Line 40) but this file doesn’t source metamorpheus_code.R where it’s defined. Also, the relative source()/fromJSON() calls will break if the working directory isn’t benchmark/. Make paths relative to this script and source metamorpheus_code.R from repo root.
Apply this diff:
-source("metamorpheus_Process.R") -config <- fromJSON("scriptController.json", simplifyVector = FALSE) +script_args <- commandArgs(trailingOnly = FALSE) +script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)]) +base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else getwd() +repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE) + +source(file.path(base_dir, "metamorpheus_Process.R")) +# Ensure MetamorpheusToMSstatsFormat is available +mm_code <- file.path(repo_root, "metamorpheus_code.R") +if (file.exists(mm_code)) { + source(mm_code) +} else { + warning("metamorpheus_code.R not found at repo root; ensure MetamorpheusToMSstatsFormat is available on the search path") +} +config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)
60-66: Align “Imputation On” task behavior with label (set MBimpute = TRUE)Both “Imputation On” tasks currently disable imputation.
list( label = "Data process without Normalization and Imputation On for all features", - result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = FALSE) + result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = TRUE) ), list( label = "Data process without Normalization and Imputation On for top3 features", - result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = FALSE) + result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = TRUE) )
🧹 Nitpick comments (5)
benchmark/benchmark_Metamorpheus.R (5)
25-27: Clarify the annotation path printoutYou print “Annotation File Path” but output the directory path, not the file. Print the file path to avoid confusion.
- cat("Annotation File Path:", annotPath, "\n") + cat("Annotation File:", file.path(annotPath, "annotation.csv"), "\n")
33-37: Guard against NA organisms when filtering protein mappingsAvoid carrying NA into the filter set.
- valid_organisms <- unique(input$Organism) + valid_organisms <- unique(stats::na.omit(input$Organism))
31-39: Harden mapping: confirm ‘Protein Groups’ column name and avoid hard-coded nameQuantifiedProteins.tsv may expose either “Protein Groups” or “Protein Group”. Use a guard and avoid column name mismatch.
- input = input %>% filter(`Protein Group` %in% protein_mappings$`Protein Groups`) + protein_groups_col <- if ("Protein Groups" %in% names(protein_mappings)) "Protein Groups" else "Protein Group" + input = input %>% filter(`Protein Group` %in% protein_mappings[[protein_groups_col]])If you can, please also verify the column names contained in QuantifiedProteins.tsv for your datasets to confirm this guard is sufficient.
83-88: Optionally persist metrics to disk per datasetPrinting is fine, but persisting results helps downstream CI/HPC collection.
For example:
- out_file <- file.path(filePath, paste0("benchmark_results_", gsub("\s+", "_", dataset_config$name), ".tsv"))
- data.table::fwrite(final_results, out_file, sep = "\t")
93-94: Protect per-dataset runs so one failure doesn’t abort the otherWrap invocations in tryCatch to continue on error.
-runBenchmarkForMetaMorpheusData("DDA-Solivais2024-Metamorpheus_MBR_LFQ", config) -runBenchmarkForMetaMorpheusData("DDA-Solivais2024-Metamorpheus_NoMBR_LFQ", config) +for (ds in c("DDA-Solivais2024-Metamorpheus_MBR_LFQ", "DDA-Solivais2024-Metamorpheus_NoMBR_LFQ")) { + tryCatch( + runBenchmarkForMetaMorpheusData(ds, config), + error = function(e) message(sprintf("Dataset %s failed: %s", ds, conditionMessage(e))) + ) +}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
benchmark/benchmark_Metamorpheus.R(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build
- GitHub Check: Benchmarking-pipeline
🔇 Additional comments (1)
benchmark/benchmark_Metamorpheus.R (1)
33-34: Good: leveraging Organism column from QuantifiedPeaks.tsvUsing input$Organism to constrain protein mappings addresses earlier feedback and improves specificity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
benchmark/benchmark_Metamorpheus.R (4)
39-39: Double-check impact of removeFewMeasurements/removeProtein_with1Feature on empirical FDRMetamorpheusToMSstatsFormat defaults to removing sparse proteins/features (both TRUE). Given no-PIP scenarios, toggling these to FALSE may materially change recall/FDR. Please validate and document the chosen setting.
I can wire these flags through from the controller JSON (and produce side-by-side metrics) if you want.
8-10: Source helpers and config using robust paths; ensure MetamorpheusToMSstatsFormat is loadedCurrent relative paths will break when not launched from benchmark/, and MetamorpheusToMSstatsFormat isn’t defined in this file. Source files via the script’s directory and include metamorpheus_code.R from repo root.
Apply this diff:
- source("metamorpheus_Process.R") - config <- fromJSON("scriptController.json", simplifyVector = FALSE) + script_args <- commandArgs(trailingOnly = FALSE) + script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)]) + base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else getwd() + repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE) + + source(file.path(base_dir, "metamorpheus_Process.R")) + # Ensure MetamorpheusToMSstatsFormat is available + if (file.exists(file.path(repo_root, "metamorpheus_code.R"))) { + source(file.path(repo_root, "metamorpheus_code.R")) + } + config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)Optional verification to confirm the function is available:
#!/bin/bash set -euo pipefail # Check the definition of MetamorpheusToMSstatsFormat rg -nP '^\s*MetamorpheusToMSstatsFormat\s*<-' metamorpheus_code.R || rg -n 'MetamorpheusToMSstatsFormat' --hidden --no-ignore
59-65: Fix “Imputation On” tasks: wrong argument name and valueBoth tasks labeled “Imputation On” pass MBImpute = FALSE and use a misspelled argument (MBImpute). Use MBimpute = TRUE so behavior matches labels.
- label = "Data process without Normalization and Imputation On for all features", - result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBImpute = FALSE) + label = "Data process without Normalization and Imputation On for all features", + result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = TRUE) @@ - label = "Data process without Normalization and Imputation On for top3 features", - result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBImpute = FALSE) + label = "Data process without Normalization and Imputation On for top3 features", + result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = TRUE)
70-79: Make parallel section Windows-safe and guard against 0 coresdetectCores() - 1 can be 0, and mclapply multicore isn’t available on Windows. Fallback to lapply when needed.
- num_cores <- detectCores() - 1 - - summarized_results <- mclapply(data_process_tasks, function(task) { - list(label = task$label, summarized = task$result()) - }, mc.cores = num_cores) - - - results_list <- mclapply(summarized_results, function(res) { - calculate_Metrics(res$summarized, protein_mappings, res$label) - }, mc.cores = num_cores) + num_cores <- max(1L, as.integer(detectCores() - 1L)) + if (.Platform$OS.type != "windows" && num_cores > 1L) { + summarized_results <- mclapply( + data_process_tasks, + function(task) list(label = task$label, summarized = task$result()), + mc.cores = num_cores + ) + results_list <- mclapply( + summarized_results, + function(res) calculate_Metrics(res$summarized, protein_mappings, res$label), + mc.cores = num_cores + ) + } else { + summarized_results <- lapply( + data_process_tasks, + function(task) list(label = task$label, summarized = task$result()) + ) + results_list <- lapply( + summarized_results, + function(res) calculate_Metrics(res$summarized, protein_mappings, res$label) + ) + }
🧹 Nitpick comments (3)
benchmark/benchmark_Metamorpheus.R (3)
31-34: Fail fast if expected columns are missingProtect against schema differences to avoid opaque dplyr errors later.
protein_mappings = data.table::fread(file.path(filePath, "QuantifiedProteins.tsv")) - - protein_mappings = protein_mappings %>% filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens")) + # Defensive checks + if (!"Protein Group" %in% names(input)) { + stop("Expected column 'Protein Group' not found in QuantifiedPeaks.tsv") + } + required_cols <- c("Protein Groups", "Organism") + missing_cols <- setdiff(required_cols, names(protein_mappings)) + if (length(missing_cols)) { + stop(sprintf("Expected columns missing in QuantifiedProteins.tsv: %s", paste(missing_cols, collapse = ", "))) + } + protein_mappings = protein_mappings %>% filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens"))
37-37: Prefer semi_join over %in% for clarity and potential performanceMakes the key relationship explicit and avoids vector materialization for large tables.
- input = input %>% filter(`Protein Group` %in% protein_mappings$`Protein Groups`) + input = input %>% semi_join(protein_mappings, by = c("Protein Group" = "Protein Groups"))
35-35: Reduce noisy output for large protein_mappingsPrinting the entire table can flood logs. Print a compact summary instead.
- print(protein_mappings) + cat( + "protein_mappings:", + sprintf("%d rows x %d cols", nrow(protein_mappings), ncol(protein_mappings)), + " | Organisms:", + paste(unique(protein_mappings$Organism), collapse = ", "), + "\n" + ) + print(utils::head(protein_mappings, 3))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
benchmark/benchmark_Metamorpheus.R(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build
- GitHub Check: Benchmarking-pipeline
User description
Motivation and Context
Please include relevant motivation and context of the problem along with a short summary of the solution.
Changes
Please provide a detailed bullet point list of your changes.
Testing
Please describe any unit tests you added or modified to verify your changes.
Checklist Before Requesting a Review
PR Type
Enhancement, Other
Description
Add Metamorpheus benchmarking R script
Create helper function file for metrics calculation
Update SLURM and workflow configs for new script
Add dataset entry in scriptController.json
Changes walkthrough 📝
benchmark.yml
Update benchmark workflow GitHub Actions config.github/workflows/benchmark.yml
config.slurm
Include Metamorpheus script in SLURM configbenchmark/config.slurm
scriptController.json
Extend dataset config for Metamorpheusbenchmark/scriptController.json
benchmark_Metamorpheus.R
Add Metamorpheus benchmarking R scriptbenchmark/benchmark_Metamorpheus.R
metamorpheus_Process.R
Add metrics calculation helper functionbenchmark/metamorpheus_Process.R
metamorpheus_code.R
Add example Metamorpheus analysis scriptmetamorpheus_code.R
Summary by CodeRabbit
New Features
Chores