Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
d21a720
Changes for metamorpheus #1
Jun 12, 2025
eaee20e
Added changes for metamorpheus no mbr file
Jun 13, 2025
b09fae2
Explorer Migration Changes #1
Jun 13, 2025
f5927d8
Path correction
Jun 13, 2025
8bd37a9
Transfer all benchmark files - Correction
Jun 13, 2025
71f62f4
Changes for benchmark folder
Jun 13, 2025
d2275bd
Changes for path correction in slurm file
Jun 14, 2025
7300bcc
Removed wrong character in script
Jun 14, 2025
79e0a2d
Changes for gcc added
Jun 16, 2025
434c1c0
Changing back R-LIBS_User env in config
Jun 17, 2025
83958b8
changes done to fix env
Jun 17, 2025
bf4938d
Changes for failing package
Jun 17, 2025
6b784f1
Added changes for path
Jun 17, 2025
959d271
Changes added for nolptr
Jun 17, 2025
138c802
Changes for Library path
Jun 17, 2025
1980ad9
Added changes for POC #1
Jun 18, 2025
331ba24
Added symlink of error package in our directory
Jun 18, 2025
5c497f4
Changes for slurm
Jun 18, 2025
d78a1e9
Changes for lesser RAM
Jun 18, 2025
4917efe
Changes for MSStats Convert added
Jun 18, 2025
90ca387
Changes for MSStats
Jun 18, 2025
09cacfa
Changes for Script with fix
Jun 19, 2025
e0472b9
Changes to debug output
Jun 19, 2025
2d8e3b3
Change for Script order
Jun 19, 2025
73e6bcc
Changes to see file print
Jun 19, 2025
244d42f
Corrections added for metamorpheus script file
Jun 19, 2025
462925f
Rerun metamorpheus benchmark
Jun 19, 2025
1b962cf
Changes for library
Jun 19, 2025
f0ec621
Changes for MBR
Jun 19, 2025
ec9fe42
Changes for calculate metrics
Aug 15, 2025
c7e7b60
All PR comments resolved
Aug 15, 2025
d8bcb03
Fix Bug : Unique comparisons not visible
Aug 15, 2025
ba65d2d
Changes reverted
Aug 15, 2025
423b485
Removed unnecessary file
Aug 15, 2025
7485d95
Changes for variable name correction
Aug 15, 2025
4fb9d06
Variable rename revert
Aug 15, 2025
9fc6a86
PR feedbacks
Aug 15, 2025
8518c44
Changes for Unique Organisms
Aug 15, 2025
8eefbff
Changes for protein_mappings
Aug 15, 2025
a9fbe80
Correction for Organism column
Aug 15, 2025
3be70e4
Organisms
Aug 15, 2025
85a41cd
Changes for organisms column
Aug 18, 2025
c4c0c86
Reverted changes
Aug 18, 2025
f6ac63e
Added new arguments
Aug 18, 2025
90cf018
Removed params
Aug 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ name: Run Simple R Script on HPC via Slurm
on:
push:
branches:
# - feature/multiple-scripts
- devel
- feature/metamorpheus-scripts
# - devel

jobs:
Benchmarking-pipeline:
Expand All @@ -21,24 +21,24 @@ jobs:
mkdir -p ~/.ssh
touch ~/.ssh/id_rsa
chmod 600 ~/.ssh/id_rsa
echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/id_rsa
ssh-keyscan -H login-00.discovery.neu.edu >> ~/.ssh/known_hosts || exit 1
echo "${{ secrets.SSH_PRIVATE_KEY_EXPLORER }}" > ~/.ssh/id_rsa
ssh-keyscan -H login.explorer.northeastern.edu >> ~/.ssh/known_hosts || exit 1

- name: Transfer Files to HPC
run: |
scp -O benchmark/benchmark_Dowell2021-HEqe408_LFQ.R benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R benchmark/scriptController.json benchmark/calculateMetrics.R benchmark/config.slurm raina.ans@login-00.discovery.neu.edu:/work/VitekLab/Projects/Benchmarking || exit 1
scp -r benchmark raina.ans@login.explorer.northeastern.edu:/projects/VitekLab/Projects/Benchmarking || exit 1

- name: Submit Slurm Job and Capture Job ID
id: submit_job
run: |
ssh raina.ans@login-00.discovery.neu.edu "cd /work/VitekLab/Projects/Benchmarking && sbatch config.slurm" | tee slurm_job_id.txt
ssh raina.ans@login.explorer.northeastern.edu "cd /projects/VitekLab/Projects/Benchmarking/benchmark && sbatch config.slurm" | tee slurm_job_id.txt
slurm_job_id=$(grep -oP '\d+' slurm_job_id.txt)
echo "Slurm Job ID is $slurm_job_id"
echo "slurm_job_id=$slurm_job_id" >> $GITHUB_ENV

- name: Monitor Slurm Job
run: |
ssh raina.ans@login-00.discovery.neu.edu "
ssh raina.ans@login.explorer.northeastern.edu "
while squeue -j ${{ env.slurm_job_id }} | grep -q ${{ env.slurm_job_id }}; do
echo 'Job Id : ${{ env.slurm_job_id }} is still running...'
sleep 10
Expand All @@ -48,8 +48,8 @@ jobs:

- name: Fetch Output
run: |
scp -O raina.ans@login-00.discovery.neu.edu:/work/VitekLab/Projects/Benchmarking/job_output.txt job_output.txt
scp -O raina.ans@login-00.discovery.neu.edu:/work/VitekLab/Projects/Benchmarking/job_error.txt job_error.txt
scp -O raina.ans@login.explorer.northeastern.edu:/projects/VitekLab/Projects/Benchmarking/benchmark/job_output.txt job_output.txt
scp -O raina.ans@login.explorer.northeastern.edu:/projects/VitekLab/Projects/Benchmarking/benchmark/job_error.txt job_error.txt

- name: Upload Output as Artifact
uses: actions/upload-artifact@v4
Expand Down
93 changes: 93 additions & 0 deletions benchmark/benchmark_Metamorpheus.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
library(MSstatsConvert)
library(MSstats)
library(parallel)
library(stringr)
library(jsonlite)
library(dplyr)

source("metamorpheus_Process.R")
config <- fromJSON("scriptController.json", simplifyVector = FALSE)

Comment on lines +8 to +10
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Fix: MetamorpheusToMSstatsFormat likely undefined; also make sources/paths robust to working directory

MetamorpheusToMSstatsFormat isn’t defined in this script and isn’t exported by MSstats/MSstatsConvert. It appears to live in metamorpheus_code.R, which isn’t sourced here. Also, both the source() and fromJSON() calls are relative to the current working directory and will break if the script isn’t launched from benchmark/.

Apply this diff to source both helper files and load the controller JSON robustly:

- source("metamorpheus_Process.R")
-config <- fromJSON("scriptController.json", simplifyVector = FALSE)
+script_args <- commandArgs(trailingOnly = FALSE)
+script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)])
+base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else "."
+repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE)
+
+source(file.path(base_dir, "metamorpheus_Process.R"))
+# Ensure MetamorpheusToMSstatsFormat is available
+if (file.exists(file.path(repo_root, "metamorpheus_code.R"))) {
+  source(file.path(repo_root, "metamorpheus_code.R"))
+}
+config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "PWD: $(pwd)"
echo
echo "---- benchmark directory listing ----"
ls -la benchmark || true
echo
echo "---- show benchmark/benchmark_Metamorpheus.R (first 200 lines) ----"
sed -n '1,200p' benchmark/benchmark_Metamorpheus.R || true
echo
echo "---- rg: search for MetamorpheusToMSstatsFormat ----"
rg -n --hidden --no-ignore -S "MetamorpheusToMSstatsFormat" || true
echo
echo "---- rg: search for metamorpheus_code.R references ----"
rg -n --hidden --no-ignore -S "metamorpheus_code\\.R" || true
echo
echo "---- fd: find files named metamorpheus_code.R or metamorpheus_Process.R ----"
fd -I -e R --hidden "metamorpheus_code.R" || true
fd -I -e R --hidden "metamorpheus_Process.R" || true
echo
echo "---- rg: search for source(...) of metamorpheus_Process.R ----"
rg -n --hidden --no-ignore -S "source\\(.*metamorpheus_Process\\.R" || true
echo
echo "---- rg: search for fromJSON\\(.*scriptController.json ----"
rg -n --hidden --no-ignore -S "fromJSON\\(.*scriptController\\.json" || true
echo
echo "---- check for metamorpheus_code.R at repo root ----"
if [ -f "metamorpheus_code.R" ]; then echo "FOUND: metamorpheus_code.R at repo root"; else echo "NOT FOUND: metamorpheus_code.R at repo root"; fi

Length of output: 6452


Fix required: source metamorpheus_code.R and make file paths robust to working directory

MetamorpheusToMSstatsFormat is defined in metamorpheus_code.R at the repo root (rg matches). benchmark/benchmark_Metamorpheus.R currently only does source("metamorpheus_Process.R") and fromJSON("scriptController.json"), which will break if the script is run from a different working directory — source the helper that defines MetamorpheusToMSstatsFormat and resolve paths relative to the script.

Files to update:

  • benchmark/benchmark_Metamorpheus.R — replace the relative source()/fromJSON() calls (lines ~8–10).
  • metamorpheus_code.R (repo root) — contains MetamorpheusToMSstatsFormat and must be sourced by the benchmark script.
  • benchmark/metamorpheus_Process.R — keep sourced, but source via computed script path.

Apply this diff to benchmark/benchmark_Metamorpheus.R:

- source("metamorpheus_Process.R")
- config <- fromJSON("scriptController.json", simplifyVector = FALSE)
+ script_args <- commandArgs(trailingOnly = FALSE)
+ script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)])
+ base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else "."
+ repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE)
+
+ source(file.path(base_dir, "metamorpheus_Process.R"))
+ # Ensure MetamorpheusToMSstatsFormat is available
+ if (file.exists(file.path(repo_root, "metamorpheus_code.R"))) {
+   source(file.path(repo_root, "metamorpheus_code.R"))
+ }
+ config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
source("metamorpheus_Process.R")
config <- fromJSON("scriptController.json", simplifyVector = FALSE)
script_args <- commandArgs(trailingOnly = FALSE)
script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)])
base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else "."
repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE)
source(file.path(base_dir, "metamorpheus_Process.R"))
# Ensure MetamorpheusToMSstatsFormat is available
if (file.exists(file.path(repo_root, "metamorpheus_code.R"))) {
source(file.path(repo_root, "metamorpheus_code.R"))
}
config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)
🤖 Prompt for AI Agents
In benchmark/benchmark_Metamorpheus.R around lines 8–10, the script currently
sources metamorpheus_Process.R and reads scriptController.json using relative
paths which will break if run from a different working directory; change the top
of the script to compute the script directory, source the repo-root
metamorpheus_code.R and the local metamorpheus_Process.R via absolute paths, and
read scriptController.json via that computed path (e.g. determine script_dir
with a robust pattern that falls back to getwd(), then call
source(file.path(script_dir, "metamorpheus_Process.R")) and
source(file.path(script_dir, "..", "metamorpheus_code.R")) and config <-
fromJSON(file.path(script_dir, "scriptController.json"), simplifyVector =
FALSE)).

runBenchmarkForMetaMorpheusData <- function(datasetPath, config) {

dataset_config <- config$datasets[[datasetPath]]
dataset_config <- as.list(dataset_config)

cat("Processing Dataset:", dataset_config$name, "\n")

filePath <- file.path(dataset_config$parent, dataset_config$data)
annotPath <- dataset_config$parent

input = data.table::fread(file.path(filePath, "QuantifiedPeaks.tsv"))
annot = data.table::fread(file.path(annotPath, "annotation.csv"))


cat("Dataset File Path:", filePath, "\n")
cat("Annotation File Path:", annotPath, "\n")

input = input %>% filter(!str_detect(`Protein Group`, ";")) # remove multiple protein group in same cell
input = input %>% filter(!str_detect(`Protein Group`, "DECOY")) # remove decoys

protein_mappings = data.table::fread(file.path(filePath, "QuantifiedProteins.tsv"))

protein_mappings = protein_mappings %>% filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens"))

print(protein_mappings)

input = input %>% filter(`Protein Group` %in% protein_mappings$`Protein Groups`)

output = MetamorpheusToMSstatsFormat(input, annot)
Copy link
Contributor

@tonywu1999 tonywu1999 Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MetamorpheusToMSstatsFormat function also has these two parameters. By default, they're set to TRUE

removeFewMeasurements, removeProtein_with1Feature

Could you double check there aren't major differences in empirical FDR when these two parameters are set to FALSE? I'm thinking due to the absence of PIP, certain proteins may be filtered out altogether, which could explain better empirical FDR with no PIP.


data_process_tasks <- list(
list(
label = "Data process with Normalized Data",
result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20)
),
list(
label = "Data process with Normalization and MBImpute False",
result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE)
),
list(
label = "Data process without Normalization",
result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20)
),
list(
label = "Data process without Normalization with MBImpute False",
result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
),
list(
label = "Data process without Normalization and Imputation On for all features",
result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = FALSE)
),
list(
label = "Data process without Normalization and Imputation On for top3 features",
result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = FALSE)
)
Comment on lines +47 to +65
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct MBimpute argument name and align “Imputation On” task behavior

Two issues:

  • Typo: MBImpute (capital I) is not a valid dataProcess() argument; it should be MBimpute.
  • Tasks labeled “…Imputation On…” currently pass FALSE, turning imputation off and contradicting the labels.

Apply this diff to fix both problems:

     list(
       label = "Data process with Normalization and MBImpute False",
-      result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE)
+      result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE)
     ),
@@
     list(
       label = "Data process without Normalization with MBImpute False",
-      result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
+      result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
     ),
     list(
       label = "Data process without Normalization and Imputation On for all features",
-      result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBImpute = FALSE)
+      result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = TRUE)
     ),
     list(
       label = "Data process without Normalization and Imputation On for top3 features",
-      result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBImpute = FALSE)
+      result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = TRUE)
     )

Note: The first two blocks already use the correct argument spelling (MBimpute). The changes above fix the last two blocks to both correct the spelling and enable imputation.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
label = "Data process with Normalization and MBImpute False",
result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE)
),
list(
label = "Data process without Normalization",
result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20)
),
list(
label = "Data process without Normalization with MBImpute False",
result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
),
list(
label = "Data process without Normalization and Imputation On for all features",
result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = FALSE)
),
list(
label = "Data process without Normalization and Imputation On for top3 features",
result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = FALSE)
)
label = "Data process with Normalization and MBImpute False",
result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE)
),
list(
label = "Data process without Normalization",
result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20)
),
list(
label = "Data process without Normalization with MBImpute False",
result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
),
list(
label = "Data process without Normalization and Imputation On for all features",
result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = TRUE)
),
list(
label = "Data process without Normalization and Imputation On for top3 features",
result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = TRUE)
)
🤖 Prompt for AI Agents
In benchmark/benchmark_Metamorpheus.R around lines 44 to 62, the dataProcess
calls have a typo (MBImpute with a capital I) and the two tasks labeled
"...Imputation On..." pass MBimpute = FALSE (disabling imputation) which
contradicts the labels; fix by renaming the argument to MBimpute (lowercase i)
where misspelled and set MBimpute = TRUE for the two "Imputation On" cases so
their behavior matches their labels.

)

start_time <- Sys.time()

num_cores <- detectCores() - 1

summarized_results <- mclapply(data_process_tasks, function(task) {
list(label = task$label, summarized = task$result())
}, mc.cores = num_cores)


results_list <- mclapply(summarized_results, function(res) {
calculate_Metrics(res$summarized, protein_mappings, res$label)
}, mc.cores = num_cores)

Comment on lines +70 to +80
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Guard against mc.cores == 0 and provide a Windows-safe fallback

detectCores() - 1 can be zero; mclapply will error. Also, mclapply with mc.cores > 1 is unavailable on Windows. Add a safe core calculation and fall back to lapply when needed.

-  num_cores <- detectCores() - 1 
-
-  summarized_results <- mclapply(data_process_tasks, function(task) {
-    list(label = task$label, summarized = task$result())
-  }, mc.cores = num_cores)	
+  num_cores <- max(1L, as.integer(detectCores() - 1L))
+  if (.Platform$OS.type != "windows" && num_cores > 1L) {
+    summarized_results <- mclapply(
+      data_process_tasks,
+      function(task) list(label = task$label, summarized = task$result()),
+      mc.cores = num_cores
+    )
+  } else {
+    summarized_results <- lapply(
+      data_process_tasks,
+      function(task) list(label = task$label, summarized = task$result())
+    )
+  }
@@
-  results_list <- mclapply(summarized_results, function(res) {
-    calculate_Metrics(res$summarized, protein_mappings, res$label)
-  }, mc.cores = num_cores)
+  if (.Platform$OS.type != "windows" && num_cores > 1L) {
+    results_list <- mclapply(
+      summarized_results,
+      function(res) calculate_Metrics(res$summarized, protein_mappings, res$label),
+      mc.cores = num_cores
+    )
+  } else {
+    results_list <- lapply(
+      summarized_results,
+      function(res) calculate_Metrics(res$summarized, protein_mappings, res$label)
+    )
+  }

Optional verification: please run a small dataset on a 1-core environment to confirm no regression when num_cores == 1.


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "Checking for benchmark/benchmark_Metamorpheus.R..."
if [ -f benchmark/benchmark_Metamorpheus.R ]; then
  echo "== File: benchmark/benchmark_Metamorpheus.R (first 220 lines) =="
  sed -n '1,220p' benchmark/benchmark_Metamorpheus.R
else
  echo "benchmark/benchmark_Metamorpheus.R not found"
fi

echo
echo "== Searching repo for parallel-related usages =="
rg -n --hidden -S 'detectCores|mclapply|library\\(parallel\\)|require\\(parallel\\)|\\.Platform' || true

echo
echo "== Showing matches in benchmark/ directory with context =="
rg -n --hidden -S --context 3 'detectCores|mclapply|\\.Platform' benchmark || true

Length of output: 8050


Guard against mc.cores == 0 and add a Windows-safe fallback (applies to multiple benchmark scripts)

detectCores() - 1 can be 0 (causing mclapply to error) and mclapply’s multicore mode is not available on Windows — fall back to lapply when multicore isn’t usable.

Files to update:

  • benchmark/benchmark_Metamorpheus.R (lines ~69–79)
  • benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R (lines ~47–56)
  • benchmark/benchmark_Dowell2021-HEqe408_LFQ.R (lines ~48–56)

Suggested change (apply the same pattern to each file):

-  num_cores <- detectCores() - 1 
-
-  summarized_results <- mclapply(data_process_tasks, function(task) {
-    list(label = task$label, summarized = task$result())
-  }, mc.cores = num_cores)	
+  num_cores <- max(1L, as.integer(detectCores() - 1L))
+  if (.Platform$OS.type != "windows" && num_cores > 1L) {
+    summarized_results <- mclapply(
+      data_process_tasks,
+      function(task) list(label = task$label, summarized = task$result()),
+      mc.cores = num_cores
+    )
+  } else {
+    summarized_results <- lapply(
+      data_process_tasks,
+      function(task) list(label = task$label, summarized = task$result())
+    )
+  }
@@
-  results_list <- mclapply(summarized_results, function(res) {
-    calculate_Metrics(res$summarized, protein_mappings, res$label)
-  }, mc.cores = num_cores)
+  if (.Platform$OS.type != "windows" && num_cores > 1L) {
+    results_list <- mclapply(
+      summarized_results,
+      function(res) calculate_Metrics(res$summarized, protein_mappings, res$label),
+      mc.cores = num_cores
+    )
+  } else {
+    results_list <- lapply(
+      summarized_results,
+      function(res) calculate_Metrics(res$summarized, protein_mappings, res$label)
+    )
+  }

Optional: verify on a 1-core / Windows-like environment to ensure no regressions.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
num_cores <- detectCores() - 1
summarized_results <- mclapply(data_process_tasks, function(task) {
list(label = task$label, summarized = task$result())
}, mc.cores = num_cores)
results_list <- mclapply(summarized_results, function(res) {
calculate_Metrics(res$summarized, protein_mappings, res$label)
}, mc.cores = num_cores)
num_cores <- max(1L, as.integer(detectCores() - 1L))
if (.Platform$OS.type != "windows" && num_cores > 1L) {
summarized_results <- mclapply(
data_process_tasks,
function(task) list(label = task$label, summarized = task$result()),
mc.cores = num_cores
)
} else {
summarized_results <- lapply(
data_process_tasks,
function(task) list(label = task$label, summarized = task$result())
)
}
if (.Platform$OS.type != "windows" && num_cores > 1L) {
results_list <- mclapply(
summarized_results,
function(res) calculate_Metrics(res$summarized, protein_mappings, res$label),
mc.cores = num_cores
)
} else {
results_list <- lapply(
summarized_results,
function(res) calculate_Metrics(res$summarized, protein_mappings, res$label)
)
}
🤖 Prompt for AI Agents
In benchmark/benchmark_Metamorpheus.R around lines 69–79, detectCores() - 1 may
be 0 and mclapply is not available on Windows; change to compute num_cores <-
max(detectCores() - 1L, 1L) and then call either lapply when running on Windows
or when num_cores == 1, otherwise call mclapply with mc.cores = num_cores; apply
the same pattern to the other two files (benchmark_Puyvelde2022-HYE5600735_LFQ.R
lines ~47–56 and benchmark_Dowell2021-HEqe408_LFQ.R lines ~48–56) so multicore
is used only when supported and falls back to lapply safely.


final_results <- do.call(rbind, results_list)
end_time <- Sys.time()
total_time <- end_time - start_time
print(final_results)
print(paste("Total Execution Time:", total_time))

}



runBenchmarkForMetaMorpheusData("DDA-Solivais2024-Metamorpheus_MBR_LFQ", config)
runBenchmarkForMetaMorpheusData("DDA-Solivais2024-Metamorpheus_NoMBR_LFQ", config)
59 changes: 36 additions & 23 deletions benchmark/config.slurm
Original file line number Diff line number Diff line change
@@ -1,38 +1,51 @@
#!/bin/bash
#SBATCH --job-name=msstats_benchmark_job_updated
#SBATCH --chdir=/work/VitekLab/Projects/Benchmarking/
#SBATCH --chdir=/projects/VitekLab/Projects/Benchmarking/benchmark
#SBATCH --output=job_output.txt
#SBATCH --error=job_error.txt
#SBATCH --open-mode=append
#SBATCH --time=01:00:00 # Set the maximum run time
#SBATCH --ntasks=1 # Number of tasks (one process)
#SBATCH --cpus-per-task=8 # Use 8 CPU cores for the task
#SBATCH --mem=256G # Request 256GB of memory
#SBATCH --partition=short # Use the 'short' partition (or change as needed)
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=128G
#SBATCH --partition=short

module load R-geospatial

module load gcc/11.1.0
module load cmake/3.23.2
module load R
module load cmake/3.30.2

export LC_ALL=C
export R_LIBS_USER=/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.2-geospatial
export R_LIBS_USER=/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.4
mkdir -p $R_LIBS_USER

mkdir -p $HOME/lib_fix
ln -sf /shared/EL9/explorer/R/4.4.1/lib64/R/lib/libRlapack.so $HOME/lib_fix/libRlapack.so.3

mkdir -p $R_LIBS_USER
export LD_LIBRARY_PATH=$HOME/lib_fix:/shared/EL9/explorer/R/4.4.1/lib64/R/lib:/usr/lib64:$LD_LIBRARY_PATH

module load R
Rscript -e "if (!requireNamespace('remotes', quietly = TRUE)) install.packages('remotes', lib = Sys.getenv('R_LIBS_USER'), repos = 'https://cloud.r-project.org'); \
remotes::install_github('Vitek-Lab/MSstats', ref = 'devel', lib = Sys.getenv('R_LIBS_USER')); \
remotes::install_github('Vitek-Lab/MSstatsConvert', ref = 'master', lib = Sys.getenv('R_LIBS_USER')); \
install.packages(c('dplyr', 'stringr', 'ggplot2'), lib = Sys.getenv('R_LIBS_USER'), repos = 'https://cloud.r-project.org')"
Rscript -e '
.libPaths("/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.4")
Sys.unsetenv("R_LIBS")
cat(".libPaths():\n"); print(.libPaths())

if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes", repos = "https://cloud.r-project.org")

install.packages("nloptr", type = "source", repos = "https://cloud.r-project.org")

for (pkg in c("dplyr", "stringr", "ggplot2")) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg, repos = "https://cloud.r-project.org")
}
}
BiocManager::install(c("MSstatsConvert", "preprocessCore"), force = TRUE)
remotes::install_github("Vitek-Lab/MSstats", ref = "devel", force = TRUE)
'
Comment on lines +24 to +41
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

This will fail if BiocManager is missing; install it before calling BiocManager::install

BiocManager::install errors if BiocManager isn’t installed. Add a guard to install BiocManager first.

Apply this diff:

 Rscript -e '
 .libPaths("/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.4")
 Sys.unsetenv("R_LIBS")
 cat(".libPaths():\n"); print(.libPaths())
 
 if (!requireNamespace("remotes", quietly = TRUE))
     install.packages("remotes", repos = "https://cloud.r-project.org")
 
-install.packages("nloptr", type = "source", repos = "https://cloud.r-project.org")
+install.packages("nloptr", type = "source", repos = "https://cloud.r-project.org")
 
 for (pkg in c("dplyr", "stringr", "ggplot2")) {
     if (!requireNamespace(pkg, quietly = TRUE)) {
         install.packages(pkg, repos = "https://cloud.r-project.org")
     }
 }
-BiocManager::install(c("MSstatsConvert", "preprocessCore"), force = TRUE)
+if (!requireNamespace("BiocManager", quietly = TRUE)) {
+    install.packages("BiocManager", repos = "https://cloud.r-project.org")
+}
+BiocManager::install(c("MSstatsConvert", "preprocessCore"), force = TRUE)
 remotes::install_github("Vitek-Lab/MSstats", ref = "devel", force = TRUE)
 '
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Rscript -e '
.libPaths("/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.4")
Sys.unsetenv("R_LIBS")
cat(".libPaths():\n"); print(.libPaths())
if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes", repos = "https://cloud.r-project.org")
install.packages("nloptr", type = "source", repos = "https://cloud.r-project.org")
for (pkg in c("dplyr", "stringr", "ggplot2")) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg, repos = "https://cloud.r-project.org")
}
}
BiocManager::install(c("MSstatsConvert", "preprocessCore"), force = TRUE)
remotes::install_github("Vitek-Lab/MSstats", ref = "devel", force = TRUE)
'
Rscript -e '
.libPaths("/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.4")
Sys.unsetenv("R_LIBS")
cat(".libPaths():\n"); print(.libPaths())
if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes", repos = "https://cloud.r-project.org")
install.packages("nloptr", type = "source", repos = "https://cloud.r-project.org")
for (pkg in c("dplyr", "stringr", "ggplot2")) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg, repos = "https://cloud.r-project.org")
}
}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager", repos = "https://cloud.r-project.org")
}
BiocManager::install(c("MSstatsConvert", "preprocessCore"), force = TRUE)
remotes::install_github("Vitek-Lab/MSstats", ref = "devel", force = TRUE)
'
🤖 Prompt for AI Agents
In benchmark/config.slurm around lines 24 to 41, the script calls
BiocManager::install without ensuring the BiocManager package is present; add a
guard to check for BiocManager with requireNamespace("BiocManager",
quietly=TRUE) and, if missing, install it (install.packages("BiocManager",
repos="https://cloud.r-project.org")) before calling BiocManager::install, then
proceed with the existing BiocManager::install call.


R_SCRIPTS=("benchmark_Dowell2021-HEqe408_LFQ.R" "benchmark_Puyvelde2022-HYE5600735_LFQ.R")
R_SCRIPTS=("benchmark_Dowell2021-HEqe408_LFQ.R" "benchmark_Puyvelde2022-HYE5600735_LFQ.R" "benchmark_Metamorpheus.R" )

for script in "${R_SCRIPTS[@]}"; do
echo "Executing script: $script" >> job_output.txt
Rscript "$script" >> job_output.txt 2>> job_error.txt
wait
echo "Finished executing script: $script" >> job_output.txt
echo "Executing script: $script" >> job_output.txt
stdbuf -oL -eL Rscript "$script" >> job_output.txt 2>> job_error.txt
wait
echo "Finished executing script: $script" >> job_output.txt
echo -e "\n\n"
done
47 changes: 47 additions & 0 deletions benchmark/metamorpheus_Process.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
calculate_Metrics <- function(QuantData, protein_mappings, task_label, alpha = 0.05) {
comparison <- matrix(
c(-1,0,0,0,1, # E-A
-1,0,0,1,0, # D-A
-1,0,1,0,0, # C-A
-1,1,0,0,0), # B-A
nrow = 4, byrow = TRUE
)
rownames(comparison) <- c("E-A", "D-A", "C-A", "B-A")
groups <- levels(QuantData$ProteinLevelData$GROUP)
colnames(comparison) <- groups[order(as.numeric(groups))]

Comment on lines +10 to +12
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid fragile column ordering: stop coercing group labels to numeric

as.numeric(groups) will warn and is unreliable when GROUP levels are non-numeric (e.g., "A","B","C"...). You only need to align by names; just set the column names directly. Optionally validate the expected length.

Apply this diff:

-  groups <- levels(QuantData$ProteinLevelData$GROUP)
-  colnames(comparison) <- groups[order(as.numeric(groups))]
+  groups <- levels(QuantData$ProteinLevelData$GROUP)
+  if (length(groups) != ncol(comparison)) {
+    stop(sprintf("Contrast matrix expects %d groups but found %d in data: %s",
+                 ncol(comparison), length(groups), paste(groups, collapse = ", ")))
+  }
+  colnames(comparison) <- groups
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
groups <- levels(QuantData$ProteinLevelData$GROUP)
colnames(comparison) <- groups[order(as.numeric(groups))]
groups <- levels(QuantData$ProteinLevelData$GROUP)
if (length(groups) != ncol(comparison)) {
stop(sprintf("Contrast matrix expects %d groups but found %d in data: %s",
ncol(comparison), length(groups), paste(groups, collapse = ", ")))
}
colnames(comparison) <- groups
🤖 Prompt for AI Agents
In benchmark/metamorpheus_Process.R around lines 10 to 12, the code coerces
group level labels to numeric via as.numeric(groups) which is fragile and warns
for non-numeric levels; instead assign column names directly using the factor
levels (or sorted/validated names if needed) and optionally verify the length
matches ncol(comparison) before setting colnames to avoid misalignment. Ensure
you remove the as.numeric() coercion and replace it with a direct assignment of
groups (or a deterministic name ordering) and add a simple length check to throw
a clear error if sizes differ.

model <- groupComparison(
contrast.matrix = comparison,
data = QuantData,
use_log_file = FALSE
)

ecoli_ids <- protein_mappings %>%
filter(Organism == "Escherichia coli (strain K12)") %>%
pull(`Protein Groups`)

filtered_comparison_result <- model$ComparisonResult %>%
mutate(ecoli = Protein %in% ecoli_ids) %>%
filter(is.na(issue))

labels <- unique(filtered_comparison_result$Label)
result_rows <- lapply(labels, function(lbl) {
df <- filtered_comparison_result %>% filter(Label == lbl)
sig <- df %>% filter(adj.pvalue < alpha)

tp <- sig %>% filter(ecoli) %>% nrow()
fp <- sig %>% filter(!ecoli) %>% nrow()
tot <- tp + fp
fdr <- if (tot > 0) fp / tot else NA_real_

data.frame(
Task = task_label,
Comparison = lbl,
FDR = fdr,
stringsAsFactors = FALSE
)
})

results <- do.call(rbind, result_rows)
return(results)
}
34 changes: 32 additions & 2 deletions benchmark/scriptController.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"datasets": {
"DDA-Puyvelde2022-HYE5600735_LFQ": {
"name": "DDA-Puyvelde2022-HYE5600735_LFQ",
"file": "/work/VitekLab/Data/MS/Benchmarking/DDA-Puyvelde2022/DDA-Puyvelde2022-HYE5600735_LFQ/FragPipe/TOP0/MSstats_fixed.csv",
"file": "/projects/VitekLab/Data/MS/Benchmarking/DDA-Puyvelde2022/DDA-Puyvelde2022-HYE5600735_LFQ/FragPipe/TOP0/MSstats_fixed.csv",
"samples": {
"Human": {
"pattern": "_HUMAN$",
Expand All @@ -20,7 +20,37 @@
},
"DDA-Dowell2021-HEqe408_LFQ": {
"name": "DDA-Dowell2021-HEqe408_LFQ",
"file": "/work/VitekLab/Data/MS/Benchmarking/DDA-Dowell2021-HEqe408_LFQ/FragPipe/TOP0/MSstats.csv",
"file": "/projects/VitekLab/Data/MS/Benchmarking/DDA-Dowell2021-HEqe408_LFQ/FragPipe/TOP0/MSstats.csv",
"samples": {
"Human": {
"pattern": "_HUMAN$",
"type": "insignificant"
},
"Ecoli": {
"pattern": "_ECOLI$",
"type": "significant"
}
}
Comment on lines 22 to +33
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Confirm consumer code reads new fields (parent and data)

Older datasets use a file field; the MetaMorpheus entries introduce parent + data. Ensure the benchmarking script branches correctly when these fields are present.

Run this to verify the harness reads parent and data and constructs file paths accordingly:


🏁 Script executed:

#!/bin/bash
# Inspect benchmark scripts for usage of scriptController.json
rg -n -C3 'scriptController\.json|parent|data"|samples' --type=r benchmark

Length of output: 4202


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Searching benchmark R scripts for dataset_config usage and path construction..."
rg -n -C3 'dataset_config|filePath|file.path\(|dataset_config\$file|dataset_config\$parent|dataset_config\$data' --type=r benchmark || true

echo
echo "----- Full contents of the three benchmark scripts referenced in previous output -----"
for f in benchmark/benchmark_Metamorpheus.R benchmark/benchmark_Dowell2021-HEqe408_LFQ.R benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R; do
  echo "=== $f ==="
  sed -n '1,240p' "$f" || true
  echo
done

Length of output: 14136


Action: Make benchmark scripts accept both file and parent+data config shapes

Verified: benchmark/benchmark_Metamorpheus.R already constructs paths from dataset_config$parent + dataset_config$data; benchmark_Dowell2021-HEqe408_LFQ.R and benchmark_Puyvelde2022-HYE5600735_LFQ.R still assume dataset_config$file only. This will break if scriptController.json uses the new parent+data shape for non‑Metamorpheus datasets.

Files to update:

  • benchmark/benchmark_Metamorpheus.R — lines ~18–21: add a fallback to dataset_config$file when parent/data are absent.
  • benchmark/benchmark_Dowell2021-HEqe408_LFQ.R — lines ~14–21: resolve a data_file from either dataset_config$file or parent+data, then use it for fread and logging.
  • benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R — lines ~14–21: same change as Dowell.

Suggested minimal patches (apply the same pattern to both Dowell and Puyvelde):

Before:

cat("Processing Dataset:", dataset_config$name, "\n")
cat("Dataset File Path:", dataset_config$file, "\n")
...
fragpipe_raw <- data.table::fread(dataset_config$file)

After:

data_file <- if (!is.null(dataset_config$file)) {
  dataset_config$file
} else if (!is.null(dataset_config$parent) && !is.null(dataset_config$data)) {
  file.path(dataset_config$parent, dataset_config$data)
} else {
  stop("Dataset path not specified in config (expected 'file' or 'parent' + 'data')")
}
cat("Processing Dataset:", dataset_config$name, "\n")
cat("Dataset File Path:", data_file, "\n")
...
fragpipe_raw <- data.table::fread(data_file)

Suggested change for Metamorpheus (keep existing parent+data behavior, add fallback to file):

Before:

filePath <- file.path(dataset_config$parent, dataset_config$data)
annotPath <- dataset_config$parent

After:

if (!is.null(dataset_config$file)) {
  # If a file path is supplied, use its directory as the dataset directory
  filePath <- dirname(dataset_config$file)
  annotPath <- dirname(filePath)
} else {
  filePath <- file.path(dataset_config$parent, dataset_config$data)
  annotPath <- dataset_config$parent
}

Reasoning: This makes each script robust to either config shape (old file or new parent+data) and fails fast with a clear error if neither is present.

Please apply these small changes to the three scripts (or ensure scriptController.json keeps the original file field for datasets consumed by the FragPipe-based scripts).

🤖 Prompt for AI Agents
In benchmark/scriptController.json around lines 22–33 and the three R scripts:
benchmark/benchmark_Metamorpheus.R (~lines 18–21),
benchmark/benchmark_Dowell2021-HEqe408_LFQ.R (~lines 14–21), and
benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R (~lines 14–21), update each
script to accept either dataset_config$file OR dataset_config$parent +
dataset_config$data: for Dowell and Puyvelde, resolve a single data_file by
using dataset_config$file if present, otherwise build it with
file.path(dataset_config$parent, dataset_config$data), stop with a clear error
if neither, then use data_file for logging and fread; for Metamorpheus, keep
parent+data behavior but add a fallback where if dataset_config$file is
provided, set filePath to dirname(dataset_config$file) and annotPath to its
parent (dirname(filePath)); ensure all logging uses the resolved path and fail
fast with a descriptive message when no valid path is provided.

},
"DDA-Solivais2024-Metamorpheus_NoMBR_LFQ": {
"name": "DDA-Solivais2024-Metamorpheus_NoMBR_LFQ",
"parent": "/projects/VitekLab/Data/MS/Benchmarking/DDA-Solivais2024_Metamorpheus/Current",
"data":"FlashLFQ_NoNormalization_NoPIP",
"samples": {
"Human": {
"pattern": "_HUMAN$",
"type": "insignificant"
},
"Ecoli": {
"pattern": "_ECOLI$",
"type": "significant"
}
}
},
"DDA-Solivais2024-Metamorpheus_MBR_LFQ": {
"name": "DDA-Solivais2024-Metamorpheus_MBR_LFQ",
"parent": "/projects/VitekLab/Data/MS/Benchmarking/DDA-Solivais2024_Metamorpheus/Current",
"data":"FlashLFQ_v1.0_NoNormalization_wPIP",
"samples": {
"Human": {
"pattern": "_HUMAN$",
Expand Down
Loading