Skip to content

Conversation

@anshuman-raina
Copy link
Contributor

@anshuman-raina anshuman-raina commented Jun 13, 2025

User description

Motivation and Context

Please include relevant motivation and context of the problem along with a short summary of the solution.

Changes

Please provide a detailed bullet point list of your changes.

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules
  • I have run the devtools::document() command after my changes and committed the added files

PR Type

Enhancement, Other


Description

  • Add Metamorpheus benchmarking R script

  • Create helper function file for metrics calculation

  • Update SLURM and workflow configs for new script

  • Add dataset entry in scriptController.json


Changes walkthrough 📝

Relevant files
Configuration changes
benchmark.yml
Update benchmark workflow GitHub Actions config                   

.github/workflows/benchmark.yml

  • Switch branch filter to feature/metamorpheus-scripts
  • Update SSH key secret and explorer host
  • Change scp commands to recursive copy
  • Adjust SSH/SCP paths to new cluster
  • +9/-9     
    config.slurm
    Include Metamorpheus script in SLURM config                           

    benchmark/config.slurm

  • Include benchmark_Metamorpheus.R in R_SCRIPTS
  • Adjust script array syntax for SLURM
  • +1/-1     
    scriptController.json
    Extend dataset config for Metamorpheus                                     

    benchmark/scriptController.json

  • Add "DDA-Solivais2024-Metamorpheus_NoMBR_LFQ" dataset
  • Specify parent path, data folder, sample patterns
  • +15/-0   
    Enhancement
    benchmark_Metamorpheus.R
    Add Metamorpheus benchmarking R script                                     

    benchmark/benchmark_Metamorpheus.R

  • New R script for Metamorpheus benchmarking
  • Reads dataset config from scriptController.json
  • Runs parallel dataProcess scenarios
  • Aggregates and prints metric results
  • +80/-0   
    metamorpheus_Process.R
    Add metrics calculation helper function                                   

    benchmark/metamorpheus_Process.R

  • Define calculate_Metrics helper function
  • Set up contrast matrix for groupComparison
  • Compute and report False Discovery Rate
  • +41/-0   
    Miscellaneous
    metamorpheus_code.R
    Add example Metamorpheus analysis script                                 

    metamorpheus_code.R

  • Provide standalone Metamorpheus analysis example
  • Compare MBR vs no-MBR workflows and FDR
  • Generate QC plots and compute metrics
  • +112/-0 

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • Summary by CodeRabbit

    • New Features

      • Added a parallelized MetaMorpheus benchmarking workflow that computes per-task FDR metrics and reports timing.
      • Added two MetaMorpheus datasets and dataset sample-type mappings (Human vs Ecoli).
    • Chores

      • Migrated benchmark execution to a different HPC host and updated remote storage/working paths and SSH secret.
      • Updated job submission/runtime settings (reduced memory, updated R/runtime libs), revamped dependency installation, and enabled line‑buffered logging.

    @anshuman-raina anshuman-raina changed the title Feature/metamorpheus scripts WIP : Feature/metamorpheus scripts Jun 13, 2025
    @github-actions
    Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Array Formatting

    The R_SCRIPTS array contains a stray comma that could break the shell array syntax and prevent proper iteration over scripts.

    R_SCRIPTS=("benchmark_Dowell2021-HEqe408_LFQ.R" "benchmark_Puyvelde2022-HYE5600735_LFQ.R", "benchmark_Metamorpheus.R")
    Missing JSON Dependency

    The script calls fromJSON without loading jsonlite (or another JSON library), leading to a potential undefined function error.

    config <- fromJSON("scriptController.json", simplifyVector = FALSE)
    Argument Type Mismatch

    The normalization parameter is passed as the string "FALSE" instead of the logical FALSE, which may not be handled correctly by dataProcess.

    result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20)

    @github-actions
    Copy link

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Fix Bash array syntax

    Remove the comma in the array declaration so that Bash treats each script as a
    separate element. Bash arrays should be space-separated without commas.

    benchmark/config.slurm [30]

    -R_SCRIPTS=("benchmark_Dowell2021-HEqe408_LFQ.R" "benchmark_Puyvelde2022-HYE5600735_LFQ.R", "benchmark_Metamorpheus.R")
    +R_SCRIPTS=("benchmark_Dowell2021-HEqe408_LFQ.R" "benchmark_Puyvelde2022-HYE5600735_LFQ.R" "benchmark_Metamorpheus.R")
    Suggestion importance[1-10]: 8

    __

    Why: The comma in the R_SCRIPTS array breaks Bash parsing and will lead to incorrect script names, so removing it fixes script execution.

    Medium
    Use logical FALSE for normalization

    The normalization argument expects a logical, not a string. Change "FALSE" to FALSE
    to disable normalization correctly.

    benchmark/benchmark_Metamorpheus.R [35-51]

     data_process_tasks <- list(
       list(
         label = "Data process without Normalization",
    -    result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20)
    +    result = function() dataProcess(output, featureSubset = "topN", normalization = FALSE, n_top_feature = 20)
       ),
       list(
         label = "Data process without Normalization with MBImpute False",
    -    result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
    +    result = function() dataProcess(output, featureSubset = "topN", normalization = FALSE, n_top_feature = 20, MBimpute = FALSE)
       ),
       …
     )
    Suggestion importance[1-10]: 7

    __

    Why: Passing normalization as a string instead of a logical may cause unexpected behavior; using FALSE matches the dataProcess API.

    Medium
    General
    Prevent division by zero

    Guard against division by zero when no proteins pass filtering. Compute the
    denominator first and set FDR to NA or zero if it is zero.

    benchmark/metamorpheus_Process.R [30]

    -FDR = nrow(human) / (nrow(ecoli) + nrow(human))
    +denom <- nrow(ecoli) + nrow(human)
    +FDR <- if (denom > 0) nrow(human) / denom else NA
    Suggestion importance[1-10]: 7

    __

    Why: Calculating FDR without checking if the denominator is zero can cause a divide-by-zero error or Inf, so validating it ensures safe computation.

    Medium
    Ensure at least one core

    If detectCores() returns 1, num_cores becomes zero and mclapply will error. Ensure
    at least one core by using max().

    benchmark/benchmark_Metamorpheus.R [64]

    -num_cores <- detectCores() - 1
    +num_cores <- max(detectCores() - 1, 1)
    Suggestion importance[1-10]: 6

    __

    Why: If detectCores() returns 1 then mc.cores=0 causes mclapply to error; ensuring at least one core prevents this runtime failure.

    Low

    @anshuman-raina anshuman-raina changed the title WIP : Feature/metamorpheus scripts Added Metamorpheus Datasets - MBR+NoMBR wNormalization to benchmarking Jun 19, 2025

    input = input %>% filter(`Protein Group` %in% protein_mappings$`Protein Groups`)

    output = MetamorpheusToMSstatsFormat(input, annot)
    Copy link
    Contributor

    @tonywu1999 tonywu1999 Jun 26, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The MetamorpheusToMSstatsFormat function also has these two parameters. By default, they're set to TRUE

    removeFewMeasurements, removeProtein_with1Feature
    

    Could you double check there aren't major differences in empirical FDR when these two parameters are set to FALSE? I'm thinking due to the absence of PIP, certain proteins may be filtered out altogether, which could explain better empirical FDR with no PIP.

    @coderabbitai
    Copy link

    coderabbitai bot commented Aug 15, 2025

    Walkthrough

    Retargets CI to a different HPC host and secret, updates remote paths and Slurm job config and R environment, adds a MetaMorpheus benchmarking R harness and a metrics helper, and updates dataset entries in the benchmark controller JSON.

    Changes

    Cohort / File(s) Summary
    CI workflow retargeting to Explorer HPC
    .github/workflows/benchmark.yml
    Branch filter changed to feature/metamorpheus-scripts; use SSH_PRIVATE_KEY_EXPLORER and login.explorer.northeastern.edu; rsync entire benchmark dir to /projects/VitekLab/Projects/Benchmarking; update remote working dir, sbatch invocation, monitoring host, and output retrieval paths; artifact step retained.
    Slurm job config and R environment refresh
    benchmark/config.slurm
    Change chdir to /projects/VitekLab/Projects/Benchmarking/benchmark; memory 256G→128G; simplify module loads to R and cmake/3.30.2; set R 4.4 lib paths, create lib_fix and symlink, extend LD_LIBRARY_PATH; replace inline installs with an Rscript that sets .libPaths, installs nloptr from source, Bioc packages, and force-installs MSstats from GitHub; add benchmark_Metamorpheus.R to script list; run Rscript via stdbuf for line-buffered output.
    MetaMorpheus benchmarking harness
    benchmark/benchmark_Metamorpheus.R
    Add runBenchmarkForMetaMorpheusData(datasetPath, config): load config/files, filter multi-protein entries and decoys, map proteins by organism, convert to MSstats format, define six processing configs, execute them in parallel with mclapply, compute metrics per summarized result, aggregate results, and invoke for two MetaMorpheus datasets.
    Metrics computation helper
    benchmark/metamorpheus_Process.R
    Add calculate_Metrics(QuantData, protein_mappings, task_label, alpha = 0.05); build 4×5 contrast matrix, run groupComparison, flag E. coli proteins via protein_mappings, compute per-label true/false positives and FDR, and return aggregated per-task FDRs.
    Dataset controller updates
    benchmark/scriptController.json
    Update dataset file paths from /work/.../projects/...; add samples mapping for DDA-Dowell2021-HEqe408_LFQ; add datasets DDA-Solivais2024-Metamorpheus_NoMBR_LFQ and DDA-Solivais2024-Metamorpheus_MBR_LFQ with parent/data and Human/Ecoli sample patterns.

    Sequence Diagram(s)

    sequenceDiagram
      autonumber
      actor Dev as GitHub Actions
      participant SSH as Explorer SSH
      participant Slurm as Slurm Scheduler
      participant Node as Compute Node
    
      Dev->>SSH: Setup SSH (new key/host)
      Dev->>SSH: rsync benchmark/ -> /projects/.../benchmark
      Dev->>SSH: sbatch /projects/.../benchmark/config.slurm
      SSH->>Slurm: Submit job
      Slurm-->>Dev: Job ID
      Slurm->>Node: Launch job
      Node->>Node: Setup R env, install deps
      Node->>Node: Run R scripts (line-buffered)
      Node-->>SSH: Write job_output.txt / job_error.txt
      Dev->>SSH: Retrieve outputs from /projects/.../benchmark/
      Dev-->>Dev: Upload artifact
    
    Loading
    sequenceDiagram
      autonumber
      participant R as benchmark_Metamorpheus.R
      participant CFG as scriptController.json
      participant Proc as MSstats/MSstatsConvert
      participant Met as calculate_Metrics
    
      R->>CFG: Load dataset config
      R->>R: Read QuantifiedPeaks/Proteins + annotation
      R->>R: Filter decoys/multiproteins and map organisms
      R->>Proc: MetamorpheusToMSstatsFormat()
      par Six processing configs
        R->>Proc: dataProcess(...) per config
      and
      end
      loop For each summarized result
        R->>Met: calculate_Metrics(summarized, mappings, label)
      end
      R->>R: Aggregate FDR metrics and print timing
    
    Loading

    Estimated code review effort

    🎯 4 (Complex) | ⏱️ ~40 minutes

    Poem

    I thump a paw on rsync lanes,
    New hosts and scripts, and slurm-file trains.
    Six tiny trials hop in parallel rows,
    E. coli peeks where the protein wind blows.
    R scripts hum — the benchmark burrow grows.

    Tip

    🔌 Remote MCP (Model Context Protocol) integration is now available!

    Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

    ✨ Finishing Touches
    🧪 Generate unit tests
    • Create PR with unit tests
    • Post copyable unit tests in a comment
    • Commit unit tests in branch feature/metamorpheus-scripts

    Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

    ❤️ Share
    🪧 Tips

    Chat

    There are 3 ways to chat with CodeRabbit:

    • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
      • I pushed a fix in commit <commit_id>, please review it.
      • Open a follow-up GitHub issue for this discussion.
    • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
    • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
      • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
      • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

    Support

    Need help? Create a ticket on our support page for assistance with any issues or questions.

    CodeRabbit Commands (Invoked using PR/Issue comments)

    Type @coderabbitai help to get the list of available commands.

    Other keywords and placeholders

    • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
    • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
    • Add @coderabbitai anywhere in the PR title to generate the title automatically.

    CodeRabbit Configuration File (.coderabbit.yaml)

    • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
    • Please see the configuration documentation for more information.
    • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

    Status, Documentation and Community

    • Visit our Status Page to check the current availability of CodeRabbit.
    • Visit our Documentation for detailed information on how to use CodeRabbit.
    • Join our Discord Community to get help, request features, and share feedback.
    • Follow us on X/Twitter for updates and announcements.

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 9

    ♻️ Duplicate comments (5)
    metamorpheus_code.R (3)

    21-26: Remove commented-out code blocks

    These stale comments add noise.

    Apply this diff:

    -# input_no_mbr$`Protein Group` = ifelse(
    -#   input_no_mbr$`Protein Group` %in% ecoli$`Protein Groups`, 
    -#   paste(input_no_mbr$`Protein Group`, "|ECOLI", sep = ""), 
    -#   paste(input_no_mbr$`Protein Group`, "|HUMAN", sep = ""))
    -# write.csv(input_no_mbr, "QuantifiedPeaks.csv", row.names = FALSE)

    and

    -# input$`Protein Group` = ifelse(
    -#   input$`Protein Group` %in% ecoli$`Protein Groups`,
    -#   paste(input$`Protein Group`, "|ECOLI", sep = ""),
    -#   paste(input$`Protein Group`, "|HUMAN", sep = ""))
    -# write.csv(input, "QuantifiedPeaks-MBR.csv", row.names = FALSE)

    Also applies to: 73-78


    1-112: Please remove this exploratory script from the PR (env-specific, redundant, and not used in the CI path)

    This file hard-codes absolute HPC paths, duplicates logic that now exists in the benchmarking harness, and includes ad-hoc visualization (hist) and commented code. It’s not invoked by the workflow, and a prior review already requested its removal.

    Recommendation:

    • Remove this file from the PR (or move it to a docs/examples location with parameterized paths and no hard-coded env specifics).
    • The CI and Slurm pipeline already use benchmark/benchmark_Metamorpheus.R + benchmark/metamorpheus_Process.R.

    49-56: Don’t hard-code a single comparison; reuse calculate_Metrics and compute robust FDRs

    You’re filtering only “B-A” and doing one-off FDR logic here. The PR already introduces calculate_Metrics for exactly this. Also guard against division by zero.

    Apply this diff:

    -e_group_no_mbr = model_no_mbr$ComparisonResult %>% filter(Label == "B-A") %>% filter(is.na(issue))
    -ecoli_no_mbr = e_group_no_mbr %>% filter(ecoli == TRUE)
    -hist(ecoli_no_mbr$log2FC)
    -
    -ecoli_no_mbr = e_group_no_mbr %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == TRUE)
    -human_no_mbr = e_group_no_mbr %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == FALSE)
    -FDR_no_mbr = nrow(human_no_mbr) / (nrow(ecoli_no_mbr) + nrow(human_no_mbr))
    +source("benchmark/metamorpheus_Process.R")
    +metrics_no_mbr <- calculate_Metrics(QuantData_no_mbr, protein_mappings, task_label = "NoMBR", alpha = 0.05)
    +print(metrics_no_mbr)
    benchmark/benchmark_Metamorpheus.R (2)

    31-35: Prefer using the Organism column in QuantifiedPeaks.tsv instead of joining via QuantifiedProteins.tsv

    Per earlier feedback, newer MetaMorpheus outputs include Organism directly in QuantifiedPeaks.tsv. Filtering in-place avoids an extra read and potential mismatches between “Protein Group(s)” columns across files.

    Apply this diff to inline the filter with a fallback to the old behavior if Organism is absent:

    -protein_mappings = data.table::fread(file.path(filePath, "QuantifiedProteins.tsv"))
    -protein_mappings = protein_mappings %>% filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens"))
    -
    -input = input %>% filter(`Protein Group` %in% protein_mappings$`Protein Groups`)
    +if ("Organism" %in% names(input)) {
    +  input <- input %>%
    +    dplyr::filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens"))
    +} else {
    +  protein_mappings <- data.table::fread(file.path(filePath, "QuantifiedProteins.tsv"))
    +  protein_mappings <- protein_mappings %>%
    +    dplyr::filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens"))
    +  input <- input %>%
    +    dplyr::filter(`Protein Group` %in% protein_mappings$`Protein Groups`)
    +}

    If organism strings vary across datasets (e.g., “Escherichia coli str. K-12”), consider a regex-based match or drive organism filters from config.


    36-36: Double-check MSstatsConvert pre-filtering options for FDR comparability

    MetamorpheusToMSstatsFormat likely forwards or reimplements MSstatsConvert behavior. Please verify the effect of:

    • removeFewMeasurements
    • removeProtein_with1Feature

    Defaults are typically TRUE and may change empirical FDR (especially without PIP). Consider exposing these via config or explicitly setting them here for the benchmarking runs to ensure apples-to-apples comparisons.

    🧹 Nitpick comments (11)
    benchmark/metamorpheus_Process.R (2)

    19-22: Harden organism matching and avoid reliance on attached packages

    Hard-coded exact string matching is brittle. Also, this function assumes dplyr is attached. Prefer explicit namespaces and robust matching.

    Apply this diff:

    -  ecoli_ids <- protein_mappings %>%
    -    filter(Organism == "Escherichia coli (strain K12)") %>%
    -    pull(`Protein Groups`)
    +  ecoli_ids <- protein_mappings |>
    +    dplyr::filter(grepl("Escherichia coli", Organism, fixed = TRUE)) |>
    +    dplyr::pull(`Protein Groups`)

    If you prefer to keep %>%, either import magrittr or switch the file to use |> consistently. Using namespaces (dplyr::) keeps this helper self-contained and safer when sourced.


    23-26: Namespace dplyr calls to reduce hidden dependencies

    This function uses dplyr verbs without importing. Namespace them to avoid runtime errors when the caller hasn't attached dplyr.

    Apply this diff:

    -  comp <- model$ComparisonResult %>%
    -    mutate(ecoli = Protein %in% ecoli_ids) %>%
    -    filter(is.na(issue))
    +  comp <- model$ComparisonResult |>
    +    dplyr::mutate(ecoli = Protein %in% ecoli_ids) |>
    +    dplyr::filter(is.na(issue))
    .github/workflows/benchmark.yml (4)

    24-26: Sanity-check SSH key handling

    Using echo for multi-line secrets often works, but printf is safer. Not mandatory, but reduces surprises with shell expansions.

    Apply this diff:

    -        echo "${{ secrets.SSH_PRIVATE_KEY_EXPLORER }}" > ~/.ssh/id_rsa
    +        printf '%s' "${{ secrets.SSH_PRIVATE_KEY_EXPLORER }}" > ~/.ssh/id_rsa

    34-37: Fix YAML trailing spaces flagged by linter

    There are trailing spaces on these lines.

    Apply this diff:

    -        ssh [email protected] "cd /projects/VitekLab/Projects/Benchmarking/benchmark && sbatch config.slurm" | tee slurm_job_id.txt 
    -        slurm_job_id=$(grep -oP '\d+' slurm_job_id.txt) 
    +        ssh [email protected] "cd /projects/VitekLab/Projects/Benchmarking/benchmark && sbatch config.slurm" | tee slurm_job_id.txt
    +        slurm_job_id=$(grep -oP '\d+' slurm_job_id.txt)
             echo "Slurm Job ID is $slurm_job_id"
    -        echo "slurm_job_id=$slurm_job_id" >> $GITHUB_ENV  
    +        echo "slurm_job_id=$slurm_job_id" >> $GITHUB_ENV

    17-17: Update to checkout v4

    Keep GA actions current.

    Apply this diff:

    -      uses: actions/checkout@v3
    +      uses: actions/checkout@v4

    31-48: Optional: export FDR threshold to Slurm job environment

    You define FDR_THRESHOLD but never pass it along. If the R scripts read it, export via sbatch.

    Apply this diff:

    -        ssh [email protected] "cd /projects/VitekLab/Projects/Benchmarking/benchmark && sbatch config.slurm" | tee slurm_job_id.txt
    +        ssh [email protected] "cd /projects/VitekLab/Projects/Benchmarking/benchmark && sbatch --export=ALL,FDR_THRESHOLD=${FDR_THRESHOLD} config.slurm" | tee slurm_job_id.txt
    benchmark/config.slurm (1)

    12-14: Confirm toolchain availability for source builds (nloptr) on the cluster

    You removed the explicit gcc module but are forcing a source build of nloptr, which requires a compiler (and NLopt via cmake). If GCC isn’t loaded by default on Explorer, this will fail.

    Options:

    • Load a compiler toolchain, e.g., module load gcc (and possibly gcc/12 or cluster default).
    • Or avoid forcing source with install.packages("nloptr", repos = "https://cloud.r-project.org") and let R decide.

    Would you like me to adjust the script for a fully pinned toolchain?

    benchmark/benchmark_Metamorpheus.R (4)

    28-30: Decoy filtering: use a dedicated Decoy flag when present

    If available, prefer the explicit Decoy/IsDecoy boolean column over string matching. It’s more robust across naming schemes.

    -input = input %>% filter(!str_detect(`Protein Group`, ";")) # remove multiple protein group in same cell
    -input = input %>% filter(!str_detect(`Protein Group`, "DECOY")) # remove decoys
    +input <- input %>% dplyr::filter(!str_detect(`Protein Group`, ";")) # remove multiple protein group in same cell
    +if ("Decoy" %in% names(input)) {
    +  input <- input %>% dplyr::filter(!Decoy)
    +} else if ("IsDecoy" %in% names(input)) {
    +  input <- input %>% dplyr::filter(!IsDecoy)
    +} else {
    +  input <- input %>% dplyr::filter(!str_detect(`Protein Group`, "DECOY"))
    +}

    67-72: Guard against zero/oversubscribed cores in mclapply

    detectCores() - 1 can be zero on small VMs, and oversubscription can hurt performance. Bound cores to [1, length(tasks)].

    -num_cores <- detectCores() - 1 
    +num_cores <- max(1L, min(length(data_process_tasks), detectCores(logical = TRUE) - 1L))
     
     summarized_results <- mclapply(data_process_tasks, function(task) {
       list(label = task$label, summarized = task$result())
    -}, mc.cores = num_cores)	
    +}, mc.cores = num_cores)

    74-77: Carry dataset name into the metrics output for traceability

    Downstream aggregation across datasets benefits from an explicit Dataset column.

    -results_list <- mclapply(summarized_results, function(res) {
    -  calculate_Metrics(res$summarized, protein_mappings, res$label)
    -}, mc.cores = num_cores)
    +results_list <- mclapply(summarized_results, function(res) {
    +  df <- calculate_Metrics(res$summarized, protein_mappings, res$label)
    +  df$Dataset <- dataset_config$name
    +  df
    +}, mc.cores = num_cores)

    89-90: Optional: Drive datasets from the controller instead of hard-coding

    To reduce maintenance and keep scripts DRY, consider deriving the targets from config (e.g., those with “Metamorpheus” in the key or a dedicated flag).

    Example replacement:

    -runBenchmarkForMetaMorpheusData("DDA-Solivais2024-Metamorpheus_MBR_LFQ", config)
    -runBenchmarkForMetaMorpheusData("DDA-Solivais2024-Metamorpheus_NoMBR_LFQ", config)
    +mm_keys <- names(config$datasets)
    +mm_keys <- mm_keys[grepl("Metamorpheus", mm_keys, ignore.case = TRUE)]
    +invisible(lapply(mm_keys, runBenchmarkForMetaMorpheusData, config = config))
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    💡 Knowledge Base configuration:

    • MCP integration is disabled by default for public repositories
    • Jira integration is disabled by default for public repositories
    • Linear integration is disabled by default for public repositories

    You can enable these sources in your CodeRabbit configuration.

    📥 Commits

    Reviewing files that changed from the base of the PR and between b5d79e8 and ec9fe42.

    📒 Files selected for processing (6)
    • .github/workflows/benchmark.yml (3 hunks)
    • benchmark/benchmark_Metamorpheus.R (1 hunks)
    • benchmark/config.slurm (1 hunks)
    • benchmark/metamorpheus_Process.R (1 hunks)
    • benchmark/scriptController.json (2 hunks)
    • metamorpheus_code.R (1 hunks)
    🧰 Additional context used
    🪛 YAMLlint (1.37.1)
    .github/workflows/benchmark.yml

    [error] 34-34: trailing spaces

    (trailing-spaces)


    [error] 35-35: trailing spaces

    (trailing-spaces)


    [error] 37-37: trailing spaces

    (trailing-spaces)

    ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
    • GitHub Check: build
    • GitHub Check: Benchmarking-pipeline
    🔇 Additional comments (3)
    benchmark/scriptController.json (2)

    35-49: LGTM: new MetaMorpheus NoMBR dataset entry is coherent and consistent

    The parent + data convention and sample patterns align with the new benchmarking harness. Patterns are properly anchored.


    50-64: LGTM: new MetaMorpheus MBR dataset entry is consistent

    Matches NoMBR structure; sample patterns and parent/data paths look good.

    benchmark/config.slurm (1)

    43-50: Nice: line-buffered Rscript output for Slurm logs

    Line buffering via stdbuf improves observability; loop structure and logging are fine.

    Comment on lines +8 to +10
    source("metamorpheus_Process.R")
    config <- fromJSON("scriptController.json", simplifyVector = FALSE)

    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    💡 Verification agent

    🧩 Analysis chain

    Fix: MetamorpheusToMSstatsFormat likely undefined; also make sources/paths robust to working directory

    MetamorpheusToMSstatsFormat isn’t defined in this script and isn’t exported by MSstats/MSstatsConvert. It appears to live in metamorpheus_code.R, which isn’t sourced here. Also, both the source() and fromJSON() calls are relative to the current working directory and will break if the script isn’t launched from benchmark/.

    Apply this diff to source both helper files and load the controller JSON robustly:

    - source("metamorpheus_Process.R")
    -config <- fromJSON("scriptController.json", simplifyVector = FALSE)
    +script_args <- commandArgs(trailingOnly = FALSE)
    +script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)])
    +base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else "."
    +repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE)
    +
    +source(file.path(base_dir, "metamorpheus_Process.R"))
    +# Ensure MetamorpheusToMSstatsFormat is available
    +if (file.exists(file.path(repo_root, "metamorpheus_code.R"))) {
    +  source(file.path(repo_root, "metamorpheus_code.R"))
    +}
    +config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)

    🏁 Script executed:

    #!/bin/bash
    set -euo pipefail
    
    echo "PWD: $(pwd)"
    echo
    echo "---- benchmark directory listing ----"
    ls -la benchmark || true
    echo
    echo "---- show benchmark/benchmark_Metamorpheus.R (first 200 lines) ----"
    sed -n '1,200p' benchmark/benchmark_Metamorpheus.R || true
    echo
    echo "---- rg: search for MetamorpheusToMSstatsFormat ----"
    rg -n --hidden --no-ignore -S "MetamorpheusToMSstatsFormat" || true
    echo
    echo "---- rg: search for metamorpheus_code.R references ----"
    rg -n --hidden --no-ignore -S "metamorpheus_code\\.R" || true
    echo
    echo "---- fd: find files named metamorpheus_code.R or metamorpheus_Process.R ----"
    fd -I -e R --hidden "metamorpheus_code.R" || true
    fd -I -e R --hidden "metamorpheus_Process.R" || true
    echo
    echo "---- rg: search for source(...) of metamorpheus_Process.R ----"
    rg -n --hidden --no-ignore -S "source\\(.*metamorpheus_Process\\.R" || true
    echo
    echo "---- rg: search for fromJSON\\(.*scriptController.json ----"
    rg -n --hidden --no-ignore -S "fromJSON\\(.*scriptController\\.json" || true
    echo
    echo "---- check for metamorpheus_code.R at repo root ----"
    if [ -f "metamorpheus_code.R" ]; then echo "FOUND: metamorpheus_code.R at repo root"; else echo "NOT FOUND: metamorpheus_code.R at repo root"; fi

    Length of output: 6452


    Fix required: source metamorpheus_code.R and make file paths robust to working directory

    MetamorpheusToMSstatsFormat is defined in metamorpheus_code.R at the repo root (rg matches). benchmark/benchmark_Metamorpheus.R currently only does source("metamorpheus_Process.R") and fromJSON("scriptController.json"), which will break if the script is run from a different working directory — source the helper that defines MetamorpheusToMSstatsFormat and resolve paths relative to the script.

    Files to update:

    • benchmark/benchmark_Metamorpheus.R — replace the relative source()/fromJSON() calls (lines ~8–10).
    • metamorpheus_code.R (repo root) — contains MetamorpheusToMSstatsFormat and must be sourced by the benchmark script.
    • benchmark/metamorpheus_Process.R — keep sourced, but source via computed script path.

    Apply this diff to benchmark/benchmark_Metamorpheus.R:

    - source("metamorpheus_Process.R")
    - config <- fromJSON("scriptController.json", simplifyVector = FALSE)
    + script_args <- commandArgs(trailingOnly = FALSE)
    + script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)])
    + base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else "."
    + repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE)
    +
    + source(file.path(base_dir, "metamorpheus_Process.R"))
    + # Ensure MetamorpheusToMSstatsFormat is available
    + if (file.exists(file.path(repo_root, "metamorpheus_code.R"))) {
    +   source(file.path(repo_root, "metamorpheus_code.R"))
    + }
    + config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    source("metamorpheus_Process.R")
    config <- fromJSON("scriptController.json", simplifyVector = FALSE)
    script_args <- commandArgs(trailingOnly = FALSE)
    script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)])
    base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else "."
    repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE)
    source(file.path(base_dir, "metamorpheus_Process.R"))
    # Ensure MetamorpheusToMSstatsFormat is available
    if (file.exists(file.path(repo_root, "metamorpheus_code.R"))) {
    source(file.path(repo_root, "metamorpheus_code.R"))
    }
    config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)
    🤖 Prompt for AI Agents
    In benchmark/benchmark_Metamorpheus.R around lines 8–10, the script currently
    sources metamorpheus_Process.R and reads scriptController.json using relative
    paths which will break if run from a different working directory; change the top
    of the script to compute the script directory, source the repo-root
    metamorpheus_code.R and the local metamorpheus_Process.R via absolute paths, and
    read scriptController.json via that computed path (e.g. determine script_dir
    with a robust pattern that falls back to getwd(), then call
    source(file.path(script_dir, "metamorpheus_Process.R")) and
    source(file.path(script_dir, "..", "metamorpheus_code.R")) and config <-
    fromJSON(file.path(script_dir, "scriptController.json"), simplifyVector =
    FALSE)).
    

    Comment on lines +44 to +62
    label = "Data process with Normalization and MBImpute False",
    result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE)
    ),
    list(
    label = "Data process without Normalization",
    result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20)
    ),
    list(
    label = "Data process without Normalization with MBImpute False",
    result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
    ),
    list(
    label = "Data process without Normalization and Imputation On for all features",
    result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = FALSE)
    ),
    list(
    label = "Data process without Normalization and Imputation On for top3 features",
    result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = FALSE)
    )
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue

    Correct MBimpute argument name and align “Imputation On” task behavior

    Two issues:

    • Typo: MBImpute (capital I) is not a valid dataProcess() argument; it should be MBimpute.
    • Tasks labeled “…Imputation On…” currently pass FALSE, turning imputation off and contradicting the labels.

    Apply this diff to fix both problems:

         list(
           label = "Data process with Normalization and MBImpute False",
    -      result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE)
    +      result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE)
         ),
    @@
         list(
           label = "Data process without Normalization with MBImpute False",
    -      result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
    +      result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
         ),
         list(
           label = "Data process without Normalization and Imputation On for all features",
    -      result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBImpute = FALSE)
    +      result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = TRUE)
         ),
         list(
           label = "Data process without Normalization and Imputation On for top3 features",
    -      result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBImpute = FALSE)
    +      result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = TRUE)
         )

    Note: The first two blocks already use the correct argument spelling (MBimpute). The changes above fix the last two blocks to both correct the spelling and enable imputation.

    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    label = "Data process with Normalization and MBImpute False",
    result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE)
    ),
    list(
    label = "Data process without Normalization",
    result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20)
    ),
    list(
    label = "Data process without Normalization with MBImpute False",
    result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
    ),
    list(
    label = "Data process without Normalization and Imputation On for all features",
    result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = FALSE)
    ),
    list(
    label = "Data process without Normalization and Imputation On for top3 features",
    result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = FALSE)
    )
    label = "Data process with Normalization and MBImpute False",
    result = function() dataProcess(output, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE)
    ),
    list(
    label = "Data process without Normalization",
    result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20)
    ),
    list(
    label = "Data process without Normalization with MBImpute False",
    result = function() dataProcess(output, featureSubset = "topN", normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
    ),
    list(
    label = "Data process without Normalization and Imputation On for all features",
    result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = TRUE)
    ),
    list(
    label = "Data process without Normalization and Imputation On for top3 features",
    result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = TRUE)
    )
    🤖 Prompt for AI Agents
    In benchmark/benchmark_Metamorpheus.R around lines 44 to 62, the dataProcess
    calls have a typo (MBImpute with a capital I) and the two tasks labeled
    "...Imputation On..." pass MBimpute = FALSE (disabling imputation) which
    contradicts the labels; fix by renaming the argument to MBimpute (lowercase i)
    where misspelled and set MBimpute = TRUE for the two "Imputation On" cases so
    their behavior matches their labels.
    

    Comment on lines +24 to +41
    Rscript -e '
    .libPaths("/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.4")
    Sys.unsetenv("R_LIBS")
    cat(".libPaths():\n"); print(.libPaths())
    if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes", repos = "https://cloud.r-project.org")
    install.packages("nloptr", type = "source", repos = "https://cloud.r-project.org")
    for (pkg in c("dplyr", "stringr", "ggplot2")) {
    if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg, repos = "https://cloud.r-project.org")
    }
    }
    BiocManager::install(c("MSstatsConvert", "preprocessCore"), force = TRUE)
    remotes::install_github("Vitek-Lab/MSstats", ref = "devel", force = TRUE)
    '
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue

    This will fail if BiocManager is missing; install it before calling BiocManager::install

    BiocManager::install errors if BiocManager isn’t installed. Add a guard to install BiocManager first.

    Apply this diff:

     Rscript -e '
     .libPaths("/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.4")
     Sys.unsetenv("R_LIBS")
     cat(".libPaths():\n"); print(.libPaths())
     
     if (!requireNamespace("remotes", quietly = TRUE))
         install.packages("remotes", repos = "https://cloud.r-project.org")
     
    -install.packages("nloptr", type = "source", repos = "https://cloud.r-project.org")
    +install.packages("nloptr", type = "source", repos = "https://cloud.r-project.org")
     
     for (pkg in c("dplyr", "stringr", "ggplot2")) {
         if (!requireNamespace(pkg, quietly = TRUE)) {
             install.packages(pkg, repos = "https://cloud.r-project.org")
         }
     }
    -BiocManager::install(c("MSstatsConvert", "preprocessCore"), force = TRUE)
    +if (!requireNamespace("BiocManager", quietly = TRUE)) {
    +    install.packages("BiocManager", repos = "https://cloud.r-project.org")
    +}
    +BiocManager::install(c("MSstatsConvert", "preprocessCore"), force = TRUE)
     remotes::install_github("Vitek-Lab/MSstats", ref = "devel", force = TRUE)
     '
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    Rscript -e '
    .libPaths("/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.4")
    Sys.unsetenv("R_LIBS")
    cat(".libPaths():\n"); print(.libPaths())
    if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes", repos = "https://cloud.r-project.org")
    install.packages("nloptr", type = "source", repos = "https://cloud.r-project.org")
    for (pkg in c("dplyr", "stringr", "ggplot2")) {
    if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg, repos = "https://cloud.r-project.org")
    }
    }
    BiocManager::install(c("MSstatsConvert", "preprocessCore"), force = TRUE)
    remotes::install_github("Vitek-Lab/MSstats", ref = "devel", force = TRUE)
    '
    Rscript -e '
    .libPaths("/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.4")
    Sys.unsetenv("R_LIBS")
    cat(".libPaths():\n"); print(.libPaths())
    if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes", repos = "https://cloud.r-project.org")
    install.packages("nloptr", type = "source", repos = "https://cloud.r-project.org")
    for (pkg in c("dplyr", "stringr", "ggplot2")) {
    if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg, repos = "https://cloud.r-project.org")
    }
    }
    if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager", repos = "https://cloud.r-project.org")
    }
    BiocManager::install(c("MSstatsConvert", "preprocessCore"), force = TRUE)
    remotes::install_github("Vitek-Lab/MSstats", ref = "devel", force = TRUE)
    '
    🤖 Prompt for AI Agents
    In benchmark/config.slurm around lines 24 to 41, the script calls
    BiocManager::install without ensuring the BiocManager package is present; add a
    guard to check for BiocManager with requireNamespace("BiocManager",
    quietly=TRUE) and, if missing, install it (install.packages("BiocManager",
    repos="https://cloud.r-project.org")) before calling BiocManager::install, then
    proceed with the existing BiocManager::install call.
    

    Comment on lines +10 to +12
    groups <- levels(QuantData$ProteinLevelData$GROUP)
    colnames(comparison) <- groups[order(as.numeric(groups))]

    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Avoid fragile column ordering: stop coercing group labels to numeric

    as.numeric(groups) will warn and is unreliable when GROUP levels are non-numeric (e.g., "A","B","C"...). You only need to align by names; just set the column names directly. Optionally validate the expected length.

    Apply this diff:

    -  groups <- levels(QuantData$ProteinLevelData$GROUP)
    -  colnames(comparison) <- groups[order(as.numeric(groups))]
    +  groups <- levels(QuantData$ProteinLevelData$GROUP)
    +  if (length(groups) != ncol(comparison)) {
    +    stop(sprintf("Contrast matrix expects %d groups but found %d in data: %s",
    +                 ncol(comparison), length(groups), paste(groups, collapse = ", ")))
    +  }
    +  colnames(comparison) <- groups
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    groups <- levels(QuantData$ProteinLevelData$GROUP)
    colnames(comparison) <- groups[order(as.numeric(groups))]
    groups <- levels(QuantData$ProteinLevelData$GROUP)
    if (length(groups) != ncol(comparison)) {
    stop(sprintf("Contrast matrix expects %d groups but found %d in data: %s",
    ncol(comparison), length(groups), paste(groups, collapse = ", ")))
    }
    colnames(comparison) <- groups
    🤖 Prompt for AI Agents
    In benchmark/metamorpheus_Process.R around lines 10 to 12, the code coerces
    group level labels to numeric via as.numeric(groups) which is fragile and warns
    for non-numeric levels; instead assign column names directly using the factor
    levels (or sorted/validated names if needed) and optionally verify the length
    matches ncol(comparison) before setting colnames to avoid misalignment. Ensure
    you remove the as.numeric() coercion and replace it with a direct assignment of
    groups (or a deterministic name ordering) and add a simple length check to throw
    a clear error if sizes differ.
    

    Comment on lines 27 to 46
    labels <- unique(comp$Label)
    result_rows <- lapply(labels, function(lbl) {
    df <- comp %>% filter(Label == lbl)
    sig <- df %>% filter(adj.pvalue < alpha)

    tp <- sig %>% filter(ecoli) %>% nrow()
    fp <- sig %>% filter(!ecoli) %>% nrow()
    tot <- tp + fp
    fdr <- if (tot > 0) fp / tot else NA_real_

    data.frame(
    Task = task_label,
    Comparison = lbl,
    FDR = fdr,
    stringsAsFactors = FALSE
    )
    })

    results <- do.call(rbind, result_rows)
    return(results)
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Guard against empty results to avoid rbind error

    If comp is empty (e.g., no comparisons available), do.call(rbind, list()) errors. Return an empty data.frame instead to keep the pipeline resilient.

    Apply this diff:

    -  labels <- unique(comp$Label)
    -  result_rows <- lapply(labels, function(lbl) {
    +  labels <- unique(comp$Label)
    +  if (length(labels) == 0L) {
    +    return(data.frame(Task = character(), Comparison = character(), FDR = numeric(),
    +                      stringsAsFactors = FALSE))
    +  }
    +  result_rows <- lapply(labels, function(lbl) {
         df <- comp %>% filter(Label == lbl)
         sig <- df %>% filter(adj.pvalue < alpha)
     
         tp <- sig %>% filter(ecoli) %>% nrow()
         fp <- sig %>% filter(!ecoli) %>% nrow()
         tot <- tp + fp
         fdr <- if (tot > 0) fp / tot else NA_real_
     
         data.frame(
           Task       = task_label,
           Comparison = lbl,
           FDR        = fdr,
           stringsAsFactors = FALSE
         )
       })
     
       results <- do.call(rbind, result_rows)
       return(results)
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    labels <- unique(comp$Label)
    result_rows <- lapply(labels, function(lbl) {
    df <- comp %>% filter(Label == lbl)
    sig <- df %>% filter(adj.pvalue < alpha)
    tp <- sig %>% filter(ecoli) %>% nrow()
    fp <- sig %>% filter(!ecoli) %>% nrow()
    tot <- tp + fp
    fdr <- if (tot > 0) fp / tot else NA_real_
    data.frame(
    Task = task_label,
    Comparison = lbl,
    FDR = fdr,
    stringsAsFactors = FALSE
    )
    })
    results <- do.call(rbind, result_rows)
    return(results)
    labels <- unique(comp$Label)
    if (length(labels) == 0L) {
    return(data.frame(Task = character(), Comparison = character(), FDR = numeric(),
    stringsAsFactors = FALSE))
    }
    result_rows <- lapply(labels, function(lbl) {
    df <- comp %>% filter(Label == lbl)
    sig <- df %>% filter(adj.pvalue < alpha)
    tp <- sig %>% filter(ecoli) %>% nrow()
    fp <- sig %>% filter(!ecoli) %>% nrow()
    tot <- tp + fp
    fdr <- if (tot > 0) fp / tot else NA_real_
    data.frame(
    Task = task_label,
    Comparison = lbl,
    FDR = fdr,
    stringsAsFactors = FALSE
    )
    })
    results <- do.call(rbind, result_rows)
    return(results)
    🤖 Prompt for AI Agents
    In benchmark/metamorpheus_Process.R around lines 27 to 46, the code calls
    do.call(rbind, result_rows) which will error when result_rows is empty; modify
    the logic to check if result_rows has length > 0 before binding and, if empty,
    return an empty data.frame with the same columns (Task, Comparison, FDR) and
    types (stringsAsFactors = FALSE) so the function returns a valid empty result
    instead of failing.
    

    Comment on lines 22 to +33
    "name": "DDA-Dowell2021-HEqe408_LFQ",
    "file": "/work/VitekLab/Data/MS/Benchmarking/DDA-Dowell2021-HEqe408_LFQ/FragPipe/TOP0/MSstats.csv",
    "file": "/projects/VitekLab/Data/MS/Benchmarking/DDA-Dowell2021-HEqe408_LFQ/FragPipe/TOP0/MSstats.csv",
    "samples": {
    "Human": {
    "pattern": "_HUMAN$",
    "type": "insignificant"
    },
    "Ecoli": {
    "pattern": "_ECOLI$",
    "type": "significant"
    }
    }
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    💡 Verification agent

    🧩 Analysis chain

    Confirm consumer code reads new fields (parent and data)

    Older datasets use a file field; the MetaMorpheus entries introduce parent + data. Ensure the benchmarking script branches correctly when these fields are present.

    Run this to verify the harness reads parent and data and constructs file paths accordingly:


    🏁 Script executed:

    #!/bin/bash
    # Inspect benchmark scripts for usage of scriptController.json
    rg -n -C3 'scriptController\.json|parent|data"|samples' --type=r benchmark

    Length of output: 4202


    🏁 Script executed:

    #!/bin/bash
    set -euo pipefail
    
    echo "Searching benchmark R scripts for dataset_config usage and path construction..."
    rg -n -C3 'dataset_config|filePath|file.path\(|dataset_config\$file|dataset_config\$parent|dataset_config\$data' --type=r benchmark || true
    
    echo
    echo "----- Full contents of the three benchmark scripts referenced in previous output -----"
    for f in benchmark/benchmark_Metamorpheus.R benchmark/benchmark_Dowell2021-HEqe408_LFQ.R benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R; do
      echo "=== $f ==="
      sed -n '1,240p' "$f" || true
      echo
    done

    Length of output: 14136


    Action: Make benchmark scripts accept both file and parent+data config shapes

    Verified: benchmark/benchmark_Metamorpheus.R already constructs paths from dataset_config$parent + dataset_config$data; benchmark_Dowell2021-HEqe408_LFQ.R and benchmark_Puyvelde2022-HYE5600735_LFQ.R still assume dataset_config$file only. This will break if scriptController.json uses the new parent+data shape for non‑Metamorpheus datasets.

    Files to update:

    • benchmark/benchmark_Metamorpheus.R — lines ~18–21: add a fallback to dataset_config$file when parent/data are absent.
    • benchmark/benchmark_Dowell2021-HEqe408_LFQ.R — lines ~14–21: resolve a data_file from either dataset_config$file or parent+data, then use it for fread and logging.
    • benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R — lines ~14–21: same change as Dowell.

    Suggested minimal patches (apply the same pattern to both Dowell and Puyvelde):

    Before:

    cat("Processing Dataset:", dataset_config$name, "\n")
    cat("Dataset File Path:", dataset_config$file, "\n")
    ...
    fragpipe_raw <- data.table::fread(dataset_config$file)

    After:

    data_file <- if (!is.null(dataset_config$file)) {
      dataset_config$file
    } else if (!is.null(dataset_config$parent) && !is.null(dataset_config$data)) {
      file.path(dataset_config$parent, dataset_config$data)
    } else {
      stop("Dataset path not specified in config (expected 'file' or 'parent' + 'data')")
    }
    cat("Processing Dataset:", dataset_config$name, "\n")
    cat("Dataset File Path:", data_file, "\n")
    ...
    fragpipe_raw <- data.table::fread(data_file)

    Suggested change for Metamorpheus (keep existing parent+data behavior, add fallback to file):

    Before:

    filePath <- file.path(dataset_config$parent, dataset_config$data)
    annotPath <- dataset_config$parent

    After:

    if (!is.null(dataset_config$file)) {
      # If a file path is supplied, use its directory as the dataset directory
      filePath <- dirname(dataset_config$file)
      annotPath <- dirname(filePath)
    } else {
      filePath <- file.path(dataset_config$parent, dataset_config$data)
      annotPath <- dataset_config$parent
    }

    Reasoning: This makes each script robust to either config shape (old file or new parent+data) and fails fast with a clear error if neither is present.

    Please apply these small changes to the three scripts (or ensure scriptController.json keeps the original file field for datasets consumed by the FragPipe-based scripts).

    🤖 Prompt for AI Agents
    In benchmark/scriptController.json around lines 22–33 and the three R scripts:
    benchmark/benchmark_Metamorpheus.R (~lines 18–21),
    benchmark/benchmark_Dowell2021-HEqe408_LFQ.R (~lines 14–21), and
    benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R (~lines 14–21), update each
    script to accept either dataset_config$file OR dataset_config$parent +
    dataset_config$data: for Dowell and Puyvelde, resolve a single data_file by
    using dataset_config$file if present, otherwise build it with
    file.path(dataset_config$parent, dataset_config$data), stop with a clear error
    if neither, then use data_file for logging and fread; for Metamorpheus, keep
    parent+data behavior but add a fallback where if dataset_config$file is
    provided, set filePath to dirname(dataset_config$file) and annotPath to its
    parent (dirname(filePath)); ensure all logging uses the resolved path and fail
    fast with a descriptive message when no valid path is provided.
    

    Comment on lines 40 to 42
    groups = levels(QuantData_no_mbr$ProteinLevelData$GROUP)
    colnames(comparison) <- groups[order(as.numeric(groups))]
    model_no_mbr <- groupComparison(contrast.matrix=comparison, data=QuantData_no_mbr,
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Fix contrast column naming; current coercion to numeric is fragile and warns

    Identical issue as in the helper: as.numeric(groups) is unreliable for non-numeric labels.

    Apply this diff:

    -groups = levels(QuantData_no_mbr$ProteinLevelData$GROUP)
    -colnames(comparison) <- groups[order(as.numeric(groups))]
    +groups <- levels(QuantData_no_mbr$ProteinLevelData$GROUP)
    +colnames(comparison) <- groups
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    groups = levels(QuantData_no_mbr$ProteinLevelData$GROUP)
    colnames(comparison) <- groups[order(as.numeric(groups))]
    model_no_mbr <- groupComparison(contrast.matrix=comparison, data=QuantData_no_mbr,
    groups <- levels(QuantData_no_mbr$ProteinLevelData$GROUP)
    colnames(comparison) <- groups
    model_no_mbr <- groupComparison(contrast.matrix=comparison, data=QuantData_no_mbr,
    🤖 Prompt for AI Agents
    In metamorpheus_code.R around lines 40 to 42, the contrast column naming uses
    order(as.numeric(groups)) which coerces non-numeric group labels to NA and emits
    warnings; replace that coercion with a safe ordering such as
    order(as.character(groups)) (or order(as.numeric(as.character(groups))) only if
    you are certain labels are numeric) and assign colnames(comparison) <-
    groups[order(as.character(groups))] so non-numeric labels are handled without
    warnings.
    

    Comment on lines 99 to 106
    e_group = model$ComparisonResult %>% filter(Label == "B-A") %>% filter(is.na(issue))
    ecoli = e_group %>% filter(ecoli == TRUE)
    hist(ecoli$log2FC)

    ecoli = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == TRUE)
    human = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == FALSE)
    FDR = nrow(human) / (nrow(ecoli) + nrow(human))

    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Avoid hard-coded “B-A” and duplicated FDR code; delegate to calculate_Metrics

    Mirror the No MBR fix here; remove ad-hoc histogram and use the shared helper.

    Apply this diff:

    -e_group = model$ComparisonResult %>% filter(Label == "B-A") %>% filter(is.na(issue))
    -ecoli = e_group %>% filter(ecoli == TRUE)
    -hist(ecoli$log2FC)
    -
    -ecoli = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == TRUE)
    -human = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == FALSE)
    -FDR = nrow(human) / (nrow(ecoli) + nrow(human))
    -
    -cat("FDR MBR", FDR, "\n")
    +metrics_mbr <- calculate_Metrics(QuantData, protein_mappings, task_label = "MBR", alpha = 0.05)
    +print(metrics_mbr)
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    e_group = model$ComparisonResult %>% filter(Label == "B-A") %>% filter(is.na(issue))
    ecoli = e_group %>% filter(ecoli == TRUE)
    hist(ecoli$log2FC)
    ecoli = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == TRUE)
    human = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == FALSE)
    FDR = nrow(human) / (nrow(ecoli) + nrow(human))
    metrics_mbr <- calculate_Metrics(QuantData, protein_mappings, task_label = "MBR", alpha = 0.05)
    print(metrics_mbr)

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 1

    ♻️ Duplicate comments (3)
    benchmark/benchmark_Metamorpheus.R (3)

    31-37: Good use of Organism-based filtering (addresses prior feedback)

    Filtering QuantifiedProteins.tsv to organisms present in QuantifiedPeaks.tsv is correct and aligns with earlier feedback to use the Organism column.


    8-10: Fix brittle paths and ensure MetamorpheusToMSstatsFormat is available

    Using source("...") and fromJSON("...") relative to the CWD will break when invoked outside benchmark/. Also, MetamorpheusToMSstatsFormat is not guaranteed to be in scope unless metamorpheus_code.R is sourced. Make paths robust and fail fast if the function is missing.

    Apply this diff:

    -source("metamorpheus_Process.R")
    -config <- fromJSON("scriptController.json", simplifyVector = FALSE)
    +## Resolve paths relative to this script; source helpers robustly
    +script_args <- commandArgs(trailingOnly = FALSE)
    +script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)])
    +base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else normalizePath(".")
    +repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE)
    +
    +source(file.path(base_dir, "metamorpheus_Process.R"))
    +
    +# Ensure MetamorpheusToMSstatsFormat is available (defined in metamorpheus_code.R at repo root)
    +mm_code <- file.path(repo_root, "metamorpheus_code.R")
    +if (file.exists(mm_code)) {
    +  source(mm_code)
    +}
    +if (!exists("MetamorpheusToMSstatsFormat")) {
    +  stop("MetamorpheusToMSstatsFormat not found; please source metamorpheus_code.R or add it to the search path.")
    +}
    +
    +# Load controller JSON from the benchmark directory
    +config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)

    58-64: “Imputation On” tasks currently disable imputation; set MBimpute = TRUE

    Labels promise “Imputation On” but MBimpute is FALSE in both tasks, changing the experiment semantics and metrics.

         list(
           label = "Data process without Normalization and Imputation On for all features",
    -      result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBImpute = FALSE)
    +      result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = TRUE)
         ),
         list(
           label = "Data process without Normalization and Imputation On for top3 features",
    -      result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = FALSE)
    +      result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = TRUE)
         )
    🧹 Nitpick comments (7)
    benchmark/benchmark_Metamorpheus.R (7)

    21-23: Add file existence checks and read only necessary columns to reduce I/O

    Fail fast with clear messages if expected inputs are missing. Also, reading only “Protein Groups” and “Organism” trims memory and speeds up processing.

    -  input = data.table::fread(file.path(filePath, "QuantifiedPeaks.tsv"))
    -  annot = data.table::fread(file.path(annotPath, "annotation.csv"))
    +  peaks_path <- file.path(filePath, "QuantifiedPeaks.tsv")
    +  annot_path <- file.path(annotPath, "annotation.csv")
    +  if (!file.exists(peaks_path)) stop("Missing QuantifiedPeaks.tsv at: ", peaks_path)
    +  if (!file.exists(annot_path)) stop("Missing annotation.csv at: ", annot_path)
    +  input <- data.table::fread(peaks_path)
    +  annot <- data.table::fread(annot_path)
    @@
    -  protein_mappings = data.table::fread(file.path(filePath, "QuantifiedProteins.tsv"))
    +  proteins_path <- file.path(filePath, "QuantifiedProteins.tsv")
    +  if (!file.exists(proteins_path)) stop("Missing QuantifiedProteins.tsv at: ", proteins_path)
    +  protein_mappings <- data.table::fread(proteins_path, select = c("Protein Groups", "Organism"))

    Also applies to: 31-31


    28-30: Use fixed() in str_detect to avoid regex overhead and edge cases

    Treat ";" and "DECOY" as fixed strings instead of regex to speed up filtering and avoid unintended regex behavior.

    -  input = input %>% filter(!str_detect(`Protein Group`, ";")) # remove multiple protein group in same cell
    -  input = input %>% filter(!str_detect(`Protein Group`, "DECOY")) # remove decoys
    +  input <- input %>% filter(!str_detect(`Protein Group`, fixed(";")))     # remove multiple protein group in same cell
    +  input <- input %>% filter(!str_detect(`Protein Group`, fixed("DECOY"))) # remove decoys

    16-17: Be resilient if dataset_config$name is missing

    Some controller entries may not set name. Fall back to datasetPath for logging.

    -  cat("Processing Dataset:", dataset_config$name, "\n")
    +  ds_name <- if (!is.null(dataset_config$name)) dataset_config$name else datasetPath
    +  cat("Processing Dataset:", ds_name, "\n")

    40-47: Clarify task label to reflect actual settings

    First task uses default MBimpute; consider making that explicit in the label to avoid confusion when comparing results.

    -      label = "Data process with Normalized Data",
    +      label = "Data process with Normalization (default MBimpute)",

    38-38: Expose MSstatsConvert filters to config for FDR sensitivity analysis

    MetamorpheusToMSstatsFormat defaults removeFewMeasurements = TRUE and removeProtein_with1Feature = TRUE. To investigate empirical FDR sensitivity (as raised previously), parameterize these via scriptController.json and pass through here. I can draft the wiring if you want.


    81-86: Persist results to disk for downstream aggregation

    Printing is transient. Consider writing CSV/TSV labeled by dataset/MBR mode to the dataset folder or a results directory for the workflow.

       final_results <- do.call(rbind, results_list)
       end_time <- Sys.time()
       total_time <- end_time - start_time
    -  print(final_results)
    -  print(paste("Total Execution Time:", total_time))
    +  print(final_results)
    +  print(paste("Total Execution Time:", total_time))
    +  # Optional: persist results
    +  out_file <- file.path(annotPath, paste0("benchmark_Metamorpheus_", gsub("\\s+", "_", ds_name), ".csv"))
    +  data.table::fwrite(final_results, out_file)
    +  message("Wrote results to: ", out_file)

    91-92: Isolate dataset runs so one failure doesn’t stop the entire batch

    Wrap each invocation in tryCatch to continue processing the remaining dataset if one fails.

    -runBenchmarkForMetaMorpheusData("DDA-Solivais2024-Metamorpheus_MBR_LFQ", config)
    -runBenchmarkForMetaMorpheusData("DDA-Solivais2024-Metamorpheus_NoMBR_LFQ", config)
    +lapply(
    +  c("DDA-Solivais2024-Metamorpheus_MBR_LFQ", "DDA-Solivais2024-Metamorpheus_NoMBR_LFQ"),
    +  function(ds) {
    +    tryCatch(
    +      runBenchmarkForMetaMorpheusData(ds, config),
    +      error = function(e) {
    +        message("ERROR in ", ds, ": ", conditionMessage(e))
    +      }
    +    )
    +  }
    +)
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    💡 Knowledge Base configuration:

    • MCP integration is disabled by default for public repositories
    • Jira integration is disabled by default for public repositories
    • Linear integration is disabled by default for public repositories

    You can enable these sources in your CodeRabbit configuration.

    📥 Commits

    Reviewing files that changed from the base of the PR and between ec9fe42 and c7e7b60.

    📒 Files selected for processing (2)
    • benchmark/benchmark_Metamorpheus.R (1 hunks)
    • benchmark/metamorpheus_Process.R (1 hunks)
    🚧 Files skipped from review as they are similar to previous changes (1)
    • benchmark/metamorpheus_Process.R
    ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
    • GitHub Check: build
    • GitHub Check: Benchmarking-pipeline

    Comment on lines +69 to +79
    num_cores <- detectCores() - 1

    summarized_results <- mclapply(data_process_tasks, function(task) {
    list(label = task$label, summarized = task$result())
    }, mc.cores = num_cores)


    results_list <- mclapply(summarized_results, function(res) {
    calculate_Metrics(res$summarized, protein_mappings, res$label)
    }, mc.cores = num_cores)

    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    💡 Verification agent

    🧩 Analysis chain

    Guard against mc.cores == 0 and provide a Windows-safe fallback

    detectCores() - 1 can be zero; mclapply will error. Also, mclapply with mc.cores > 1 is unavailable on Windows. Add a safe core calculation and fall back to lapply when needed.

    -  num_cores <- detectCores() - 1 
    -
    -  summarized_results <- mclapply(data_process_tasks, function(task) {
    -    list(label = task$label, summarized = task$result())
    -  }, mc.cores = num_cores)	
    +  num_cores <- max(1L, as.integer(detectCores() - 1L))
    +  if (.Platform$OS.type != "windows" && num_cores > 1L) {
    +    summarized_results <- mclapply(
    +      data_process_tasks,
    +      function(task) list(label = task$label, summarized = task$result()),
    +      mc.cores = num_cores
    +    )
    +  } else {
    +    summarized_results <- lapply(
    +      data_process_tasks,
    +      function(task) list(label = task$label, summarized = task$result())
    +    )
    +  }
    @@
    -  results_list <- mclapply(summarized_results, function(res) {
    -    calculate_Metrics(res$summarized, protein_mappings, res$label)
    -  }, mc.cores = num_cores)
    +  if (.Platform$OS.type != "windows" && num_cores > 1L) {
    +    results_list <- mclapply(
    +      summarized_results,
    +      function(res) calculate_Metrics(res$summarized, protein_mappings, res$label),
    +      mc.cores = num_cores
    +    )
    +  } else {
    +    results_list <- lapply(
    +      summarized_results,
    +      function(res) calculate_Metrics(res$summarized, protein_mappings, res$label)
    +    )
    +  }

    Optional verification: please run a small dataset on a 1-core environment to confirm no regression when num_cores == 1.


    🏁 Script executed:

    #!/bin/bash
    set -euo pipefail
    echo "Checking for benchmark/benchmark_Metamorpheus.R..."
    if [ -f benchmark/benchmark_Metamorpheus.R ]; then
      echo "== File: benchmark/benchmark_Metamorpheus.R (first 220 lines) =="
      sed -n '1,220p' benchmark/benchmark_Metamorpheus.R
    else
      echo "benchmark/benchmark_Metamorpheus.R not found"
    fi
    
    echo
    echo "== Searching repo for parallel-related usages =="
    rg -n --hidden -S 'detectCores|mclapply|library\\(parallel\\)|require\\(parallel\\)|\\.Platform' || true
    
    echo
    echo "== Showing matches in benchmark/ directory with context =="
    rg -n --hidden -S --context 3 'detectCores|mclapply|\\.Platform' benchmark || true

    Length of output: 8050


    Guard against mc.cores == 0 and add a Windows-safe fallback (applies to multiple benchmark scripts)

    detectCores() - 1 can be 0 (causing mclapply to error) and mclapply’s multicore mode is not available on Windows — fall back to lapply when multicore isn’t usable.

    Files to update:

    • benchmark/benchmark_Metamorpheus.R (lines ~69–79)
    • benchmark/benchmark_Puyvelde2022-HYE5600735_LFQ.R (lines ~47–56)
    • benchmark/benchmark_Dowell2021-HEqe408_LFQ.R (lines ~48–56)

    Suggested change (apply the same pattern to each file):

    -  num_cores <- detectCores() - 1 
    -
    -  summarized_results <- mclapply(data_process_tasks, function(task) {
    -    list(label = task$label, summarized = task$result())
    -  }, mc.cores = num_cores)	
    +  num_cores <- max(1L, as.integer(detectCores() - 1L))
    +  if (.Platform$OS.type != "windows" && num_cores > 1L) {
    +    summarized_results <- mclapply(
    +      data_process_tasks,
    +      function(task) list(label = task$label, summarized = task$result()),
    +      mc.cores = num_cores
    +    )
    +  } else {
    +    summarized_results <- lapply(
    +      data_process_tasks,
    +      function(task) list(label = task$label, summarized = task$result())
    +    )
    +  }
    @@
    -  results_list <- mclapply(summarized_results, function(res) {
    -    calculate_Metrics(res$summarized, protein_mappings, res$label)
    -  }, mc.cores = num_cores)
    +  if (.Platform$OS.type != "windows" && num_cores > 1L) {
    +    results_list <- mclapply(
    +      summarized_results,
    +      function(res) calculate_Metrics(res$summarized, protein_mappings, res$label),
    +      mc.cores = num_cores
    +    )
    +  } else {
    +    results_list <- lapply(
    +      summarized_results,
    +      function(res) calculate_Metrics(res$summarized, protein_mappings, res$label)
    +    )
    +  }

    Optional: verify on a 1-core / Windows-like environment to ensure no regressions.

    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    num_cores <- detectCores() - 1
    summarized_results <- mclapply(data_process_tasks, function(task) {
    list(label = task$label, summarized = task$result())
    }, mc.cores = num_cores)
    results_list <- mclapply(summarized_results, function(res) {
    calculate_Metrics(res$summarized, protein_mappings, res$label)
    }, mc.cores = num_cores)
    num_cores <- max(1L, as.integer(detectCores() - 1L))
    if (.Platform$OS.type != "windows" && num_cores > 1L) {
    summarized_results <- mclapply(
    data_process_tasks,
    function(task) list(label = task$label, summarized = task$result()),
    mc.cores = num_cores
    )
    } else {
    summarized_results <- lapply(
    data_process_tasks,
    function(task) list(label = task$label, summarized = task$result())
    )
    }
    if (.Platform$OS.type != "windows" && num_cores > 1L) {
    results_list <- mclapply(
    summarized_results,
    function(res) calculate_Metrics(res$summarized, protein_mappings, res$label),
    mc.cores = num_cores
    )
    } else {
    results_list <- lapply(
    summarized_results,
    function(res) calculate_Metrics(res$summarized, protein_mappings, res$label)
    )
    }
    🤖 Prompt for AI Agents
    In benchmark/benchmark_Metamorpheus.R around lines 69–79, detectCores() - 1 may
    be 0 and mclapply is not available on Windows; change to compute num_cores <-
    max(detectCores() - 1L, 1L) and then call either lapply when running on Windows
    or when num_cores == 1, otherwise call mclapply with mc.cores = num_cores; apply
    the same pattern to the other two files (benchmark_Puyvelde2022-HYE5600735_LFQ.R
    lines ~47–56 and benchmark_Dowell2021-HEqe408_LFQ.R lines ~48–56) so multicore
    is used only when supported and falls back to lapply safely.
    

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    ♻️ Duplicate comments (9)
    metamorpheus_code.R (4)

    1-1: Consider removing this standalone script from the PR

    This file appears to be a standalone analysis script that duplicates functionality from the benchmarking framework. Since the benchmarking harness (benchmark/benchmark_Metamorpheus.R) already implements the same workflow in a more structured way, keeping this file may cause confusion and maintenance burden.


    99-108: Use the shared calculate_Metrics helper for consistency

    The FDR calculation is duplicated from the no-MBR section. Using the shared helper would improve maintainability.

    Apply this diff:

    -e_group = model$ComparisonResult %>% filter(Label == "B-A") %>% filter(is.na(issue))
    -ecoli = e_group %>% filter(ecoli == TRUE)
    -hist(ecoli$log2FC)
    -
    -ecoli = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == TRUE)
    -human = e_group %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == FALSE)
    -FDR = nrow(human) / (nrow(ecoli) + nrow(human))
    -
    -
    -cat("FDR MBR", FDR, "\n")
    +metrics_mbr <- calculate_Metrics(QuantData, protein_mappings, task_label = "MBR", alpha = 0.05)
    +print(metrics_mbr)

    40-42: Fix fragile group ordering for contrast column names

    Using as.numeric(groups) on factor levels can produce NA values and warnings for non-numeric group labels. This approach is unreliable and should be replaced with a more robust ordering mechanism.

    Apply this diff to fix the issue:

    -groups = levels(QuantData_no_mbr$ProteinLevelData$GROUP)
    -colnames(comparison) <- groups[order(as.numeric(groups))]
    +groups <- levels(QuantData_no_mbr$ProteinLevelData$GROUP)
    +colnames(comparison) <- groups

    90-92: Fix fragile group ordering for MBR contrast column names

    Same issue as in the no-MBR section - using as.numeric(groups) on factor levels is unreliable.

    Apply this diff:

    -groups = levels(QuantData$ProteinLevelData$GROUP)
    -colnames(comparison) <- groups[order(as.numeric(groups))]
    +groups <- levels(QuantData$ProteinLevelData$GROUP)
    +colnames(comparison) <- groups
    benchmark/benchmark_Metamorpheus.R (5)

    31-33: Use the Organism column from QuantifiedPeaks.tsv if available

    It's been noted that Metamorpheus now includes an "Organism" column in the QuantifiedPeaks.tsv file. Using this built-in column would be more robust than relying on a separate proteins file.

    Consider checking if the Organism column exists in the input data and using it directly:

    -protein_mappings = data.table::fread(file.path(filePath, "QuantifiedProteins.tsv"))
    -protein_mappings = protein_mappings %>% filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens"))
    +if ("Organism" %in% colnames(input)) {
    +  # Use built-in Organism column
    +  input = input %>% filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens"))
    +  protein_mappings = input %>% distinct(`Protein Group`, Organism)
    +} else {
    +  # Fall back to QuantifiedProteins.tsv
    +  protein_mappings = data.table::fread(file.path(filePath, "QuantifiedProteins.tsv"))
    +  protein_mappings = protein_mappings %>% filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens"))
    +}

    36-36: Verify impact of removeFewMeasurements and removeProtein_with1Feature parameters

    The MetamorpheusToMSstatsFormat function has two parameters (removeFewMeasurements, removeProtein_with1Feature) that are TRUE by default. These could filter out proteins and potentially explain FDR differences between MBR and no-MBR workflows.

    Please verify if setting these parameters to FALSE affects the empirical FDR results:

    #!/bin/bash
    # Check if MetamorpheusToMSstatsFormat is defined and its default parameters
    rg -n -A 10 "MetamorpheusToMSstatsFormat\s*<-\s*function" --type R

    8-10: Make file paths robust to working directory changes

    The script uses relative paths for sourcing files and loading JSON, which will fail if run from a different working directory. Additionally, MetamorpheusToMSstatsFormat appears to be defined in metamorpheus_code.R at the repo root but isn't sourced here.

    Apply this diff to make paths robust:

    -source("metamorpheus_Process.R")
    -config <- fromJSON("scriptController.json", simplifyVector = FALSE)
    +script_args <- commandArgs(trailingOnly = FALSE)
    +script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)])
    +base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else "."
    +repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE)
    +
    +source(file.path(base_dir, "metamorpheus_Process.R"))
    +# Source MetamorpheusToMSstatsFormat definition
    +if (file.exists(file.path(repo_root, "metamorpheus_code.R"))) {
    +  source(file.path(repo_root, "metamorpheus_code.R"))
    +}
    +config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)

    56-57: Fix inconsistent imputation settings in task labels

    Tasks labeled "Imputation On" are actually passing MBimpute = FALSE, which disables imputation. This contradicts the task labels and will produce misleading benchmark results.

    Apply this diff to enable imputation for the "Imputation On" tasks:

         list(
           label = "Data process without Normalization and Imputation On for all features",
    -      result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = FALSE)
    +      result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = TRUE)
         ),
         list(
           label = "Data process without Normalization and Imputation On for top3 features",
    -      result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = FALSE)
    +      result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = TRUE)
         )

    Also applies to: 60-61


    67-76: Add safeguards for parallel execution on limited-core or Windows systems

    The code can fail in two scenarios: (1) when detectCores() - 1 returns 0 on single-core systems, and (2) on Windows where mclapply with mc.cores > 1 isn't available.

    Apply this diff to add proper safeguards:

    -  num_cores <- detectCores() - 1 
    -
    -  summarized_results <- mclapply(data_process_tasks, function(task) {
    -    list(label = task$label, summarized = task$result())
    -  }, mc.cores = num_cores)	
    -
    -
    -  results_list <- mclapply(summarized_results, function(res) {
    -    calculate_Metrics(res$summarized, protein_mappings, res$label)
    -  }, mc.cores = num_cores)
    +  num_cores <- max(1L, as.integer(detectCores() - 1L))
    +  
    +  if (.Platform$OS.type != "windows" && num_cores > 1L) {
    +    summarized_results <- mclapply(
    +      data_process_tasks,
    +      function(task) list(label = task$label, summarized = task$result()),
    +      mc.cores = num_cores
    +    )
    +    
    +    results_list <- mclapply(
    +      summarized_results,
    +      function(res) calculate_Metrics(res$summarized, protein_mappings, res$label),
    +      mc.cores = num_cores
    +    )
    +  } else {
    +    summarized_results <- lapply(
    +      data_process_tasks,
    +      function(task) list(label = task$label, summarized = task$result())
    +    )
    +    
    +    results_list <- lapply(
    +      summarized_results,
    +      function(res) calculate_Metrics(res$summarized, protein_mappings, res$label)
    +    )
    +  }
    🧹 Nitpick comments (1)
    metamorpheus_code.R (1)

    49-57: Consider using the shared calculate_Metrics helper function

    The FDR calculation logic is duplicated here and in the MBR section. Since benchmark/metamorpheus_Process.R provides a calculate_Metrics function for this exact purpose, consider using it for consistency and maintainability.

    Apply this diff to use the shared helper:

    -e_group_no_mbr = model_no_mbr$ComparisonResult %>% filter(Label == "B-A") %>% filter(is.na(issue))
    -ecoli_no_mbr = e_group_no_mbr %>% filter(ecoli == TRUE)
    -hist(ecoli_no_mbr$log2FC)
    -
    -ecoli_no_mbr = e_group_no_mbr %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == TRUE)
    -human_no_mbr = e_group_no_mbr %>% filter(adj.pvalue < 0.05) %>% filter(ecoli == FALSE)
    -FDR_no_mbr = nrow(human_no_mbr) / (nrow(ecoli_no_mbr) + nrow(human_no_mbr))
    -
    -cat("FDR no MBR", FDR_no_mbr, "\n")
    +# Source the helper if not already done
    +source("benchmark/metamorpheus_Process.R")
    +metrics_no_mbr <- calculate_Metrics(QuantData_no_mbr, protein_mappings, task_label = "No MBR", alpha = 0.05)
    +print(metrics_no_mbr)
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    💡 Knowledge Base configuration:

    • MCP integration is disabled by default for public repositories
    • Jira integration is disabled by default for public repositories
    • Linear integration is disabled by default for public repositories

    You can enable these sources in your CodeRabbit configuration.

    📥 Commits

    Reviewing files that changed from the base of the PR and between d8bcb03 and ba65d2d.

    📒 Files selected for processing (3)
    • benchmark/benchmark_Metamorpheus.R (1 hunks)
    • benchmark/metamorpheus_Process.R (1 hunks)
    • metamorpheus_code.R (1 hunks)
    🚧 Files skipped from review as they are similar to previous changes (1)
    • benchmark/metamorpheus_Process.R
    ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
    • GitHub Check: build
    • GitHub Check: Benchmarking-pipeline

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    ♻️ Duplicate comments (4)
    benchmark/benchmark_Metamorpheus.R (4)

    40-41: Verify impact of removeFewMeasurements and removeProtein_with1Feature in MetamorpheusToMSstatsFormat

    Prior feedback asked to double-check empirical FDR when these are set to FALSE. Current call relies on defaults; consider running a scenario with both flags FALSE for comparison.

    Example variant to try (not necessarily for commit):

    • output_alt <- MetamorpheusToMSstatsFormat(input, annot, removeFewMeasurements = FALSE, removeProtein_with1Feature = FALSE)
    • Run the same tasks on output_alt and compare FDR metrics against output.

    71-80: Guard against mc.cores == 0 and Windows fallback for mclapply

    detectCores() - 1 can be 0, and multicore mclapply isn’t available on Windows. Fall back to lapply when needed.

    -  num_cores <- detectCores() - 1 
    -
    -  summarized_results <- mclapply(data_process_tasks, function(task) {
    -    list(label = task$label, summarized = task$result())
    -  }, mc.cores = num_cores)	
    +  num_cores <- max(1L, as.integer(detectCores() - 1L))
    +  if (.Platform$OS.type != "windows" && num_cores > 1L) {
    +    summarized_results <- mclapply(
    +      data_process_tasks,
    +      function(task) list(label = task$label, summarized = task$result()),
    +      mc.cores = num_cores
    +    )
    +  } else {
    +    summarized_results <- lapply(
    +      data_process_tasks,
    +      function(task) list(label = task$label, summarized = task$result())
    +    )
    +  }
    @@
    -  results_list <- mclapply(summarized_results, function(res) {
    -    calculate_Metrics(res$summarized, protein_mappings, res$label)
    -  }, mc.cores = num_cores)
    +  if (.Platform$OS.type != "windows" && num_cores > 1L) {
    +    results_list <- mclapply(
    +      summarized_results,
    +      function(res) calculate_Metrics(res$summarized, protein_mappings, res$label),
    +      mc.cores = num_cores
    +    )
    +  } else {
    +    results_list <- lapply(
    +      summarized_results,
    +      function(res) calculate_Metrics(res$summarized, protein_mappings, res$label)
    +    )
    +  }

    8-10: Source metamorpheus_code.R and resolve paths relative to the script (prevents undefined function error and WD fragility)

    MetamorpheusToMSstatsFormat is used later (Line 40) but this file doesn’t source metamorpheus_code.R where it’s defined. Also, the relative source()/fromJSON() calls will break if the working directory isn’t benchmark/. Make paths relative to this script and source metamorpheus_code.R from repo root.

    Apply this diff:

    -source("metamorpheus_Process.R")
    -config <- fromJSON("scriptController.json", simplifyVector = FALSE)
    +script_args <- commandArgs(trailingOnly = FALSE)
    +script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)])
    +base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else getwd()
    +repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE)
    +
    +source(file.path(base_dir, "metamorpheus_Process.R"))
    +# Ensure MetamorpheusToMSstatsFormat is available
    +mm_code <- file.path(repo_root, "metamorpheus_code.R")
    +if (file.exists(mm_code)) {
    +  source(mm_code)
    +} else {
    +  warning("metamorpheus_code.R not found at repo root; ensure MetamorpheusToMSstatsFormat is available on the search path")
    +}
    +config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)

    60-66: Align “Imputation On” task behavior with label (set MBimpute = TRUE)

    Both “Imputation On” tasks currently disable imputation.

       list(
         label = "Data process without Normalization and Imputation On for all features",
    -      result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = FALSE)
    +      result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = TRUE)
       ),
       list(
         label = "Data process without Normalization and Imputation On for top3 features",
    -      result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = FALSE)
    +      result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = TRUE)
       )
    🧹 Nitpick comments (5)
    benchmark/benchmark_Metamorpheus.R (5)

    25-27: Clarify the annotation path printout

    You print “Annotation File Path” but output the directory path, not the file. Print the file path to avoid confusion.

    -  cat("Annotation File Path:", annotPath, "\n")
    +  cat("Annotation File:", file.path(annotPath, "annotation.csv"), "\n")

    33-37: Guard against NA organisms when filtering protein mappings

    Avoid carrying NA into the filter set.

    -  valid_organisms <- unique(input$Organism)
    +  valid_organisms <- unique(stats::na.omit(input$Organism))

    31-39: Harden mapping: confirm ‘Protein Groups’ column name and avoid hard-coded name

    QuantifiedProteins.tsv may expose either “Protein Groups” or “Protein Group”. Use a guard and avoid column name mismatch.

    -  input = input %>% filter(`Protein Group` %in% protein_mappings$`Protein Groups`)
    +  protein_groups_col <- if ("Protein Groups" %in% names(protein_mappings)) "Protein Groups" else "Protein Group"
    +  input = input %>% filter(`Protein Group` %in% protein_mappings[[protein_groups_col]])

    If you can, please also verify the column names contained in QuantifiedProteins.tsv for your datasets to confirm this guard is sufficient.


    83-88: Optionally persist metrics to disk per dataset

    Printing is fine, but persisting results helps downstream CI/HPC collection.

    For example:

    • out_file <- file.path(filePath, paste0("benchmark_results_", gsub("\s+", "_", dataset_config$name), ".tsv"))
    • data.table::fwrite(final_results, out_file, sep = "\t")

    93-94: Protect per-dataset runs so one failure doesn’t abort the other

    Wrap invocations in tryCatch to continue on error.

    -runBenchmarkForMetaMorpheusData("DDA-Solivais2024-Metamorpheus_MBR_LFQ", config)
    -runBenchmarkForMetaMorpheusData("DDA-Solivais2024-Metamorpheus_NoMBR_LFQ", config)
    +for (ds in c("DDA-Solivais2024-Metamorpheus_MBR_LFQ", "DDA-Solivais2024-Metamorpheus_NoMBR_LFQ")) {
    +  tryCatch(
    +    runBenchmarkForMetaMorpheusData(ds, config),
    +    error = function(e) message(sprintf("Dataset %s failed: %s", ds, conditionMessage(e)))
    +  )
    +}
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    💡 Knowledge Base configuration:

    • MCP integration is disabled by default for public repositories
    • Jira integration is disabled by default for public repositories
    • Linear integration is disabled by default for public repositories

    You can enable these sources in your CodeRabbit configuration.

    📥 Commits

    Reviewing files that changed from the base of the PR and between 9fc6a86 and 8518c44.

    📒 Files selected for processing (1)
    • benchmark/benchmark_Metamorpheus.R (1 hunks)
    ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
    • GitHub Check: build
    • GitHub Check: Benchmarking-pipeline
    🔇 Additional comments (1)
    benchmark/benchmark_Metamorpheus.R (1)

    33-34: Good: leveraging Organism column from QuantifiedPeaks.tsv

    Using input$Organism to constrain protein mappings addresses earlier feedback and improves specificity.

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    ♻️ Duplicate comments (4)
    benchmark/benchmark_Metamorpheus.R (4)

    39-39: Double-check impact of removeFewMeasurements/removeProtein_with1Feature on empirical FDR

    MetamorpheusToMSstatsFormat defaults to removing sparse proteins/features (both TRUE). Given no-PIP scenarios, toggling these to FALSE may materially change recall/FDR. Please validate and document the chosen setting.

    I can wire these flags through from the controller JSON (and produce side-by-side metrics) if you want.


    8-10: Source helpers and config using robust paths; ensure MetamorpheusToMSstatsFormat is loaded

    Current relative paths will break when not launched from benchmark/, and MetamorpheusToMSstatsFormat isn’t defined in this file. Source files via the script’s directory and include metamorpheus_code.R from repo root.

    Apply this diff:

    - source("metamorpheus_Process.R")
    - config <- fromJSON("scriptController.json", simplifyVector = FALSE)
    + script_args <- commandArgs(trailingOnly = FALSE)
    + script_path <- sub("^--file=", "", script_args[grep("^--file=", script_args)])
    + base_dir <- if (length(script_path)) dirname(normalizePath(script_path)) else getwd()
    + repo_root <- normalizePath(file.path(base_dir, ".."), mustWork = FALSE)
    +
    + source(file.path(base_dir, "metamorpheus_Process.R"))
    + # Ensure MetamorpheusToMSstatsFormat is available
    + if (file.exists(file.path(repo_root, "metamorpheus_code.R"))) {
    +   source(file.path(repo_root, "metamorpheus_code.R"))
    + }
    + config <- fromJSON(file.path(base_dir, "scriptController.json"), simplifyVector = FALSE)

    Optional verification to confirm the function is available:

    #!/bin/bash
    set -euo pipefail
    # Check the definition of MetamorpheusToMSstatsFormat
    rg -nP '^\s*MetamorpheusToMSstatsFormat\s*<-' metamorpheus_code.R || rg -n 'MetamorpheusToMSstatsFormat' --hidden --no-ignore

    59-65: Fix “Imputation On” tasks: wrong argument name and value

    Both tasks labeled “Imputation On” pass MBImpute = FALSE and use a misspelled argument (MBImpute). Use MBimpute = TRUE so behavior matches labels.

    -      label = "Data process without Normalization and Imputation On for all features",
    -      result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBImpute = FALSE)
    +      label = "Data process without Normalization and Imputation On for all features",
    +      result = function() dataProcess(output, featureSubset = "all", normalization = "FALSE", MBimpute = TRUE)
    @@
    -      label = "Data process without Normalization and Imputation On for top3 features",
    -      result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBImpute = FALSE)
    +      label = "Data process without Normalization and Imputation On for top3 features",
    +      result = function() dataProcess(output, featureSubset = "top3", normalization = "FALSE", MBimpute = TRUE)

    70-79: Make parallel section Windows-safe and guard against 0 cores

    detectCores() - 1 can be 0, and mclapply multicore isn’t available on Windows. Fallback to lapply when needed.

    -  num_cores <- detectCores() - 1 
    -
    -  summarized_results <- mclapply(data_process_tasks, function(task) {
    -    list(label = task$label, summarized = task$result())
    -  }, mc.cores = num_cores)	
    -
    -
    -  results_list <- mclapply(summarized_results, function(res) {
    -    calculate_Metrics(res$summarized, protein_mappings, res$label)
    -  }, mc.cores = num_cores)
    +  num_cores <- max(1L, as.integer(detectCores() - 1L))
    +  if (.Platform$OS.type != "windows" && num_cores > 1L) {
    +    summarized_results <- mclapply(
    +      data_process_tasks,
    +      function(task) list(label = task$label, summarized = task$result()),
    +      mc.cores = num_cores
    +    )
    +    results_list <- mclapply(
    +      summarized_results,
    +      function(res) calculate_Metrics(res$summarized, protein_mappings, res$label),
    +      mc.cores = num_cores
    +    )
    +  } else {
    +    summarized_results <- lapply(
    +      data_process_tasks,
    +      function(task) list(label = task$label, summarized = task$result())
    +    )
    +    results_list <- lapply(
    +      summarized_results,
    +      function(res) calculate_Metrics(res$summarized, protein_mappings, res$label)
    +    )
    +  }
    🧹 Nitpick comments (3)
    benchmark/benchmark_Metamorpheus.R (3)

    31-34: Fail fast if expected columns are missing

    Protect against schema differences to avoid opaque dplyr errors later.

       protein_mappings = data.table::fread(file.path(filePath, "QuantifiedProteins.tsv"))
    -
    -  protein_mappings = protein_mappings %>% filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens"))
    +  # Defensive checks
    +  if (!"Protein Group" %in% names(input)) {
    +    stop("Expected column 'Protein Group' not found in QuantifiedPeaks.tsv")
    +  }
    +  required_cols <- c("Protein Groups", "Organism")
    +  missing_cols <- setdiff(required_cols, names(protein_mappings))
    +  if (length(missing_cols)) {
    +    stop(sprintf("Expected columns missing in QuantifiedProteins.tsv: %s", paste(missing_cols, collapse = ", ")))
    +  }
    +  protein_mappings = protein_mappings %>% filter(Organism %in% c("Escherichia coli (strain K12)", "Homo sapiens"))

    37-37: Prefer semi_join over %in% for clarity and potential performance

    Makes the key relationship explicit and avoids vector materialization for large tables.

    -  input = input %>% filter(`Protein Group` %in% protein_mappings$`Protein Groups`)
    +  input = input %>% semi_join(protein_mappings, by = c("Protein Group" = "Protein Groups"))

    35-35: Reduce noisy output for large protein_mappings

    Printing the entire table can flood logs. Print a compact summary instead.

    -  print(protein_mappings)
    +  cat(
    +    "protein_mappings:",
    +    sprintf("%d rows x %d cols", nrow(protein_mappings), ncol(protein_mappings)),
    +    " | Organisms:",
    +    paste(unique(protein_mappings$Organism), collapse = ", "),
    +    "\n"
    +  )
    +  print(utils::head(protein_mappings, 3))
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    💡 Knowledge Base configuration:

    • MCP integration is disabled by default for public repositories
    • Jira integration is disabled by default for public repositories
    • Linear integration is disabled by default for public repositories

    You can enable these sources in your CodeRabbit configuration.

    📥 Commits

    Reviewing files that changed from the base of the PR and between f6ac63e and 90cf018.

    📒 Files selected for processing (1)
    • benchmark/benchmark_Metamorpheus.R (1 hunks)
    ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
    • GitHub Check: build
    • GitHub Check: Benchmarking-pipeline

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants