Skip to content

Conversation

@adamjtaylor
Copy link
Collaborator

This pull request introduces a robust retry and resource scaling strategy for Nextflow processes, with special emphasis on handling Docker out-of-memory (OOM) errors. The changes improve reliability by dynamically increasing resources and retry attempts for failed tasks, and add new configuration for very memory-intensive processes.

Error handling and retry strategy improvements:

  • Added a global error strategy that retries tasks up to three times for Docker OOM errors (exit codes 125, 137, 139), and up to two times for general failures, with exponential scaling of CPU and memory resources on each retry. (nextflow.config)
  • Updated profile-specific error strategies (tower profile) to consistently handle Docker OOM errors with increased retry limits and resource scaling for different process labels (process_low, process_medium, process_high). (nextflow.config)

Resource scaling enhancements:

  • Changed CPU and memory allocation logic to use exponential scaling based on retry attempts, ensuring more resources are allocated to tasks that fail repeatedly. (nextflow.config)
  • For process_high and process_high_memory labels, increased maximum retry attempts and introduced higher memory scaling for very demanding tasks. (nextflow.config)
  • Added a new process_high_memory label for processes requiring exceptionally large

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the Nextflow configuration to implement a robust retry and resource scaling strategy for handling Docker out-of-memory (OOM) errors across different process types. The changes introduce global default settings and enhance profile-specific configurations with exponential memory scaling.

Changes:

  • Added global error handling strategy for Docker OOM errors with retry logic and exponential resource scaling
  • Updated the tower profile with exponential memory scaling and differentiated retry logic for OOM vs. general failures
  • Introduced a new process_high_memory label for exceptionally memory-intensive processes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

maxRetries = 3

// Default resource scaling
cpus = { 2 * task.attempt }
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global default CPU allocation grows unbounded with retry attempts (2, 4, 6, 8, etc.). Consider adding a Math.min() cap similar to the tower profile's process_medium and process_high labels to prevent excessive CPU allocation on later retries.

Suggested change
cpus = { 2 * task.attempt }
cpus = { Math.min(2 * task.attempt, 8) }

Copilot uses AI. Check for mistakes.
if (task.exitStatus in [125, 137, 139]) {
return task.attempt <= 3 ? 'retry' : 'ignore'
}
return task.attempt <= 3 ? 'retry' : 'ignore'
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error strategy for process_medium has redundant logic - both the OOM error case and the default case return the same condition (task.attempt <= 3). This can be simplified to just 'return task.attempt <= 3 ? 'retry' : 'ignore'' without the if statement, or the non-OOM case should have a different retry limit to match the differentiation in other process labels.

Suggested change
return task.attempt <= 3 ? 'retry' : 'ignore'
return task.attempt <= 2 ? 'retry' : 'ignore'

Copilot uses AI. Check for mistakes.
errorStrategy = {task.attempt <= 3 ? 'retry' : 'ignore' }
cpus = { Math.min(8 * task.attempt, 16) }
memory = { 16.GB * Math.pow(2, task.attempt - 1) } // 16GB, 32GB, 64GB
maxRetries = 4 // One extra retry for high-memory processes
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states 'high-memory processes' but this is for the process_high label, not process_high_memory. Either the comment should say 'high processes' or it may be confusing since process_high_memory is introduced later.

Suggested change
maxRetries = 4 // One extra retry for high-memory processes
maxRetries = 4 // One extra retry for high processes

Copilot uses AI. Check for mistakes.
@aditigopalan
Copy link
Collaborator

Looks like sage.config uses uses check_max() and slow() functions and tower profile uses direct calculations. Is this intentional?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants