-
Notifications
You must be signed in to change notification settings - Fork 2
Refactor resource scaling in nextflow.config tower profile #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…mproved retry logic and memory management
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the Nextflow configuration to implement a robust retry and resource scaling strategy for handling Docker out-of-memory (OOM) errors across different process types. The changes introduce global default settings and enhance profile-specific configurations with exponential memory scaling.
Changes:
- Added global error handling strategy for Docker OOM errors with retry logic and exponential resource scaling
- Updated the
towerprofile with exponential memory scaling and differentiated retry logic for OOM vs. general failures - Introduced a new
process_high_memorylabel for exceptionally memory-intensive processes
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| maxRetries = 3 | ||
|
|
||
| // Default resource scaling | ||
| cpus = { 2 * task.attempt } |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The global default CPU allocation grows unbounded with retry attempts (2, 4, 6, 8, etc.). Consider adding a Math.min() cap similar to the tower profile's process_medium and process_high labels to prevent excessive CPU allocation on later retries.
| cpus = { 2 * task.attempt } | |
| cpus = { Math.min(2 * task.attempt, 8) } |
| if (task.exitStatus in [125, 137, 139]) { | ||
| return task.attempt <= 3 ? 'retry' : 'ignore' | ||
| } | ||
| return task.attempt <= 3 ? 'retry' : 'ignore' |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error strategy for process_medium has redundant logic - both the OOM error case and the default case return the same condition (task.attempt <= 3). This can be simplified to just 'return task.attempt <= 3 ? 'retry' : 'ignore'' without the if statement, or the non-OOM case should have a different retry limit to match the differentiation in other process labels.
| return task.attempt <= 3 ? 'retry' : 'ignore' | |
| return task.attempt <= 2 ? 'retry' : 'ignore' |
| errorStrategy = {task.attempt <= 3 ? 'retry' : 'ignore' } | ||
| cpus = { Math.min(8 * task.attempt, 16) } | ||
| memory = { 16.GB * Math.pow(2, task.attempt - 1) } // 16GB, 32GB, 64GB | ||
| maxRetries = 4 // One extra retry for high-memory processes |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states 'high-memory processes' but this is for the process_high label, not process_high_memory. Either the comment should say 'high processes' or it may be confusing since process_high_memory is introduced later.
| maxRetries = 4 // One extra retry for high-memory processes | |
| maxRetries = 4 // One extra retry for high processes |
|
Looks like sage.config uses uses check_max() and slow() functions and tower profile uses direct calculations. Is this intentional? |
This pull request introduces a robust retry and resource scaling strategy for Nextflow processes, with special emphasis on handling Docker out-of-memory (OOM) errors. The changes improve reliability by dynamically increasing resources and retry attempts for failed tasks, and add new configuration for very memory-intensive processes.
Error handling and retry strategy improvements:
nextflow.config)towerprofile) to consistently handle Docker OOM errors with increased retry limits and resource scaling for different process labels (process_low,process_medium,process_high). (nextflow.config)Resource scaling enhancements:
nextflow.config)process_highandprocess_high_memorylabels, increased maximum retry attempts and introduced higher memory scaling for very demanding tasks. (nextflow.config)process_high_memorylabel for processes requiring exceptionally large