You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: Add optional max_walltime to prevent infinite looping in Slurm jobs (#638)
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Co-authored-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Copy file name to clipboardExpand all lines: docs/libraries/nemo-evaluator-launcher/configuration/executors/slurm.md
+25Lines changed: 25 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -198,6 +198,31 @@ The Slurm executor includes advanced auto-resume capabilities:
198
198
3. **Automatic Resubmission**: New job is submitted with dependency on previous job
199
199
4. **Progress Preservation**: Evaluation continues from where it left off
200
200
201
+
### Maximum Total Walltime
202
+
203
+
To prevent jobs from resuming indefinitely, you can configure a maximum total wall-clock time across all resumes using the `max_walltime` parameter:
204
+
205
+
```yaml
206
+
execution:
207
+
walltime: "04:00:00" # Time limit per job submission
208
+
max_walltime: "24:00:00" # Maximum total time across all resumes (optional)
209
+
```
210
+
211
+
**How it works:**
212
+
- The actual runtime of each job is tracked using SLURM's `sacct` command
213
+
- When a job resumes, the previous job's actual elapsed time is added to the accumulated total
214
+
- Before starting each resumed job, the accumulated runtime is checked against `max_walltime`
215
+
- If the accumulated runtime exceeds `max_walltime`, the job chain stops with an error
216
+
- This prevents runaway jobs that might otherwise resume indefinitely
217
+
218
+
**Configuration:**
219
+
- `max_walltime`: Maximum total runtime in `HH:MM:SS` format (e.g., `"24:00:00"` for 24 hours)
220
+
- Defaults to `"72:00:00"` (72 hours). Set to `null` for unlimited resuming
221
+
222
+
:::{note}
223
+
The `max_walltime` tracks **actual job execution time only**, excluding time spent waiting in the queue. This ensures accurate runtime accounting even when jobs are repeatedly preempted or must wait for resources.
224
+
:::
225
+
201
226
## Monitoring and Job Management
202
227
203
228
For monitoring jobs, checking status, and managing evaluations, see the [Executors Overview](overview.md#job-management) section.
0 commit comments