Skip to content

Commit 1446a76

Browse files
wprazuchmarta-sd
andauthored
feat: Add optional max_walltime to prevent infinite looping in Slurm jobs (#638)
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com> Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com> Co-authored-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
1 parent 1f20816 commit 1446a76

4 files changed

Lines changed: 354 additions & 4 deletions

File tree

docs/libraries/nemo-evaluator-launcher/configuration/executors/slurm.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,31 @@ The Slurm executor includes advanced auto-resume capabilities:
198198
3. **Automatic Resubmission**: New job is submitted with dependency on previous job
199199
4. **Progress Preservation**: Evaluation continues from where it left off
200200

201+
### Maximum Total Walltime
202+
203+
To prevent jobs from resuming indefinitely, you can configure a maximum total wall-clock time across all resumes using the `max_walltime` parameter:
204+
205+
```yaml
206+
execution:
207+
walltime: "04:00:00" # Time limit per job submission
208+
max_walltime: "24:00:00" # Maximum total time across all resumes (optional)
209+
```
210+
211+
**How it works:**
212+
- The actual runtime of each job is tracked using SLURM's `sacct` command
213+
- When a job resumes, the previous job's actual elapsed time is added to the accumulated total
214+
- Before starting each resumed job, the accumulated runtime is checked against `max_walltime`
215+
- If the accumulated runtime exceeds `max_walltime`, the job chain stops with an error
216+
- This prevents runaway jobs that might otherwise resume indefinitely
217+
218+
**Configuration:**
219+
- `max_walltime`: Maximum total runtime in `HH:MM:SS` format (e.g., `"24:00:00"` for 24 hours)
220+
- Defaults to `"72:00:00"` (72 hours). Set to `null` for unlimited resuming
221+
222+
:::{note}
223+
The `max_walltime` tracks **actual job execution time only**, excluding time spent waiting in the queue. This ensures accurate runtime accounting even when jobs are repeatedly preempted or must wait for resources.
224+
:::
225+
201226
## Monitoring and Job Management
202227

203228
For monitoring jobs, checking status, and managing evaluations, see the [Executors Overview](overview.md#job-management) section.

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/execution/slurm/default.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ num_nodes: 1
2424
ntasks_per_node: 1
2525
gres: gpu:8
2626
walltime: 01:00:00
27+
max_walltime: "120:00:00" # Maximum total runtime across all resumes. Set to null for unlimited.
2728
subproject: nemo-evaluator-launcher
2829
sbatch_comment: null # Optional comment for SLURM job (translates to #SBATCH --comment='...')
2930

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py

Lines changed: 97 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -620,8 +620,9 @@ def _create_slurm_sbatch_script(
620620
if env_vars:
621621
s += "\n"
622622

623-
# auto resume after timeout
624-
s += _AUTORESUME_HANDLER
623+
# auto resume after timeout (with optional max_walltime enforcement)
624+
max_walltime = cfg.execution.get("max_walltime", "120:00:00")
625+
s += _generate_autoresume_handler(remote_task_subdir, max_walltime)
625626
s += "\n\n"
626627

627628
# echo the current SLURM_JOB_ID
@@ -1369,9 +1370,100 @@ def _get_progress(
13691370
return progress_list
13701371

13711372

1372-
_AUTORESUME_HANDLER = """
1373+
def _generate_autoresume_handler(
1374+
remote_task_subdir: Path, max_walltime: Optional[str] = None
1375+
) -> str:
1376+
"""Generate the autoresume handler script with optional max walltime enforcement.
1377+
1378+
Args:
1379+
remote_task_subdir: The remote directory path for storing timing files.
1380+
max_walltime: Maximum total wall-clock time (e.g., "24:00:00"). None means unlimited.
1381+
1382+
Returns:
1383+
The autoresume handler script as a string.
1384+
"""
1385+
start_time_file = remote_task_subdir / ".job_start_time"
1386+
1387+
accumulated_walltime_file = remote_task_subdir / ".accumulated_walltime"
1388+
1389+
# Generate max walltime check logic if max_walltime is specified
1390+
if max_walltime:
1391+
max_walltime_check = f'''
1392+
# Check if max_walltime has been exceeded
1393+
_max_walltime="{max_walltime}"
1394+
_start_time_file="{start_time_file}"
1395+
_accumulated_walltime_file="{accumulated_walltime_file}"
1396+
1397+
# Convert HH:MM:SS or D-HH:MM:SS to seconds
1398+
_walltime_to_seconds() {{
1399+
local time_str=$1
1400+
local days=0 hours=0 minutes=0 seconds=0
1401+
1402+
# Handle format with days: D-HH:MM:SS (sacct output format)
1403+
if [[ "$time_str" =~ ^([0-9]+)-([0-9]+):([0-9]+):([0-9]+)$ ]]; then
1404+
days=${{BASH_REMATCH[1]}}
1405+
hours=${{BASH_REMATCH[2]}}
1406+
minutes=${{BASH_REMATCH[3]}}
1407+
seconds=${{BASH_REMATCH[4]}}
1408+
# Handle different formats: HH:MM:SS, MM:SS, or just seconds
1409+
elif [[ "$time_str" =~ ^([0-9]+):([0-9]+):([0-9]+)$ ]]; then
1410+
hours=${{BASH_REMATCH[1]}}
1411+
minutes=${{BASH_REMATCH[2]}}
1412+
seconds=${{BASH_REMATCH[3]}}
1413+
elif [[ "$time_str" =~ ^([0-9]+):([0-9]+)$ ]]; then
1414+
minutes=${{BASH_REMATCH[1]}}
1415+
seconds=${{BASH_REMATCH[2]}}
1416+
elif [[ "$time_str" =~ ^([0-9]+)$ ]]; then
1417+
seconds=${{BASH_REMATCH[1]}}
1418+
fi
1419+
1420+
echo $((days * 86400 + hours * 3600 + minutes * 60 + seconds))
1421+
}}
1422+
1423+
_max_walltime_seconds=$(_walltime_to_seconds "$_max_walltime")
1424+
1425+
# Initialize accumulated walltime file on first run or on manual resume
1426+
if [[ ! -f "$_accumulated_walltime_file" || ! -n "$_prev_slurm_job_id" ]]; then
1427+
echo "0" > "$_accumulated_walltime_file"
1428+
echo "Job chain started at $(date). Max total walltime: $_max_walltime"
1429+
fi
1430+
1431+
# Read accumulated walltime from previous jobs
1432+
_accumulated_seconds=$(cat "$_accumulated_walltime_file")
1433+
1434+
# If there's a previous job, add its actual elapsed time (from sacct) to the accumulated walltime
1435+
# This must happen BEFORE the max walltime check to ensure accurate tracking
1436+
if [[ -n "$_prev_slurm_job_id" ]]; then
1437+
_prev_elapsed=$(sacct -j $_prev_slurm_job_id -P -n -o Elapsed | head -n 1)
1438+
if [[ -n "$_prev_elapsed" ]]; then
1439+
_prev_elapsed_seconds=$(_walltime_to_seconds "$_prev_elapsed")
1440+
_accumulated_seconds=$((_accumulated_seconds + _prev_elapsed_seconds))
1441+
echo "$_accumulated_seconds" > "$_accumulated_walltime_file"
1442+
echo "Previous job $_prev_slurm_job_id ran for $_prev_elapsed"
1443+
fi
1444+
fi
1445+
1446+
_elapsed_formatted=$(printf '%02d:%02d:%02d' $((_accumulated_seconds/3600)) $(((_accumulated_seconds%3600)/60)) $((_accumulated_seconds%60)))
1447+
1448+
echo "Total accumulated walltime: $_elapsed_formatted (max: $_max_walltime)"
1449+
1450+
# Check if we've exceeded max walltime - if so, don't schedule next job and exit
1451+
if [[ $_accumulated_seconds -ge $_max_walltime_seconds ]]; then
1452+
echo "ERROR: Maximum total walltime ($_max_walltime) exceeded. Accumulated: $_elapsed_formatted"
1453+
echo "Stopping job chain to prevent infinite resuming."
1454+
exit 1
1455+
fi
1456+
1457+
# Record job start time for this job (for debugging/logging purposes)
1458+
date +%s > "$_start_time_file"
1459+
'''
1460+
else:
1461+
max_walltime_check = ""
1462+
1463+
handler = f"""
13731464
_this_script=$0
13741465
_prev_slurm_job_id=$1
1466+
{max_walltime_check}
13751467
# Handle automatic resumption after some failed state.
13761468
if [[ "$_prev_slurm_job_id" != "" ]]; then
13771469
_prev_state=`sacct -j $_prev_slurm_job_id -P -n -o State | head -n 1`
@@ -1390,7 +1482,8 @@ def _get_progress(
13901482
# Schedule next execution of this script with the current $SLURM_JOB_ID as an argument.
13911483
# "afternotok" means next execution will be invoked only if the current execution terminates in some failed state.
13921484
sbatch --dependency=afternotok:$SLURM_JOB_ID $_this_script $SLURM_JOB_ID
1393-
""".strip()
1485+
"""
1486+
return handler.strip()
13941487

13951488

13961489
def _generate_haproxy_config_with_placeholders(cfg):

0 commit comments

Comments
 (0)