feat: add generic heartbeat for periodic script execution during evaluation by AdamRajfer · Pull Request #876 · NVIDIA-NeMo/Evaluator

AdamRajfer · 2026-03-20T00:52:19Z

Summary

Adds a heartbeat mechanism that runs a user-provided script periodically during evaluation jobs. Supported on all executors (SLURM, local, lepton).

Configuration

execution:
  heartbeat:
    script: 'ls -la artifacts/ 2>/dev/null || echo "no artifacts yet"'
    interval: 60          # seconds (default: 300)
    container: null       # optional container image

Field	Default	Description
`script`	`null`	Command to run periodically. When `null`, heartbeat is disabled.
`interval`	`300`	Seconds between heartbeat executions.
`container`	`null`	Optional container image for the script.

How it works

When script is set, the executor starts a background loop that runs the command every interval seconds. The loop runs alongside the evaluation and stops automatically when the evaluation completes or fails. Script failures are logged but do not affect the evaluation.

Changes

executor.py (SLURM): heartbeat background loop in sbatch script, with optional container support via srun
run.template.sh (local): heartbeat background loop in bash template
executor.py (lepton): heartbeat background loop in launch script
Config defaults added to all execution configs
Example config: examples/slurm_heartbeat.yaml
Docs: heartbeat section added to SLURM executor docs

Removed

exporters/mlflow_live.py and tests — replaced by the generic heartbeat

Test plan

8 heartbeat unit tests across all executors
All existing tests pass (926 passed)
Pre-commit clean
End-to-end SLURM test: heartbeat logged every 10s throughout job

When auto_export includes "mlflow" and export.mlflow.live is true (default), the SLURM executor tracks the job lifecycle in MLflow in real-time: deploying → server_healthy → evaluating → completed/failed/timeout A standalone mlflow_live.py script runs inside the mlflow container via srun at each lifecycle stage. A background heartbeat uploads artifacts every 5 minutes, giving live visibility into running evaluations. When live is false, the same container-based export runs once after evaluation completes (no pip-installing the launcher). Set export.mlflow.live to false to disable live tracking and use only the final export. Key changes: - exporters/mlflow_live.py: standalone MLflow export script (init, update, cancel, finalize modes) reads export_config.yml - executor.py: injects lifecycle hooks into sbatch script — signal traps (USR1/TERM/EXIT), heartbeat loop, status updates; mlflow filtered from old auto-export destinations - Artifacts uploaded with same exclusion patterns and directory structure as the existing MLflow exporter - Finalize runs before server cleanup to avoid walltime race The user config API is unchanged — same export.mlflow.* keys. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

AdamRajfer · 2026-03-20T01:07:47Z

/ok to test 05e85eb

…uation Adds a heartbeat mechanism that runs a user-provided script periodically during evaluation jobs. Supported on all executors (SLURM, local, lepton). Configuration: execution: heartbeat: script: 'echo heartbeat' interval: 60 container: null The heartbeat starts before evaluation and stops after completion. Script failures are logged but do not affect the evaluation. Replaces the previous MLflow-specific live export with a generic, executor-agnostic approach. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

AdamRajfer · 2026-03-20T17:40:14Z

/ok to test aa40815

marta-sd · 2026-03-20T18:09:36Z

+        from jinja2 import Environment
+
+        template_text = (
+            Path(__file__).parent.parent.parent


Please check if we can use package resources instead

We can. Fixed

marta-sd · 2026-03-20T18:10:19Z

+        from jinja2 import Environment
+
+        template_text = (
+            Path(__file__).parent.parent.parent


We can. Fixed

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

AdamRajfer · 2026-03-23T12:15:10Z

/ok to test 22f21ac

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

AdamRajfer · 2026-03-23T17:14:56Z

/ok to test 4207f59

AdamRajfer requested a review from agronskiy March 20, 2026 00:52

AdamRajfer self-assigned this Mar 20, 2026

AdamRajfer requested review from a team as code owners March 20, 2026 00:52

github-actions bot added nemo-evaluator-launcher tests labels Mar 20, 2026

copy-pr-bot bot temporarily deployed to test March 20, 2026 00:53 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 20, 2026 00:53 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 20, 2026 00:56 Inactive

github-actions bot added the documentation Improvements or additions to documentation label Mar 20, 2026

AdamRajfer changed the title ~~feat: add live MLflow export for SLURM evaluation jobs~~ feat: add generic heartbeat for periodic script execution during evaluation Mar 20, 2026

copy-pr-bot bot temporarily deployed to test March 20, 2026 17:40 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 20, 2026 17:41 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 20, 2026 17:43 Inactive

marta-sd reviewed Mar 20, 2026

View reviewed changes

fix: use importlib.resources for template loading in heartbeat tests

22f21ac

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 23, 2026 12:15 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 23, 2026 12:15 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 23, 2026 12:17 Inactive

fix: redirect heartbeat output to stdout.log in local executor

4207f59

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 23, 2026 17:06 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 23, 2026 17:06 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 23, 2026 17:08 Inactive

marta-sd approved these changes Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add generic heartbeat for periodic script execution during evaluation#876

feat: add generic heartbeat for periodic script execution during evaluation#876
AdamRajfer wants to merge 4 commits intomainfrom
arajfer/live-mlflow-export

AdamRajfer commented Mar 20, 2026 •

edited

Loading

Uh oh!

AdamRajfer commented Mar 20, 2026

Uh oh!

AdamRajfer commented Mar 20, 2026

Uh oh!

marta-sd Mar 20, 2026

Uh oh!

AdamRajfer Mar 23, 2026

Uh oh!

marta-sd Mar 20, 2026

Uh oh!

AdamRajfer Mar 23, 2026

Uh oh!

AdamRajfer commented Mar 23, 2026

Uh oh!

AdamRajfer commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdamRajfer commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Configuration

How it works

Changes

Removed

Test plan

Uh oh!

AdamRajfer commented Mar 20, 2026

Uh oh!

AdamRajfer commented Mar 20, 2026

Uh oh!

marta-sd Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

AdamRajfer Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

marta-sd Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

AdamRajfer Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

AdamRajfer commented Mar 23, 2026

Uh oh!

AdamRajfer commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AdamRajfer commented Mar 20, 2026 •

edited

Loading