Skip to content

feat: add generic heartbeat for periodic script execution during evaluation#876

Open
AdamRajfer wants to merge 4 commits intomainfrom
arajfer/live-mlflow-export
Open

feat: add generic heartbeat for periodic script execution during evaluation#876
AdamRajfer wants to merge 4 commits intomainfrom
arajfer/live-mlflow-export

Conversation

@AdamRajfer
Copy link
Copy Markdown
Contributor

@AdamRajfer AdamRajfer commented Mar 20, 2026

Summary

Adds a heartbeat mechanism that runs a user-provided script periodically during evaluation jobs. Supported on all executors (SLURM, local, lepton).

Configuration

execution:
  heartbeat:
    script: 'ls -la artifacts/ 2>/dev/null || echo "no artifacts yet"'
    interval: 60          # seconds (default: 300)
    container: null       # optional container image
Field Default Description
script null Command to run periodically. When null, heartbeat is disabled.
interval 300 Seconds between heartbeat executions.
container null Optional container image for the script.

How it works

When script is set, the executor starts a background loop that runs the command every interval seconds. The loop runs alongside the evaluation and stops automatically when the evaluation completes or fails. Script failures are logged but do not affect the evaluation.

Changes

  • executor.py (SLURM): heartbeat background loop in sbatch script, with optional container support via srun
  • run.template.sh (local): heartbeat background loop in bash template
  • executor.py (lepton): heartbeat background loop in launch script
  • Config defaults added to all execution configs
  • Example config: examples/slurm_heartbeat.yaml
  • Docs: heartbeat section added to SLURM executor docs

Removed

  • exporters/mlflow_live.py and tests — replaced by the generic heartbeat

Test plan

  • 8 heartbeat unit tests across all executors
  • All existing tests pass (926 passed)
  • Pre-commit clean
  • End-to-end SLURM test: heartbeat logged every 10s throughout job

When auto_export includes "mlflow" and export.mlflow.live is true
(default), the SLURM executor tracks the job lifecycle in MLflow
in real-time:

  deploying → server_healthy → evaluating → completed/failed/timeout

A standalone mlflow_live.py script runs inside the mlflow container
via srun at each lifecycle stage. A background heartbeat uploads
artifacts every 5 minutes, giving live visibility into running
evaluations.

When live is false, the same container-based export runs once
after evaluation completes (no pip-installing the launcher).

Set export.mlflow.live to false to disable live tracking and use
only the final export.

Key changes:
- exporters/mlflow_live.py: standalone MLflow export script
  (init, update, cancel, finalize modes) reads export_config.yml
- executor.py: injects lifecycle hooks into sbatch script —
  signal traps (USR1/TERM/EXIT), heartbeat loop, status updates;
  mlflow filtered from old auto-export destinations
- Artifacts uploaded with same exclusion patterns and directory
  structure as the existing MLflow exporter
- Finalize runs before server cleanup to avoid walltime race

The user config API is unchanged — same export.mlflow.* keys.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@AdamRajfer
Copy link
Copy Markdown
Contributor Author

/ok to test 05e85eb

…uation

Adds a heartbeat mechanism that runs a user-provided script periodically
during evaluation jobs. Supported on all executors (SLURM, local, lepton).

Configuration:
  execution:
    heartbeat:
      script: 'echo heartbeat'
      interval: 60
      container: null

The heartbeat starts before evaluation and stops after completion.
Script failures are logged but do not affect the evaluation.

Replaces the previous MLflow-specific live export with a generic,
executor-agnostic approach.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 20, 2026
@AdamRajfer AdamRajfer changed the title feat: add live MLflow export for SLURM evaluation jobs feat: add generic heartbeat for periodic script execution during evaluation Mar 20, 2026
@AdamRajfer
Copy link
Copy Markdown
Contributor Author

/ok to test aa40815

from jinja2 import Environment

template_text = (
Path(__file__).parent.parent.parent
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if we can use package resources instead

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can. Fixed

from jinja2 import Environment

template_text = (
Path(__file__).parent.parent.parent
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can. Fixed

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@AdamRajfer
Copy link
Copy Markdown
Contributor Author

/ok to test 22f21ac

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@AdamRajfer
Copy link
Copy Markdown
Contributor Author

/ok to test 4207f59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nemo-evaluator-launcher tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants