feat: add generic heartbeat for periodic script execution during evaluation#876
Open
AdamRajfer wants to merge 4 commits intomainfrom
Open
feat: add generic heartbeat for periodic script execution during evaluation#876AdamRajfer wants to merge 4 commits intomainfrom
AdamRajfer wants to merge 4 commits intomainfrom
Conversation
When auto_export includes "mlflow" and export.mlflow.live is true (default), the SLURM executor tracks the job lifecycle in MLflow in real-time: deploying → server_healthy → evaluating → completed/failed/timeout A standalone mlflow_live.py script runs inside the mlflow container via srun at each lifecycle stage. A background heartbeat uploads artifacts every 5 minutes, giving live visibility into running evaluations. When live is false, the same container-based export runs once after evaluation completes (no pip-installing the launcher). Set export.mlflow.live to false to disable live tracking and use only the final export. Key changes: - exporters/mlflow_live.py: standalone MLflow export script (init, update, cancel, finalize modes) reads export_config.yml - executor.py: injects lifecycle hooks into sbatch script — signal traps (USR1/TERM/EXIT), heartbeat loop, status updates; mlflow filtered from old auto-export destinations - Artifacts uploaded with same exclusion patterns and directory structure as the existing MLflow exporter - Finalize runs before server cleanup to avoid walltime race The user config API is unchanged — same export.mlflow.* keys. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Contributor
Author
|
/ok to test 05e85eb |
…uation
Adds a heartbeat mechanism that runs a user-provided script periodically
during evaluation jobs. Supported on all executors (SLURM, local, lepton).
Configuration:
execution:
heartbeat:
script: 'echo heartbeat'
interval: 60
container: null
The heartbeat starts before evaluation and stops after completion.
Script failures are logged but do not affect the evaluation.
Replaces the previous MLflow-specific live export with a generic,
executor-agnostic approach.
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Contributor
Author
|
/ok to test aa40815 |
marta-sd
reviewed
Mar 20, 2026
| from jinja2 import Environment | ||
|
|
||
| template_text = ( | ||
| Path(__file__).parent.parent.parent |
Contributor
There was a problem hiding this comment.
Please check if we can use package resources instead
marta-sd
reviewed
Mar 20, 2026
| from jinja2 import Environment | ||
|
|
||
| template_text = ( | ||
| Path(__file__).parent.parent.parent |
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Contributor
Author
|
/ok to test 22f21ac |
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Contributor
Author
|
/ok to test 4207f59 |
marta-sd
approved these changes
Apr 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a heartbeat mechanism that runs a user-provided script periodically during evaluation jobs. Supported on all executors (SLURM, local, lepton).
Configuration
scriptnullnull, heartbeat is disabled.interval300containernullHow it works
When
scriptis set, the executor starts a background loop that runs the command everyintervalseconds. The loop runs alongside the evaluation and stops automatically when the evaluation completes or fails. Script failures are logged but do not affect the evaluation.Changes
executor.py(SLURM): heartbeat background loop in sbatch script, with optional container support via srunrun.template.sh(local): heartbeat background loop in bash templateexecutor.py(lepton): heartbeat background loop in launch scriptexamples/slurm_heartbeat.yamlRemoved
exporters/mlflow_live.pyand tests — replaced by the generic heartbeatTest plan