fix: move auto-export to separate CPU-only sbatch job#901
Merged
Conversation
marta-sd
reviewed
Apr 10, 2026
Auto-export (MLflow, wandb) was running inline in the GPU evaluation sbatch job via srun --overlap, wasting GPU time on I/O-bound uploads. Changes: - Generate a separate export.sbatch script with no GPU request - Submit it via sbatch from the eval script on success - Mount the full invocation dir + output_dir (fixes mount mismatch that caused MLflow export to fail because --job-dirs path was not accessible inside the export container) - Support auto_export.partition config to target a CPU partition The export job runs after the GPU job completes and releases GPUs immediately, instead of holding them during artifact uploads. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
186f693 to
15aed72
Compare
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
- When top-level 'export' section is missing or empty, fall back to per-destination config from auto_export (e.g. auto_export.mlflow) so the exporter has tracking_uri/entity/project. - Use --ignore-installed in default pip install command to avoid distutils conflicts (e.g. blinker) in pre-built container images. - Add interrupted marker check from main (SIGTERM handling). - Add tests for empty export fallback, interrupted marker, and --ignore-installed default. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Reorganize the function to follow main's structure more closely and reduce diff noise. No behavior changes. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Contributor
Author
|
/ok to test af58520 |
Add partition, image, and launcher_install_cmd fields under auto_export in the default slurm execution config. This documents available options and allows users to target a CPU partition for export jobs. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
…titions Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Contributor
Author
|
/ok to test d27878d |
Clusters like HSG require GPU spec on batch partition. When cpu_partition is not set, the export sbatch falls back to batch and requests --gpus 1 to avoid rejection. When cpu_partition is set, no GPU flag is added. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
…partition HSG requires GPU spec at sbatch level, not just srun level. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
HSG CPU partition requires explicit --gpus 0 in srun to avoid inheriting GPU gres. Without it, srun fails with Invalid gres. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Clusters like HSG have minimum GPU QOS requirements on batch partition that make fallback impractical. Users on such clusters must set cpu_partition explicitly. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
marta-sd
approved these changes
Apr 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Auto-export (MLflow, wandb) was running inline in the GPU evaluation sbatch job via srun --overlap, wasting GPU time on I/O-bound uploads.
Changes:
The export job runs after the GPU job completes and releases GPUs immediately, instead of holding them during artifact uploads.