Skip to content

fix: move auto-export to separate CPU-only sbatch job#901

Merged
marta-sd merged 21 commits intomainfrom
arajfer/fix-export-cpu-partition
Apr 15, 2026
Merged

fix: move auto-export to separate CPU-only sbatch job#901
marta-sd merged 21 commits intomainfrom
arajfer/fix-export-cpu-partition

Conversation

@AdamRajfer
Copy link
Copy Markdown
Contributor

Auto-export (MLflow, wandb) was running inline in the GPU evaluation sbatch job via srun --overlap, wasting GPU time on I/O-bound uploads.

Changes:

  • Generate a separate export.sbatch script with no GPU request
  • Submit it via sbatch from the eval script on success
  • Mount the full invocation dir + output_dir (fixes mount mismatch that caused MLflow export to fail because --job-dirs path was not accessible inside the export container)
  • Support auto_export.partition config to target a CPU partition

The export job runs after the GPU job completes and releases GPUs immediately, instead of holding them during artifact uploads.

Copy link
Copy Markdown
Contributor

@marta-sd marta-sd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small comments

Auto-export (MLflow, wandb) was running inline in the GPU evaluation
sbatch job via srun --overlap, wasting GPU time on I/O-bound uploads.

Changes:
- Generate a separate export.sbatch script with no GPU request
- Submit it via sbatch from the eval script on success
- Mount the full invocation dir + output_dir (fixes mount mismatch
  that caused MLflow export to fail because --job-dirs path was not
  accessible inside the export container)
- Support auto_export.partition config to target a CPU partition

The export job runs after the GPU job completes and releases GPUs
immediately, instead of holding them during artifact uploads.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
- When top-level 'export' section is missing or empty, fall back to
  per-destination config from auto_export (e.g. auto_export.mlflow)
  so the exporter has tracking_uri/entity/project.
- Use --ignore-installed in default pip install command to avoid
  distutils conflicts (e.g. blinker) in pre-built container images.
- Add interrupted marker check from main (SIGTERM handling).
- Add tests for empty export fallback, interrupted marker, and
  --ignore-installed default.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Reorganize the function to follow main's structure more closely
and reduce diff noise. No behavior changes.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@AdamRajfer
Copy link
Copy Markdown
Contributor Author

/ok to test af58520

Add partition, image, and launcher_install_cmd fields under
auto_export in the default slurm execution config. This documents
available options and allows users to target a CPU partition for
export jobs.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
…titions

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@AdamRajfer
Copy link
Copy Markdown
Contributor Author

/ok to test d27878d

Clusters like HSG require GPU spec on batch partition. When
cpu_partition is not set, the export sbatch falls back to batch
and requests --gpus 1 to avoid rejection. When cpu_partition is
set, no GPU flag is added.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
…partition

HSG requires GPU spec at sbatch level, not just srun level.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
HSG CPU partition requires explicit --gpus 0 in srun to avoid
inheriting GPU gres. Without it, srun fails with Invalid gres.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Clusters like HSG have minimum GPU QOS requirements on batch
partition that make fallback impractical. Users on such clusters
must set cpu_partition explicitly.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nemo-evaluator-launcher tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants