fix: move auto-export to separate CPU-only sbatch job by AdamRajfer · Pull Request #901 · NVIDIA-NeMo/Evaluator

AdamRajfer · 2026-04-09T17:41:21Z

Auto-export (MLflow, wandb) was running inline in the GPU evaluation sbatch job via srun --overlap, wasting GPU time on I/O-bound uploads.

Changes:

Generate a separate export.sbatch script with no GPU request
Submit it via sbatch from the eval script on success
Mount the full invocation dir + output_dir (fixes mount mismatch that caused MLflow export to fail because --job-dirs path was not accessible inside the export container)
Support auto_export.partition config to target a CPU partition

The export job runs after the GPU job completes and releases GPUs immediately, instead of holding them during artifact uploads.

marta-sd

Two small comments

Auto-export (MLflow, wandb) was running inline in the GPU evaluation sbatch job via srun --overlap, wasting GPU time on I/O-bound uploads. Changes: - Generate a separate export.sbatch script with no GPU request - Submit it via sbatch from the eval script on success - Mount the full invocation dir + output_dir (fixes mount mismatch that caused MLflow export to fail because --job-dirs path was not accessible inside the export container) - Support auto_export.partition config to target a CPU partition The export job runs after the GPU job completes and releases GPUs immediately, instead of holding them during artifact uploads. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

- When top-level 'export' section is missing or empty, fall back to per-destination config from auto_export (e.g. auto_export.mlflow) so the exporter has tracking_uri/entity/project. - Use --ignore-installed in default pip install command to avoid distutils conflicts (e.g. blinker) in pre-built container images. - Add interrupted marker check from main (SIGTERM handling). - Add tests for empty export fallback, interrupted marker, and --ignore-installed default. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

Reorganize the function to follow main's structure more closely and reduce diff noise. No behavior changes. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

AdamRajfer · 2026-04-11T10:53:11Z

/ok to test af58520

Add partition, image, and launcher_install_cmd fields under auto_export in the default slurm execution config. This documents available options and allows users to target a CPU partition for export jobs. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

…titions Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

AdamRajfer · 2026-04-15T03:32:52Z

/ok to test d27878d

Clusters like HSG require GPU spec on batch partition. When cpu_partition is not set, the export sbatch falls back to batch and requests --gpus 1 to avoid rejection. When cpu_partition is set, no GPU flag is added. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

…partition HSG requires GPU spec at sbatch level, not just srun level. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

HSG CPU partition requires explicit --gpus 0 in srun to avoid inheriting GPU gres. Without it, srun fails with Invalid gres. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

Clusters like HSG have minimum GPU QOS requirements on batch partition that make fallback impractical. Users on such clusters must set cpu_partition explicitly. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

AdamRajfer requested review from a team as code owners April 9, 2026 17:41

github-actions bot added nemo-evaluator-launcher tests labels Apr 9, 2026

copy-pr-bot bot temporarily deployed to test April 9, 2026 17:42 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 9, 2026 17:43 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 9, 2026 17:45 Inactive

marta-sd reviewed Apr 10, 2026

View reviewed changes

Comment thread packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py Outdated

Comment thread packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py Outdated

AdamRajfer force-pushed the arajfer/fix-export-cpu-partition branch from 186f693 to 15aed72 Compare April 10, 2026 10:50

copy-pr-bot bot temporarily deployed to test April 10, 2026 10:51 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 10, 2026 10:51 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 10, 2026 10:54 Inactive

AdamRajfer added 2 commits April 11, 2026 12:14

Merge branch 'main' into arajfer/fix-export-cpu-partition

5231377

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

copy-pr-bot bot temporarily deployed to test April 11, 2026 10:37 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci April 11, 2026 10:37 Error

copy-pr-bot bot temporarily deployed to nemo-ci April 11, 2026 10:37 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci April 11, 2026 10:39 Error

refactor: restructure auto-export to match main code style

af58520

Reorganize the function to follow main's structure more closely and reduce diff noise. No behavior changes. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

copy-pr-bot bot temporarily deployed to test April 11, 2026 10:42 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 11, 2026 10:42 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 11, 2026 10:44 Inactive

copy-pr-bot bot temporarily deployed to test April 11, 2026 13:58 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 11, 2026 13:58 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci April 15, 2026 01:25 Error

fix: restore no-container-mount-home comment from main

d328784

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

copy-pr-bot bot temporarily deployed to test April 15, 2026 01:28 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci April 15, 2026 01:28 Error

refactor: simplify cpu_partition fallback with get default

ff03e58

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

copy-pr-bot bot temporarily deployed to test April 15, 2026 01:30 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 15, 2026 01:30 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 15, 2026 01:32 Inactive

fix: add --gpus 0 to export srun to avoid gres inheritance on CPU par…

366b707

…titions Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

copy-pr-bot bot temporarily deployed to test April 15, 2026 03:00 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 15, 2026 03:00 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 15, 2026 03:02 Inactive

fix: default cpu_partition to null, fallback to partition in code

d27878d

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

copy-pr-bot bot temporarily deployed to test April 15, 2026 03:27 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 15, 2026 03:27 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 15, 2026 03:29 Inactive

copy-pr-bot bot temporarily deployed to test April 15, 2026 05:07 Inactive

AdamRajfer added 5 commits April 15, 2026 07:19

fix: add --gpus-per-node=1 to sbatch header when falling back to GPU …

da94f3f

…partition HSG requires GPU spec at sbatch level, not just srun level. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

fix: use --gpus 0 on CPU partition srun, --gpus 1 on GPU fallback

bd1a1eb

HSG CPU partition requires explicit --gpus 0 in srun to avoid inheriting GPU gres. Without it, srun fails with Invalid gres. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

fix: remove GPU fallback logic, always use --gpus 0 in export srun

38d5565

Clusters like HSG have minimum GPU QOS requirements on batch partition that make fallback impractical. Users on such clusters must set cpu_partition explicitly. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

refactor: hardcode --gpus 0 in export srun, remove variable

855903b

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

docs: add cpu_partition and export_image to auto-export docs

6198dd3

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

marta-sd approved these changes Apr 15, 2026

View reviewed changes

AdamRajfer mentioned this pull request Apr 17, 2026

fix(slurm): request zero GPUs when export falls back to GPU partition #915

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: move auto-export to separate CPU-only sbatch job#901

fix: move auto-export to separate CPU-only sbatch job#901
marta-sd merged 21 commits intomainfrom
arajfer/fix-export-cpu-partition

AdamRajfer commented Apr 9, 2026

Uh oh!

marta-sd left a comment

Uh oh!

Uh oh!

Uh oh!

AdamRajfer commented Apr 11, 2026

Uh oh!

AdamRajfer commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdamRajfer commented Apr 9, 2026

Uh oh!

marta-sd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AdamRajfer commented Apr 11, 2026

Uh oh!

AdamRajfer commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants