Skip to content

Headless install of workloads not supported by gpu_type gives cryptic error message #27

@keyvaann

Description

@keyvaann

Describe the bug
I'm trying to install microbenchmarks via headless mode but they don't seem to be recognized. Because of #26, I'm using v25.10 for now as it can install other workloads but it fails to do so in for microbenchmarks:

=== HEADLESS INSTALLATION MODE ===
✓ Configuration loaded from: /cm-tests/.../dgx-automation/config/dgx-headless-config-gb300.yaml
Environment type: uv
Install path: /cm-tests/dgxc-benchmarking/gb300/workloads
GPU type: gb300
Node architecture: aarch64
Install method: local
Selected workloads: pretrain_nemotron4-15b, pretrain_nemotron4-340b, pretrain_llama3.1, pretrain_deepseek-v3, pretrain_grok1, pretrain_nemotron-h, microbenchmark_cpu_overhead, microbenchmark_nccl

Development mode: Using repository at /cm-tests/dgxc-benchmarking/gb300
Error: Selected workloads not found: ['microbenchmark_cpu_overhead', 'microbenchmark_nccl']
Custom script failed for run gb300, version v25.10
Preparation step failed with code 1.

I tried it with different names like nccl and microbenchmark-nccl but none of them worked.

Steps/Code to reproduce bug
Here is my headless play file:

venv_type: uv
install_path: /cm-tests/dgxc-benchmarking/gb300/workloads
slurm_info:
  slurm:
    account: root
    gpu_partition: main
    cpu_partition: main
    gpu_partition_gres: 8
    cpu_partition_gres: null
    node_architecture: aarch64
gpu_type: gb300
node_architecture: aarch64
install_method: local
selected_workloads:
  - pretrain_nemotron4-15b
  - pretrain_nemotron4-340b
  - pretrain_llama3.1
  - pretrain_deepseek-v3
  - pretrain_grok1
  - pretrain_nemotron-h
  - microbenchmark_cpu_overhead
  - microbenchmark_nccl
env_vars:
  HF_TOKEN: hf_

And I use this command to run it: ./install.sh --play config.yaml -v -d.

Expected behavior
The installation will succeed, and in case there are errors the issue will be clearly indicated.

Environment details (please complete the following information):

Environment location: Cloud(Nebius)
Method of DGXC Benchmarking install: From source with UV
Run print_env.sh from the project root and paste the results here

By submitting this issue, you agree to follow our code of conduct.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions