Add Benchmark

Guide for adding a new benchmark to benchmarks/ in the distributed-workloads repo.

Directory layout

Each benchmark lives in its own subdirectory under benchmarks/:

benchmarks/<benchmark-name>/
  Dockerfile              # Multi-stage build for the benchmark image
  Dockerfile.cuda         # (optional) CUDA variant
  mpi-runtime.yaml        # ClusterTrainingRuntime defining the MPI execution environment
  trainjob.yaml           # TrainJob manifest to submit the benchmark
  README.md               # Documentation (what, files, quick start, parameters, output)
  <scripts>               # (optional) Training/benchmark scripts mounted via ConfigMap

See benchmarks/osu-benchmarks/ and benchmarks/kftv2-mpi-ddp-sft/ as reference implementations.

Dockerfile conventions

Follow the multi-stage build pattern used in benchmarks/osu-benchmarks/Dockerfile:

Stage 1 (builder) - compile dependencies from source (e.g., OpenMPI, benchmark binaries)
Stage 2 (runtime) - copy built artifacts, configure SSH for MPI, set up the runtime environment

Key requirements:

Base image from quay.io/opendatahub/ or quay.io/modh/
USER 0 only during build stages; final image must use USER 1001
OpenShift GID 0 pattern: chgrp -R 0 <dir> && chmod -R g=u <dir>
Allow random UID: chmod g=u /etc/passwd
SSH setup with keys baked into /tmp/ssh/ (Training Operator does not auto-inject SSH keys)
For CUDA variants, create a separate Dockerfile.cuda extending the base

ClusterTrainingRuntime

Define a ClusterTrainingRuntime resource with MPI configuration. Key fields:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
  name: <runtime-name>
spec:
  mlPolicy:
    mpi:
      mpiImplementation: OpenMPI
      sshAuthMountPath: /tmp/ssh
  template:
    spec:
      replicatedJobs:
        - name: launcher
          replicas: 1
          template: ...
        - name: worker
          replicas: <N>
          template: ...

Launcher: runs the benchmark command (mpirun/mpiexec)
Workers: run sshd and wait for MPI connections
Both need the SSH setup commands in their entrypoints

See benchmarks/osu-benchmarks/mpi-runtime-cpu.yaml for a complete example.

TrainJob

Submit benchmarks using a TrainJob with generateName (not fixed name):

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  generateName: <benchmark-name>-
  namespace: <namespace>
spec:
  runtimeRef:
    name: <runtime-name>
  trainer:
    numNodes: 2
    resourcesPerNode:
      requests:
        nvidia.com/gpu: "2"
    env:
      - name: PARAM_NAME
        value: "value"

Use trainer.env for benchmark parameters - the controller injects them into all pod containers.

See benchmarks/kftv2-mpi-ddp-sft/trainjob.yaml for a complete example.

Makefile targets

Add build/push targets to the root Makefile following the existing pattern:

BENCHMARK_VERSION ?= latest

.PHONY: build-<name>-benchmark-image
build-<name>-benchmark-image:
	$(CONTAINER_ENGINE) build -t quay.io/modh/distributed-workloads-benchmark:trainer-mpi-<name>-$(BENCHMARK_VERSION) \
	  -f benchmarks/<name>/Dockerfile benchmarks/<name>/

.PHONY: push-<name>-benchmark-image
push-<name>-benchmark-image:
	$(CONTAINER_ENGINE) push quay.io/modh/distributed-workloads-benchmark:trainer-mpi-<name>-$(BENCHMARK_VERSION)

Registry: quay.io/modh/distributed-workloads-benchmark Tag format: trainer-mpi-<name>-<version>

CI workflow

Create .github/workflows/build-and-push-<name>-benchmark.yml matching the structure in build-and-push-osu-benchmark.yml:

Trigger on push/PR when files under benchmarks/<name>/ change
Build on all branches, push only on main
Use docker/build-push-action with appropriate Dockerfile path

README

Every benchmark must include a README.md with these sections (see benchmarks/kftv2-mpi-ddp-sft/README.md):

Section	Content
Title + summary	One-line description of what the benchmark measures
What this benchmark does	Table with algorithm, model, dataset, backend, runtime, image
Files	Table mapping each file to its purpose
Quick start	Numbered steps: deploy runtime, create namespace/ConfigMap, submit TrainJob, monitor
Scaling	Table showing node/GPU configurations
Benchmark parameters	Tables for training and infrastructure parameters with defaults and impact
Expected output	Example benchmark summary output
Known issues	Documented limitations and workarounds
Cleanup	Commands to remove all created resources

Checklist

Dockerfile builds successfully: make build-<name>-benchmark-image
ClusterTrainingRuntime applies: oc apply -f benchmarks/<name>/mpi-runtime.yaml
TrainJob submits and runs: oc create -f benchmarks/<name>/trainjob.yaml
README has all required sections
Makefile targets added for build and push
CI workflow triggers on path changes to benchmarks/<name>/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Benchmark

Directory layout

Dockerfile conventions

ClusterTrainingRuntime

TrainJob

Makefile targets

CI workflow

README

Checklist

Uh oh!

FilesExpand file tree

SKILL.md

Latest commit

History

SKILL.md

File metadata and controls

Add Benchmark

Directory layout

Dockerfile conventions

ClusterTrainingRuntime

TrainJob

Makefile targets

CI workflow

README

Checklist