feat(nvml-mock): NCCL / nccl-tests collective-comms simulation by giuliocalzo · Pull Request #412 · NVIDIA/k8s-test-infra

giuliocalzo · 2026-06-22T19:06:46Z

Summary

Hardware-free NCCL collective-comms simulation so multi-node nccl-tests-style flows run on Kind without GPUs. A 2-pod all_reduce-style run completes and reports non-zero algbw/busbw derived from the topology (NVLink intra-node, InfiniBand rate_gbps inter-node). Closes #372.

What's included

pkg/gpu/mocknccl/engine — collective types + busbw factors, a linear cost model (time = latency + size/effective_bw), profile-YAML + MOCK_NCCL_* config resolution, an MPI-free TCP rendezvous barrier, and capped collective execution.
pkg/gpu/mockcuda — host wall-clock CUDA stream/event timing engine + CGo exports (cudaStreamCreate, cudaEvent*, cudaEventElapsedTime, cudaMemset). This is the timing source the driver measures.
libnccl.so.2 bridge (pkg/gpu/mocknccl/bridge) — common NCCL ABI (version/init/comm-mgmt/group + AllReduce/AllGather/ReduceScatter/Broadcast/Reduce) delegating to the engine.
mock-coll-perf nccl-tests-style C driver + driver-facing shim headers (nccl.h, cuda_runtime.h), Makefile, and a build-tagged ABI smoke test.
Packaging — Dockerfile builds + installs the lib and driver into the nvml-mock image.
Chart — optional 2-pod Indexed Job + headless rendezvous Service (--set nccl.test.enabled=true), values.schema.json, and helm unittests.
E2E — tests/e2e/validate-nccl.sh + an nccl-multinode CI job; plus README/architecture/CHANGELOG docs.

Design notes / deviations

Opaque CUDA/NCCL handles are backed by real C memory (vet-clean) instead of fabricating pointers from integers.
Driver prototypes live in dedicated shim headers (nccl.h/cuda_runtime.h) so the cgo-included *_types.h stay types-only and the bridge build is unaffected.
E2E uses the branch's shell-validator + CI-job convention (the Go ginkgo harness lives in a separate, unmerged branch).

Test plan

go test full suite — pass
golangci-lint run — 0 issues; gofmt clean
helm lint + helm unittest — 7 suites / 122 tests pass
Docker builder stage: mock-coll-perf links + runs on Linux (NCCL version 22304, non-zero time(us))
ABI smoke test links against libnccl.so.2
Live nccl-multinode Kind job (runs in GitHub Actions; can't run on the macOS host)

🤖 Draft PR; opened for review/CI.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

Broaden the *.so ignore rule to *.so.* so versioned c-shared outputs (e.g. libnccl.so.2, libnccl.so.2.23.4) and the cgo-generated header are not accidentally committed. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

Satisfy golangci-lint errcheck by wrapping the deferred listener and connection Close calls, matching the repo convention. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

In Kubernetes the rendezvous address is a headless Service DNS name that ranks dial; rank 0 cannot bind() that remote name, so ncclCommInitRank failed on every pod and the Indexed Job hit its backoff limit. Rank 0 now listens on the local wildcard interface for rdzvAddr's port while other ranks still dial the full address. Adds an end-to-end Rendezvous test covering the rank 0 listen path that previously had no coverage. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

giuliocalzo added 17 commits June 22, 2026 17:46

feat(mocknccl): collective types and busbw factors

cb7ab91

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

feat(mocknccl): linear cost model

aad5dc6

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

feat(mocknccl): profile + env config and model resolution

3079b46

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

feat(mocknccl): MPI-free TCP rendezvous barrier

c4f56c1

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

feat(mocknccl): comm state and capped collective execution

b8c6b26

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

feat(mockcuda): host wall-clock CUDA stream and event timing

375c722

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

feat(mockcuda): bridge exports for streams and timing events

03732a2

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

feat(mocknccl): libnccl.so.2 bridge with common collective surface

2fe0e57

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

build(mocknccl): Makefile and libnccl ABI smoke test

f3f939e

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

build: ignore versioned shared library artifacts

8556fb5

Broaden the *.so ignore rule to *.so.* so versioned c-shared outputs (e.g. libnccl.so.2, libnccl.so.2.23.4) and the cgo-generated header are not accidentally committed. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

feat(mocknccl): nccl-tests-style mock-coll-perf driver

6915bd9

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

build(nvml-mock): package mock libnccl and mock-coll-perf driver

e100da1

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

feat(chart): optional 2-pod NCCL collective-comms test Job

2e2a8b1

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

test(e2e): 2-pod mock NCCL busbw validator and multi-node CI job

f8191b0

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

docs(mocknccl): document NCCL collective-comms simulation

ebee14d

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

fix(mocknccl): check deferred Close errors in rendezvous

af06880

Satisfy golangci-lint errcheck by wrapping the deferred listener and connection Close calls, matching the repo convention. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(nvml-mock): NCCL / nccl-tests collective-comms simulation#412

feat(nvml-mock): NCCL / nccl-tests collective-comms simulation#412
giuliocalzo wants to merge 17 commits into
NVIDIA:mainfrom
giuliocalzo:feat/nccl-collective-comms

giuliocalzo commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

giuliocalzo commented Jun 22, 2026

Summary

What's included

Design notes / deviations

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant