Skip to content

feat(nvml-mock): NCCL / nccl-tests collective-comms simulation#412

Draft
giuliocalzo wants to merge 17 commits into
NVIDIA:mainfrom
giuliocalzo:feat/nccl-collective-comms
Draft

feat(nvml-mock): NCCL / nccl-tests collective-comms simulation#412
giuliocalzo wants to merge 17 commits into
NVIDIA:mainfrom
giuliocalzo:feat/nccl-collective-comms

Conversation

@giuliocalzo

Copy link
Copy Markdown
Contributor

Summary

Hardware-free NCCL collective-comms simulation so multi-node nccl-tests-style flows run on Kind without GPUs. A 2-pod all_reduce-style run completes and reports non-zero algbw/busbw derived from the topology (NVLink intra-node, InfiniBand rate_gbps inter-node). Closes #372.

What's included

  • pkg/gpu/mocknccl/engine — collective types + busbw factors, a linear cost model (time = latency + size/effective_bw), profile-YAML + MOCK_NCCL_* config resolution, an MPI-free TCP rendezvous barrier, and capped collective execution.
  • pkg/gpu/mockcuda — host wall-clock CUDA stream/event timing engine + CGo exports (cudaStreamCreate, cudaEvent*, cudaEventElapsedTime, cudaMemset). This is the timing source the driver measures.
  • libnccl.so.2 bridge (pkg/gpu/mocknccl/bridge) — common NCCL ABI (version/init/comm-mgmt/group + AllReduce/AllGather/ReduceScatter/Broadcast/Reduce) delegating to the engine.
  • mock-coll-perf nccl-tests-style C driver + driver-facing shim headers (nccl.h, cuda_runtime.h), Makefile, and a build-tagged ABI smoke test.
  • Packaging — Dockerfile builds + installs the lib and driver into the nvml-mock image.
  • Chart — optional 2-pod Indexed Job + headless rendezvous Service (--set nccl.test.enabled=true), values.schema.json, and helm unittests.
  • E2Etests/e2e/validate-nccl.sh + an nccl-multinode CI job; plus README/architecture/CHANGELOG docs.

Design notes / deviations

  • Opaque CUDA/NCCL handles are backed by real C memory (vet-clean) instead of fabricating pointers from integers.
  • Driver prototypes live in dedicated shim headers (nccl.h/cuda_runtime.h) so the cgo-included *_types.h stay types-only and the bridge build is unaffected.
  • E2E uses the branch's shell-validator + CI-job convention (the Go ginkgo harness lives in a separate, unmerged branch).

Test plan

  • go test full suite — pass
  • golangci-lint run — 0 issues; gofmt clean
  • helm lint + helm unittest — 7 suites / 122 tests pass
  • Docker builder stage: mock-coll-perf links + runs on Linux (NCCL version 22304, non-zero time(us))
  • ABI smoke test links against libnccl.so.2
  • Live nccl-multinode Kind job (runs in GitHub Actions; can't run on the macOS host)

🤖 Draft PR; opened for review/CI.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Broaden the *.so ignore rule to *.so.* so versioned c-shared outputs
(e.g. libnccl.so.2, libnccl.so.2.23.4) and the cgo-generated header are
not accidentally committed.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Satisfy golangci-lint errcheck by wrapping the deferred listener and
connection Close calls, matching the repo convention.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
In Kubernetes the rendezvous address is a headless Service DNS name that
ranks dial; rank 0 cannot bind() that remote name, so ncclCommInitRank
failed on every pod and the Indexed Job hit its backoff limit. Rank 0 now
listens on the local wildcard interface for rdzvAddr's port while other
ranks still dial the full address. Adds an end-to-end Rendezvous test
covering the rank 0 listen path that previously had no coverage.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(nvml-mock): NCCL / nccl-tests collective-comms simulation

1 participant