feat(nvml-mock): NCCL / nccl-tests collective-comms simulation#412
Draft
giuliocalzo wants to merge 17 commits into
Draft
feat(nvml-mock): NCCL / nccl-tests collective-comms simulation#412giuliocalzo wants to merge 17 commits into
giuliocalzo wants to merge 17 commits into
Conversation
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Broaden the *.so ignore rule to *.so.* so versioned c-shared outputs (e.g. libnccl.so.2, libnccl.so.2.23.4) and the cgo-generated header are not accidentally committed. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Satisfy golangci-lint errcheck by wrapping the deferred listener and connection Close calls, matching the repo convention. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
In Kubernetes the rendezvous address is a headless Service DNS name that ranks dial; rank 0 cannot bind() that remote name, so ncclCommInitRank failed on every pod and the Indexed Job hit its backoff limit. Rank 0 now listens on the local wildcard interface for rdzvAddr's port while other ranks still dial the full address. Adds an end-to-end Rendezvous test covering the rank 0 listen path that previously had no coverage. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardware-free NCCL collective-comms simulation so multi-node
nccl-tests-style flows run on Kind without GPUs. A 2-podall_reduce-style run completes and reports non-zero algbw/busbw derived from the topology (NVLink intra-node, InfiniBandrate_gbpsinter-node). Closes #372.What's included
pkg/gpu/mocknccl/engine— collective types + busbw factors, a linear cost model (time = latency + size/effective_bw), profile-YAML +MOCK_NCCL_*config resolution, an MPI-free TCP rendezvous barrier, and capped collective execution.pkg/gpu/mockcuda— host wall-clock CUDA stream/event timing engine + CGo exports (cudaStreamCreate,cudaEvent*,cudaEventElapsedTime,cudaMemset). This is the timing source the driver measures.libnccl.so.2bridge (pkg/gpu/mocknccl/bridge) — common NCCL ABI (version/init/comm-mgmt/group + AllReduce/AllGather/ReduceScatter/Broadcast/Reduce) delegating to the engine.mock-coll-perfnccl-tests-style C driver + driver-facing shim headers (nccl.h,cuda_runtime.h),Makefile, and a build-tagged ABI smoke test.nvml-mockimage.Job+ headless rendezvousService(--set nccl.test.enabled=true),values.schema.json, and helm unittests.tests/e2e/validate-nccl.sh+ annccl-multinodeCI job; plus README/architecture/CHANGELOG docs.Design notes / deviations
nccl.h/cuda_runtime.h) so the cgo-included*_types.hstay types-only and the bridge build is unaffected.Test plan
go testfull suite — passgolangci-lint run— 0 issues;gofmtcleanhelm lint+helm unittest— 7 suites / 122 tests passmock-coll-perflinks + runs on Linux (NCCL version22304, non-zerotime(us))libnccl.so.2nccl-multinodeKind job (runs in GitHub Actions; can't run on the macOS host)🤖 Draft PR; opened for review/CI.