-
Notifications
You must be signed in to change notification settings - Fork 9
feat(mocknvml): add mock libnvidia-ml.so implementation for GPU testing #204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add NVIDIA go-nvml library as a dependency to support the mock NVML implementation. This provides the nvml.Interface and mock/dgxa100 packages needed for GPU simulation. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Implement a CGo-based mock NVML library that simulates NVIDIA GPU hardware for testing. Key components: - engine/: Core mock engine with thread-safe handle management - config.go: Environment-based configuration (MOCK_NVML_NUM_DEVICES) - device.go: Enhanced device wrapper with PCI/BAR1 info - engine.go: Singleton engine with reference-counted init/shutdown - handles.go: Thread-safe C handle to Go object mapping - bridge/: Generated CGo bridge (via cmd/generate-bridge) - Exports nvml* functions callable from C - Memory-safe error string caching - Dockerfile/Makefile: Build tooling for libnvidia-ml.so The mock uses dgxa100 server from go-nvml, simulating an 8-GPU DGX A100. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add test coverage for the mock NVML library: Unit tests (pkg/gpu/mocknvml/engine/): - config_test.go: Environment variable parsing - engine_test.go: Init/Shutdown lifecycle, device enumeration - handles_test.go: Concurrent handle registration and lookup - device_test.go: BAR1 memory, PCI info, process queries Integration tests (tests/mocknvml/): - Docker-based test that uses go-nvml with mock libnvidia-ml.so - Validates device enumeration, UUID/PCI lookups CI configuration: - golang.yaml: Updated workflow to run mocknvml tests Tests use t.Setenv() for isolation and atomic counters for goroutine safety. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add detailed README for the mock NVML library covering: - Architecture overview and component descriptions - Build instructions (Docker and native) - Configuration via environment variables - Usage examples and integration guide - API coverage matrix Remove obsolete pkg/README.md placeholder. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Ignore build artifacts from mock NVML library: - libnvidia-ml.so* shared library files - Build output directories Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…tibility Adopt comprehensive mock NVML implementation with: - YAML configuration for full GPU property control (memory, power, thermal, clocks, ECC, PCIe) - nvidia-smi compatibility (works with real nvidia-smi binary) - 50+ NVML function implementations (vs previous 16) - GPU profiles for A100 and GB200 - ConfigurableDevice replacing EnhancedDevice for YAML-driven properties - Debug logging via MOCK_NVML_DEBUG environment variable Based on dims/k8s-test-infra@16811b9 Co-authored-by: Davanum Srinivas <davanum@gmail.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add detailed documentation for the mock NVML library: - README.md: Overview and documentation index - quickstart.md: 5-minute getting started guide - architecture.md: System design, components, and data flow diagrams - configuration.md: Complete YAML configuration reference - examples.md: Common usage patterns and scenarios - development.md: Contributing and extending guide - troubleshooting.md: Common issues and solutions Documentation covers: - Building and installation - YAML-based GPU configuration - nvidia-smi integration - Testing scenarios (CI/CD, Kubernetes, Docker) - Custom GPU profiles - Debug mode usage Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds vendored dependencies from the NVIDIA go-nvml library (v0.13.0-1) to support mock NVML implementations for GPU testing. The changes introduce comprehensive NVML function bindings, mock interfaces, and test utilities that enable testing GPU-related software without requiring actual NVIDIA hardware.
Changes:
- Added go-nvml v0.13.0-1 dependency with mock support packages
- Vendored generated API bindings (1154 lines), type definitions, and implementations
- Included mock implementations for Device, GpuInstance, ComputeInstance, EventSet, and other NVML types
- Added DGX A100 profile data for MIG testing scenarios
Reviewed changes
Copilot reviewed 32 out of 73 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| vendor/modules.txt | Added go-nvml v0.13.0-1 dependency entry with mock packages |
| vendor/github.com/NVIDIA/go-nvml/pkg/nvml/zz_generated.api.go | Generated API bindings for 397 NVML package-level functions |
| vendor/github.com/NVIDIA/go-nvml/pkg/nvml/vgpu.go | VGPU-related NVML function implementations |
| vendor/github.com/NVIDIA/go-nvml/pkg/nvml/unit.go | Unit/chassis management function implementations |
| vendor/github.com/NVIDIA/go-nvml/pkg/nvml/types_gen.go | Generated type definitions for NVML structures |
| vendor/github.com/NVIDIA/go-nvml/pkg/nvml/system.go | System-level NVML function implementations |
| vendor/github.com/NVIDIA/go-nvml/pkg/nvml/return.go | Error string handling and Return type implementation |
| vendor/github.com/NVIDIA/go-nvml/pkg/nvml/refcount.go | Reference counting utilities for Init/Shutdown |
| vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/*.go | Generated mock implementations using moq |
| vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100/*.go | DGX A100 MIG profile data and mock server |
| vendor/github.com/NVIDIA/go-nvml/pkg/nvml/init.go | Initialization and shutdown implementations |
| vendor/github.com/NVIDIA/go-nvml/pkg/nvml/event_set.go | Event handling implementations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Fix linter errors reported by CI: - Remove embedded field "Device" from selectors (QF1008 staticcheck) - Check error return values from fmt.Sscanf (errcheck) - Add nolint directive for intentional uintptr to unsafe.Pointer conversion Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
dims
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArangoGutierrez this looks awesome!
LGTM
elezar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ArangoGutierrez. Some initial comments. I'll work through this a bit more tomorrow.
| ```bash | ||
| # Simulate 8x A100 GPUs on your laptop | ||
| LD_LIBRARY_PATH=. MOCK_NVML_CONFIG=configs/mock-nvml-config-a100.yaml nvidia-smi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: What provides the nvidia-smi binary in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are expected to have it installed. one of the features is to be able to run nvidia-smi against the fake lib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand this comment. One of the original use cases for this mock library was so we could run it on CPU only nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i meant if the user wants to test it against nvidia-smi binary. my understadning is that you are expected to have the binary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like one example use-case that i am testing is a100 simulation on gh100 node. it should work on cpu node but also on a gpu node with a mock lib path and thats what the example is showing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and looks like you can install nvidia-smi on a cpu only node from the utils package.
$ dpkg -S nvidia-smi
nvidia-utils-570: /usr/share/man/man1/nvidia-smi.1.gz
nvidia-utils-570: /usr/bin/nvidia-smi
nvidia-utils-570: /usr/share/doc/nvidia-utils-570/nvidia-smi.html
$ dpkg-deb -c nvidia-utils-570-server_570.195.03-0ubuntu0.22.04.3_arm64.deb
drwxr-xr-x root/root 0 2025-10-23 12:15 ./
drwxr-xr-x root/root 0 2025-10-23 12:15 ./usr/
drwxr-xr-x root/root 0 2025-10-23 12:15 ./usr/bin/
-rwxr-xr-x root/root 66251 2025-09-20 02:40 ./usr/bin/nvidia-bug-report.sh
-rwxr-xr-x root/root 134000 2025-09-20 00:51 ./usr/bin/nvidia-debugdump
-rwxr-xr-x root/root 1149184 2025-09-20 00:54 ./usr/bin/nvidia-smi
-rwxr-xr-x root/root 203424 2025-09-20 00:51 ./usr/bin/nvidia-xconfig
drwxr-xr-x root/root 0 2025-10-23 12:15 ./usr/share/
drwxr-xr-x root/root 0 2025-10-23 12:15 ./usr/share/doc/
drwxr-xr-x root/root 0 2025-10-23 12:15 ./usr/share/doc/nvidia-utils-570-server/
-rw-r--r-- root/root 453 2025-10-23 12:15 ./usr/share/doc/nvidia-utils-570-server/changelog.Debian.gz
-rw-r--r-- root/root 29736 2025-10-10 17:31 ./usr/share/doc/nvidia-utils-570-server/copyright
-rw-r--r-- root/root 3199 2025-09-20 00:52 ./usr/share/doc/nvidia-utils-570-server/nvidia-debugdump.html
-rw-r--r-- root/root 2870 2025-09-20 00:52 ./usr/share/doc/nvidia-utils-570-server/nvidia-smi.html
drwxr-xr-x root/root 0 2025-10-23 12:15 ./usr/share/man/
drwxr-xr-x root/root 0 2025-10-23 12:15 ./usr/share/man/man1/
-rw-r--r-- root/root 35664 2025-10-23 12:15 ./usr/share/man/man1/nvidia-smi.1.gz
-rw-r--r-- root/root 8046 2025-10-23 12:15 ./usr/share/man/man1/nvidia-xconfig.1.gz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyways lets @ArangoGutierrez explain what is implied here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as you have nvidia-smi on any box, you can build the mock libnvidia-ml.so place it in the LD_LIBRARY_PATH and when you run nvidia-smi, you will see A100 in the output.
if you want say GB200, then you set the MOCK_NVML_CONFIG to the path of the yaml file:
pkg/gpu/mocknvml/configs/mock-nvml-config-gb200.yaml
and then run nvidia-smi and you will see the output change to GB200.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR solely focuses on the mock libnvidia-ml.so.
On the PoC PR #182 I have extra logic to have an init container that will be in charge of installing nvidia-smi as @guptaNswati pointed out, and creating mock /dev files, among other things. Since that PR is too big, I am working on smaller PR's, so this PR must be evaluated simply as a mock for libnvidia-ml.so, assuming everything else on the system is not mock, that will be handled by following PR's that will disaggregate PR 182
| | PCIe | `GetPciInfo`, `GetCurrPcieLinkGeneration` | ✅ Full | | ||
| | Utilization | `GetUtilizationRates` | ✅ Full | | ||
| | MIG | `GetMigMode` | ✅ Basic | | ||
| | Other | 340+ additional functions | ⚠️ Stubs | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the implementation it seems as if these will all raise errors?
| #include <stdio.h> | ||
| #include <stdint.h> | ||
|
|
||
| typedef int nvmlReturn_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question here would be why we don't include vendor/github.com/NVIDIA/go-nvml/pkg/nvml/nvml.h instead of redefining these types?
| errorStringCache = make(map[nvml.Return]*C.char) | ||
| errorStringCacheMu sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is it less error-prone to combine these into a struct instead of managing the loc separately?
| func toReturn(ret nvml.Return) C.nvmlReturn_t { | ||
| return C.nvmlReturn_t(ret) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Since this file is generated, is this any more useful than just casting every return?
| //export nvmlSystemGetProcessName | ||
| func nvmlSystemGetProcessName(pid C.uint, name unsafe.Pointer, length C.uint) C.nvmlReturn_t { | ||
| debugLog("[NVML-STUB] nvmlSystemGetProcessName called (NOT IMPLEMENTED)\n") | ||
| return C.NVML_ERROR_NOT_SUPPORTED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is NOT_SUPPORTED what we want here? In some cases that is used to signal different behaviour across devices. Should we not panic instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general question: How does nvidia-smi handle the cases where these stubs are missing?
| dev := engine.GetEngine().LookupDevice(uintptr(nvmlDevice)) | ||
| if dev == nil { | ||
| return C.NVML_ERROR_INVALID_ARGUMENT | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we implement LookupDevice to return an invalidDevice implementation that returns INVALID_ARGUMENT for all calls, it's not required to implement these checks all the time.
|
|
||
| //export nvmlInitWithFlags | ||
| func nvmlInitWithFlags(flags C.uint) C.nvmlReturn_t { | ||
| ret := engine.GetEngine().Init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not be "NOT IMPLEMENTED"? Why do we ignore the flags?
| return false | ||
| } | ||
|
|
||
| func getImplementation(funcName string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like quite a tedious way to implement these functions. We're not able to lean on an IDE at all since we always have to provide real strings.
Would it not be clearer to actually immplement the functions that we provide mappings for in a go file and also accept this as input -- or just check the required annotations / comments to ensure C exports using the generate tooling?
The basic thinking would be that we would have a file (or files) pkg/gpu/mocknvml/bridge/impl.go where we could have, for example:
//export nvmlInit_v2
func nvmlInit_v2() C.nvmlReturn_t {
ret := engine.GetEngine().Init()
return toReturn(ret)
}
//export nvmlShutdown
func nvmlShutdown() C.nvmlReturn_t {
ret := engine.GetEngine().Shutdown()
return toReturn(ret)
}
running generate-bridge would then:
- Ensure that the correct
//exportfunctions are present. - Ensure that the arguments match the expected arguments.
- Generated failing / panicing stubs for the functions that are NOT implemented.
Alternatively, what we could do here is MAP the c functions to the engine functions and automate the argument conversion.
| if paramType == "string" && needsCCharParam(f.Name, p.Name) { | ||
| paramType = "*C.char" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under which conditions do we NOT want a *C.char parameter? Does it not make more sense to return *C.char from getCType?
Summary
Add a CGo-based mock NVML library that simulates NVIDIA GPU hardware for testing purposes. This enables testing GPU-related software (like device plugins, nvidia-smi workflows, and monitoring tools) without requiring actual NVIDIA hardware.
Key features:
MOCK_NVML_DEBUGenvironment variableComponents:
pkg/gpu/mocknvml/engine/- Core mock engine with device configurationpkg/gpu/mocknvml/bridge/- Generated CGo bridge exporting nvml* functionspkg/gpu/mocknvml/configs/- Example YAML configurations (A100, GB200)tests/mocknvml/- Integration tests using go-nvmldocs/mocknvml/- Comprehensive documentationBased on dims/k8s-test-infra@16811b9
Test plan
go test ./pkg/gpu/mocknvml/engine/...make -C tests/mocknvml testCommits
chore(deps): add go-nvml v0.13.0-1 dependencyfeat(mocknvml): add mock libnvidia-ml.so implementationtest(mocknvml): add comprehensive unit and integration testsdocs(mocknvml): add comprehensive documentationchore: update gitignore for mock library artifactsfeat(mocknvml): add YAML-based GPU configuration and nvidia-smi compatibilitydocs(mocknvml): add comprehensive documentation (architecture, troubleshooting, examples)Co-authored-by: Davanum Srinivas davanum@gmail.com
Signed-off-by: Carlos Eduardo Arango Gutierrez eduardoa@nvidia.com