feat(mocknvml): add mock libnvidia-ml.so implementation for GPU testing #204

ArangoGutierrez · 2026-01-20T11:48:44Z

Summary

Add a CGo-based mock NVML library that simulates NVIDIA GPU hardware for testing purposes. This enables testing GPU-related software (like device plugins, nvidia-smi workflows, and monitoring tools) without requiring actual NVIDIA hardware.

Key features:

YAML-based configuration for full GPU property control (memory, power, thermal, clocks, ECC, PCIe)
nvidia-smi compatibility - works with the real nvidia-smi binary
50+ NVML function implementations
GPU profiles for A100 and GB200 architectures
Debug logging via MOCK_NVML_DEBUG environment variable
Thread-safe handle management with proper CGo memory handling

Components:

pkg/gpu/mocknvml/engine/ - Core mock engine with device configuration
pkg/gpu/mocknvml/bridge/ - Generated CGo bridge exporting nvml* functions
pkg/gpu/mocknvml/configs/ - Example YAML configurations (A100, GB200)
tests/mocknvml/ - Integration tests using go-nvml
docs/mocknvml/ - Comprehensive documentation

Based on dims/k8s-test-infra@16811b9

Test plan

Unit tests pass: go test ./pkg/gpu/mocknvml/engine/...
Integration tests pass: make -C tests/mocknvml test
Concurrency tests verify thread safety
golangci-lint passes
CI workflow runs successfully

Commits

chore(deps): add go-nvml v0.13.0-1 dependency
feat(mocknvml): add mock libnvidia-ml.so implementation
test(mocknvml): add comprehensive unit and integration tests
docs(mocknvml): add comprehensive documentation
chore: update gitignore for mock library artifacts
feat(mocknvml): add YAML-based GPU configuration and nvidia-smi compatibility
docs(mocknvml): add comprehensive documentation (architecture, troubleshooting, examples)

Co-authored-by: Davanum Srinivas davanum@gmail.com
Signed-off-by: Carlos Eduardo Arango Gutierrez eduardoa@nvidia.com

Add NVIDIA go-nvml library as a dependency to support the mock NVML implementation. This provides the nvml.Interface and mock/dgxa100 packages needed for GPU simulation. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Implement a CGo-based mock NVML library that simulates NVIDIA GPU hardware for testing. Key components: - engine/: Core mock engine with thread-safe handle management - config.go: Environment-based configuration (MOCK_NVML_NUM_DEVICES) - device.go: Enhanced device wrapper with PCI/BAR1 info - engine.go: Singleton engine with reference-counted init/shutdown - handles.go: Thread-safe C handle to Go object mapping - bridge/: Generated CGo bridge (via cmd/generate-bridge) - Exports nvml* functions callable from C - Memory-safe error string caching - Dockerfile/Makefile: Build tooling for libnvidia-ml.so The mock uses dgxa100 server from go-nvml, simulating an 8-GPU DGX A100. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Add test coverage for the mock NVML library: Unit tests (pkg/gpu/mocknvml/engine/): - config_test.go: Environment variable parsing - engine_test.go: Init/Shutdown lifecycle, device enumeration - handles_test.go: Concurrent handle registration and lookup - device_test.go: BAR1 memory, PCI info, process queries Integration tests (tests/mocknvml/): - Docker-based test that uses go-nvml with mock libnvidia-ml.so - Validates device enumeration, UUID/PCI lookups CI configuration: - golang.yaml: Updated workflow to run mocknvml tests Tests use t.Setenv() for isolation and atomic counters for goroutine safety. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Add detailed README for the mock NVML library covering: - Architecture overview and component descriptions - Build instructions (Docker and native) - Configuration via environment variables - Usage examples and integration guide - API coverage matrix Remove obsolete pkg/README.md placeholder. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Ignore build artifacts from mock NVML library: - libnvidia-ml.so* shared library files - Build output directories Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

…tibility Adopt comprehensive mock NVML implementation with: - YAML configuration for full GPU property control (memory, power, thermal, clocks, ECC, PCIe) - nvidia-smi compatibility (works with real nvidia-smi binary) - 50+ NVML function implementations (vs previous 16) - GPU profiles for A100 and GB200 - ConfigurableDevice replacing EnhancedDevice for YAML-driven properties - Debug logging via MOCK_NVML_DEBUG environment variable Based on dims/k8s-test-infra@16811b9 Co-authored-by: Davanum Srinivas <davanum@gmail.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Add detailed documentation for the mock NVML library: - README.md: Overview and documentation index - quickstart.md: 5-minute getting started guide - architecture.md: System design, components, and data flow diagrams - configuration.md: Complete YAML configuration reference - examples.md: Common usage patterns and scenarios - development.md: Contributing and extending guide - troubleshooting.md: Common issues and solutions Documentation covers: - Building and installation - YAML-based GPU configuration - nvidia-smi integration - Testing scenarios (CI/CD, Kubernetes, Docker) - Custom GPU profiles - Debug mode usage Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Copilot

Pull request overview

This PR adds vendored dependencies from the NVIDIA go-nvml library (v0.13.0-1) to support mock NVML implementations for GPU testing. The changes introduce comprehensive NVML function bindings, mock interfaces, and test utilities that enable testing GPU-related software without requiring actual NVIDIA hardware.

Changes:

Added go-nvml v0.13.0-1 dependency with mock support packages
Vendored generated API bindings (1154 lines), type definitions, and implementations
Included mock implementations for Device, GpuInstance, ComputeInstance, EventSet, and other NVML types
Added DGX A100 profile data for MIG testing scenarios

Reviewed changes

Copilot reviewed 32 out of 73 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
vendor/modules.txt	Added go-nvml v0.13.0-1 dependency entry with mock packages
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/zz_generated.api.go	Generated API bindings for 397 NVML package-level functions
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/vgpu.go	VGPU-related NVML function implementations
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/unit.go	Unit/chassis management function implementations
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/types_gen.go	Generated type definitions for NVML structures
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/system.go	System-level NVML function implementations
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/return.go	Error string handling and Return type implementation
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/refcount.go	Reference counting utilities for Init/Shutdown
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/*.go	Generated mock implementations using moq
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100/*.go	DGX A100 MIG profile data and mock server
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/init.go	Initialization and shutdown implementations
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/event_set.go	Event handling implementations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Fix linter errors reported by CI: - Remove embedded field "Device" from selectors (QF1008 staticcheck) - Check error return values from fmt.Sscanf (errcheck) - Add nolint directive for intentional uintptr to unsafe.Pointer conversion Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

dims

@ArangoGutierrez this looks awesome!

LGTM

elezar

Thanks @ArangoGutierrez. Some initial comments. I'll work through this a bit more tomorrow.

elezar · 2026-01-21T15:24:53Z

docs/mocknvml/README.md

+
+```bash
+# Simulate 8x A100 GPUs on your laptop
+LD_LIBRARY_PATH=. MOCK_NVML_CONFIG=configs/mock-nvml-config-a100.yaml nvidia-smi


Question: What provides the nvidia-smi binary in this case?

you are expected to have it installed. one of the features is to be able to run nvidia-smi against the fake lib.

I don't quite understand this comment. One of the original use cases for this mock library was so we could run it on CPU only nodes.

oh i meant if the user wants to test it against nvidia-smi binary. my understadning is that you are expected to have the binary

like one example use-case that i am testing is a100 simulation on gh100 node. it should work on cpu node but also on a gpu node with a mock lib path and thats what the example is showing

and looks like you can install nvidia-smi on a cpu only node from the utils package.

$ dpkg -S nvidia-smi nvidia-utils-570: /usr/share/man/man1/nvidia-smi.1.gz nvidia-utils-570: /usr/bin/nvidia-smi nvidia-utils-570: /usr/share/doc/nvidia-utils-570/nvidia-smi.html $ dpkg-deb -c nvidia-utils-570-server_570.195.03-0ubuntu0.22.04.3_arm64.deb drwxr-xr-x root/root 0 2025-10-23 12:15 ./ drwxr-xr-x root/root 0 2025-10-23 12:15 ./usr/ drwxr-xr-x root/root 0 2025-10-23 12:15 ./usr/bin/ -rwxr-xr-x root/root 66251 2025-09-20 02:40 ./usr/bin/nvidia-bug-report.sh -rwxr-xr-x root/root 134000 2025-09-20 00:51 ./usr/bin/nvidia-debugdump -rwxr-xr-x root/root 1149184 2025-09-20 00:54 ./usr/bin/nvidia-smi -rwxr-xr-x root/root 203424 2025-09-20 00:51 ./usr/bin/nvidia-xconfig drwxr-xr-x root/root 0 2025-10-23 12:15 ./usr/share/ drwxr-xr-x root/root 0 2025-10-23 12:15 ./usr/share/doc/ drwxr-xr-x root/root 0 2025-10-23 12:15 ./usr/share/doc/nvidia-utils-570-server/ -rw-r--r-- root/root 453 2025-10-23 12:15 ./usr/share/doc/nvidia-utils-570-server/changelog.Debian.gz -rw-r--r-- root/root 29736 2025-10-10 17:31 ./usr/share/doc/nvidia-utils-570-server/copyright -rw-r--r-- root/root 3199 2025-09-20 00:52 ./usr/share/doc/nvidia-utils-570-server/nvidia-debugdump.html -rw-r--r-- root/root 2870 2025-09-20 00:52 ./usr/share/doc/nvidia-utils-570-server/nvidia-smi.html drwxr-xr-x root/root 0 2025-10-23 12:15 ./usr/share/man/ drwxr-xr-x root/root 0 2025-10-23 12:15 ./usr/share/man/man1/ -rw-r--r-- root/root 35664 2025-10-23 12:15 ./usr/share/man/man1/nvidia-smi.1.gz -rw-r--r-- root/root 8046 2025-10-23 12:15 ./usr/share/man/man1/nvidia-xconfig.1.gz

anyways lets @ArangoGutierrez explain what is implied here :)

As long as you have nvidia-smi on any box, you can build the mock libnvidia-ml.so place it in the LD_LIBRARY_PATH and when you run nvidia-smi, you will see A100 in the output.

if you want say GB200, then you set the MOCK_NVML_CONFIG to the path of the yaml file:
pkg/gpu/mocknvml/configs/mock-nvml-config-gb200.yaml

and then run nvidia-smi and you will see the output change to GB200.

This PR solely focuses on the mock libnvidia-ml.so.
On the PoC PR #182 I have extra logic to have an init container that will be in charge of installing nvidia-smi as @guptaNswati pointed out, and creating mock /dev files, among other things. Since that PR is too big, I am working on smaller PR's, so this PR must be evaluated simply as a mock for libnvidia-ml.so, assuming everything else on the system is not mock, that will be handled by following PR's that will disaggregate PR 182

elezar · 2026-01-21T15:25:55Z

docs/mocknvml/README.md

+| PCIe | `GetPciInfo`, `GetCurrPcieLinkGeneration` | ✅ Full |
+| Utilization | `GetUtilizationRates` | ✅ Full |
+| MIG | `GetMigMode` | ✅ Basic |
+| Other | 340+ additional functions | ⚠️ Stubs |


Looking at the implementation it seems as if these will all raise errors?

elezar · 2026-01-21T15:27:45Z