Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

Summary

Add a CGo-based mock NVML library that simulates NVIDIA GPU hardware for testing purposes. This enables testing GPU-related software (like device plugins, nvidia-smi workflows, and monitoring tools) without requiring actual NVIDIA hardware.

Key features:

  • YAML-based configuration for full GPU property control (memory, power, thermal, clocks, ECC, PCIe)
  • nvidia-smi compatibility - works with the real nvidia-smi binary
  • 50+ NVML function implementations
  • GPU profiles for A100 and GB200 architectures
  • Debug logging via MOCK_NVML_DEBUG environment variable
  • Thread-safe handle management with proper CGo memory handling

Components:

  • pkg/gpu/mocknvml/engine/ - Core mock engine with device configuration
  • pkg/gpu/mocknvml/bridge/ - Generated CGo bridge exporting nvml* functions
  • pkg/gpu/mocknvml/configs/ - Example YAML configurations (A100, GB200)
  • tests/mocknvml/ - Integration tests using go-nvml
  • docs/mocknvml/ - Comprehensive documentation

Based on dims/k8s-test-infra@16811b9

Test plan

  • Unit tests pass: go test ./pkg/gpu/mocknvml/engine/...
  • Integration tests pass: make -C tests/mocknvml test
  • Concurrency tests verify thread safety
  • golangci-lint passes
  • CI workflow runs successfully

Commits

  1. chore(deps): add go-nvml v0.13.0-1 dependency
  2. feat(mocknvml): add mock libnvidia-ml.so implementation
  3. test(mocknvml): add comprehensive unit and integration tests
  4. docs(mocknvml): add comprehensive documentation
  5. chore: update gitignore for mock library artifacts
  6. feat(mocknvml): add YAML-based GPU configuration and nvidia-smi compatibility
  7. docs(mocknvml): add comprehensive documentation (architecture, troubleshooting, examples)

Co-authored-by: Davanum Srinivas davanum@gmail.com
Signed-off-by: Carlos Eduardo Arango Gutierrez eduardoa@nvidia.com

ArangoGutierrez and others added 7 commits January 19, 2026 19:40
Add NVIDIA go-nvml library as a dependency to support the mock NVML
implementation. This provides the nvml.Interface and mock/dgxa100
packages needed for GPU simulation.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Implement a CGo-based mock NVML library that simulates NVIDIA GPU
hardware for testing. Key components:

- engine/: Core mock engine with thread-safe handle management
  - config.go: Environment-based configuration (MOCK_NVML_NUM_DEVICES)
  - device.go: Enhanced device wrapper with PCI/BAR1 info
  - engine.go: Singleton engine with reference-counted init/shutdown
  - handles.go: Thread-safe C handle to Go object mapping

- bridge/: Generated CGo bridge (via cmd/generate-bridge)
  - Exports nvml* functions callable from C
  - Memory-safe error string caching

- Dockerfile/Makefile: Build tooling for libnvidia-ml.so

The mock uses dgxa100 server from go-nvml, simulating an 8-GPU DGX A100.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add test coverage for the mock NVML library:

Unit tests (pkg/gpu/mocknvml/engine/):
- config_test.go: Environment variable parsing
- engine_test.go: Init/Shutdown lifecycle, device enumeration
- handles_test.go: Concurrent handle registration and lookup
- device_test.go: BAR1 memory, PCI info, process queries

Integration tests (tests/mocknvml/):
- Docker-based test that uses go-nvml with mock libnvidia-ml.so
- Validates device enumeration, UUID/PCI lookups

CI configuration:
- golang.yaml: Updated workflow to run mocknvml tests

Tests use t.Setenv() for isolation and atomic counters for goroutine
safety.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add detailed README for the mock NVML library covering:

- Architecture overview and component descriptions
- Build instructions (Docker and native)
- Configuration via environment variables
- Usage examples and integration guide
- API coverage matrix

Remove obsolete pkg/README.md placeholder.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Ignore build artifacts from mock NVML library:
- libnvidia-ml.so* shared library files
- Build output directories

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…tibility

Adopt comprehensive mock NVML implementation with:
- YAML configuration for full GPU property control (memory, power, thermal, clocks, ECC, PCIe)
- nvidia-smi compatibility (works with real nvidia-smi binary)
- 50+ NVML function implementations (vs previous 16)
- GPU profiles for A100 and GB200
- ConfigurableDevice replacing EnhancedDevice for YAML-driven properties
- Debug logging via MOCK_NVML_DEBUG environment variable

Based on dims/k8s-test-infra@16811b9

Co-authored-by: Davanum Srinivas <davanum@gmail.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add detailed documentation for the mock NVML library:

- README.md: Overview and documentation index
- quickstart.md: 5-minute getting started guide
- architecture.md: System design, components, and data flow diagrams
- configuration.md: Complete YAML configuration reference
- examples.md: Common usage patterns and scenarios
- development.md: Contributing and extending guide
- troubleshooting.md: Common issues and solutions

Documentation covers:
- Building and installation
- YAML-based GPU configuration
- nvidia-smi integration
- Testing scenarios (CI/CD, Kubernetes, Docker)
- Custom GPU profiles
- Debug mode usage

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds vendored dependencies from the NVIDIA go-nvml library (v0.13.0-1) to support mock NVML implementations for GPU testing. The changes introduce comprehensive NVML function bindings, mock interfaces, and test utilities that enable testing GPU-related software without requiring actual NVIDIA hardware.

Changes:

  • Added go-nvml v0.13.0-1 dependency with mock support packages
  • Vendored generated API bindings (1154 lines), type definitions, and implementations
  • Included mock implementations for Device, GpuInstance, ComputeInstance, EventSet, and other NVML types
  • Added DGX A100 profile data for MIG testing scenarios

Reviewed changes

Copilot reviewed 32 out of 73 changed files in this pull request and generated no comments.

Show a summary per file
File Description
vendor/modules.txt Added go-nvml v0.13.0-1 dependency entry with mock packages
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/zz_generated.api.go Generated API bindings for 397 NVML package-level functions
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/vgpu.go VGPU-related NVML function implementations
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/unit.go Unit/chassis management function implementations
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/types_gen.go Generated type definitions for NVML structures
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/system.go System-level NVML function implementations
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/return.go Error string handling and Return type implementation
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/refcount.go Reference counting utilities for Init/Shutdown
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/*.go Generated mock implementations using moq
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100/*.go DGX A100 MIG profile data and mock server
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/init.go Initialization and shutdown implementations
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/event_set.go Event handling implementations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Fix linter errors reported by CI:
- Remove embedded field "Device" from selectors (QF1008 staticcheck)
- Check error return values from fmt.Sscanf (errcheck)
- Add nolint directive for intentional uintptr to unsafe.Pointer conversion

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copy link

@dims dims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArangoGutierrez this looks awesome!

LGTM

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArangoGutierrez. Some initial comments. I'll work through this a bit more tomorrow.

```bash
# Simulate 8x A100 GPUs on your laptop
LD_LIBRARY_PATH=. MOCK_NVML_CONFIG=configs/mock-nvml-config-a100.yaml nvidia-smi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What provides the nvidia-smi binary in this case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are expected to have it installed. one of the features is to be able to run nvidia-smi against the fake lib.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this comment. One of the original use cases for this mock library was so we could run it on CPU only nodes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i meant if the user wants to test it against nvidia-smi binary. my understadning is that you are expected to have the binary

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like one example use-case that i am testing is a100 simulation on gh100 node. it should work on cpu node but also on a gpu node with a mock lib path and thats what the example is showing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and looks like you can install nvidia-smi on a cpu only node from the utils package.

$ dpkg -S nvidia-smi
nvidia-utils-570: /usr/share/man/man1/nvidia-smi.1.gz
nvidia-utils-570: /usr/bin/nvidia-smi
nvidia-utils-570: /usr/share/doc/nvidia-utils-570/nvidia-smi.html

$ dpkg-deb -c nvidia-utils-570-server_570.195.03-0ubuntu0.22.04.3_arm64.deb 
drwxr-xr-x root/root         0 2025-10-23 12:15 ./
drwxr-xr-x root/root         0 2025-10-23 12:15 ./usr/
drwxr-xr-x root/root         0 2025-10-23 12:15 ./usr/bin/
-rwxr-xr-x root/root     66251 2025-09-20 02:40 ./usr/bin/nvidia-bug-report.sh
-rwxr-xr-x root/root    134000 2025-09-20 00:51 ./usr/bin/nvidia-debugdump
-rwxr-xr-x root/root   1149184 2025-09-20 00:54 ./usr/bin/nvidia-smi
-rwxr-xr-x root/root    203424 2025-09-20 00:51 ./usr/bin/nvidia-xconfig
drwxr-xr-x root/root         0 2025-10-23 12:15 ./usr/share/
drwxr-xr-x root/root         0 2025-10-23 12:15 ./usr/share/doc/
drwxr-xr-x root/root         0 2025-10-23 12:15 ./usr/share/doc/nvidia-utils-570-server/
-rw-r--r-- root/root       453 2025-10-23 12:15 ./usr/share/doc/nvidia-utils-570-server/changelog.Debian.gz
-rw-r--r-- root/root     29736 2025-10-10 17:31 ./usr/share/doc/nvidia-utils-570-server/copyright
-rw-r--r-- root/root      3199 2025-09-20 00:52 ./usr/share/doc/nvidia-utils-570-server/nvidia-debugdump.html
-rw-r--r-- root/root      2870 2025-09-20 00:52 ./usr/share/doc/nvidia-utils-570-server/nvidia-smi.html
drwxr-xr-x root/root         0 2025-10-23 12:15 ./usr/share/man/
drwxr-xr-x root/root         0 2025-10-23 12:15 ./usr/share/man/man1/
-rw-r--r-- root/root     35664 2025-10-23 12:15 ./usr/share/man/man1/nvidia-smi.1.gz
-rw-r--r-- root/root      8046 2025-10-23 12:15 ./usr/share/man/man1/nvidia-xconfig.1.gz

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyways lets @ArangoGutierrez explain what is implied here :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as you have nvidia-smi on any box, you can build the mock libnvidia-ml.so place it in the LD_LIBRARY_PATH and when you run nvidia-smi, you will see A100 in the output.

if you want say GB200, then you set the MOCK_NVML_CONFIG to the path of the yaml file:
pkg/gpu/mocknvml/configs/mock-nvml-config-gb200.yaml

and then run nvidia-smi and you will see the output change to GB200.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR solely focuses on the mock libnvidia-ml.so.
On the PoC PR #182 I have extra logic to have an init container that will be in charge of installing nvidia-smi as @guptaNswati pointed out, and creating mock /dev files, among other things. Since that PR is too big, I am working on smaller PR's, so this PR must be evaluated simply as a mock for libnvidia-ml.so, assuming everything else on the system is not mock, that will be handled by following PR's that will disaggregate PR 182

| PCIe | `GetPciInfo`, `GetCurrPcieLinkGeneration` | ✅ Full |
| Utilization | `GetUtilizationRates` | ✅ Full |
| MIG | `GetMigMode` | ✅ Basic |
| Other | 340+ additional functions | ⚠️ Stubs |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the implementation it seems as if these will all raise errors?

#include <stdio.h>
#include <stdint.h>

typedef int nvmlReturn_t;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question here would be why we don't include vendor/github.com/NVIDIA/go-nvml/pkg/nvml/nvml.h instead of redefining these types?

Comment on lines +103 to +104
errorStringCache = make(map[nvml.Return]*C.char)
errorStringCacheMu sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is it less error-prone to combine these into a struct instead of managing the loc separately?

Comment on lines +121 to +123
func toReturn(ret nvml.Return) C.nvmlReturn_t {
return C.nvmlReturn_t(ret)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Since this file is generated, is this any more useful than just casting every return?

//export nvmlSystemGetProcessName
func nvmlSystemGetProcessName(pid C.uint, name unsafe.Pointer, length C.uint) C.nvmlReturn_t {
debugLog("[NVML-STUB] nvmlSystemGetProcessName called (NOT IMPLEMENTED)\n")
return C.NVML_ERROR_NOT_SUPPORTED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is NOT_SUPPORTED what we want here? In some cases that is used to signal different behaviour across devices. Should we not panic instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general question: How does nvidia-smi handle the cases where these stubs are missing?

Comment on lines +367 to +370
dev := engine.GetEngine().LookupDevice(uintptr(nvmlDevice))
if dev == nil {
return C.NVML_ERROR_INVALID_ARGUMENT
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we implement LookupDevice to return an invalidDevice implementation that returns INVALID_ARGUMENT for all calls, it's not required to implement these checks all the time.


//export nvmlInitWithFlags
func nvmlInitWithFlags(flags C.uint) C.nvmlReturn_t {
ret := engine.GetEngine().Init()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be "NOT IMPLEMENTED"? Why do we ignore the flags?

return false
}

func getImplementation(funcName string) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like quite a tedious way to implement these functions. We're not able to lean on an IDE at all since we always have to provide real strings.

Would it not be clearer to actually immplement the functions that we provide mappings for in a go file and also accept this as input -- or just check the required annotations / comments to ensure C exports using the generate tooling?

The basic thinking would be that we would have a file (or files) pkg/gpu/mocknvml/bridge/impl.go where we could have, for example:

//export nvmlInit_v2
func nvmlInit_v2() C.nvmlReturn_t {
	ret := engine.GetEngine().Init()
	return toReturn(ret)
}

//export nvmlShutdown
func nvmlShutdown() C.nvmlReturn_t {
	ret := engine.GetEngine().Shutdown()
	return toReturn(ret)
}

running generate-bridge would then:

  1. Ensure that the correct //export functions are present.
  2. Ensure that the arguments match the expected arguments.
  3. Generated failing / panicing stubs for the functions that are NOT implemented.

Alternatively, what we could do here is MAP the c functions to the engine functions and automate the argument conversion.

Comment on lines +227 to +229
if paramType == "string" && needsCCharParam(f.Name, p.Name) {
paramType = "*C.char"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under which conditions do we NOT want a *C.char parameter? Does it not make more sense to return *C.char from getCType?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants