Skip to content

Commit c9336f9

Browse files
Add comprehensive README for mock NVML library
Documents usage, configuration, and examples for the mock libnvidia-ml.so Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
1 parent 2747561 commit c9336f9

File tree

1 file changed

+244
-0
lines changed

1 file changed

+244
-0
lines changed

pkg/gpu/mocknvml/README.md

Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# Mock NVML Library
2+
3+
A CGo-based mock implementation of NVIDIA's NVML (NVIDIA Management Library) for testing GPU-enabled applications without physical GPUs.
4+
5+
## Overview
6+
7+
This library provides a drop-in replacement for `libnvidia-ml.so` that simulates GPU devices and their properties. It's designed for testing Kubernetes components like the NVIDIA device plugin in environments without actual GPUs.
8+
9+
**Key Features:**
10+
- 🔧 **Zero-config default**: Simulates DGX A100 system (8 GPUs) out of the box
11+
- 📋 **CDI-driven configuration**: Define custom GPU topologies via CDI specifications
12+
- 🐳 **Docker build support**: Build Linux binaries on macOS
13+
- 🧵 **Thread-safe**: Proper synchronization for concurrent access
14+
-**Well-tested**: Comprehensive unit tests with race detection
15+
16+
## Quick Start
17+
18+
### Building the Library
19+
20+
#### Local Build (Linux)
21+
```bash
22+
cd pkg/gpu/mocknvml
23+
make
24+
```
25+
26+
#### Docker Build (Cross-platform)
27+
```bash
28+
cd pkg/gpu/mocknvml
29+
make docker-build
30+
```
31+
32+
This produces `libnvidia-ml.so` and `libnvidia-ml.h` in the current directory.
33+
34+
### Using the Library
35+
36+
#### Option 1: Default Configuration (8 A100 GPUs)
37+
38+
```bash
39+
# Set library path
40+
export LD_LIBRARY_PATH=/path/to/pkg/gpu/mocknvml:$LD_LIBRARY_PATH
41+
42+
# Run your application
43+
./your-gpu-application
44+
```
45+
46+
#### Option 2: Custom Device Count
47+
48+
```bash
49+
export LD_LIBRARY_PATH=/path/to/pkg/gpu/mocknvml:$LD_LIBRARY_PATH
50+
export MOCK_NVML_NUM_DEVICES=4
51+
52+
./your-gpu-application
53+
```
54+
55+
#### Option 3: CDI Specification
56+
57+
```bash
58+
export LD_LIBRARY_PATH=/path/to/pkg/gpu/mocknvml:$LD_LIBRARY_PATH
59+
export NVIDIA_MOCK_CDI_SPEC=/path/to/cdi-spec.yaml
60+
61+
./your-gpu-application
62+
```
63+
64+
## Configuration
65+
66+
### Environment Variables
67+
68+
| Variable | Description | Default |
69+
|----------|-------------|---------|
70+
| `NVIDIA_MOCK_CDI_SPEC` | Path to CDI specification file | _(none)_ |
71+
| `MOCK_NVML_NUM_DEVICES` | Number of GPU devices (max 8) | `8` |
72+
| `MOCK_NVML_DRIVER_VERSION` | NVIDIA driver version string | `550.54.15` |
73+
74+
### CDI Specification
75+
76+
Create a CDI spec to define custom GPU configurations:
77+
78+
**Example: 2 GPU Configuration** (`cdi-spec-2gpu.yaml`)
79+
80+
```yaml
81+
cdiVersion: 0.6.0
82+
kind: nvidia.com/gpu
83+
devices:
84+
- name: "0"
85+
containerEdits:
86+
deviceNodes:
87+
- path: /dev/nvidia0
88+
type: c
89+
major: 195
90+
minor: 0
91+
annotations:
92+
nvidia.com/gpu.product: "NVIDIA A100-SXM4-40GB"
93+
nvidia.com/gpu.uuid: "GPU-12345678-1234-1234-1234-123456789012"
94+
95+
- name: "1"
96+
containerEdits:
97+
deviceNodes:
98+
- path: /dev/nvidia1
99+
type: c
100+
major: 195
101+
minor: 1
102+
annotations:
103+
nvidia.com/gpu.product: "NVIDIA A100-SXM4-40GB"
104+
nvidia.com/gpu.uuid: "GPU-87654321-4321-4321-4321-210987654321"
105+
```
106+
107+
**Usage:**
108+
```bash
109+
export NVIDIA_MOCK_CDI_SPEC=./cdi-spec-2gpu.yaml
110+
export LD_LIBRARY_PATH=/path/to/pkg/gpu/mocknvml:$LD_LIBRARY_PATH
111+
./your-application
112+
```
113+
114+
## Example: Testing NVIDIA Device Plugin
115+
116+
```bash
117+
# Build the mock library
118+
cd pkg/gpu/mocknvml
119+
make docker-build
120+
121+
# Set up environment
122+
export LD_LIBRARY_PATH=$(pwd):$LD_LIBRARY_PATH
123+
export MOCK_NVML_NUM_DEVICES=4
124+
125+
# Run device plugin
126+
kubectl apply -f device-plugin.yaml
127+
```
128+
129+
## Example: Integration with Kind
130+
131+
```bash
132+
# Create Kind cluster
133+
kind create cluster --name gpu-test
134+
135+
# Load mock library into worker nodes
136+
docker cp pkg/gpu/mocknvml/libnvidia-ml.so gpu-test-worker:/usr/lib/x86_64-linux-gnu/
137+
138+
# Deploy device plugin with mock library
139+
kubectl apply -f deployments/device-plugin-mock.yaml
140+
```
141+
142+
## Supported NVML Functions
143+
144+
The mock library currently implements the following NVML functions:
145+
146+
### Initialization
147+
- `nvmlInit` / `nvmlInit_v2`
148+
- `nvmlShutdown`
149+
150+
### Device Enumeration
151+
- `nvmlDeviceGetCount` / `nvmlDeviceGetCount_v2`
152+
- `nvmlDeviceGetHandleByIndex` / `nvmlDeviceGetHandleByIndex_v2`
153+
- `nvmlDeviceGetHandleByUUID`
154+
- `nvmlDeviceGetHandleByPciBusId` / `nvmlDeviceGetHandleByPciBusId_v2`
155+
156+
### Device Information
157+
- `nvmlDeviceGetName`
158+
- `nvmlDeviceGetUUID`
159+
- `nvmlDeviceGetPciInfo` / `nvmlDeviceGetPciInfo_v3`
160+
- `nvmlDeviceGetMemoryInfo`
161+
162+
### Process Information
163+
- `nvmlDeviceGetComputeRunningProcesses` (returns empty list)
164+
- `nvmlDeviceGetGraphicsRunningProcesses` (returns empty list)
165+
166+
## Architecture
167+
168+
```
169+
┌─────────────────────────────────────────┐
170+
│ Your Application │
171+
│ (e.g., k8s-device-plugin) │
172+
└─────────────────┬───────────────────────┘
173+
│ NVML C API
174+
┌─────────────────▼───────────────────────┐
175+
│ libnvidia-ml.so (Mock) │
176+
│ │
177+
│ ┌────────────────────────────────────┐ │
178+
│ │ Bridge Layer (CGo) │ │
179+
│ │ - C function exports │ │
180+
│ │ - Type conversions │ │
181+
│ └────────────┬───────────────────────┘ │
182+
│ │ │
183+
│ ┌────────────▼───────────────────────┐ │
184+
│ │ Engine Layer (Go) │ │
185+
│ │ - Lifecycle management │ │
186+
│ │ - Handle table (C ↔ Go) │ │
187+
│ │ - Configuration │ │
188+
│ └────────────┬───────────────────────┘ │
189+
│ │ │
190+
│ ┌────────────▼───────────────────────┐ │
191+
│ │ go-nvml Mock (dgxa100) │ │
192+
│ │ - Device simulation │ │
193+
│ │ - Property storage │ │
194+
│ └────────────────────────────────────┘ │
195+
└──────────────────────────────────────────┘
196+
```
197+
198+
## Limitations
199+
200+
- **Maximum 8 GPUs**: Limited by the underlying `dgxa100` mock implementation
201+
- **Subset of NVML API**: Only implements functions required for device plugin operation
202+
- **Static device properties**: Device properties are set at initialization and don't change
203+
- **No MIG support**: Multi-Instance GPU features are not implemented
204+
205+
## Development
206+
207+
### Running Tests
208+
209+
```bash
210+
cd pkg/gpu/mocknvml/engine
211+
go test -v -race -coverprofile=coverage.out ./...
212+
```
213+
214+
### Adding New NVML Functions
215+
216+
1. Add the C function signature to `bridge/bridge.go`
217+
2. Implement the Go wrapper that calls the engine
218+
3. Add corresponding method to the engine if needed
219+
4. Add tests for the new functionality
220+
221+
### Debugging
222+
223+
Enable verbose logging:
224+
```bash
225+
export MOCK_NVML_DEBUG=1
226+
```
227+
228+
## Contributing
229+
230+
When adding new features:
231+
1. Add unit tests for new functionality
232+
2. Update this README with new configuration options
233+
3. Ensure Docker build still works
234+
4. Run tests with race detection: `go test -race ./...`
235+
236+
## License
237+
238+
Apache License 2.0 - See LICENSE file for details.
239+
240+
## Related Projects
241+
242+
- [go-nvml](https://github.com/NVIDIA/go-nvml) - Official NVIDIA Go bindings for NVML
243+
- [k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) - NVIDIA device plugin for Kubernetes
244+
- [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) - Container toolkit for GPU support

0 commit comments

Comments
 (0)