Add comprehensive README for mock NVML library

ArangoGutierrez · ArangoGutierrez · commit c9336f998a08 · 2025-11-24T15:29:14.000+01:00
Documents usage, configuration, and examples for the mock libnvidia-ml.so

Signed-off-by: Carlos Eduardo Arango Gutierrez &lt;eduardoa@nvidia.com&gt;
diff --git a/pkg/gpu/mocknvml/README.md b/pkg/gpu/mocknvml/README.md
@@ -0,0 +1,244 @@
+# Mock NVML Library
+
+A CGo-based mock implementation of NVIDIA's NVML (NVIDIA Management Library) for testing GPU-enabled applications without physical GPUs.
+
+## Overview
+
+This library provides a drop-in replacement for `libnvidia-ml.so` that simulates GPU devices and their properties. It's designed for testing Kubernetes components like the NVIDIA device plugin in environments without actual GPUs.
+
+**Key Features:**
+- 🔧 **Zero-config default**: Simulates DGX A100 system (8 GPUs) out of the box
+- 📋 **CDI-driven configuration**: Define custom GPU topologies via CDI specifications
+- 🐳 **Docker build support**: Build Linux binaries on macOS
+- 🧵 **Thread-safe**: Proper synchronization for concurrent access
+- ✅ **Well-tested**: Comprehensive unit tests with race detection
+
+## Quick Start
+
+### Building the Library
+
+#### Local Build (Linux)
+```bash
+cd pkg/gpu/mocknvml
+make
+```
+
+#### Docker Build (Cross-platform)
+```bash
+cd pkg/gpu/mocknvml
+make docker-build
+```
+
+This produces `libnvidia-ml.so` and `libnvidia-ml.h` in the current directory.
+
+### Using the Library
+
+#### Option 1: Default Configuration (8 A100 GPUs)
+
+```bash
+# Set library path
+export LD_LIBRARY_PATH=/path/to/pkg/gpu/mocknvml:$LD_LIBRARY_PATH
+
+# Run your application
+./your-gpu-application
+```
+
+#### Option 2: Custom Device Count
+
+```bash
+export LD_LIBRARY_PATH=/path/to/pkg/gpu/mocknvml:$LD_LIBRARY_PATH
+export MOCK_NVML_NUM_DEVICES=4
+
+./your-gpu-application
+```
+
+#### Option 3: CDI Specification
+
+```bash
+export LD_LIBRARY_PATH=/path/to/pkg/gpu/mocknvml:$LD_LIBRARY_PATH
+export NVIDIA_MOCK_CDI_SPEC=/path/to/cdi-spec.yaml
+
+./your-gpu-application
+```
+
+## Configuration
+
+### Environment Variables
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `NVIDIA_MOCK_CDI_SPEC` | Path to CDI specification file | _(none)_ |
+| `MOCK_NVML_NUM_DEVICES` | Number of GPU devices (max 8) | `8` |
+| `MOCK_NVML_DRIVER_VERSION` | NVIDIA driver version string | `550.54.15` |
+
+### CDI Specification
+
+Create a CDI spec to define custom GPU configurations:
+
+**Example: 2 GPU Configuration** (`cdi-spec-2gpu.yaml`)
+
+```yaml
+cdiVersion: 0.6.0
+kind: nvidia.com/gpu
+devices:
+  - name: "0"
+    containerEdits:
+      deviceNodes:
+        - path: /dev/nvidia0
+          type: c
+          major: 195
+          minor: 0
+    annotations:
+      nvidia.com/gpu.product: "NVIDIA A100-SXM4-40GB"
+      nvidia.com/gpu.uuid: "GPU-12345678-1234-1234-1234-123456789012"
+  
+  - name: "1"
+    containerEdits:
+      deviceNodes:
+        - path: /dev/nvidia1
+          type: c
+          major: 195
+          minor: 1
+    annotations:
+      nvidia.com/gpu.product: "NVIDIA A100-SXM4-40GB"
+      nvidia.com/gpu.uuid: "GPU-87654321-4321-4321-4321-210987654321"
+```
+
+**Usage:**
+```bash
+export NVIDIA_MOCK_CDI_SPEC=./cdi-spec-2gpu.yaml
+export LD_LIBRARY_PATH=/path/to/pkg/gpu/mocknvml:$LD_LIBRARY_PATH
+./your-application
+```
+
+## Example: Testing NVIDIA Device Plugin
+
+```bash
+# Build the mock library
+cd pkg/gpu/mocknvml
+make docker-build
+
+# Set up environment
+export LD_LIBRARY_PATH=$(pwd):$LD_LIBRARY_PATH
+export MOCK_NVML_NUM_DEVICES=4
+
+# Run device plugin
+kubectl apply -f device-plugin.yaml
+```
+
+## Example: Integration with Kind
+
+```bash
+# Create Kind cluster
+kind create cluster --name gpu-test
+
+# Load mock library into worker nodes
+docker cp pkg/gpu/mocknvml/libnvidia-ml.so gpu-test-worker:/usr/lib/x86_64-linux-gnu/
+
+# Deploy device plugin with mock library
+kubectl apply -f deployments/device-plugin-mock.yaml
+```
+
+## Supported NVML Functions
+
+The mock library currently implements the following NVML functions:
+
+### Initialization
+- `nvmlInit` / `nvmlInit_v2`
+- `nvmlShutdown`
+
+### Device Enumeration
+- `nvmlDeviceGetCount` / `nvmlDeviceGetCount_v2`
+- `nvmlDeviceGetHandleByIndex` / `nvmlDeviceGetHandleByIndex_v2`
+- `nvmlDeviceGetHandleByUUID`
+- `nvmlDeviceGetHandleByPciBusId` / `nvmlDeviceGetHandleByPciBusId_v2`
+
+### Device Information
+- `nvmlDeviceGetName`
+- `nvmlDeviceGetUUID`
+- `nvmlDeviceGetPciInfo` / `nvmlDeviceGetPciInfo_v3`
+- `nvmlDeviceGetMemoryInfo`
+
+### Process Information
+- `nvmlDeviceGetComputeRunningProcesses` (returns empty list)
+- `nvmlDeviceGetGraphicsRunningProcesses` (returns empty list)
+
+## Architecture
+
+```
+┌─────────────────────────────────────────┐
+│         Your Application                 │
+│    (e.g., k8s-device-plugin)            │
+└─────────────────┬───────────────────────┘
+                  │ NVML C API
+┌─────────────────▼───────────────────────┐
+│      libnvidia-ml.so (Mock)             │
+│                                          │
+│  ┌────────────────────────────────────┐ │
+│  │  Bridge Layer (CGo)                │ │
+│  │  - C function exports              │ │
+│  │  - Type conversions                │ │
+│  └────────────┬───────────────────────┘ │
+│               │                          │
+│  ┌────────────▼───────────────────────┐ │
+│  │  Engine Layer (Go)                 │ │
+│  │  - Lifecycle management            │ │
+│  │  - Handle table (C ↔ Go)          │ │
+│  │  - Configuration                   │ │
+│  └────────────┬───────────────────────┘ │
+│               │                          │
+│  ┌────────────▼───────────────────────┐ │
+│  │  go-nvml Mock (dgxa100)            │ │
+│  │  - Device simulation               │ │
+│  │  - Property storage                │ │
+│  └────────────────────────────────────┘ │
+└──────────────────────────────────────────┘
+```
+
+## Limitations
+
+- **Maximum 8 GPUs**: Limited by the underlying `dgxa100` mock implementation
+- **Subset of NVML API**: Only implements functions required for device plugin operation
+- **Static device properties**: Device properties are set at initialization and don't change
+- **No MIG support**: Multi-Instance GPU features are not implemented
+
+## Development
+
+### Running Tests
+
+```bash
+cd pkg/gpu/mocknvml/engine
+go test -v -race -coverprofile=coverage.out ./...
+```
+
+### Adding New NVML Functions
+
+1. Add the C function signature to `bridge/bridge.go`
+2. Implement the Go wrapper that calls the engine
+3. Add corresponding method to the engine if needed
+4. Add tests for the new functionality
+
+### Debugging
+
+Enable verbose logging:
+```bash
+export MOCK_NVML_DEBUG=1
+```
+
+## Contributing
+
+When adding new features:
+1. Add unit tests for new functionality
+2. Update this README with new configuration options
+3. Ensure Docker build still works
+4. Run tests with race detection: `go test -race ./...`
+
+## License
+
+Apache License 2.0 - See LICENSE file for details.
+
+## Related Projects
+
+- [go-nvml](https://github.com/NVIDIA/go-nvml) - Official NVIDIA Go bindings for NVML
+- [k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) - NVIDIA device plugin for Kubernetes
+- [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) - Container toolkit for GPU support