feat: support for local gpu telemetry via pynvml #595

ajcasagrande · 2026-01-23T02:03:20Z

Summary by CodeRabbit

New Features
- Added PyNVML-based local GPU telemetry option and a collector-type switch; added SUMMARY and REALTIME_DASHBOARD modes
- Expanded GPU metrics (memory, SM, decoder, encoder, JPEG utilizations)
Documentation
- Updated GPU telemetry guide with PyNVML setup, examples, comparisons vs DCGM, tips, and clarified default endpoint behavior
Chores
- Added runtime dependency for PyNVML support (nvidia-ml-py)

_{✏️ Tip: You can customize this high-level summary in your review settings.}

github-actions · 2026-01-23T02:03:31Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2e7356d373093ae3e950e5bfb0dbe0ad579894c6

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2e7356d373093ae3e950e5bfb0dbe0ad579894c6

Last updated for commit: 2e7356d • Browse code

coderabbitai · 2026-01-23T02:13:14Z

Walkthrough

This PR adds PyNVML-based local GPU telemetry alongside DCGM, refactors telemetry into protocol/factory patterns, renames and re-exports DCGM collector, extends telemetry metrics and constants, updates configuration parsing and docs, and adds comprehensive PyNVML tests and related test updates.

Changes

Cohort / File(s)	Summary
Documentation `docs/cli_options.md`, `docs/tutorials/gpu-telemetry.md`	Documented new `pynvml` mode, added local GPU monitoring path, examples, prerequisites, comparison table, and clarified mode/endpoint behavior.
Enums & Types `src/aiperf/common/enums/...`	Added `GPUTelemetryCollectorType` (DCGM, PYNVML) and new `GPUTelemetryMode` members (`SUMMARY`, `REALTIME_DASHBOARD`) and exported them.
Config Parsing `src/aiperf/common/config/user_config.py`	Added `gpu_telemetry_collector_type` parsing, validation (disallow mixing pynvml with DCGM URLs), runtime pynvml availability checks, and public accessor.
Models `src/aiperf/common/models/telemetry_models.py`	Added metrics fields: `mem_utilization`, `sm_utilization`, `decoder_utilization`, `encoder_utilization`, `jpg_utilization`; relaxed dcgm_url description.
Constants & Public API `src/aiperf/gpu_telemetry/constants.py`, `src/aiperf/gpu_telemetry/__init__.py`	Added `PYNVML_SOURCE_IDENTIFIER`, `DCGM_SCALING_FACTORS`, `PYNVML_SCALING_FACTORS`; expanded DCGM field mapping and updated public exports to include collectors, factory, protocol, and scaling constants.
DCGM Collector Refactor `src/aiperf/gpu_telemetry/dcgm_collector.py`	Renamed `GPUTelemetryDataCollector` → `DCGMTelemetryCollector`, registered with factory/protocol, switched to `DCGM_SCALING_FACTORS`, updated docstrings and exports.
PyNVML Collector (new) `src/aiperf/gpu_telemetry/pynvml_collector.py`	New `PyNVMLTelemetryCollector` implementing protocol: NVML init/shutdown, periodic metric collection, TelemetryRecord emission, error handling, and callbacks.
Factory & Protocol `src/aiperf/gpu_telemetry/factories.py`	New `GPUTelemetryCollectorProtocol`, `TRecordCallback`, `TErrorCallback`, and `GPUTelemetryCollectorFactory` to register/create collectors by type.
Manager `src/aiperf/gpu_telemetry/manager.py`	Refactored to use protocol/factory, added collector_type selection, separate DCGM and pynvml configuration paths, generalized endpoint handling and status reporting.
Controller / Mixins `src/aiperf/controller/system_controller.py`, `src/aiperf/common/mixins/base_metrics_collector_mixin.py`	Minor doc/logging tweaks (removed debug log, updated docstring/copyright).
Tests — Collector Renames & Updates `tests/.../test_dcgm_faker.py`, `tests/unit/gpu_telemetry/...`, `tests/unit/server/test_dcgm_faker.py`	Replaced `GPUTelemetryDataCollector` usages with `DCGMTelemetryCollector`, updated imports and assertions to new API (id/endpoint_url).
Tests — PyNVML `tests/unit/gpu_telemetry/test_pynvml_collector.py`	New comprehensive unit tests (mocked pynvml) covering init, NVML lifecycle, metrics, scaling, callbacks, and error cases.
Tests — Config & Manager `tests/unit/common/config/test_user_config.py`, `tests/unit/gpu_telemetry/test_telemetry_manager.py`	Added tests for pynvml validation (mixing DCGM URLs), dashboard interactions, and manager collector_type flows.
Packaging `pyproject.toml`	Added dependency `nvidia-ml-py` (no version pinned).
Integration Test Adjustments `tests/integration/test_custom_gpu_metrics.py`, `tests/integration/test_dcgm_faker.py`	Adjusted expected metric counts and removed direct reliance on `GPU_TELEMETRY_METRICS_CONFIG`; added DCGM faker metric count constant.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hopped through logs and NVML fields,
I found new collectors in tidy shields.
DCGM and pynvml now share the burrow,
Factories line up in neat little rows,
Metrics collected, carrots and code — hooray! 🎉

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and specifically describes the main change: adding support for local GPU telemetry via the pynvml library. The title directly maps to the substantial feature additions across the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 89.60% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/unit/gpu_telemetry/test_telemetry_manager.py (1)

639-696: Add test coverage for PYNVML collector integration in manager's configure phase.

While test_pynvml_collector.py covers PyNVMLTelemetryCollector functionality thoroughly, the manager-level integration tests in this file only exercise the DCGM collector path. The _profile_configure_command method should be tested with GPUTelemetryCollectorType.PYNVML to ensure feature parity with DCGM coverage.

🤖 Fix all issues with AI agents

In `@src/aiperf/gpu_telemetry/manager.py`:
- Around line 226-236: The broad `except Exception` handlers should be narrowed
or explicitly exempted from Ruff BLE001; locate the three handlers that log "GPU
Telemetry: Failed to configure pynvml collector" (the block shown and the other
occurrences around the comment ranges 317-319 and 381-382) and either (A)
replace `except Exception as e:` with specific exception types that you expect
from the pynvml setup/telemetry code (e.g., NVML-specific exceptions, OSError,
ImportError, ValueError, etc.), or (B) if a broad catch is intentional for
fault-tolerant telemetry, add a `# noqa: BLE001` comment on the `except
Exception` line and keep the logging behavior unchanged; ensure the first
RuntimeError handler remains unchanged and that the error messages remain
descriptive.

In `@src/aiperf/gpu_telemetry/pynvml_collector.py`:
- Around line 121-148: The async method is_url_reachable performs blocking NVML
calls directly (pynvml.nvmlInit(), nvmlDeviceGetCount(), pynvml.nvmlShutdown())
which can block the event loop; move the entire synchronous NVML probe into a
synchronous helper and call it via asyncio.to_thread (or wrap each blocking call
with asyncio.to_thread) from is_url_reachable so the NVML init/count/shutdown
run off the event loop, propagate errors to return False on failure, and keep
the existing behavior of returning True only if device count > 0; reference the
is_url_reachable function and the NVML calls (pynvml.nvmlInit,
nvmlDeviceGetCount, pynvml.nvmlShutdown) when making the change.

🧹 Nitpick comments (5)

src/aiperf/common/config/user_config.py (2)
482-490: Consider removing the unused # noqa: F401 directive.

The # noqa: F401 comment suppresses "imported but unused" warnings, but based on the static analysis hint, this rule may not be enabled in your linter configuration. If F401 is not enforced, the directive is unnecessary.

However, if you want to keep it for future-proofing (in case F401 gets enabled), that's also acceptable.
🔧 Suggested fix
             elif item.lower() == "pynvml":
                 collector_type = GPUTelemetryCollectorType.PYNVML
                 try:
-                    import pynvml  # noqa: F401
+                    import pynvml  # Runtime check for availability
+                    del pynvml  # Explicitly mark as unused
                 except ImportError as e:
                     raise ValueError(
                         "pynvml package not installed. Install with: pip install nvidia-ml-py"
                     ) from e
525-528: Consider adding a setter for symmetry with gpu_telemetry_mode.

The gpu_telemetry_mode property has both a getter and setter, but gpu_telemetry_collector_type only has a getter. If there's a use case for programmatically changing the collector type after initialization, adding a setter would maintain consistency.
tests/unit/server/test_dcgm_faker.py (1)
188-189: Minor: Update docstring to reflect renamed class.

The class docstring still references "TelemetryDataCollector" but the code now uses DCGMTelemetryCollector.
Suggested fix
 class TestDCGMFakerTelemetryCollector:
-    """Test DCGMFaker by parsing output with actual TelemetryDataCollector."""
+    """Test DCGMFaker by parsing output with actual DCGMTelemetryCollector."""
tests/integration/test_dcgm_faker.py (1)
12-17: Minor: Update docstrings to reflect renamed class.

Both the class docstring (line 13) and test method docstring (line 17) still reference "TelemetryDataCollector" instead of DCGMTelemetryCollector.
Suggested fix
 class TestDCGMFaker:
-    """Test DCGMFaker by parsing output with actual TelemetryDataCollector."""
+    """Test DCGMFaker by parsing output with actual DCGMTelemetryCollector."""
 
     `@pytest.mark.parametrize`("gpu_name", GPU_CONFIGS.keys())
     def test_faker_output_parsed_by_real_telemetry_collector(self, gpu_name):
-        """Test that faker output is parsed correctly by actual TelemetryDataCollector."""
+        """Test that faker output is parsed correctly by actual DCGMTelemetryCollector."""
src/aiperf/gpu_telemetry/pynvml_collector.py (1)
43-48: Remove unnecessary noqa directive.

The # noqa: F401 is flagged as unused since the F401 rule isn't enabled. Remove it for cleaner code.
Suggested fix
     try:
-        import pynvml  # noqa: F401
+        import pynvml

         return True
     except ImportError:
         return False

src/aiperf/gpu_telemetry/manager.py

src/aiperf/gpu_telemetry/pynvml_collector.py

codecov · 2026-01-23T04:06:14Z

Codecov Report

❌ Patch coverage is 95.48387% with 14 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/aiperf/gpu_telemetry/pynvml_collector.py	94.44%	7 Missing and 5 partials ⚠️
src/aiperf/common/config/user_config.py	88.23%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

ajcasagrande · 2026-01-23T04:21:34Z

@coderabbitai review the updated changes please

coderabbitai · 2026-01-23T04:21:53Z

@ajcasagrande I'll review the updated changes in this PR. Let me analyze the implementation of the pynvml GPU telemetry support.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

matthewkotila

was cool to read 🙌

matthewkotila · 2026-01-24T00:52:12Z

docs/tutorials/gpu-telemetry.md

this is only useful when the server is being run locally, right?

should we display a warning if the server url is not localhost and user uses --gpu-telemetry pynvml?

matthewkotila · 2026-01-24T00:53:32Z

src/aiperf/gpu_telemetry/pynvml_collector.py

+            RuntimeError: If NVML initialization or GPU discovery fails.
+        """
+        try:
+            pynvml.nvmlInit()


this is generally a really fast call? should we wrap it in .to_thread() like other pynvml calls and await it?

This is very fast and only happens once during program startup. I ran this in a loop (separate processes, each only timing the init call) and the mean time is 0.01 seconds.

matthewkotila · 2026-01-24T00:55:33Z

src/aiperf/gpu_telemetry/manager.py

+            from aiperf.gpu_telemetry.pynvml_collector import PyNVMLTelemetryCollector
+
+            collector_id = "pynvml_collector"
+            collector = PyNVMLTelemetryCollector(


should this use GPUTelemetryCollectorFactory.create_instance()?

matthewkotila · 2026-01-24T00:57:27Z

src/aiperf/common/models/telemetry_models.py

    timestamp_ns: int = Field(
        description="Nanosecond wall-clock timestamp when telemetry was collected (time_ns)"
    )
    dcgm_url: str = Field(


should this identifier be renamed to something like telemetry_url?

matthewkotila · 2026-01-24T01:00:38Z

src/aiperf/gpu_telemetry/pynvml_collector.py

+                telemetry_data = TelemetryMetrics()
+
+                # Power usage (milliwatts -> watts)
+                with contextlib.suppress(NVMLError):


should we have some debug logging for the cases where collecting a metric fails?

For readability, it might be nice if there was a single context here. I think a one-time warning for unhandled NVMLError would make sense.

Note that the NVMLNotFound specific error (not 100% sure on the name) is thrown in the happy case when nothing went 'wrong' but there simply isn't data to report. This can happen for instance if at the start of sampling no program is using the GPU yet -- so that shouldn't show as an error to anyone.

ajcasagrande added 2 commits January 22, 2026 17:16

feat: support for local gpu telemetry via pynvml

4fc7e12

update tests

980033a

github-actions bot added the feat label Jan 23, 2026

ajcasagrande requested a review from lkomali January 23, 2026 02:09

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

src/aiperf/gpu_telemetry/manager.py Show resolved Hide resolved

src/aiperf/gpu_telemetry/pynvml_collector.py Show resolved Hide resolved

code review and add library

0bf2588

fix tests

3e5b612

ajcasagrande added 4 commits January 23, 2026 08:00

Merge branch 'main' into ajc/pynvml

9d5a0c2

Merge remote-tracking branch 'origin/main' into ajc/pynvml

6e61701

improve pynvml add gpm for sm util

be16571

Merge branch 'main' into ajc/pynvml

2e7356d

matthewkotila reviewed Jan 24, 2026

View reviewed changes

feat: support for local gpu telemetry via pynvml #595

Are you sure you want to change the base?

feat: support for local gpu telemetry via pynvml #595

Uh oh!

Conversation

ajcasagrande commented Jan 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

coderabbitai bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ajcasagrande commented Jan 23, 2026

Uh oh!

coderabbitai bot commented Jan 23, 2026

Uh oh!

matthewkotila left a comment

Choose a reason for hiding this comment

Uh oh!

matthewkotila Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

matthewkotila Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

brandonpelfrey Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthewkotila Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

matthewkotila Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

matthewkotila Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

brandonpelfrey Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ajcasagrande commented Jan 23, 2026 •

edited by coderabbitai bot

Loading

github-actions bot commented Jan 23, 2026 •

edited

Loading

coderabbitai bot commented Jan 23, 2026 •

edited

Loading

codecov bot commented Jan 23, 2026 •

edited

Loading

brandonpelfrey Jan 27, 2026 •

edited

Loading