Skip to content

Conversation

@ajcasagrande
Copy link
Contributor

@ajcasagrande ajcasagrande commented Jan 23, 2026

Summary by CodeRabbit

  • New Features

    • Added PyNVML-based local GPU telemetry option and a collector-type switch; added SUMMARY and REALTIME_DASHBOARD modes
    • Expanded GPU metrics (memory, SM, decoder, encoder, JPEG utilizations)
  • Documentation

    • Updated GPU telemetry guide with PyNVML setup, examples, comparisons vs DCGM, tips, and clarified default endpoint behavior
  • Chores

    • Added runtime dependency for PyNVML support (nvidia-ml-py)

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions github-actions bot added the feat label Jan 23, 2026
@github-actions
Copy link

github-actions bot commented Jan 23, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2e7356d373093ae3e950e5bfb0dbe0ad579894c6

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2e7356d373093ae3e950e5bfb0dbe0ad579894c6

Last updated for commit: 2e7356dBrowse code

@ajcasagrande ajcasagrande requested a review from lkomali January 23, 2026 02:09
@coderabbitai
Copy link

coderabbitai bot commented Jan 23, 2026

Walkthrough

This PR adds PyNVML-based local GPU telemetry alongside DCGM, refactors telemetry into protocol/factory patterns, renames and re-exports DCGM collector, extends telemetry metrics and constants, updates configuration parsing and docs, and adds comprehensive PyNVML tests and related test updates.

Changes

Cohort / File(s) Summary
Documentation
docs/cli_options.md, docs/tutorials/gpu-telemetry.md
Documented new pynvml mode, added local GPU monitoring path, examples, prerequisites, comparison table, and clarified mode/endpoint behavior.
Enums & Types
src/aiperf/common/enums/...
Added GPUTelemetryCollectorType (DCGM, PYNVML) and new GPUTelemetryMode members (SUMMARY, REALTIME_DASHBOARD) and exported them.
Config Parsing
src/aiperf/common/config/user_config.py
Added gpu_telemetry_collector_type parsing, validation (disallow mixing pynvml with DCGM URLs), runtime pynvml availability checks, and public accessor.
Models
src/aiperf/common/models/telemetry_models.py
Added metrics fields: mem_utilization, sm_utilization, decoder_utilization, encoder_utilization, jpg_utilization; relaxed dcgm_url description.
Constants & Public API
src/aiperf/gpu_telemetry/constants.py, src/aiperf/gpu_telemetry/__init__.py
Added PYNVML_SOURCE_IDENTIFIER, DCGM_SCALING_FACTORS, PYNVML_SCALING_FACTORS; expanded DCGM field mapping and updated public exports to include collectors, factory, protocol, and scaling constants.
DCGM Collector Refactor
src/aiperf/gpu_telemetry/dcgm_collector.py
Renamed GPUTelemetryDataCollectorDCGMTelemetryCollector, registered with factory/protocol, switched to DCGM_SCALING_FACTORS, updated docstrings and exports.
PyNVML Collector (new)
src/aiperf/gpu_telemetry/pynvml_collector.py
New PyNVMLTelemetryCollector implementing protocol: NVML init/shutdown, periodic metric collection, TelemetryRecord emission, error handling, and callbacks.
Factory & Protocol
src/aiperf/gpu_telemetry/factories.py
New GPUTelemetryCollectorProtocol, TRecordCallback, TErrorCallback, and GPUTelemetryCollectorFactory to register/create collectors by type.
Manager
src/aiperf/gpu_telemetry/manager.py
Refactored to use protocol/factory, added collector_type selection, separate DCGM and pynvml configuration paths, generalized endpoint handling and status reporting.
Controller / Mixins
src/aiperf/controller/system_controller.py, src/aiperf/common/mixins/base_metrics_collector_mixin.py
Minor doc/logging tweaks (removed debug log, updated docstring/copyright).
Tests — Collector Renames & Updates
tests/.../test_dcgm_faker.py, tests/unit/gpu_telemetry/..., tests/unit/server/test_dcgm_faker.py
Replaced GPUTelemetryDataCollector usages with DCGMTelemetryCollector, updated imports and assertions to new API (id/endpoint_url).
Tests — PyNVML
tests/unit/gpu_telemetry/test_pynvml_collector.py
New comprehensive unit tests (mocked pynvml) covering init, NVML lifecycle, metrics, scaling, callbacks, and error cases.
Tests — Config & Manager
tests/unit/common/config/test_user_config.py, tests/unit/gpu_telemetry/test_telemetry_manager.py
Added tests for pynvml validation (mixing DCGM URLs), dashboard interactions, and manager collector_type flows.
Packaging
pyproject.toml
Added dependency nvidia-ml-py (no version pinned).
Integration Test Adjustments
tests/integration/test_custom_gpu_metrics.py, tests/integration/test_dcgm_faker.py
Adjusted expected metric counts and removed direct reliance on GPU_TELEMETRY_METRICS_CONFIG; added DCGM faker metric count constant.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hopped through logs and NVML fields,
I found new collectors in tidy shields.
DCGM and pynvml now share the burrow,
Factories line up in neat little rows,
Metrics collected, carrots and code — hooray! 🎉

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: adding support for local GPU telemetry via the pynvml library. The title directly maps to the substantial feature additions across the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 89.60% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unit/gpu_telemetry/test_telemetry_manager.py (1)

639-696: Add test coverage for PYNVML collector integration in manager's configure phase.

While test_pynvml_collector.py covers PyNVMLTelemetryCollector functionality thoroughly, the manager-level integration tests in this file only exercise the DCGM collector path. The _profile_configure_command method should be tested with GPUTelemetryCollectorType.PYNVML to ensure feature parity with DCGM coverage.

🤖 Fix all issues with AI agents
In `@src/aiperf/gpu_telemetry/manager.py`:
- Around line 226-236: The broad `except Exception` handlers should be narrowed
or explicitly exempted from Ruff BLE001; locate the three handlers that log "GPU
Telemetry: Failed to configure pynvml collector" (the block shown and the other
occurrences around the comment ranges 317-319 and 381-382) and either (A)
replace `except Exception as e:` with specific exception types that you expect
from the pynvml setup/telemetry code (e.g., NVML-specific exceptions, OSError,
ImportError, ValueError, etc.), or (B) if a broad catch is intentional for
fault-tolerant telemetry, add a `# noqa: BLE001` comment on the `except
Exception` line and keep the logging behavior unchanged; ensure the first
RuntimeError handler remains unchanged and that the error messages remain
descriptive.

In `@src/aiperf/gpu_telemetry/pynvml_collector.py`:
- Around line 121-148: The async method is_url_reachable performs blocking NVML
calls directly (pynvml.nvmlInit(), nvmlDeviceGetCount(), pynvml.nvmlShutdown())
which can block the event loop; move the entire synchronous NVML probe into a
synchronous helper and call it via asyncio.to_thread (or wrap each blocking call
with asyncio.to_thread) from is_url_reachable so the NVML init/count/shutdown
run off the event loop, propagate errors to return False on failure, and keep
the existing behavior of returning True only if device count > 0; reference the
is_url_reachable function and the NVML calls (pynvml.nvmlInit,
nvmlDeviceGetCount, pynvml.nvmlShutdown) when making the change.
🧹 Nitpick comments (5)
src/aiperf/common/config/user_config.py (2)

482-490: Consider removing the unused # noqa: F401 directive.

The # noqa: F401 comment suppresses "imported but unused" warnings, but based on the static analysis hint, this rule may not be enabled in your linter configuration. If F401 is not enforced, the directive is unnecessary.

However, if you want to keep it for future-proofing (in case F401 gets enabled), that's also acceptable.

🔧 Suggested fix
             elif item.lower() == "pynvml":
                 collector_type = GPUTelemetryCollectorType.PYNVML
                 try:
-                    import pynvml  # noqa: F401
+                    import pynvml  # Runtime check for availability
+                    del pynvml  # Explicitly mark as unused
                 except ImportError as e:
                     raise ValueError(
                         "pynvml package not installed. Install with: pip install nvidia-ml-py"
                     ) from e

525-528: Consider adding a setter for symmetry with gpu_telemetry_mode.

The gpu_telemetry_mode property has both a getter and setter, but gpu_telemetry_collector_type only has a getter. If there's a use case for programmatically changing the collector type after initialization, adding a setter would maintain consistency.

tests/unit/server/test_dcgm_faker.py (1)

188-189: Minor: Update docstring to reflect renamed class.

The class docstring still references "TelemetryDataCollector" but the code now uses DCGMTelemetryCollector.

Suggested fix
 class TestDCGMFakerTelemetryCollector:
-    """Test DCGMFaker by parsing output with actual TelemetryDataCollector."""
+    """Test DCGMFaker by parsing output with actual DCGMTelemetryCollector."""
tests/integration/test_dcgm_faker.py (1)

12-17: Minor: Update docstrings to reflect renamed class.

Both the class docstring (line 13) and test method docstring (line 17) still reference "TelemetryDataCollector" instead of DCGMTelemetryCollector.

Suggested fix
 class TestDCGMFaker:
-    """Test DCGMFaker by parsing output with actual TelemetryDataCollector."""
+    """Test DCGMFaker by parsing output with actual DCGMTelemetryCollector."""
 
     `@pytest.mark.parametrize`("gpu_name", GPU_CONFIGS.keys())
     def test_faker_output_parsed_by_real_telemetry_collector(self, gpu_name):
-        """Test that faker output is parsed correctly by actual TelemetryDataCollector."""
+        """Test that faker output is parsed correctly by actual DCGMTelemetryCollector."""
src/aiperf/gpu_telemetry/pynvml_collector.py (1)

43-48: Remove unnecessary noqa directive.

The # noqa: F401 is flagged as unused since the F401 rule isn't enabled. Remove it for cleaner code.

Suggested fix
     try:
-        import pynvml  # noqa: F401
+        import pynvml

         return True
     except ImportError:
         return False

@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 95.48387% with 14 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/aiperf/gpu_telemetry/pynvml_collector.py 94.44% 7 Missing and 5 partials ⚠️
src/aiperf/common/config/user_config.py 88.23% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ajcasagrande
Copy link
Contributor Author

@coderabbitai review the updated changes please

@coderabbitai
Copy link

coderabbitai bot commented Jan 23, 2026

@ajcasagrande I'll review the updated changes in this PR. Let me analyze the implementation of the pynvml GPU telemetry support.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@matthewkotila matthewkotila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was cool to read 🙌

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only useful when the server is being run locally, right?

should we display a warning if the server url is not localhost and user uses --gpu-telemetry pynvml?

RuntimeError: If NVML initialization or GPU discovery fails.
"""
try:
pynvml.nvmlInit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is generally a really fast call? should we wrap it in .to_thread() like other pynvml calls and await it?

Copy link
Contributor

@brandonpelfrey brandonpelfrey Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very fast and only happens once during program startup. I ran this in a loop (separate processes, each only timing the init call) and the mean time is 0.01 seconds.

from aiperf.gpu_telemetry.pynvml_collector import PyNVMLTelemetryCollector

collector_id = "pynvml_collector"
collector = PyNVMLTelemetryCollector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this use GPUTelemetryCollectorFactory.create_instance()?

timestamp_ns: int = Field(
description="Nanosecond wall-clock timestamp when telemetry was collected (time_ns)"
)
dcgm_url: str = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this identifier be renamed to something like telemetry_url?

telemetry_data = TelemetryMetrics()

# Power usage (milliwatts -> watts)
with contextlib.suppress(NVMLError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have some debug logging for the cases where collecting a metric fails?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, it might be nice if there was a single context here. I think a one-time warning for unhandled NVMLError would make sense.

Note that the NVMLNotFound specific error (not 100% sure on the name) is thrown in the happy case when nothing went 'wrong' but there simply isn't data to report. This can happen for instance if at the start of sampling no program is using the GPU yet -- so that shouldn't show as an error to anyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants