-
Notifications
You must be signed in to change notification settings - Fork 25
feat: support for local gpu telemetry via pynvml #595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2e7356d373093ae3e950e5bfb0dbe0ad579894c6Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2e7356d373093ae3e950e5bfb0dbe0ad579894c6Last updated for commit: |
WalkthroughThis PR adds PyNVML-based local GPU telemetry alongside DCGM, refactors telemetry into protocol/factory patterns, renames and re-exports DCGM collector, extends telemetry metrics and constants, updates configuration parsing and docs, and adds comprehensive PyNVML tests and related test updates. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit/gpu_telemetry/test_telemetry_manager.py (1)
639-696: Add test coverage for PYNVML collector integration in manager's configure phase.While
test_pynvml_collector.pycovers PyNVMLTelemetryCollector functionality thoroughly, the manager-level integration tests in this file only exercise the DCGM collector path. The_profile_configure_commandmethod should be tested withGPUTelemetryCollectorType.PYNVMLto ensure feature parity with DCGM coverage.
🤖 Fix all issues with AI agents
In `@src/aiperf/gpu_telemetry/manager.py`:
- Around line 226-236: The broad `except Exception` handlers should be narrowed
or explicitly exempted from Ruff BLE001; locate the three handlers that log "GPU
Telemetry: Failed to configure pynvml collector" (the block shown and the other
occurrences around the comment ranges 317-319 and 381-382) and either (A)
replace `except Exception as e:` with specific exception types that you expect
from the pynvml setup/telemetry code (e.g., NVML-specific exceptions, OSError,
ImportError, ValueError, etc.), or (B) if a broad catch is intentional for
fault-tolerant telemetry, add a `# noqa: BLE001` comment on the `except
Exception` line and keep the logging behavior unchanged; ensure the first
RuntimeError handler remains unchanged and that the error messages remain
descriptive.
In `@src/aiperf/gpu_telemetry/pynvml_collector.py`:
- Around line 121-148: The async method is_url_reachable performs blocking NVML
calls directly (pynvml.nvmlInit(), nvmlDeviceGetCount(), pynvml.nvmlShutdown())
which can block the event loop; move the entire synchronous NVML probe into a
synchronous helper and call it via asyncio.to_thread (or wrap each blocking call
with asyncio.to_thread) from is_url_reachable so the NVML init/count/shutdown
run off the event loop, propagate errors to return False on failure, and keep
the existing behavior of returning True only if device count > 0; reference the
is_url_reachable function and the NVML calls (pynvml.nvmlInit,
nvmlDeviceGetCount, pynvml.nvmlShutdown) when making the change.
🧹 Nitpick comments (5)
src/aiperf/common/config/user_config.py (2)
482-490: Consider removing the unused# noqa: F401directive.The
# noqa: F401comment suppresses "imported but unused" warnings, but based on the static analysis hint, this rule may not be enabled in your linter configuration. IfF401is not enforced, the directive is unnecessary.However, if you want to keep it for future-proofing (in case F401 gets enabled), that's also acceptable.
🔧 Suggested fix
elif item.lower() == "pynvml": collector_type = GPUTelemetryCollectorType.PYNVML try: - import pynvml # noqa: F401 + import pynvml # Runtime check for availability + del pynvml # Explicitly mark as unused except ImportError as e: raise ValueError( "pynvml package not installed. Install with: pip install nvidia-ml-py" ) from e
525-528: Consider adding a setter for symmetry withgpu_telemetry_mode.The
gpu_telemetry_modeproperty has both a getter and setter, butgpu_telemetry_collector_typeonly has a getter. If there's a use case for programmatically changing the collector type after initialization, adding a setter would maintain consistency.tests/unit/server/test_dcgm_faker.py (1)
188-189: Minor: Update docstring to reflect renamed class.The class docstring still references "TelemetryDataCollector" but the code now uses
DCGMTelemetryCollector.Suggested fix
class TestDCGMFakerTelemetryCollector: - """Test DCGMFaker by parsing output with actual TelemetryDataCollector.""" + """Test DCGMFaker by parsing output with actual DCGMTelemetryCollector."""tests/integration/test_dcgm_faker.py (1)
12-17: Minor: Update docstrings to reflect renamed class.Both the class docstring (line 13) and test method docstring (line 17) still reference "TelemetryDataCollector" instead of
DCGMTelemetryCollector.Suggested fix
class TestDCGMFaker: - """Test DCGMFaker by parsing output with actual TelemetryDataCollector.""" + """Test DCGMFaker by parsing output with actual DCGMTelemetryCollector.""" `@pytest.mark.parametrize`("gpu_name", GPU_CONFIGS.keys()) def test_faker_output_parsed_by_real_telemetry_collector(self, gpu_name): - """Test that faker output is parsed correctly by actual TelemetryDataCollector.""" + """Test that faker output is parsed correctly by actual DCGMTelemetryCollector."""src/aiperf/gpu_telemetry/pynvml_collector.py (1)
43-48: Remove unnecessarynoqadirective.The
# noqa: F401is flagged as unused since the F401 rule isn't enabled. Remove it for cleaner code.Suggested fix
try: - import pynvml # noqa: F401 + import pynvml return True except ImportError: return False
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@coderabbitai review the updated changes please |
|
✅ Actions performedReview triggered.
|
matthewkotila
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was cool to read 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only useful when the server is being run locally, right?
should we display a warning if the server url is not localhost and user uses --gpu-telemetry pynvml?
| RuntimeError: If NVML initialization or GPU discovery fails. | ||
| """ | ||
| try: | ||
| pynvml.nvmlInit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is generally a really fast call? should we wrap it in .to_thread() like other pynvml calls and await it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very fast and only happens once during program startup. I ran this in a loop (separate processes, each only timing the init call) and the mean time is 0.01 seconds.
| from aiperf.gpu_telemetry.pynvml_collector import PyNVMLTelemetryCollector | ||
|
|
||
| collector_id = "pynvml_collector" | ||
| collector = PyNVMLTelemetryCollector( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this use GPUTelemetryCollectorFactory.create_instance()?
| timestamp_ns: int = Field( | ||
| description="Nanosecond wall-clock timestamp when telemetry was collected (time_ns)" | ||
| ) | ||
| dcgm_url: str = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this identifier be renamed to something like telemetry_url?
| telemetry_data = TelemetryMetrics() | ||
|
|
||
| # Power usage (milliwatts -> watts) | ||
| with contextlib.suppress(NVMLError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have some debug logging for the cases where collecting a metric fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, it might be nice if there was a single context here. I think a one-time warning for unhandled NVMLError would make sense.
Note that the NVMLNotFound specific error (not 100% sure on the name) is thrown in the happy case when nothing went 'wrong' but there simply isn't data to report. This can happen for instance if at the start of sampling no program is using the GPU yet -- so that shouldn't show as an error to anyone.
Summary by CodeRabbit
New Features
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.