-
Notifications
You must be signed in to change notification settings - Fork 25
feat: filter out warmup from gpu telemetry, counter deltas #596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@101cb2b8f39457004741d6bb9720cf35abbec65bRecommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@101cb2b8f39457004741d6bb9720cf35abbec65bLast updated for commit: |
WalkthroughIntroduces time-range filtering capabilities for telemetry data by adding Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/aiperf/gpu_telemetry/accumulator.py`:
- Around line 249-269: The blanket except Exception in the metric loop is
intentional but loses traceback; replace the self.warning(...) call with
self.exception(...) so the full stack trace is logged when
gpu_data.get_metric_result or other code fails (inside the loop that iterates
get_gpu_telemetry_metrics_config()), and add an inline "# noqa: BLE001" comment
on that except Exception line to document intentional broad exception masking;
keep the existing except NoMetricValue: continue behavior and reference
GPU_TELEMETRY_COUNTER_METRICS, get_gpu_telemetry_metrics_config,
gpu_data.get_metric_result and NoMetricValue when making the change.
| metric_display, | ||
| metric_key, | ||
| unit, | ||
| unit_enum, | ||
| ) in get_gpu_telemetry_metrics_config(): | ||
| try: | ||
| is_counter = metric_key in GPU_TELEMETRY_COUNTER_METRICS | ||
| metric_result = gpu_data.get_metric_result( | ||
| metric_key, metric_key, metric_key, unit | ||
| metric_key, | ||
| metric_key, | ||
| metric_display, | ||
| unit_enum, | ||
| time_filter=time_filter, | ||
| is_counter=is_counter, | ||
| ) | ||
| metrics_dict[metric_key] = metric_result.to_json_result() | ||
| except Exception: | ||
| except NoMetricValue: | ||
| continue | ||
| except Exception as e: | ||
| self.warning( | ||
| f"Failed to compute metric '{metric_key}' for GPU {gpu_uuid[:12]}: {e}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find telemetry_models file and inspect get_metric_result
find . -type f -name '*telemetry*models*.py' | head -20Repository: ai-dynamo/aiperf
Length of output: 158
🏁 Script executed:
#!/bin/bash
# Search for get_metric_result definition
rg -n 'def get_metric_result' --type pyRepository: ai-dynamo/aiperf
Length of output: 135
🏁 Script executed:
#!/bin/bash
# Look for exception definitions and raises in the telemetry module
rg -n 'class.*Error|class.*Exception' src/aiperf/gpu_telemetry/ --type py
rg -n 'raise ' src/aiperf/gpu_telemetry/ --type py | head -30Repository: ai-dynamo/aiperf
Length of output: 228
🏁 Script executed:
#!/bin/bash
# Check the accumulator.py file to see imports and context
cat -n src/aiperf/gpu_telemetry/accumulator.py | head -50
cat -n src/aiperf/gpu_telemetry/accumulator.py | sed -n '240,280p'Repository: ai-dynamo/aiperf
Length of output: 4837
🏁 Script executed:
#!/bin/bash
# Read get_metric_result implementation
cat -n src/aiperf/common/models/telemetry_models.py | sed -n '441,500p'Repository: ai-dynamo/aiperf
Length of output: 2679
🏁 Script executed:
#!/bin/bash
# Check for more context, including exception handling in get_metric_result
cat -n src/aiperf/common/models/telemetry_models.py | sed -n '430,550p'Repository: ai-dynamo/aiperf
Length of output: 4811
🏁 Script executed:
#!/bin/bash
# Look for what exceptions are defined/raised in the telemetry_models
rg -n 'raise|except' src/aiperf/common/models/telemetry_models.py | head -40Repository: ai-dynamo/aiperf
Length of output: 275
🏁 Script executed:
#!/bin/bash
# Find time_series class and its to_metric_result methods
rg -n 'def to_metric_result' src/aiperf/common/models/telemetry_models.py -A 15Repository: ai-dynamo/aiperf
Length of output: 1321
🏁 Script executed:
#!/bin/bash
# Check the entire to_metric_result_filtered to see what it does
rg -n 'def to_metric_result_filtered' src/aiperf/common/models/ -A 20Repository: ai-dynamo/aiperf
Length of output: 1964
🏁 Script executed:
#!/bin/bash
# Get full implementation of to_metric_result
cat -n src/aiperf/common/models/telemetry_models.py | sed -n '213,280p'Repository: ai-dynamo/aiperf
Length of output: 2829
🏁 Script executed:
#!/bin/bash
# Get full implementation of to_metric_result_filtered
cat -n src/aiperf/common/models/telemetry_models.py | sed -n '320,410p'Repository: ai-dynamo/aiperf
Length of output: 3857
Use self.exception() with explicit # noqa: BLE001 to justify blanket exception handling.
Ruff flags BLE001 here; while the blanket catch serves resilience (preventing one metric failure from stopping others), using self.warning() drops the traceback, hindering debugging. Change to self.exception() to log the full stack trace and add # noqa: BLE001 to explicitly document this as intentional exception masking.
Suggested change
except Exception as e: # noqa: BLE001
self.exception(
f"Failed to compute metric '{metric_key}' for GPU {gpu_uuid[:12]}"
)🧰 Tools
🪛 Ruff (0.14.13)
266-266: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@src/aiperf/gpu_telemetry/accumulator.py` around lines 249 - 269, The blanket
except Exception in the metric loop is intentional but loses traceback; replace
the self.warning(...) call with self.exception(...) so the full stack trace is
logged when gpu_data.get_metric_result or other code fails (inside the loop that
iterates get_gpu_telemetry_metrics_config()), and add an inline "# noqa: BLE001"
comment on that except Exception line to document intentional broad exception
masking; keep the existing except NoMetricValue: continue behavior and reference
GPU_TELEMETRY_COUNTER_METRICS, get_gpu_telemetry_metrics_config,
gpu_data.get_metric_result and NoMetricValue when making the change.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
debermudez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job on this.
lkomali
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Thank you
Summary by CodeRabbit
Release Notes
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.