Skip to content

Conversation

@ajcasagrande
Copy link
Contributor

@ajcasagrande ajcasagrande commented Jan 23, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added time-range filtering for telemetry metrics to enable window-based data analysis.
    • Introduced counter metric support with delta calculations for energy consumption, XID errors, and power violations.
    • Implemented profile completion handler to finalize profiling operations.
    • Added baseline metrics collection during profiling initialization.
  • Improvements

    • Enhanced statistical calculations using sample standard deviation for multiple data points.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link

github-actions bot commented Jan 23, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@101cb2b8f39457004741d6bb9720cf35abbec65b

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@101cb2b8f39457004741d6bb9720cf35abbec65b

Last updated for commit: 101cb2bBrowse code

@github-actions github-actions bot added the feat label Jan 23, 2026
@coderabbitai
Copy link

coderabbitai bot commented Jan 23, 2026

Walkthrough

Introduces time-range filtering capabilities for telemetry data by adding TimeRangeFilter support and implementing time-based utilities (mask generation, reference indexing) to enable filtered metric statistics and counter-based delta calculations for gauge metrics. Defines counter metrics constant and orchestrates profile completion with baseline metric capture.

Changes

Cohort / File(s) Summary
Core telemetry model enhancements
src/aiperf/common/models/telemetry_models.py
Adds TimeRangeFilter dependency and time-series filtering methods: get_time_mask(), get_reference_idx(), and to_metric_result_filtered(). Reworks statistics calculation to use sample standard deviation (ddof=1) for multiple samples. Extends get_metric_result() signature to support time-filtered and counter-based results.
GPU telemetry configuration
src/aiperf/gpu_telemetry/constants.py, src/aiperf/gpu_telemetry/__init__.py
Defines new GPU_TELEMETRY_COUNTER_METRICS constant identifying cumulative counter metrics (energy_consumption, xid_errors, power_violation). Re-exports constant as public API.
GPU telemetry orchestration
src/aiperf/gpu_telemetry/accumulator.py, src/aiperf/gpu_telemetry/manager.py
Updates accumulator to pass time-range filters and counter metadata to metric retrieval calls; adjusts metric rendering and error handling. Introduces profile completion handler and Phase 2 baseline metric capture during configuration.
Test coverage expansions
tests/unit/gpu_telemetry/test_telemetry_manager.py, tests/unit/gpu_telemetry/test_telemetry_models.py
Adds mocking for initialization and collection phases in manager tests. Expands telemetry model tests with parametrized time-range filtering scenarios, counter delta calculations, and sample statistics validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Time filters bloom on telemetry trails,
Counters leap where delta prevails,
Gauges dance through windowed time,
Baselines captured, metrics align—
Profiling complete, efficiency gains! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the two main objectives: filtering out warmup data from GPU telemetry metrics and implementing counter delta calculations.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/aiperf/gpu_telemetry/accumulator.py`:
- Around line 249-269: The blanket except Exception in the metric loop is
intentional but loses traceback; replace the self.warning(...) call with
self.exception(...) so the full stack trace is logged when
gpu_data.get_metric_result or other code fails (inside the loop that iterates
get_gpu_telemetry_metrics_config()), and add an inline "# noqa: BLE001" comment
on that except Exception line to document intentional broad exception masking;
keep the existing except NoMetricValue: continue behavior and reference
GPU_TELEMETRY_COUNTER_METRICS, get_gpu_telemetry_metrics_config,
gpu_data.get_metric_result and NoMetricValue when making the change.

Comment on lines +249 to +269
metric_display,
metric_key,
unit,
unit_enum,
) in get_gpu_telemetry_metrics_config():
try:
is_counter = metric_key in GPU_TELEMETRY_COUNTER_METRICS
metric_result = gpu_data.get_metric_result(
metric_key, metric_key, metric_key, unit
metric_key,
metric_key,
metric_display,
unit_enum,
time_filter=time_filter,
is_counter=is_counter,
)
metrics_dict[metric_key] = metric_result.to_json_result()
except Exception:
except NoMetricValue:
continue
except Exception as e:
self.warning(
f"Failed to compute metric '{metric_key}' for GPU {gpu_uuid[:12]}: {e}"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find telemetry_models file and inspect get_metric_result
find . -type f -name '*telemetry*models*.py' | head -20

Repository: ai-dynamo/aiperf

Length of output: 158


🏁 Script executed:

#!/bin/bash
# Search for get_metric_result definition
rg -n 'def get_metric_result' --type py

Repository: ai-dynamo/aiperf

Length of output: 135


🏁 Script executed:

#!/bin/bash
# Look for exception definitions and raises in the telemetry module
rg -n 'class.*Error|class.*Exception' src/aiperf/gpu_telemetry/ --type py
rg -n 'raise ' src/aiperf/gpu_telemetry/ --type py | head -30

Repository: ai-dynamo/aiperf

Length of output: 228


🏁 Script executed:

#!/bin/bash
# Check the accumulator.py file to see imports and context
cat -n src/aiperf/gpu_telemetry/accumulator.py | head -50
cat -n src/aiperf/gpu_telemetry/accumulator.py | sed -n '240,280p'

Repository: ai-dynamo/aiperf

Length of output: 4837


🏁 Script executed:

#!/bin/bash
# Read get_metric_result implementation
cat -n src/aiperf/common/models/telemetry_models.py | sed -n '441,500p'

Repository: ai-dynamo/aiperf

Length of output: 2679


🏁 Script executed:

#!/bin/bash
# Check for more context, including exception handling in get_metric_result
cat -n src/aiperf/common/models/telemetry_models.py | sed -n '430,550p'

Repository: ai-dynamo/aiperf

Length of output: 4811


🏁 Script executed:

#!/bin/bash
# Look for what exceptions are defined/raised in the telemetry_models
rg -n 'raise|except' src/aiperf/common/models/telemetry_models.py | head -40

Repository: ai-dynamo/aiperf

Length of output: 275


🏁 Script executed:

#!/bin/bash
# Find time_series class and its to_metric_result methods
rg -n 'def to_metric_result' src/aiperf/common/models/telemetry_models.py -A 15

Repository: ai-dynamo/aiperf

Length of output: 1321


🏁 Script executed:

#!/bin/bash
# Check the entire to_metric_result_filtered to see what it does
rg -n 'def to_metric_result_filtered' src/aiperf/common/models/ -A 20

Repository: ai-dynamo/aiperf

Length of output: 1964


🏁 Script executed:

#!/bin/bash
# Get full implementation of to_metric_result
cat -n src/aiperf/common/models/telemetry_models.py | sed -n '213,280p'

Repository: ai-dynamo/aiperf

Length of output: 2829


🏁 Script executed:

#!/bin/bash
# Get full implementation of to_metric_result_filtered
cat -n src/aiperf/common/models/telemetry_models.py | sed -n '320,410p'

Repository: ai-dynamo/aiperf

Length of output: 3857


Use self.exception() with explicit # noqa: BLE001 to justify blanket exception handling.

Ruff flags BLE001 here; while the blanket catch serves resilience (preventing one metric failure from stopping others), using self.warning() drops the traceback, hindering debugging. Change to self.exception() to log the full stack trace and add # noqa: BLE001 to explicitly document this as intentional exception masking.

Suggested change
                        except Exception as e:  # noqa: BLE001
                            self.exception(
                                f"Failed to compute metric '{metric_key}' for GPU {gpu_uuid[:12]}"
                            )
🧰 Tools
🪛 Ruff (0.14.13)

266-266: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@src/aiperf/gpu_telemetry/accumulator.py` around lines 249 - 269, The blanket
except Exception in the metric loop is intentional but loses traceback; replace
the self.warning(...) call with self.exception(...) so the full stack trace is
logged when gpu_data.get_metric_result or other code fails (inside the loop that
iterates get_gpu_telemetry_metrics_config()), and add an inline "# noqa: BLE001"
comment on that except Exception line to document intentional broad exception
masking; keep the existing except NoMetricValue: continue behavior and reference
GPU_TELEMETRY_COUNTER_METRICS, get_gpu_telemetry_metrics_config,
gpu_data.get_metric_result and NoMetricValue when making the change.

@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 86.11111% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/aiperf/gpu_telemetry/accumulator.py 60.00% 4 Missing ⚠️
src/aiperf/gpu_telemetry/manager.py 80.95% 4 Missing ⚠️
src/aiperf/common/models/telemetry_models.py 95.00% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Contributor

@debermudez debermudez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job on this.

Copy link
Contributor

@lkomali lkomali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants