feat: filter out warmup from gpu telemetry, counter deltas #596

ajcasagrande · 2026-01-23T04:03:03Z

Summary by CodeRabbit

Release Notes

New Features
- Added time-range filtering for telemetry metrics to enable window-based data analysis.
- Introduced counter metric support with delta calculations for energy consumption, XID errors, and power violations.
- Implemented profile completion handler to finalize profiling operations.
- Added baseline metrics collection during profiling initialization.
Improvements
- Enhanced statistical calculations using sample standard deviation for multiple data points.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

github-actions · 2026-01-23T04:03:12Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@101cb2b8f39457004741d6bb9720cf35abbec65b

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@101cb2b8f39457004741d6bb9720cf35abbec65b

Last updated for commit: 101cb2b • Browse code

coderabbitai · 2026-01-23T04:09:01Z

Walkthrough

Introduces time-range filtering capabilities for telemetry data by adding TimeRangeFilter support and implementing time-based utilities (mask generation, reference indexing) to enable filtered metric statistics and counter-based delta calculations for gauge metrics. Defines counter metrics constant and orchestrates profile completion with baseline metric capture.

Changes

Cohort / File(s)	Summary
Core telemetry model enhancements `src/aiperf/common/models/telemetry_models.py`	Adds `TimeRangeFilter` dependency and time-series filtering methods: `get_time_mask()`, `get_reference_idx()`, and `to_metric_result_filtered()`. Reworks statistics calculation to use sample standard deviation (ddof=1) for multiple samples. Extends `get_metric_result()` signature to support time-filtered and counter-based results.
GPU telemetry configuration `src/aiperf/gpu_telemetry/constants.py`, `src/aiperf/gpu_telemetry/__init__.py`	Defines new `GPU_TELEMETRY_COUNTER_METRICS` constant identifying cumulative counter metrics (`energy_consumption`, `xid_errors`, `power_violation`). Re-exports constant as public API.
GPU telemetry orchestration `src/aiperf/gpu_telemetry/accumulator.py`, `src/aiperf/gpu_telemetry/manager.py`	Updates accumulator to pass time-range filters and counter metadata to metric retrieval calls; adjusts metric rendering and error handling. Introduces profile completion handler and Phase 2 baseline metric capture during configuration.
Test coverage expansions `tests/unit/gpu_telemetry/test_telemetry_manager.py`, `tests/unit/gpu_telemetry/test_telemetry_models.py`	Adds mocking for initialization and collection phases in manager tests. Expands telemetry model tests with parametrized time-range filtering scenarios, counter delta calculations, and sample statistics validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Time filters bloom on telemetry trails,
Counters leap where delta prevails,
Gauges dance through windowed time,
Baselines captured, metrics align—
Profiling complete, efficiency gains! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the two main objectives: filtering out warmup data from GPU telemetry metrics and implementing counter delta calculations.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/aiperf/gpu_telemetry/accumulator.py`:
- Around line 249-269: The blanket except Exception in the metric loop is
intentional but loses traceback; replace the self.warning(...) call with
self.exception(...) so the full stack trace is logged when
gpu_data.get_metric_result or other code fails (inside the loop that iterates
get_gpu_telemetry_metrics_config()), and add an inline "# noqa: BLE001" comment
on that except Exception line to document intentional broad exception masking;
keep the existing except NoMetricValue: continue behavior and reference
GPU_TELEMETRY_COUNTER_METRICS, get_gpu_telemetry_metrics_config,
gpu_data.get_metric_result and NoMetricValue when making the change.

coderabbitai · 2026-01-23T04:09:04Z

src/aiperf/gpu_telemetry/accumulator.py

+                        metric_display,
                        metric_key,
-                        unit,
+                        unit_enum,
                    ) in get_gpu_telemetry_metrics_config():
                        try:
+                            is_counter = metric_key in GPU_TELEMETRY_COUNTER_METRICS
                            metric_result = gpu_data.get_metric_result(
-                                metric_key, metric_key, metric_key, unit
+                                metric_key,
+                                metric_key,
+                                metric_display,
+                                unit_enum,
+                                time_filter=time_filter,
+                                is_counter=is_counter,
                            )
                            metrics_dict[metric_key] = metric_result.to_json_result()
-                        except Exception:
+                        except NoMetricValue:
+                            continue
+                        except Exception as e:
+                            self.warning(
+                                f"Failed to compute metric '{metric_key}' for GPU {gpu_uuid[:12]}: {e}"
+                            )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find telemetry_models file and inspect get_metric_result find . -type f -name '*telemetry*models*.py' | head -20

Repository: ai-dynamo/aiperf

Length of output: 158

🏁 Script executed:

#!/bin/bash # Search for get_metric_result definition rg -n 'def get_metric_result' --type py

Repository: ai-dynamo/aiperf

Length of output: 135

🏁 Script executed:

#!/bin/bash # Look for exception definitions and raises in the telemetry module rg -n 'class.*Error|class.*Exception' src/aiperf/gpu_telemetry/ --type py rg -n 'raise ' src/aiperf/gpu_telemetry/ --type py | head -30

Repository: ai-dynamo/aiperf

Length of output: 228

🏁 Script executed:

#!/bin/bash # Check the accumulator.py file to see imports and context cat -n src/aiperf/gpu_telemetry/accumulator.py | head -50 cat -n src/aiperf/gpu_telemetry/accumulator.py | sed -n '240,280p'

Repository: ai-dynamo/aiperf

Length of output: 4837

🏁 Script executed:

#!/bin/bash # Read get_metric_result implementation cat -n src/aiperf/common/models/telemetry_models.py | sed -n '441,500p'

Repository: ai-dynamo/aiperf

Length of output: 2679

🏁 Script executed:

#!/bin/bash # Check for more context, including exception handling in get_metric_result cat -n src/aiperf/common/models/telemetry_models.py | sed -n '430,550p'

Repository: ai-dynamo/aiperf

Length of output: 4811

🏁 Script executed:

#!/bin/bash # Look for what exceptions are defined/raised in the telemetry_models rg -n 'raise|except' src/aiperf/common/models/telemetry_models.py | head -40

Repository: ai-dynamo/aiperf

Length of output: 275

🏁 Script executed:

#!/bin/bash # Find time_series class and its to_metric_result methods rg -n 'def to_metric_result' src/aiperf/common/models/telemetry_models.py -A 15

Repository: ai-dynamo/aiperf

Length of output: 1321

🏁 Script executed:

#!/bin/bash # Check the entire to_metric_result_filtered to see what it does rg -n 'def to_metric_result_filtered' src/aiperf/common/models/ -A 20

Repository: ai-dynamo/aiperf

Length of output: 1964

🏁 Script executed:

#!/bin/bash # Get full implementation of to_metric_result cat -n src/aiperf/common/models/telemetry_models.py | sed -n '213,280p'

Repository: ai-dynamo/aiperf

Length of output: 2829

🏁 Script executed:

#!/bin/bash # Get full implementation of to_metric_result_filtered cat -n src/aiperf/common/models/telemetry_models.py | sed -n '320,410p'

Repository: ai-dynamo/aiperf

Length of output: 3857

Use self.exception() with explicit # noqa: BLE001 to justify blanket exception handling.

Ruff flags BLE001 here; while the blanket catch serves resilience (preventing one metric failure from stopping others), using self.warning() drops the traceback, hindering debugging. Change to self.exception() to log the full stack trace and add # noqa: BLE001 to explicitly document this as intentional exception masking.

Suggested change

except Exception as e: # noqa: BLE001 self.exception( f"Failed to compute metric '{metric_key}' for GPU {gpu_uuid[:12]}" )

🧰 Tools

🪛 Ruff (0.14.13)

266-266: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In `@src/aiperf/gpu_telemetry/accumulator.py` around lines 249 - 269, The blanket except Exception in the metric loop is intentional but loses traceback; replace the self.warning(...) call with self.exception(...) so the full stack trace is logged when gpu_data.get_metric_result or other code fails (inside the loop that iterates get_gpu_telemetry_metrics_config()), and add an inline "# noqa: BLE001" comment on that except Exception line to document intentional broad exception masking; keep the existing except NoMetricValue: continue behavior and reference GPU_TELEMETRY_COUNTER_METRICS, get_gpu_telemetry_metrics_config, gpu_data.get_metric_result and NoMetricValue when making the change.

codecov · 2026-01-23T04:55:20Z

Codecov Report

❌ Patch coverage is 86.11111% with 10 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/aiperf/gpu_telemetry/accumulator.py	60.00%	4 Missing ⚠️
src/aiperf/gpu_telemetry/manager.py	80.95%	4 Missing ⚠️
src/aiperf/common/models/telemetry_models.py	95.00%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

debermudez

Nice job on this.

lkomali

Great work! Thank you

feat: filter out warmup from gpu telemetry, counter deltas

fa5f258

github-actions bot added the feat label Jan 23, 2026

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

fix tests and timefilter

d7b1361

Merge branch 'main' into ajc/gpu-metrics-filter

29b280a

debermudez approved these changes Jan 23, 2026

View reviewed changes

lkomali approved these changes Jan 23, 2026

View reviewed changes

Merge branch 'main' into ajc/gpu-metrics-filter

101cb2b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: filter out warmup from gpu telemetry, counter deltas #596

feat: filter out warmup from gpu telemetry, counter deltas #596

Uh oh!

ajcasagrande commented Jan 23, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Jan 23, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 23, 2026

Uh oh!

codecov bot commented Jan 23, 2026

Uh oh!

debermudez left a comment

Uh oh!

lkomali left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: filter out warmup from gpu telemetry, counter deltas #596

Are you sure you want to change the base?

feat: filter out warmup from gpu telemetry, counter deltas #596

Uh oh!

Conversation

ajcasagrande commented Jan 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

github-actions bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

coderabbitai bot commented Jan 23, 2026

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 23, 2026

Codecov Report

Uh oh!

debermudez left a comment

Choose a reason for hiding this comment

Uh oh!

lkomali left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ajcasagrande commented Jan 23, 2026 •

edited by coderabbitai bot

Loading

github-actions bot commented Jan 23, 2026 •

edited

Loading