Skip to content

chore: add label to nvlink metric#521

Closed
tmcroberts97 wants to merge 1 commit intoNVIDIA:mainfrom
tmcroberts97:chore/add-label-to-nvlink-metric
Closed

chore: add label to nvlink metric#521
tmcroberts97 wants to merge 1 commit intoNVIDIA:mainfrom
tmcroberts97:chore/add-label-to-nvlink-metric

Conversation

@tmcroberts97
Copy link
Contributor

Description

Add an instance ID label to the nvlink config apply latency metric, for more granular alerting.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Add an instance ID label to the nvlink config apply latency metric, for
more granular alerting.

Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
@tmcroberts97 tmcroberts97 requested a review from a team as a code owner March 11, 2026 16:42
@github-actions
Copy link

🛡️ Vulnerability Scan

🚨 Found 74 vulnerability(ies)
📊 vs main: 74 (no change)

Severity Breakdown:

  • 🔴 Critical/High: 74
  • 🟡 Medium: 0
  • 🔵 Low/Info: 0

🔗 View full details in Security tab

🕐 Last updated: 2026-03-11 16:44:25 UTC | Commit: 0f78094

@github-actions
Copy link

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-03-11 16:44:25 UTC | Commit: 0f78094

for (duration_ms, instance_id) in &metrics.nvlink_config_apply_durations_ms {
self.nvlink_config_apply_latency.record(
*duration_ms,
&[KeyValue::new("instance_id", instance_id.clone())],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will lead to a lot of time-series. e.g. in a site with 1000 instances, you would get 1000 * amount_of_histogram_buckets new lines in the prometheus metrics file.

I'd avoid it, and think about how we can communicate high-latency issues in a different fashion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants