Skip to content

feat: scout_heartbeat_timeout health alert#513

Open
krish-nvidia wants to merge 3 commits intoNVIDIA:mainfrom
krish-nvidia:scout-heartbeat-timeout-health-alert
Open

feat: scout_heartbeat_timeout health alert#513
krish-nvidia wants to merge 3 commits intoNVIDIA:mainfrom
krish-nvidia:scout-heartbeat-timeout-health-alert

Conversation

@krish-nvidia
Copy link
Contributor

@krish-nvidia krish-nvidia commented Mar 10, 2026

Description

Problem: Machines can become unresponsive (e.g., left in BIOS menu, OS issues, power faults) while they are in the Ready state, creating silent failures that go undetected by carbide until instance creation attempts fail. A metric was created for this issue but a host's health state never reflected the timeout, so allocations weren't blocked.

Fix: Add scout heartbeat timeout health alert for machines in the Ready state:

  • Create a merge health override when last_scout_contact_time exceeds scout_reporting_timeout (default 5 minutes)
  • Remove the override automatically when scout heartbeat recovers in Ready
  • Clear the scout heartbeat timeout alert whenever the host transitions out of Ready, so stale alerts do not leak across state changes.
  • Continue emitting hosts_with_scout_heartbeat_timeout metric

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
@krish-nvidia krish-nvidia requested a review from a team as a code owner March 10, 2026 22:27
@krish-nvidia krish-nvidia self-assigned this Mar 10, 2026
@github-actions
Copy link

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-03-11 14:11:52 UTC | Commit: 76f2032

@github-actions
Copy link

🛡️ Vulnerability Scan

🚨 Found 74 vulnerability(ies)
📊 vs main: 74 (no change)

Severity Breakdown:

  • 🔴 Critical/High: 74
  • 🟡 Medium: 0
  • 🔵 Low/Info: 0

🔗 View full details in Security tab

🕐 Last updated: 2026-03-11 14:12:04 UTC | Commit: 76f2032

Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
@krish-nvidia krish-nvidia requested a review from yoks March 11, 2026 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant