Skip to content

feat: upstream model health monitoring#22

Merged
christianromeni merged 1 commit intomainfrom
feat/model-health
Mar 24, 2026
Merged

feat: upstream model health monitoring#22
christianromeni merged 1 commit intomainfrom
feat/model-health

Conversation

@christianromeni
Copy link
Copy Markdown
Contributor

Summary

Proactive upstream model health monitoring with 3 configurable probe levels.

Health Check Levels

Level Probe Default
health GET server root (any HTTP response = healthy) On, 30s
models GET /models (2xx = ok) Off, 60s
functional POST /chat/completions with 1 token Off, 5m

Backend

  • New internal/health/ package with Checker, 3 probe types
  • Copy-on-write for thread-safe concurrent results
  • Error sanitization (no internal URLs/IPs leaked)
  • Minimum interval enforcement (10s health/models, 60s functional)
  • Prometheus gauges: model_health_status + model_health_latency_seconds
  • API: GET /api/v1/models/health
  • Dashboard stats: models_healthy / models_unhealthy / models_degraded
  • 15 health checker tests

Frontend

  • Health badges on Models page (green/yellow/red dots)
  • Model Health section on Dashboard
  • Model Performance table uses health latency + TPS (not misleading avg duration)
  • useModelHealth hook with 15s polling

Config

settings:
  health_check:
    health:
      enabled: true
      interval: 30s

Test plan

  • go test ./... -race 17/17 passed
  • npx tsc --noEmit clean
  • npm run lint clean
  • npm run test -- --run 305/305 passed
  • Manual: dolphin-mistral shows "healthy" with 1ms latency
  • Manual: unreachable models show "unhealthy"

- 3-level health checks: health (ping), models (GET /models), functional (completion)
- Each level independently configurable: enabled + interval
- Default: health probe every 30s, others disabled
- Copy-on-write for thread-safe concurrent probe results
- Error sanitization (no internal URLs/IPs in API responses)
- Minimum interval enforcement (10s health/models, 60s functional)
- Health latency in Model Performance table (replaces misleading avg duration)
- TPS (tokens/sec) throughput metric from usage data
- Health badges on Models page (healthy/degraded/unhealthy)
- Health summary on Dashboard (healthy/degraded/unhealthy counts)
- Prometheus gauges: model_health_status + model_health_latency_seconds
- API: GET /api/v1/models/health
- 15 health checker tests
@christianromeni christianromeni merged commit 2752afe into main Mar 24, 2026
5 checks passed
@christianromeni christianromeni deleted the feat/model-health branch March 24, 2026 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant