feat: upstream model health monitoring by christianromeni · Pull Request #22 · voidmind-io/voidllm

christianromeni · 2026-03-24T17:16:09Z

Summary

Proactive upstream model health monitoring with 3 configurable probe levels.

Health Check Levels

Level	Probe	Default
`health`	GET server root (any HTTP response = healthy)	On, 30s
`models`	GET /models (2xx = ok)	Off, 60s
`functional`	POST /chat/completions with 1 token	Off, 5m

Backend

New internal/health/ package with Checker, 3 probe types
Copy-on-write for thread-safe concurrent results
Error sanitization (no internal URLs/IPs leaked)
Minimum interval enforcement (10s health/models, 60s functional)
Prometheus gauges: model_health_status + model_health_latency_seconds
API: GET /api/v1/models/health
Dashboard stats: models_healthy / models_unhealthy / models_degraded
15 health checker tests

Frontend

Health badges on Models page (green/yellow/red dots)
Model Health section on Dashboard
Model Performance table uses health latency + TPS (not misleading avg duration)
useModelHealth hook with 15s polling

Config

settings:
  health_check:
    health:
      enabled: true
      interval: 30s

Test plan

go test ./... -race 17/17 passed
npx tsc --noEmit clean
npm run lint clean
npm run test -- --run 305/305 passed
Manual: dolphin-mistral shows "healthy" with 1ms latency
Manual: unreachable models show "unhealthy"

- 3-level health checks: health (ping), models (GET /models), functional (completion) - Each level independently configurable: enabled + interval - Default: health probe every 30s, others disabled - Copy-on-write for thread-safe concurrent probe results - Error sanitization (no internal URLs/IPs in API responses) - Minimum interval enforcement (10s health/models, 60s functional) - Health latency in Model Performance table (replaces misleading avg duration) - TPS (tokens/sec) throughput metric from usage data - Health badges on Models page (healthy/degraded/unhealthy) - Health summary on Dashboard (healthy/degraded/unhealthy counts) - Prometheus gauges: model_health_status + model_health_latency_seconds - API: GET /api/v1/models/health - 15 health checker tests

christianromeni merged commit 2752afe into main Mar 24, 2026
5 checks passed

christianromeni deleted the feat/model-health branch March 24, 2026 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: upstream model health monitoring#22

feat: upstream model health monitoring#22
christianromeni merged 1 commit intomainfrom
feat/model-health

christianromeni commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christianromeni commented Mar 24, 2026

Summary

Health Check Levels

Backend

Frontend

Config

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant