| Status | Active |
|---|---|
| Owner | HyperFleet Platform Team |
| Last Updated | 2026-03-20 |
This document defines the standard conventions for Prometheus metrics across all HyperFleet components (API, Sentinel, Adapters).
- Consistency: All components follow the same naming and labeling conventions
- Predictability: Engineers can anticipate metric names and labels across repositories
- Observability: Standardized metrics enable unified dashboards and alerting
- Creating a shared metrics library
- Mandating a specific metrics framework beyond Prometheus compatibility
All metrics MUST follow the Prometheus naming best practices with HyperFleet-specific prefixes.
hyperfleet_<component>_<metric_name>_<unit>
| Element | Description | Examples |
|---|---|---|
hyperfleet_ |
Global prefix for all HyperFleet metrics | Required |
<component> |
Component name | api, sentinel, adapter |
<metric_name> |
Descriptive name using snake_case | events_processed, request_duration |
<unit> |
Unit suffix (when applicable) | _seconds, _bytes, _total |
- Use snake_case for all metric names
- Use
_totalsuffix for counters - Use
_secondssuffix for durations (always use seconds, not milliseconds) - Use
_bytessuffix for sizes - Use
_infosuffix for info metrics (gauges with static labels) - Avoid redundant prefixes - don't repeat component name in metric name
hyperfleet_api_requests_total
hyperfleet_sentinel_poll_duration_seconds
hyperfleet_adapter_events_processed_total
All metrics MUST include these standard labels:
| Label | Description | Example Values |
|---|---|---|
component |
Component name | api, sentinel, adapter-validation |
version |
Component version | v1.2.3, dev-abc123 |
hyperfleet_sentinel_events_published_total{component="sentinel", version="v1.2.3", resource_type="clusters"} 1523
hyperfleet_adapter_events_processed_total{component="adapter-validation", version="v1.0.0", status="success"} 834
hyperfleet_api_requests_total{component="api", version="v2.1.0", method="GET", status_code="200"} 10234
DO:
- Use labels for dimensions that need filtering/aggregation
- Keep label cardinality low (< 100 unique values per label)
- Use consistent label names across metrics
- Sanitize dynamic values (replace IDs with placeholders)
DON'T:
- Use high-cardinality labels (cluster IDs, user IDs, timestamps)
- Include sensitive information in labels
- Use labels for data that changes frequently
Endpoint Path Sanitization:
/clusters/cls-abc123 → /clusters/{id}
/clusters/cls-abc/nodepools/np-1 → /clusters/{id}/nodepools/{id}
/namespaces/ns-123/jobs/job-456 → /namespaces/{ns}/jobs/{name}
Every HyperFleet component MUST expose these baseline metrics:
Go applications automatically expose these via the Prometheus client library:
| Metric | Type | Description |
|---|---|---|
go_goroutines |
Gauge | Number of goroutines |
go_memstats_alloc_bytes |
Gauge | Bytes allocated and in use |
go_gc_duration_seconds |
Summary | GC pause duration |
process_cpu_seconds_total |
Counter | Total CPU time |
process_resident_memory_bytes |
Gauge | Resident memory size |
Every component MUST expose a build info metric:
hyperfleet_sentinel_build_info{component="sentinel", version="v1.2.3", commit="abc123", go_version="go1.21"} 1
Every component SHOULD expose health status as a metric:
hyperfleet_sentinel_up{component="sentinel", version="v1.2.3"} 1
Use for values that only increase (can reset to zero on restart).
# Total requests processed
hyperfleet_api_requests_total{component="api", version="v1.0.0", method="GET", path="/clusters/{id}", status_code="200"} 1523
# Total errors
hyperfleet_adapter_errors_total{component="adapter-validation", version="v1.0.0", error_type="api_error"} 12
Use for values that can go up or down.
# Current queue depth
hyperfleet_sentinel_pending_resources{component="sentinel", version="v1.2.3", resource_type="clusters"} 42
# Current connections
hyperfleet_api_active_connections{component="api", version="v1.0.0"} 15
Use for measuring distributions (latency, size).
hyperfleet_api_request_duration_seconds_bucket{component="api", version="v1.0.0", method="GET", le="0.1"} 1200
hyperfleet_api_request_duration_seconds_bucket{component="api", version="v1.0.0", method="GET", le="0.5"} 1450
hyperfleet_api_request_duration_seconds_bucket{component="api", version="v1.0.0", method="GET", le="+Inf"} 1523
hyperfleet_api_request_duration_seconds_sum{component="api", version="v1.0.0", method="GET"} 234.56
hyperfleet_api_request_duration_seconds_count{component="api", version="v1.0.0", method="GET"} 1523
Avoid summaries in favor of histograms. Histograms are more flexible for aggregation.
Choose buckets based on the expected value distribution:
For HTTP API calls (fast operations):
Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}For adapter event processing (potentially long operations):
Buckets: []float64{0.1, 0.5, 1, 2, 5, 10, 30, 60, 120}For database operations:
Buckets: []float64{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1}- Include buckets at expected p50, p90, p95, p99 values
- Ensure at least 2-3 buckets below typical values
- Include buckets above SLO thresholds for alerting
- Maximum 10-15 buckets to limit cardinality
All HyperFleet components MUST use:
| Port | Path | Description |
|---|---|---|
9090 |
/metrics |
Prometheus metrics endpoint |
All metrics MUST be compatible with OpenMetrics format. The Prometheus Go client handles this automatically.
For detailed metrics definitions per component, see:
- Sentinel: Sentinel Operator Guide
- Adapters: Adapter Metrics