Skip to content

szibis/metrics-governor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

320 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

metrics-governor logo

metrics-governor

Release Go Version License

Build Security CodeQL

Tests Coverage Race Detector Go Lines Docs Benchmarks

OTLP PRW vtprotobuf Alerts SLOs Helm Chart Grafana Playground


metrics-governor is a high-performance metrics governance proxy for OTLP and Prometheus Remote Write. Drop it between your apps and your backend to control cardinality, transform metrics in-flight, and scale horizontally — with zero data loss.

Any pipeline. Any backend. On-prem or cloud. Whether you're shipping metrics to Prometheus, Grafana Cloud, Datadog, Splunk, VictoriaMetrics, or any OTLP-compatible backend — metrics-governor sits in front and gives you governance powers that no collector, agent, or vendor provides out of the box.

Two native pipelines. Zero conversion. Zero allocation. OTLP stays OTLP. PRW stays PRW. Each protocol runs its own receive-process-export path with full feature parity, no conversion overhead, and zero-allocation serialization via vtprotobuf.

What's New

  • v1.2.0 — LLM/GenAI metric governance — Token budget tracking, gen_ai.* metric governance with limits rules, per-model/provider visibility. First Prometheus-native proxy to govern LLM observability metrics. Details
  • v1.0.1 — Memory optimization — GOGC tuning (200→100) + Green Tea GC + reduced buffer/queue allocation. Memory at 50k dps dropped 48% (37.5%→19.5%) with only +0.19pp CPU. Memory budget metrics added for operational visibility. Details
  • v1.0 stable release — All 15 deprecated CLI flags, legacy sampling metrics, and backward-compatibility shims removed. Clean, unified API surface.
  • vtprotobuf integration (v0.44) — Zero-allocation protobuf marshal/unmarshal via PlanetScale vtprotobuf with sync.Pool message reuse. Measured <1% CPU at 100k dps.
  • Pipeline performance (v1.0.1) — Lock-free atomic counters, single-shot zstd, pooled compression. Stats full-mode now viable for production.
  • 3,100+ tests — Comprehensive coverage including race detector, vtprotobuf integration, and parity tests across all packages.

Migrating from v0.x? All deprecated flags have replacements — see DEPRECATIONS.md for the full migration table.

The Cardinality Problem — And Why It's Still Unsolved

Metric cardinality is the silent budget killer in observability. Every distinct combination of metric name and label values creates a separate time series. One unbounded label — a user ID, a request path, an ephemeral container name — can turn a single counter into millions of series, crushing your storage backend and exploding your costs.

What's missing across the industry is governance in transit — intelligence between your apps and your backend that knows who the offenders are, protects everyone else, and escalates gradually instead of cutting blindly. That's what metrics-governor does.

Comparison: Open-Source Collectors & Agents

How metrics-governor compares against the most common open-source metrics collectors and agents:

Feature metrics-governor OTel Collector Grafana Alloy vmagent Vector Prometheus Cribl Stream
Cardinality Governance
Adaptive limiting (drop only top offenders)
Tiered escalation (log→sample→strip→drop)
Per-group / per-tenant quotas ⚠️
Dry-run mode for limits ⚠️
Dead rule detection
Rule ownership labels (team routing)
Processing
Static filter / drop
Label transform (rename, regex, add/remove)
Downsample (per-series temporal compression) ⚠️
Cross-series aggregation (avg, sum, p95) ⚠️ ⚠️ ⚠️ ⚠️ ⚠️
Classify (derive ownership labels) ⚠️
Pipeline
OTLP native (gRPC + HTTP) ⚠️ ⚠️ ⚠️
PRW native (no conversion) ⚠️ ⚠️
Persistent queue / zero data loss ⚠️ ⚠️ ⚠️
Consistent hash sharding ⚠️ ⚠️ ⚠️
Circuit breaker / backpressure ⚠️ ⚠️ ⚠️ ⚠️
Observability
LLM/GenAI metric governance
Legend and notes
  • ✅ Fully supported — ⚠️ Partial or limited — ❌ Not available
  • vmagent OTLP: experimental ingestion since v1.93+, primarily PRW-focused
  • vmagent downsample: stream aggregation provides time-based aggregation, not per-series compression algorithms (LTTB, SDT, CV-based)
  • vmagent sharding: requires external hashmod relabeling across multiple instances
  • OTel Collector PRW: available via contrib receiver/exporter, involves internal conversion
  • OTel Collector aggregation: groupbyattrsprocessor provides basic grouping, not full statistical aggregation
  • OTel Collector persistent queue: file_storage extension, limited compared to dedicated disk queue
  • Grafana Alloy sharding: clustering mode with hash ring distribution
  • Vector OTLP: source and sink available, later addition to the platform
  • Vector aggregation: aggregate transform provides interval-based reduction, limited cross-series operations
  • Prometheus OTLP: receiver available since v2.47+, recording rules provide aggregation (not in forwarding path)
  • Prometheus persistent queue: WAL-based remote write queue, limited durability guarantees
  • Cribl Stream quotas: routing by source/destination, not per-metric-group adaptive enforcement
  • Cribl Stream classify: data classification available, not metrics-ownership-specific

Comparison: Vendor Cardinality Management

How metrics-governor's in-transit governance compares against vendor-side cardinality management solutions:

Feature metrics-governor Datadog MwL Grafana Adaptive Metrics Splunk MPM Chronosphere New Relic
Where it runs In transit (your infra) Backend (SaaS) Backend (SaaS) Backend (SaaS) Backend (SaaS) Backend (SaaS)
Open source
Reduces volume before shipping ⚠️
Adaptive limiting (top offenders only) ⚠️ ⚠️
Tiered escalation
Tag allowlist / blocklist ⚠️
Per-group / per-tenant quotas
Unused dimension detection ⚠️ ⚠️
ML-based recommendations ⚠️
Downsample / aggregate in-transit ⚠️
Dead rule detection
Works with any backend
No vendor lock-in
Self-hosted / on-prem ⚠️
LLM/GenAI metric governance
Legend and notes
  • ✅ Fully supported — ⚠️ Partial or limited — ❌ Not available
  • Datadog Metrics without Limits: Decouples ingestion from indexing — all data is ingested (and billed), you choose which tags to keep queryable. Does not reduce data in transit.
  • Grafana Adaptive Metrics: ML-based recommendations for tag aggregation in Grafana Cloud. Suggestions only — requires manual approval. Cloud-only, not available on-prem.
  • Splunk MPM: Dimension utilization ranking (R0-R5), aggregation rules. Available in Splunk Observability Cloud only. Aggregation reduces stored MTS but doesn't reduce ingest volume.
  • Chronosphere: Control plane with aggregation rules and quotas. Available as SaaS and on-prem (limited). Reduces stored data but relies on Chronosphere's storage.
  • New Relic: Drop rules and data management. Limited cardinality-specific controls compared to dedicated governance tools.
  • metrics-governor unused dimension detection: Dead rule detection tracks stale rules; per-metric stats in full mode tracks cardinality per metric. Not ML-based discovery.

Universal Governance for Mixed Environments

Whether you're running legacy Prometheus Remote Write, migrating to modern OpenTelemetry, or operating both in parallel — metrics-governor provides a single governance layer across all your metrics traffic.

  • Bridge old and new — adopt OTel incrementally while maintaining full control over existing Prometheus infrastructure
  • Same rules, same protection — cardinality limits, processing rules, and alerting work identically across both protocols
  • Single pane of governance — one proxy, one config, one set of dashboards for your entire metrics pipeline regardless of protocol mix

Why metrics-governor?

Challenge How metrics-governor Solves It
Cardinality explosions crush your backend Adaptive limiting identifies and drops only the top offenders — well-behaved services keep flowing
All-or-nothing enforcement kills good data Tiered escalation with graduated responses: log → sample → strip labels → drop
Raw volume too high for storage budget Processing rules sample, downsample, aggregate, classify, transform, or drop metrics before they leave the proxy
Storage explosion from a noisy tenant Multi-tenancy with per-tenant quotas and adaptive limits — detect tenants, enforce budgets, protect storage without blanket-dropping
No team accountability for metric costs Rule ownership labels attach team, slack_channel, pagerduty_service to any rule for Alertmanager routing
Data loss during backend outages Always-queue architecture with circuit breaker, persistent disk queue, and exponential backoff — zero data loss by default
Single backend can't keep up Consistent sharding fans out to N backends via K8s DNS discovery with stable hash routing
No visibility into the metrics pipeline Real-time stats, 13 production alerts, Grafana dashboards, and dead rule detection
Unpredictable costs from runaway services Per-group tracking with configurable limits, dry-run mode, and ownership labels for team routing
Need team/severity labels derived from business values Transform rules — build severity, team, env from metric names and label values
Stale rules pile up unnoticed Dead rule detection tracks last-match time for every rule, with alerts for stale cleanup
Complex deployment planning Interactive Playground generates Helm, app, and limits YAML from your throughput inputs

Architecture

metrics-governor architecture

View as text diagram (Mermaid)
flowchart LR
    subgraph Sources["&nbsp; Sources &nbsp;"]
        S1["OTLP gRPC / HTTP\nApps · Agents · SDKs"]:::source
        S2["PRW 1.0 / 2.0\nPrometheus · Grafana Agent"]:::source
    end

    subgraph MG["&nbsp; ⚡ metrics-governor &nbsp;"]
        direction TB
        subgraph OTLP["&thinsp; OTLP Pipeline &thinsp;"]
            direction LR
            O1(["Receive"]):::rx --> O2(["Process"]):::proc --> O3(["Limit"]):::limit
            O3 --> O4(["Queue"]):::queue --> O5(["Prepare"]):::prep --> O6(["Send"]):::send
            O6 -. "retry" .-> O4
        end
        subgraph PRW["&thinsp; PRW Pipeline &thinsp;"]
            direction LR
            P1(["Receive"]):::rx --> P2(["Process"]):::proc --> P3(["Limit"]):::limit
            P3 --> P4(["Queue"]):::queue --> P5(["Prepare"]):::prep --> P6(["Send"]):::send
            P6 -. "retry" .-> P4
        end
    end

    subgraph Backends["&nbsp; Backends &nbsp;"]
        B1["Collector · Mimir\nVictoriaMetrics · Grafana Cloud"]:::backend
        B2["Prometheus · Thanos\nVictoriaMetrics · Cortex"]:::backend
    end

    S1 -->|"gRPC :4317\nHTTP :4318"| O1
    S2 -->|"HTTP :9091"| P1
    O6 --> B1
    P6 --> B2

    classDef source fill:#3498db,stroke:#1a5276,color:#fff,stroke-width:2px
    classDef rx fill:#1abc9c,stroke:#0e6655,color:#fff,stroke-width:2px
    classDef proc fill:#9b59b6,stroke:#6c3483,color:#fff,stroke-width:2px
    classDef limit fill:#e74c3c,stroke:#922b21,color:#fff,stroke-width:2px
    classDef queue fill:#f39c12,stroke:#b7770a,color:#fff,stroke-width:2px
    classDef prep fill:#3498db,stroke:#1a5276,color:#fff,stroke-width:2px
    classDef send fill:#2ecc71,stroke:#1a8c4e,color:#fff,stroke-width:2px
    classDef backend fill:#2ecc71,stroke:#1a8c4e,color:#fff,stroke-width:2px

    style Sources fill:#eaf2f8,stroke:#2980b9,stroke-width:2px,color:#1a5276
    style MG fill:#f9f3e3,stroke:#d4a017,stroke-width:3px,color:#7d6608
    style OTLP fill:#e8f6f3,stroke:#1abc9c,stroke-width:1px,color:#0e6655
    style PRW fill:#fef5e7,stroke:#f39c12,stroke-width:1px,color:#b7770a
    style Backends fill:#eafaf1,stroke:#2ecc71,stroke-width:2px,color:#1a8c4e
Loading

Each pipeline runs independently: ReceiveProcessLimitQueuePrepareSendBackend. Failed exports retry through the queue with circuit breaker protection.


Features

Receive — Dual Native Protocols

Protocol Ports Capabilities
OTLP gRPC :4317 Full ExportMetricsService, TLS/mTLS, bearer token, gzip/zstd, vtprotobuf zero-alloc unmarshal
OTLP HTTP :4318 Protobuf + JSON, gzip/zstd/snappy decompression, content negotiation, vtprotobuf pool reuse
PRW 1.0/2.0 :9091 Auto-detect version, native histograms, VictoriaMetrics mode, exemplars

Backpressure built in: capacity-bounded buffers return 429 / ResourceExhausted when full. Docs

Supported backends:

Protocol Backends
OTLP OpenTelemetry Collector, Grafana Mimir, Cortex, VictoriaMetrics, ClickHouse, Grafana Cloud
PRW Prometheus, VictoriaMetrics, Grafana Mimir, Cortex, Thanos Receive, Amazon Managed Prometheus, GCP Managed Prometheus, Grafana Cloud

Process — Unified Rules Engine

Six actions in a single ordered pipeline — first match wins:

Action What It Does Terminal?
Sample Stochastic reduction (probabilistic or head-N) Yes
Downsample Per-series compression — 10 methods incl. adaptive CV-based, LTTB, SDT Yes
Aggregate Cross-series reduction with group_by — avg, sum, p95, stddev, and more Yes
Transform 12 label operations — rename, regex replace, add, remove, keep, drop No (chains)
Classify Derive ownership labels (team, severity, priority) from metric metadata No (chains)
Drop Unconditional removal Yes

Transform → Classify chaining: non-terminal actions chain — classify metrics into categories, then transform labels to match your storage schema in a single pass. Plus dead rule detection: always-on metrics track when rules stop matching, with optional scanner and alert rules for stale rule cleanup. Docs

Control — Intelligent Cardinality Governance

  • Adaptive Limiting — Drops only the top offenders, not everything. Per-group tracking by service, namespace, or any label combination. Tiered escalation: log → sample → strip labels → drop. Dry-run mode for safe rollouts
  • Cardinality Tracking — Three modes: Bloom filter (98% less memory — 1.2 MB vs 75 MB @ 1M series), HyperLogLog (constant 12 KB), Hybrid (auto-switches at threshold)
  • Bloom Persistence — Save/restore filter state across restarts, eliminating cold-start re-learning
  • Rule Ownership Labels — Attach team, slack_channel, pagerduty_service to any rule for Alertmanager routing
  • LLM/GenAI Token Budget Tracking — Monitor token consumption rates, budget burn, per-model/provider visibility. Govern gen_ai.* metrics with limits rules or dedicated tracker

Export — High-Throughput Pipeline

Optimization Impact How
vtprotobuf Zero-allocation marshal/unmarshal PlanetScale vtprotobuf with sync.Pool message reuse — near-zero GC pressure
Pipeline Split +60-76% throughput CPU-bound preparers (NumCPU) compress, I/O-bound senders (NumCPU x 2) send HTTP
AIMD Batch Tuning Auto-discovers optimal batch size +25% after 10 successes, -50% on failure, HTTP 413 ceiling discovery
Adaptive Worker Scaling 1 to NumCPU x 4 workers EWMA latency tracking, scale up on queue depth, halve on 30s idle
Async Send Max network utilization Semaphore-bounded concurrency: 4/sender, NumCPU x 8 global
Connection Pre-warming Zero cold-start latency HEAD requests at startup establish connection pools
String Interning 76% fewer allocations Label deduplication across the hot path
Compression Pooling 80% fewer allocs Reusable gzip/zstd/snappy encoder pools

Protect — Zero Data Loss Architecture

  • Always-Queue — All data flows through the queue (VMAgent/OTel-inspired), eliminating flush-time blocking
  • Persistent Queue — FastQueue disk-backed with snappy compression, 256 KB buffered I/O, write coalescing — 128x fewer IOPS, 70% less disk I/O
  • Circuit Breaker — Three-state (closed/open/half-open) with CAS transitions, prevents cascading failures
  • Split-on-Error — Oversized batches auto-split on HTTP 413 from Mimir, Thanos, VictoriaMetrics, Cortex
  • Backpressure — Buffer returns 429/ResourceExhausted; percentage-based memory sizing (15% buffer, 15% queue)
  • Graceful Shutdown — Drains in-flight exports and persists queue state before termination

Scale — Horizontal and Hierarchical

  • Consistent Sharding — Hash ring with 150 virtual nodes per endpoint, K8s DNS discovery with automatic failover. Same series always routes to same backend (OTLP and PRW)
  • Two-Tier Architecture — DaemonSet edge (Tier 1) processes per-node, StatefulSet gateway (Tier 2) aggregates globally — 10-50x traffic reduction between nodes
  • Percentage-Based Memory — Buffer and queue sizes auto-scale with container resources via cgroup detection
  • Three Queue Modesmemory (fastest), disk (durable), hybrid (best of both)

Monitor — Full Observability

  • Real-Time Statistics — Per-metric cardinality, datapoints, and limit violations with three stats levels (none/basic/full)
  • 13 Production Alerts — Zero-overlap design: DataLoss, ExportDegraded, QueueSaturated, CircuitOpen, OOMRisk, CardinalityExplosion, and more — each with runbooks
  • Dead Rule Detection — Always-on last-match tracking for processing and limits rules, with alert rules for stale rule cleanup
  • Grafana Dashboards — Operations and development dashboards included, auto-imported via provisioning
  • Health Endpoints/live and /ready probes with per-component JSON status for Kubernetes

Deploy — Production Ready from Day One

  • Helm Chart — Full production chart with probes, ConfigMap sidecar, HPA-ready, alert rules integrated
  • Profiles — 6 presets (minimal, balanced, safety, observable, resilient, performance) — one flag to set 30+ parameters, tuned from measured vtprotobuf benchmarks
  • Hot Reload — SIGHUP reloads limits and processing rules without restart; ConfigMap sidecar for Kubernetes
  • Interactive Playground — Browser tool estimates resources, generates Helm/YAML/limits configs, recommends cloud storage classes
  • TLS/mTLS + Auth — Full TLS, mutual TLS, bearer token, basic auth, custom headers
  • Zero-Config Start — Works out of the box with sensible defaults; add limits and sharding when needed

Performance at a Glance

Measured comparison — governor vs OTel Collector vs vmagent (4-core, 1 GB, OTLP gRPC → HTTP):

Load Tool CPU avg Memory avg Ingestion
50k dps metrics-governor (balanced) 4.51% 19.5% 99.25%
OTel Collector 4.51% 15.3% 99.83%
vmagent 2.94% 7.3% 99.90%
100k dps metrics-governor (balanced) 6.47% 18.4% 99.53%
OTel Collector 6.58% 9.3% 99.83%
vmagent 16.70% 3.2% 99.83%

Governor scales sublinearly: 1.43x CPU for 2x load (50k→100k). At 100k dps, governor uses less CPU than OTel Collector while providing full governance features neither tool offers.

Optimization Impact
vtprotobuf marshal/unmarshal Zero allocationssync.Pool message reuse, near-zero GC pressure
Pipeline split +60-76% throughput — CPU-bound preparers + I/O-bound senders
Green Tea GC + GOGC=100 48% memory reduction vs default GC tuning
Cardinality memory (Bloom) 1.2 MB per 1M series (98% less than maps)
String interning 76% fewer allocations on the hot path
Disk I/O (buffered + coalesced) 128x fewer IOPS, 70% less throughput
Queue compression (snappy) 2.5-3x storage capacity
Two-tier traffic reduction 10-50x between DaemonSet and StatefulSet tiers

See Performance Guide and Benchmarks for methodology and full results.


Flexible Operating Modes

One binary, six profiles — choose durability, observability, cost efficiency, or raw throughput:

Priority Queue Mode Stats Level Profile Cost Efficiency Trade-off
Maximum Safety disk full safety High Full crash recovery + per-metric cost tracking
Durable + Observable hybrid full observable High Disk spillover + full per-metric stats for cost visibility
Resilient hybrid basic resilient Medium Memory-speed normally, disk spillover for spikes
High Throughput hybrid basic performance Low Pipeline split + max throughput + adaptive tuning
Balanced (default) memory basic balanced Medium Best performance with essential metrics
Minimal Footprint memory none minimal Smallest resource usage, pure proxy

Higher proxy resources (disk, CPU) can save 10–100x in backend SaaS costs by identifying and reducing expensive metrics before they reach your storage. See Cost Efficiency.

See Profiles and Performance Tuning for details.


Quick Start

# Start metrics-governor with adaptive limits
metrics-governor \
  -exporter-endpoint otel-collector:4317 \
  -limits-config limits.yaml \
  -limits-dry-run=false \
  -stats-labels service,env

# Point your apps at metrics-governor instead of the collector
# export OTEL_EXPORTER_OTLP_ENDPOINT=http://metrics-governor:4317
# limits.yaml — adaptive limiting by service
rules:
  - name: "per-service-limits"
    match:
      labels:
        service: "*"
    max_cardinality: 10000
    max_datapoints_rate: 100000
    action: adaptive
    group_by: ["service"]

When cardinality exceeds 10,000, metrics-governor identifies which service is the top contributor and drops only that service's excess metrics — everyone else keeps flowing.


Playground

Plan your deployment in seconds. The interactive Playground estimates CPU, memory, disk I/O, and K8s pod sizing from your throughput inputs, and generates ready-to-use Helm, app config, and limits YAML — all in a single zero-dependency HTML page.

Open Playground | Source

Throughput inputs
Throughput Inputs — Simple & Advanced modes
Resource estimation
Resource Estimation — CPU, memory, disk, fit check
YAML preview
Editable YAML — Bidirectional sync with inputs
Fit check
Fit Check — Pod override & resource validation

Documentation

Guide Description
🚀 Installation Source, Docker, or Helm chart
⚙️ Configuration YAML config and CLI flags reference
📋 Profiles 6 presets: minimal, balanced, safety, observable, resilient, performance
📡 Receiving OTLP gRPC/HTTP, PRW 1.0/2.0, backpressure
📡 PRW Protocol PRW 1.0/2.0, native histograms, VictoriaMetrics mode
🔄 Processing Rules Sample, downsample, aggregate, transform, classify, drop, dead rule detection
🏗️ Two-Tier Architecture DaemonSet edge + StatefulSet gateway pattern
🎯 Limits Adaptive limiting, tiered escalation, per-label limits, rule ownership
👥 Multi-Tenancy Tenant detection (header/label/attribute), per-tenant quotas, priority-based enforcement
🔀 Sharding Consistent hashing, K8s DNS discovery
📊 Statistics Per-metric tracking, three stats levels
Export Pipeline Pipeline split, batch tuning, adaptive scaling
Performance Bloom filters, string interning, I/O optimization
🛡️ Resilience Circuit breaker, persistent queue, backoff
📦 Queue Memory, disk, hybrid queue modes
🔢 Cardinality Tracking Bloom, HyperLogLog, Hybrid mode
💾 Bloom Persistence Save/restore filter state across restarts
🚨 Alerting 13 alerts with runbooks, dead rule detection
🎯 SLOs SLI definitions, error budgets, burn-rate alerts, health dashboard
🤖 LLM Governance Token budget tracking, gen_ai.* metric governance, example configs
📊 Dashboards Grafana operations and development dashboards
🏭 Production Guide Sizing, HPA/VPA, DaemonSet, bare metal
🔧 Stability Tuning Graduated spillover, load shedding, drain ordering, backpressure tuning
🏥 Health Kubernetes liveness and readiness probes
🔄 Dynamic Reload Hot-reload via SIGHUP with ConfigMap sidecar
🔐 TLS Server/client TLS, mTLS
🔑 Auth Bearer token, basic auth, custom headers
📦 Compression gzip, zstd, snappy
🌐 HTTP Settings Connection pools, timeouts, HTTP/2
📝 Logging JSON structured logging
🖥️ Playground Interactive deployment planner
🧪 Testing Test environment, Docker Compose
🛠️ Development Building, contributing
📜 Changelog Release history with breaking changes
⚠️ Deprecations Deprecation lifecycle, migration table

Contributing

Contributions welcome! See Development Guide for details.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

Apache License 2.0 — see LICENSE.

Support


Built with ❤️ for the observability community

About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors