You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
metrics-governor supports two configuration methods:
YAML configuration file (recommended for complex setups)
CLI flags (for simple setups or quick overrides)
Dual Pipeline Support: All components (receivers, buffers, exporters, limits, sharding, queues) work identically for both OTLP and PRW pipelines. They are completely separate - OTLP options use standard flags, PRW options use -prw-* prefixed flags.
Supported Backends
metrics-governor can export metrics to any OTLP or Prometheus Remote Write compatible backend:
OTLP Protocol (gRPC or HTTP)
Backend
Protocol
Default Path
Notes
OpenTelemetry Collector
gRPC (4317) or HTTP (4318)
/v1/metrics
Most common setup
Prometheus (with OTLP receiver)
gRPC (4317)
/v1/metrics
Requires --enable-feature=otlp-write-receiver
Grafana Mimir
gRPC or HTTP
/otlp/v1/metrics
Native OTLP support
Cortex
gRPC or HTTP
/api/v1/push
Via OTLP receiver
Thanos
gRPC
/v1/metrics
Via sidecar or receive
VictoriaMetrics
HTTP only
/opentelemetry/v1/metrics
Native OTLP support
ClickHouse
gRPC or HTTP
/v1/metrics
Via OTLP receiver
Grafana Cloud
gRPC or HTTP
/otlp/v1/metrics
Cloud hosted
Prometheus Remote Write (PRW)
Backend
Default Path
Notes
Prometheus
/api/v1/write
Native PRW support
VictoriaMetrics
/api/v1/write or /write
Use -prw-exporter-vm-mode for optimizations
Grafana Mimir
/api/v1/push
PRW compatible
Cortex
/api/v1/push
PRW compatible
Thanos Receive
/api/v1/receive
PRW compatible
Grafana Cloud
/api/prom/push
Cloud hosted
Amazon Managed Prometheus
/api/v1/remote_write
AWS hosted
Google Cloud Managed Prometheus
Custom
GCP hosted
YAML Configuration File
Use the -config flag to specify a YAML configuration file:
All settings can also be configured via CLI flags.
Configuration Flag
Flag
Default
Description
-config
Path to YAML configuration file
Receiver Options
Flag
Default
Description
-grpc-listen
:4317
gRPC receiver listen address
-http-listen
:4318
HTTP receiver listen address
-http-receiver-path
/v1/metrics
URL path for HTTP receiver
-receiver-tls-enabled
false
Enable TLS for receivers
-receiver-tls-cert
Path to server certificate file
-receiver-tls-key
Path to server private key file
-receiver-tls-ca
Path to CA certificate for client verification (mTLS)
-receiver-tls-client-auth
false
Require client certificates (mTLS)
-receiver-auth-enabled
false
Enable authentication for receivers
-receiver-auth-bearer-token
Expected bearer token for authentication
-receiver-auth-basic-username
Basic auth username
-receiver-auth-basic-password
Basic auth password
Exporter Options
The OTLP exporter supports any OTLP-compatible backend via gRPC or HTTP protocols: OpenTelemetry Collector, Prometheus, Grafana Mimir, Cortex, Thanos, VictoriaMetrics, and others.
Flag
Default
Description
-exporter-endpoint
localhost:4317
OTLP exporter endpoint (host:port for gRPC, URL for HTTP)
-exporter-protocol
grpc
Exporter protocol: grpc (recommended, most backends) or http
-exporter-default-path
/v1/metrics
Default HTTP path when endpoint has no path. Standard: /v1/metrics. VictoriaMetrics: /opentelemetry/v1/metrics
-exporter-insecure
true
Use insecure connection (no TLS) for exporter
-exporter-timeout
30s
Exporter request timeout
-exporter-tls-enabled
false
Enable custom TLS config for exporter
-exporter-tls-cert
Path to client certificate file (mTLS)
-exporter-tls-key
Path to client private key file (mTLS)
-exporter-tls-ca
Path to CA certificate for server verification
-exporter-tls-skip-verify
false
Skip TLS certificate verification
-exporter-tls-server-name
Override server name for TLS verification
-exporter-auth-bearer-token
Bearer token to send with requests
-exporter-auth-basic-username
Basic auth username
-exporter-auth-basic-password
Basic auth password
-exporter-auth-headers
Custom headers (format: key1=value1,key2=value2)
Buffer Options
Flag
Default
Description
-buffer-size
10000
Maximum number of metrics to buffer
-flush-interval
5s
Buffer flush interval
-batch-size
1000
Maximum batch size for export (by count)
-max-batch-bytes
8388608
Maximum batch size in bytes (8MB). Batches exceeding this are recursively split. Set below backend limit. 0 disables byte splitting.
Stats Options
Flag
Default
Description
-stats-addr
:9090
Stats/metrics HTTP endpoint address
-stats-labels
Comma-separated labels to track (e.g., service,env,cluster)
-stats-level
basic
Stats collection level: none (disabled), basic (core counters), or full (per-label breakdowns)
Limits Options
Flag
Default
Description
-limits-config
Path to limits configuration YAML file
-limits-dry-run
true
Dry run mode: log violations but don't enforce
Queue Options (FastQueue)
The queue uses a high-performance FastQueue implementation inspired by VictoriaMetrics' persistentqueue. It provides metadata-only persistence with in-memory buffering for high throughput. See resilience.md for circuit breaker and backoff documentation.
Flag
Default
Description
-queue-enabled
true
Enable failover queue (safety net for export failures)
-queue-type
memory
Queue type: memory (bounded in-memory, fast) or disk (FastQueue, durable, survives restarts)
-queue-mode
memory
Queue mode: memory (in-memory only), disk (fully disk-backed via FastQueue), or hybrid (L1 memory + L2 disk spillover)
-queue-path
./queue
Queue storage directory (disk and hybrid modes)
-queue-max-size
10000
Maximum number of batches in queue
-queue-max-bytes
268435456
Maximum memory for in-memory queue (256MB). In hybrid mode, this is the L1 memory capacity before spilling to disk.
-queue-hybrid-spillover-pct
80
Percentage of in-memory queue capacity before spilling to disk (hybrid mode only, 1-100)
-queue-retry-interval
5s
Initial retry interval
-queue-max-retry-delay
5m
Maximum retry backoff delay
-queue-full-behavior
drop_oldest
Queue full behavior: drop_oldest, drop_newest, or block
-queue-adaptive-enabled
true
Enable adaptive queue sizing (disk mode only)
-queue-target-utilization
0.85
Target disk utilization (0.0-1.0, disk mode only)
-queue-inmemory-blocks
2048
In-memory channel size for fast path (disk mode only)
-queue-chunk-size
536870912
Chunk file size in bytes (512MB, disk mode only)
-queue-meta-sync
1s
Metadata sync interval (max data loss window, disk mode only)
-queue-stale-flush
30s
Interval to flush stale in-memory blocks to disk (disk mode only)
-queue-write-buffer-size
262144
Buffered writer size in bytes (256KB, disk mode only)
URL path for PRW receiver (empty = register both /api/v1/write and /write)
-prw-receiver-version
auto
Protocol version: 1.0, 2.0, or auto
-prw-receiver-tls-enabled
false
Enable TLS for PRW receiver
-prw-receiver-tls-cert
Certificate file path
-prw-receiver-tls-key
Private key file path
-prw-receiver-auth-enabled
false
Enable authentication
-prw-receiver-auth-bearer-token
Expected bearer token
PRW Exporter Options
The Prometheus Remote Write exporter supports any PRW-compatible backend: Prometheus, Grafana Mimir, Cortex, Thanos, VictoriaMetrics, and others.
Flag
Default
Description
-prw-exporter-endpoint
PRW backend URL (empty = disabled)
-prw-exporter-default-path
/api/v1/write
Default PRW path when endpoint has no path. Standard: /api/v1/write. Mimir/Cortex: /api/v1/push. Thanos: /api/v1/receive
-prw-exporter-version
auto
Protocol version: 1.0 (standard), 2.0 (native histograms), or auto
-prw-exporter-timeout
30s
Request timeout
-prw-exporter-tls-enabled
false
Enable TLS
-prw-exporter-tls-cert
Client certificate (mTLS)
-prw-exporter-tls-key
Client key (mTLS)
-prw-exporter-tls-ca
CA certificate
-prw-exporter-auth-bearer-token
Bearer token for auth
-prw-exporter-vm-mode
false
Enable VictoriaMetrics mode
-prw-exporter-vm-compression
snappy
Compression: snappy or zstd
PRW Buffer Options
Flag
Default
Description
-prw-buffer-size
10000
Maximum requests in buffer
-prw-flush-interval
5s
Flush interval
-prw-batch-size
1000
Batch size for export
PRW Queue Options
The PRW queue uses the same high-performance disk-backed SendQueue as the OTLP pipeline, providing persistent storage, circuit breaker, exponential backoff, and split-on-error. See resilience.md for detailed resilience documentation.
Flag
Default
Description
-prw-queue-enabled
false
Enable persistent retry queue
-prw-queue-path
./prw-queue
Queue directory (disk-backed, survives restarts)
-prw-queue-max-size
10000
Max queue entries
-prw-queue-max-bytes
1073741824
Max queue size in bytes (1GB)
-prw-queue-retry-interval
5s
Initial retry interval
-prw-queue-max-retry-delay
5m
Maximum retry backoff delay
-prw-queue-backoff-enabled
true
Enable exponential backoff for retries
-prw-queue-backoff-multiplier
2.0
Multiply delay by this on each failure
-prw-queue-circuit-breaker-enabled
true
Enable circuit breaker pattern
-prw-queue-circuit-breaker-threshold
5
Consecutive failures before opening circuit
-prw-queue-circuit-breaker-reset-timeout
30s
Time before half-open state
PRW Sharding Options
Flag
Default
Description
-prw-sharding-enabled
false
Enable consistent sharding
-prw-sharding-headless-service
K8s headless service DNS name with port
-prw-sharding-labels
Comma-separated labels for shard key
-prw-sharding-dns-refresh-interval
30s
DNS refresh interval
-prw-sharding-virtual-nodes
150
Virtual nodes per endpoint
OTLP Queue Options
The queue provides durability for export failures with memory or disk-backed storage. Memory mode (default) is fast with bounded in-memory queue. Disk mode uses a high-performance FastQueue implementation. See resilience.md for detailed information on circuit breaker, backoff, failover queue, and split-on-error behavior.
Flag
Default
Description
-queue-enabled
true
Enable failover queue (safety net for export failures)
-queue-type
memory
Queue type: memory (bounded, fast) or disk (FastQueue, durable)
-queue-mode
memory
Queue mode: memory, disk, or hybrid (L1 memory + L2 disk spillover). See queue.md for details.
-queue-path
./queue
Queue directory path (disk and hybrid modes)
-queue-max-size
10000
Max queue entries
-queue-max-bytes
268435456
Maximum memory for in-memory queue (256MB)
-queue-hybrid-spillover-pct
80
Percentage of in-memory queue capacity before spilling to disk (hybrid mode only)
-queue-retry-interval
5s
Initial retry interval
-queue-max-retry-delay
5m
Maximum retry backoff delay
-queue-full-behavior
drop_oldest
Behavior when full: drop_oldest, drop_newest, block
-queue-adaptive-enabled
true
Enable adaptive queue sizing (disk mode only)
-queue-target-utilization
0.85
Target disk utilization (disk mode only)
-queue-inmemory-blocks
256
In-memory channel size (disk mode only)
-queue-chunk-size
536870912
Chunk file size in bytes (disk mode only)
-queue-meta-sync
1s
Metadata sync interval (disk mode only)
-queue-stale-flush
5s
Flush stale in-memory blocks to disk (disk mode only)
Queue Resilience Options
Flag
Default
Description
-queue-backoff-enabled
true
Enable exponential backoff for retries
-queue-backoff-multiplier
2.0
Multiply delay by this on each failure
-queue-circuit-breaker-enabled
true
Enable circuit breaker pattern
-queue-circuit-breaker-threshold
10
Consecutive failures before opening circuit
-queue-circuit-breaker-reset-timeout
30s
Time before half-open state
Memory Limit Options
Flag
Default
Description
-memory-limit-ratio
0.9
Ratio of container memory for GOMEMLIMIT (0.0-1.0, 0=disabled)
Sharding Options
Flag
Default
Description
-sharding-enabled
false
Enable consistent sharding
-sharding-headless-service
K8s headless service DNS name with port
-sharding-labels
Comma-separated labels for shard key
-sharding-dns-refresh-interval
30s
DNS refresh interval
-sharding-virtual-nodes
150
Virtual nodes per endpoint
-sharding-fallback-on-empty
false
Fall back to default exporter if no labels match
Always-Queue & Worker Pool Options
Flag
Default
Description
-queue-always-queue
true
Always route data through queue (workers pull from queue)
-queue-workers
0
Worker count for queue drain (0 = 2×NumCPU)
-buffer-full-policy
reject
Buffer full policy: reject (429/ResourceExhausted), drop_oldest, block
-buffer-memory-percent
0.15
Buffer capacity as percentage of detected memory limit (0.0-1.0)
-queue-memory-percent
0.15
Queue in-memory capacity as percentage of detected memory limit (0.0-1.0)
Performance Options
Flag
Default
Description
-string-interning
true
Enable string interning for label deduplication
-intern-max-value-length
64
Max length for label value interning
Telemetry Options (OTLP Self-Monitoring)
Flag
Default
Description
-telemetry-endpoint
OTLP endpoint for self-monitoring (empty = disabled)
-telemetry-protocol
grpc
OTLP protocol: grpc or http
-telemetry-insecure
true
Use insecure connection for OTLP telemetry
When -telemetry-endpoint is set, metrics-governor exports its own logs (as OTLP log records) and Prometheus metrics (bridged to OTLP metric format) to the specified endpoint.
HTTP Client Tuning (Exporter)
Flag
Default
Description
-exporter-max-idle-conns
100
Maximum idle connections across all hosts
-exporter-max-idle-conns-per-host
100
Maximum idle connections per host
-exporter-max-conns-per-host
0
Maximum total connections per host (0 = unlimited)
sample_rate must be >0 and <=1. The sampling is deterministic (hash-based), so the same series are consistently kept or dropped within a window.
Strip labels action:
rules:
- name: "strip-high-cardinality-labels"match:
metric_name: "http_request_.*"max_cardinality: 5000action: strip_labelsstrip_labels: ["request_id", "trace_id", "span_id"] # Must be non-empty
strip_labels must contain at least one label name. Only the listed attributes are removed; the datapoint itself is preserved.
Tiered Escalation
Tiers allow a single rule to escalate its response as utilization increases. When tiers is set, the highest matching tier's action overrides the rule's base action during violations.
Each tier specifies an at_percent threshold (1-100) representing the percentage of the rule's limit that triggers it. Tiers must be sorted ascending by at_percent.
rules:
- name: "escalating-response"match:
metric_name: "http_request_.*"max_cardinality: 10000action: log # Base action (used below first tier)tiers:
- at_percent: 80# At 80% utilization: start samplingaction: samplesample_rate: 0.5
- at_percent: 95# At 95%: strip labels to reduce cardinalityaction: strip_labelsstrip_labels: ["request_id"]
- at_percent: 100# At 100%: drop everythingaction: drop
action (required): Action for this tier (log, sample, strip_labels, drop, adaptive)
sample_rate: Required when tier action is sample
strip_labels: Required when tier action is strip_labels
Per-Label Cardinality Limits
label_limits sets per-label cardinality limits. Each key is a label name, the value is the maximum unique values allowed for that label. When exceeded, label_limit_action controls the response.
rules:
- name: "per-label-cardinality"match:
metric_name: "http_request_.*"max_cardinality: 10000action: adaptivegroup_by: ["service"]label_limits:
request_id: 1000# Max 1000 unique request_id valuesuser_id: 500# Max 500 unique user_id valueslabel_limit_action: strip # "strip" (default) or "drop"
Field
Default
Description
label_limits
(none)
Map of label name to max unique values. 0 = always strip/drop that label.
label_limit_action
strip
strip removes the offending label; drop drops the entire datapoint.
Per-label limits are evaluated independently of the rule's max_cardinality. They track cardinality per label name and act when any individual label exceeds its threshold.
Adaptive Priority
adaptive_priority configures priority-based dropping for action: adaptive. When set, groups are sorted by priority (highest preserved longest) before falling back to contribution-based ordering.
# Run both OTLP and PRW pipelines simultaneously
metrics-governor \
-grpc-listen :4317 \
-exporter-endpoint otel-collector:4317 \
-prw-listen :9090 \
-prw-exporter-endpoint http://victoriametrics:8428
Buffering and Performance
# Adjust buffering for high throughput
metrics-governor -buffer-size 50000 -flush-interval 10s -batch-size 2000
# Byte-aware batch splitting (default 8MB, set below backend limit)
metrics-governor -max-batch-bytes 8388608
# Enable stats tracking by service, environment and cluster
metrics-governor -stats-labels service,env,cluster
# Performance tuning: configure worker pool
metrics-governor -queue-workers 32
# High-load environment with byte splitting
metrics-governor -queue-workers 64 -buffer-size 100000 -batch-size 5000 -max-batch-bytes 8388608
Limits Enforcement
# Enable limits enforcement (dry-run by default)
metrics-governor -limits-config /etc/metrics-governor/limits.yaml
# Enable limits enforcement with actual enforcement
metrics-governor -limits-config /etc/metrics-governor/limits.yaml -limits-dry-run=false
Performance Tuning
metrics-governor includes performance optimizations for high-throughput environments. These techniques are inspired by concepts described in VictoriaMetrics blog articles on TSDB optimization:
Note: These are original implementations using standard Go patterns (sync.Map, channel-based semaphores), not copied code from VictoriaMetrics. We only adopted the conceptual approaches.
String Interning
When enabled (default), identical label names and values are deduplicated in memory for the PRW pipeline:
Prometheus labels (e.g., __name__, job, instance) are always interned
Label values shorter than intern-max-value-length (default: 64) are interned
Applied to PRW label parsing and shard key building
Reduces memory allocations by up to 66% for PRW unmarshal operations
Achieves 99%+ cache hit rate for common labels
Worker Pool
Pull-based workers drain the queue concurrently, replacing the previous semaphore-based concurrency limiting:
Default: 2 × NumCPU workers (I/O-bound, benefits from exceeding CPU count)