Metrics Observability for Gateway #508
Tharsanan1
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Metrics Observability for Gateway
Overview
We need to support metrics observability in our gateway components. The Router (Envoy proxy) already supports metrics, so we don't need to implement any code for metrics collection on the router side. However, in the GatewayController and PolicyEngine, we need to implement the necessary metrics and expose them via metrics endpoints.
I propose the following metrics to be implemented in each component.
Policy Engine Metrics
1. Request Processing Metrics 🎯
policy_engine_requests_totalphase(request_headers/request_body/response_headers/response_body),route,api_name,api_versionpolicy_engine_request_duration_secondsphase,routepolicy_engine_request_errors_totalphase,error_type,routecontext_build_failed,policy_execution_failed,chain_not_found,stream_errorpolicy_engine_short_circuits_totalroute,policy_name2. Policy Execution Metrics 🔄
policy_engine_policy_executions_totalpolicy_name,policy_version,api,route,status(executed/skipped/error)policy_engine_policy_duration_secondspolicy_name,policy_version,api,routepolicy_engine_policy_skipped_totalpolicy_name,api,route,reason(condition_not_met/disabled)policy_engine_policies_per_chainroute,api3. Configuration Management Metrics ⚙️
policy_engine_policy_chains_loadedmode(xds/file)policy_engine_xds_updates_totalstatus(success/failure),type(full/incremental)policy_engine_xds_connection_statestate(connected/disconnected/connecting)4. Resource & Performance Metrics 💻
policy_engine_active_streamspolicy_engine_body_bytes_processedphase,operation(read/write)policy_engine_context_build_duration_secondstype(request/response)5. System Health Metrics 🏥
policy_engine_uppolicy_engine_grpc_connections_activetype(envoy/xds)policy_engine_goroutinespolicy_engine_memory_bytestype(heap_alloc/heap_sys/stack)6. Error & Reliability Metrics⚠️
policy_engine_policy_errors_totalpolicy_name,error_typepolicy_engine_stream_errors_totalerror_type(send/receive/init)policy_engine_route_lookup_failures_totalpolicy_engine_panic_recoveries_totalcomponentGateway Controller Metrics
1. API Configuration Management Metrics 📋
gateway_controller_api_operations_totaloperation(create/update/delete/get/list),status(success/failure),api_type(rest/websub)gateway_controller_api_operation_duration_secondsoperation,api_typegateway_controller_apis_totalapi_type(rest/websub),status(pending/deployed/failed)gateway_controller_validation_errors_totaloperation,error_type(schema/policy/duplicate/invalid_upstream)gateway_controller_deployment_latency_seconds2. xDS Snapshot & Translation Metrics 🔄
gateway_controller_snapshot_generation_duration_secondstype(router/policy)gateway_controller_snapshot_generation_totaltype(router/policy),status(success/failure),trigger(api_create/api_update/api_delete/policy_update/cert_update)gateway_controller_snapshot_sizeresource_type(listeners/clusters/routes/endpoints/secrets)gateway_controller_translation_errors_totalerror_type(invalid_vhost/missing_upstream/cert_not_found)gateway_controller_routes_per_api3. xDS Server & Client Metrics 🌐
gateway_controller_xds_clients_connectedserver(router/policy),node_idgateway_controller_xds_stream_requests_totalserver,type_url(listener/cluster/route/endpoint/secret/policy),operation(subscribe/unsubscribe)gateway_controller_xds_snapshot_ack_totalserver,node_id,status(ack/nack)gateway_controller_xds_stream_duration_secondsserver,node_id4. Storage & Database Metrics 💾
gateway_controller_database_operations_totaloperation(insert/update/delete/select),table(api_configurations/llm_provider_templates/certificates),status(success/failure)gateway_controller_database_operation_duration_secondsoperation,tablegateway_controller_database_size_bytesdatabase(sqlite/config_store)gateway_controller_config_store_sizetype(apis/policies/templates/certificates)gateway_controller_storage_errors_totaloperation,error_type(locked/not_found/constraint_violation)constraint_violation= duplicate configs5. Certificate Management Metrics 🔐
gateway_controller_certificates_totaltype(custom/system)gateway_controller_certificate_operations_totaloperation(upload/delete/load),status(success/failure)gateway_controller_certificate_expiry_secondscert_id,cert_namegateway_controller_sds_updates_totalstatus(success/failure)6. Policy Management Metrics 📜
gateway_controller_policies_totalapi_id,routegateway_controller_policy_chain_lengthapi_id,routegateway_controller_policy_snapshot_updates_totalstatus(success/failure)gateway_controller_policy_validation_errors_totalerror_type(unknown_policy/invalid_params/condition_error)7. Control Plane Integration Metrics 🔌
gateway_controller_control_plane_connection_statestate(disconnected/connecting/connected/reconnecting)gateway_controller_control_plane_reconnections_totalgateway_controller_control_plane_events_sent_totalevent_type(api_deployed/api_deleted),status(success/failure)gateway_controller_control_plane_message_latency_seconds8. HTTP API Server Metrics 🌍
gateway_controller_http_requests_totalmethod,endpoint,status_codegateway_controller_http_request_duration_secondsmethod,endpointgateway_controller_http_request_size_bytesendpointgateway_controller_http_response_size_bytesendpointgateway_controller_concurrent_requests9. LLM & MCP Management Metrics 🤖
gateway_controller_llm_providers_totalstatus(active/inactive)gateway_controller_llm_provider_templates_totalgateway_controller_mcp_proxies_totaltype,statusgateway_controller_llm_operations_totaloperation(create/update/delete),status(success/failure)10. System Health & Resource Metrics 💻
gateway_controller_upgateway_controller_infoversion,storage_type,build_dategateway_controller_goroutinesgateway_controller_memory_bytestype(heap_alloc/heap_sys/stack_inuse)gateway_controller_snapshot_cache_size_bytesgateway_controller_config_reload_timestamp11. Validation & Error Metrics⚠️
gateway_controller_validation_duration_secondsvalidator_type(api/policy/llm/mcp)gateway_controller_errors_totalcomponent,error_typegateway_controller_panic_recoveries_totalcomponentImplementation Strategy
Technology Stack
I propose using Prometheus client libraries in the code to implement these metric collectors and expose them via standard Prometheus metrics endpoints.
Deployment Options
When users install the gateway with metrics enabled, they can:
Use the default metrics profile: Enable the metrics profile in the Docker Compose command to use the default setup with Prometheus for scraping and Grafana for visualization.
Use custom observability stack: Choose not to use the default provided strategy and instead use their own preferred scraping and visualization techniques if needed.
Please refer to this discussion — #506 — where we explore how profiles can be used to support different observability strategies.
This approach provides flexibility while also offering an out-of-the-box option for users who want to get started quickly with metrics observability.
Beta Was this translation helpful? Give feedback.
All reactions