This document is meant to walk through some choices when setting up your gateway.
If using a gateway service of type ClusterIP you have the option to create an ingress to expose your gateway
gateway:
service:
type: ClusterIP
ingress:
enabled: true
ingressClassName: traefik
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: webIn our experience there are a few gotchas we encountered when attempting to benchmark for competitive numbers, this document is meant to capture those gateway related challenges we encountered and workarounds we derived.
To remove the chance that the envoy pod can be a bottle neck, we thought it might be a good idea to increase its resource. This option gets exposed from the llm-d-infra chart by either an AgentgatewayParameters manifest if you are using agentgateway or the deprecated kgateway compatibility mode, or a configmap if you are using Istio.
Currently we only have a workaround for increasing max connections and timeout for Istio, we would like to expand this to other providers in the future. This is provided via the Istio destination rule, you can see a values configuration for this in our istio values file.
This gets exposed by the DestinationRule template in the llm-d-infra charts, but we hope this manifest will make its way upstream to the GAIE charts.
In an effort to reduce the workload of the envoy gateway container, we have dropped the logs to level error, only showing us critical issues. You can change these through the --v flag on the GAIE (inferencepool) chart. You can see an example of this in the precise-kv-cache-aware guide.