Default refresh metrics interval could be too low #189
Description
gateway-api-inference-extension/pkg/ext-proc/main.go
Lines 58 to 62 in 1b1d139
The gateway-api-inference-extension collects metrics from inference engines for load balancing decisions in 50ms by default. Assuming 20 inference gateways, each inference engine needs to handle 400 requests per second. Taking the currently supported vllm as an example, Python code needs to be executed either through triton or by exposing the metric interface directly. A single Python thread can be under a lot of pressure to handle 400 requests per second.
Also here it's not just about being able to process the requests, but it also needs to be done within 50ms in the vast majority of cases. If 50ms is P90, then 10% will have load balancing decisions that are not expected. So it needs to be at least P99 in 50ms.
Fortunately, inference requests basically can't be completed within 50ms (even with PD disaggregation, 50ms is not enough to complete the prefill phase), so this default can be adjusted up a bit. After all, the number of tasks queued in this time period won't change much (kvcache usage metrics are another story).