Default refresh metrics interval could be too low

https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/1b1d13953d2d3dd85d5d54a8709070fb39d41046/pkg/ext-proc/main.go#L58-L62

The gateway-api-inference-extension collects metrics from inference engines for load balancing decisions in 50ms by default. Assuming 20 inference gateways, each inference engine needs to handle 400 requests per second. Taking the currently supported vllm as an example, Python code needs to be executed either through triton or by exposing the metric interface directly. A single Python thread can be under a lot of pressure to handle 400 requests per second.

Also here it's not just about being able to process the requests, but it also needs to be done within 50ms in the vast majority of cases. If 50ms is P90, then 10% will have load balancing decisions that are not expected. So it needs to be at least P99 in 50ms.

Fortunately, inference requests basically can't be completed within 50ms (even with PD disaggregation, 50ms is not enough to complete the prefill phase), so this default can be adjusted up a bit. After all, the number of tasks queued in this time period won't change much (kvcache usage metrics are another story).

	refreshMetricsInterval = flag.Duration(
	"refreshMetricsInterval",
	50*time.Millisecond,
	"interval to refresh metrics")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Default refresh metrics interval could be too low #189

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Default refresh metrics interval could be too low #189

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions