Model Redirect and Traffic Splitting within an Inference Pool #1695

zetxqx · 2025-10-09T16:49:58Z

zetxqx
Oct 9, 2025

The idea addresses a missing feature in the current inference pool: the ability to perform model redirects and split traffic for requests. The essence of this idea is to enable request body mutation, allowing us to change the model name in a request before it's processed by the model server.

Currently, while we can split traffic between multiple inference pools (example), we cannot mutate the request body. This is a limitation when the model servers are configured differently and expect specific model names. This proposal aims to solve this by allowing for in-flight modification of the request.

Use cases

Here are several potential scenarios where this functionality would be beneficial:

Seamless Model Upgrades: A ML engineer/model server owner wants to roll out a new model version (e.g., modelA-v2) to replace an existing one (modelA-v1). With model redirection, they could gradually shift traffic to the new version without any changes to the client application, which would still be requesting modelA. This is particularly crucial for Lora adapter updates that must happen within the same inference pool.
A/B Testing: An ML engineer wants to compare the performance of two different models. They could send a small percentage of user traffic to a new model and compare its performance against the current one before a full rollout.
Large-Scale LoRA Management: For scenarios with hundreds of LoRA adapters, a robust model mapping system is needed to manage the different adapters effectively
* Semantic Routing: To optimize for cost and performance, it would be possible to route requests to different models based on the prompt's content within a inferencePool.

High-Level Idea

There is a proposal from @ahg-g leveraging httpHeaderModifier and BBR however, there is some limitation on envoy filter order and HTTPRouteRule max limit to be 16.

The core idea is to introduce a mechanism that allows users to define model redirect and traffic splitting rules. These rules would then be used to mutate the model field within the request body before it reaches the model server.

For example, a rule could specify that for any request asking for "production-chatbot", 10% of the traffic is rewritten to be served by "stable-chatbot-v6" and 90% by "stable-chatbot-v5"

We are considering different approaches to implement this, but we first want to gather feedback on the use cases and the general direction.

What are your thoughts on these use cases? Are there other scenarios we should be considering for in-pool model redirection?

zetxqx · 2025-10-23T00:19:20Z

zetxqx
Oct 23, 2025
Author

Sharing the doc here: https://docs.google.com/document/d/12yR_nAWM-Tg2ZmgGYX1h-dlUNi0AqYoACUjNElipl0M/edit?usp=sharing

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Model Redirect and Traffic Splitting within an Inference Pool #1695

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Model Redirect and Traffic Splitting within an Inference Pool #1695

Uh oh!

Uh oh!

zetxqx Oct 9, 2025

Use cases

High-Level Idea

Replies: 1 comment

Uh oh!

zetxqx Oct 23, 2025 Author

zetxqx
Oct 9, 2025

zetxqx
Oct 23, 2025
Author