Replies: 1 comment
-
|
Sharing the doc here: https://docs.google.com/document/d/12yR_nAWM-Tg2ZmgGYX1h-dlUNi0AqYoACUjNElipl0M/edit?usp=sharing |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The idea addresses a missing feature in the current inference pool: the ability to perform model redirects and split traffic for requests. The essence of this idea is to enable request body mutation, allowing us to change the model name in a request before it's processed by the model server.
Currently, while we can split traffic between multiple inference pools (example), we cannot mutate the request body. This is a limitation when the model servers are configured differently and expect specific model names. This proposal aims to solve this by allowing for in-flight modification of the request.
Use cases
Here are several potential scenarios where this functionality would be beneficial:
* Semantic Routing: To optimize for cost and performance, it would be possible to route requests to different models based on the prompt's content within a inferencePool.High-Level Idea
There is a proposal from @ahg-g leveraging httpHeaderModifier and BBR however, there is some limitation on envoy filter order and HTTPRouteRule max limit to be 16.
The core idea is to introduce a mechanism that allows users to define model redirect and traffic splitting rules. These rules would then be used to mutate the model field within the request body before it reaches the model server.
For example, a rule could specify that for any request asking for "production-chatbot", 10% of the traffic is rewritten to be served by "stable-chatbot-v6" and 90% by "stable-chatbot-v5"
We are considering different approaches to implement this, but we first want to gather feedback on the use cases and the general direction.
What are your thoughts on these use cases? Are there other scenarios we should be considering for in-pool model redirection?
Beta Was this translation helpful? Give feedback.
All reactions