Add Endpoint Picker Protocol Proposal #164

liu-cong · 2025-01-07T04:43:12Z

This is adapted from the initial doc

I didn't include the ORCA load reporting section as it's not currently required by the inference extension, though parallel efforts are happening. The intention is to keep the scope of this protocol small and expand in the future if needed.

I see this as a very initial effort to define the contract, and is an evolving process to monitor industry trends and drive more unification.

ahg-g · 2025-01-07T20:08:51Z

@smarterclayton

docs/proposals/003-model-server-protocol/protocol.md

kfswain · 2025-01-09T20:35:07Z

This is awesome!
I'm gonna put a hold on here so we can discuss this in our Contributors meeting before it merges. Does next week sound good to have this on the agenda? Or is a little more time preferable?

/hold

netlify · 2025-01-23T21:25:41Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`00c6e61`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67996e9c4c15ae00088a21e6
😎 Deploy Preview	https://deploy-preview-164--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

docs/proposals/003-model-server-protocol/protocol.md

danehans · 2025-01-11T00:24:33Z

docs/proposals/003-model-server-protocol/protocol.md

+| TimePerPrefillToken | Histogram | The prefill latency per token in the last W seconds. W will be decided by simulation/benchmarking.  In time series metric the latency is typically reported as Histogram and we can derive the average from the Histogram. | `vllm:time_to_first_token_seconds` | 
+| TimePerDecodeToken | Histogram | The decode latency per token in the last W seconds. W will be decided by simulation/benchmarking. | `vllm:time_per_output_token_seconds` | 
+
+## LoRA Adapter Serving


We should consider making the model server protocol pluggable with LoRA being a reference plugin implementation.

docs/proposals/003-model-server-protocol/protocol.md

kfswain

Some nits and thoughts, but overall LGTM! This is great!

docs/proposals/003-model-server-protocol/protocol.md

kfswain · 2025-01-24T17:40:54Z

/lgtm

docs/proposals/003-model-server-protocol/protocol.md

ahg-g · 2025-01-27T16:23:01Z

I suggest the following:

Call this "Endpoint Picker Protocol" and give it version 0.1.0 (or some other versioning format)
Split it into two sections: Model Server Protocol and Proxy Protocol
The Model Server Protocol includes what this PR currently have
The Proxy Protocol specs that the protocol between the proxy and the endpoint picker is External Processing

ahg-g · 2025-01-28T00:55:29Z

Also, part of this protocol is how to communicate the picked endpoint, the request header

ahg-g · 2025-01-28T02:39:34Z

/retitle "Add Endpoint Picker Protocol Proposal"

ahg-g · 2025-01-28T02:42:46Z

docs/proposals/003-model-server-protocol/protocol.md

+request, provided the requested adapter is valid.
+
+The model server MUST expose the following LoRA adapter information via a RESTful API with response
+in JSON :


Does the current EPP implementation support this?

No it does not. It uses the current vLLM metrics implementation

The protocol is part of the release, and so i think we should remove this until the next release when we actually implement it and only spec the metrics based approach.

I can document the current metrics approach for vllm, but I don't want to make it a "protocol", because it's really awkward. It's not something we should recommend for the next model server to implement.

The purpose of the protocol is to set up a contract for any new model server integration to follow. So I think we should document this here, and with a note that the current vllm workaround should converge too.

But that is why we have versioning, it doesn't make sense to document a contract that the EPP doesn't implement. Once we have EPP implements it, we update the protocol and create a new release. @robscott @smarterclayton for opinions.

IMO this is just a short-term workaround and we don't need a protocol to document it - the code speaks for itself. I think the protocol should be something reasonably stable based on our best knowledge, and we are willing to take to integrate with another model server.

One option is to just not document this at all, until we have some implementation to support it.

Not documenting anything is an option, but documenting a protocol and make it part of a release that doesn't implement it is I think not what we should do here. Remember that we are versioning the protocol with the EPP image.

I would still lean towards documenting what we implemented and released, and we can change that in the next release, that is what versioning allows us to do. It is expected that the initial iterations will include more frequent changes, and the protocol will stabilize after that.

I would probably align to abdullah here - the goal is to spec what is currently required, so that you don't read things that don't exist. There will be people using 0.1 for quite a while, so be accurate to 0.1. Anything that is to be removed can simply be moved to a separate PR which is "draft for 0.2 proposed changes" and not lost.

Thanks for the feedback, I updated the doc

ahg-g · 2025-01-29T00:33:31Z

/lgtm
/approve

Thanks @liu-cong !

k8s-ci-robot · 2025-01-29T00:33:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, liu-cong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahg-g · 2025-01-29T00:36:39Z

/hold cancel

* Add model server protocol proposal * Remove future work and focus on current release * address comments * document current lora metrics

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 7, 2025

k8s-ci-robot requested review from kfswain and robscott January 7, 2025 04:43

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 7, 2025

liu-cong mentioned this pull request Jan 7, 2025

Add model server configurations to InferencePool #163

Closed

liu-cong marked this pull request as draft January 7, 2025 19:01

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2025

ahg-g reviewed Jan 7, 2025

View reviewed changes

docs/proposals/003-model-server-protocol/protocol.md Outdated Show resolved Hide resolved

Add model server protocol proposal

a5e340c

liu-cong force-pushed the protocol branch from 1bb383c to a5e340c Compare January 7, 2025 22:58

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 7, 2025

liu-cong marked this pull request as ready for review January 7, 2025 22:59

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2025

k8s-ci-robot requested a review from ahg-g January 7, 2025 22:59

liu-cong mentioned this pull request Jan 8, 2025

Support for model servers other than vllm #95

Closed

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2025

Remove future work and focus on current release

cbc2639

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 23, 2025

danehans reviewed Jan 24, 2025

View reviewed changes

kfswain reviewed Jan 24, 2025

View reviewed changes

k8s-ci-robot assigned kfswain Jan 24, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2025

ahg-g reviewed Jan 27, 2025

View reviewed changes

address comments

c470005

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2025

k8s-ci-robot changed the title ~~Add model server protocol proposal~~ "Add Endpoint Picker Protocol Proposal" Jan 28, 2025

ahg-g changed the title ~~"Add Endpoint Picker Protocol Proposal"~~ Add Endpoint Picker Protocol Proposal Jan 28, 2025

ahg-g reviewed Jan 28, 2025

View reviewed changes

AndresGuedez mentioned this pull request Jan 28, 2025

InferencePool config proposal for API review #162

Merged

document current lora metrics

00c6e61

liu-cong force-pushed the protocol branch from 978ec6c to 00c6e61 Compare January 28, 2025 23:56

k8s-ci-robot assigned ahg-g Jan 29, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 29, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 29, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2025

k8s-ci-robot merged commit ee46fd9 into kubernetes-sigs:main Jan 29, 2025
6 of 7 checks passed

Add Endpoint Picker Protocol Proposal #164

Add Endpoint Picker Protocol Proposal #164

Uh oh!

Conversation

liu-cong commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahg-g commented Jan 7, 2025

Uh oh!

Uh oh!

kfswain commented Jan 9, 2025

Uh oh!

netlify bot commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfswain left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfswain commented Jan 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ahg-g commented Jan 27, 2025

Uh oh!

ahg-g commented Jan 28, 2025

Uh oh!

ahg-g commented Jan 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahg-g commented Jan 29, 2025

Uh oh!

k8s-ci-robot commented Jan 29, 2025

Uh oh!

ahg-g commented Jan 29, 2025

Uh oh!

Uh oh!

Uh oh!

liu-cong commented Jan 7, 2025 •

edited

Loading

netlify bot commented Jan 23, 2025 •

edited

Loading

smarterclayton Jan 28, 2025 •

edited

Loading