Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Endpoint Picker Protocol Proposal #164

Merged
merged 4 commits into from
Jan 29, 2025

Conversation

liu-cong
Copy link
Contributor

@liu-cong liu-cong commented Jan 7, 2025

This is adapted from the initial doc

I didn't include the ORCA load reporting section as it's not currently required by the inference extension, though parallel efforts are happening. The intention is to keep the scope of this protocol small and expand in the future if needed.

I see this as a very initial effort to define the contract, and is an evolving process to monitor industry trends and drive more unification.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 7, 2025
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 7, 2025
@liu-cong liu-cong marked this pull request as draft January 7, 2025 19:01
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Jan 7, 2025

@smarterclayton

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 7, 2025
@liu-cong liu-cong marked this pull request as ready for review January 7, 2025 22:59
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2025
@k8s-ci-robot k8s-ci-robot requested a review from ahg-g January 7, 2025 22:59
@kfswain
Copy link
Collaborator

kfswain commented Jan 9, 2025

This is awesome!
I'm gonna put a hold on here so we can discuss this in our Contributors meeting before it merges. Does next week sound good to have this on the agenda? Or is a little more time preferable?

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2025
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 23, 2025
Copy link

netlify bot commented Jan 23, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 00c6e61
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67996e9c4c15ae00088a21e6
😎 Deploy Preview https://deploy-preview-164--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

docs/proposals/003-model-server-protocol/protocol.md Outdated Show resolved Hide resolved
| TimePerPrefillToken | Histogram | The prefill latency per token in the last W seconds. W will be decided by simulation/benchmarking. In time series metric the latency is typically reported as Histogram and we can derive the average from the Histogram. | `vllm:time_to_first_token_seconds` |
| TimePerDecodeToken | Histogram | The decode latency per token in the last W seconds. W will be decided by simulation/benchmarking. | `vllm:time_per_output_token_seconds` |

## LoRA Adapter Serving
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider making the model server protocol pluggable with LoRA being a reference plugin implementation.

docs/proposals/003-model-server-protocol/protocol.md Outdated Show resolved Hide resolved
docs/proposals/003-model-server-protocol/protocol.md Outdated Show resolved Hide resolved
docs/proposals/003-model-server-protocol/protocol.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@kfswain kfswain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits and thoughts, but overall LGTM! This is great!

docs/proposals/003-model-server-protocol/protocol.md Outdated Show resolved Hide resolved
docs/proposals/003-model-server-protocol/protocol.md Outdated Show resolved Hide resolved
docs/proposals/003-model-server-protocol/protocol.md Outdated Show resolved Hide resolved
@kfswain
Copy link
Collaborator

kfswain commented Jan 24, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Jan 27, 2025

I suggest the following:

  • Call this "Endpoint Picker Protocol" and give it version 0.1.0 (or some other versioning format)
  • Split it into two sections: Model Server Protocol and Proxy Protocol
  • The Model Server Protocol includes what this PR currently have
  • The Proxy Protocol specs that the protocol between the proxy and the endpoint picker is External Processing

@ahg-g
Copy link
Contributor

ahg-g commented Jan 28, 2025

Also, part of this protocol is how to communicate the picked endpoint, the request header

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Jan 28, 2025

/retitle "Add Endpoint Picker Protocol Proposal"

@k8s-ci-robot k8s-ci-robot changed the title Add model server protocol proposal "Add Endpoint Picker Protocol Proposal" Jan 28, 2025
@ahg-g ahg-g changed the title "Add Endpoint Picker Protocol Proposal" Add Endpoint Picker Protocol Proposal Jan 28, 2025
request, provided the requested adapter is valid.

The model server MUST expose the following LoRA adapter information via a RESTful API with response
in JSON :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the current EPP implementation support this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it does not. It uses the current vLLM metrics implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protocol is part of the release, and so i think we should remove this until the next release when we actually implement it and only spec the metrics based approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can document the current metrics approach for vllm, but I don't want to make it a "protocol", because it's really awkward. It's not something we should recommend for the next model server to implement.

The purpose of the protocol is to set up a contract for any new model server integration to follow. So I think we should document this here, and with a note that the current vllm workaround should converge too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that is why we have versioning, it doesn't make sense to document a contract that the EPP doesn't implement. Once we have EPP implements it, we update the protocol and create a new release. @robscott @smarterclayton for opinions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is just a short-term workaround and we don't need a protocol to document it - the code speaks for itself. I think the protocol should be something reasonably stable based on our best knowledge, and we are willing to take to integrate with another model server.

One option is to just not document this at all, until we have some implementation to support it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not documenting anything is an option, but documenting a protocol and make it part of a release that doesn't implement it is I think not what we should do here. Remember that we are versioning the protocol with the EPP image.

I would still lean towards documenting what we implemented and released, and we can change that in the next release, that is what versioning allows us to do. It is expected that the initial iterations will include more frequent changes, and the protocol will stabilize after that.

Copy link
Contributor

@smarterclayton smarterclayton Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably align to abdullah here - the goal is to spec what is currently required, so that you don't read things that don't exist. There will be people using 0.1 for quite a while, so be accurate to 0.1. Anything that is to be removed can simply be moved to a separate PR which is "draft for 0.2 proposed changes" and not lost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, I updated the doc

@ahg-g
Copy link
Contributor

ahg-g commented Jan 29, 2025

/lgtm
/approve

Thanks @liu-cong !

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 29, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, liu-cong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 29, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Jan 29, 2025

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2025
@k8s-ci-robot k8s-ci-robot merged commit ee46fd9 into kubernetes-sigs:main Jan 29, 2025
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants