-
Notifications
You must be signed in to change notification settings - Fork 418
feat(api_server): Add OpenAI-compatible API server for MaxText models #2313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
400ee8c to
ce5fab1
Compare
RissyRan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I will take 2nd round of review for files maxtet_generator, maxtext_server, and server_models.
b70e083 to
18d3055
Compare
RissyRan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM in general. Could you leverage Gemini to build some unit tests, especailly for maxtext_generator? More unit tests are very welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! LGTM to unblock.
+1 to Ran's comment, it would be great to have some unit tests guarding your functionality.
d1b5ad2 to
a29735a
Compare
000872a to
750ffc2
Compare
RissyRan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine to merge those at this moment (in a separate file, and no breakage for existing codebase), but could @hengtaoguo or @bvandermoon help test and verify end-to-end? Currently no tests for those scripts and functionality yet.
c703914 to
07a0e66
Compare
RissyRan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussing with @bvandermoon and @hengtaoguo, we will follow it up if any issues.
07a0e66 to
e8bdeef
Compare
This commit introduces a fully-featured, OpenAI-compatible RESTful API server for serving MaxText models. The server is built with FastAPI, supports multi-host inference on TPUs, and is designed for both interactive use and large-scale benchmarking.
Key features and additions:
1. **Core Server Implementation:**
- Adds `maxtext_server.py`, a FastAPI application that serves `/v1/completions` and `/v1/chat/completions` endpoints.
- Implements dynamic request batching to efficiently utilize underlying hardware.
- Uses `maxtext_generator.py` to encapsulate the MaxText inference engine, handling model loading, tokenization, and the generation loop.
- Includes Pydantic models in `server_models.py` for robust, OpenAI-compliant request and response validation.
2. **Deployment and Utilities:**
- Provides `start_server.sh` to simplify launching the server from the project root.
- Adds `port_forward_xpk.sh`, a utility script to automatically find and connect to a server running on a GKE cluster via `xpk`, supporting custom namespaces.
- Isolates server-specific dependencies in `benchmarks/api_server/requirements.txt` (`uvicorn`, `fastapi`, `openai-harmony`).
3. **Comprehensive Documentation:**
- A new `README.md` in the `api_server` directory offers a complete guide covering:
- Installation and environment setup.
- Launching the server in both single-pod and multi-pod GKE environments.
- Detailed examples for interacting with the API using `curl` and the `openai` Python client.
- Step-by-step instructions for running benchmarks with `lm-evaluation-harness` and `evalchemy` for both log-likelihood and generative tasks.
e8bdeef to
f8031a0
Compare
This commit introduces a fully-featured, OpenAI-compatible RESTful API server for serving MaxText models. The server is built with FastAPI, supports multi-host inference on TPUs, and is designed for both interactive use and large-scale benchmarking.
Key features and additions:
Core Server Implementation:
maxtext_server.py, a FastAPI application that serves/v1/completionsand/v1/chat/completionsendpoints.maxtext_generator.pyto encapsulate the MaxText inference engine, handling model loading, tokenization, and the generation loop.server_models.pyfor robust, OpenAI-compliant request and response validation.Deployment and Utilities:
start_server.shto simplify launching the server from the project root.port_forward_xpk.sh, a utility script to automatically find and connect to a server running on a GKE cluster viaxpk, supporting custom namespaces.benchmarks/api_server/requirements.txt(uvicorn,fastapi,openai-harmony).Comprehensive Documentation:
README.mdin theapi_serverdirectory offers a complete guide covering: - Installation and environment setup. - Launching the server in both single-pod and multi-pod GKE environments. - Detailed examples for interacting with the API usingcurland theopenaiPython client. - Step-by-step instructions for running benchmarks withlm-evaluation-harnessandevalchemyfor both log-likelihood and generative tasks.Checklist
Before submitting this PR, please make sure (put X in square brackets):