feat(tokenization): replace RenderChat with RenderChatCompletion RPC#432
feat(tokenization): replace RenderChat with RenderChatCompletion RPC#432vMaroon merged 29 commits intollm-d:mainfrom
Conversation
74db48b to
5df8fe5
Compare
|
Unsigned commits detected! Please sign your commits. For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation. |
abcec3d to
49f885d
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates the KV-cache manager’s UDS tokenizer client to use the newer vLLM renderer RPCs (RenderChatCompletion / RenderCompletion) instead of the legacy chat-template rendering flow, and adjusts tests/protobuf bindings accordingly.
Changes:
- Switch Go UDS tokenizer client
Renderto callRenderCompletionandRenderChatto callRenderChatCompletion. - Update Go and Python tests to reflect the new RPCs and response shapes (notably: offsets are no longer asserted).
- Regenerate Go protobuf/grpc bindings to include the new RPCs and message types.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/uds_tokenizer/uds_e2e_test.go | Updates e2e assertions to ignore offsets and validate determinism under the new render RPCs. |
| services/uds_tokenizer/tests/test_renderer.py | Adjusts integration tests for RenderChatCompletion; one assertion was weakened. |
| pkg/tokenization/uds_tokenizer_test.go | Updates mock server + unit tests to cover RenderChatCompletion / RenderCompletion. |
| pkg/tokenization/uds_tokenizer.go | Main client change: builds OpenAI-ish JSON payloads and calls new renderer RPCs. |
| api/tokenizerpb/tokenizer_grpc.pb.go | Regenerated gRPC client/server stubs with new RPC methods. |
| api/tokenizerpb/tokenizer.pb.go | Regenerated protobuf messages for new render request/response + MM feature types. |
| api/indexerpb/indexer_grpc.pb.go | Regenerated header/version metadata. |
| api/indexerpb/indexer.pb.go | Regenerated header/version metadata. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6c6e2c9 to
665ec28
Compare
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
0e2f31b to
2e3be89
Compare
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
2e3be89 to
fc368a3
Compare
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
…uest json Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ), | ||
| self.renderer_service.render_chat(chat_request, request.model_name), | ||
| self._loop, | ||
| ).result() |
There was a problem hiding this comment.
Not yours, but I think we need a timeout in order not to block the whole grpc server.
from concurrent.futures import TimeoutError as FuturesTimeoutError
try:
result = asyncio.run_coroutine_threadsafe(
self.renderer_service.render_chat(chat_request, request.model_name),
self._loop,
).result(timeout=30)
except FuturesTimeoutError:
context.abort(grpc.StatusCode.DEADLINE_EXCEEDED, "render_chat timed out")There was a problem hiding this comment.
Since we're already migrating to async here, would it make more sense to add it there?
There was a problem hiding this comment.
Let's follow-up separately if needed.
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
|
/lgtm |
Closes #425. Rebased on top of #461.
Changes Overview
Switches
RenderandRenderChatinUdsTokenizerto use the newRenderCompletionandRenderChatCompletionRPCs introduced in #461, replacing the oldRenderChatTemplateflow.On the Go side,
RenderChatnow builds a nativeRenderChatCompletionRequestproto (messages, tools, chat_template_kwargs) and returns token IDs directly instead of callingEncodeon a rendered prompt string.RendercallsRenderCompletionwith the prompt list and returns token IDs directly too — neither returns character offsets anymore since the renderer service doesn't produce them.Protocol Design
Tools and
chat_template_kwargsare both serialized as JSON strings in the proto (tools_json,chat_template_kwargs). This avoids building a typed proto structure for fields that are already arbitrary JSON at the call site ([]interface{}in GIE'sChatCompletionsRequest), and lets Python deserialize them directly without field renaming or special-casing.On the Python side, the gRPC servicer and renderer are updated to match: the renderer service methods now accept typed
ChatCompletionRequest/CompletionRequestobjects directly instead of going through a JSON round-trip. Proto-to-request conversion usesMessageToDictandjson.loadsfor the JSON string fields.Also excludes generated pb.go and pb2 files from golangci-lint and ruff in CI.