[serve][llm][doc] Document direct streaming and ingress request router#63860
[serve][llm][doc] Document direct streaming and ingress request router#63860eicherseiji wants to merge 20 commits into
Conversation
Add a user guide for direct streaming (RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING), covering how the LLMServer deployment becomes the ingress, how the ingress request router (LLMRouter) selects replicas via choose_replica(), how to customize replica selection and session affinity, and the current limitations. Add a routing-policies-guide cross-reference label and a note pointing readers to the ingress request router, and link the new page from the user guides index. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces documentation for a new "Direct streaming" feature in Ray Serve LLM, which reduces streaming latency by bypassing the ingress proxy hop and routing requests directly to model replicas. The documentation includes architectural details, usage guides, and limitations. A review comment correctly identifies a Python code example error where build_openai_app is invoked with a dictionary instead of keyword arguments, which would cause a runtime error.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…forwarding and TCP_NODELAY notes - Clarify in the alpha warning and throughout that the ingress request router (its router deployment, /internal/route endpoint, and replica selection plumbing) is a private implementation detail. The supported configuration surface is the env vars and request_router_config. - Document RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY=1 as the opt-in required for body-aware policies (prefix-aware routing). - Add a tip about RAY_SERVE_HAPROXY_TCP_NODELAY for streaming latency. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Defer the internal direct-streaming mechanism (and its caveats) to the direct streaming guide. The routing-policies doc only notes that the public request_router_config and policies still apply under direct streaming. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…n routing-policies note Signed-off-by: Seiji Eicher <seiji@anyscale.com>
… doc_code example Move the inline enable snippet into doc/source/llm/doc_code/serve/ direct_streaming/ and pull it in via literalinclude, mirroring the prefix-aware example. The glob in doc/BUILD.bazel runs it as a team:llm GPU test, so the documented build_openai_app call and config stay verified. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…eck, truncation scope - Make the request_router_config example a tested doc_code literalinclude (direct_streaming_custom_router_example.py); no inline Python blocks remain. - Verify direct streaming via the LLMRouter deployment in the Serve dashboard instead of quoting a volatile log line. - Clarify that body truncation affects only the routing-decision copy, not the request the model processes or the client response, and that the captured region is always the head (sized by RAY_SERVE_HAPROXY_INGRESS_REQUEST_ROUTER_BUFSIZE). Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Reference RAY_SERVE_HAPROXY_INGRESS_REQUEST_ROUTER_BUFSIZE by name and describe the memory tradeoff qualitatively instead of quoting the default value, which would drift if the constant changes. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ex limits - Reframe body forwarding/truncation around TTFT (the buffering wait) rather than memory, and explain there's no free way to buffer the whole body. - Mention serve_haproxy_ingress_router_truncations_total and cross-link the HAProxy ingress request router metrics in the monitoring guide. - Clarify the session-id header case/-/_ matching with concrete examples. - Document that LoRA/multiplex-aware routing isn't supported, and add a Workarounds section (multiple single-model apps; default ingress for LoRA). Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…cation note Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…imitations Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…a limitations restatement Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Apply the platform-agnostic Anyscale style rules: split semicolons and explanatory colons into simple sentences, drop sentence parentheticals, define TTFT on first use, remove timeless words (currently/today) and the 'as usual' filler. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…drop redundant reference Signed-off-by: Seiji Eicher <seiji@anyscale.com>
… minor sentence splits docs-style-auditor flagged an apparent contradiction: this guide stated the direct-streaming default router is RoundRobinRouter while routing-policies.md states Ray Serve defaults to Power of Two Choices. Both are true (direct streaming overrides the general default in builder.py); make that explicit. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ng CI to direct streaming guide
Bring the useful additions from the alternate draft into the direct streaming
user guide:
- Add a "Supported serving patterns" section covering the OpenAI, data parallel
attention, and prefill/decode builders, all of which support direct streaming.
- Add a YAML deployment tab alongside the Python example, backed by a
literalincluded config.yaml that mirrors the Python snippet.
- Note the engine-native routes plus the GET /v1/models/{id} route that direct
streaming adds for client.models.retrieve(...).
Wire the YAML config into CI via a companion doc_code example that loads and
deploys it (the qwen YAML config pattern). The direct_streaming doc examples now
run in their own bazel target with RAY_SERVE_ENABLE_HA_PROXY and
RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING set, so they exercise the real direct
streaming path instead of the default ingress. They are isolated from the shared
serve doc target because the flags would reject the multi-config qwen example
under direct streaming.
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
| (direct-streaming-guide)= | ||
| # Direct streaming | ||
|
|
||
| Lower streaming latency by removing the ingress proxy hop and routing requests directly to model replicas. |
There was a problem hiding this comment.
Can you add a figure to make it intuitive what is happening?
|
|
||
| The deployed application is OpenAI-compatible and exposes the engine's native routes, including `/v1/chat/completions`, `/v1/completions`, and `/v1/models`. Ray Serve LLM also adds `GET /v1/models/{id}`, so clients can call `client.models.retrieve(...)` as they would against the standalone ingress. | ||
|
|
||
| To confirm direct streaming is active, open the Serve dashboard and check that the ingress request router deployment (listed as `LLMRouter`) is running alongside your model deployment. |
There was a problem hiding this comment.
What would be the ray list command equivalent here? or serve status equivalent so that it is agent friendly?
- Lead with the Enable section (env vars + deploy) instead of burying it. - Drop internal-detail prose (capacity semaphore) and the single-model framing in "When to use" (already covered in Limitations). - Note the ingress event loop is shared between the request path (TTFT) and the streamed-token path (TPOT). - Clarify the routing-policies cross-reference and add an explicit link. - Switch doc_code examples to Qwen/Qwen3.5-0.8B with the vLLM recipe engine args (tool-call and reasoning parsers). Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 21a8624. Configure here.
… streaming verification Replace the one-line dashboard pointer in the verification section with a serve status example and a dashboard screenshot, both showing the LLMServer model deployment alongside the LLMRouter ingress request router. Note that LLMRouter replaces OpenAiIngress and is the signal that direct streaming is on. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Replace the direct streaming ASCII diagram in How it works with an architecture figure showing HAProxy calling /internal/route on the LLMRouter to pick an LLMServer replica, then sending the request and response directly to that replica. Fold the default-path diagram into prose. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Why are these changes needed?
Direct streaming (
RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING) and the ingress request router shipped across several PRs (#63167, #63280, #63468, #63517, #63362) with no user-facing docs.What this PR adds
doc/source/serve/llm/user-guides/direct-streaming.md: what direct streaming is, how to enable and verify it, how to configure replica selection (including body-aware routing and session affinity), and its limitations. The ingress request router is documented as an internal mechanism; the supported surface is the env vars plusrequest_router_config.doc_codeexamples backing every Python snippet in the guide (run by the existingteam:llmGPU suite).architecture/routing-policies.md, and a toctree entry inuser-guides/index.md.Docs-only; all changes under
doc/.Related issue number
N/A
Checks