[serve][llm][doc] Document direct streaming and ingress request router by eicherseiji · Pull Request #63860 · ray-project/ray

eicherseiji · 2026-06-04T21:09:50Z

Why are these changes needed?

Direct streaming (RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING) and the ingress request router shipped across several PRs (#63167, #63280, #63468, #63517, #63362) with no user-facing docs.

What this PR adds

A user guide, doc/source/serve/llm/user-guides/direct-streaming.md: what direct streaming is, how to enable and verify it, how to configure replica selection (including body-aware routing and session affinity), and its limitations. The ingress request router is documented as an internal mechanism; the supported surface is the env vars plus request_router_config.
Tested doc_code examples backing every Python snippet in the guide (run by the existing team:llm GPU suite).
A cross-reference label and note in architecture/routing-policies.md, and a toctree entry in user-guides/index.md.

Docs-only; all changes under doc/.

Related issue number

N/A

Checks

I've signed off every commit (DCO).
I've made sure the tests are passing.
Testing strategy: docs-only; Python snippets are tested doc_code examples (team:llm GPU suite).

Add a user guide for direct streaming (RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING), covering how the LLMServer deployment becomes the ingress, how the ingress request router (LLMRouter) selects replicas via choose_replica(), how to customize replica selection and session affinity, and the current limitations. Add a routing-policies-guide cross-reference label and a note pointing readers to the ingress request router, and link the new page from the user guides index. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces documentation for a new "Direct streaming" feature in Ray Serve LLM, which reduces streaming latency by bypassing the ingress proxy hop and routing requests directly to model replicas. The documentation includes architectural details, usage guides, and limitations. A review comment correctly identifies a Python code example error where build_openai_app is invoked with a dictionary instead of keyword arguments, which would cause a runtime error.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

…forwarding and TCP_NODELAY notes - Clarify in the alpha warning and throughout that the ingress request router (its router deployment, /internal/route endpoint, and replica selection plumbing) is a private implementation detail. The supported configuration surface is the env vars and request_router_config. - Document RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY=1 as the opt-in required for body-aware policies (prefix-aware routing). - Add a tip about RAY_SERVE_HAPROXY_TCP_NODELAY for streaming latency. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Defer the internal direct-streaming mechanism (and its caveats) to the direct streaming guide. The routing-policies doc only notes that the public request_router_config and policies still apply under direct streaming. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…n routing-policies note Signed-off-by: Seiji Eicher <seiji@anyscale.com>

… doc_code example Move the inline enable snippet into doc/source/llm/doc_code/serve/ direct_streaming/ and pull it in via literalinclude, mirroring the prefix-aware example. The glob in doc/BUILD.bazel runs it as a team:llm GPU test, so the documented build_openai_app call and config stay verified. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…eck, truncation scope - Make the request_router_config example a tested doc_code literalinclude (direct_streaming_custom_router_example.py); no inline Python blocks remain. - Verify direct streaming via the LLMRouter deployment in the Serve dashboard instead of quoting a volatile log line. - Clarify that body truncation affects only the routing-decision copy, not the request the model processes or the client response, and that the captured region is always the head (sized by RAY_SERVE_HAPROXY_INGRESS_REQUEST_ROUTER_BUFSIZE). Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Reference RAY_SERVE_HAPROXY_INGRESS_REQUEST_ROUTER_BUFSIZE by name and describe the memory tradeoff qualitatively instead of quoting the default value, which would drift if the constant changes. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…ex limits - Reframe body forwarding/truncation around TTFT (the buffering wait) rather than memory, and explain there's no free way to buffer the whole body. - Mention serve_haproxy_ingress_router_truncations_total and cross-link the HAProxy ingress request router metrics in the monitoring guide. - Clarify the session-id header case/-/_ matching with concrete examples. - Document that LoRA/multiplex-aware routing isn't supported, and add a Workarounds section (multiple single-model apps; default ingress for LoRA). Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…cation note Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…imitations Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…a limitations restatement Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Apply the platform-agnostic Anyscale style rules: split semicolons and explanatory colons into simple sentences, drop sentence parentheticals, define TTFT on first use, remove timeless words (currently/today) and the 'as usual' filler. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…drop redundant reference Signed-off-by: Seiji Eicher <seiji@anyscale.com>

… minor sentence splits docs-style-auditor flagged an apparent contradiction: this guide stated the direct-streaming default router is RoundRobinRouter while routing-policies.md states Ray Serve defaults to Power of Two Choices. Both are true (direct streaming overrides the general default in builder.py); make that explicit. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…ng CI to direct streaming guide Bring the useful additions from the alternate draft into the direct streaming user guide: - Add a "Supported serving patterns" section covering the OpenAI, data parallel attention, and prefill/decode builders, all of which support direct streaming. - Add a YAML deployment tab alongside the Python example, backed by a literalincluded config.yaml that mirrors the Python snippet. - Note the engine-native routes plus the GET /v1/models/{id} route that direct streaming adds for client.models.retrieve(...). Wire the YAML config into CI via a companion doc_code example that loads and deploys it (the qwen YAML config pattern). The direct_streaming doc examples now run in their own bazel target with RAY_SERVE_ENABLE_HA_PROXY and RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING set, so they exercise the real direct streaming path instead of the default ingress. They are isolated from the shared serve doc target because the flags would reject the multi-config qwen example under direct streaming. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

dstrodtman

stamp

kouroshHakha · 2026-06-08T19:24:58Z

+(direct-streaming-guide)=
+# Direct streaming
+
+Lower streaming latency by removing the ingress proxy hop and routing requests directly to model replicas.


Can you add a figure to make it intuitive what is happening?

kouroshHakha · 2026-06-08T19:32:22Z

+
+The deployed application is OpenAI-compatible and exposes the engine's native routes, including `/v1/chat/completions`, `/v1/completions`, and `/v1/models`. Ray Serve LLM also adds `GET /v1/models/{id}`, so clients can call `client.models.retrieve(...)` as they would against the standalone ingress.
+
+To confirm direct streaming is active, open the Serve dashboard and check that the ingress request router deployment (listed as `LLMRouter`) is running alongside your model deployment.


Take a screenshot?

What would be the ray list command equivalent here? or serve status equivalent so that it is agent friendly?

- Lead with the Enable section (env vars + deploy) instead of burying it. - Drop internal-detail prose (capacity semaphore) and the single-model framing in "When to use" (already covered in Limitations). - Note the ingress event loop is shared between the request path (TTFT) and the streamed-token path (TPOT). - Clarify the routing-policies cross-reference and add an explicit link. - Switch doc_code examples to Qwen/Qwen3.5-0.8B with the vLLM recipe engine args (tool-call and reasoning parsers). Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 21a8624. Configure here.}

… streaming verification Replace the one-line dashboard pointer in the verification section with a serve status example and a dashboard screenshot, both showing the LLMServer model deployment alongside the LLMRouter ingress request router. Note that LLMRouter replaces OpenAiIngress and is the signal that direct streaming is on. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Replace the direct streaming ASCII diagram in How it works with an architecture figure showing HAProxy calling /internal/route on the LLMRouter to pick an LLMServer replica, then sending the request and response directly to that replica. Fold the default-path diagram into prose. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

gemini-code-assist Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread doc/source/serve/llm/user-guides/direct-streaming.md Outdated

eicherseiji added 11 commits June 4, 2026 22:02

[serve][llm][doc] Say 'implementation detail' instead of 'internal' i…

f92af8b

…n routing-policies note Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[serve][llm][doc] Fold workarounds into limitation bullets, trim trun…

8883401

…cation note Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[serve][llm][doc] Trim non-limitation bullets from direct streaming l…

d055bae

…imitations Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[serve][llm][doc] Keep limitations to genuine functional constraints

0d1b3e1

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[serve][llm][doc] Reframe 'When to use' as the low-latency path, not …

6d36eef

…a limitations restatement Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji self-assigned this Jun 4, 2026

eicherseiji added the go add ONLY when ready to merge, run all tests label Jun 4, 2026

eicherseiji marked this pull request as ready for review June 4, 2026 22:55

eicherseiji requested review from a team as code owners June 4, 2026 22:55

ray-gardener Bot added serve Ray Serve Related Issue docs An issue or change related to documentation labels Jun 5, 2026

eicherseiji added 4 commits June 5, 2026 16:58

[serve][llm][doc] Apply style-auditor findings: split long sentence, …

0229d84

…drop redundant reference Signed-off-by: Seiji Eicher <seiji@anyscale.com>

dstrodtman approved these changes Jun 6, 2026

View reviewed changes

kouroshHakha reviewed Jun 8, 2026

View reviewed changes

jeffreywang-anyscale reviewed Jun 8, 2026

View reviewed changes

Comment thread doc/source/llm/doc_code/serve/direct_streaming/direct_streaming_config.yaml Outdated

Comment thread doc/source/serve/llm/architecture/routing-policies.md Outdated

Comment thread doc/source/serve/llm/user-guides/direct-streaming.md Outdated

eicherseiji mentioned this pull request Jun 8, 2026

[serve][llm] Make direct streaming LoRA/multiplex-aware #63934

Draft

2 tasks

[serve][llm][doc] Note affinity-aware direct streaming is planned

21a8624

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

cursor Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread doc/source/llm/doc_code/serve/direct_streaming/direct_streaming_example.py

eicherseiji added 2 commits June 9, 2026 23:02

eicherseiji requested review from jeffreywang-anyscale and kouroshHakha June 10, 2026 00:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve][llm][doc] Document direct streaming and ingress request router#63860

[serve][llm][doc] Document direct streaming and ingress request router#63860
eicherseiji wants to merge 20 commits into
ray-project:masterfrom
eicherseiji:seiji/docs-llm-direct-streaming

eicherseiji commented Jun 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

dstrodtman left a comment

Uh oh!

Uh oh!

kouroshHakha Jun 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha Jun 8, 2026

Uh oh!

kouroshHakha Jun 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		The deployed application is OpenAI-compatible and exposes the engine's native routes, including `/v1/chat/completions`, `/v1/completions`, and `/v1/models`. Ray Serve LLM also adds `GET /v1/models/{id}`, so clients can call `client.models.retrieve(...)` as they would against the standalone ingress.

		To confirm direct streaming is active, open the Serve dashboard and check that the ingress request router deployment (listed as `LLMRouter`) is running alongside your model deployment.

Conversation

eicherseiji commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

What this PR adds

Related issue number

Checks

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

dstrodtman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kouroshHakha Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

kouroshHakha Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eicherseiji commented Jun 4, 2026 •

edited

Loading