Skip to content

[serve][llm][doc] Document direct streaming and ingress request router#63860

Open
eicherseiji wants to merge 20 commits into
ray-project:masterfrom
eicherseiji:seiji/docs-llm-direct-streaming
Open

[serve][llm][doc] Document direct streaming and ingress request router#63860
eicherseiji wants to merge 20 commits into
ray-project:masterfrom
eicherseiji:seiji/docs-llm-direct-streaming

Conversation

@eicherseiji

@eicherseiji eicherseiji commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Why are these changes needed?

Direct streaming (RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING) and the ingress request router shipped across several PRs (#63167, #63280, #63468, #63517, #63362) with no user-facing docs.

What this PR adds

  • A user guide, doc/source/serve/llm/user-guides/direct-streaming.md: what direct streaming is, how to enable and verify it, how to configure replica selection (including body-aware routing and session affinity), and its limitations. The ingress request router is documented as an internal mechanism; the supported surface is the env vars plus request_router_config.
  • Tested doc_code examples backing every Python snippet in the guide (run by the existing team:llm GPU suite).
  • A cross-reference label and note in architecture/routing-policies.md, and a toctree entry in user-guides/index.md.

Docs-only; all changes under doc/.

Related issue number

N/A

Checks

  • I've signed off every commit (DCO).
  • I've made sure the tests are passing.
  • Testing strategy: docs-only; Python snippets are tested doc_code examples (team:llm GPU suite).

Add a user guide for direct streaming (RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING),
covering how the LLMServer deployment becomes the ingress, how the ingress
request router (LLMRouter) selects replicas via choose_replica(), how to
customize replica selection and session affinity, and the current limitations.

Add a routing-policies-guide cross-reference label and a note pointing readers
to the ingress request router, and link the new page from the user guides
index.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces documentation for a new "Direct streaming" feature in Ray Serve LLM, which reduces streaming latency by bypassing the ingress proxy hop and routing requests directly to model replicas. The documentation includes architectural details, usage guides, and limitations. A review comment correctly identifies a Python code example error where build_openai_app is invoked with a dictionary instead of keyword arguments, which would cause a runtime error.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread doc/source/serve/llm/user-guides/direct-streaming.md Outdated
…forwarding and TCP_NODELAY notes

- Clarify in the alpha warning and throughout that the ingress request
  router (its router deployment, /internal/route endpoint, and replica
  selection plumbing) is a private implementation detail. The supported
  configuration surface is the env vars and request_router_config.
- Document RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY=1 as the opt-in
  required for body-aware policies (prefix-aware routing).
- Add a tip about RAY_SERVE_HAPROXY_TCP_NODELAY for streaming latency.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Defer the internal direct-streaming mechanism (and its caveats) to the
direct streaming guide. The routing-policies doc only notes that the public
request_router_config and policies still apply under direct streaming.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…n routing-policies note

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
… doc_code example

Move the inline enable snippet into doc/source/llm/doc_code/serve/
direct_streaming/ and pull it in via literalinclude, mirroring the
prefix-aware example. The glob in doc/BUILD.bazel runs it as a team:llm GPU
test, so the documented build_openai_app call and config stay verified.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…eck, truncation scope

- Make the request_router_config example a tested doc_code literalinclude
  (direct_streaming_custom_router_example.py); no inline Python blocks remain.
- Verify direct streaming via the LLMRouter deployment in the Serve dashboard
  instead of quoting a volatile log line.
- Clarify that body truncation affects only the routing-decision copy, not the
  request the model processes or the client response, and that the captured
  region is always the head (sized by RAY_SERVE_HAPROXY_INGRESS_REQUEST_ROUTER_BUFSIZE).

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Reference RAY_SERVE_HAPROXY_INGRESS_REQUEST_ROUTER_BUFSIZE by name and describe
the memory tradeoff qualitatively instead of quoting the default value, which
would drift if the constant changes.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ex limits

- Reframe body forwarding/truncation around TTFT (the buffering wait) rather
  than memory, and explain there's no free way to buffer the whole body.
- Mention serve_haproxy_ingress_router_truncations_total and cross-link the
  HAProxy ingress request router metrics in the monitoring guide.
- Clarify the session-id header case/-/_ matching with concrete examples.
- Document that LoRA/multiplex-aware routing isn't supported, and add a
  Workarounds section (multiple single-model apps; default ingress for LoRA).

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…cation note

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…imitations

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…a limitations restatement

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@eicherseiji eicherseiji self-assigned this Jun 4, 2026
@eicherseiji eicherseiji added the go add ONLY when ready to merge, run all tests label Jun 4, 2026
@eicherseiji eicherseiji marked this pull request as ready for review June 4, 2026 22:55
@eicherseiji eicherseiji requested review from a team as code owners June 4, 2026 22:55
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue docs An issue or change related to documentation labels Jun 5, 2026
Apply the platform-agnostic Anyscale style rules: split semicolons and
explanatory colons into simple sentences, drop sentence parentheticals,
define TTFT on first use, remove timeless words (currently/today) and the
'as usual' filler.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…drop redundant reference

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
… minor sentence splits

docs-style-auditor flagged an apparent contradiction: this guide stated the
direct-streaming default router is RoundRobinRouter while routing-policies.md
states Ray Serve defaults to Power of Two Choices. Both are true (direct
streaming overrides the general default in builder.py); make that explicit.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ng CI to direct streaming guide

Bring the useful additions from the alternate draft into the direct streaming
user guide:

- Add a "Supported serving patterns" section covering the OpenAI, data parallel
  attention, and prefill/decode builders, all of which support direct streaming.
- Add a YAML deployment tab alongside the Python example, backed by a
  literalincluded config.yaml that mirrors the Python snippet.
- Note the engine-native routes plus the GET /v1/models/{id} route that direct
  streaming adds for client.models.retrieve(...).

Wire the YAML config into CI via a companion doc_code example that loads and
deploys it (the qwen YAML config pattern). The direct_streaming doc examples now
run in their own bazel target with RAY_SERVE_ENABLE_HA_PROXY and
RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING set, so they exercise the real direct
streaming path instead of the default ingress. They are isolated from the shared
serve doc target because the flags would reject the multi-config qwen example
under direct streaming.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

@dstrodtman dstrodtman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamp

Comment thread doc/source/serve/llm/architecture/routing-policies.md Outdated
(direct-streaming-guide)=
# Direct streaming

Lower streaming latency by removing the ingress proxy hop and routing requests directly to model replicas.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a figure to make it intuitive what is happening?

Comment thread doc/source/serve/llm/user-guides/direct-streaming.md Outdated
Comment thread doc/source/serve/llm/user-guides/direct-streaming.md Outdated
Comment thread doc/source/serve/llm/user-guides/direct-streaming.md Outdated
Comment thread doc/source/serve/llm/user-guides/direct-streaming.md Outdated
Comment thread doc/source/serve/llm/user-guides/direct-streaming.md

The deployed application is OpenAI-compatible and exposes the engine's native routes, including `/v1/chat/completions`, `/v1/completions`, and `/v1/models`. Ray Serve LLM also adds `GET /v1/models/{id}`, so clients can call `client.models.retrieve(...)` as they would against the standalone ingress.

To confirm direct streaming is active, open the Serve dashboard and check that the ingress request router deployment (listed as `LLMRouter`) is running alongside your model deployment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a screenshot?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the ray list command equivalent here? or serve status equivalent so that it is agent friendly?

Comment thread doc/source/serve/llm/user-guides/direct-streaming.md Outdated
Comment thread doc/source/llm/doc_code/serve/direct_streaming/direct_streaming_config.yaml Outdated
Comment thread doc/source/serve/llm/architecture/routing-policies.md Outdated
Comment thread doc/source/serve/llm/user-guides/direct-streaming.md Outdated
- Lead with the Enable section (env vars + deploy) instead of burying it.
- Drop internal-detail prose (capacity semaphore) and the single-model
  framing in "When to use" (already covered in Limitations).
- Note the ingress event loop is shared between the request path (TTFT)
  and the streamed-token path (TPOT).
- Clarify the routing-policies cross-reference and add an explicit link.
- Switch doc_code examples to Qwen/Qwen3.5-0.8B with the vLLM recipe
  engine args (tool-call and reasoning parsers).

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 21a8624. Configure here.

… streaming verification

Replace the one-line dashboard pointer in the verification section with a
serve status example and a dashboard screenshot, both showing the LLMServer
model deployment alongside the LLMRouter ingress request router. Note that
LLMRouter replaces OpenAiIngress and is the signal that direct streaming is on.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Replace the direct streaming ASCII diagram in How it works with an
architecture figure showing HAProxy calling /internal/route on the LLMRouter
to pick an LLMServer replica, then sending the request and response directly
to that replica. Fold the default-path diagram into prose.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs An issue or change related to documentation go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants