Skip to content

feat(inference): add native vLLM Responses API passthrough#5602

Draft
franciscojavierarceo wants to merge 8 commits intoogx-ai:mainfrom
franciscojavierarceo:feat/native-vllm-responses
Draft

feat(inference): add native vLLM Responses API passthrough#5602
franciscojavierarceo wants to merge 8 commits intoogx-ai:mainfrom
franciscojavierarceo:feat/native-vllm-responses

Conversation

@franciscojavierarceo
Copy link
Copy Markdown
Collaborator

Summary

  • Adds native_responses config flag (default false) to the vLLM inference adapter, allowing operators to route Responses API requests directly to vLLM's /v1/responses endpoint instead of decomposing into chat completions
  • Adds openai_response() as an optional method on InferenceProvider protocol (default NotImplementedError), with routing in InferenceRouter and implementation in VLLMInferenceAdapter
  • Modifies StreamingResponseOrchestrator to try native responses first, falling back transparently to the existing chat-completions path when the provider doesn't support it

Motivation

The chat-completions path loses information during the Responses → CC → Responses conversion:

  • Reasoning tokens require fragile extraction through non-standard CC fields and custom wrapper types; non-streaming reasoning is not supported at all
  • Token accounting loses reasoning_tokens, cached_tokens, and per-turn breakdowns
  • Format conversion overhead adds complexity and potential for information loss on every turn

vLLM's native /v1/responses endpoint returns ReasoningItem objects, structured token usage, and Responses-format streaming events directly — eliminating the translation layer.

What changes

Component Before After
Inference call openai_chat_completion()/v1/chat/completions openai_response()/v1/responses (with CC fallback)
Reasoning Extracted via delta.reasoning_content hack Native ReasoningItem from vLLM
Token accounting Aggregate prompt_tokens/completion_tokens only Structured reasoning_tokens, cached_tokens
Streaming events Synthesized from flat CC delta chunks Parsed directly from vLLM's structured events

Everything else stays the same: tool calling, state management, persistence, guardrails, vector stores, files, compaction, prompts — all owned by llama-stack.

Test plan

  • 22 new unit tests (16 vLLM adapter + 6 builtin passthrough)
  • uv run pytest tests/unit/providers/responses/ tests/unit/providers/remote/inference/vllm/ -x — 258 passed
  • uv run pre-commit run --all-files — all hooks pass including mypy
  • End-to-end tested against live vLLM server on DGX with openai/gpt-oss-120b
  • Integration tests with --setup vllm

🤖 Generated with Claude Code

Allow vLLM providers to forward Responses API requests directly to
vLLM's /v1/responses endpoint instead of decomposing into chat
completions, preserving reasoning tokens, structured token accounting,
and native streaming events. Controlled by native_responses config flag
(default false). Falls back to chat completions when the provider does
not support native responses.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 21, 2026
Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

Recordings committed successfully

Recordings from the integration tests have been committed to this PR.

View commit workflow

franciscojavierarceo and others added 6 commits April 21, 2026 13:48
…for native responses

Replace the try/except NotImplementedError fallback pattern with an
explicit supports_native_responses property on InferenceProvider and
check_native_responses_support() on the router. The orchestrator now
checks the config flag before choosing the inference path, making the
routing decision deterministic based on the run config rather than
relying on exception-based control flow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Add documentation explaining the native_responses config flag,
when to use it, what it changes vs what stays the same, and
how to verify it works. Currently only supported for remote::vllm.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
…yError on Pydantic models

Pydantic BaseModel subclasses raise KeyError (not AttributeError) for
undefined fields, causing all non-vLLM providers to crash when the
router accesses provider.supports_native_responses directly. Use
getattr with a default of False instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
…e events

Fix three protocol regressions in native vLLM responses passthrough:

1. Streaming is now incremental instead of buffered — _process_native_response_events
   yields events as they arrive via async generator, matching the CC path pattern.

2. Tool-loop follow-up turns no longer crash — CC-format messages are serialized to
   dicts instead of validated through TypeAdapter[OpenAIResponseInput], which rejected
   OpenAIAssistantMessageParam and OpenAIToolMessageParam on subsequent inference turns.

3. Provider lifecycle events (created, in_progress, completed) are filtered out since
   Llama Stack's create_response() already emits its own with correct response_id and
   sequence numbers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
…r handling, add violation check

Fix four issues from self-review:

1. Serialize tool_choice and truncation via model_dump() in the vLLM
   response payload to prevent TypeError when httpx JSON-encodes
   Pydantic model objects.

2. Narrow the SSE parser's except clause from bare Exception to
   (json.JSONDecodeError, ValidationError) so unexpected errors
   propagate instead of being silently swallowed.

3. Add missing violation_detected early return on the native responses
   path, matching the CC path's behavior when output guardrails
   detect a violation during streaming.

4. Remove unused output_messages parameter from
   _process_native_response_events.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Add native responses coverage to developer-facing docs:
- ARCHITECTURE.md: document optional provider capabilities pattern
  and the getattr safety for Pydantic BaseModel subclasses
- Remote inference README: document native responses support and
  how providers can implement openai_response()
- Responses provider README: document native passthrough in the
  StreamingResponseOrchestrator
- InferenceRouter.openai_response(): add docstring

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant