feat(inference): add native vLLM Responses API passthrough by franciscojavierarceo · Pull Request #5602 · ogx-ai/ogx

franciscojavierarceo · 2026-04-21T17:41:53Z

Summary

Adds native_responses config flag (default false) to the vLLM inference adapter, allowing operators to route Responses API requests directly to vLLM's /v1/responses endpoint instead of decomposing into chat completions
Adds openai_response() as an optional method on InferenceProvider protocol (default NotImplementedError), with routing in InferenceRouter and implementation in VLLMInferenceAdapter
Modifies StreamingResponseOrchestrator to try native responses first, falling back transparently to the existing chat-completions path when the provider doesn't support it

Motivation

The chat-completions path loses information during the Responses → CC → Responses conversion:

Reasoning tokens require fragile extraction through non-standard CC fields and custom wrapper types; non-streaming reasoning is not supported at all
Token accounting loses reasoning_tokens, cached_tokens, and per-turn breakdowns
Format conversion overhead adds complexity and potential for information loss on every turn

vLLM's native /v1/responses endpoint returns ReasoningItem objects, structured token usage, and Responses-format streaming events directly — eliminating the translation layer.

What changes

Component	Before	After
Inference call	`openai_chat_completion()` → `/v1/chat/completions`	`openai_response()` → `/v1/responses` (with CC fallback)
Reasoning	Extracted via `delta.reasoning_content` hack	Native `ReasoningItem` from vLLM
Token accounting	Aggregate `prompt_tokens`/`completion_tokens` only	Structured `reasoning_tokens`, `cached_tokens`
Streaming events	Synthesized from flat CC delta chunks	Parsed directly from vLLM's structured events

Everything else stays the same: tool calling, state management, persistence, guardrails, vector stores, files, compaction, prompts — all owned by llama-stack.

Test plan

22 new unit tests (16 vLLM adapter + 6 builtin passthrough)
uv run pytest tests/unit/providers/responses/ tests/unit/providers/remote/inference/vllm/ -x — 258 passed
uv run pre-commit run --all-files — all hooks pass including mypy
End-to-end tested against live vLLM server on DGX with openai/gpt-oss-120b
Integration tests with --setup vllm

🤖 Generated with Claude Code

Allow vLLM providers to forward Responses API requests directly to vLLM's /v1/responses endpoint instead of decomposing into chat completions, preserving reasoning tokens, structured token accounting, and native streaming events. Controlled by native_responses config flag (default false). Falls back to chat completions when the provider does not support native responses. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

github-actions · 2026-04-21T17:46:00Z

✅ Recordings committed successfully

Recordings from the integration tests have been committed to this PR.

View commit workflow

…for native responses Replace the try/except NotImplementedError fallback pattern with an explicit supports_native_responses property on InferenceProvider and check_native_responses_support() on the router. The orchestrator now checks the config flag before choosing the inference path, making the routing decision deterministic based on the run config rather than relying on exception-based control flow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

Add documentation explaining the native_responses config flag, when to use it, what it changes vs what stays the same, and how to verify it works. Currently only supported for remote::vllm. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

…yError on Pydantic models Pydantic BaseModel subclasses raise KeyError (not AttributeError) for undefined fields, causing all non-vLLM providers to crash when the router accesses provider.supports_native_responses directly. Use getattr with a default of False instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

…e events Fix three protocol regressions in native vLLM responses passthrough: 1. Streaming is now incremental instead of buffered — _process_native_response_events yields events as they arrive via async generator, matching the CC path pattern. 2. Tool-loop follow-up turns no longer crash — CC-format messages are serialized to dicts instead of validated through TypeAdapter[OpenAIResponseInput], which rejected OpenAIAssistantMessageParam and OpenAIToolMessageParam on subsequent inference turns. 3. Provider lifecycle events (created, in_progress, completed) are filtered out since Llama Stack's create_response() already emits its own with correct response_id and sequence numbers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

…r handling, add violation check Fix four issues from self-review: 1. Serialize tool_choice and truncation via model_dump() in the vLLM response payload to prevent TypeError when httpx JSON-encodes Pydantic model objects. 2. Narrow the SSE parser's except clause from bare Exception to (json.JSONDecodeError, ValidationError) so unexpected errors propagate instead of being silently swallowed. 3. Add missing violation_detected early return on the native responses path, matching the CC path's behavior when output guardrails detect a violation during streaming. 4. Remove unused output_messages parameter from _process_native_response_events. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

Add native responses coverage to developer-facing docs: - ARCHITECTURE.md: document optional provider capabilities pattern and the getattr safety for Pydantic BaseModel subclasses - Remote inference README: document native responses support and how providers can implement openai_response() - Responses provider README: document native passthrough in the StreamingResponseOrchestrator - InferenceRouter.openai_response(): add docstring Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 21, 2026

Recordings update from CI

d56abe3

Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

franciscojavierarceo and others added 6 commits April 21, 2026 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(inference): add native vLLM Responses API passthrough#5602

feat(inference): add native vLLM Responses API passthrough#5602
franciscojavierarceo wants to merge 8 commits intoogx-ai:mainfrom
franciscojavierarceo:feat/native-vllm-responses

franciscojavierarceo commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

franciscojavierarceo commented Apr 21, 2026

Summary

Motivation

What changes

Test plan

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant