feat(inference): add native vLLM Responses API passthrough#5602
Draft
franciscojavierarceo wants to merge 8 commits intoogx-ai:mainfrom
Draft
feat(inference): add native vLLM Responses API passthrough#5602franciscojavierarceo wants to merge 8 commits intoogx-ai:mainfrom
franciscojavierarceo wants to merge 8 commits intoogx-ai:mainfrom
Conversation
Allow vLLM providers to forward Responses API requests directly to vLLM's /v1/responses endpoint instead of decomposing into chat completions, preserving reasoning tokens, structured token accounting, and native streaming events. Controlled by native_responses config flag (default false). Falls back to chat completions when the provider does not support native responses. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Contributor
|
✅ Recordings committed successfully Recordings from the integration tests have been committed to this PR. |
…for native responses Replace the try/except NotImplementedError fallback pattern with an explicit supports_native_responses property on InferenceProvider and check_native_responses_support() on the router. The orchestrator now checks the config flag before choosing the inference path, making the routing decision deterministic based on the run config rather than relying on exception-based control flow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Add documentation explaining the native_responses config flag, when to use it, what it changes vs what stays the same, and how to verify it works. Currently only supported for remote::vllm. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
…yError on Pydantic models Pydantic BaseModel subclasses raise KeyError (not AttributeError) for undefined fields, causing all non-vLLM providers to crash when the router accesses provider.supports_native_responses directly. Use getattr with a default of False instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
…e events Fix three protocol regressions in native vLLM responses passthrough: 1. Streaming is now incremental instead of buffered — _process_native_response_events yields events as they arrive via async generator, matching the CC path pattern. 2. Tool-loop follow-up turns no longer crash — CC-format messages are serialized to dicts instead of validated through TypeAdapter[OpenAIResponseInput], which rejected OpenAIAssistantMessageParam and OpenAIToolMessageParam on subsequent inference turns. 3. Provider lifecycle events (created, in_progress, completed) are filtered out since Llama Stack's create_response() already emits its own with correct response_id and sequence numbers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
…r handling, add violation check Fix four issues from self-review: 1. Serialize tool_choice and truncation via model_dump() in the vLLM response payload to prevent TypeError when httpx JSON-encodes Pydantic model objects. 2. Narrow the SSE parser's except clause from bare Exception to (json.JSONDecodeError, ValidationError) so unexpected errors propagate instead of being silently swallowed. 3. Add missing violation_detected early return on the native responses path, matching the CC path's behavior when output guardrails detect a violation during streaming. 4. Remove unused output_messages parameter from _process_native_response_events. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Add native responses coverage to developer-facing docs: - ARCHITECTURE.md: document optional provider capabilities pattern and the getattr safety for Pydantic BaseModel subclasses - Remote inference README: document native responses support and how providers can implement openai_response() - Responses provider README: document native passthrough in the StreamingResponseOrchestrator - InferenceRouter.openai_response(): add docstring Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
native_responsesconfig flag (defaultfalse) to the vLLM inference adapter, allowing operators to route Responses API requests directly to vLLM's/v1/responsesendpoint instead of decomposing into chat completionsopenai_response()as an optional method onInferenceProviderprotocol (defaultNotImplementedError), with routing inInferenceRouterand implementation inVLLMInferenceAdapterStreamingResponseOrchestratorto try native responses first, falling back transparently to the existing chat-completions path when the provider doesn't support itMotivation
The chat-completions path loses information during the Responses → CC → Responses conversion:
reasoning_tokens,cached_tokens, and per-turn breakdownsvLLM's native
/v1/responsesendpoint returnsReasoningItemobjects, structured token usage, and Responses-format streaming events directly — eliminating the translation layer.What changes
openai_chat_completion()→/v1/chat/completionsopenai_response()→/v1/responses(with CC fallback)delta.reasoning_contenthackReasoningItemfrom vLLMprompt_tokens/completion_tokensonlyreasoning_tokens,cached_tokensEverything else stays the same: tool calling, state management, persistence, guardrails, vector stores, files, compaction, prompts — all owned by llama-stack.
Test plan
uv run pytest tests/unit/providers/responses/ tests/unit/providers/remote/inference/vllm/ -x— 258 passeduv run pre-commit run --all-files— all hooks pass including mypyopenai/gpt-oss-120b--setup vllm🤖 Generated with Claude Code