Skip to content

Latest commit

 

History

History
203 lines (165 loc) · 8.15 KB

File metadata and controls

203 lines (165 loc) · 8.15 KB

OpenAI Responses API subset (/v1/responses)

mlxcel serve and mlxcel-server expose a Phase-1 subset of OpenAI's Responses API. Basic client.responses.create(...) and streaming flows can be used with the OpenAI Python SDK when base_url points at the mlxcel server, but this is not a full implementation of every OpenAI Responses feature.

Implementation source map:

Module Responsibility
src/server/types/responses_request.rs Request types.
src/server/types/responses_response.rs Response types.
src/server/types/responses_stream.rs SSE event enum.
src/server/responses_translator.rs Responses ↔ chat-completions translation.
src/server/responses_store.rs In-memory response store.
src/server/conversation_store.rs In-memory conversation transcript store.
src/server/routes/responses.rs Route handlers.

Implemented endpoints

Method Path Description
POST /v1/responses Create a response, either non-streaming or streaming.
GET /v1/responses/{id} Retrieve a stored response.
DELETE /v1/responses/{id} Delete a stored response.
POST /v1/responses/{id}/cancel Best-effort cancellation / cancellation marking.

Aliases without /v1 are also mounted for the same implemented routes: /responses, /responses/{id}, and /responses/{id}/cancel.

The following OpenAI-style surfaces are not mounted in this implementation:

  • GET /v1/responses/{id}/input_items
  • POST /v1/responses/compact
  • POST /v1/responses/input_tokens

Quickstart

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-local")

resp = client.responses.create(
    model="qwen3-0.6b-4bit",
    input="Reply with: hello",
    max_output_tokens=64,
)
print(resp.status)
print(resp.output_text)

with client.responses.stream(
    model="qwen3-0.6b-4bit",
    input="Count to 5.",
    max_output_tokens=64,
) as stream:
    for event in stream:
        print(event.type, getattr(event, "delta", ""))
    final = stream.get_final_response()
    print(final.usage)

Supported request fields

Field Status Notes
model required Must match the loaded model alias/path accepted by the server.
input supported String or typed input item array.
instructions supported Prepended as a system-style message; not inherited through previous_response_id.
tools function-only Only {"type":"function", ...} is accepted.
tool_choice supported subset String or named function choice compatible with chat-completions tooling.
parallel_tool_calls accepted Forwarded to existing tool-call handling.
text.format supported subset text and json_schema shapes are handled through existing structured-output code.
reasoning echoed/advisory Recorded and echoed; model-specific thinking behavior remains template/runtime dependent.
conversation supported String id or { "id": "..." }; uses in-memory conversation store.
previous_response_id supported Rehydrates stored prior input/output items. Mutually exclusive with conversation.
store supported Defaults to true; false skips persistence.
stream supported Streams typed SSE events.
max_output_tokens supported Must be greater than zero.
max_tool_calls supported Soft cap on emitted function-call items.
temperature, top_p, top_logprobs supported subset Mapped to chat-completions sampling fields.
metadata supported Maximum 16 entries.
prompt_cache_key accepted Forwarded to prompt-cache plumbing.
user, safety_identifier accepted user is used when both are present; safety_identifier is used as a fallback.
background rejected when true Async polling is not implemented.
truncation only disabled Other values, including auto, return 400.
service_tier accepted Echoed/ignored; no scheduling tier is implemented.

Input items

Phase 1 supports these typed items:

[
  {"type":"message", "role":"user", "content":"hello"},
  {"type":"message", "role":"system", "content":[{"type":"text", "text":"sys"}]},
  {"type":"function_call", "call_id":"call_abc", "name":"f", "arguments":"{}"},
  {"type":"function_call_output", "call_id":"call_abc", "output":"ok"},
  {"type":"reasoning", "content":[{"type":"reasoning_text", "text":"..."}]}
]

developer role is treated like system. Reasoning input items are accepted but are not fed back into the next prompt.

Message content parts reuse mlxcel's chat-completions content part types. This means text, image_url, video_url, and input_audio can deserialize, but actual execution still depends on the loaded model's media support. OpenAI Responses-specific input_image / input_file compatibility is not complete.

Response shape

Responses use an OpenAI-like object shape:

{
  "id": "resp_...",
  "object": "response",
  "created_at": 1234.0,
  "completed_at": 1235.0,
  "status": "completed",
  "model": "...",
  "output": [
    {"type":"reasoning", "id":"rs_...", "status":"completed", "content":[...]},
    {"type":"function_call", "id":"fc_...", "call_id":"call_...", "name":"...", "arguments":"{}", "status":"completed"},
    {"type":"message", "id":"msg_...", "role":"assistant", "status":"completed", "content":[...]}
  ],
  "output_text": "...",
  "usage": {
    "input_tokens": 12,
    "output_tokens": 34,
    "total_tokens": 46,
    "input_tokens_details": {"cached_tokens": 0},
    "output_tokens_details": {"reasoning_tokens": 0}
  }
}

Several request fields are echoed back when present. Treat this as compatibility surface, not as proof that every echoed field changes runtime behavior.

Streaming events

SSE frames are typed and include a monotonic sequence_number per response. Phase 1 emits events such as:

  • response.created
  • response.in_progress
  • response.output_item.added
  • response.content_part.added
  • response.output_text.delta
  • response.output_text.done
  • response.content_part.done
  • response.output_item.done
  • response.function_call_arguments.delta
  • response.function_call_arguments.done
  • response.reasoning_text.delta
  • response.reasoning_text.done
  • response.completed
  • failure/incomplete/error events on error paths

Response and conversation stores

The stores are in memory and are bounded by entry count and TTL.

Flag Default Env var Notes
--responses-store-max-entries 1024 LLAMA_ARG_RESPONSES_STORE_MAX_ENTRIES 0 disables response persistence.
--responses-store-ttl-secs 3600 LLAMA_ARG_RESPONSES_STORE_TTL_SECS 0 disables TTL.
--conversation-store-max-entries 256 LLAMA_ARG_CONVERSATION_STORE_MAX_ENTRIES 0 disables conversations.
--conversation-store-ttl-secs 3600 LLAMA_ARG_CONVERSATION_STORE_TTL_SECS 0 disables TTL.

When response storage is disabled, retrieve/delete/cancel-by-id and previous_response_id chaining return an error. When conversation storage is disabled, requests using conversation return an error.

Chaining semantics

  • previous_response_id loads the stored response's input items and output items as prior conversation history, then appends the new input.
  • conversation loads and appends to an in-memory transcript by id.
  • The two fields are mutually exclusive.
  • instructions from the referenced prior response are not carried over.

Unsupported tool types

Only function tools are accepted. Built-in/external tool types such as web_search, file_search, computer_use_preview, code_interpreter, image_generation, mcp, custom, apply_patch, and function_shell return 400 responses. mlxcel does not execute external tools for the Responses API.

Differences from OpenAI's full API

Notable gaps:

  • no background job mode;
  • no input-items pagination endpoint;
  • no server-side compaction endpoint;
  • no token-count endpoint;
  • no built-in tools or MCP connector execution;
  • no disk-persisted response store;
  • incomplete OpenAI Responses multimodal part compatibility.