Skip to content

Commit 0741eb5

Browse files
authored
QVAC-18733 feat[api]: add openai responses routes with in-memory store (#2030)
* QVAC-18733 feat[api]: add OpenAI Responses routes with in-memory store Implement POST /v1/responses (blocking + SSE), GET/DELETE /v1/responses/{id}, GET /v1/responses/{id}/input_items, previous_response_id chaining, LRU+TTL store, X-QVAC-Stub: responses-volatile header, and startup banner. * fix: align Responses streaming with finalized response and add usage stats - Approach (b): always include the assistant `message` item in `response.output[0]`, even when tool calls are present, so the streamed item tree matches `response.completed`. - Pre-allocate `msgItemId` and `fcItemIds` once and reuse them across SSE events and the finalized `output[]`, fixing client-side accumulation by `item_id`. - Use distinct `output_index` per tool call (1..n) and set `item_id` on `response.function_call_arguments.delta`/`.done` to the function-call item id (was the OpenAI `call_id`, causing collisions and wrong wiring). - Populate `required_action.submit_tool_outputs.tool_calls` so OpenAI clients can satisfy tool calls instead of hanging in `requires_action` with no payload. - Drop the duplicate `previous_response_id` lookup in `handlePostResponses`. - Drop `parallel_tool_calls` from the unsupported-params log: it is honored. - Recognise `function_call_output` (-> `tool` role) and `function_call` (-> synthesized assistant `<tool_call>` content) in `openaiResponsesInputToHistory` and `historyPrefixFromStoredResponse` so chained tool round-trips actually carry through `previous_response_id`. - Use `crypto.randomUUID()` for `resp_`/`msg_`/`fc_`/input-item ids. - Surface real `usage.output_tokens` from `result.stats.generatedTokens` (Responses + chat.completions, blocking + streaming); fall back to word count when stats are missing. `input_tokens` stays 0 with an inline note that the SDK does not expose a prompt-token count today. - Tighten `CompletionResult.stats` to a structured `CompletionRunStats` shape. Tests: extend `responses.test.ts` and `translate.test.ts`; add `responses-streaming.test.ts` driving the new exported `writeStreamingResponse` / `writeBlockingResponse` helpers with a fake `CompletionResult` and `ServerResponse`. * test[skiplog]: stabilize Responses chain e2e for tiny reasoning model Pin temperature=0 + seed and bump max_output_tokens to 512 so Qwen3-600M has room for both its <think> block and the actual answer. The test exercises previous_response_id chain wiring; it should not depend on sampling luck or the model's reasoning length. * fix: walk previous_response_id chain so multi-turn keeps grandparent history Each StoredResponse.inputItems only carries that turn's NEW input (`normalizeResponsesInputItemsForStorage(body['input'])`), so a chain of depth >= 3 silently lost the grandparent turn: resp_1 input "A" -> output "X" (stored: ["A"]) resp_2 prev=resp_1 input "B" history sent: [A, X, B] (stored: ["B"]) resp_3 prev=resp_2 input "C" history sent: [B, Y, C] -- A and X gone historyPrefixFromStoredResponse now walks the chain via responseObject.previous_response_id when given a resolver, prepending earlier turns oldest-first. Cap depth at 32 to bound work and protect against pathological cycles. Routes pass `(id) => store.get(id)` as the resolver. Legacy single-step callers still work unchanged when the resolver is omitted. Tests: - unit: depth-3 chain produces all six prefix entries in order; maxDepth cap honored. - e2e: resp_1 sets "code word is XYZZY", resp_2 acks, resp_3 asks for the word and recovers it -- would silently fail before this fix. * fix: address Responses review nits (SSE sentinel, dup event, types, max_tokens warn, README) Five low-severity items from PR #2030 review: - Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }` knob to `endSSE` so chat-completions keeps its existing sentinel and Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips the assertion accordingly. - Drop the duplicate `response.in_progress` event emitted back-to-back with `response.created` (same payload, no state transition — strict parsers can choke). - Tighten `BuildResponseObjectParams.parallelToolCalls` from `boolean | undefined` to `boolean` (the route already resolves a default before calling), eliminating a dead `?? true` fallback. - Warn on `max_tokens` for /v1/responses (spec field is `max_output_tokens`); still accepted as a fallback so existing clients don't break, but they get a logger.warn nudge. - README: add a "serve openai" section listing all routes and a Responses subsection that documents volatility, the `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples. The README previously listed no openai-compat endpoints at all. Skipped from the review: - #2 (no client-disconnect handling in streaming): pre-existing gap shared with /v1/chat/completions, reviewer marked out of scope. - #7 (per-entry byte-size cap on the in-memory store): reviewer marked follow-up; `maxEntries` + TTL still bound memory pressure for the local-first single-user target audience. * fix: address Simon review nits (stream error sentinel, input_items after cursor) Two surfaced post-rebase: 1. sendError gained an opt-in { sseSentinel: false } so callers inside an active stream can suppress the trailing `data: [DONE]\n\n` after the `response.error` SSE event. Responses streaming error path now passes it, closing the gap that the happy path already handled (response.completed already used endSSE({ sentinel: false })). 2. GET /v1/responses/:id/input_items now reads the `after` cursor from the query string in addition to `limit`. Spec-compliant pagination would have re-fetched page 1 forever; the store already implemented the cursor. Added a store-level pagination test that walks all pages by `last_id`.
1 parent 93984e0 commit 0741eb5

18 files changed

Lines changed: 2260 additions & 30 deletions

packages/cli/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -306,7 +306,7 @@ Run an **OpenAI-compatible HTTP server** backed by locally configured QVAC model
306306
qvac serve openai [options]
307307
```
308308

309-
See **[docs/serve-openai.md](./docs/serve-openai.md)** for supported `/v1/...` routes, multipart request shapes, and how to register models — including **`whispercpp-audio-translation`** for `POST /v1/audio/translations` (Whisper translate-to-English).
309+
See **[docs/serve-openai.md](./docs/serve-openai.md)** for supported `/v1/...` routes, multipart request shapes, and how to register models — including **`whispercpp-audio-translation`** for `POST /v1/audio/translations` (Whisper translate-to-English) and the volatile **`POST /v1/responses`** Responses API with `previous_response_id` chaining.
310310

311311
## Configuration
312312

packages/cli/docs/serve-openai.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,53 @@ This document describes the supported routes and how to configure `serve.models`
1212
| `GET` | `/v1/models/{id}` | Model metadata |
1313
| `DELETE` | `/v1/models/{id}` | Unload |
1414
| `POST` | `/v1/chat/completions` | Chat |
15+
| `POST` | `/v1/responses` | Responses API (blocking + SSE streaming); volatile, see below |
16+
| `GET` | `/v1/responses/{id}` | Retrieve a stored response |
17+
| `DELETE` | `/v1/responses/{id}` | Delete a stored response |
18+
| `GET` | `/v1/responses/{id}/input_items` | Paginate the original input items |
1519
| `POST` | `/v1/embeddings` | Embeddings |
1620
| `POST` | `/v1/audio/transcriptions` | Speech-to-text (source language) |
1721
| `POST` | `/v1/audio/translations` | Speech-to-text **into English** (Whisper translate task) |
1822

1923
Other OpenAI routes may be added over time; this file is updated when they ship.
2024

25+
## `POST /v1/responses`
26+
27+
OpenAI-compatible Responses API: blocking, SSE streaming, retrieval by id,
28+
and `previous_response_id` chaining. Backed by the same chat models registered
29+
under `serve.models` (any alias whose endpoint category is `chat`).
30+
31+
> **Volatile state.** All responses are kept in process memory only — there is
32+
> no disk or P2P persistence. Stored ids expire on server restart, after the
33+
> per-entry TTL (1h by default), or once the LRU cap (256 entries) evicts
34+
> them. Each response is also tagged with `X-QVAC-Stub: responses-volatile`
35+
> and a one-line warn is logged at startup so operators know the surface is
36+
> not durable. Pass `store: false` in the request body to skip persistence
37+
> entirely.
38+
39+
Intentionally rejected with `400`: `conversation`, `background: true`, and
40+
built-in tools (`web_search`, `file_search`, `code_interpreter`).
41+
`function`-typed tools work normally.
42+
43+
### Examples
44+
45+
```bash
46+
# Blocking
47+
curl -sS http://127.0.0.1:11434/v1/responses \
48+
-H "Content-Type: application/json" \
49+
-d '{"model":"<alias>","input":"ping","store":true}'
50+
51+
# Streaming (SSE)
52+
curl -sN http://127.0.0.1:11434/v1/responses \
53+
-H "Content-Type: application/json" \
54+
-d '{"model":"<alias>","input":"ping","stream":true}'
55+
56+
# Multi-turn via previous_response_id
57+
curl -sS http://127.0.0.1:11434/v1/responses \
58+
-H "Content-Type: application/json" \
59+
-d '{"model":"<alias>","input":"and now?","previous_response_id":"resp_..."}'
60+
```
61+
2162
## `POST /v1/audio/translations`
2263

2364
OpenAI’s **translations** endpoint always returns **English text**. It maps to Whisper’s **translate** task (not “transcribe then run a text translator”).

packages/cli/src/serve/adapters/openai/index.ts

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,18 @@ export function createOpenAIAdapter (): APIAdapter {
3232
return true
3333
}
3434

35+
if (method === 'POST' && path === '/v1/responses') {
36+
const { handlePostResponses } = await import('./routes/responses.js')
37+
await handlePostResponses(req, res, ctx)
38+
return true
39+
}
40+
41+
if (path.startsWith('/v1/responses/')) {
42+
const { routeResponsesId } = await import('./routes/responses-id.js')
43+
const handled = await routeResponsesId(req, res, ctx)
44+
if (handled) return true
45+
}
46+
3547
if (method === 'POST' && path === '/v1/chat/completions') {
3648
const { handleChatCompletions } = await import('./routes/chat.js')
3749
await handleChatCompletions(req, res, ctx)
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
import crypto from 'node:crypto'
2+
import type { SDKToolCall, CompletionRunStats } from '../../core/sdk.js'
3+
import { sdkToolCallsToOpenai } from './translate.js'
4+
5+
export function responseId (): string {
6+
return `resp_${randomId()}`
7+
}
8+
9+
export function messageId (): string {
10+
return `msg_${randomId()}`
11+
}
12+
13+
export function functionCallOutputItemId (): string {
14+
return `fc_${randomId()}`
15+
}
16+
17+
function randomId (): string {
18+
return crypto.randomUUID()
19+
}
20+
21+
export interface BuildResponseObjectParams {
22+
id: string
23+
modelAlias: string
24+
text: string
25+
toolCalls: SDKToolCall[] | null | undefined
26+
createdAtSec: number
27+
metadata: Record<string, unknown> | null | undefined
28+
temperature: number | undefined
29+
topP: number | undefined
30+
maxOutputTokens: number | undefined
31+
parallelToolCalls: boolean
32+
previousResponseId: string | null | undefined
33+
store: boolean
34+
/** When set (e.g. streaming), must match SSE item ids so finalized response matches the stream. */
35+
messageItemId?: string
36+
/** When set, must align with `toolCalls` length; same ids as streamed function_call items. */
37+
functionCallItemIds?: string[]
38+
/** From SDK completion stats; `generatedTokens` maps to `usage.output_tokens`. */
39+
stats?: CompletionRunStats
40+
}
41+
42+
function wordCountFallback (text: string): number {
43+
return text ? text.split(/\s+/).filter(Boolean).length : 0
44+
}
45+
46+
export function buildResponseObject (params: BuildResponseObjectParams): Record<string, unknown> {
47+
const hasToolCalls = params.toolCalls !== null && params.toolCalls !== undefined && params.toolCalls.length > 0
48+
const msgId = params.messageItemId ?? messageId()
49+
const output: unknown[] = []
50+
51+
output.push({
52+
type: 'message',
53+
id: msgId,
54+
status: 'completed',
55+
role: 'assistant',
56+
content: [{ type: 'output_text', text: params.text || '', annotations: [] }]
57+
})
58+
59+
const openaiCalls = sdkToolCallsToOpenai(params.toolCalls)
60+
if (hasToolCalls) {
61+
const ids = params.functionCallItemIds
62+
let i = 0
63+
for (const tc of openaiCalls ?? []) {
64+
const fcId = ids !== undefined && ids[i] !== undefined ? ids[i]! : functionCallOutputItemId()
65+
i++
66+
output.push({
67+
type: 'function_call',
68+
id: fcId,
69+
call_id: tc.id,
70+
name: tc.function.name,
71+
arguments: tc.function.arguments,
72+
status: 'completed'
73+
})
74+
}
75+
}
76+
77+
const outputTokens =
78+
typeof params.stats?.generatedTokens === 'number' && Number.isFinite(params.stats.generatedTokens)
79+
? params.stats.generatedTokens
80+
: wordCountFallback(params.text || '')
81+
// SDK does not expose prompt token count today; `cacheTokens` is KV-cache hit count, not full prompt size.
82+
const inputTokens = 0
83+
const usage = {
84+
input_tokens: inputTokens,
85+
output_tokens: outputTokens,
86+
total_tokens: inputTokens + outputTokens
87+
}
88+
89+
const base: Record<string, unknown> = {
90+
id: params.id,
91+
object: 'response',
92+
created_at: params.createdAtSec,
93+
status: hasToolCalls ? 'requires_action' : 'completed',
94+
model: params.modelAlias,
95+
output,
96+
output_text: params.text || '',
97+
usage,
98+
parallel_tool_calls: params.parallelToolCalls,
99+
store: params.store
100+
}
101+
102+
if (hasToolCalls) {
103+
base['required_action'] = {
104+
type: 'submit_tool_outputs',
105+
submit_tool_outputs: {
106+
tool_calls: (openaiCalls ?? []).map((tc) => ({
107+
id: tc.id,
108+
type: 'function',
109+
function: {
110+
name: tc.function.name,
111+
arguments: tc.function.arguments
112+
}
113+
}))
114+
}
115+
}
116+
}
117+
118+
if (params.metadata !== undefined && params.metadata !== null) {
119+
base['metadata'] = params.metadata
120+
}
121+
if (params.temperature !== undefined) base['temperature'] = params.temperature
122+
if (params.topP !== undefined) base['top_p'] = params.topP
123+
if (params.maxOutputTokens !== undefined) base['max_output_tokens'] = params.maxOutputTokens
124+
if (params.previousResponseId) base['previous_response_id'] = params.previousResponseId
125+
126+
return base
127+
}
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
export const RESPONSES_VOLATILE_STUB = 'responses-volatile'
2+
3+
export interface StoredResponse {
4+
id: string
5+
createdAtSec: number
6+
expiresAtSec: number
7+
responseObject: Record<string, unknown>
8+
inputItems: unknown[]
9+
modelAlias: string
10+
}
11+
12+
export interface ResponsesStoreOptions {
13+
maxEntries?: number
14+
ttlMs?: number
15+
now?: () => number
16+
}
17+
18+
export interface ListInputItemsOptions {
19+
limit?: number
20+
after?: string | undefined
21+
}
22+
23+
export interface ResponsesStore {
24+
put: (record: StoredResponse) => void
25+
get: (id: string) => StoredResponse | undefined
26+
delete: (id: string) => boolean
27+
listInputItems: (id: string, opts?: ListInputItemsOptions) => {
28+
object: string
29+
data: unknown[]
30+
first_id: string | null
31+
last_id: string | null
32+
has_more: boolean
33+
} | null
34+
size: () => number
35+
bannerLine: () => string
36+
}
37+
38+
const DEFAULT_MAX = 256
39+
const DEFAULT_TTL_MS = 60 * 60 * 1000
40+
41+
export const RESPONSES_DEFAULT_TTL_SEC = Math.floor(DEFAULT_TTL_MS / 1000)
42+
43+
export function createResponsesStore (options: ResponsesStoreOptions = {}): ResponsesStore {
44+
const maxEntries = options.maxEntries ?? DEFAULT_MAX
45+
const ttlMs = options.ttlMs ?? DEFAULT_TTL_MS
46+
const nowMs = options.now ?? ((): number => Date.now())
47+
48+
const map = new Map<string, StoredResponse>()
49+
50+
function pruneExpired (): void {
51+
const t = nowMs() / 1000
52+
for (const [k, v] of map) {
53+
if (v.expiresAtSec <= t) map.delete(k)
54+
}
55+
}
56+
57+
function bump (id: string, rec: StoredResponse): void {
58+
map.delete(id)
59+
map.set(id, rec)
60+
}
61+
62+
return {
63+
put (record: StoredResponse): void {
64+
pruneExpired()
65+
bump(record.id, record)
66+
while (map.size > maxEntries) {
67+
const first = map.keys().next().value
68+
if (first === undefined) break
69+
map.delete(first)
70+
}
71+
},
72+
73+
get (id: string): StoredResponse | undefined {
74+
pruneExpired()
75+
const rec = map.get(id)
76+
if (!rec) return undefined
77+
if (rec.expiresAtSec <= nowMs() / 1000) {
78+
map.delete(id)
79+
return undefined
80+
}
81+
bump(id, rec)
82+
return rec
83+
},
84+
85+
delete (id: string): boolean {
86+
return map.delete(id)
87+
},
88+
89+
listInputItems (id: string, opts?: ListInputItemsOptions): {
90+
object: string
91+
data: unknown[]
92+
first_id: string | null
93+
last_id: string | null
94+
has_more: boolean
95+
} | null {
96+
pruneExpired()
97+
const rec = map.get(id)
98+
if (!rec) return null
99+
if (rec.expiresAtSec <= nowMs() / 1000) {
100+
map.delete(id)
101+
return null
102+
}
103+
const limit = typeof opts?.limit === 'number' && opts.limit > 0 ? Math.min(opts.limit, 100) : 20
104+
const items = rec.inputItems as Array<{ id?: string }>
105+
let start = 0
106+
if (opts?.after) {
107+
const idx = items.findIndex((it) => {
108+
if (!it || typeof it !== 'object') return false
109+
const ito = it as Record<string, unknown>
110+
return ito['id'] === opts.after
111+
})
112+
start = idx >= 0 ? idx + 1 : items.length
113+
}
114+
const slice = items.slice(start, start + limit)
115+
const hasMore = start + slice.length < items.length
116+
const firstId = slice[0] && typeof slice[0] === 'object' && typeof (slice[0] as { id?: string }).id === 'string'
117+
? (slice[0] as { id: string }).id
118+
: null
119+
const last = slice[slice.length - 1]
120+
const lastId = last && typeof last === 'object' && typeof (last as { id?: string }).id === 'string'
121+
? (last as { id: string }).id
122+
: null
123+
return {
124+
object: 'list',
125+
data: slice,
126+
first_id: firstId,
127+
last_id: lastId,
128+
has_more: hasMore
129+
}
130+
},
131+
132+
size (): number {
133+
pruneExpired()
134+
return map.size
135+
},
136+
137+
bannerLine (): string {
138+
const ttlMin = Math.round(ttlMs / 60000)
139+
return `responses: in-memory only — IDs expire on restart, max ${maxEntries} entries, ${ttlMin}m TTL`
140+
}
141+
}
142+
}

0 commit comments

Comments
 (0)