Skip to content

Commit a26b3db

Browse files
authored
Add WebSocket drift detection tests (#37)
## Summary - **Fix Responses WS input format**: handler now accepts the flat `response.create` format matching the real OpenAI API (previously required a non-standard nested `response: { ... }` envelope) - **4 verified WS drift tests**: OpenAI Responses WS (text + tool call) and OpenAI Realtime (text + tool call), triangulated against real APIs - **Model canaries**: Realtime preview model availability check (detects deprecation, suggests GA replacement); Gemini Live text-capable model availability check (enables drift tests when Google ships one) - **Gemini Live**: protocol implemented per docs, documented as unverified — no text-capable `bidiGenerateContent` model exists yet - TLS WebSocket client (`ws-providers.ts`) with RFC 6455 framing, ping/pong, connection-scoped cursors - SDK shapes for Realtime and Gemini Live event sequences - Fix README Gemini Live response shape example and Responses WS example ## Test plan - [x] `pnpm test` — 540 unit tests pass - [x] `pnpm test:drift` without keys — all 27 tests skip gracefully - [x] `pnpm test:drift` with keys — 25 pass, 2 skip (Gemini Live text/tool) - [x] `pnpm run format:check` — clean - [x] `pnpm run lint` — clean - [x] 5 rounds of CR→fix loop with full pr-review-toolkit suite 🤖 Generated with [Claude Code](https://claude.com/claude-code)
2 parents 756127b + e5870ed commit a26b3db

16 files changed

Lines changed: 1470 additions & 54 deletions

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,15 @@
11
# @copilotkit/llmock
22

3+
## 1.3.3
4+
5+
### Patch Changes
6+
7+
- Fix Responses WS handler to accept flat `response.create` format matching the real OpenAI API (previously required a non-standard nested `response: { ... }` envelope)
8+
- WebSocket drift detection tests: TLS client for real provider WS endpoints, 4 verified drift tests (Responses WS + Realtime), Gemini Live canary for text-capable model availability
9+
- Realtime model canary: detects when `gpt-4o-mini-realtime-preview` is deprecated and suggests GA replacement
10+
- Gemini Live documented as unverified (no text-capable `bidiGenerateContent` model exists yet)
11+
- Fix README Gemini Live response shape example (`modelTurn.parts`, not `modelTurnComplete`)
12+
313
## 1.3.2
414

515
### Patch Changes

DRIFT.md

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,32 @@ When a model is deprecated:
101101
3. Add raw fetch client functions to `src/__tests__/drift/providers.ts`
102102
4. Create `src/__tests__/drift/<provider>.drift.ts` with 4 test scenarios
103103
5. Add model listing function to `providers.ts` and model check to `models.drift.ts`
104-
6. Update the allowlist in `schema.ts` if needed
104+
6. If the provider uses WebSocket, add protocol functions to `ws-providers.ts` and create `ws-<provider>.drift.ts`
105+
7. Update the allowlist in `schema.ts` if needed
106+
107+
## WebSocket Drift Coverage
108+
109+
In addition to the 19 existing drift tests (16 HTTP response-shape + 3 model deprecation), WebSocket drift tests cover llmock's WS protocols:
110+
111+
| Protocol | Text | Tool Call | Real Endpoint | Status |
112+
| ------------------- | ---- | --------- | ------------------------------------------------------------------- | ---------- |
113+
| OpenAI Responses WS ||| `wss://api.openai.com/v1/responses` | Verified |
114+
| OpenAI Realtime ||| `wss://api.openai.com/v1/realtime` | Verified |
115+
| Gemini Live ||| `wss://generativelanguage.googleapis.com/ws/...BidiGenerateContent` | Unverified |
116+
117+
**Models**: `gpt-4o-mini` for Responses WS, `gpt-4o-mini-realtime-preview` for Realtime.
118+
119+
**Auth**: Uses the same `OPENAI_API_KEY` and `GOOGLE_API_KEY` environment variables as HTTP tests. No new secrets needed.
120+
121+
**How it works**: A TLS WebSocket client (`ws-providers.ts`) connects to real provider endpoints using `node:tls` with RFC 6455 framing. Each protocol function handles the setup sequence (e.g., Realtime session negotiation, Gemini Live setup/setupComplete) and collects messages until a terminal event. The mock side uses the existing `ws-test-client.ts` plaintext client against the local llmock server.
122+
123+
### Gemini Live: unverified
124+
125+
llmock's Gemini Live handler implements the text-based `BidiGenerateContent` protocol as documented in Google's [Live API reference](https://ai.google.dev/api/live)`setup`/`setupComplete` handshake, `clientContent` with turns, `serverContent` with `modelTurn.parts[].text`, and `toolCall` responses. The protocol format is correct per the docs.
126+
127+
However, as of March 2026, the only models that support `bidiGenerateContent` are native-audio models (`gemini-2.5-flash-native-audio-*`), which reject text-only requests. No text-capable model exists for this endpoint yet, so we cannot triangulate llmock's output against a real API response.
128+
129+
A canary test (`ws-gemini-live.drift.ts`) queries the Gemini model listing API on each drift run and checks for a non-audio model that supports `bidiGenerateContent`. When Google ships one, the canary will flag it and the full drift tests can be enabled.
105130

106131
## CI Schedule
107132

@@ -115,4 +140,4 @@ See `.github/workflows/test-drift.yml`.
115140

116141
## Cost
117142

118-
~20 API calls per run using the cheapest available models (`gpt-4o-mini`, `claude-haiku-4-5-20251001`, `gemini-2.5-flash`) with 10-100 max tokens each. Under $0.01/week.
143+
~25 API calls per run (16 HTTP response-shape + 3 model listing + 4 WS + 2 canaries) using the cheapest available models (`gpt-4o-mini`, `gpt-4o-mini-realtime-preview`, `claude-haiku-4-5-20251001`, `gemini-2.5-flash`) with 10-100 max tokens each. Under $0.02/week. When Gemini Live text-capable models become available, this will increase to 6 WS calls.

README.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -500,7 +500,7 @@ WebSocket endpoints:
500500

501501
- **WS `/v1/responses`** — OpenAI Responses API over WebSocket
502502
- **WS `/v1/realtime`** — OpenAI Realtime API (text + tool calls)
503-
- **WS `/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent`** — Gemini Live
503+
- **WS `/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent`** — Gemini Live ([unverified](#gemini-live-bidigeneratecontent))
504504

505505
All endpoints share the same fixture pool — the same fixtures work across all providers. Requests are translated to a common format internally for fixture matching.
506506

@@ -518,13 +518,11 @@ Connect to `ws://localhost:5555/v1/responses` and send a `response.create` event
518518
// → Client sends:
519519
{
520520
"type": "response.create",
521-
"response": {
522-
"modalities": ["text"],
523-
"instructions": "You are a helpful assistant.",
524-
"input": [
525-
{ "type": "message", "role": "user", "content": [{ "type": "input_text", "text": "Hello" }] },
526-
],
527-
},
521+
"model": "gpt-4o",
522+
"instructions": "You are a helpful assistant.",
523+
"input": [
524+
{ "type": "message", "role": "user", "content": [{ "type": "input_text", "text": "Hello" }] },
525+
],
528526
}
529527

530528
// ← Server streams:
@@ -567,19 +565,21 @@ Connect to `ws://localhost:5555/v1/realtime`. The Realtime API uses a session-ba
567565

568566
### Gemini Live (BidiGenerateContent)
569567

570-
Connect to `ws://localhost:5555/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent`. Gemini Live uses a setup/content/response flow:
568+
Connect to `ws://localhost:5555/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent`. Gemini Live uses a setup/content/response flow.
569+
570+
> **⚠️ Unverified**: As of March 2026, Google's only `bidiGenerateContent`-capable models are audio-only — no text-capable model exists for this endpoint. llmock implements the text-based protocol as documented in Google's [Live API reference](https://ai.google.dev/api/live), but the response shapes have not been verified against real API output. Code you write against this mock may need adjustment when Google ships a text-capable Live model. See [DRIFT.md](DRIFT.md#gemini-live-unverified) for details and the automated canary that tracks model availability.
571571
572572
```jsonc
573573
// → Setup message (must be first):
574-
{ "setup": { "model": "models/gemini-2.0-flash-live", "generationConfig": { "responseModalities": ["TEXT"] } } }
574+
{ "setup": { "model": "models/gemini-2.5-flash", "generationConfig": { "responseModalities": ["TEXT"] } } }
575575

576576
// → Send user content:
577577
{ "clientContent": { "turns": [{ "role": "user", "parts": [{ "text": "Hello" }] }], "turnComplete": true } }
578578

579579
// ← Server streams:
580580
// {"setupComplete": {}}
581-
// {"serverContent": {"modelTurnComplete": false, "parts": [{"text": "Hello"}]}}
582-
// {"serverContent": {"modelTurnComplete": true}}
581+
// {"serverContent": {"modelTurn": {"parts": [{"text": "Hello"}]}, "turnComplete": false}}
582+
// {"serverContent": {"modelTurn": {"parts": [{"text": "!"}]}, "turnComplete": true}}
583583
```
584584

585585
## CLI

docs/index.html

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1199,7 +1199,9 @@ <h3>WebSocket APIs</h3>
11991199
<ul>
12001200
<li>OpenAI Responses API over WebSocket</li>
12011201
<li>OpenAI Realtime API — text + tool calls</li>
1202-
<li>Gemini Live BidiGenerateContent</li>
1202+
<li>
1203+
Gemini Live BidiGenerateContent (unverified — no text-capable model exists yet)
1204+
</li>
12031205
<li>No audio/video — text and tool call paths only</li>
12041206
</ul>
12051207
</div>
@@ -1308,7 +1310,7 @@ <h2 class="section-title">llmock vs MSW</h2>
13081310
<td class="manual">Manual — build data SSE yourself</td>
13091311
</tr>
13101312
<tr>
1311-
<td>WebSocket APIs (Realtime, Gemini Live)</td>
1313+
<td>WebSocket APIs (Realtime, Gemini Live*)</td>
13121314
<td class="yes">Built-in ✓</td>
13131315
<td class="no">No</td>
13141316
</tr>

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@copilotkit/llmock",
3-
"version": "1.3.2",
3+
"version": "1.3.3",
44
"description": "Deterministic mock LLM server for testing (OpenAI, Anthropic, Gemini)",
55
"license": "MIT",
66
"packageManager": "pnpm@10.28.2",

src/__tests__/drift/helpers.ts

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,12 @@
1010
import http from "node:http";
1111
import { createServer, type ServerInstance } from "../../server.js";
1212
import type { Fixture } from "../../types.js";
13+
import type { WSTestClient } from "../ws-test-client.js";
14+
import { extractShape, type SSEEventShape } from "./schema.js";
15+
16+
import { classifyGeminiMessage } from "./ws-providers.js";
17+
18+
export { classifyGeminiMessage };
1319

1420
// ---------------------------------------------------------------------------
1521
// HTTP helpers
@@ -101,3 +107,77 @@ export async function startDriftServer(): Promise<ServerInstance> {
101107
export async function stopDriftServer(instance: ServerInstance): Promise<void> {
102108
await new Promise<void>((r) => instance.server.close(() => r()));
103109
}
110+
111+
// ---------------------------------------------------------------------------
112+
// WebSocket helpers
113+
// ---------------------------------------------------------------------------
114+
115+
export const GEMINI_WS_PATH =
116+
"/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent";
117+
118+
/**
119+
* Collect mock WS messages until a terminal predicate fires.
120+
*
121+
* Uses a polling loop on waitForMessages() since ws-test-client doesn't
122+
* support predicate-based collection. The `skip` parameter tells us how
123+
* many messages have already been consumed so we don't re-read them.
124+
*
125+
* Throws if the terminal predicate never fires before the timeout expires.
126+
*/
127+
export async function collectMockWSMessages(
128+
client: WSTestClient,
129+
terminal: (msg: unknown) => boolean,
130+
timeoutMs = 15000,
131+
skip = 0,
132+
): Promise<{ events: SSEEventShape[]; rawMessages: unknown[] }> {
133+
const rawMessages: unknown[] = [];
134+
const deadline = Date.now() + timeoutMs;
135+
let count = skip;
136+
let terminated = false;
137+
138+
while (Date.now() < deadline) {
139+
const nextCount = count + 1;
140+
let msgs: string[];
141+
try {
142+
msgs = await client.waitForMessages(nextCount, Math.min(2000, deadline - Date.now()));
143+
} catch (e: unknown) {
144+
// Only suppress waitForMessages timeout — rethrow anything else
145+
if (e instanceof Error && e.message.includes("Timeout waiting for")) {
146+
if (Date.now() >= deadline) break;
147+
continue;
148+
}
149+
throw e;
150+
}
151+
// Only increment count after successful receipt
152+
count = nextCount;
153+
const latest = msgs[count - 1];
154+
let parsed: unknown;
155+
try {
156+
parsed = typeof latest === "string" ? JSON.parse(latest) : latest;
157+
} catch {
158+
throw new Error(
159+
`collectMockWSMessages: failed to parse message ${count}: ${String(latest).slice(0, 200)}`,
160+
);
161+
}
162+
rawMessages.push(parsed);
163+
if (terminal(parsed)) {
164+
terminated = true;
165+
break;
166+
}
167+
}
168+
169+
if (!terminated) {
170+
throw new Error(
171+
`collectMockWSMessages timed out after ${timeoutMs}ms without terminal message. ` +
172+
`Collected ${rawMessages.length} messages.`,
173+
);
174+
}
175+
176+
const events: SSEEventShape[] = rawMessages.map((msg) => {
177+
const m = msg as Record<string, any>;
178+
const type = m.type ?? classifyGeminiMessage(m as Record<string, unknown>);
179+
return { type, dataShape: extractShape(msg) };
180+
});
181+
182+
return { events, rawMessages };
183+
}

src/__tests__/drift/models.drift.ts

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ describe.skipIf(!process.env.ANTHROPIC_API_KEY)("Anthropic model availability",
7272
if (referenced.length === 0) return;
7373

7474
for (const m of referenced) {
75-
const found = models.some((available) => available === m || available.startsWith(`${m}`));
75+
const found = models.some((available) => available === m || available.startsWith(m));
7676
expect(found, `Model ${m} no longer available at Anthropic`).toBe(true);
7777
}
7878
});
@@ -89,11 +89,14 @@ describe.skipIf(!process.env.GOOGLE_API_KEY)("Gemini model availability", () =>
8989

9090
if (referenced.length === 0) return;
9191

92-
// Skip experimental and live-only models — they're ephemeral
93-
const stable = referenced.filter((m) => !m.includes("-exp") && !m.endsWith("-live"));
92+
// Skip experimental models, live-only models, and anchor-link fragments
93+
// scraped from markdown (e.g., "gemini-live-bidigeneratecontent")
94+
const stable = referenced.filter(
95+
(m) => !m.includes("-exp") && !m.includes("-live") && !m.includes("bidigeneratecontent"),
96+
);
9497

9598
for (const m of stable) {
96-
const found = models.some((available) => available === m || available.startsWith(`${m}`));
99+
const found = models.some((available) => available === m || available.startsWith(m));
97100
expect(found, `Model ${m} no longer available at Gemini`).toBe(true);
98101
}
99102
});

0 commit comments

Comments
 (0)