Skip to content

Commit e75918a

Browse files
authored
Add live API drift detection + fix missing refusal field (#33)
## Summary - **Bug fix**: OpenAI Chat Completions responses now include `refusal: null` — a field both the SDK and real API return that llmock was omitting. Conformance and unit tests updated to assert the field. - **New feature**: Three-layer drift detection test suite that triangulates between SDK types, real API responses, and llmock output to catch response shape drift across all 4 providers (OpenAI Chat, OpenAI Responses, Anthropic Claude, Google Gemini) - **CI**: Weekly GitHub Actions workflow for automated drift checks + manual trigger ## Details 19 drift tests across 5 files: - 16 shape comparison tests (4 per provider × 4 scenarios: non-streaming text/tool, streaming text/tool) - 3 model deprecation checks (one per provider) Key robustness features: - All provider functions fail fast on non-2xx responses with status code + body in the error message - All streaming tests assert events were actually received (no silent pass on zero events) - SSE parsers handle `\r\n` line endings and continuation lines (Gemini sends wrapped JSON) - Retry with exponential backoff on 429/500/502/503 - `ping` and other transport-level SSE events classified as `info`, not `critical` - Known intentional differences (usage fields, system_fingerprint, etc.) allowlisted The refusal bug was discovered by running the drift tests against real APIs — exactly the value prop. See [DRIFT.md](DRIFT.md) for full documentation. ## Test plan - [x] `pnpm test` — 540/540 existing tests pass (including new refusal assertions) - [x] `pnpm test:drift` with all 3 API keys — 19/19 pass - [x] `pnpm test:drift` without keys — 19 tests skip gracefully - [x] Prettier + ESLint clean - [x] 4 rounds of code review (code-reviewer, silent-failure-hunter, code-simplifier, comment-analyzer, pr-test-analyzer, type-design-analyzer) — all clean 🤖 Generated with [Claude Code](https://claude.com/claude-code)
2 parents 63b5892 + 7a961f8 commit e75918a

20 files changed

Lines changed: 2859 additions & 3 deletions

.github/workflows/test-drift.yml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
name: Drift Tests
2+
on:
3+
schedule:
4+
- cron: "0 6 * * 1" # Weekly Monday 6am UTC
5+
workflow_dispatch: # Manual trigger
6+
jobs:
7+
drift:
8+
runs-on: ubuntu-latest
9+
steps:
10+
- uses: actions/checkout@v4
11+
- uses: pnpm/action-setup@v4
12+
- uses: actions/setup-node@v4
13+
with:
14+
node-version: 22
15+
cache: pnpm
16+
- run: pnpm install --frozen-lockfile
17+
- run: pnpm test:drift
18+
env:
19+
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
20+
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
21+
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,14 @@
11
# @copilotkit/llmock
22

3+
## 1.3.2
4+
5+
### Patch Changes
6+
7+
- Fix missing `refusal` field on OpenAI Chat Completions responses — both the SDK and real API return `refusal: null` on non-refusal messages, but llmock was omitting it
8+
- Live API drift detection test suite: three-layer triangulation between SDK types, real API responses, and llmock output across OpenAI (Chat + Responses), Anthropic Claude, and Google Gemini
9+
- Weekly CI workflow for automated drift checks
10+
- `DRIFT.md` documentation for the drift detection system
11+
312
## 1.3.1
413

514
### Patch Changes

DRIFT.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# Live API Drift Detection
2+
3+
llmock produces responses shaped like real LLM APIs. Providers change their APIs over time. **Drift** means the mock no longer matches reality — your tests pass against llmock but break against the real API.
4+
5+
## Three-Layer Approach
6+
7+
Drift detection compares three independent sources to triangulate the cause of any mismatch:
8+
9+
| SDK types = Real API? | Real API = llmock? | Diagnosis |
10+
| --------------------- | ------------------ | -------------------------------------------------------------------- |
11+
| Yes | No | **llmock drift** — response builders need updating |
12+
| No | No | **Provider changed before SDK update** — flag, wait for SDK catch-up |
13+
| Yes | Yes | **No drift** — all clear |
14+
| No | Yes | **SDK drift** — provider deprecated something SDK still references |
15+
16+
Two-way comparison (mock vs real) can't distinguish between "we need to fix llmock" and "the SDK hasn't caught up yet." Three-way comparison can.
17+
18+
## Running Drift Tests
19+
20+
```bash
21+
# All providers (requires all three API keys)
22+
OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-... GOOGLE_API_KEY=... pnpm test:drift
23+
24+
# Single provider (others skip automatically)
25+
OPENAI_API_KEY=sk-... pnpm test:drift
26+
27+
# Strict mode — warnings also fail
28+
STRICT_DRIFT=1 OPENAI_API_KEY=sk-... pnpm test:drift
29+
```
30+
31+
Required environment variables:
32+
33+
- `OPENAI_API_KEY` — OpenAI API key
34+
- `ANTHROPIC_API_KEY` — Anthropic API key
35+
- `GOOGLE_API_KEY` — Google AI API key
36+
37+
Each provider's tests skip independently if its key is not set. You can run drift tests for just one provider.
38+
39+
## Reading Results
40+
41+
### Severity levels
42+
43+
- **critical** — Test fails. llmock produces a different shape than the real API for a field that both the SDK and real API agree on. This means llmock needs an update.
44+
- **warning** — Test passes (unless `STRICT_DRIFT=1`). The real API has a field that neither the SDK nor llmock knows about, or the SDK and real API disagree. Usually means a provider added something new.
45+
- **info** — Always passes. Known intentional differences (usage fields are always zero, optional fields llmock omits, etc.).
46+
47+
### Example report output
48+
49+
```
50+
API DRIFT DETECTED: OpenAI Chat Completions (non-streaming text)
51+
52+
1. [critical] LLMOCK DRIFT — field in SDK + real API but missing from mock
53+
Path: usage.completion_tokens_details
54+
SDK: object { reasoning_tokens: number }
55+
Real: object { reasoning_tokens: number, accepted_prediction_tokens: number }
56+
Mock: <absent>
57+
58+
2. [warning] PROVIDER ADDED FIELD — in real API but not in SDK or mock
59+
Path: system_fingerprint
60+
SDK: <absent>
61+
Real: string
62+
Mock: <absent>
63+
64+
3. [info] MOCK EXTRA FIELD — in mock but not in real API
65+
Path: choices[0].logprobs
66+
SDK: null | object
67+
Real: <absent>
68+
Mock: null
69+
```
70+
71+
## Fixing Detected Drift
72+
73+
When a `critical` drift is detected:
74+
75+
1. **Identify the response builder** — the report path tells you which provider and field:
76+
- OpenAI Chat Completions → `src/helpers.ts` (`buildTextCompletion`, `buildToolCallCompletion`, `buildTextChunks`, `buildToolCallChunks`)
77+
- OpenAI Responses API → `src/responses.ts` (`buildTextResponse`, `buildToolCallResponse`, `buildTextStreamEvents`, `buildToolCallStreamEvents`)
78+
- Anthropic Claude → `src/messages.ts` (`buildClaudeTextResponse`, `buildClaudeToolCallResponse`, `buildClaudeTextStreamEvents`, `buildClaudeToolCallStreamEvents`)
79+
- Google Gemini → `src/gemini.ts` (`buildGeminiTextResponse`, `buildGeminiToolCallResponse`, `buildGeminiTextStreamChunks`, `buildGeminiToolCallStreamChunks`)
80+
81+
2. **Update the builder** — add or modify the field to match the real API shape.
82+
83+
3. **Run conformance tests**`pnpm test` to verify existing API conformance tests still pass.
84+
85+
4. **Run drift tests**`pnpm test:drift` to verify the drift is resolved.
86+
87+
## Model Deprecation
88+
89+
The `models.drift.ts` test scrapes model names referenced in llmock's test files, README, and fixtures, then checks each provider's model listing API to verify they still exist.
90+
91+
When a model is deprecated:
92+
93+
1. Update the model name in the affected test files and fixtures
94+
2. Update `src/__tests__/drift/providers.ts` if the cheap test model changed
95+
3. Run `pnpm test` and `pnpm test:drift`
96+
97+
## Adding a New Provider
98+
99+
1. Add the provider's SDK as a devDependency in `package.json`
100+
2. Add shape extraction functions to `src/__tests__/drift/sdk-shapes.ts`
101+
3. Add raw fetch client functions to `src/__tests__/drift/providers.ts`
102+
4. Create `src/__tests__/drift/<provider>.drift.ts` with 4 test scenarios
103+
5. Add model listing function to `providers.ts` and model check to `models.drift.ts`
104+
6. Update the allowlist in `schema.ts` if needed
105+
106+
## CI Schedule
107+
108+
Drift tests run on a schedule:
109+
110+
- **Weekly**: Monday 6:00 AM UTC
111+
- **Manual**: Trigger via GitHub Actions UI (`workflow_dispatch`)
112+
- **NOT** on PR or push — these tests hit real APIs and cost money
113+
114+
See `.github/workflows/test-drift.yml`.
115+
116+
## Cost
117+
118+
~20 API calls per run using the cheapest available models (`gpt-4o-mini`, `claude-haiku-4-5-20251001`, `gemini-2.5-flash`) with 10-100 max tokens each. Under $0.01/week.

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -673,7 +673,7 @@ Areas where llmock could grow, and explicit non-goals for the current scope.
673673

674674
### Testing
675675

676-
- **Live API conformance**: The `api-conformance` tests validate response format structure but do not run against real LLM APIs. A subset of tests that hit actual OpenAI/Anthropic/Gemini endpoints (gated behind API keys) would catch format drift as providers evolve their APIs.
676+
- **Live API drift detection**: The `drift` test suite runs against real OpenAI, Anthropic, and Gemini APIs to catch response format drift. See [DRIFT.md](DRIFT.md) for details on the three-layer triangulation approach, how to run tests, and how to fix detected drift. Runs weekly in CI; requires API keys.
677677
- **Token counts**: Usage fields are always zero across all providers.
678678
- **Vision/image content**: Image content parts are not handled by any provider.
679679

package.json

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@copilotkit/llmock",
3-
"version": "1.3.1",
3+
"version": "1.3.2",
44
"description": "Deterministic mock LLM server for testing (OpenAI, Anthropic, Gemini)",
55
"license": "MIT",
66
"packageManager": "pnpm@10.28.2",
@@ -36,6 +36,7 @@
3636
"scripts": {
3737
"build": "tsdown",
3838
"test": "vitest run",
39+
"test:drift": "vitest run --config vitest.config.drift.ts",
3940
"test:exports": "publint && attw --pack .",
4041
"lint": "eslint .",
4142
"format:check": "prettier --check .",
@@ -60,6 +61,9 @@
6061
"tsdown": "^0.12.5",
6162
"typescript": "^5.8.3",
6263
"typescript-eslint": "^8.35.1",
64+
"@anthropic-ai/sdk": "^0.78.0",
65+
"@google/generative-ai": "^0.24.0",
66+
"openai": "^4.0.0",
6367
"vitest": "^3.2.1"
6468
}
6569
}

0 commit comments

Comments
 (0)