Skip to content

Commit 71d4154

Browse files
Improve 1B inference quality loop
1 parent f25eb5a commit 71d4154

14 files changed

Lines changed: 648 additions & 68 deletions

File tree

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
"@razroo/ray-core": minor
3+
---
4+
5+
Add JSON repair diagnostics for tiny model classification, task-routing diagnostics, prompt-format benchmark sweeps, scored quality metrics, benchmark history output, and sanitized capability hints.

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -239,17 +239,18 @@ pnpm benchmark:assert:cx23
239239
pnpm benchmark:assert:cax11
240240
pnpm benchmark:assert:cx23:1b
241241
pnpm benchmark:assert:8gb:1b
242+
pnpm benchmark:1b:prompt-formats
242243
```
243244

244-
Those commands write the latest report to `.ray/benchmarks/` and compare the run against the baseline JSON in `examples/benchmarks/baselines/`. The 1B workload also checks simple output quality signals such as JSON validity, prompt echo, stop-token leakage, and generic email filler.
245+
Those commands write the latest report to `.ray/benchmarks/`, append JSONL history when configured, and compare the run against the baseline JSON in `examples/benchmarks/baselines/`. The 1B workload also checks scored output quality signals such as JSON validity, prompt echo, stop-token leakage, call-to-action presence, forbidden wrappers, and generic email filler.
245246

246247
For prompt-family quality checks across cold outreach, follow-up, classification, rewrite, and section generation:
247248

248249
```bash
249250
pnpm eval:prompt-families:1b
250251
```
251252

252-
The structured benchmark output includes provider diagnostics such as prompt format, request shape, model ref, launch preset, slot reuse, cached tokens, and context window so a quality regression can be tied back to the backend path Ray chose.
253+
The structured benchmark output includes provider diagnostics such as prompt format, request shape, model ref, launch preset, slot reuse, cached tokens, JSON repair attempts, and context window so a quality regression can be tied back to the backend path Ray chose. `/health` also exposes detected backend capabilities, and `/v1/config` includes sanitized capability hints for the configured profile.
253254

254255
### Quality gate (matches CI)
255256

docs/integrations/razroo-email-ai.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,8 @@ The gateway exposes:
3838
- `POST /v1/jobs` — async durable submission (same inference fields, plus optional `callbackUrl`). Returns `202 Accepted` and a job location.
3939
- `GET /v1/jobs/:id` — durable job state and final result/error.
4040
- `GET /livez` — lightweight unauthenticated liveness for reverse proxies.
41-
- `GET /health` — detailed queue/provider snapshot, plus `asyncQueue` when enabled. Public profiles require Bearer auth.
42-
- `GET /v1/config` — non-secret config (sanitized). Public profiles require Bearer auth.
41+
- `GET /health` — detailed queue/provider snapshot, detected backend capabilities (`applyTemplate`, `chatTemplate`, `jsonMode`, context window, slots), plus `asyncQueue` when enabled. Public profiles require Bearer auth.
42+
- `GET /v1/config` — non-secret config (sanitized) with capability hints for the configured model/profile. Public profiles require Bearer auth.
4343

4444
With the public profile, a minimal `curl` check is:
4545

@@ -63,10 +63,11 @@ Benchmark the 1B email path with:
6363
```bash
6464
pnpm benchmark:assert:cx23:1b
6565
pnpm benchmark:assert:8gb:1b
66+
pnpm benchmark:1b:prompt-formats
6667
pnpm autotune:1b
6768
```
6869

69-
The workload in [email-1b-workload.jsonl](../../examples/workloads/email-1b-workload.jsonl) exercises cold outreach, follow-up, reply classification, reply rewrite, and a direct section-generation prompt shaped like the app's product flow. It asserts JSON validity for classification and rejects common prompt echo, stop-token leakage, and generic email filler.
70+
The workload in [email-1b-workload.jsonl](../../examples/workloads/email-1b-workload.jsonl) exercises cold outreach, follow-up, reply classification, reply rewrite, and a direct section-generation prompt shaped like the app's product flow. It asserts JSON validity for classification and rejects common prompt echo, stop-token leakage, and generic email filler. Benchmark runs can append JSONL history under `.ray/benchmarks/history` so prompt/config changes can be compared over time.
7071

7172
[email-prompt-families-1b.json](../../examples/evals/email-prompt-families-1b.json) is the smaller golden eval set for prompt wording changes. Run it with `pnpm eval:prompt-families:1b` against a live Ray gateway. The output includes provider diagnostics for `promptFormat`, `promptFormatReason`, `modelRef`, `launchPreset`, cached tokens, slot reuse, and context window.
7273

examples/benchmarks/baselines/hetzner-cx23-1b.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
"maxTtftP95Ms": 2800,
1212
"minCompletionTokensPerSecondAvg": 8,
1313
"minQualityPassRate": 80,
14+
"minQualityScoreAvg": 80,
1415
"minValidJsonRate": 100
1516
},
1617
"notes": [

examples/benchmarks/baselines/single-node-8gb-1b.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
"maxTtftP95Ms": 2200,
1212
"minCompletionTokensPerSecondAvg": 14,
1313
"minQualityPassRate": 85,
14+
"minQualityScoreAvg": 85,
1415
"minValidJsonRate": 100
1516
},
1617
"notes": [

package.json

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,12 +40,13 @@
4040
"start:hetzner-email-ai:public": "node ./apps/gateway/dist/index.js --config ./examples/config/ray.hetzner-cx23-qwen0.6b.public.json",
4141
"benchmark": "node --import tsx ./scripts/benchmark.ts",
4242
"benchmark:email": "node --import tsx ./scripts/benchmark.ts --base-url http://127.0.0.1:3000 --workload ./examples/workloads/email-workload.jsonl --concurrency 2",
43-
"benchmark:1b": "node --import tsx ./scripts/benchmark.ts --base-url http://127.0.0.1:3000 --workload ./examples/workloads/email-1b-workload.jsonl --concurrency 1",
44-
"eval:prompt-families:1b": "node --import tsx ./scripts/benchmark.ts --base-url http://127.0.0.1:3000 --workload ./examples/evals/email-prompt-families-1b.json --concurrency 1 --requests 5 --label email-prompt-families-1b --output ./.ray/evals/email-prompt-families-1b.latest.json",
43+
"benchmark:1b": "node --import tsx ./scripts/benchmark.ts --base-url http://127.0.0.1:3000 --workload ./examples/workloads/email-1b-workload.jsonl --concurrency 1 --history-dir ./.ray/benchmarks/history",
44+
"benchmark:1b:prompt-formats": "node --import tsx ./scripts/benchmark.ts --base-url http://127.0.0.1:3000 --workload ./examples/workloads/email-1b-workload.jsonl --concurrency 1 --requests 10 --label email-1b-prompt-formats --prompt-format-sweep --output ./.ray/benchmarks/email-1b-prompt-formats.latest.json --history-dir ./.ray/benchmarks/history",
45+
"eval:prompt-families:1b": "node --import tsx ./scripts/benchmark.ts --base-url http://127.0.0.1:3000 --workload ./examples/evals/email-prompt-families-1b.json --concurrency 1 --requests 5 --label email-prompt-families-1b --output ./.ray/evals/email-prompt-families-1b.latest.json --history-dir ./.ray/evals/history",
4546
"benchmark:assert:cx23": "node --import tsx ./scripts/benchmark.ts --base-url http://127.0.0.1:3000 --workload ./examples/workloads/email-workload.jsonl --concurrency 2 --requests 16 --label hetzner-cx23-sub1b --baseline ./examples/benchmarks/baselines/hetzner-cx23-sub1b.json --assert-baseline --output ./.ray/benchmarks/hetzner-cx23-sub1b.latest.json",
4647
"benchmark:assert:cax11": "node --import tsx ./scripts/benchmark.ts --base-url http://127.0.0.1:3000 --workload ./examples/workloads/email-workload.jsonl --concurrency 2 --requests 16 --label hetzner-cax11-sub1b --baseline ./examples/benchmarks/baselines/hetzner-cax11-sub1b.json --assert-baseline --output ./.ray/benchmarks/hetzner-cax11-sub1b.latest.json",
47-
"benchmark:assert:cx23:1b": "node --import tsx ./scripts/benchmark.ts --base-url http://127.0.0.1:3000 --workload ./examples/workloads/email-1b-workload.jsonl --concurrency 1 --requests 10 --label hetzner-cx23-1b --baseline ./examples/benchmarks/baselines/hetzner-cx23-1b.json --assert-baseline --output ./.ray/benchmarks/hetzner-cx23-1b.latest.json",
48-
"benchmark:assert:8gb:1b": "node --import tsx ./scripts/benchmark.ts --base-url http://127.0.0.1:3000 --workload ./examples/workloads/email-1b-workload.jsonl --concurrency 2 --requests 16 --label single-node-8gb-1b --baseline ./examples/benchmarks/baselines/single-node-8gb-1b.json --assert-baseline --output ./.ray/benchmarks/single-node-8gb-1b.latest.json",
48+
"benchmark:assert:cx23:1b": "node --import tsx ./scripts/benchmark.ts --base-url http://127.0.0.1:3000 --workload ./examples/workloads/email-1b-workload.jsonl --concurrency 1 --requests 10 --label hetzner-cx23-1b --baseline ./examples/benchmarks/baselines/hetzner-cx23-1b.json --assert-baseline --output ./.ray/benchmarks/hetzner-cx23-1b.latest.json --history-dir ./.ray/benchmarks/history",
49+
"benchmark:assert:8gb:1b": "node --import tsx ./scripts/benchmark.ts --base-url http://127.0.0.1:3000 --workload ./examples/workloads/email-1b-workload.jsonl --concurrency 2 --requests 16 --label single-node-8gb-1b --baseline ./examples/benchmarks/baselines/single-node-8gb-1b.json --assert-baseline --output ./.ray/benchmarks/single-node-8gb-1b.latest.json --history-dir ./.ray/benchmarks/history",
4950
"autotune:hetzner-email-ai": "node --import tsx ./scripts/benchmark.ts --autotune --config ./examples/config/ray.hetzner-cx23-qwen0.6b.json --workload ./examples/workloads/email-workload.jsonl --concurrency 2 --requests 16",
5051
"autotune:1b": "node --import tsx ./scripts/benchmark.ts --autotune --config ./examples/config/ray.1b.json --workload ./examples/workloads/email-1b-workload.jsonl --concurrency 1 --requests 10",
5152
"autotune:1b:8gb": "node --import tsx ./scripts/benchmark.ts --autotune --config ./examples/config/ray.1b.8gb.json --workload ./examples/workloads/email-1b-workload.jsonl --concurrency 2 --requests 16",

packages/config/src/defaults.test.ts

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,12 @@ test("sanitizeConfig redacts upstream adapter headers", () => {
9494
headers: Record<string, string>;
9595
};
9696
};
97+
capabilityHints: {
98+
modelId: string;
99+
operational?: unknown;
100+
};
97101
};
98102

99103
assert.equal(safe.model.adapter.headers.authorization, "[redacted]");
104+
assert.equal(safe.capabilityHints.modelId, "qwen2.5-3b-instruct-q4");
100105
});

packages/config/src/index.ts

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -641,6 +641,27 @@ export async function loadRayConfig(options: LoadRayConfigOptions = {}): Promise
641641

642642
export function sanitizeConfig(config: RayConfig): Record<string, unknown> {
643643
const safe = structuredClone(config) as RayConfig;
644+
const capabilityHints = {
645+
profile: safe.profile,
646+
modelId: safe.model.id,
647+
family: safe.model.family,
648+
quantization: safe.model.quantization,
649+
contextWindow: safe.model.contextWindow,
650+
maxOutputTokens: safe.model.maxOutputTokens,
651+
operational: safe.model.operational,
652+
...(safe.model.adapter.kind === "llama.cpp"
653+
? {
654+
llamaCpp: {
655+
modelRef: safe.model.adapter.modelRef,
656+
launchPreset: safe.model.adapter.launchProfile?.preset,
657+
ctxSize: safe.model.adapter.launchProfile?.ctxSize,
658+
parallel: safe.model.adapter.launchProfile?.parallel,
659+
cacheRamMiB: safe.model.adapter.launchProfile?.cacheRamMiB,
660+
cachePrompt: safe.model.adapter.cachePrompt,
661+
},
662+
}
663+
: {}),
664+
};
644665

645666
if (
646667
(safe.model.adapter.kind === "openai-compatible" || safe.model.adapter.kind === "llama.cpp") &&
@@ -651,7 +672,10 @@ export function sanitizeConfig(config: RayConfig): Record<string, unknown> {
651672
);
652673
}
653674

654-
return safe as unknown as Record<string, unknown>;
675+
return {
676+
...(safe as unknown as Record<string, unknown>),
677+
capabilityHints,
678+
};
655679
}
656680

657681
export { createDefaultConfig, mergeConfig };

packages/core/src/types.ts

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -365,6 +365,8 @@ export interface ProviderDiagnostics {
365365
requestShape?: "openai-chat" | "llama.cpp-completion";
366366
promptFormat?: "llama.cpp-template" | "prompt-scaffold" | "ray-chat-fallback";
367367
promptFormatReason?: string;
368+
jsonRepairAttempted?: boolean;
369+
jsonRepairSucceeded?: boolean;
368370
modelRef?: string;
369371
backendModel?: string;
370372
launchPreset?: string;
@@ -454,10 +456,18 @@ export interface LearnedOutputCapDiagnostics {
454456
percentile: number;
455457
}
456458

459+
export interface TaskRoutingDiagnostics {
460+
taskKind: "classification" | "rewrite" | "draft" | "unknown";
461+
recommendedModelRole: "classifier" | "drafter" | "general";
462+
activeModelRole?: string;
463+
matchedActiveRole: boolean;
464+
}
465+
457466
export interface InferenceDiagnostics {
458467
promptCompiler?: PromptCompilerDiagnostics;
459468
learnedOutputCap?: LearnedOutputCapDiagnostics;
460469
adaptiveTuning?: AdaptiveTuningDiagnostics;
470+
taskRouting?: TaskRoutingDiagnostics;
461471
provider?: ProviderDiagnostics;
462472
}
463473

packages/models/src/llama-cpp.test.ts

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -471,6 +471,89 @@ test("llama.cpp provider falls back to chat completions for json_object requests
471471
assert.ok(!seenPaths.includes("/completion"));
472472
});
473473

474+
test("llama.cpp provider repairs invalid json_object chat responses once", async (t) => {
475+
let chatCalls = 0;
476+
477+
const server = createServer(async (request, response) => {
478+
if (request.url === "/apply-template") {
479+
response.writeHead(200, { "content-type": "application/json" });
480+
response.end(JSON.stringify({ prompt: "<s>json prompt" }));
481+
return;
482+
}
483+
484+
if (request.url === "/tokenize") {
485+
response.writeHead(200, { "content-type": "application/json" });
486+
response.end(JSON.stringify({ tokens: [1, 2, 3] }));
487+
return;
488+
}
489+
490+
if (request.url === "/v1/chat/completions") {
491+
chatCalls += 1;
492+
response.writeHead(200, { "content-type": "application/json" });
493+
response.end(
494+
JSON.stringify({
495+
choices: [
496+
{
497+
message: {
498+
content: chatCalls === 1 ? "intent: positive" : '{"intent":"positive"}',
499+
},
500+
},
501+
],
502+
usage: {
503+
prompt_tokens: chatCalls === 1 ? 3 : 5,
504+
completion_tokens: chatCalls === 1 ? 4 : 2,
505+
total_tokens: chatCalls === 1 ? 7 : 7,
506+
},
507+
}),
508+
);
509+
return;
510+
}
511+
512+
response.writeHead(404);
513+
response.end();
514+
});
515+
516+
await new Promise<void>((resolve) => server.listen(0, "127.0.0.1", resolve));
517+
t.after(() => server.close());
518+
519+
const address = server.address();
520+
if (!address || typeof address === "string") {
521+
throw new Error("Expected a TCP server address");
522+
}
523+
524+
const model = createModel(`http://127.0.0.1:${address.port}`, 500);
525+
const provider = new LlamaCppProvider(model, model.adapter as LlamaCppProviderConfig);
526+
const context = createContext(model, new AbortController().signal);
527+
const request = {
528+
input: "Classify the reply",
529+
system: "Return only compact JSON.",
530+
maxTokens: 64,
531+
temperature: 0.2,
532+
topP: 0.95,
533+
cache: true,
534+
metadata: {},
535+
responseFormat: {
536+
type: "json_object" as const,
537+
},
538+
};
539+
540+
const preparation = await provider.prepare(request, context);
541+
const result = await provider.infer(request, {
542+
...context,
543+
preparation,
544+
});
545+
546+
assert.equal(result.output, '{"intent":"positive"}');
547+
assert.equal(chatCalls, 2);
548+
assert.equal(result.diagnostics?.jsonRepairAttempted, true);
549+
assert.equal(result.diagnostics?.jsonRepairSucceeded, true);
550+
assert.deepEqual(result.usage?.tokens, {
551+
prompt: 8,
552+
completion: 6,
553+
total: 14,
554+
});
555+
});
556+
474557
test("llama.cpp provider degrades gracefully when slot snapshots time out", async (t) => {
475558
const seenPaths: string[] = [];
476559

0 commit comments

Comments
 (0)