Skip to content

Commit ca9c8b2

Browse files
committed
update persona model
1 parent 3ab1c0d commit ca9c8b2

8 files changed

Lines changed: 49 additions & 3 deletions

File tree

docs/generated/workspace-inventory.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Workspace Inventory
22

3-
Generated: 2026-05-15T07:43:59.795Z
3+
Generated: 2026-05-15T12:20:13.570Z
44

55
```text
66
AGENTS.md
@@ -446,6 +446,7 @@ src/shared/utils/
446446
src/shared/utils/json.ts
447447
src/shared/utils/logging.ts
448448
src/shared/utils/safe-static-path.ts
449+
src/shared/utils/scoring.ts
449450
src/shared/utils/secret-cipher.ts
450451
src/shared/utils/template.ts
451452
tests/

docs/product-specs/current-state.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Current State
22

3-
Last validated against `platform.md`: 2026-04-17
3+
Last validated against `platform.md`: 2026-05-15
44

55
## Implemented scenarios
66

@@ -10,6 +10,7 @@ Last validated against `platform.md`: 2026-04-17
1010
- [x] List command shows available scenarios
1111
- [x] Dry-run mode records intent without contacting external systems
1212
- [x] Judge requests preserve cache-friendly prompt prefixes
13+
- [x] Persona simulation uses a configurable default model with hidden reasoning
1314
- [x] Parallel mode overlaps scenario execution while preserving ordering
1415
- [ ] Multi-session memory scenarios preserve pinned identity and session controls
1516
- [ ] AutoGPT preset forges auth tokens internally
@@ -52,6 +53,9 @@ Last validated against `platform.md`: 2026-04-17
5253
- Judge-model requests now preserve a stable rubric-first prefix, add a stable
5354
prompt cache key, and enable supported provider caching on the OpenRouter
5455
Responses path.
56+
- Persona simulator requests default to `deepseek/deepseek-v4-flash` unless a
57+
persona-level `model` or `AGENTPROBE_PERSONA_MODEL` override is present, and
58+
they use medium reasoning effort while excluding reasoning from responses.
5559
- The OpenClaw CLI surface is implemented behind websocket endpoint presets and
5660
can create sessions, send chat turns, and read session history.
5761
- `bun run fast-feedback` now refreshes generated docs and quality score before

docs/product-specs/e2e-checklist.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ Derived from `platform.md`. Every scenario should have a coverage owner.
1010
| List command shows available scenarios | `tests/e2e/cli.e2e.test.ts` | ⏳ planned |
1111
| Dry-run mode records intent without contacting external systems | `tests/e2e/cli.e2e.test.ts` | ✅ covered |
1212
| Judge requests preserve cache-friendly prompt prefixes | `tests/unit/judge.test.ts` | ✅ covered |
13+
| Persona simulation uses a configurable default model with hidden reasoning | `tests/unit/simulator.test.ts` | ✅ covered |
1314
| Parallel mode overlaps scenario execution while preserving ordering | `tests/e2e/cli.e2e.test.ts` | ✅ covered |
1415
| Multi-session memory scenarios preserve pinned identity and session controls | `tests/unit/runner.test.ts` + `tests/unit/scenario-parsing.test.ts` | ⏳ planned |
1516
| AutoGPT preset forges auth tokens internally | `tests/unit/autogpt-auth.test.ts` + `tests/unit/adapters.test.ts` | ⏳ expanding |

docs/product-specs/platform.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,15 @@ judge-model calls
5858
pushes transcript-specific content to the tail, and enables supported provider
5959
prompt caching without changing the scoring contract
6060

61+
### Persona simulation uses a configurable default model with hidden reasoning
62+
63+
**Given** a persona without an explicit `model` field and no
64+
`AGENTPROBE_PERSONA_MODEL` override
65+
**When** AgentProbe simulates the next persona turn
66+
**Then** the CLI sends the simulator request with
67+
`deepseek/deepseek-v4-flash` as the default model, medium reasoning effort, and
68+
reasoning excluded from the response
69+
6170
### Parallel mode overlaps scenario execution while preserving ordering
6271

6372
**Given** valid endpoint, scenario, persona, and rubric YAML files with more

src/domains/evaluation/simulator.ts

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ import type {
88
import { AgentProbeRuntimeError } from "../../shared/utils/errors.ts";
99
import type { LlmResponsesClient } from "./ports.ts";
1010

11-
const DEFAULT_PERSONA_MODEL = "moonshotai/kimi-k2.6";
11+
const DEFAULT_PERSONA_MODEL = "deepseek/deepseek-v4-flash";
1212

1313
type ConversationHistory =
1414
| string
@@ -582,6 +582,10 @@ export async function generatePersonaStep(
582582
model: resolvePersonaModel(persona),
583583
instructions: simulatorInstructions(persona, requireResponse),
584584
input: baseInput,
585+
reasoning: {
586+
effort: "medium",
587+
exclude: true,
588+
},
585589
text: {
586590
format: {
587591
type: "json_schema",

src/providers/sdk/openai-responses.ts

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -251,6 +251,14 @@ export class OpenAiResponsesClient {
251251
},
252252
temperature: request.temperature,
253253
max_output_tokens: request.maxOutputTokens,
254+
reasoning: request.reasoning
255+
? {
256+
effort: request.reasoning.effort,
257+
max_tokens: request.reasoning.maxTokens,
258+
exclude: request.reasoning.exclude,
259+
enabled: request.reasoning.enabled,
260+
}
261+
: undefined,
254262
prompt_cache_key: request.promptCacheKey,
255263
cache_control: request.cacheControl
256264
? {

src/shared/types/contracts.ts

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -633,6 +633,12 @@ export type OpenAiResponsesRequest = {
633633
model: string;
634634
instructions: string;
635635
input: string | OpenAiResponsesInputMessage[];
636+
reasoning?: {
637+
effort?: "xhigh" | "high" | "medium" | "low" | "minimal" | "none";
638+
maxTokens?: number;
639+
exclude?: boolean;
640+
enabled?: boolean;
641+
};
636642
text: {
637643
format: {
638644
type: "json_schema";

tests/unit/simulator.test.ts

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ import { beforeEach, describe, expect, test } from "bun:test";
33
import {
44
generateNextStep,
55
generatePersonaStep,
6+
resolvePersonaModel,
67
} from "../../src/domains/evaluation/simulator.ts";
78
import { AgentProbeRuntimeError } from "../../src/shared/utils/errors.ts";
89
import {
@@ -23,6 +24,14 @@ describe("simulator", () => {
2324
}
2425
});
2526

27+
test("uses DeepSeek Flash as the default persona model", () => {
28+
delete process.env.AGENTPROBE_PERSONA_MODEL;
29+
30+
expect(resolvePersonaModel(buildPersona())).toBe(
31+
"deepseek/deepseek-v4-flash",
32+
);
33+
});
34+
2635
test("uses env default model and guidance for required turns", async () => {
2736
process.env.AGENTPROBE_PERSONA_MODEL = "env-persona-model";
2837
const client = new FakeResponsesClient([
@@ -80,6 +89,10 @@ describe("simulator", () => {
8089
});
8190
expect(client.calls).toHaveLength(1);
8291
expect(client.calls[0]?.model).toBe("env-persona-model");
92+
expect(client.calls[0]?.reasoning).toEqual({
93+
effort: "medium",
94+
exclude: true,
95+
});
8396
expect(client.calls[0]?.instructions).toContain("Frustrated Customer");
8497
expect(client.calls[0]?.input).toContain("Ask about refund timing.");
8598
expect(client.calls[0]?.input).toContain("Conversation so far:");

0 commit comments

Comments
 (0)