Skip to content

feat: multiturn synthetic user Runner#1441

Open
chiang-daniel wants to merge 17 commits into
leonard/kil-632-feat-multiturn-taskfrom
dchiang/multiturn-synthetic-user
Open

feat: multiturn synthetic user Runner#1441
chiang-daniel wants to merge 17 commits into
leonard/kil-632-feat-multiturn-taskfrom
dchiang/multiturn-synthetic-user

Conversation

@chiang-daniel
Copy link
Copy Markdown
Contributor

@chiang-daniel chiang-daniel commented Jun 2, 2026

What does this PR do?

  • New libs/core kiln_ai.synthetic_user.runnerdrive_case + run_cases_batch
  • New libs/core SyntheticUserCase contract
  • New SyntheticUserClient wrapping kiln_server /v1/synthetic_user/generate
  • New studio_server routes: generate_cases (sync) + run_cases_batch (SSE)

Pipeline

  • Author cases via remote /generate (pro-gated, kiln-AI keys)
  • Drive locally: target adapter ↔ SyntheticUserDriver each turn, user's own keys
  • Persist chains as multi-turn TaskRuns tagged synthetic_user_case + synthetic_user_batch:<tag>
  • Fan out N cases under asyncio.Semaphore(4); stream BatchEvents over SSE

Notable

  • Runner lives in libs/core alongside EvalRunner / RagJobRunner — same pattern
  • SSE total_cost honestly sums target adapter + SU driver spend
  • Tool-dispatch-only assistant turns filtered before role_swap (tool-using targets)
  • Module constants: NUM_CASES_MAX=10, MAX_TURNS_DEFAULT=5, CONCURRENCY=4

Flow

   ┌─────────────────────────── all local ───────────────────────────┐

   Task Runner                          SU Driver
   ───────────                          ─────────
        │                                   │
        │ ◄─────── seed_prompt ─────────────│  (turn 1 only)
        │                                   │
        ▼                                   │
   invoke target task                       │
   (local; uses run config:                 │
    model, provider, prompt,                │
    tools, etc.)                            │
        │                                   │
        ▼                                   │
    TaskRun                                 │
        │                                   │
        ├────────── trace ────────────────► │
        │                                   │
        │                                   ▼
        │                          generate reply
        │                          (local; uses SU
        │                           model + provider)
        │                                   │
        │ ◄────── next user message ────────│
        │                                   │
        ▼                                   │
      (loop until max_turns)                │

Test plan

  • 134 unit tests across libs/core/kiln_ai/synthetic_user + studio_server routes
  • End-to-end smoke (_smoke.py, untracked): 3 hand-crafted cases → 3 persisted chains, $0.04 total

Related Issues

Contributor License Agreement

I, @, confirm that I have read and agree to the Contributors License Agreement.

Checklists

  • Tests have been run locally and passed
  • New tests have been added to any work in /lib

chiang-daniel and others added 13 commits June 1, 2026 15:51
Removes the /respond SDK module and its supporting wire types
(RespondRequest/Response, SyntheticUserDriverConfig, ConversationTurn,
the nested SyntheticUserInfo model). Per-turn synthetic-user invocation
moves to OSS at libs/core/kiln_ai/synthetic_user/ in a subsequent commit.

Collapses SyntheticUserCase.synthetic_user_info to a single tagged blob
string:
  <persona>...</persona><goal>...</goal><behavior_guidance>...</behavior_guidance>
The server treats the blob as opaque; the local player parses it.

Adds a typed `code` literal on /generate's 502 response
(llm_unavailable | upstream_invalid_output) so callers can discriminate
between transient model failures and unparseable model output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OSS-side per-turn synthetic-user invocation — the replacement for
kiln_server's removed /respond endpoint. Lives in
libs/core/kiln_ai/synthetic_user/ so the runner can call the LLM using
the user's own provider keys rather than a hosted endpoint.

Modules:
- models — Pydantic SyntheticUserInfo (parsed form) + SyntheticUserDriverConfig.
- parser — tagged-blob ↔ SyntheticUserInfo. Required: <persona>, <goal>;
  optional: <behavior_guidance>. Unknown tags ignored (forward-compat).
- role_swap — flips eval-frame user/assistant labels into LLM-frame labels;
  raises on system/tool roles (the driver filters those upstream) and on
  non-string content.
- prompt — persona-playing system prompt. No <DONE>/<CANCEL> guidance:
  drive loop is fixed-length; SU stays engaged across the conversation.
- driver — SyntheticUserDriver. Parses the blob once at construction,
  renders the system prompt once, builds the adapter once. respond()
  filters visible roles, role-swaps, prepends the system prompt as
  prior_trace[0], calls adapter.invoke_returning_run_output (in-memory —
  the SU never persists a TaskRun), returns the raw string.

56 unit tests covering: parser roundtrip / required-tag enforcement /
whitespace / unknown-tag forward-compat; role_swap empty/alternating/
preserves-order/raises-on-system-or-tool; prompt structural assertions
(persona/goal/conventions present, behavior_guidance only when set, no
<DONE>/<CANCEL>); driver happy path, role-swap shape, custom
visible_roles, ends-on-assistant invariant, non-string output guard,
parse-error on construction, adapter reuse across turns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up an OpenAPI description on GenerateSyntheticUsersResponse.cases
documenting the strict-N batch contract. No shape change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Thin async wrapper around the SDK's /v1/synthetic_user/generate
endpoint. The SDK now parses 401/422/500/502 into typed response
models, so the wrapper switches on the parsed type rather than
reading raw bytes — 502 surfaces its typed `code` literal
(llm_unavailable | upstream_invalid_output) directly to callers.

No retry loop. /generate is a once-per-batch authoring call;
kiln_server's pipeline already retries transient provider failures
internally before returning 502, so a 502 reaching us is a genuine
per-batch failure that should propagate. Drops the v1 client's
SyntheticUserTransientError + backoff machinery.

No /respond. Per-turn synthetic-user invocation lives at
libs/core/kiln_ai/synthetic_user/ and runs locally with the user's keys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an explicit "your entire output is the user's next message, verbatim
and nothing else: no narration, no meta-commentary, no quotes, no labels
like 'User:'" clause to the persona-playing system prompt.

A team running similar SU-driven evals reported the persona-playing
model frequently breaks character — narrating ("I would now ask..."),
self-evaluating, or labeling its output. This clause pins that down at
the prompt boundary so we don't end up reaching for post-processing
band-aids later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
drive_loop.py:
- drive_case(*, case, target_invoker, su_driver, turns, on_turn) runs the
  loop for exactly `turns` iterations — no early termination, no
  stop_signal plumbing. Returns DriveCaseResult(chain) with the persisted
  TaskRun chain.
- TargetInvoker + TurnHook Protocols. The SU driver does all role
  filtering / role swap / invariant checks internally; the drive loop
  passes the cumulative trace as-is.

runner.py:
- run_cases_batch is an async generator yielding typed BatchEvents
  (BatchStartedEvent / TurnCompletedEvent / CaseCompletedEvent /
  CaseFailedEvent / BatchCompletedEvent). No stop_signal/stop_reason
  fields — drive loop is fixed-length.
- Constructs a SyntheticUserDriver per case; a malformed
  synthetic_user_info blob surfaces as a CaseFailedEvent for that case
  alone (other cases continue).
- _make_target_invoker / _build_input_source / _tag_leaf patterns kept
  from the prior v1 commits (target persistence + SU attribution
  unchanged). input_source now carries the opaque blob on the root run
  + slim {batch_tag, turn_index} on subsequent turns.
- Per-case try/except now WRAPS _tag_leaf too, so a save_to_file failure
  surfaces as case_failed instead of silently disappearing into
  asyncio.gather(return_exceptions=True). Same try also wraps the
  target_invoker construction.
- Case tasks are kicked off before the first BatchStartedEvent yield and
  the entire drain loop is inside a try/finally that cancels them on
  consumer disconnect — fixes the v1 issue where browser disconnect kept
  the request alive for the full duration of every in-flight case.

14 tests cover: input validation, happy-path event stream, leaf tagging,
auto-generated batch_tag, malformed blob → case_failed, target invoke
failure → case_failed, tag-save failure → case_failed, concurrency
semaphore enforcing max-in-flight, root vs slim input_source
attribution, and consumer cancellation propagating to case tasks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two routes for the multi-turn synthetic-user data-generation pipeline:
- POST .../multiturn_sdg/generate_cases (sync JSON)
- POST .../multiturn_sdg/run_cases_batch (SSE via CancellableStreamingResponse)

Wires connect_multiturn_sdg_api into desktop_server.make_app and registers
the Multiturn SDG tag in kiln_server's tags_metadata so the regenerated
api_schema.d.ts surfaces the routes in the typed client.

Both routes guard task.turn_mode == multiturn before doing any upstream
work and route SyntheticUserClient typed errors through to faithful HTTP
statuses (401/422/502 preserved, not collapsed). The SSE route threads
build_save_context(request) into run_cases_batch and uses an isinstance
whitelist on the JSON encoder so future Pydantic types on the wire need
explicit review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename total_cost -> target_total_cost on CaseCompletedEvent and
  BatchCompletedEvent. The runner only sees target adapter spend; the SU
  driver's per-turn cost isn't rolled up here. Old name was misleading
  in a beta where users pick the SU model.

- Thread an optional save_context through run_cases_batch and wrap the
  leaf-tag save. Adapter writes inside adapter.invoke still bypass — a
  kiln_ai-side gap shared with the chat SSE pattern, documented in the
  runner docstring.

- Add a re-run idempotency test for _tag_leaf to lock in the spec's
  "set-union + sort, preserves pre-existing tags" contract.

- Drop the dead UNSET/None branch in client._code_or_default; the
  remaining one-liner has identical behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename DEFAULT_TURNS -> MAX_TURNS_DEFAULT to match spec naming.
- Name asyncio.create_task instances so debug dumps point at this code.
- Pre-assert non-empty seed_prompt in drive_case (assert-loud invariant).
- Document invariants on _make_target_invoker (sequential-per-case),
  _tag_leaf (one-writer-per-leaf), and _close_when_done (final put on
  cancel path goes into the void).
- Drop the unreachable generic fallback in _to_http_exception; tighten
  the param type to the two real subclasses so the type checker enforces
  exhaustiveness at the call site.
- Log a warning in _format_validation_detail when every item is skipped
  so a silent SDK shape drift surfaces.
- Tests: parameterize turns<1 with negatives, lock in
  _event_to_payload's unregistered-event guard, and couple the
  auto-batch_tag test to the public regex instead of the implementation.
- Stale "Phase 3" docstring scrub + f-string cosmetic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root TaskRun's input_source.properties now carries the decomposed SU
case context — persona, goal, behavior_guidance (when present),
seed_prompt — instead of the opaque tagged blob.

Lets dataset readers and eval tooling inspect SU attribution by direct
property access rather than re-parsing the XML each time. The blob is
losslessly reconstructable from these fields via build_synthetic_user_info
if a downstream tool needs the original wire form.

Parse happens once per case in _build_input_source on the root turn; the
SU driver constructor already validated the blob, so the re-parse here
can't surface a new error class. behavior_guidance is omitted when the
parser returns None (the DataSource validator rejects empty strings).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SyntheticUserDriver.respond now returns (message, cost) — the per-call
cost is read from the in-memory TaskRun's usage.cost (the only place SU
spend surfaces, since SU turns aren't persisted as TaskRuns).

drive_case accumulates su_total_cost across turns and exposes it on
DriveCaseResult. The runner adds it to the leaf's cumulative_usage.cost
to produce an honest CaseCompletedEvent.total_cost — renamed from
target_total_cost since the field now reports total spend, not just the
target adapter's. BatchCompletedEvent.total_cost sums across successful
cases the same way.

Matters now because the SU model is user-selectable: someone picking
Sonnet for higher-quality probes would have had ~half their spend
invisible under the old target-only total.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…source

Every input to the filter has stronger upstream protection now:
seed_prompt is asserted non-empty in drive_case; persona and goal are
required-non-empty by parse_synthetic_user_info; behavior_guidance is
already conditionally skipped if None; the remaining keys are Pydantic-
validated or non-string. The filter was guarding nothing.

The DataSource validator stays as the real backstop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure relocation + boundary update; behavior unchanged.

run_cases_batch and drive_case now live at
libs/core/kiln_ai/synthetic_user/{runner,drive_loop}.py alongside the
existing SyntheticUserDriver. Same neighborhood as EvalRunner /
RagJobRunner / ExtractorRunner — runners belong in libs/core.

To make libs/core SDK-agnostic, introduce a small
kiln_ai.synthetic_user.SyntheticUserCase Pydantic model (two fields,
field-identical to the kiln_server SDK's case shape). The
multiturn_sdg_api route validates dicts straight into the libs/core type
via Pydantic, so the runner never sees the SDK class. The SDK case is
still used for `/generate_cases` output via `to_dict()` — nothing
about that pro-gated authoring path changes.

Tests move with the code. studio_server keeps only the SDK-wrapper
SyntheticUserClient and the FastAPI route, which is exactly the
established shape for eval_api driving EvalRunner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

Walkthrough

This PR introduces a complete multi-turn synthetic data generation pipeline enabling local execution of multi-turn synthetic user conversations against target tasks. The implementation includes a batch runner with concurrent case execution, per-turn drivers that invoke adapters, FastAPI endpoints exposing generation and execution, an SDK client wrapper for upstream case generation, comprehensive event streaming via SSE, and end-to-end test coverage.

Changes

Multi-turn Synthetic Data Generation Pipeline

Layer / File(s) Summary
Core synthetic user types and data models
libs/core/kiln_ai/synthetic_user/__init__.py, libs/core/kiln_ai/synthetic_user/case.py, libs/core/kiln_ai/synthetic_user/models.py, libs/core/kiln_ai/synthetic_user/parser.py, libs/core/kiln_ai/synthetic_user/prompt.py, libs/core/kiln_ai/synthetic_user/role_swap.py, libs/core/kiln_ai/synthetic_user/test_*.py
Introduce SyntheticUserCase (seed prompt + info blob), SyntheticUserInfo (persona/goal/behavior guidance), parsing/serialization with XML-like tagged format, system prompt rendering, role-swapping utility for eval-to-LLM frame conversion, and full unit test coverage.
Synthetic user per-turn driver
libs/core/kiln_ai/synthetic_user/driver.py, libs/core/kiln_ai/synthetic_user/test_driver.py
Implement SyntheticUserDriver that parses synthetic-user info at construction, filters conversation messages by visible roles, drops tool-dispatch-only turns, applies role swap, invokes adapter, and returns synthetic user reply plus per-call cost; includes 18 test cases validating parsing, visibility filtering, tool-call handling, and adapter reuse.
Single-case drive loop for multi-turn iteration
libs/core/kiln_ai/synthetic_user/drive_loop.py, libs/core/kiln_ai/synthetic_user/test_drive_loop.py
Add drive_case function and DriveCaseResult to orchestrate fixed-turn iteration: seed with case prompt, invoke target task via adapter, thread cumulative trace to SU driver, collect persisted TaskRun chain, aggregate SU cost; includes 11 test cases covering turn sequencing, trace threading, hook callbacks, and error propagation.
Batch runner with concurrent execution and event streaming
libs/core/kiln_ai/synthetic_user/runner.py, libs/core/kiln_ai/synthetic_user/test_runner.py
Implement async run_cases_batch generator yielding strongly-typed BatchEvents (started, turn completed, case completed/failed, batch completed); execute cases concurrently under semaphore; emit turn snapshots with cumulative trace and cost; tag leaf runs; isolate per-case failures; includes 18 test cases validating event sequencing, concurrency caps, input source decomposition, and cancellation handling.
SDK client wrapper and exception handling
app/desktop/studio_server/synthetic_user/__init__.py, app/desktop/studio_server/synthetic_user/client.py, app/desktop/studio_server/synthetic_user/test_client.py
Wrap kiln_server SDK endpoint for /v1/synthetic_user/generate; define typed exception hierarchy (SyntheticUserError, SyntheticUserRequestError for 4xx, SyntheticUserServerError for 5xx); translate SDK responses into wrapper types; includes 13 test cases covering success path, typed error codes, fallback status classification, and no-retry behavior.
FastAPI endpoints and desktop server integration
app/desktop/studio_server/multiturn_sdg_api.py, app/desktop/studio_server/test_multiturn_sdg_api.py, app/desktop/desktop_server.py
Add /generate_cases synchronous route and /run_cases_batch SSE route under /api/projects/{project_id}/tasks/{task_id}/multiturn_sdg/; validate multiturn task requirement; map upstream errors to HTTP status codes; stream SSE JSON frames with custom serialization for MessageUsage; wrap with CancellableStreamingResponse; apply _git_sync_no_write_lock decorator; includes 23 test cases covering happy path, validation, error preservation, and structural behavior.
Frontend API schema and minor UI updates
app/web_ui/src/lib/api_schema.d.ts, app/web_ui/src/lib/ui/conversation/multiturn_composer.svelte, libs/server/kiln_server/server.py
Generate TypeScript type definitions for new endpoints and request/response models; add "Multiturn SDG" OpenAPI tag; fix comment formatting in Svelte component.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • leonardmq
  • scosman
  • tawnymanticore

🐰 A pipeline flows, cases now sync,
Turn by turn the models think,
Batch events stream in SSE delight,
Synthetic users chat through the night!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: adding a multiturn synthetic user Runner to the codebase, which is the core feature of this PR.
Description check ✅ Passed The PR description covers most required sections: purpose (What does this PR do), pipeline explanation, notable features, flow diagram, test plan, and related issues. However, some template sections are incomplete (CLA confirmation uses placeholder @ and checklist items are unchecked), but the core descriptive content is substantial and well-structured.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dchiang/multiturn-synthetic-user

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multi-turn synthetic data generation (SDG) capabilities, adding FastAPI routes, a local synthetic-user driver, client wrappers, and comprehensive unit tests, alongside updates to tracking models. The review feedback highlights several critical issues: multiple model files (chat_session_list_item.py, kiln_base_model.py, task_output.py, task_output_rating.py, and task_run.py) use datetime.datetime.fromisoformat without importing the datetime module, which will cause runtime NameErrors. Additionally, manually overriding the Content-Type header with a hardcoded boundary in the prompt optimization endpoint is fragile and should be removed, and role_swap.py needs to gracefully handle None content in assistant messages to prevent crashes during tool-use turns.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread libs/core/kiln_ai/synthetic_user/role_swap.py
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

📊 Coverage Report

Overall Coverage: 92%

Diff: origin/leonard/kil-632-feat-multiturn-task...HEAD

  • app/desktop/desktop_server.py (100%)
  • app/desktop/studio_server/multiturn_sdg_api.py (100%)
  • app/desktop/studio_server/synthetic_user/init.py (100%)
  • app/desktop/studio_server/synthetic_user/client.py (91.5%): Missing lines 173-174,182-183,189
  • libs/core/kiln_ai/synthetic_user/init.py (100%)
  • libs/core/kiln_ai/synthetic_user/case.py (100%)
  • libs/core/kiln_ai/synthetic_user/drive_loop.py (97.1%): Missing lines 101
  • libs/core/kiln_ai/synthetic_user/driver.py (97.4%): Missing lines 118
  • libs/core/kiln_ai/synthetic_user/models.py (100%)
  • libs/core/kiln_ai/synthetic_user/parser.py (100%)
  • libs/core/kiln_ai/synthetic_user/prompt.py (100%)
  • libs/core/kiln_ai/synthetic_user/role_swap.py (93.8%): Missing lines 45
  • libs/core/kiln_ai/synthetic_user/runner.py (99.3%): Missing lines 420

Summary

  • Total: 447 lines
  • Missing: 9 lines
  • Coverage: 97%

Line-by-line

View line-by-line diff coverage

app/desktop/studio_server/synthetic_user/client.py

Lines 169-178

  169     parts: list[str] = []
  170     skipped = 0
  171     for item in detail:
  172         if not isinstance(item, ValidationError):
! 173             skipped += 1
! 174             continue
  175         loc = ".".join(str(x) for x in item.loc)
  176         parts.append(f"{loc}: {item.msg}")
  177     if not parts:
  178         # The SDK's HTTPValidationError.detail had items the SDK couldn't

Lines 178-187

  178         # The SDK's HTTPValidationError.detail had items the SDK couldn't
  179         # parse as ValidationError — a shape we don't expect today. Log
  180         # so we can spot the discrepancy if it ever appears in the wild,
  181         # instead of silently returning the empty fallback.
! 182         if skipped:
! 183             logger.warning(
  184                 "HTTPValidationError carried %d non-ValidationError detail item(s); "
  185                 "raw detail repr: %r",
  186                 skipped,
  187                 detail,

Lines 185-191

  185                 "raw detail repr: %r",
  186                 skipped,
  187                 detail,
  188             )
! 189         return "Validation error (no detail)."
  190     return "Validation error: " + "; ".join(parts)

libs/core/kiln_ai/synthetic_user/drive_loop.py

Lines 97-105

   97     # Assert-loud on missing seed. An empty string would silently flow
   98     # into the target adapter and surface as a confusing model-side error
   99     # rather than a clean "the case is malformed" signal.
  100     if not case.seed_prompt:
! 101         raise ValueError("case.seed_prompt must be a non-empty string")
  102 
  103     user_msg: str = case.seed_prompt
  104     prev_run: TaskRun | None = None
  105     prev_trace: list[ChatCompletionMessageParam] | None = None

libs/core/kiln_ai/synthetic_user/driver.py

Lines 114-122

  114         swapped = role_swap(visible)
  115         last = swapped[-1]
  116         user_input = last["content"]
  117         if not isinstance(user_input, str):
! 118             raise RuntimeError(
  119                 "synthetic user input must be a plain string after role_swap"
  120             )
  121 
  122         system_msg: ChatCompletionSystemMessageParam = {

libs/core/kiln_ai/synthetic_user/role_swap.py

Lines 41-49

  41         # the target. Narrowing here lets us assign into the swapped wrapper
  42         # type without a cast.
  43         content = msg["content"]
  44         if not isinstance(content, str):
! 45             raise ValueError(
  46                 f"role_swap requires string content for role {role!r}; "
  47                 f"got {type(content).__name__}"
  48             )
  49         if role == "user":

libs/core/kiln_ai/synthetic_user/runner.py

Lines 416-424

  416     missing (defensive against fakes in unit tests that don't populate it).
  417     """
  418     usage = getattr(run, "cumulative_usage", None)
  419     if usage is None:
! 420         return 0.0
  421     return float(getattr(usage, "cost", None) or 0.0)
  422 
  423 
  424 def _tag_leaf(leaf: TaskRun, batch_tag: str) -> None:


@chiang-daniel chiang-daniel changed the title Dchiang/multiturn synthetic user feat: multiturn synthetic user Runner Jun 2, 2026
chiang-daniel and others added 4 commits June 2, 2026 16:04
… role_swap

Tool-using targets emit assistant turns with content=None and tool_calls
set — pure tool dispatches, not user-facing speech. Pre-this-fix, those
hit role_swap's strict-content invariant and crashed the SU run. Gemini's
suggestion (coerce None → "") would have let them through but degraded
the SU LLM's conversation view to consecutive user turns with empty
content — silently worse than the crash.

The right place to filter is at the driver, next to the existing
visible_message_roles filter — "what's visible to the SU" is the driver's
responsibility. role_swap stays strict on None content (the trip wire
for any caller bypassing the driver's filter).

Filter predicate: drop assistant turns where content is None. Keep
assistant turns that carry text alongside tool_calls — the text is
user-facing speech the SU should respond to.

Addresses gemini-code-assist comment on PR #1441 / role_swap.py without
applying the suggested empty-string coercion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…context

Fix comment numbering in driver.py (4→5), correct "greedy" to "non-greedy"
in parser.py, remove inaccurate drive-loop claim from studio_server __init__.
Strip historical /respond migration references, remove app-layer concerns
(SSE, @no_write_lock) from SDK-level docstrings, deduplicate cost-attribution
explanations across driver/runner/drive_loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stray U+200B (zero-width space) between "disables/" and "spinners" in a
comment tripped eslint no-irregular-whitespace. Likely a paste artifact
from Leonard's recent commit; fixed in passing during the merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

headers["Content-Type"] = "multipart/form-data; boundary=+++"

_kwargs["headers"] = headers
Copy link
Copy Markdown
Contributor Author

@chiang-daniel chiang-daniel Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the changes under /api_client are files copied from the new server SDK. No need to review those.

@chiang-daniel chiang-daniel marked this pull request as ready for review June 3, 2026 17:12
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
libs/core/kiln_ai/synthetic_user/runner.py (1)

57-66: ⚖️ Poor tradeoff

TurnCompletedEvent.cumulative_cost omits SU-driver spend while CaseCompletedEvent.total_cost includes it.

A live cost ticker driven off cumulative_cost will undercount during turns, then jump up when case_completed adds result.su_total_cost. This matches the documented "honest totals only at case end" intent, so it's not a bug — just flagging the per-turn vs per-case inconsistency in case the UI relies on a smooth running total. Threading the running SU cost into on_turn would remove the jump.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/synthetic_user/runner.py` around lines 57 - 66,
TurnCompletedEvent.cumulative_cost currently excludes SU-driver spend while
CaseCompletedEvent.total_cost includes it, causing per-turn cost undercounts
then a jump at case completion; update the on-turn flow to thread the running SU
cost into each TurnCompletedEvent so cumulative_cost reflects assistant+SU spend
per turn (adjust the code paths that construct TurnCompletedEvent and any
function handling on_turn to accept and pass the incremental su_running_cost),
and ensure CaseCompletedEvent.total_cost still aggregates final su_total_cost so
the live ticker remains smooth and consistent with the end-of-case total.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/web_ui/src/lib/api_schema.d.ts`:
- Around line 17079-17104: The OpenAPI docs currently advertise
stream_run_cases_batch
(stream_run_cases_batch_api_projects__project_id__tasks__task_id__multiturn_sdg_run_cases_batch_post)
as returning "application/json" but the route actually returns a
StreamingResponse with media_type="text/event-stream"; update the FastAPI route
in app/desktop/studio_server/multiturn_sdg_api.py to declare the 200 response
content type as "text/event-stream" (e.g., add responses={200: {"content":
{"text/event-stream": {"schema": {"type":"string"}}}}} or set
response_class/response_model metadata appropriately) so the OpenAPI spec
reflects SSE, then run app/web_ui/src/lib/generate_schema.sh to regenerate
app/web_ui/src/lib/api_schema.d.ts; do not manually edit the generated TS file.

---

Nitpick comments:
In `@libs/core/kiln_ai/synthetic_user/runner.py`:
- Around line 57-66: TurnCompletedEvent.cumulative_cost currently excludes
SU-driver spend while CaseCompletedEvent.total_cost includes it, causing
per-turn cost undercounts then a jump at case completion; update the on-turn
flow to thread the running SU cost into each TurnCompletedEvent so
cumulative_cost reflects assistant+SU spend per turn (adjust the code paths that
construct TurnCompletedEvent and any function handling on_turn to accept and
pass the incremental su_running_cost), and ensure CaseCompletedEvent.total_cost
still aggregates final su_total_cost so the live ticker remains smooth and
consistent with the end-of-case total.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: dd8cddc7-7358-4d89-a388-06e6d09f5738

📥 Commits

Reviewing files that changed from the base of the PR and between d2c3f99 and d032dcf.

⛔ Files ignored due to path filters (20)
  • app/desktop/studio_server/api_client/kiln_ai_server_client/api/jobs/start_prompt_optimization_job_v1_jobs_prompt_optimization_job_start_post.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/api/jobs/start_sample_job_v1_jobs_sample_job_start_post.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/api/synthetic_user/__init__.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/api/synthetic_user/generate_v1_synthetic_user_generate_post.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/__init__.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/chat_completion_assistant_message_param_wrapper.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/chat_session_list_item.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/generate_synthetic_users_request.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/generate_synthetic_users_response.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/generate_v1_synthetic_user_generate_post_response_401.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/generate_v1_synthetic_user_generate_post_response_500.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/generate_v1_synthetic_user_generate_post_response_502.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/generate_v1_synthetic_user_generate_post_response_502_code.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/kiln_base_model.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/message_usage.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/synthetic_user_case.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/task_output.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/task_output_rating.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/task_run.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
  • app/desktop/studio_server/api_client/kiln_ai_server_client/models/usage.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
📒 Files selected for processing (26)
  • app/desktop/desktop_server.py
  • app/desktop/studio_server/multiturn_sdg_api.py
  • app/desktop/studio_server/synthetic_user/__init__.py
  • app/desktop/studio_server/synthetic_user/client.py
  • app/desktop/studio_server/synthetic_user/test_client.py
  • app/desktop/studio_server/test_multiturn_sdg_api.py
  • app/web_ui/src/lib/api_schema.d.ts
  • app/web_ui/src/lib/ui/conversation/multiturn_composer.svelte
  • libs/core/kiln_ai/synthetic_user/__init__.py
  • libs/core/kiln_ai/synthetic_user/case.py
  • libs/core/kiln_ai/synthetic_user/drive_loop.py
  • libs/core/kiln_ai/synthetic_user/driver.py
  • libs/core/kiln_ai/synthetic_user/models.py
  • libs/core/kiln_ai/synthetic_user/parser.py
  • libs/core/kiln_ai/synthetic_user/prompt.py
  • libs/core/kiln_ai/synthetic_user/role_swap.py
  • libs/core/kiln_ai/synthetic_user/runner.py
  • libs/core/kiln_ai/synthetic_user/test_case.py
  • libs/core/kiln_ai/synthetic_user/test_drive_loop.py
  • libs/core/kiln_ai/synthetic_user/test_driver.py
  • libs/core/kiln_ai/synthetic_user/test_models.py
  • libs/core/kiln_ai/synthetic_user/test_parser.py
  • libs/core/kiln_ai/synthetic_user/test_prompt.py
  • libs/core/kiln_ai/synthetic_user/test_role_swap.py
  • libs/core/kiln_ai/synthetic_user/test_runner.py
  • libs/server/kiln_server/server.py

Comment on lines +17079 to +17104
stream_run_cases_batch_api_projects__project_id__tasks__task_id__multiturn_sdg_run_cases_batch_post: {
parameters: {
query?: never;
header?: never;
path: {
/** @description ID of the project containing the target task. */
project_id: string;
/** @description ID of the target task. Must be a multi-turn task. */
task_id: string;
};
cookie?: never;
};
requestBody: {
content: {
"application/json": components["schemas"]["RunCasesBatchApiInput"];
};
};
responses: {
/** @description Successful Response */
200: {
headers: {
[name: string]: unknown;
};
content: {
"application/json": unknown;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

run_cases_batch response media type is mis-modeled as JSON instead of SSE.

stream_run_cases_batch is typed with 200 -> application/json, but the backend route returns StreamingResponse(..., media_type="text/event-stream") (see app/desktop/studio_server/multiturn_sdg_api.py). This weakens the generated client contract for streaming and can break typed frontend consumption.

Please update the backend route OpenAPI metadata/response docs to advertise text/event-stream, then regenerate app/web_ui/src/lib/api_schema.d.ts via app/web_ui/src/lib/generate_schema.sh rather than editing this file directly.
Based on learnings: "app/web_ui/src/lib/api_schema.d.ts is auto-generated by openapi-typescript; do not propose manual edits. Schema changes should be made in the FastAPI backend … then re-generate the TS types."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/web_ui/src/lib/api_schema.d.ts` around lines 17079 - 17104, The OpenAPI
docs currently advertise stream_run_cases_batch
(stream_run_cases_batch_api_projects__project_id__tasks__task_id__multiturn_sdg_run_cases_batch_post)
as returning "application/json" but the route actually returns a
StreamingResponse with media_type="text/event-stream"; update the FastAPI route
in app/desktop/studio_server/multiturn_sdg_api.py to declare the 200 response
content type as "text/event-stream" (e.g., add responses={200: {"content":
{"text/event-stream": {"schema": {"type":"string"}}}}} or set
response_class/response_model metadata appropriately) so the OpenAPI spec
reflects SSE, then run app/web_ui/src/lib/generate_schema.sh to regenerate
app/web_ui/src/lib/api_schema.d.ts; do not manually edit the generated TS file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant