feat: multiturn synthetic user Runner by chiang-daniel · Pull Request #1441 · Kiln-AI/Kiln

chiang-daniel · 2026-06-02T09:00:47Z

What does this PR do?

New libs/core kiln_ai.synthetic_user.runner — drive_case + run_cases_batch
New libs/core SyntheticUserCase contract
New SyntheticUserClient wrapping kiln_server /v1/synthetic_user/generate
New studio_server routes: generate_cases (sync) + run_cases_batch (SSE)

Pipeline

Author cases via remote /generate (pro-gated, kiln-AI keys)
Drive locally: target adapter ↔ SyntheticUserDriver each turn, user's own keys
Persist chains as multi-turn TaskRuns tagged synthetic_user_case + synthetic_user_batch:<tag>
Fan out N cases under asyncio.Semaphore(4); stream BatchEvents over SSE

Notable

Runner lives in libs/core alongside EvalRunner / RagJobRunner — same pattern
SSE total_cost honestly sums target adapter + SU driver spend
Tool-dispatch-only assistant turns filtered before role_swap (tool-using targets)
Module constants: NUM_CASES_MAX=10, MAX_TURNS_DEFAULT=5, CONCURRENCY=4

Flow

   ┌─────────────────────────── all local ───────────────────────────┐

   Task Runner                          SU Driver
   ───────────                          ─────────
        │                                   │
        │ ◄─────── seed_prompt ─────────────│  (turn 1 only)
        │                                   │
        ▼                                   │
   invoke target task                       │
   (local; uses run config:                 │
    model, provider, prompt,                │
    tools, etc.)                            │
        │                                   │
        ▼                                   │
    TaskRun                                 │
        │                                   │
        ├────────── trace ────────────────► │
        │                                   │
        │                                   ▼
        │                          generate reply
        │                          (local; uses SU
        │                           model + provider)
        │                                   │
        │ ◄────── next user message ────────│
        │                                   │
        ▼                                   │
      (loop until max_turns)                │

Test plan

134 unit tests across libs/core/kiln_ai/synthetic_user + studio_server routes
End-to-end smoke (_smoke.py, untracked): 3 hand-crafted cases → 3 persisted chains, $0.04 total

Related Issues

Contributor License Agreement

I, @, confirm that I have read and agree to the Contributors License Agreement.

Checklists

Tests have been run locally and passed
New tests have been added to any work in /lib

Removes the /respond SDK module and its supporting wire types (RespondRequest/Response, SyntheticUserDriverConfig, ConversationTurn, the nested SyntheticUserInfo model). Per-turn synthetic-user invocation moves to OSS at libs/core/kiln_ai/synthetic_user/ in a subsequent commit. Collapses SyntheticUserCase.synthetic_user_info to a single tagged blob string: <persona>...</persona><goal>...</goal><behavior_guidance>...</behavior_guidance> The server treats the blob as opaque; the local player parses it. Adds a typed `code` literal on /generate's 502 response (llm_unavailable | upstream_invalid_output) so callers can discriminate between transient model failures and unparseable model output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

OSS-side per-turn synthetic-user invocation — the replacement for kiln_server's removed /respond endpoint. Lives in libs/core/kiln_ai/synthetic_user/ so the runner can call the LLM using the user's own provider keys rather than a hosted endpoint. Modules: - models — Pydantic SyntheticUserInfo (parsed form) + SyntheticUserDriverConfig. - parser — tagged-blob ↔ SyntheticUserInfo. Required: <persona>, <goal>; optional: <behavior_guidance>. Unknown tags ignored (forward-compat). - role_swap — flips eval-frame user/assistant labels into LLM-frame labels; raises on system/tool roles (the driver filters those upstream) and on non-string content. - prompt — persona-playing system prompt. No <DONE>/<CANCEL> guidance: drive loop is fixed-length; SU stays engaged across the conversation. - driver — SyntheticUserDriver. Parses the blob once at construction, renders the system prompt once, builds the adapter once. respond() filters visible roles, role-swaps, prepends the system prompt as prior_trace[0], calls adapter.invoke_returning_run_output (in-memory — the SU never persists a TaskRun), returns the raw string. 56 unit tests covering: parser roundtrip / required-tag enforcement / whitespace / unknown-tag forward-compat; role_swap empty/alternating/ preserves-order/raises-on-system-or-tool; prompt structural assertions (persona/goal/conventions present, behavior_guidance only when set, no <DONE>/<CANCEL>); driver happy path, role-swap shape, custom visible_roles, ends-on-assistant invariant, non-string output guard, parse-error on construction, adapter reuse across turns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Picks up an OpenAPI description on GenerateSyntheticUsersResponse.cases documenting the strict-N batch contract. No shape change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Thin async wrapper around the SDK's /v1/synthetic_user/generate endpoint. The SDK now parses 401/422/500/502 into typed response models, so the wrapper switches on the parsed type rather than reading raw bytes — 502 surfaces its typed `code` literal (llm_unavailable | upstream_invalid_output) directly to callers. No retry loop. /generate is a once-per-batch authoring call; kiln_server's pipeline already retries transient provider failures internally before returning 502, so a 502 reaching us is a genuine per-batch failure that should propagate. Drops the v1 client's SyntheticUserTransientError + backoff machinery. No /respond. Per-turn synthetic-user invocation lives at libs/core/kiln_ai/synthetic_user/ and runs locally with the user's keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an explicit "your entire output is the user's next message, verbatim and nothing else: no narration, no meta-commentary, no quotes, no labels like 'User:'" clause to the persona-playing system prompt. A team running similar SU-driven evals reported the persona-playing model frequently breaks character — narrating ("I would now ask..."), self-evaluating, or labeling its output. This clause pins that down at the prompt boundary so we don't end up reaching for post-processing band-aids later. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

drive_loop.py: - drive_case(*, case, target_invoker, su_driver, turns, on_turn) runs the loop for exactly `turns` iterations — no early termination, no stop_signal plumbing. Returns DriveCaseResult(chain) with the persisted TaskRun chain. - TargetInvoker + TurnHook Protocols. The SU driver does all role filtering / role swap / invariant checks internally; the drive loop passes the cumulative trace as-is. runner.py: - run_cases_batch is an async generator yielding typed BatchEvents (BatchStartedEvent / TurnCompletedEvent / CaseCompletedEvent / CaseFailedEvent / BatchCompletedEvent). No stop_signal/stop_reason fields — drive loop is fixed-length. - Constructs a SyntheticUserDriver per case; a malformed synthetic_user_info blob surfaces as a CaseFailedEvent for that case alone (other cases continue). - _make_target_invoker / _build_input_source / _tag_leaf patterns kept from the prior v1 commits (target persistence + SU attribution unchanged). input_source now carries the opaque blob on the root run + slim {batch_tag, turn_index} on subsequent turns. - Per-case try/except now WRAPS _tag_leaf too, so a save_to_file failure surfaces as case_failed instead of silently disappearing into asyncio.gather(return_exceptions=True). Same try also wraps the target_invoker construction. - Case tasks are kicked off before the first BatchStartedEvent yield and the entire drain loop is inside a try/finally that cancels them on consumer disconnect — fixes the v1 issue where browser disconnect kept the request alive for the full duration of every in-flight case. 14 tests cover: input validation, happy-path event stream, leaf tagging, auto-generated batch_tag, malformed blob → case_failed, target invoke failure → case_failed, tag-save failure → case_failed, concurrency semaphore enforcing max-in-flight, root vs slim input_source attribution, and consumer cancellation propagating to case tasks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two routes for the multi-turn synthetic-user data-generation pipeline: - POST .../multiturn_sdg/generate_cases (sync JSON) - POST .../multiturn_sdg/run_cases_batch (SSE via CancellableStreamingResponse) Wires connect_multiturn_sdg_api into desktop_server.make_app and registers the Multiturn SDG tag in kiln_server's tags_metadata so the regenerated api_schema.d.ts surfaces the routes in the typed client. Both routes guard task.turn_mode == multiturn before doing any upstream work and route SyntheticUserClient typed errors through to faithful HTTP statuses (401/422/502 preserved, not collapsed). The SSE route threads build_save_context(request) into run_cases_batch and uses an isinstance whitelist on the JSON encoder so future Pydantic types on the wire need explicit review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Rename total_cost -> target_total_cost on CaseCompletedEvent and BatchCompletedEvent. The runner only sees target adapter spend; the SU driver's per-turn cost isn't rolled up here. Old name was misleading in a beta where users pick the SU model. - Thread an optional save_context through run_cases_batch and wrap the leaf-tag save. Adapter writes inside adapter.invoke still bypass — a kiln_ai-side gap shared with the chat SSE pattern, documented in the runner docstring. - Add a re-run idempotency test for _tag_leaf to lock in the spec's "set-union + sort, preserves pre-existing tags" contract. - Drop the dead UNSET/None branch in client._code_or_default; the remaining one-liner has identical behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Rename DEFAULT_TURNS -> MAX_TURNS_DEFAULT to match spec naming. - Name asyncio.create_task instances so debug dumps point at this code. - Pre-assert non-empty seed_prompt in drive_case (assert-loud invariant). - Document invariants on _make_target_invoker (sequential-per-case), _tag_leaf (one-writer-per-leaf), and _close_when_done (final put on cancel path goes into the void). - Drop the unreachable generic fallback in _to_http_exception; tighten the param type to the two real subclasses so the type checker enforces exhaustiveness at the call site. - Log a warning in _format_validation_detail when every item is skipped so a silent SDK shape drift surfaces. - Tests: parameterize turns<1 with negatives, lock in _event_to_payload's unregistered-event guard, and couple the auto-batch_tag test to the public regex instead of the implementation. - Stale "Phase 3" docstring scrub + f-string cosmetic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Root TaskRun's input_source.properties now carries the decomposed SU case context — persona, goal, behavior_guidance (when present), seed_prompt — instead of the opaque tagged blob. Lets dataset readers and eval tooling inspect SU attribution by direct property access rather than re-parsing the XML each time. The blob is losslessly reconstructable from these fields via build_synthetic_user_info if a downstream tool needs the original wire form. Parse happens once per case in _build_input_source on the root turn; the SU driver constructor already validated the blob, so the re-parse here can't surface a new error class. behavior_guidance is omitted when the parser returns None (the DataSource validator rejects empty strings). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SyntheticUserDriver.respond now returns (message, cost) — the per-call cost is read from the in-memory TaskRun's usage.cost (the only place SU spend surfaces, since SU turns aren't persisted as TaskRuns). drive_case accumulates su_total_cost across turns and exposes it on DriveCaseResult. The runner adds it to the leaf's cumulative_usage.cost to produce an honest CaseCompletedEvent.total_cost — renamed from target_total_cost since the field now reports total spend, not just the target adapter's. BatchCompletedEvent.total_cost sums across successful cases the same way. Matters now because the SU model is user-selectable: someone picking Sonnet for higher-quality probes would have had ~half their spend invisible under the old target-only total. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…source Every input to the filter has stronger upstream protection now: seed_prompt is asserted non-empty in drive_case; persona and goal are required-non-empty by parse_synthetic_user_info; behavior_guidance is already conditionally skipped if None; the remaining keys are Pydantic- validated or non-string. The filter was guarding nothing. The DataSource validator stays as the real backstop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure relocation + boundary update; behavior unchanged. run_cases_batch and drive_case now live at libs/core/kiln_ai/synthetic_user/{runner,drive_loop}.py alongside the existing SyntheticUserDriver. Same neighborhood as EvalRunner / RagJobRunner / ExtractorRunner — runners belong in libs/core. To make libs/core SDK-agnostic, introduce a small kiln_ai.synthetic_user.SyntheticUserCase Pydantic model (two fields, field-identical to the kiln_server SDK's case shape). The multiturn_sdg_api route validates dicts straight into the libs/core type via Pydantic, so the runner never sees the SDK class. The SDK case is still used for `/generate_cases` output via `to_dict()` — nothing about that pro-gated authoring path changes. Tests move with the code. studio_server keeps only the SDK-wrapper SyntheticUserClient and the FastAPI route, which is exactly the established shape for eval_api driving EvalRunner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-02T09:00:55Z

Walkthrough

This PR introduces a complete multi-turn synthetic data generation pipeline enabling local execution of multi-turn synthetic user conversations against target tasks. The implementation includes a batch runner with concurrent case execution, per-turn drivers that invoke adapters, FastAPI endpoints exposing generation and execution, an SDK client wrapper for upstream case generation, comprehensive event streaming via SSE, and end-to-end test coverage.

Changes

Multi-turn Synthetic Data Generation Pipeline

Layer / File(s)	Summary
Core synthetic user types and data models `libs/core/kiln_ai/synthetic_user/__init__.py`, `libs/core/kiln_ai/synthetic_user/case.py`, `libs/core/kiln_ai/synthetic_user/models.py`, `libs/core/kiln_ai/synthetic_user/parser.py`, `libs/core/kiln_ai/synthetic_user/prompt.py`, `libs/core/kiln_ai/synthetic_user/role_swap.py`, `libs/core/kiln_ai/synthetic_user/test_*.py`	Introduce `SyntheticUserCase` (seed prompt + info blob), `SyntheticUserInfo` (persona/goal/behavior guidance), parsing/serialization with XML-like tagged format, system prompt rendering, role-swapping utility for eval-to-LLM frame conversion, and full unit test coverage.
Synthetic user per-turn driver `libs/core/kiln_ai/synthetic_user/driver.py`, `libs/core/kiln_ai/synthetic_user/test_driver.py`	Implement `SyntheticUserDriver` that parses synthetic-user info at construction, filters conversation messages by visible roles, drops tool-dispatch-only turns, applies role swap, invokes adapter, and returns synthetic user reply plus per-call cost; includes 18 test cases validating parsing, visibility filtering, tool-call handling, and adapter reuse.
Single-case drive loop for multi-turn iteration `libs/core/kiln_ai/synthetic_user/drive_loop.py`, `libs/core/kiln_ai/synthetic_user/test_drive_loop.py`	Add `drive_case` function and `DriveCaseResult` to orchestrate fixed-turn iteration: seed with case prompt, invoke target task via adapter, thread cumulative trace to SU driver, collect persisted `TaskRun` chain, aggregate SU cost; includes 11 test cases covering turn sequencing, trace threading, hook callbacks, and error propagation.
Batch runner with concurrent execution and event streaming `libs/core/kiln_ai/synthetic_user/runner.py`, `libs/core/kiln_ai/synthetic_user/test_runner.py`	Implement async `run_cases_batch` generator yielding strongly-typed `BatchEvent`s (started, turn completed, case completed/failed, batch completed); execute cases concurrently under semaphore; emit turn snapshots with cumulative trace and cost; tag leaf runs; isolate per-case failures; includes 18 test cases validating event sequencing, concurrency caps, input source decomposition, and cancellation handling.
SDK client wrapper and exception handling `app/desktop/studio_server/synthetic_user/__init__.py`, `app/desktop/studio_server/synthetic_user/client.py`, `app/desktop/studio_server/synthetic_user/test_client.py`	Wrap kiln_server SDK endpoint for `/v1/synthetic_user/generate`; define typed exception hierarchy (`SyntheticUserError`, `SyntheticUserRequestError` for 4xx, `SyntheticUserServerError` for 5xx); translate SDK responses into wrapper types; includes 13 test cases covering success path, typed error codes, fallback status classification, and no-retry behavior.
FastAPI endpoints and desktop server integration `app/desktop/studio_server/multiturn_sdg_api.py`, `app/desktop/studio_server/test_multiturn_sdg_api.py`, `app/desktop/desktop_server.py`	Add `/generate_cases` synchronous route and `/run_cases_batch` SSE route under `/api/projects/{project_id}/tasks/{task_id}/multiturn_sdg/`; validate multiturn task requirement; map upstream errors to HTTP status codes; stream SSE JSON frames with custom serialization for `MessageUsage`; wrap with `CancellableStreamingResponse`; apply `_git_sync_no_write_lock` decorator; includes 23 test cases covering happy path, validation, error preservation, and structural behavior.
Frontend API schema and minor UI updates `app/web_ui/src/lib/api_schema.d.ts`, `app/web_ui/src/lib/ui/conversation/multiturn_composer.svelte`, `libs/server/kiln_server/server.py`	Generate TypeScript type definitions for new endpoints and request/response models; add "Multiturn SDG" OpenAPI tag; fix comment formatting in Svelte component.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

leonardmq
scosman
tawnymanticore

🐰 A pipeline flows, cases now sync,
Turn by turn the models think,
Batch events stream in SSE delight,
Synthetic users chat through the night! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 34.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately reflects the main change: adding a multiturn synthetic user Runner to the codebase, which is the core feature of this PR.
Description check	✅ Passed	The PR description covers most required sections: purpose (What does this PR do), pipeline explanation, notable features, flow diagram, test plan, and related issues. However, some template sections are incomplete (CLA confirmation uses placeholder @ and checklist items are unchecked), but the core descriptive content is substantial and well-structured.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dchiang/multiturn-synthetic-user

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces multi-turn synthetic data generation (SDG) capabilities, adding FastAPI routes, a local synthetic-user driver, client wrappers, and comprehensive unit tests, alongside updates to tracking models. The review feedback highlights several critical issues: multiple model files (chat_session_list_item.py, kiln_base_model.py, task_output.py, task_output_rating.py, and task_run.py) use datetime.datetime.fromisoformat without importing the datetime module, which will cause runtime NameErrors. Additionally, manually overriding the Content-Type header with a hardcoded boundary in the prompt optimization endpoint is fragile and should be removed, and role_swap.py needs to gracefully handle None content in assistant messages to prevent crashes during tool-use turns.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

github-actions · 2026-06-02T09:03:39Z

📊 Coverage Report

Overall Coverage: 92%

Diff: origin/leonard/kil-632-feat-multiturn-task...HEAD

app/desktop/desktop_server.py (100%)
app/desktop/studio_server/multiturn_sdg_api.py (100%)
app/desktop/studio_server/synthetic_user/init.py (100%)
app/desktop/studio_server/synthetic_user/client.py (91.5%): Missing lines 173-174,182-183,189
libs/core/kiln_ai/synthetic_user/init.py (100%)
libs/core/kiln_ai/synthetic_user/case.py (100%)
libs/core/kiln_ai/synthetic_user/drive_loop.py (97.1%): Missing lines 101
libs/core/kiln_ai/synthetic_user/driver.py (97.4%): Missing lines 118
libs/core/kiln_ai/synthetic_user/models.py (100%)
libs/core/kiln_ai/synthetic_user/parser.py (100%)
libs/core/kiln_ai/synthetic_user/prompt.py (100%)
libs/core/kiln_ai/synthetic_user/role_swap.py (93.8%): Missing lines 45
libs/core/kiln_ai/synthetic_user/runner.py (99.3%): Missing lines 420

Summary

Total: 447 lines
Missing: 9 lines
Coverage: 97%

Line-by-line

View line-by-line diff coverage

app/desktop/studio_server/synthetic_user/client.py

Lines 169-178

  169     parts: list[str] = []
  170     skipped = 0
  171     for item in detail:
  172         if not isinstance(item, ValidationError):
! 173             skipped += 1
! 174             continue
  175         loc = ".".join(str(x) for x in item.loc)
  176         parts.append(f"{loc}: {item.msg}")
  177     if not parts:
  178         # The SDK's HTTPValidationError.detail had items the SDK couldn't

Lines 178-187

  178         # The SDK's HTTPValidationError.detail had items the SDK couldn't
  179         # parse as ValidationError — a shape we don't expect today. Log
  180         # so we can spot the discrepancy if it ever appears in the wild,
  181         # instead of silently returning the empty fallback.
! 182         if skipped:
! 183             logger.warning(
  184                 "HTTPValidationError carried %d non-ValidationError detail item(s); "
  185                 "raw detail repr: %r",
  186                 skipped,
  187                 detail,

Lines 185-191

  185                 "raw detail repr: %r",
  186                 skipped,
  187                 detail,
  188             )
! 189         return "Validation error (no detail)."
  190     return "Validation error: " + "; ".join(parts)

libs/core/kiln_ai/synthetic_user/drive_loop.py

Lines 97-105

   97     # Assert-loud on missing seed. An empty string would silently flow
   98     # into the target adapter and surface as a confusing model-side error
   99     # rather than a clean "the case is malformed" signal.
  100     if not case.seed_prompt:
! 101         raise ValueError("case.seed_prompt must be a non-empty string")
  102 
  103     user_msg: str = case.seed_prompt
  104     prev_run: TaskRun | None = None
  105     prev_trace: list[ChatCompletionMessageParam] | None = None

libs/core/kiln_ai/synthetic_user/driver.py

Lines 114-122

  114         swapped = role_swap(visible)
  115         last = swapped[-1]
  116         user_input = last["content"]
  117         if not isinstance(user_input, str):
! 118             raise RuntimeError(
  119                 "synthetic user input must be a plain string after role_swap"
  120             )
  121 
  122         system_msg: ChatCompletionSystemMessageParam = {

libs/core/kiln_ai/synthetic_user/role_swap.py

Lines 41-49

  41         # the target. Narrowing here lets us assign into the swapped wrapper
  42         # type without a cast.
  43         content = msg["content"]
  44         if not isinstance(content, str):
! 45             raise ValueError(
  46                 f"role_swap requires string content for role {role!r}; "
  47                 f"got {type(content).__name__}"
  48             )
  49         if role == "user":

libs/core/kiln_ai/synthetic_user/runner.py

Lines 416-424

  416     missing (defensive against fakes in unit tests that don't populate it).
  417     """
  418     usage = getattr(run, "cumulative_usage", None)
  419     if usage is None:
! 420         return 0.0
  421     return float(getattr(usage, "cost", None) or 0.0)
  422 
  423 
  424 def _tag_leaf(leaf: TaskRun, batch_tag: str) -> None:

📊 HTML Coverage Report - Interactive coverage report
📈 Diff Coverage Report - Detailed diff analysis
Github Actions Run - View the full coverage report

… role_swap Tool-using targets emit assistant turns with content=None and tool_calls set — pure tool dispatches, not user-facing speech. Pre-this-fix, those hit role_swap's strict-content invariant and crashed the SU run. Gemini's suggestion (coerce None → "") would have let them through but degraded the SU LLM's conversation view to consecutive user turns with empty content — silently worse than the crash. The right place to filter is at the driver, next to the existing visible_message_roles filter — "what's visible to the SU" is the driver's responsibility. role_swap stays strict on None content (the trip wire for any caller bypassing the driver's filter). Filter predicate: drop assistant turns where content is None. Keep assistant turns that carry text alongside tool_calls — the text is user-facing speech the SU should respond to. Addresses gemini-code-assist comment on PR #1441 / role_swap.py without applying the suggested empty-string coercion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…context Fix comment numbering in driver.py (4→5), correct "greedy" to "non-greedy" in parser.py, remove inaccurate drive-loop claim from studio_server __init__. Strip historical /respond migration references, remove app-layer concerns (SSE, @no_write_lock) from SDK-level docstrings, deduplicate cost-attribution explanations across driver/runner/drive_loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ask' into dchiang/multiturn-synthetic-user

Stray U+200B (zero-width space) between "disables/" and "spinners" in a comment tripped eslint no-irregular-whitespace. Likely a paste artifact from Leonard's recent commit; fixed in passing during the merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chiang-daniel · 2026-06-03T17:07:28Z


+    headers["Content-Type"] = "multipart/form-data; boundary=+++"
+
    _kwargs["headers"] = headers


All the changes under /api_client are files copied from the new server SDK. No need to review those.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

libs/core/kiln_ai/synthetic_user/runner.py (1)
57-66: ⚖️ Poor tradeoff

TurnCompletedEvent.cumulative_cost omits SU-driver spend while CaseCompletedEvent.total_cost includes it.

A live cost ticker driven off cumulative_cost will undercount during turns, then jump up when case_completed adds result.su_total_cost. This matches the documented "honest totals only at case end" intent, so it's not a bug — just flagging the per-turn vs per-case inconsistency in case the UI relies on a smooth running total. Threading the running SU cost into on_turn would remove the jump.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/synthetic_user/runner.py` around lines 57 - 66,
TurnCompletedEvent.cumulative_cost currently excludes SU-driver spend while
CaseCompletedEvent.total_cost includes it, causing per-turn cost undercounts
then a jump at case completion; update the on-turn flow to thread the running SU
cost into each TurnCompletedEvent so cumulative_cost reflects assistant+SU spend
per turn (adjust the code paths that construct TurnCompletedEvent and any
function handling on_turn to accept and pass the incremental su_running_cost),
and ensure CaseCompletedEvent.total_cost still aggregates final su_total_cost so
the live ticker remains smooth and consistent with the end-of-case total.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/web_ui/src/lib/api_schema.d.ts`:
- Around line 17079-17104: The OpenAPI docs currently advertise
stream_run_cases_batch
(stream_run_cases_batch_api_projects__project_id__tasks__task_id__multiturn_sdg_run_cases_batch_post)
as returning "application/json" but the route actually returns a
StreamingResponse with media_type="text/event-stream"; update the FastAPI route
in app/desktop/studio_server/multiturn_sdg_api.py to declare the 200 response
content type as "text/event-stream" (e.g., add responses={200: {"content":
{"text/event-stream": {"schema": {"type":"string"}}}}} or set
response_class/response_model metadata appropriately) so the OpenAPI spec
reflects SSE, then run app/web_ui/src/lib/generate_schema.sh to regenerate
app/web_ui/src/lib/api_schema.d.ts; do not manually edit the generated TS file.

---

Nitpick comments:
In `@libs/core/kiln_ai/synthetic_user/runner.py`:
- Around line 57-66: TurnCompletedEvent.cumulative_cost currently excludes
SU-driver spend while CaseCompletedEvent.total_cost includes it, causing
per-turn cost undercounts then a jump at case completion; update the on-turn
flow to thread the running SU cost into each TurnCompletedEvent so
cumulative_cost reflects assistant+SU spend per turn (adjust the code paths that
construct TurnCompletedEvent and any function handling on_turn to accept and
pass the incremental su_running_cost), and ensure CaseCompletedEvent.total_cost
still aggregates final su_total_cost so the live ticker remains smooth and
consistent with the end-of-case total.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: dd8cddc7-7358-4d89-a388-06e6d09f5738

📥 Commits

Reviewing files that changed from the base of the PR and between d2c3f99 and d032dcf.

⛔ Files ignored due to path filters (20)

app/desktop/studio_server/api_client/kiln_ai_server_client/api/jobs/start_prompt_optimization_job_v1_jobs_prompt_optimization_job_start_post.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/api/jobs/start_sample_job_v1_jobs_sample_job_start_post.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/api/synthetic_user/__init__.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/api/synthetic_user/generate_v1_synthetic_user_generate_post.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/__init__.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/chat_completion_assistant_message_param_wrapper.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/chat_session_list_item.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/generate_synthetic_users_request.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/generate_synthetic_users_response.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/generate_v1_synthetic_user_generate_post_response_401.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/generate_v1_synthetic_user_generate_post_response_500.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/generate_v1_synthetic_user_generate_post_response_502.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/generate_v1_synthetic_user_generate_post_response_502_code.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/kiln_base_model.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/message_usage.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/synthetic_user_case.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/task_output.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/task_output_rating.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/task_run.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**
app/desktop/studio_server/api_client/kiln_ai_server_client/models/usage.py is excluded by !app/desktop/studio_server/api_client/kiln_ai_server_client/**

📒 Files selected for processing (26)

app/desktop/desktop_server.py
app/desktop/studio_server/multiturn_sdg_api.py
app/desktop/studio_server/synthetic_user/__init__.py
app/desktop/studio_server/synthetic_user/client.py
app/desktop/studio_server/synthetic_user/test_client.py
app/desktop/studio_server/test_multiturn_sdg_api.py
app/web_ui/src/lib/api_schema.d.ts
app/web_ui/src/lib/ui/conversation/multiturn_composer.svelte
libs/core/kiln_ai/synthetic_user/__init__.py
libs/core/kiln_ai/synthetic_user/case.py
libs/core/kiln_ai/synthetic_user/drive_loop.py
libs/core/kiln_ai/synthetic_user/driver.py
libs/core/kiln_ai/synthetic_user/models.py
libs/core/kiln_ai/synthetic_user/parser.py
libs/core/kiln_ai/synthetic_user/prompt.py
libs/core/kiln_ai/synthetic_user/role_swap.py
libs/core/kiln_ai/synthetic_user/runner.py
libs/core/kiln_ai/synthetic_user/test_case.py
libs/core/kiln_ai/synthetic_user/test_drive_loop.py
libs/core/kiln_ai/synthetic_user/test_driver.py
libs/core/kiln_ai/synthetic_user/test_models.py
libs/core/kiln_ai/synthetic_user/test_parser.py
libs/core/kiln_ai/synthetic_user/test_prompt.py
libs/core/kiln_ai/synthetic_user/test_role_swap.py
libs/core/kiln_ai/synthetic_user/test_runner.py
libs/server/kiln_server/server.py

coderabbitai · 2026-06-03T17:24:32Z

+    stream_run_cases_batch_api_projects__project_id__tasks__task_id__multiturn_sdg_run_cases_batch_post: {
+        parameters: {
+            query?: never;
+            header?: never;
+            path: {
+                /** @description ID of the project containing the target task. */
+                project_id: string;
+                /** @description ID of the target task. Must be a multi-turn task. */
+                task_id: string;
+            };
+            cookie?: never;
+        };
+        requestBody: {
+            content: {
+                "application/json": components["schemas"]["RunCasesBatchApiInput"];
+            };
+        };
+        responses: {
+            /** @description Successful Response */
+            200: {
+                headers: {
+                    [name: string]: unknown;
+                };
+                content: {
+                    "application/json": unknown;
+                };


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

run_cases_batch response media type is mis-modeled as JSON instead of SSE.

stream_run_cases_batch is typed with 200 -> application/json, but the backend route returns StreamingResponse(..., media_type="text/event-stream") (see app/desktop/studio_server/multiturn_sdg_api.py). This weakens the generated client contract for streaming and can break typed frontend consumption.

Please update the backend route OpenAPI metadata/response docs to advertise text/event-stream, then regenerate app/web_ui/src/lib/api_schema.d.ts via app/web_ui/src/lib/generate_schema.sh rather than editing this file directly.
Based on learnings: "app/web_ui/src/lib/api_schema.d.ts is auto-generated by openapi-typescript; do not propose manual edits. Schema changes should be made in the FastAPI backend … then re-generate the TS types."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/web_ui/src/lib/api_schema.d.ts` around lines 17079 - 17104, The OpenAPI docs currently advertise stream_run_cases_batch (stream_run_cases_batch_api_projects__project_id__tasks__task_id__multiturn_sdg_run_cases_batch_post) as returning "application/json" but the route actually returns a StreamingResponse with media_type="text/event-stream"; update the FastAPI route in app/desktop/studio_server/multiturn_sdg_api.py to declare the 200 response content type as "text/event-stream" (e.g., add responses={200: {"content": {"text/event-stream": {"schema": {"type":"string"}}}}} or set response_class/response_model metadata appropriately) so the OpenAPI spec reflects SSE, then run app/web_ui/src/lib/generate_schema.sh to regenerate app/web_ui/src/lib/api_schema.d.ts; do not manually edit the generated TS file.

chiang-daniel and others added 13 commits June 1, 2026 15:51

chore: re-vendor SDK from kiln_server@acbd0ce

b3337da

Picks up an OpenAPI description on GenerateSyntheticUsersResponse.cases documenting the strict-N batch contract. No shape change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 2, 2026

View reviewed changes

chiang-daniel changed the title ~~Dchiang/multiturn synthetic user~~ feat: multiturn synthetic user Runner Jun 2, 2026

chiang-daniel and others added 4 commits June 2, 2026 16:04

Merge remote-tracking branch 'origin/leonard/kil-632-feat-multiturn-t…

d9a3517

…ask' into dchiang/multiturn-synthetic-user

chiang-daniel commented Jun 3, 2026

View reviewed changes

chiang-daniel marked this pull request as ready for review June 3, 2026 17:12

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multiturn synthetic user Runner#1441

feat: multiturn synthetic user Runner#1441
chiang-daniel wants to merge 17 commits into
leonard/kil-632-feat-multiturn-taskfrom
dchiang/multiturn-synthetic-user

chiang-daniel commented Jun 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026 •

edited

Loading

app/desktop/studio_server/synthetic_user/client.py

libs/core/kiln_ai/synthetic_user/drive_loop.py

libs/core/kiln_ai/synthetic_user/driver.py

libs/core/kiln_ai/synthetic_user/role_swap.py

libs/core/kiln_ai/synthetic_user/runner.py

Uh oh!

chiang-daniel Jun 3, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		headers["Content-Type"] = "multipart/form-data; boundary=+++"

		_kwargs["headers"] = headers

Conversation

chiang-daniel commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Pipeline

Notable

Flow

Test plan

Related Issues

Contributor License Agreement

Checklists

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Coverage Report

Diff: origin/leonard/kil-632-feat-multiturn-task...HEAD

Summary

Line-by-line

app/desktop/studio_server/synthetic_user/client.py

libs/core/kiln_ai/synthetic_user/drive_loop.py

libs/core/kiln_ai/synthetic_user/driver.py

libs/core/kiln_ai/synthetic_user/role_swap.py

libs/core/kiln_ai/synthetic_user/runner.py

Uh oh!

chiang-daniel Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chiang-daniel commented Jun 2, 2026 •

edited

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

github-actions Bot commented Jun 2, 2026 •

edited

Loading

chiang-daniel Jun 3, 2026 •

edited

Loading