You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
participants with the room. It serves three purposes:
85
86
86
87
1.**Identity awareness** — the room auto-injects participant names and
87
88
descriptions into each LLM's system prompt, so every agent knows who
88
89
else is in the room.
89
90
2.**Message routing (pub/sub)** — when any participant speaks (via `talk()`),
90
91
their message is automatically visible to all other participants. No manual
91
92
forwarding needed.
93
+
3.**Automatic isolation** — for `LLMChat` participants, `add_participant()`
94
+
automatically creates an independent clone so the same LLM can be reused
95
+
for multiple participants without identity collisions.
92
96
93
97
Participants can be `LLMChat` instances (LLM-driven) or `Actor` instances
94
98
(code-driven). See [Code-Driven Participants](#code-driven-participants-actor)
@@ -991,12 +995,17 @@ Following a successful Test-Driven Development (TDD) cycle, the complete Phase 1
991
995
-**Task Return-Type Auto-Inference Restrictions**: The `@kbench.task` decorator uses strict name-matching on string return annotations to infer evaluation result types.
992
996
- Using typing-wrapped annotations like `Dict[str, str]` fails type-inference and triggers a `TypeError`.
993
997
- Annotating the task signature with the plain builtin `dict` class (or subclassing `benchmarks.results.Result`) resolves type-inference and executes correctly.
994
-
-**Object Reference Identity Collisions**: In multi-agent evaluations, passing the *exact same model object reference* (e.g. reusing `kbench.llm` for all players) collapses the `msg.sender is viewer` check during perspective projection. All messages are remapped as role `assistant` (belonging to the active viewer), generating consecutive `assistant` role blocks which the model provider APIs reject with server-side validation errors (e.g., throwing NoneType choice subscript errors).
995
-
-*Mitigation*: Multi-agent rooms must instantiate**separate participant references** (one per player, even if sharing the same model configuration) by calling the `ModelProxy` factory independently:
998
+
-**Object Reference Identity Collisions***(resolved by `add_participant()`)*: In multi-agent evaluations, passing the *exact same model object reference* (e.g. reusing `kbench.llm` for all players) collapses the `msg.sender is viewer` check during perspective projection. All messages are remapped as role `assistant` (belonging to the active viewer), generating consecutive `assistant` role blocks which the model provider APIs reject with server-side validation errors (e.g., throwing NoneType choice subscript errors).
999
+
-*Original Mitigation*: Multi-agent rooms required instantiating**separate participant references** (one per player, even if sharing the same model configuration) by calling the `ModelProxy` factory independently:
996
1000
```python
997
1001
player_x = ModelProxy(model_name, name="PlayerX")
998
1002
player_o = ModelProxy(model_name, name="PlayerO")
999
1003
```
1004
+
-*Current Solution*: The `room.add_participant()` method (see §9.6) now handles this automatically — users pass a single LLM instance and the room creates isolated clones:
-**Post-Game Assertion Decoupling**: Running evaluation assertions inline during turn loops clutters the final output panel with alert blocks, disrupting reading immersion. Moving assertions to a post-game loop (iterating over `room.messages`*after* the `with room:` context block exits) leaves a clean, continuous story transcript while still fully enforcing evaluation rules.
1001
1010
1002
1011
---
@@ -1065,3 +1074,67 @@ TypeError: can only concatenate str (not "LLMResponse") to str
1065
1074
self[message].stream(chunk_text)
1066
1075
```
1067
1076
1077
+
1078
+
### 9.6 Automatic Participant Isolation via `add_participant()`
1079
+
1080
+
The original `ChatRoom` API required users to pass a pre-constructed `participants` list to the constructor, and critically relied on users manually creating distinct `ModelProxy` instances for each participant — even when all participants shared the same underlying model. This was a significant footgun documented in §9.4.
1081
+
1082
+
#### The Problem
1083
+
1084
+
```python
1085
+
# OLD API — error-prone: user must remember to create separate instances
1086
+
model_name = kbench.llm.model
1087
+
alice = kbench.kaggle.ModelProxy(model_name, name="Alice", avatar="👩")
1088
+
bob = kbench.kaggle.ModelProxy(model_name, name="Bob", avatar="👨")
1089
+
charlie = kbench.kaggle.ModelProxy(model_name, name="Charlie", avatar="🧑")
This pattern forced users to understand internal cloning requirements, bloated task function signatures (one `LLMChat` parameter per participant), and exposed `ModelProxy` as a leaky abstraction.
1095
+
1096
+
#### The Solution: `room.add_participant()`
1097
+
1098
+
```python
1099
+
# NEW API — clean: room handles isolation automatically
1100
+
room = ChatRoom(system_prompt="...")
1101
+
alice = room.add_participant(llm, name="Alice", avatar="👩", system_prompt=werewolf_prompt)
1102
+
bob = room.add_participant(llm, name="Bob", avatar="👨", system_prompt=werewolf_prompt)
1103
+
```
1104
+
1105
+
The `add_participant()` method:
1106
+
- Accepts a single LLM instance and creates a fully independent clone
1107
+
- Returns the clone so the caller can use it for `talk()` and perspective projection
1108
+
- Accepts `name`, `avatar`, and `system_prompt` keyword overrides applied to the clone
1109
+
1110
+
#### Three-Tier Cloning Strategy
1111
+
1112
+
| Participant type | Cloning method | Rationale |
1113
+
|---|---|---|
1114
+
|`OpenAI` / `GoogleGenAI`| Explicit constructor `type(p)(p.client, p.model, ...)`| Fully independent instance; shares only the stateless API client |
1115
+
| Other `LLMChat` subclasses |`copy.copy()` fallback | For test mocks and custom subclasses |
1116
+
| Plain `Actor`| No cloning (pass-through) | Scripted actors have no model state to isolate |
1117
+
1118
+
For the production `OpenAI` and `GoogleGenAI` classes (both created by `ModelProxy`), the room calls the class constructor directly with the original's `client` and `model` attributes. This creates a truly independent instance — new serializer, new system_prompt — while reusing the stateless API client (which is safe to share).
1119
+
1120
+
#### Safety Guards
1121
+
1122
+
1.**Name collision guard**: Rejects participants whose effective name matches an existing participant. Duplicate names would break perspective projection (both would render as `[Alice]:`).
1123
+
1124
+
2.**Plain Actor identity check**: Prevents the same `Actor` instance from being registered twice. Since plain Actors are not cloned, adding the same object twice would cause mutations to affect both "participants".
1125
+
1126
+
3.**LLM isolation guarantee**: LLMChat participants are always cloned, so the same source LLM can be registered any number of times (with different names).
1127
+
1128
+
#### Impact on Task Signatures
1129
+
1130
+
Before: task functions required one parameter per participant.
description="Evaluates an LLM's ability to negotiate a phone order under strict budget, topping, allergy, and delivery constraints.",
95
95
)
96
96
defrun_pizza_order(
97
-
customer_llm: kbench.LLMChat,
97
+
llm: kbench.LLMChat,
98
98
clerk_llm: kbench.LLMChat,
99
99
) ->dict:
100
100
"""Runs a phone-ordering simulation between a Customer and a Clerk."""
101
101
102
-
customer_llm.system_prompt= (
102
+
customer_prompt= (
103
103
"You are calling 'Luigi's Pizza' to order dinner for a family of 3 (you, your spouse, and your kid).\n"
104
104
"Your objectives and constraints are:\n"
105
105
"1. BUDGET: You have a strict budget of $35.00 CASH on hand. The total cost, including any delivery fees, must not exceed $35.00.\n"
@@ -119,7 +119,7 @@ def run_pizza_order(
119
119
" - MATH AUDIT: The clerk may make mistakes when totaling your order. Independently calculate the total (Large Pizza $12 + toppings $1.50 each + side + drink + $5 delivery). If the clerk overcharges you, challenge their math, explain what the total should be, and refuse to finalize until it is corrected."
120
120
)
121
121
122
-
clerk_llm.system_prompt= (
122
+
clerk_prompt= (
123
123
"You are the clerk answering the phone at Luigi's Pizza. You are extremely busy, rushed, and slightly impatient.\n"
124
124
"Strictly guide the conversation through the following phases based on the customer's input:\n\n"
125
125
"PHASE 1: Greeting\n"
@@ -154,11 +154,17 @@ def run_pizza_order(
154
154
)
155
155
156
156
room=ChatRoom(
157
-
participants=[clerk_llm, customer_llm],
158
157
system_prompt="A phone call between a customer and an impatient pizzeria clerk.",
0 commit comments