Summary
Agent currently swallows reasoning/thinking deltas from the LLM stream and only flushes them to the console as a single rich panel after the thinking phase ends. Integrators using streaming_callback, arun_stream, or run_stream cannot surface thinking tokens in real time — they only ever see content tokens.
We should stream thinking tokens through the same callback surface as content tokens, with a clear way to distinguish them.
Today's behavior
swarms/structs/agent.py:4036-4090 — _yield_only_content_chunks:
reasoning = getattr(delta, \"reasoning_content\", None)
if reasoning:
thinking_parts.append(reasoning)
continue # swallow the thinking chunk; don't pass to content stream
# First non-thinking chunk — flush accumulated thinking
if thinking_parts and not thinking_displayed:
if self.print_on:
formatter.print_thinking_panel(\"\".join(thinking_parts), title=...)
thinking_displayed = True
The reasoning deltas:
- Never reach
streaming_callback.
- Never reach
arun_stream / run_stream consumers.
- Are batched, so even console users see the thinking as one block at the end of the thinking phase, not as it's produced.
This means a dashboard, web UI, or terminal renderer integrating against `Agent` cannot show "thinking in progress" the way Claude.ai / OpenAI playground / Anthropic Console do.
Repro
from swarms import Agent
agent = Agent(
agent_name=\"Reasoner\",
model_name=\"claude-sonnet-4-6\",
thinking_tokens=2000,
streaming_callback=lambda tok: print(repr(tok)),
)
agent.run(\"Solve: a chicken and a half lays an egg and a half in a day and a half.\")
# Expected: callback fires for thinking tokens AND content tokens, distinguishably.
# Actual: callback fires only for content tokens. Thinking is invisible to the callback.
Same gap exists for arun_stream / run_stream — they only yield content tokens.
Proposed design
Option A (preferred): tagged events. Change streaming_callback to optionally accept a structured event dict, and switch arun_stream / run_stream to yield events by default when an opt-in flag is set:
# Token event
{\"type\": \"thinking\", \"token\": \"...\"}
{\"type\": \"content\", \"token\": \"...\"}
# Phase boundaries (optional but useful)
{\"type\": \"thinking_start\"}
{\"type\": \"thinking_end\", \"text\": \"<full thinking>\"}
{\"type\": \"content_start\"}
{\"type\": \"content_end\", \"text\": \"<full content>\"}
Preserve back-compat: if the callback signature is Callable[[str], None], keep delivering only content tokens (today's behavior). If it's Callable[[dict], None] (detect via inspect.signature) or the user passes streaming_events=True, deliver tagged events.
Option B: separate thinking_callback. Add a second kwarg:
agent = Agent(
...,
streaming_callback=on_content_token,
thinking_callback=on_thinking_token,
)
Simpler to add, no signature detection, but doesn't generalize to arun_stream/run_stream cleanly.
I lean toward Option A because it composes with the existing arun_stream(with_events=True) pattern already established in AgentRearrange (swarms/structs/agent_rearrange.py:1105-1129) — same event shape, just add thinking / thinking_start / thinking_end types.
Acceptance criteria
- A reasoning model (
claude-sonnet-4-6 with thinking_tokens=..., or an OpenAI o-series model) streams thinking deltas to the registered callback in real time, one chunk at a time, before the first content token arrives.
- Thinking tokens are distinguishable from content tokens in the callback payload.
arun_stream(with_events=True) yields {\"type\": \"thinking\", \"token\": ...} events for reasoning deltas alongside the existing content events.
- The console rich-panel UX for
print_on=True is preserved (or rendered incrementally — bonus).
- Back-compat: existing
streaming_callback=lambda tok: ... integrations that only care about content keep working without code changes.
Notes
_yield_only_content_chunks (agent.py:4036) is the natural place to fire thinking events before swallowing the chunk. Pass the callback / event-sink through from call_llm (agent.py:4092).
- Reasoning content lives at
delta.reasoning_content per LiteLLM; same accessor already used at L4056.
AgentRearrange.arun_stream(with_events=True) already returns agent_start / token / agent_end events — extending the same shape with thinking_start / thinking / thinking_end keeps the multi-agent streaming layer consistent.
Summary
Agentcurrently swallows reasoning/thinking deltas from the LLM stream and only flushes them to the console as a single rich panel after the thinking phase ends. Integrators usingstreaming_callback,arun_stream, orrun_streamcannot surface thinking tokens in real time — they only ever see content tokens.We should stream thinking tokens through the same callback surface as content tokens, with a clear way to distinguish them.
Today's behavior
swarms/structs/agent.py:4036-4090—_yield_only_content_chunks:The reasoning deltas:
streaming_callback.arun_stream/run_streamconsumers.This means a dashboard, web UI, or terminal renderer integrating against `Agent` cannot show "thinking in progress" the way Claude.ai / OpenAI playground / Anthropic Console do.
Repro
Same gap exists for
arun_stream/run_stream— they only yield content tokens.Proposed design
Option A (preferred): tagged events. Change
streaming_callbackto optionally accept a structured event dict, and switcharun_stream/run_streamto yield events by default when an opt-in flag is set:Preserve back-compat: if the callback signature is
Callable[[str], None], keep delivering only content tokens (today's behavior). If it'sCallable[[dict], None](detect viainspect.signature) or the user passesstreaming_events=True, deliver tagged events.Option B: separate
thinking_callback. Add a second kwarg:Simpler to add, no signature detection, but doesn't generalize to
arun_stream/run_streamcleanly.I lean toward Option A because it composes with the existing
arun_stream(with_events=True)pattern already established inAgentRearrange(swarms/structs/agent_rearrange.py:1105-1129) — same event shape, just addthinking/thinking_start/thinking_endtypes.Acceptance criteria
claude-sonnet-4-6withthinking_tokens=..., or an OpenAI o-series model) streams thinking deltas to the registered callback in real time, one chunk at a time, before the first content token arrives.arun_stream(with_events=True)yields{\"type\": \"thinking\", \"token\": ...}events for reasoning deltas alongside the existing content events.print_on=Trueis preserved (or rendered incrementally — bonus).streaming_callback=lambda tok: ...integrations that only care about content keep working without code changes.Notes
_yield_only_content_chunks(agent.py:4036) is the natural place to fire thinking events before swallowing the chunk. Pass the callback / event-sink through fromcall_llm(agent.py:4092).delta.reasoning_contentper LiteLLM; same accessor already used at L4056.AgentRearrange.arun_stream(with_events=True)already returnsagent_start/token/agent_endevents — extending the same shape withthinking_start/thinking/thinking_endkeeps the multi-agent streaming layer consistent.