Skip to content

[improvement][Agent][stream thinking tokens through streaming_callback and arun_stream] #1621

@kyegomez

Description

@kyegomez

Summary

Agent currently swallows reasoning/thinking deltas from the LLM stream and only flushes them to the console as a single rich panel after the thinking phase ends. Integrators using streaming_callback, arun_stream, or run_stream cannot surface thinking tokens in real time — they only ever see content tokens.

We should stream thinking tokens through the same callback surface as content tokens, with a clear way to distinguish them.

Today's behavior

swarms/structs/agent.py:4036-4090_yield_only_content_chunks:

reasoning = getattr(delta, \"reasoning_content\", None)
if reasoning:
    thinking_parts.append(reasoning)
    continue  # swallow the thinking chunk; don't pass to content stream

# First non-thinking chunk — flush accumulated thinking
if thinking_parts and not thinking_displayed:
    if self.print_on:
        formatter.print_thinking_panel(\"\".join(thinking_parts), title=...)
    thinking_displayed = True

The reasoning deltas:

  • Never reach streaming_callback.
  • Never reach arun_stream / run_stream consumers.
  • Are batched, so even console users see the thinking as one block at the end of the thinking phase, not as it's produced.

This means a dashboard, web UI, or terminal renderer integrating against `Agent` cannot show "thinking in progress" the way Claude.ai / OpenAI playground / Anthropic Console do.

Repro

from swarms import Agent

agent = Agent(
    agent_name=\"Reasoner\",
    model_name=\"claude-sonnet-4-6\",
    thinking_tokens=2000,
    streaming_callback=lambda tok: print(repr(tok)),
)
agent.run(\"Solve: a chicken and a half lays an egg and a half in a day and a half.\")
# Expected: callback fires for thinking tokens AND content tokens, distinguishably.
# Actual:   callback fires only for content tokens. Thinking is invisible to the callback.

Same gap exists for arun_stream / run_stream — they only yield content tokens.

Proposed design

Option A (preferred): tagged events. Change streaming_callback to optionally accept a structured event dict, and switch arun_stream / run_stream to yield events by default when an opt-in flag is set:

# Token event
{\"type\": \"thinking\", \"token\": \"...\"}
{\"type\": \"content\",  \"token\": \"...\"}

# Phase boundaries (optional but useful)
{\"type\": \"thinking_start\"}
{\"type\": \"thinking_end\",   \"text\": \"<full thinking>\"}
{\"type\": \"content_start\"}
{\"type\": \"content_end\",    \"text\": \"<full content>\"}

Preserve back-compat: if the callback signature is Callable[[str], None], keep delivering only content tokens (today's behavior). If it's Callable[[dict], None] (detect via inspect.signature) or the user passes streaming_events=True, deliver tagged events.

Option B: separate thinking_callback. Add a second kwarg:

agent = Agent(
    ...,
    streaming_callback=on_content_token,
    thinking_callback=on_thinking_token,
)

Simpler to add, no signature detection, but doesn't generalize to arun_stream/run_stream cleanly.

I lean toward Option A because it composes with the existing arun_stream(with_events=True) pattern already established in AgentRearrange (swarms/structs/agent_rearrange.py:1105-1129) — same event shape, just add thinking / thinking_start / thinking_end types.

Acceptance criteria

  • A reasoning model (claude-sonnet-4-6 with thinking_tokens=..., or an OpenAI o-series model) streams thinking deltas to the registered callback in real time, one chunk at a time, before the first content token arrives.
  • Thinking tokens are distinguishable from content tokens in the callback payload.
  • arun_stream(with_events=True) yields {\"type\": \"thinking\", \"token\": ...} events for reasoning deltas alongside the existing content events.
  • The console rich-panel UX for print_on=True is preserved (or rendered incrementally — bonus).
  • Back-compat: existing streaming_callback=lambda tok: ... integrations that only care about content keep working without code changes.

Notes

  • _yield_only_content_chunks (agent.py:4036) is the natural place to fire thinking events before swallowing the chunk. Pass the callback / event-sink through from call_llm (agent.py:4092).
  • Reasoning content lives at delta.reasoning_content per LiteLLM; same accessor already used at L4056.
  • AgentRearrange.arun_stream(with_events=True) already returns agent_start / token / agent_end events — extending the same shape with thinking_start / thinking / thinking_end keeps the multi-agent streaming layer consistent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions