Skip to content

Infinite Loop Bug in fix_message_list Function when using Nested Agent-as-a-tool caused CAI to Hang Indefinitely #410

@cryansky

Description

@cryansky

Context

I was experimenting with creating multiple agents that can call other agents as a tool. I noticed that in some cases, when one agent calls another agent (as a tool) that invokes another tool (incl. MCP tool), CAI hangs indefinitely even after the tool output is returned.

Did a bit of debugging and I realized the fix_message_list function in cai/util.py contains a bug that causes the execution hang. This occurs when an assistant message has multiple tool_calls and the tool responses are out of order in the conversation history.

Bug Description

The fix_message_list function has a second pass (a while loop) that ensures every tool message directly follows its matching assistant message. To decide whether a tool message is already in the right place, it checks if the immediately preceding message is the matching assistant. This check is too strict, i.e., it doesn't account for sibling tool messages belonging to the same assistant.

Affected Code

# Only checks the single previous message
prev_msg = processed_messages[i - 1]

is_valid_sequence = (
    prev_msg.get("role") == "assistant"
    and prev_msg.get("tool_calls")
    and any(tc.get("id") == tool_id for tc in prev_msg.get("tool_calls", []))
)

When an assistant has two tool calls (call_1, call_2), the valid sequence looks like:

[0] assistant(tool_calls=[call_1, call_2])
[1] tool(call_1)
[2] tool(call_2)

But at index 2, the code sees tool(call_1) as the previous message — not an assistant — and wrongly concludes tool(call_2) is out of place. It then moves tool(call_2) to position 1, which pushes tool(call_1) to position 2, where the same wrong check triggers again. This creates an infinite ping-pong:

iteration 1:
  [0] assistant(tool_calls=[call_1, call_2])
  [1] tool(call_1)
  [2] tool(call_2)  ← i=2, prev is tool(call_1) not assistant → move to pos 1

iteration 2:
  [0] assistant(tool_calls=[call_1, call_2])
  [1] tool(call_2)  ← just moved here
  [2] tool(call_1)  ← i=2, prev is tool(call_2) not assistant → move to pos 1

iteration 3:
  [0] assistant(tool_calls=[call_1, call_2])
  [1] tool(call_1)  ← just moved here
  [2] tool(call_2)  ← i=2, same as iteration 1...

... forever

Fix

Instead of only checking the single previous message, walk backward past sibling tool messages to find the nearest assistant. If that assistant owns the current tool message, the sequence is valid and no move is needed.

When Does This Happen?

This triggers when tool responses arrive out of order relative to the tool_calls array. In my case it happened on agent-as-tool approach where sub-agents run concurrently.

Example Setup

from cai.sdk.agents import Agent

wiz_operator_agent = Agent(
    name="wiz_operator_agent",
    instructions="...",
    mcp_servers=[wiz],  # Wiz MCP (list_issues, get_issue_v2, ...)
)

github_operator_agent = Agent(
    name="github_operator_agent",
    instructions="...",
    mcp_servers=[github],  # GitHub MCP (get_file_contents, search_code, ...)
)

# Parent agent calls sub-agents as tools
alert_agent = Agent(
    name="alert_agent",
    instructions="...",
    tools=[
        wiz_operator_agent.as_tool(
            tool_name="verify_wiz_exposure",
            tool_description="Re-verify exposure claims in Wiz",
        ),
        github_operator_agent.as_tool(
            tool_name="verify_github_code",
            tool_description="Re-verify code analysis: check for fixes/PRs",
        ),
    ],
)

When alert_agent calls both tools in one turn, each triggers a Runner.run() inside as_tool. The Wiz call might complete before the GitHub call, but the LLM listed GitHub first in tool_calls:

# What the LLM produced
{"role": "assistant", "tool_calls": [
    {"id": "call_1", "function": {"name": "verify_wiz_exposure", ...}},
    {"id": "call_2", "function": {"name": "verify_github_code", ...}}
]}

# What arrived (GitHub finished first due to simpler MCP call chain)
{"role": "tool", "tool_call_id": "call_2", "content": "GitHub: PR #1234 patches CVE..."}
{"role": "tool", "tool_call_id": "call_1", "content": "Wiz: resource is internet-facing..."}

call_2 arrived before call_1, but call_1 was listed first → out of order → infinite loop.

Steps to Reproduce

Pass this message list to fix_message_list:

from cai.util import fix_message_list

messages = [
    {"role": "user", "content": "Validate alert"},
    {"role": "assistant", "tool_calls": [
        {"id": "call_1", "type": "function", "function": {"name": "verify_wiz_exposure", "arguments": "{}"}},
        {"id": "call_2", "type": "function", "function": {"name": "verify_github_code", "arguments": "{}"}}
    ]},
    {"role": "tool", "tool_call_id": "call_2", "content": "GitHub result"},  # Out of order
    {"role": "tool", "tool_call_id": "call_1", "content": "Wiz result"},
]

fix_message_list(messages)  # Hangs forever

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions