Multi-agent evaluation as an LLM testing dimension #2540

ThinkOffApp · 2026-03-08T15:48:07Z

ThinkOffApp
Mar 8, 2026

deepeval covers single-model evaluation dimensions well (hallucination, bias, toxicity, etc.). We've been working on a dimension that doesn't fit neatly into single-model evaluation: how LLMs behave in multi-agent group settings.

OMATS (OpenClaw Multi-Agent Test Suite, https://github.com/ThinkOffApp/openclaw-multi-agent-test-suite) tests models across 28 scenarios in shared room environments. The failure modes are specific to group dynamics and don't show up in single-model evaluation:

Agents echoing what others said instead of contributing new information
Ignoring stop orders when peers are still talking
Going into ACK loops (two agents endlessly acknowledging each other)
Compounding guardrails (each agent adds safety caveats to the previous agent's output)
Mixing up who said what in a multi-party conversation

We've done preliminary testing of 12 models with continuous 0.0-1.0 scoring across three stages (agent discipline, multi-agent communication, agent management) using direct LLM calls simulating OpenClaw. Next we will run the tests with OpenClaw agents.

Multi-agent evaluation could be an interesting addition to deepeval's framework, especially as more teams deploy multiple LLM agents working together.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-agent evaluation as an LLM testing dimension #2540

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Multi-agent evaluation as an LLM testing dimension #2540

Uh oh!

ThinkOffApp Mar 8, 2026

Replies: 0 comments

ThinkOffApp
Mar 8, 2026