Multi-agent evaluation as an LLM testing dimension #2540
ThinkOffApp
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
deepeval covers single-model evaluation dimensions well (hallucination, bias, toxicity, etc.). We've been working on a dimension that doesn't fit neatly into single-model evaluation: how LLMs behave in multi-agent group settings.
OMATS (OpenClaw Multi-Agent Test Suite, https://github.com/ThinkOffApp/openclaw-multi-agent-test-suite) tests models across 28 scenarios in shared room environments. The failure modes are specific to group dynamics and don't show up in single-model evaluation:
We've done preliminary testing of 12 models with continuous 0.0-1.0 scoring across three stages (agent discipline, multi-agent communication, agent management) using direct LLM calls simulating OpenClaw. Next we will run the tests with OpenClaw agents.
Multi-agent evaluation could be an interesting addition to deepeval's framework, especially as more teams deploy multiple LLM agents working together.
Beta Was this translation helpful? Give feedback.
All reactions