Skip to content

Agent evaluation methodology #23

@leothomas

Description

@leothomas

Feature Type

New functionality

Problem Statement

Evaluation is a key pain-point for engineers that arises in a few key areas when developing LLM powered agentic workflows/frameworks, namely in :

  • Quantifying the agent’s capacity to produce useful output, in order to more objectively evaluate proposed code modifications (eg: How do we know that updates to the system prompt actually made the responses “better”, other than by “feeling”?)
  • Creating a set of desired functionality such that proposed code modifications to enable one use case do not cause regressions in others (eg: How can we be sure that the addition of a new tool doesn’t somehow confuse the model when working through a use case it used to handle well?)

The intent is to enable easily and quickly creating and executing an evaluation protocol to continuously monitor the quality of the agent’s output and to guard against regression, as it’s being developed and used.

Several challenges arise in creating such a suite of tests, mostly relating to the fact that agentic inputs and outputs are “squishy”: a single use case may be represented by large variety of inputs, which may all differ slightly in formulation, structure, context or might include typos. For example, the three following questions are all essentially synonymous formulations of the same user intent, (and could be answered with identical, or very similar, agentic workflows):

  • “How much damage did Hurricane Milton cause?”
  • “Quantify the economic impacts of Hurrican Milton on Florida’s public infrastructure and residences”
  • “How much damage did hurricane Martin cause in Florida on October 11, 2024?”

Similarly, agentic outputs are “squishy”. The three questions above could be answered with a single dollar amount, or with a paragraph summary of several different sources, or with a suite of charts. Our evaluation protocal must be able to 1) identify objectively correct and incorrect outputs (such as a poem about the hurricane) and 2) score (or at least rank) multiple similar outputs.

This evaluation protocol should also be able to handle and evaluate intermediate steps occurring with human-in-the-loop interactions, ignoring effect directly relating to the user’s input (eg: if the user makes a mistake which temporarily “confuses” the agent, any intermediate errors should not be interpreted as mistakes on the agent’s part).

Why is this important:

Beyond the developer experience aspects listed above (ability to continuously monitor the quality of the agent and guard against regression) such an evaluation protocol will help make more generalizable tool. When developing, engineers tend to repeat a select few test questions which can lead to “overfitting” for those specific use-cases/formulations (eg: what happens when we ask about an earthquake? Or ask for damages relating to endangered species’ habitats?).

Lastly an evaluation protocol helps comparing agents or agentic approaches between themselves, and is a key element of any subsequent publication.

Proposed Solution

LLM-as-judge is likely the easiest initial implementation:

  1. Generate a small QA set by working through a small set of representative workflows, and validate the agent output
  2. Use an LLM to generate a large amount of perturbations of each QA input
  3. Evaluate the perturbation of each QA input with an LLM instructed output a binary assessment of wether or not the output correctly addresses the input

Further improvements to the evaluation metrics include:

  • Prompt the LLM to evaluate not just the output but also the intermediate reasoning steps
  • Store the initial output set as a current best results, and for all subsequent outputs, prompt an LLM to chose wether the new output is better that the current best result, if yes, consider the test passed and store the output as the new best output
  • Assert that all tool calls from the perturbed input match the tool calls from the original input (can be order dependent or independent)

Alternative Solutions

No response

User Benefits

  • Reduced regressions (very common in my experience when working on prompting, etc)
  • Improved ability to assess (quantitatively) proposed code modification
  • Ability to benchmark/compare to existing tools
  • Wide swath of test cases will improve the agent's generalizability across a variety of use cases (and avoid developer overfitting to a small set of test/sample use cases)

Implementation Ideas

No response

Contribution

  • I'm willing to submit a PR for this feature
  • I'm willing to test this feature
  • I'm willing to help document this feature

Additional Context

Prior art:

  • High level survey on evaluation of LLM-based Agents: https://arxiv.org/pdf/2503.16416
  • Evaluation Framework for LLM powered conversational agents: https://aisel.aisnet.org/cgi/viewcontent.cgi?article=1013&context=pacis2024
  • Metamorphic Multi-turn Testing for LLM-based Dialogue Systems (MORTAR): https://arxiv.org/pdf/2412.15557
    • Creates a knowledge graph of an existing agent <> user interaction, which it then uses to evaluate the answers to systematically perturbed versions of the original input (addresses “squishyness” of inputs and outputs
  • Agent state modifications: https://katherine-munro.com/p/evaluating-llm-systems-with-realistic-data-at-scale
    • The approach is very similar to the approach presented above and relies on extracting a variety of agent states from previous workflow executions, and creating a set of perturbations to the state at various points within the agent execution. Depending on how and where the state is modified, the tests/evaluation can more precisely evaluate certain aspects of the agent’s execution, like tool selection, routing, error handling etc.
  • LLM-as-judge: https://arxiv.org/pdf/2411.15594
    • Use an LLM to evaluate the agent’s output, given a specific input, in a few different ways:
      • on a numerical scale (including a binary scale)
      • using a series of true/false questions about the output
      • selecting the best output from 2 or more output (the paper notes this most closely ressembles the way humans would evaluate outputs)
  • Simulating multi-turn interactions using LLMs: https://arxiv.org/pdf/2502.04349
    • This take the above methodology even further and uses an LLM to simulate target users, and has the LLM interact with the agent. This method should be constrained to very specific use cases, otherwise the evaluation blurs the distinction between the agent to be evaluated and the LLM used in the evaluation.
  • From previous projects: one previous attempt at setting up unit-test style tests was to establish a set of test prompts and examine the agent’s output for one (or several) specific tool calls. This works well to ascertain that that the model can understand user intention and select the appropriate tool - however it doesn’t evaluate a variety of inputs, nor does it test the model’s capacity to re-assemble the output from multiple tools into a response.

Existing frameworks/libraries:

  • AgentBench: https://github.com/THUDM/AgentBench
    • LLM agent benchmarking dataset and library, revolving around a set of key domains (eg: Digital Card Game, Lateral Thinking Puzzles, Web Shopping or Web Browsing)
  • DeepEval: https://github.com/confident-ai/deepeval
    • Evaluation framework that provides access to a variety of metrics and evaluations strategies, such as evaluating tool calls, with or without tool input/output evaluation, with or without order independence, with or without error catching/retrying, etc
    • Also provides red-teaming functionality (SQL injection, etc)
    • Integrates with Pytest
  • AgentQuest: https://arxiv.org/pdf/2404.06411
    • Enables additional evaluation metrics, such as progress rate (how quickly the agent is progressing towards the answer/completion) and repetition rate (how often the agent gets stuck in loops)
  • Agent-as-judge: https://github.com/metauto-ai/agent-as-a-judge

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions