This repository contains a collection of benchmarks and evaluators for evaluating agents built on top of Llama Stack.
git clone https://github.com/yanxi0830/llama-stack-evals.git
cd llama-stack-evals
pip install -e .
- 📓 Checkout notebooks/ for working examples of how to run benchmarks using Llama Stack.
# Example evaluation setup
from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.types.agent_create_params import AgentConfig
from llama_stack_evals.benchmarks.hotpotqa import HotpotQAEvaluator
# setup client
client = LlamaStackClient(...)
# setup agent config
agent_config = AgentConfig(...)
# setup evaluator
evaluator = HotpotQAEvaluator()
# Run evaluation
results = evaluator.run(agent_config, client)
While llama-stack makes it easy to build and deploy LLM agents, evaluating these agents comprehensively can be challenging. Current evaluation API have several limitations:
- Complex setup requirements for running evaluations (data preparation, defining scoring functions, benchmark registration)
- Difficulty in using private datasets that can't be shared with servers
- Limited flexibility in implementing custom scoring functions
- Challenges in evaluating agents with custom client tools
- Lack of streamlined solutions for running evaluations against popular benchmarks
llama-stack-evals
addresses these challenges by providing a developer-friendly framework for evaluating applications built on top of Llama Stack.
Simplified Evaluation Flow
- Reduce complex evaluation setups to just a few lines of code
- Intuitive APIs that feel natural to Python developers
- Built-in support for popular open benchmarks
Flexibility & Customization
- Evaluate agents with custom client tools
- Use private datasets without uploading to servers
- Implement custom scoring functions easily
- Run component-level evaluations (e.g., retrieval tools)
Out-of-the-Box Benchmark Examples
- Model Evaluation: Release lightweight benchmark numbers on ANY llama-stack SDK compatible endpoints
- Agent Evaluation:
- E2E evaluation via AgentConfig (e.g., HotpotQA)
- Component-level evaluation (e.g., BEIR for retrieval)
- Complex Simulation Support:
- Tau-Bench (with simulated users)
- CRAG
- RAG evaluations with vector DB integration
- Not a Recipe Repository: Provide structured, maintained evaluation tools rather than ad-hoc evaluation scripts
✨ We welcome Pull Requests with improvements or suggestions.
🐛 If you want to flag an issue or propose an improvement, but don't know how to realize it, create a GitHub Issue.