Skip to content

ehhuang/ls-example

Repository files navigation

Llama Stack Evals

This repository contains a collection of benchmarks and evaluators for evaluating agents built on top of Llama Stack.

🔧 Installation

git clone https://github.com/yanxi0830/llama-stack-evals.git
cd llama-stack-evals
pip install -e .

🚀 Usage

  • 📓 Checkout notebooks/ for working examples of how to run benchmarks using Llama Stack.
# Example evaluation setup
from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.types.agent_create_params import AgentConfig

from llama_stack_evals.benchmarks.hotpotqa import HotpotQAEvaluator

# setup client
client = LlamaStackClient(...)
# setup agent config
agent_config = AgentConfig(...)
# setup evaluator
evaluator = HotpotQAEvaluator()
# Run evaluation
results = evaluator.run(agent_config, client)

Why llama-stack-evals?

While llama-stack makes it easy to build and deploy LLM agents, evaluating these agents comprehensively can be challenging. Current evaluation API have several limitations:

  • Complex setup requirements for running evaluations (data preparation, defining scoring functions, benchmark registration)
  • Difficulty in using private datasets that can't be shared with servers
  • Limited flexibility in implementing custom scoring functions
  • Challenges in evaluating agents with custom client tools
  • Lack of streamlined solutions for running evaluations against popular benchmarks

llama-stack-evals addresses these challenges by providing a developer-friendly framework for evaluating applications built on top of Llama Stack.

What This Library Offers

Simplified Evaluation Flow

  • Reduce complex evaluation setups to just a few lines of code
  • Intuitive APIs that feel natural to Python developers
  • Built-in support for popular open benchmarks

Flexibility & Customization

  • Evaluate agents with custom client tools
  • Use private datasets without uploading to servers
  • Implement custom scoring functions easily
  • Run component-level evaluations (e.g., retrieval tools)

Out-of-the-Box Benchmark Examples

  • Model Evaluation: Release lightweight benchmark numbers on ANY llama-stack SDK compatible endpoints
  • Agent Evaluation:
    • E2E evaluation via AgentConfig (e.g., HotpotQA)
    • Component-level evaluation (e.g., BEIR for retrieval)
  • Complex Simulation Support:
    • Tau-Bench (with simulated users)
    • CRAG
    • RAG evaluations with vector DB integration

What This Library Is Not

  • Not a Recipe Repository: Provide structured, maintained evaluation tools rather than ad-hoc evaluation scripts

🙌 Want to contribute?

✨ We welcome Pull Requests with improvements or suggestions.

🐛 If you want to flag an issue or propose an improvement, but don't know how to realize it, create a GitHub Issue.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published