Adding Behavioral Tests for an Agent

This guide explains how to add behavioral tests for an agent that doesn't have test coverage yet.

Prerequisites

The agent is deployed and exposes the standard /chat/completions and /health endpoints
You know what tools the agent has (check its src/ directory for @tool decorators or tool definitions)
Test dependencies are installed: uv pip install -e ".[test]"

1. Create the Directory Structure

mkdir -p agents/<framework>/<agent>/tests/behavioral/fixtures

For example, the LangGraph ReAct agent tests live at:

agents/langgraph/react_agent/tests/behavioral/
├── conftest.py
├── test_tool_usage.py
├── test_response_quality.py
├── test_cost_latency.py
├── test_reliability.py
└── fixtures/
    └── golden_queries.yaml

2. Create conftest.py

The conftest defines fixtures specific to your agent. Because agent tests live under agents/ (a separate directory tree from tests/behavioral/), pytest's conftest discovery won't find the shared fixtures. You must define http_client and eval_config locally. At minimum you need:

agent_url — reads from a new env var specific to your agent
http_client — must be redefined locally (not inherited from tests/behavioral/conftest.py)
eval_config — resolves the path to tests/behavioral/configs/thresholds.yaml relative to the repo root (adjust .parents[N] based on your agent's directory depth)
known_tools — lists the tools in the agent's schema (used by hallucination detection)
agent_thresholds — pulls from the shared eval_config fixture
run_eval — overrides the root fixture to add MLflow trace enrichment

load_golden() helper: Import the shared loader from harness.fixtures and create a thin wrapper that binds fixtures_dir to Path(__file__).parent / "fixtures":

from pathlib import Path
from typing import Any

from harness.fixtures import load_golden as _load_golden_from

FIXTURES_DIR = Path(__file__).parent / "fixtures"

def load_golden(category: str | None = None) -> list[dict[str, Any]]:
    return _load_golden_from(FIXTURES_DIR, category)

See existing agent implementations for working examples:

agents/langgraph/react_agent/tests/behavioral/conftest.py
agents/vanilla_python/openai_responses_agent/tests/behavioral/conftest.py
agents/autogen/mcp_agent/tests/behavioral/conftest.py
agents/crewai/websearch_agent/tests/behavioral/conftest.py
agents/langgraph/agentic_rag/tests/behavioral/conftest.py

3. Add Thresholds

Add a section for your agent in tests/behavioral/configs/thresholds.yaml:

my_agent:
  tool_selection_accuracy: 0.85
  response_coherence_accuracy: 0.75
  max_latency_p95: 10.0
  pass_at_k: 8

4. Create Golden Queries

Create fixtures/golden_queries.yaml with test inputs for your agent:

queries:
  - query: "A question that should trigger tool_a"
    expected_tools: ["tool_a"]
    expected_elements: ["keyword_from_tool_output"]
    difficulty: easy
    category: factual

  - query: "Hello"
    expected_tools: []
    expected_elements: []
    difficulty: easy
    category: greeting

5. Write Test Files

There are four standard test files. Each one uses pytest markers to describe what is being tested, not which agent:

File	Marker	What it tests
`test_tool_usage.py`	`@pytest.mark.<agent>`	Correct tool selection, no hallucinated tools, valid args
`test_response_quality.py`	`@pytest.mark.<agent>`	Plan coherence, completeness, multi-tool synthesis
`test_cost_latency.py`	`@pytest.mark.<agent>`	Response time within threshold
`test_reliability.py`	`@pytest.mark.<agent>`, `@pytest.mark.slow`	pass@k consistency over repeated runs

See the existing implementations for reference:

agents/langgraph/react_agent/tests/behavioral/ (single tool: search)
agents/vanilla_python/openai_responses_agent/tests/behavioral/ (two tools: search_price, search_reviews)
agents/autogen/mcp_agent/tests/behavioral/ (two tools via MCP: add, sub)
agents/crewai/websearch_agent/tests/behavioral/ (single tool: Web Search)
agents/langgraph/agentic_rag/tests/behavioral/ (single tool: retriever)

6. Register the Agent Marker

Add your agent marker to pyproject.toml under the agent markers section:

markers = [
    # Agent markers
    "langgraph_react: ...",
    "vanilla_python: ...",
    "my_agent: My Agent description (tool_a + tool_b)",
    ...
]

7. Add the Env Var to the README

Add your agent's URL env var to the table in the Behavioral Tests section of the root README.md.

8. Verify

# Check tests are discovered
pytest agents/<framework>/<agent>/tests/behavioral/ --collect-only

# Run against a deployed agent
MY_AGENT_URL=https://my-agent.example.com pytest agents/<framework>/<agent>/tests/behavioral/ -v

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Behavioral Tests for an Agent

Prerequisites

1. Create the Directory Structure

2. Create conftest.py

3. Add Thresholds

4. Create Golden Queries

5. Write Test Files

6. Register the Agent Marker

7. Add the Env Var to the README

8. Verify

FilesExpand file tree

adding-behavioral-tests.md

Latest commit

History

adding-behavioral-tests.md

File metadata and controls

Adding Behavioral Tests for an Agent

Prerequisites

1. Create the Directory Structure

2. Create conftest.py

3. Add Thresholds

4. Create Golden Queries

5. Write Test Files

6. Register the Agent Marker

7. Add the Env Var to the README

8. Verify