| name | description |
|---|---|
testing-agent |
Run goal-based evaluation tests for agents. Use when you need to verify an agent meets its goals, debug failing tests, or iterate on agent improvements based on test results. |
This skill provides tools for testing agents built with the building-agents skill.
mcp__agent-builder__list_tests- Check what tests existmcp__agent-builder__generate_constraint_testsormcp__agent-builder__generate_success_tests- Get test guidelines- Write tests directly using the Write tool with the guidelines provided
mcp__agent-builder__run_tests- Execute testsmcp__agent-builder__debug_test- Debug failures
The generate_*_tests MCP tools return guidelines and templates - they do NOT generate test code via LLM.
You (Claude) write the tests directly using the Write tool based on the guidelines.
# Step 1: Get test guidelines
result = mcp__agent-builder__generate_constraint_tests(
goal_id="my-goal",
goal_json='{"id": "...", "constraints": [...]}',
agent_path="exports/my_agent"
)
# Step 2: The result contains:
# - output_file: where to write tests
# - file_header: imports and fixtures to use
# - test_template: format for test functions
# - constraints_formatted: the constraints to test
# - test_guidelines: rules for writing tests
# Step 3: Write tests directly using the Write tool
Write(
file_path=result["output_file"],
content=result["file_header"] + test_code_you_write
)
# Step 4: Run tests via MCP tool
mcp__agent-builder__run_tests(
goal_id="my-goal",
agent_path="exports/my_agent"
)
# Step 5: Debug failures via MCP tool
mcp__agent-builder__debug_test(
goal_id="my-goal",
test_name="test_constraint_foo",
agent_path="exports/my_agent"
)Run goal-based evaluation tests for agents built with the building-agents skill.
Key Principle: MCP tools provide guidelines, Claude writes tests directly
- ✅ Get guidelines:
generate_constraint_tests,generate_success_tests→ returns templates and guidelines - ✅ Write tests: Use the Write tool with the provided file_header and test_template
- ✅ Run tests:
run_tests(runs pytest via subprocess) - ✅ Debug failures:
debug_test(re-runs single test with verbose output) - ✅ List tests:
list_tests(scans Python test files) - ✅ Tests stored in
exports/{agent}/tests/test_*.py
exports/my_agent/
├── __init__.py
├── agent.py ← Agent to test
├── nodes/__init__.py
├── config.py
├── __main__.py
└── tests/ ← Test files written by MCP tools
├── conftest.py # Shared fixtures (auto-created)
├── test_constraints.py
├── test_success_criteria.py
└── test_edge_cases.py
Tests import the agent directly:
import pytest
from exports.my_agent import default_agent
@pytest.mark.asyncio
async def test_happy_path(mock_mode):
result = await default_agent.run({"query": "test"}, mock_mode=mock_mode)
assert result.success
assert len(result.output) > 0- MCP tools provide consistent test guidelines with proper imports, fixtures, and API key enforcement
- Claude writes tests directly, eliminating circular LLM dependencies in the MCP server
run_testsparses pytest output into structured results for iterationdebug_testprovides formatted output with actionable debugging info- File headers include conftest.py setup with proper fixtures
- Check existing tests -
list_tests(goal_id, agent_path) - Get test guidelines -
generate_constraint_testsorgenerate_success_tests - Write tests - Use the Write tool with the provided file_header and guidelines
- Run tests -
run_tests(goal_id, agent_path) - Debug failures -
debug_test(goal_id, test_name, agent_path) - Iterate - Repeat steps 4-5 until all pass
CRITICAL: Testing requires ALL credentials the agent depends on. This includes both the LLM API key AND any tool-specific credentials (HubSpot, Brave Search, etc.).
Before running agent tests, you MUST collect ALL required credentials from the user.
Step 1: LLM API Key (always required)
export ANTHROPIC_API_KEY="your-key-here"Step 2: Tool-specific credentials (depends on agent's tools)
Inspect the agent's mcp_servers.json and tool configuration to determine which tools the agent uses, then check for all required credentials:
from aden_tools.credentials import CredentialManager, CREDENTIAL_SPECS
creds = CredentialManager()
# Determine which tools the agent uses (from agent.json or mcp_servers.json)
agent_tools = [...] # e.g., ["hubspot_search_contacts", "web_search", ...]
# Find all missing credentials for those tools
missing = creds.get_missing_for_tools(agent_tools)Common tool credentials:
| Tool | Env Var | Help URL |
|---|---|---|
| HubSpot CRM | HUBSPOT_ACCESS_TOKEN |
https://developers.hubspot.com/docs/api/private-apps |
| Brave Search | BRAVE_SEARCH_API_KEY |
https://brave.com/search/api/ |
| Google Search | GOOGLE_SEARCH_API_KEY + GOOGLE_SEARCH_CX |
https://developers.google.com/custom-search |
Why ALL credentials are required:
- Tests need to execute the agent's LLM nodes to validate behavior
- Tools with missing credentials will return error dicts instead of real data
- Mock mode bypasses everything, providing no confidence in real-world performance
- The
AgentRunner.run()method validates credentials at startup and will fail fast if any are missing
Mock mode (--mock flag or mock_mode=True) is ONLY for structure validation:
✓ Validates graph structure (nodes, edges, connections) ✓ Tests that code doesn't crash on execution ✗ Does NOT test LLM message generation ✗ Does NOT test reasoning or decision-making quality ✗ Does NOT test constraint validation (length limits, format rules) ✗ Does NOT test real API integrations or tool use ✗ Does NOT test personalization or content quality
Bottom line: If you're testing whether an agent achieves its goal, you MUST use real credentials for ALL services.
When generating tests, ALWAYS include credential checks for ALL required services:
import os
import pytest
from aden_tools.credentials import CredentialManager
# At the top of every test file
pytestmark = pytest.mark.skipif(
not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
reason="API key required for real testing. Set ANTHROPIC_API_KEY or use MOCK_MODE=1 for structure validation only."
)
@pytest.fixture(scope="session", autouse=True)
def check_credentials():
"""Ensure ALL required credentials are set for real testing."""
creds = CredentialManager()
mock_mode = os.environ.get("MOCK_MODE")
# Always check LLM key
if not creds.is_available("anthropic"):
if mock_mode:
print("\n⚠️ Running in MOCK MODE - structure validation only")
print(" This does NOT test LLM behavior or agent quality")
print(" Set ANTHROPIC_API_KEY for real testing\n")
else:
pytest.fail(
"\n❌ ANTHROPIC_API_KEY not set!\n\n"
"Real testing requires an API key. Choose one:\n"
"1. Set API key (RECOMMENDED):\n"
" export ANTHROPIC_API_KEY='your-key-here'\n"
"2. Run structure validation only:\n"
" MOCK_MODE=1 pytest exports/{agent}/tests/\n\n"
"Note: Mock mode does NOT validate agent behavior or quality."
)
# Check tool-specific credentials (skip in mock mode)
if not mock_mode:
# List the tools this agent uses - update per agent
agent_tools = [] # e.g., ["hubspot_search_contacts", "hubspot_get_contact"]
missing = creds.get_missing_for_tools(agent_tools)
if missing:
lines = ["\n❌ Missing tool credentials!\n"]
for name in missing:
spec = creds.specs.get(name)
if spec:
lines.append(f" {spec.env_var} - {spec.description}")
if spec.help_url:
lines.append(f" Setup: {spec.help_url}")
lines.append("\nSet the required environment variables and re-run.")
pytest.fail("\n".join(lines))When the user asks to test an agent, ALWAYS check for ALL credentials first — not just the LLM key:
- Identify the agent's tools from
agent.jsonormcp_servers.json - Check ALL required credentials using
CredentialManager - Ask the user to provide any missing credentials before proceeding
from aden_tools.credentials import CredentialManager, CREDENTIAL_SPECS
creds = CredentialManager()
# 1. Check LLM key
missing_creds = []
if not creds.is_available("anthropic"):
missing_creds.append(("ANTHROPIC_API_KEY", "Anthropic API key for LLM calls"))
# 2. Check tool-specific credentials
agent_tools = [...] # Determined from agent config
missing_tools = creds.get_missing_for_tools(agent_tools)
for name in missing_tools:
spec = CREDENTIAL_SPECS.get(name)
if spec:
missing_creds.append((spec.env_var, spec.description))
# 3. Present ALL missing credentials to the user at once
if missing_creds:
print("⚠️ Missing credentials required by this agent:\n")
for env_var, description in missing_creds:
print(f" • {env_var} — {description}")
print()
print("Please set the missing environment variables:")
for env_var, _ in missing_creds:
print(f" export {env_var}='your-value-here'")
print()
print("Or run in mock mode (structure validation only):")
print(" MOCK_MODE=1 pytest exports/{agent}/tests/")
# Ask user to provide credentials or choose mock mode
AskUserQuestion(...)IMPORTANT: Do NOT skip credential collection. If an agent uses HubSpot tools, the user MUST provide HUBSPOT_ACCESS_TOKEN. If it uses web search, the user MUST provide the appropriate search API key. Collect ALL missing credentials in a single prompt rather than discovering them one at a time during test failures.
┌─────────────────────────────────────────────────────────────────────────┐
│ GOAL STAGE │
│ (building-agents skill) │
│ │
│ 1. User defines goal with success_criteria and constraints │
│ 2. Goal written to agent.py immediately │
│ 3. Generate CONSTRAINT TESTS → Write to tests/ → USER APPROVAL │
│ Files created: exports/{agent}/tests/test_constraints.py │
└─────────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────────┐
│ AGENT STAGE │
│ (building-agents skill) │
│ │
│ Build nodes + edges, written immediately to files │
│ Constraint tests can run during development: │
│ run_tests(goal_id, agent_path, test_types='["constraint"]') │
└─────────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────────┐
│ EVAL STAGE (this skill) │
│ │
│ 1. Generate SUCCESS_CRITERIA TESTS → Write to tests/ → USER APPROVAL │
│ Files created: exports/{agent}/tests/test_success_criteria.py │
│ 2. Run all tests: run_tests(goal_id, agent_path) │
│ 3. On failure → debug_test(goal_id, test_name, agent_path) │
│ 4. Iterate: Edit agent code → Re-run run_tests (instant feedback) │
└─────────────────────────────────────────────────────────────────────────┘
ALWAYS check first before generating new tests:
mcp__agent-builder__list_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent"
)This shows what test files already exist. If tests exist:
- Review the list to see what's covered
- Ask user if they want to add more or run existing tests
After goal is defined, get test guidelines using the MCP tool:
# First, read the goal from agent.py to get the goal JSON
goal_code = Read(file_path="exports/your_agent/agent.py")
# Extract the goal definition and convert to JSON
# Get constraint test guidelines via MCP tool
result = mcp__agent-builder__generate_constraint_tests(
goal_id="your-goal-id",
goal_json='{"id": "goal-id", "name": "...", "constraints": [...]}',
agent_path="exports/your_agent"
)Response includes:
output_file: Where to write tests (e.g.,exports/your_agent/tests/test_constraints.py)file_header: Imports, fixtures, and pytest setup to use at the top of the filetest_template: Format for test functionsconstraints_formatted: The constraints to testtest_guidelines: Rules and best practices for writing testsinstruction: How to proceed
Write tests directly using the provided guidelines:
# Write tests using the Write tool
Write(
file_path=result["output_file"],
content=result["file_header"] + "\n\n" + your_test_code
)After agent is fully built, get success criteria test guidelines:
# Get success criteria test guidelines via MCP tool
result = mcp__agent-builder__generate_success_tests(
goal_id="your-goal-id",
goal_json='{"id": "goal-id", "name": "...", "success_criteria": [...]}',
node_names="analyze_request,search_web,format_results",
tool_names="web_search,web_scrape",
agent_path="exports/your_agent"
)Write tests directly using the provided guidelines:
# Write tests using the Write tool
Write(
file_path=result["output_file"],
content=result["file_header"] + "\n\n" + your_test_code
)The file_header returned by the MCP tools includes proper imports and fixtures.
You should also create a conftest.py file in the tests directory with shared fixtures:
# Create conftest.py with the conftest template
Write(
file_path="exports/your_agent/tests/conftest.py",
content=conftest_content # Use PYTEST_CONFTEST_TEMPLATE format
)Use the MCP tool to run tests (not pytest directly):
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent"
)
**Response includes structured results:**
```json
{
"goal_id": "your-goal-id",
"overall_passed": false,
"summary": {
"total": 12,
"passed": 10,
"failed": 2,
"skipped": 0,
"errors": 0,
"pass_rate": "83.3%"
},
"test_results": [
{"file": "test_constraints.py", "test_name": "test_constraint_api_rate_limits", "status": "passed"},
{"file": "test_success_criteria.py", "test_name": "test_success_find_relevant_results", "status": "failed"}
],
"failures": [
{"test_name": "test_success_find_relevant_results", "details": "AssertionError: Expected 3-5 results..."}
]
}Options for run_tests:
# Run only constraint tests
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
test_types='["constraint"]'
)
# Run with parallel workers
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
parallel=4
)
# Stop on first failure
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
fail_fast=True
)Use the MCP tool to debug (not Bash/pytest directly):
mcp__agent-builder__debug_test(
goal_id="your-goal-id",
test_name="test_success_find_relevant_results",
agent_path="exports/your_agent"
)Response includes:
- Full verbose output from the test
- Stack trace with exact line numbers
- Captured logs and prints
- Suggestions for fixing the issue
When a test fails, categorize the error to guide iteration:
def categorize_test_failure(test_output, agent_code):
"""Categorize test failure to guide iteration."""
# Read test output and agent code
failure_info = {
"test_name": "...",
"error_message": "...",
"stack_trace": "...",
}
# Pattern-based categorization
if any(pattern in failure_info["error_message"].lower() for pattern in [
"typeerror", "attributeerror", "keyerror", "valueerror",
"null", "none", "undefined", "tool call failed"
]):
category = "IMPLEMENTATION_ERROR"
guidance = {
"stage": "Agent",
"action": "Fix the bug in agent code",
"files_to_edit": ["agent.py", "nodes/__init__.py"],
"restart_required": False,
"description": "Code bug - fix and re-run tests"
}
elif any(pattern in failure_info["error_message"].lower() for pattern in [
"assertion", "expected", "got", "should be", "success criteria"
]):
category = "LOGIC_ERROR"
guidance = {
"stage": "Goal",
"action": "Update goal definition",
"files_to_edit": ["agent.py (goal section)"],
"restart_required": True,
"description": "Goal definition is wrong - update and rebuild"
}
elif any(pattern in failure_info["error_message"].lower() for pattern in [
"timeout", "rate limit", "empty", "boundary", "edge case"
]):
category = "EDGE_CASE"
guidance = {
"stage": "Eval",
"action": "Add edge case test and fix handling",
"files_to_edit": ["agent.py", "tests/test_edge_cases.py"],
"restart_required": False,
"description": "New scenario - add test and handle it"
}
else:
category = "UNKNOWN"
guidance = {
"stage": "Unknown",
"action": "Manual investigation required",
"restart_required": False
}
return {
"category": category,
"guidance": guidance,
"failure_info": failure_info
}Show categorization to user:
AskUserQuestion(
questions=[{
"question": f"Test failed with {category}. How would you like to proceed?",
"header": "Test Failure",
"options": [
{
"label": "Fix code directly (Recommended)" if category == "IMPLEMENTATION_ERROR" else "Update goal",
"description": guidance["description"]
},
{
"label": "Show detailed error info",
"description": "View full stack trace and logs"
},
{
"label": "Skip for now",
"description": "Continue with other tests"
}
],
"multiSelect": false
}]
)# 1. Show user the exact file and line that failed
print(f"Error in: exports/{agent_name}/nodes/__init__.py:42")
print(f"Issue: 'NoneType' object has no attribute 'get'")
# 2. Read the problematic code
code = Read(file_path=f"exports/{agent_name}/nodes/__init__.py")
# 3. User can fix directly, or you suggest a fix:
Edit(
file_path=f"exports/{agent_name}/nodes/__init__.py",
old_string="if results.get('videos'):",
new_string="if results and results.get('videos'):"
)
# 4. Re-run tests immediately (instant feedback!)
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path=f"exports/{agent_name}"
)# 1. Show user the goal definition
goal_code = Read(file_path=f"exports/{agent_name}/agent.py")
# 2. Discuss what needs to change in success_criteria or constraints
# 3. Edit the goal
Edit(
file_path=f"exports/{agent_name}/agent.py",
old_string='target="3-5 videos"',
new_string='target="1-5 videos"' # More realistic
)
# 4. May need to regenerate agent nodes if goal changed significantly
# This requires going back to building-agents skill# 1. Create new edge case test with API key enforcement
edge_case_test = '''
@pytest.mark.asyncio
async def test_edge_case_empty_results(mock_mode):
"""Test: Agent handles no results gracefully"""
result = await default_agent.run({{"query": "xyzabc123nonsense"}}, mock_mode=mock_mode)
# Should succeed with empty results, not crash
assert result.success or result.error is not None
if result.success:
assert result.output.get("message") == "No results found"
'''
# 2. Add to test file
Edit(
file_path=f"exports/{agent_name}/tests/test_edge_cases.py",
old_string="# Add edge case tests here",
new_string=edge_case_test
)
# 3. Fix agent to handle edge case
# Edit agent code to handle empty results
# 4. Re-run testsgenerate_constraint_tests and generate_success_tests MCP tools to create properly structured tests with correct imports and fixtures.
These templates show the structure of generated tests for reference only.
"""Constraint tests for {agent_name}.
These tests validate that the agent respects its defined constraints.
Requires ANTHROPIC_API_KEY for real testing.
"""
import os
import pytest
from exports.{agent_name} import default_agent
from aden_tools.credentials import CredentialManager
# Enforce API key for real testing
pytestmark = pytest.mark.skipif(
not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)
@pytest.mark.asyncio
async def test_constraint_{constraint_id}():
"""Test: {constraint_description}"""
# Test implementation based on constraint type
mock_mode = bool(os.environ.get("MOCK_MODE"))
result = await default_agent.run({{"test": "input"}}, mock_mode=mock_mode)
# Assert constraint is respected
assert True # Replace with actual check"""Success criteria tests for {agent_name}.
These tests validate that the agent achieves its defined success criteria.
Requires ANTHROPIC_API_KEY for real testing - mock mode cannot validate success criteria.
"""
import os
import pytest
from exports.{agent_name} import default_agent
from aden_tools.credentials import CredentialManager
# Enforce API key for real testing
pytestmark = pytest.mark.skipif(
not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)
@pytest.mark.asyncio
async def test_success_{criteria_id}():
"""Test: {criteria_description}"""
mock_mode = bool(os.environ.get("MOCK_MODE"))
result = await default_agent.run({{"test": "input"}}, mock_mode=mock_mode)
assert result.success, f"Agent failed: {{result.error}}"
# Verify success criterion met
# e.g., assert metric meets target
assert True # Replace with actual check"""Edge case tests for {agent_name}.
These tests validate agent behavior in unusual or boundary conditions.
Requires ANTHROPIC_API_KEY for real testing.
"""
import os
import pytest
from exports.{agent_name} import default_agent
from aden_tools.credentials import CredentialManager
# Enforce API key for real testing
pytestmark = pytest.mark.skipif(
not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)
@pytest.mark.asyncio
async def test_edge_case_{scenario_name}():
"""Test: Agent handles {scenario_description}"""
mock_mode = bool(os.environ.get("MOCK_MODE"))
result = await default_agent.run({{"edge": "case_input"}}, mock_mode=mock_mode)
# Verify graceful handling
assert result.success or result.error is not NoneDuring agent construction (Agent stage), you can run constraint tests incrementally:
# After adding first node
print("Added search_node. Running relevant constraint tests...")
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path=f"exports/{agent_name}",
test_types='["constraint"]'
)
# After adding second node
print("Added filter_node. Running all constraint tests...")
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path=f"exports/{agent_name}",
test_types='["constraint"]'
)This provides immediate feedback during development, catching issues early.
Note: All test patterns should include API key enforcement via conftest.py.
The framework now automatically validates and cleans node outputs using a fast LLM (Cerebras llama-3.3-70b) at edge traversal time. This prevents cascading failures from malformed output.
What OutputCleaner does:
- ✅ Validates output matches next node's input schema
- ✅ Detects JSON parsing trap (entire response in one key)
- ✅ Cleans malformed output automatically (~200-500ms, ~$0.001 per cleaning)
- ✅ Boosts success rates by 1.8-2.2x
Impact on tests: Tests should still use safe patterns because OutputCleaner may not catch all issues in test mode.
❌ UNSAFE (will cause test failures):
# Direct key access - can crash!
approval_decision = result.output["approval_decision"]
assert approval_decision == "APPROVED"
# Nested access without checks
category = result.output["analysis"]["category"]
# Assuming parsed JSON structure
for issue in result.output["compliance_issues"]:
...✅ SAFE (correct patterns):
# 1. Safe dict access with .get()
output = result.output or {}
approval_decision = output.get("approval_decision", "UNKNOWN")
assert "APPROVED" in approval_decision or approval_decision == "APPROVED"
# 2. Type checking before operations
analysis = output.get("analysis", {})
if isinstance(analysis, dict):
category = analysis.get("category", "unknown")
# 3. Parse JSON from strings (the JSON parsing trap!)
import json
recommendation = output.get("recommendation", "{}")
if isinstance(recommendation, str):
try:
parsed = json.loads(recommendation)
if isinstance(parsed, dict):
approval = parsed.get("approval_decision", "UNKNOWN")
except json.JSONDecodeError:
approval = "UNKNOWN"
elif isinstance(recommendation, dict):
approval = recommendation.get("approval_decision", "UNKNOWN")
# 4. Safe iteration with type check
compliance_issues = output.get("compliance_issues", [])
if isinstance(compliance_issues, list):
for issue in compliance_issues:
...Add to conftest.py:
import json
import re
def _parse_json_from_output(result, key):
"""Parse JSON from agent output (framework may store full LLM response as string)."""
response_text = result.output.get(key, "")
# Remove markdown code blocks if present
json_text = re.sub(r'```json\s*|\s*```', '', response_text).strip()
try:
return json.loads(json_text)
except (json.JSONDecodeError, AttributeError, TypeError):
return result.output.get(key)
def safe_get_nested(result, key_path, default=None):
"""Safely get nested value from result.output."""
output = result.output or {}
current = output
for key in key_path:
if isinstance(current, dict):
current = current.get(key)
elif isinstance(current, str):
try:
json_text = re.sub(r'```json\s*|\s*```', '', current).strip()
parsed = json.loads(json_text)
if isinstance(parsed, dict):
current = parsed.get(key)
else:
return default
except json.JSONDecodeError:
return default
else:
return default
return current if current is not None else default
# Make available in tests
pytest.parse_json_from_output = _parse_json_from_output
pytest.safe_get_nested = safe_get_nestedUsage in tests:
# Use helper to parse JSON safely
parsed = pytest.parse_json_from_output(result, "recommendation")
if isinstance(parsed, dict):
approval = parsed.get("approval_decision", "UNKNOWN")
# Safe nested access
risk_score = pytest.safe_get_nested(result, ["analysis", "risk_score"], default=0.0)Generate 8-15 tests total, NOT 30+
- ✅ 2-3 tests per success criterion
- ✅ 1 happy path test
- ✅ 1 boundary/edge case test
- ✅ 1 error handling test (optional)
Why fewer tests?:
- Each test requires real LLM call (~3 seconds, costs money)
- 30 tests = 90 seconds, $0.30+ in costs
- 12 tests = 36 seconds, $0.12 in costs
- Focus on quality over quantity
result.success=True means NO exception, NOT goal achieved
# ❌ WRONG - assumes goal achieved
assert result.success
# ✅ RIGHT - check success AND output
assert result.success, f"Agent failed: {result.error}"
output = result.output or {}
approval = output.get("approval_decision")
assert approval == "APPROVED", f"Expected APPROVED, got {approval}"All ExecutionResult fields:
success: bool- Execution completed without exception (NOT goal achieved!)output: dict- Complete memory snapshot (may contain raw strings)error: str | None- Error message if failedsteps_executed: int- Number of nodes executedtotal_tokens: int- Cumulative token usagetotal_latency_ms: int- Total execution timepath: list[str]- Node IDs traversedpaused_at: str | None- Node ID if HITL pause occurredsession_state: dict- State for resuming
@pytest.mark.asyncio
async def test_happy_path(mock_mode):
"""Test normal successful execution"""
result = await default_agent.run({{"query": "python tutorials"}}, mock_mode=mock_mode)
assert result.success
assert len(result.output) > 0@pytest.mark.asyncio
async def test_boundary_minimum(mock_mode):
"""Test at minimum threshold"""
result = await default_agent.run({{"query": "very specific niche topic"}}, mock_mode=mock_mode)
assert result.success
assert len(result.output.get("results", [])) >= 1@pytest.mark.asyncio
async def test_error_handling(mock_mode):
"""Test graceful error handling"""
result = await default_agent.run({{"query": ""}}, mock_mode=mock_mode) # Invalid input
assert not result.success or result.output.get("error") is not None@pytest.mark.asyncio
async def test_performance_latency(mock_mode):
"""Test response time is acceptable"""
import time
start = time.time()
result = await default_agent.run({{"query": "test"}}, mock_mode=mock_mode)
duration = time.time() - start
assert duration < 5.0, f"Took {{duration}}s, expected <5s"| Scenario | From | To | Action |
|---|---|---|---|
| Agent built, ready to test | building-agents | testing-agent | Generate success tests |
| LOGIC_ERROR found | testing-agent | building-agents | Update goal, rebuild |
| IMPLEMENTATION_ERROR found | testing-agent | Direct fix | Edit agent files, re-run tests |
| EDGE_CASE found | testing-agent | testing-agent | Add edge case test |
| All tests pass | testing-agent | Done | Agent validated ✅ |
| Scenario | Old Approach | New Approach |
|---|---|---|
| Bug Fix | Rebuild via MCP tools (14 min) | Edit Python file, pytest (2 min) |
| Add Test | Generate via MCP, export (5 min) | Write test file directly (1 min) |
| Debug | Read subprocess logs | pdb, breakpoints, prints |
| Inspect | Limited visibility | Full Python introspection |
| Don't | Do Instead |
|---|---|
| ❌ Write tests without getting guidelines first | ✅ Use generate_*_tests to get proper file_header and guidelines |
| ❌ Run pytest via Bash | ✅ Use run_tests MCP tool for structured results |
| ❌ Debug tests with Bash pytest -vvs | ✅ Use debug_test MCP tool for formatted output |
| ❌ Check for tests with Glob | ✅ Use list_tests MCP tool |
| ❌ Skip the file_header from guidelines | ✅ Always include the file_header for proper imports and fixtures |
| Don't | Do Instead |
|---|---|
| ❌ Treat all failures the same | ✅ Use debug_test to categorize and iterate appropriately |
| ❌ Rebuild entire agent for small bugs | ✅ Edit code directly, re-run tests |
| ❌ Run tests without API key | ✅ Always set ANTHROPIC_API_KEY first |
| ❌ Write tests without understanding the constraints/criteria | ✅ Read the formatted constraints/criteria from guidelines |
1. Check existing tests: list_tests(goal_id, agent_path)
→ Scans exports/{agent}/tests/test_*.py
↓
2. Get test guidelines: generate_constraint_tests, generate_success_tests
→ Returns file_header, test_template, constraints/criteria, guidelines
↓
3. Write tests: Use Write tool with the provided guidelines
→ Write tests to exports/{agent}/tests/test_*.py
↓
4. Run tests: run_tests(goal_id, agent_path)
→ Executes: pytest exports/{agent}/tests/ -v
↓
5. Debug failures: debug_test(goal_id, test_name, agent_path)
→ Re-runs single test with verbose output
↓
6. Fix based on category:
- IMPLEMENTATION_ERROR → Edit agent code directly
- ASSERTION_FAILURE → Fix agent logic or update test
- IMPORT_ERROR → Check package structure
- API_ERROR → Check API keys and connectivity
↓
7. Re-run tests: run_tests(goal_id, agent_path)
↓
8. Repeat until all pass ✅
# Check existing tests (scans Python test files)
mcp__agent-builder__list_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent"
)
# Get constraint test guidelines (returns templates and guidelines, NOT generated tests)
mcp__agent-builder__generate_constraint_tests(
goal_id="your-goal-id",
goal_json='{"id": "...", "constraints": [...]}',
agent_path="exports/your_agent"
)
# Returns: output_file, file_header, test_template, constraints_formatted, test_guidelines
# Get success criteria test guidelines
mcp__agent-builder__generate_success_tests(
goal_id="your-goal-id",
goal_json='{"id": "...", "success_criteria": [...]}',
node_names="node1,node2",
tool_names="tool1,tool2",
agent_path="exports/your_agent"
)
# Returns: output_file, file_header, test_template, success_criteria_formatted, test_guidelines
# Run tests via pytest subprocess
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent"
)
# Debug a failed test (re-runs with verbose output)
mcp__agent-builder__debug_test(
goal_id="your-goal-id",
test_name="test_constraint_foo",
agent_path="exports/your_agent"
)# Run only constraint tests
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
test_types='["constraint"]'
)
# Run only success criteria tests
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
test_types='["success"]'
)
# Run with pytest-xdist parallelism (requires pytest-xdist)
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
parallel=4
)
# Stop on first failure
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
fail_fast=True
)You can also run tests directly with pytest (the MCP tools use pytest internally):
# Run all tests
pytest exports/your_agent/tests/ -v
# Run specific test file
pytest exports/your_agent/tests/test_constraints.py -v
# Run specific test
pytest exports/your_agent/tests/test_constraints.py::test_constraint_foo -vvs
# Run in mock mode (structure validation only)
MOCK_MODE=1 pytest exports/your_agent/tests/ -vMCP tools generate tests, write them to Python files, and run them via pytest.