This directory contains tools and test data for evaluating AWS Transform agents and their capabilities.
The evaluation infrastructure consists of:
- Test Data Generator - Intelligent test case generation from teacher samples and source context
- Test Samples - Curated test cases for agent evaluation
- Evaluation Framework (Coming Soon) - Automated test execution and scoring
evaluation/
├── README.md # This file
├── requirements.txt # Python dependencies
├── test_data_generator/ # Intelligent test case generator
│ ├── README.md # Generator documentation
│ ├── ARCHITECTURE.md # Design decisions and rationale
│ ├── TEST_README.md # Testing guide
│ ├── cli.py # Command-line interface
│ ├── intelligent_generator.py # Main generation logic
│ ├── domain_analyzer.py # Domain understanding from samples
│ ├── context_loader.py # Source context loading strategies
│ ├── deduplicate_tests.py # Deduplication utilities
│ ├── example.py # Usage examples
│ ├── test_basic.py # Smoke tests
│ └── test_units.py # Unit test suite (22 tests)
├── test_samples/ # Sample test cases
│ └── onboarding_intermediate.json
└── generated_test_data/ # Generated tests (gitignored)
Purpose: Generate diverse, high-quality test cases for agent evaluation.
Key Features:
- Learns from teacher samples to understand domain patterns
- Analyzes source context (skills, code, documentation)
- Generates tests with controlled diversity
- Ensures complexity distribution (simple, medium, complex)
- Automatic deduplication
- Configurable loading strategies for different domains
Quick Start:
# Generate 20 test cases from source context only
python -m evaluation.test_data_generator.cli \
--source-context /path/to/agent/code/ \
--count 20 \
--output generated_tests/
# Generate with teacher samples + source context
python -m evaluation.test_data_generator.cli \
--teacher-samples evaluation/test_samples/ \
--source-context /path/to/agent/code/ \
--count 20 \
--output generated_tests/
# High diversity generation for edge cases
python -m evaluation.test_data_generator.cli \
--source-context /path/to/agent/code/ \
--count 10 \
--diversity 0.95 \
--output edge_cases/Requirements:
- Python 3.11+
- AWS credentials with Bedrock access
- boto3 installed
Documentation:
- Generator README - Usage guide
- Architecture - Design decisions
- Testing Guide - Running tests
Testing:
# Run smoke tests (no AWS required)
python3 evaluation/test_data_generator/test_basic.py
# Run full unit test suite
pytest evaluation/test_data_generator/test_units.py -vPurpose: An example of test cases demonstrating expected agent behavior.
Current Samples:
test_samples/onboarding_intermediate.json- Intermediate user onboarding scenario
Test Case Schema:
{
"id": "unique-test-id",
"name": "Human-readable test name",
"user_message": "Initial prompt to agent",
"description": "What this test validates",
"complexity": "simple|medium|complex",
"tags": ["category", "type"],
"max_turns": 12,
"timeout_seconds": 600,
"simulated_human_guidance": "Persona and behavior for simulated user",
"metadata": {
"domain": "agent_builder",
"scenario_type": "onboarding"
},
"assertions": [
{
"name": "assertion_name",
"type": "llm_judge|tool_called|transcript_contains|transcript_not_contains",
"description": "What this checks",
"check": "Evaluation criteria or pattern"
}
]
}Assertion Types:
llm_judge- LLM evaluates if behavior meets criteriatool_called- Verifies specific tool was invokedtranscript_contains- Pattern matching in transcripttranscript_not_contains- Ensure pattern is absent
Planned Features:
- Automated test execution against agents
- LLM-based assertion evaluation
- Scoring and metrics (pass rate, ...)
- Test result reporting (JSON, HTML, markdown)
- Integration with CI/CD pipelines
## Generating Test Data
### For Agent Evaluation
Generate diverse tests covering the agent's capabilities:
```bash
python -m evaluation.test_data_generator.cli \
--source-context /path/to/agent/source/ \
--count 50 \
--diversity 0.8 \
--output generated_tests/agent_eval/
Generate tests with specific complexity:
python -m evaluation.test_data_generator.cli \
--teacher-samples evaluation/test_samples/ \
--source-context /path/to/agent/source/ \
--count 30 \
--complexity medium \
--output generated_tests/regression/Use high diversity to find edge cases:
python -m evaluation.test_data_generator.cli \
--source-context /path/to/agent/source/ \
--count 20 \
--diversity 0.95 \
--temperature 0.9 \
--output generated_tests/edge_cases/The generator includes built-in quality controls:
✅ Domain Understanding - Analyzes source context to understand capabilities
✅ Diversity Control - --diversity parameter (0.0-1.0) controls novelty
✅ Complexity Distribution - Ensures mix of simple/medium/complex tests
✅ Automatic Deduplication - Removes duplicate test names
✅ Structural Validation - Ensures all required fields present
✅ Assertion Quality - Generates testable, specific assertions
The context loader supports different strategies for different tasks:
agent_evaluation(default) - Focus on instructions, capabilities, rulesapi_analysis- Prioritize API schemas, endpointscode_understanding- Focus on source codearchitecture_review- Prioritize design docsconfiguration_audit- Focus on config filesgeneric- Balanced loading
python -m evaluation.test_data_generator.cli \
--source-context /path/to/code/ \
--loading-strategy code_understanding \
--output generated_tests/When using deduplicate_tests.py:
keep_first- Keep first occurrence of each namekeep_best- Keep test with most assertionskeep_all_unique- Rename duplicates to make unique
python -m evaluation.test_data_generator.deduplicate_tests \
--input generated_tests/all.json \
--output generated_tests/unique.json \
--strategy keep_best# Test data generator smoke tests
python3 evaluation/test_data_generator/test_basic.py
# Full unit test suite
pytest evaluation/test_data_generator/test_units.py -v
# With coverage
pytest evaluation/test_data_generator/test_units.py \
--cov=evaluation.test_data_generator \
--cov-report=term-missing- Create a new JSON file in
test_samples/ - Follow the test case schema (see above)
- Include diverse assertion types
- Add simulated_human_guidance for reproducibility
- Validate JSON syntax:
python -m json.tool test_samples/new_test.json
Generate initial test suite from source code:
# 1. Generate diverse tests
python -m evaluation.test_data_generator.cli \
--source-context /path/to/agent/ \
--count 50 \
--diversity 0.8 \
--output bootstrap_tests/
# 2. Review and curate
# Manually review generated_tests/all_generated_tests.json
# Move high-quality tests to test_samples/
# 3. Use curated tests as teacher samples for refinement
python -m evaluation.test_data_generator.cli \
--teacher-samples test_samples/ \
--source-context /path/to/agent/ \
--count 30 \
--output refined_tests/# Generate stable, deterministic tests
python -m evaluation.test_data_generator.cli \
--teacher-samples test_samples/ \
--source-context /path/to/agent/ \
--count 40 \
--diversity 0.5 \
--temperature 0.7 \
--output regression_suite/- Intelligent test data generator
- Context-aware test generation
- Deduplication utilities
- Comprehensive unit tests
- Evaluation framework - Automated test execution
- Test runner - Parallel test execution
- Scoring engine - Pass/fail with metrics
- Results dashboard - Visualization and reporting
- CI/CD integration - GitHub Actions workflow
- Regression tracking - Historical comparison
Install dependencies:
pip install -r evaluation/requirements.txtFor Test Generation:
- Python 3.11+
- boto3>=1.28.0
- AWS credentials with Bedrock access
For Development/Testing:
- pytest>=7.0.0
- pytest-cov>=4.0.0 (optional, for coverage reports)
- unittest (standard library)
- Mock AWS credentials (for unit tests)
When adding new capabilities:
- Document in source code - Clear docstrings and comments
- Add unit tests - Cover deterministic logic without AWS calls
- Update examples - Add usage examples to
example.py - Update README - Document new features and workflows
For issues or questions:
- Check existing documentation in
test_data_generator/ - Run smoke tests to validate setup:
python3 evaluation/test_data_generator/test_basic.py - Review examples:
evaluation/test_data_generator/example.py
Status: Test data generation is complete and production-ready. Evaluation framework is planned for future development.