Skip to content

Commit 2141c7d

Browse files
committed
added unit tests for test data generator, fixed duplicated dead code in cli.py and import issues in example.py, Remove generated_test_data from git tracking
1 parent c4732a4 commit 2141c7d

58 files changed

Lines changed: 1091 additions & 3137 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,7 @@ DEPENDENCY-LICENSES.txt
1515
bom.xml
1616
sbom.json
1717
BSD-licenses.txt
18-
MIT-licenses.txt
18+
MIT-licenses.txt
19+
20+
# Generated test data
21+
evaluation/generated_test_data

evaluation/README.md

Lines changed: 335 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,335 @@
1+
# Evaluation Infrastructure
2+
3+
This directory contains tools and test data for evaluating AWS Transform agents and their capabilities.
4+
5+
## Overview
6+
7+
The evaluation infrastructure consists of:
8+
1. **Test Data Generator** - Intelligent test case generation from teacher samples and source context
9+
2. **Test Samples** - Curated test cases for agent evaluation
10+
3. **Evaluation Framework** *(Coming Soon)* - Automated test execution and scoring
11+
12+
## Directory Structure
13+
14+
```
15+
evaluation/
16+
├── README.md # This file
17+
├── test_data_generator/ # Intelligent test case generator
18+
│ ├── README.md # Generator documentation
19+
│ ├── ARCHITECTURE.md # Design decisions and rationale
20+
│ ├── TEST_README.md # Testing guide
21+
│ ├── cli.py # Command-line interface
22+
│ ├── intelligent_generator.py # Main generation logic
23+
│ ├── domain_analyzer.py # Domain understanding from samples
24+
│ ├── context_loader.py # Source context loading strategies
25+
│ ├── deduplicate_tests.py # Deduplication utilities
26+
│ ├── example.py # Usage examples
27+
│ ├── test_basic.py # Smoke tests
28+
│ └── test_units.py # Unit test suite (22 tests)
29+
├── test_sample/ # Sample test cases
30+
│ └── onboarding_intermediate.json
31+
└── generated_test_data/ # Generated tests (gitignored)
32+
```
33+
34+
## Components
35+
36+
### 1. Test Data Generator
37+
38+
**Purpose:** Generate diverse, high-quality test cases for agent evaluation.
39+
40+
**Key Features:**
41+
- Learns from teacher samples to understand domain patterns
42+
- Analyzes source context (skills, code, documentation)
43+
- Generates tests with controlled diversity
44+
- Ensures complexity distribution (simple, medium, complex)
45+
- Automatic deduplication
46+
- Configurable loading strategies for different domains
47+
48+
**Quick Start:**
49+
```bash
50+
# Generate 20 test cases from source context only
51+
python -m evaluation.test_data_generator.cli \
52+
--source-context /path/to/agent/code/ \
53+
--count 20 \
54+
--output generated_tests/
55+
56+
# Generate with teacher samples + source context
57+
python -m evaluation.test_data_generator.cli \
58+
--teacher-samples evaluation/test_sample/ \
59+
--source-context /path/to/agent/code/ \
60+
--count 20 \
61+
--output generated_tests/
62+
63+
# High diversity generation for edge cases
64+
python -m evaluation.test_data_generator.cli \
65+
--source-context /path/to/agent/code/ \
66+
--count 10 \
67+
--diversity 0.95 \
68+
--output edge_cases/
69+
```
70+
71+
**Requirements:**
72+
- Python 3.11+
73+
- AWS credentials with Bedrock access
74+
- boto3 installed
75+
76+
**Documentation:**
77+
- [Generator README](test_data_generator/README.md) - Usage guide
78+
- [Architecture](test_data_generator/ARCHITECTURE.md) - Design decisions
79+
- [Testing Guide](test_data_generator/TEST_README.md) - Running tests
80+
81+
**Testing:**
82+
```bash
83+
# Run smoke tests (no AWS required)
84+
python3 evaluation/test_data_generator/test_basic.py
85+
86+
# Run full unit test suite
87+
pytest evaluation/test_data_generator/test_units.py -v
88+
```
89+
90+
### 2. Test Samples
91+
92+
**Purpose:** An example of test cases demonstrating expected agent behavior.
93+
94+
**Current Samples:**
95+
- `test_sample/onboarding_intermediate.json` - Intermediate user onboarding scenario
96+
97+
**Test Case Schema:**
98+
```json
99+
{
100+
"id": "unique-test-id",
101+
"name": "Human-readable test name",
102+
"user_message": "Initial prompt to agent",
103+
"description": "What this test validates",
104+
"complexity": "simple|medium|complex",
105+
"tags": ["category", "type"],
106+
"max_turns": 12,
107+
"timeout_seconds": 600,
108+
"simulated_human_guidance": "Persona and behavior for simulated user",
109+
"metadata": {
110+
"domain": "agent_builder",
111+
"scenario_type": "onboarding"
112+
},
113+
"assertions": [
114+
{
115+
"name": "assertion_name",
116+
"type": "llm_judge|tool_called|transcript_contains|transcript_not_contains",
117+
"description": "What this checks",
118+
"check": "Evaluation criteria or pattern"
119+
}
120+
]
121+
}
122+
```
123+
124+
**Assertion Types:**
125+
- `llm_judge` - LLM evaluates if behavior meets criteria
126+
- `tool_called` - Verifies specific tool was invoked
127+
- `transcript_contains` - Pattern matching in transcript
128+
- `transcript_not_contains` - Ensure pattern is absent
129+
130+
### 3. Evaluation Framework *(Coming Soon)*
131+
132+
**Planned Features:**
133+
- Automated test execution against agents
134+
- LLM-based assertion evaluation
135+
- Scoring and metrics (pass rate, ...)
136+
- Test result reporting (JSON, HTML, markdown)
137+
- Integration with CI/CD pipelines
138+
139+
```
140+
141+
## Generating Test Data
142+
143+
### For Agent Evaluation
144+
Generate diverse tests covering the agent's capabilities:
145+
146+
```bash
147+
python -m evaluation.test_data_generator.cli \
148+
--source-context /path/to/agent/source/ \
149+
--count 50 \
150+
--diversity 0.8 \
151+
--output generated_tests/agent_eval/
152+
```
153+
154+
### For Regression Testing
155+
Generate tests with specific complexity:
156+
157+
```bash
158+
python -m evaluation.test_data_generator.cli \
159+
--teacher-samples evaluation/test_sample/ \
160+
--source-context /path/to/agent/source/ \
161+
--count 30 \
162+
--complexity medium \
163+
--output generated_tests/regression/
164+
```
165+
166+
### For Edge Case Discovery
167+
Use high diversity to find edge cases:
168+
169+
```bash
170+
python -m evaluation.test_data_generator.cli \
171+
--source-context /path/to/agent/source/ \
172+
--count 20 \
173+
--diversity 0.95 \
174+
--temperature 0.9 \
175+
--output generated_tests/edge_cases/
176+
```
177+
178+
## Test Data Quality
179+
180+
The generator includes built-in quality controls:
181+
182+
**Domain Understanding** - Analyzes source context to understand capabilities
183+
**Diversity Control** - `--diversity` parameter (0.0-1.0) controls novelty
184+
**Complexity Distribution** - Ensures mix of simple/medium/complex tests
185+
**Automatic Deduplication** - Removes duplicate test names
186+
**Structural Validation** - Ensures all required fields present
187+
**Assertion Quality** - Generates testable, specific assertions
188+
189+
## Configuration
190+
191+
### Loading Strategies
192+
193+
The context loader supports different strategies for different tasks:
194+
195+
- `agent_evaluation` (default) - Focus on instructions, capabilities, rules
196+
- `api_analysis` - Prioritize API schemas, endpoints
197+
- `code_understanding` - Focus on source code
198+
- `architecture_review` - Prioritize design docs
199+
- `configuration_audit` - Focus on config files
200+
- `generic` - Balanced loading
201+
202+
```bash
203+
python -m evaluation.test_data_generator.cli \
204+
--source-context /path/to/code/ \
205+
--loading-strategy code_understanding \
206+
--output generated_tests/
207+
```
208+
209+
### Deduplication Strategies
210+
211+
When using `deduplicate_tests.py`:
212+
213+
- `keep_first` - Keep first occurrence of each name
214+
- `keep_best` - Keep test with most assertions
215+
- `keep_all_unique` - Rename duplicates to make unique
216+
217+
```bash
218+
python -m evaluation.test_data_generator.deduplicate_tests \
219+
--input generated_tests/all.json \
220+
--output generated_tests/unique.json \
221+
--strategy keep_best
222+
```
223+
224+
## Development
225+
226+
### Running Tests
227+
228+
```bash
229+
# Test data generator smoke tests
230+
python3 evaluation/test_data_generator/test_basic.py
231+
232+
# Full unit test suite
233+
pytest evaluation/test_data_generator/test_units.py -v
234+
235+
# With coverage
236+
pytest evaluation/test_data_generator/test_units.py \
237+
--cov=evaluation.test_data_generator \
238+
--cov-report=term-missing
239+
```
240+
241+
### Adding New Test Samples
242+
243+
1. Create a new JSON file in `test_sample/`
244+
2. Follow the test case schema (see above)
245+
3. Include diverse assertion types
246+
4. Add simulated_human_guidance for reproducibility
247+
5. Validate JSON syntax: `python -m json.tool test_sample/new_test.json`
248+
249+
## Common Workflows
250+
251+
### Workflow 1: Bootstrap Test Suite
252+
Generate initial test suite from source code:
253+
254+
```bash
255+
# 1. Generate diverse tests
256+
python -m evaluation.test_data_generator.cli \
257+
--source-context /path/to/agent/ \
258+
--count 50 \
259+
--diversity 0.8 \
260+
--output bootstrap_tests/
261+
262+
# 2. Review and curate
263+
# Manually review generated_tests/all_generated_tests.json
264+
# Move high-quality tests to test_sample/
265+
266+
# 3. Use curated tests as teacher samples for refinement
267+
python -m evaluation.test_data_generator.cli \
268+
--teacher-samples test_sample/ \
269+
--source-context /path/to/agent/ \
270+
--count 30 \
271+
--output refined_tests/
272+
```
273+
274+
275+
```bash
276+
# Generate stable, deterministic tests
277+
python -m evaluation.test_data_generator.cli \
278+
--teacher-samples test_sample/ \
279+
--source-context /path/to/agent/ \
280+
--count 40 \
281+
--diversity 0.5 \
282+
--temperature 0.7 \
283+
--output regression_suite/
284+
```
285+
286+
## Roadmap
287+
288+
- [x] Intelligent test data generator
289+
- [x] Context-aware test generation
290+
- [x] Deduplication utilities
291+
- [x] Comprehensive unit tests
292+
- [ ] **Evaluation framework** - Automated test execution
293+
- [ ] **Test runner** - Parallel test execution
294+
- [ ] **Scoring engine** - Pass/fail with metrics
295+
- [ ] **Results dashboard** - Visualization and reporting
296+
- [ ] **CI/CD integration** - GitHub Actions workflow
297+
- [ ] **Regression tracking** - Historical comparison
298+
299+
## Requirements
300+
301+
**For Test Generation:**
302+
- Python 3.11+
303+
- boto3
304+
305+
**For Development/Testing:**
306+
- pytest
307+
- unittest (standard library)
308+
- Mock AWS credentials (for unit tests)
309+
310+
## Contributing
311+
312+
When adding new capabilities:
313+
314+
1. **Document in source code** - Clear docstrings and comments
315+
2. **Add unit tests** - Cover deterministic logic without AWS calls
316+
3. **Update examples** - Add usage examples to `example.py`
317+
4. **Update README** - Document new features and workflows
318+
319+
## Resources
320+
321+
- [Test Data Generator README](test_data_generator/README.md)
322+
- [Architecture Documentation](test_data_generator/ARCHITECTURE.md)
323+
- [Testing Guide](test_data_generator/TEST_README.md)
324+
- [Example Usage](test_data_generator/example.py)
325+
326+
## Support
327+
328+
For issues or questions:
329+
1. Check existing documentation in `test_data_generator/`
330+
2. Run smoke tests to validate setup: `python3 evaluation/test_data_generator/test_basic.py`
331+
3. Review examples: `evaluation/test_data_generator/example.py`
332+
333+
---
334+
335+
**Status:** Test data generation is complete and production-ready. Evaluation framework is planned for future development.

0 commit comments

Comments
 (0)