diff --git a/evaluation/.github-ci-example.yml b/evaluation/.github-ci-example.yml new file mode 100644 index 0000000..cdca0f2 --- /dev/null +++ b/evaluation/.github-ci-example.yml @@ -0,0 +1,67 @@ +# Example GitHub Actions workflow for evaluation package CI +# This file is for reference only - integrate into your existing CI pipeline + +name: Evaluation Tests + +on: + pull_request: + paths: + - 'evaluation/**' + push: + branches: [main, develop] + paths: + - 'evaluation/**' + +jobs: + test: + runs-on: ubuntu-latest + strategy: + matrix: + python-version: ['3.11', '3.12', '3.13'] + + steps: + - uses: actions/checkout@v4 + + - name: Set up Python ${{ matrix.python-version }} + uses: actions/setup-python@v5 + with: + python-version: ${{ matrix.python-version }} + + - name: Install evaluation package + run: | + cd evaluation + pip install -e ".[test]" + + - name: Run smoke tests + run: | + python3 evaluation/test_data_generator/test_basic.py + + - name: Run unit tests + run: | + pytest evaluation/test_data_generator/test_units.py -v --tb=short + + - name: Run unit tests with coverage + if: matrix.python-version == '3.11' + run: | + pytest evaluation/test_data_generator/test_units.py \ + --cov=evaluation.test_data_generator \ + --cov-report=term-missing \ + --cov-fail-under=70 + + lint: + runs-on: ubuntu-latest + + steps: + - uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.11' + + - name: Install ruff + run: pip install ruff + + - name: Lint with ruff + run: | + ruff check evaluation/ --config pyproject.toml diff --git a/evaluation/README.md b/evaluation/README.md new file mode 100644 index 0000000..05e005e --- /dev/null +++ b/evaluation/README.md @@ -0,0 +1,338 @@ +# Evaluation Infrastructure + +This directory contains tools and test data for evaluating AWS Transform agents and their capabilities. + +## Overview + +The evaluation infrastructure consists of: +1. **Test Data Generator** - Intelligent test case generation from teacher samples and source context +2. **Test Samples** - Curated test cases for agent evaluation +3. **Evaluation Framework** *(Coming Soon)* - Automated test execution and scoring + +## Directory Structure + +``` +evaluation/ +├── README.md # This file +├── pyproject.toml # Package configuration (setuptools) +├── test_data_generator/ # Intelligent test case generator +│ ├── README.md # Complete documentation +│ ├── cli.py # Command-line interface +│ ├── intelligent_generator.py # Main generation logic +│ ├── domain_analyzer.py # Domain understanding from samples +│ ├── context_loader.py # Source context loading strategies +│ ├── deduplicate_tests.py # Deduplication utilities +│ ├── example.py # Usage examples +│ ├── test_basic.py # Smoke tests +│ └── test_units.py # Unit test suite (22 tests) +├── test_samples/ # Sample test cases +│ └── onboarding_intermediate.json +└── generated_test_data/ # Generated tests (gitignored) +``` + +## Components + +### 1. Test Data Generator + +**Purpose:** Generate diverse, high-quality test cases for agent evaluation. + +**Key Features:** +- Learns from teacher samples to understand domain patterns +- Analyzes source context (skills, code, documentation) +- Generates tests with controlled diversity +- Ensures complexity distribution (simple, medium, complex) +- Automatic deduplication +- Configurable loading strategies for different domains + +**Quick Start:** +```bash +# Generate 20 test cases from source context only +python -m evaluation.test_data_generator.cli \ + --source-context /path/to/agent/code/ \ + --count 20 \ + --output generated_tests/ + +# Generate with teacher samples + source context +python -m evaluation.test_data_generator.cli \ + --teacher-samples evaluation/test_samples/ \ + --source-context /path/to/agent/code/ \ + --count 20 \ + --output generated_tests/ + +# High diversity generation for edge cases +python -m evaluation.test_data_generator.cli \ + --source-context /path/to/agent/code/ \ + --count 10 \ + --diversity 0.95 \ + --output edge_cases/ +``` + +**Requirements:** +- Python 3.11+ +- AWS credentials with Bedrock access +- boto3 installed + +**Documentation:** +- [Generator README](test_data_generator/README.md) - Complete guide (usage, architecture, testing) + +**Testing:** +```bash +# Run smoke tests (no AWS required) +python3 evaluation/test_data_generator/test_basic.py + +# Run full unit test suite +pytest evaluation/test_data_generator/test_units.py -v +``` + +### 2. Test Samples + +**Purpose:** An example of test cases demonstrating expected agent behavior. + +**Current Samples:** +- `test_samples/onboarding_intermediate.json` - Intermediate user onboarding scenario + +**Test Case Schema:** +```json +{ + "id": "unique-test-id", + "name": "Human-readable test name", + "user_message": "Initial prompt to agent", + "description": "What this test validates", + "complexity": "simple|medium|complex", + "tags": ["category", "type"], + "max_turns": 12, + "timeout_seconds": 600, + "simulated_human_guidance": "Persona and behavior for simulated user", + "metadata": { + "domain": "agent_builder", + "scenario_type": "onboarding" + }, + "assertions": [ + { + "name": "assertion_name", + "type": "llm_judge|tool_called|transcript_contains|transcript_not_contains", + "description": "What this checks", + "check": "Evaluation criteria or pattern" + } + ] +} +``` + +**Assertion Types:** +- `llm_judge` - LLM evaluates if behavior meets criteria +- `tool_called` - Verifies specific tool was invoked +- `transcript_contains` - Pattern matching in transcript +- `transcript_not_contains` - Ensure pattern is absent + +### 3. Evaluation Framework *(Coming Soon)* + +**Planned Features:** +- Automated test execution against agents +- LLM-based assertion evaluation +- Scoring and metrics (pass rate, ...) +- Test result reporting (JSON, HTML, markdown) +- Integration with CI/CD pipelines + +``` + +## Generating Test Data + +### For Agent Evaluation +Generate diverse tests covering the agent's capabilities: + +```bash +python -m evaluation.test_data_generator.cli \ + --source-context /path/to/agent/source/ \ + --count 50 \ + --diversity 0.8 \ + --output generated_tests/agent_eval/ +``` + +### For Regression Testing +Generate tests with specific complexity: + +```bash +python -m evaluation.test_data_generator.cli \ + --teacher-samples evaluation/test_samples/ \ + --source-context /path/to/agent/source/ \ + --count 30 \ + --complexity medium \ + --output generated_tests/regression/ +``` + +### For Edge Case Discovery +Use high diversity to find edge cases: + +```bash +python -m evaluation.test_data_generator.cli \ + --source-context /path/to/agent/source/ \ + --count 20 \ + --diversity 0.95 \ + --temperature 0.9 \ + --output generated_tests/edge_cases/ +``` + +## Test Data Quality + +The generator includes built-in quality controls: + +✅ **Domain Understanding** - Analyzes source context to understand capabilities +✅ **Diversity Control** - `--diversity` parameter (0.0-1.0) controls novelty +✅ **Complexity Distribution** - Ensures mix of simple/medium/complex tests +✅ **Automatic Deduplication** - Removes duplicate test names +✅ **Structural Validation** - Ensures all required fields present +✅ **Assertion Quality** - Generates testable, specific assertions + +## Configuration + +### Loading Strategies + +The context loader supports different strategies for different tasks: + +- `agent_evaluation` (default) - Focus on instructions, capabilities, rules +- `api_analysis` - Prioritize API schemas, endpoints +- `code_understanding` - Focus on source code +- `architecture_review` - Prioritize design docs +- `configuration_audit` - Focus on config files +- `generic` - Balanced loading + +```bash +python -m evaluation.test_data_generator.cli \ + --source-context /path/to/code/ \ + --loading-strategy code_understanding \ + --output generated_tests/ +``` + +### Deduplication Strategies + +When using `deduplicate_tests.py`: + +- `keep_first` - Keep first occurrence of each name +- `keep_best` - Keep test with most assertions +- `keep_all_unique` - Rename duplicates to make unique + +```bash +python -m evaluation.test_data_generator.deduplicate_tests \ + --input generated_tests/all.json \ + --output generated_tests/unique.json \ + --strategy keep_best +``` + +## Development + +### Running Tests + +```bash +# Test data generator smoke tests +python3 evaluation/test_data_generator/test_basic.py + +# Full unit test suite +pytest evaluation/test_data_generator/test_units.py -v + +# With coverage +pytest evaluation/test_data_generator/test_units.py \ + --cov=evaluation.test_data_generator \ + --cov-report=term-missing +``` + +### Adding New Test Samples + +1. Create a new JSON file in `test_samples/` +2. Follow the test case schema (see above) +3. Include diverse assertion types +4. Add simulated_human_guidance for reproducibility +5. Validate JSON syntax: `python -m json.tool test_samples/new_test.json` + +## Common Workflows + +### Workflow 1: Bootstrap Test Suite +Generate initial test suite from source code: + +```bash +# 1. Generate diverse tests +python -m evaluation.test_data_generator.cli \ + --source-context /path/to/agent/ \ + --count 50 \ + --diversity 0.8 \ + --output bootstrap_tests/ + +# 2. Review and curate +# Manually review generated_tests/all_generated_tests.json +# Move high-quality tests to test_samples/ + +# 3. Use curated tests as teacher samples for refinement +python -m evaluation.test_data_generator.cli \ + --teacher-samples test_samples/ \ + --source-context /path/to/agent/ \ + --count 30 \ + --output refined_tests/ +``` + + +```bash +# Generate stable, deterministic tests +python -m evaluation.test_data_generator.cli \ + --teacher-samples test_samples/ \ + --source-context /path/to/agent/ \ + --count 40 \ + --diversity 0.5 \ + --temperature 0.7 \ + --output regression_suite/ +``` + +## Roadmap + +- [x] Intelligent test data generator +- [x] Context-aware test generation +- [x] Deduplication utilities +- [x] Comprehensive unit tests +- [ ] **Evaluation framework** - Automated test execution +- [ ] **Test runner** - Parallel test execution +- [ ] **Scoring engine** - Pass/fail with metrics +- [ ] **Results dashboard** - Visualization and reporting +- [ ] **CI/CD integration** - GitHub Actions workflow +- [ ] **Regression tracking** - Historical comparison + +## Installation + +Install the evaluation package: + +```bash +# From agent-builder-toolkit-aws-transform/ +cd evaluation +pip install -e . + +# Or with test dependencies +pip install -e ".[test]" +``` + +**Requirements:** +- Python 3.11+ +- boto3>=1.28.0 (AWS Bedrock access) +- AWS credentials configured +- pytest>=7.0.0 (for testing, optional) + +## Contributing + +When adding new capabilities: + +1. **Document in source code** - Clear docstrings and comments +2. **Add unit tests** - Cover deterministic logic without AWS calls +3. **Update examples** - Add usage examples to `example.py` +4. **Update README** - Document new features and workflows + +## Resources + +- [Test Data Generator README](test_data_generator/README.md) - Complete documentation +- [Example Usage](test_data_generator/example.py) - Code examples + +## Support + +For issues or questions: +1. Check existing documentation in `test_data_generator/` +2. Run smoke tests to validate setup: `python3 evaluation/test_data_generator/test_basic.py` +3. Review examples: `evaluation/test_data_generator/example.py` + +--- + +**Status:** Test data generation is complete and production-ready. Evaluation framework is planned for future development. diff --git a/evaluation/pyproject.toml b/evaluation/pyproject.toml new file mode 100644 index 0000000..1f880e1 --- /dev/null +++ b/evaluation/pyproject.toml @@ -0,0 +1,37 @@ +[build-system] +requires = ["setuptools>=68", "wheel"] +build-backend = "setuptools.build_meta" + +[project] +name = "agent-builder-evaluation-aws-transform" +version = "0.0.0" +description = "Evaluation framework and test data generator for AWS Transform agents" +readme = "README.md" +requires-python = ">=3.11" +authors = [{ name = "AWS Transform Team" }] +classifiers = [ + "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", + "Programming Language :: Python :: 3.13", + "Operating System :: OS Independent", +] +dependencies = [ + "boto3>=1.28.0", + "botocore>=1.28.0", +] + +[project.optional-dependencies] +test = [ + "pytest>=7.0.0", + "pytest-cov>=4.0.0", +] + +[tool.setuptools.packages.find] +where = ["."] + +[tool.pytest.ini_options] +testpaths = ["test_data_generator"] +python_files = ["test_*.py"] +python_classes = ["Test*"] +python_functions = ["test_*"] diff --git a/evaluation/test_data_generator/README.md b/evaluation/test_data_generator/README.md new file mode 100644 index 0000000..a26c445 --- /dev/null +++ b/evaluation/test_data_generator/README.md @@ -0,0 +1,547 @@ +# Intelligent Test Data Generator + +Generate diverse, high-quality test cases by understanding your task domain and learning from teacher samples. + +## Table of Contents + +- [Overview](#overview) +- [Quick Start](#quick-start) +- [Command-Line Options](#command-line-options) +- [How It Works](#how-it-works) +- [Programmatic Usage](#programmatic-usage) +- [Best Practices](#best-practices) +- [Troubleshooting](#troubleshooting) +- [Architecture](#architecture) +- [Testing](#testing) +- [Contributing](#contributing) + +## Overview + +This intelligent test generator solves the problem of limited test data by: + +1. **Understanding your domain** - Analyzes source code/documentation and optional teacher test samples using LLM to extract domain patterns, capabilities, user personas, and success criteria +2. **Generating diverse tests** - Creates new test cases that explore different scenarios, edge cases, and complexity levels +3. **Maintaining quality** - Ensures generated tests follow proper structure and quality standards + +**Key Feature**: Source context (code/docs) is **always required** for domain understanding. Teacher samples are optional - you can bootstrap from source code alone! + +**Why Use This?** + +- **Problem**: You need more test samples for agent evolution, but manually creating test cases is time-consuming +- **Solution**: This generator learns from a small set of teacher samples and generates diverse, realistic test cases automatically + +## Quick Start + +### 1. Bootstrap from Source Context (No Teacher Samples) + +Generate tests directly from your source code without any teacher samples: + +```bash +python -m test_data_generator.cli \ + --source-context /path/to/your/source/code/ \ + --count 20 \ + --output generated_tests/ +``` + +### 2. Generate with Teacher Samples + Source Context + +For best results, combine teacher samples with source context: + +```bash +python -m test_data_generator.cli \ + --teacher-samples test_samples/ \ + --source-context /path/to/your/source/code/ \ + --count 20 \ + --output generated_tests/ +``` + +### 3. Advanced: Tune Complexity and Diversity + +```bash +python -m test_data_generator.cli \ + --teacher-samples test_samples/ \ + --source-context /path/to/source/folder/ \ + --count 15 \ + --complexity medium \ + --diversity 0.9 \ + --output generated_tests/ +``` + +### 4. Domain Analysis Only + +Analyze your domain without generating tests: + +```bash +python -m test_data_generator.cli \ + --source-context /path/to/source/folder/ \ + --teacher-samples test_samples/ \ + --analyze-only \ + --output analysis/ +``` + +## Command-Line Options + +``` +Required: + --source-context PATH Path to source code/docs directory or file + REQUIRED for domain understanding + --output PATH Output directory for generated tests + +Optional: + --teacher-samples PATH Path to teacher test samples (file or directory) + Optional - can bootstrap from source context alone + --count N Number of tests to generate (default: 10) + --complexity LEVEL Generate specific complexity: simple, medium, or complex + --diversity FACTOR Diversity 0-1: 0=similar, 1=very diverse (default: 0.8) + --region REGION AWS region (default: us-west-2) + --model-id MODEL Bedrock model ID (default: claude-opus-4-5) + --temperature TEMP Generation temperature 0-1 (default: 0.8) + --analyze-only Only analyze domain, don't generate + --no-deduplicate Disable test name deduplication + --no-ensure-complex Disable ensuring 20% complex tests + --use-two-pass-analysis Use two-pass analysis for large source context + --loading-strategy STR Strategy for loading files (default: agent_evaluation) + --verbose Enable verbose logging +``` + +### Diversity Control + +The `--diversity` parameter controls how different generated tests are: + +- **0.0 - 0.3**: Low diversity - Stay close to patterns, mostly variations +- **0.4 - 0.6**: Medium diversity - Explore different aspects of core capabilities +- **0.7 - 1.0**: High diversity - Explore edge cases, error handling, unusual scenarios + +## How It Works + +### Phase 1: Domain Analysis + +The generator analyzes your source context and optional teacher samples to extract: + +- **Core capabilities** being tested +- **Domain-specific patterns** and scenarios +- **User personas** and interaction styles +- **Success criteria** and quality expectations +- **Edge cases** and complexity factors +- **Assertion patterns** and what they validate + +### Phase 2: Intelligent Generation + +Using the domain understanding, it generates tests that: + +- Follow the same structure as teacher samples (or infer structure from source) +- Explore different scenarios and edge cases +- Cover different user personas and skill levels +- Maintain appropriate assertion quality +- Match desired complexity distribution + +### Phase 3: Validation + +Each generated test is validated to ensure: + +- Required fields are present +- Assertions are valid and complete +- Structure matches teacher samples +- Reasonable defaults for timing/turns + +### Output Files + +After generation, you'll find: + +``` +output_directory/ +├── domain_analysis.json # Domain understanding and patterns +├── all_generated_tests.json # All tests in one file +├── generated_test_001.json # Individual test files +├── generated_test_002.json +└── ... +``` + +## Programmatic Usage + +### Bootstrap Without Teacher Samples + +```python +from test_data_generator import IntelligentTestGenerator +from test_data_generator.context_loader import ContextLoader + +# Load source context (required) +loader = ContextLoader(strategy='agent_evaluation') +source_context = loader.load('/path/to/your/source/') + +generator = IntelligentTestGenerator( + region_name='us-west-2', + model_id='us.anthropic.claude-opus-4-5-20251101-v1:0', + temperature=0.8 +) + +# Generate from source context only +generated = generator.generate_test_cases( + teacher_samples=[], # Empty list + count=10, + source_context=source_context, # Required + diversity_factor=0.8, + output_dir='output/' +) +``` + +### With Teacher Samples + Source Context + +```python +# Load source context and teacher samples +loader = ContextLoader(strategy='agent_evaluation') +source_context = loader.load('/path/to/your/source/') +teacher_samples = [...] # Your test samples + +# Generate with both +generated = generator.generate_test_cases( + teacher_samples=teacher_samples, + count=20, + source_context=source_context, + complexity='medium', + diversity_factor=0.8, + output_dir='output/' +) +``` + +## Best Practices + +### Always Required: +1. **Provide comprehensive source context**: Include documentation, code, configuration files +2. **Organize your source well**: Clear structure helps the analyzer understand your domain +3. **Include key files**: POWER.md, README, main entry points, core logic + +### With Teacher Samples: +1. **Start with quality teachers**: Better teacher samples = better generated tests +2. **Start at 0.8 diversity**: Adjust based on results +3. **Review generated tests**: Spot-check first few generations +4. **Iterate**: Use domain analysis to understand coverage gaps + +### Without Teacher Samples (Bootstrap): +1. **Start with lower count**: Generate 5-10 tests first to verify quality +2. **Review structure**: First-generation tests may need manual refinement +3. **Use refined tests as teachers**: Use generated tests as teacher samples for next round +4. **Iterate to improve**: Each generation learns from previous results + +## Troubleshooting + +**No tests generated**: Check that source context is provided and teacher samples (if used) have valid structure with assertions + +**Low quality tests**: Try lowering diversity factor or providing more comprehensive source context + +**Bedrock errors**: Verify AWS credentials and model access in your region + +**Memory issues**: Reduce count or generate in smaller batches + +**"Source context very large" warning**: +- Tool auto-skips .git, node_modules, __pycache__, etc. +- Auto-skips files >100KB +- Content truncated intelligently for analysis +- Use specific file instead of full directory if needed + +## Architecture + +### System Overview + +``` +INPUT PROCESSING OUTPUT +───── ────────── ────── + +┌──────────────┐ ┌────────────────────┐ ┌─────────────┐ +│ Teacher │ │ Domain Analyzer │ │ Generated │ +│ Samples │─────────────>│ │ │ Tests │ +│ (optional) │ │ • Extract patterns│ │ (10-50 tests)│ +└──────────────┘ │ • Understand │ └─────────────┘ + │ capabilities │ +┌──────────────┐ │ • Identify personas│ ┌─────────────┐ +│Source Context│─────────────>│ • Analyze │ │ Domain │ +│ (required) │ │ assertions │ │ Analysis │ +└──────────────┘ └────────┬───────────┘ └─────────────┘ + │ + │ Domain Understanding + │ + ▼ + ┌────────────────────┐ + │ Intelligent Gen. │ + │ │ + │ • Build prompts │ + │ • Generate batches│ + │ • Ensure diversity│ + │ • Validate output │ + └────────┬───────────┘ + │ + ▼ + ┌────────────────────┐ + │ AWS Bedrock │ + │ (Claude Models) │ + └────────────────────┘ +``` + +### Component Architecture + +``` +test_data_generator/ +│ +├── domain_analyzer.py +│ └─ DomainAnalyzer +│ ├─ analyze_test_samples() # Phase 1: Understanding +│ ├─ _extract_structural_patterns() +│ ├─ _extract_domain_understanding() +│ ├─ _two_pass_analysis() +│ ├─ _smart_truncate() +│ └─ _call_bedrock() +│ +├── intelligent_generator.py +│ └─ IntelligentTestGenerator +│ ├─ generate_test_cases() # Phase 2: Generation +│ ├─ _generate_batch() +│ ├─ _build_generation_prompt() +│ ├─ _validate_and_fix_tests() +│ └─ _final_quality_pass() +│ +├── context_loader.py +│ └─ ContextLoader +│ ├─ load() # Load source context +│ ├─ _discover_files() +│ └─ _prioritize_files() +│ +└── cli.py + ├─ main() # Command-line interface + ├─ load_teacher_samples() + └─ Argument parsing +``` + +### Key Design Decisions + +**1. Two-Phase Approach (Analysis → Generation)** + +*Why?* Separate understanding from generation, reusable domain analysis, better quality control + +*Tradeoffs:* Two LLM calls instead of one, slightly slower but much better quality + +**2. Batch Generation** + +*Why?* Ensures diversity across batches, better progress tracking, fault tolerance + +*Tradeoffs:* More API calls and complexity, but better results and reliability + +**3. Validation & Auto-Fix** + +*Why?* LLM output can be imperfect, ensures structural consistency, reduces manual cleanup + +*Tradeoffs:* May mask generation issues, but much better usability + +**4. Source Context Required, Teacher Samples Optional** + +*Why?* Source context provides ground truth, enables bootstrapping without existing tests + +*Tradeoffs:* More setup required, but more flexible and powerful + +### Validation Pipeline + +``` +Generated Test + │ + ▼ +┌─────────────┐ +│ Has ID? │ ─No─> Generate ID +└──────┬──────┘ + │ Yes + ▼ +┌─────────────┐ +│ Has name? │ ─No─> Generate name +└──────┬──────┘ + │ Yes + ▼ +┌─────────────┐ +│ Valid │ ─No─> Set default +│ complexity? │ +└──────┬──────┘ + │ Yes + ▼ +┌─────────────┐ +│ Has │ ─No─> Skip test +│ assertions? │ +└──────┬──────┘ + │ Yes + ▼ +┌─────────────┐ +│ Validate │ ─Invalid─> Remove invalid +│ assertions │ +└──────┬──────┘ + │ Valid + ▼ +┌─────────────┐ +│ Add to │ +│ output set │ +└─────────────┘ +``` + +### Error Handling Strategy + +**Level 1: Graceful Degradation** +- No teacher samples? Generate from source context alone +- Some tests invalid? Use valid ones +- Batch failed? Continue with others + +**Level 2: Validation & Auto-Fix** +- Missing fields? Add defaults +- Invalid assertions? Remove them +- Wrong structure? Fix if possible + +**Level 3: Clear Errors** +- No source context? Error & exit +- Bedrock unavailable? Error & exit +- All tests invalid? Error & exit + +## Testing + +### Running Tests + +**Quick Smoke Test (No Dependencies)** +```bash +python3 evaluation/test_data_generator/test_basic.py +``` + +**Full Unit Test Suite** +```bash +# Run all unit tests +pytest evaluation/test_data_generator/test_units.py -v + +# Run with coverage +pytest evaluation/test_data_generator/test_units.py \ + --cov=evaluation.test_data_generator \ + --cov-report=term-missing +``` + +**Run Specific Test Classes** +```bash +# Test only ContextLoader +pytest evaluation/test_data_generator/test_units.py::TestContextLoader -v + +# Test only Deduplication +pytest evaluation/test_data_generator/test_units.py::TestDeduplication -v + +# Test only DomainAnalyzer +pytest evaluation/test_data_generator/test_units.py::TestDomainAnalyzer -v +``` + +### Test Coverage Summary + +**TestContextLoader (10 tests)** +- Strategy initialization and fallback behavior +- Binary file detection +- Single file and directory loading +- Skip patterns (directories, large files) +- Custom strategy creation + +**TestDeduplication (4 tests)** +- `keep_first` - Keep first occurrence +- `keep_best` - Keep test with most assertions +- `keep_all_unique` - Rename duplicates +- No duplicates case + +**TestDomainAnalyzer (5 tests)** +- Structural pattern extraction +- Complexity distribution analysis +- Assertion pattern analysis +- Default structure generation +- Empty sample handling + +**TestCustomStrategyCreation (3 tests)** +- Basic custom strategy creation +- Priority pattern configuration +- File size limits + +### What's Tested vs. Not Tested + +**Tested (22 unit tests, ~0.1s runtime)** +- Context loading (~200 lines) +- Deduplication (~100 lines) +- Domain analysis (structural parts) +- File discovery and filtering +- Pattern extraction + +**Not Tested (Requires AWS/Bedrock)** +- Domain understanding generation +- Test case generation +- Two-pass analysis +- LLM API calls + +### CI/CD Integration + +```bash +# Run tests and fail on any failures +pytest evaluation/test_data_generator/test_units.py --tb=short || exit 1 + +# Run with coverage threshold +pytest evaluation/test_data_generator/test_units.py \ + --cov=evaluation.test_data_generator \ + --cov-fail-under=70 +``` + +## Contributing + +The generator is designed to be extensible. To customize: + +1. **Modify domain analysis**: Edit `domain_analyzer.py` to extract additional patterns +2. **Adjust generation**: Update `intelligent_generator.py` prompts for your domain +3. **Add validation**: Add custom validation logic for your test structure +4. **Create strategies**: Add new context loading strategies in `context_loader.py` + +### Code Style + +- Follow existing patterns for consistency +- Add docstrings for public methods +- Use type hints where appropriate +- Keep functions focused and small +- Add unit tests for deterministic logic + +### Pull Request Guidelines + +1. Ensure all tests pass +2. Add tests for new functionality +3. Update documentation as needed +4. Keep changes focused and atomic +5. Provide clear commit messages + +### Writing New Tests + +When adding new testable functionality: + +1. Add unit tests to `test_units.py` if the logic is **deterministic** +2. Use mocking (`unittest.mock`) for external dependencies +3. Use temp directories for file I/O tests +4. Keep tests **fast** + +Example: +```python +class TestNewFeature(unittest.TestCase): + def test_feature_works(self): + """Test that feature behaves correctly.""" + # Arrange, Act, Assert +``` + +## Installation + +Install the evaluation package from the repository root: + +```bash +# From agent-builder-toolkit-aws-transform/ +cd evaluation +pip install -e . + +# Or with test dependencies +pip install -e ".[test]" +``` + +## Requirements + +- Python 3.11+ +- boto3 (AWS Bedrock access) +- AWS credentials configured +- Access to Claude models in Bedrock + +## License + +Same as parent project. diff --git a/evaluation/test_data_generator/__init__.py b/evaluation/test_data_generator/__init__.py new file mode 100644 index 0000000..b7e6e17 --- /dev/null +++ b/evaluation/test_data_generator/__init__.py @@ -0,0 +1,10 @@ +"""Test data generator package for AWS Transform agent evaluation. + +This package provides intelligent test data generation based on understanding +of the task domain and teacher test samples. +""" + +from .intelligent_generator import IntelligentTestGenerator +from .domain_analyzer import DomainAnalyzer + +__all__ = ['IntelligentTestGenerator', 'DomainAnalyzer'] diff --git a/evaluation/test_data_generator/cli.py b/evaluation/test_data_generator/cli.py new file mode 100755 index 0000000..815d5cc --- /dev/null +++ b/evaluation/test_data_generator/cli.py @@ -0,0 +1,316 @@ +#!/usr/bin/env python3 +"""Command-line interface for intelligent test data generation.""" + +import argparse +import json +import logging +import sys +from pathlib import Path +from typing import List, Dict, Any + +from .intelligent_generator import IntelligentTestGenerator +from .domain_analyzer import DomainAnalyzer +from .context_loader import ContextLoader, LOADING_STRATEGIES + +logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' +) +logger = logging.getLogger(__name__) + + +def load_teacher_samples(path: str) -> List[Dict[str, Any]]: + """Load teacher samples from path (file or directory).""" + path_obj = Path(path) + + if not path_obj.exists(): + raise FileNotFoundError(f"Path not found: {path}") + + samples = [] + + if path_obj.is_file(): + with open(path_obj, 'r') as f: + data = json.load(f) + if isinstance(data, list): + samples.extend(data) + elif isinstance(data, dict) and 'test_cases' in data: + samples.extend(data['test_cases']) + else: + # Assume it's a single test case wrapped in array + samples.append(data) + + elif path_obj.is_dir(): + for json_file in path_obj.glob("*.json"): + try: + with open(json_file, 'r') as f: + data = json.load(f) + if isinstance(data, list): + samples.extend(data) + else: + samples.append(data) + except Exception as e: + logger.warning(f"Failed to load {json_file}: {e}") + + logger.info(f"Loaded {len(samples)} teacher samples from {path}") + return samples + + +def main(): + """Main CLI entry point.""" + parser = argparse.ArgumentParser( + description="Intelligent Test Data Generator - Generate diverse test cases from teacher samples", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Generate from source context only (no teacher samples) + python -m test_data_generator.cli \\ + --source-context /path/to/source/folder/ \\ + --count 20 \\ + --output generated_tests/ + + # Generate with teacher samples + source context + python -m test_data_generator.cli \\ + --teacher-samples test_samples/ \\ + --source-context /path/to/source/folder/ \\ + --count 20 \\ + --output generated_tests/ + + # Generate with specific complexity and high diversity + python -m test_data_generator.cli \\ + --teacher-samples test_samples/onboarding_intermediate.json \\ + --source-context /path/to/source/folder/ \\ + --count 10 \\ + --complexity medium \\ + --diversity 0.9 \\ + --output generated_tests/ + + # Just analyze domain without generating + python -m test_data_generator.cli \\ + --teacher-samples test_samples/ \\ + --source-context /path/to/source/folder/ \\ + --analyze-only \\ + --output analysis/ + """ + ) + + parser.add_argument( + '--teacher-samples', + required=False, + help='Path to teacher test samples (file or directory with JSON files). Optional - can generate from source context only.' + ) + + parser.add_argument( + '--count', + type=int, + default=10, + help='Number of test cases to generate (default: 10)' + ) + + parser.add_argument( + '--output', + required=True, + help='Output directory for generated tests' + ) + + parser.add_argument( + '--source-context', + dest='source_context', + required=True, + help='Path to source context file or directory (loads all source/config/doc files recursively). REQUIRED for domain understanding.' + ) + + parser.add_argument( + '--complexity', + choices=['simple', 'medium', 'complex'], + help='Generate tests of specific complexity (default: mixed)' + ) + + parser.add_argument( + '--diversity', + type=float, + default=0.8, + help='Diversity factor (0-1): 0=similar to teachers, 1=very diverse (default: 0.8)' + ) + + parser.add_argument( + '--region', + default='us-west-2', + help='AWS region for Bedrock (default: us-west-2)' + ) + + parser.add_argument( + '--model-id', + default='us.anthropic.claude-opus-4-5-20251101-v1:0', + help='Bedrock model ID (default: Claude Opus 4.5)' + ) + + parser.add_argument( + '--temperature', + type=float, + default=0.8, + help='Generation temperature (0-1, default: 0.8)' + ) + + parser.add_argument( + '--analyze-only', + action='store_true', + help='Only analyze domain, do not generate tests' + ) + + parser.add_argument( + '--no-deduplicate', + action='store_true', + help='Disable automatic deduplication of test names (default: deduplicate enabled)' + ) + + parser.add_argument( + '--no-ensure-complex', + action='store_true', + help='Disable ensuring 20%% complex tests (default: ensure enabled)' + ) + + parser.add_argument( + '--use-two-pass-analysis', + action='store_true', + help='Use two-pass analysis for large source context (more comprehensive but slower)' + ) + + parser.add_argument( + '--loading-strategy', + choices=list(LOADING_STRATEGIES.keys()), + default='agent_evaluation', + help='Strategy for loading source context files (default: agent_evaluation)' + ) + + parser.add_argument( + '--verbose', + action='store_true', + help='Enable verbose logging' + ) + + args = parser.parse_args() + + # Set log level + if args.verbose: + logging.getLogger().setLevel(logging.DEBUG) + + try: + # Load teacher samples + logger.info("=" * 80) + logger.info("INTELLIGENT TEST DATA GENERATOR") + logger.info("=" * 80) + + teacher_samples = [] + if args.teacher_samples: + logger.info(f"\nStep 1a: Loading teacher samples from {args.teacher_samples}") + teacher_samples = load_teacher_samples(args.teacher_samples) + if not teacher_samples: + logger.error("No teacher samples found!") + sys.exit(1) + else: + logger.info("\nStep 1a: No teacher samples provided - will generate from source context only") + + # Load source context (always required) + logger.info(f"\nStep 1b: Loading source context from {args.source_context}") + loader = ContextLoader(strategy=args.loading_strategy) + source_context = loader.load(args.source_context) + + if not source_context: + logger.error("ERROR: Failed to load source context!") + logger.error(" Source context is required for domain understanding.") + sys.exit(1) + + # Create output directory + output_path = Path(args.output) + output_path.mkdir(parents=True, exist_ok=True) + logger.info(f"Output directory: {output_path}") + + # Initialize generator + generator = IntelligentTestGenerator( + region_name=args.region, + model_id=args.model_id, + temperature=args.temperature + ) + + if args.analyze_only: + # Just analyze and save + logger.info("\nAnalyzing domain (analysis-only mode)...") + analyzer = DomainAnalyzer( + args.region, + args.model_id, + use_two_pass_analysis=args.use_two_pass_analysis + ) + analysis = analyzer.analyze_test_samples(teacher_samples, source_context) + + analysis_file = output_path / "domain_analysis.json" + analyzer.save_analysis(analysis, str(analysis_file)) + + logger.info(f"\n{'=' * 80}") + logger.info("ANALYSIS COMPLETE") + logger.info(f"{'=' * 80}") + logger.info(f"\nDomain Analysis saved to: {analysis_file}") + logger.info(f"\nKey Findings:") + logger.info(f" - Domain: {analysis['domain_understanding'].get('domain_description', 'N/A')}") + logger.info(f" - Capabilities: {len(analysis['domain_understanding'].get('core_capabilities', []))}") + logger.info(f" - User Personas: {len(analysis['domain_understanding'].get('user_personas', []))}") + logger.info(f" - Complexity Levels: {list(analysis['complexity_distribution']['distribution'].keys())}") + + else: + # Generate tests + logger.info(f"\nGenerating {args.count} test cases...") + logger.info(f" - Complexity: {args.complexity or 'mixed'}") + logger.info(f" - Diversity: {args.diversity}") + logger.info(f" - Deduplication: {'disabled' if args.no_deduplicate else 'enabled'}") + logger.info(f" - Ensure complex: {'disabled' if args.no_ensure_complex else 'enabled (20%)'}") + logger.info(f" - Model: {args.model_id}") + + generated_tests = generator.generate_test_cases( + teacher_samples=teacher_samples, + count=args.count, + source_context=source_context, + complexity=args.complexity, + diversity_factor=args.diversity, + output_dir=str(output_path), + deduplicate=not args.no_deduplicate, + ensure_complex_tests=not args.no_ensure_complex + ) + + logger.info(f"\n{'=' * 80}") + logger.info("GENERATION COMPLETE") + logger.info(f"{'=' * 80}") + logger.info(f"\nGenerated {len(generated_tests)} test cases") + logger.info(f"Output directory: {output_path}") + logger.info(f"\nFiles created:") + logger.info(f" - domain_analysis.json (domain understanding)") + logger.info(f" - all_generated_tests.json (all tests in one file)") + logger.info(f" - generated_test_001.json, ... (individual test files)") + + # Print summary + complexity_counts = {} + for test in generated_tests: + comp = test.get('complexity', 'unknown') + complexity_counts[comp] = complexity_counts.get(comp, 0) + 1 + + logger.info(f"\nComplexity distribution:") + for comp, count in sorted(complexity_counts.items()): + logger.info(f" - {comp}: {count}") + + # Print some example test names + logger.info(f"\nExample test names:") + for i, test in enumerate(generated_tests[:5]): + logger.info(f" {i+1}. {test.get('name', 'Unnamed')}") + + logger.info(f"\n{'=' * 80}") + logger.info("SUCCESS") + logger.info(f"{'=' * 80}\n") + + except KeyboardInterrupt: + logger.info("\nWARN: Interrupted by user") + sys.exit(1) + except Exception as e: + logger.exception("FAIL: Generation failed") + sys.exit(1) + + +if __name__ == '__main__': + main() diff --git a/evaluation/test_data_generator/context_loader.py b/evaluation/test_data_generator/context_loader.py new file mode 100644 index 0000000..65e87d0 --- /dev/null +++ b/evaluation/test_data_generator/context_loader.py @@ -0,0 +1,406 @@ +"""Configurable context loading with task-specific strategies.""" + +import logging +from pathlib import Path +from typing import Dict, List, Optional, Callable +from dataclasses import dataclass, field + +logger = logging.getLogger(__name__) + + +@dataclass +class LoadingStrategy: + """Defines how to prioritize and load files based on task needs.""" + + name: str + description: str + + # File prioritization rules + priority_patterns: Dict[str, int] = field(default_factory=dict) + # filename_pattern -> priority_score + + extension_priorities: Dict[str, int] = field(default_factory=dict) + # .ext -> priority_score + + # Filters + required_patterns: List[str] = field(default_factory=list) + # Must include files matching these patterns + + exclude_patterns: List[str] = field(default_factory=list) + # Must exclude files matching these patterns + + # Size limits + max_file_size: int = 100 * 1024 # 100KB default + max_total_size: int = 500 * 1024 # 500KB total default + + # Modifiers + depth_penalty: int = 5 # Penalty per directory level + small_file_boost: int = 10 # Boost for files < 10KB + + # Custom scoring function (optional) + custom_scorer: Optional[Callable[[Path], int]] = None + + +# Pre-defined strategies for common tasks +LOADING_STRATEGIES = { + "agent_evaluation": LoadingStrategy( + name="agent_evaluation", + description="Load agent instructions, capabilities, and behavior rules", + priority_patterns={ + "power.md": 100, + "claude.md": 100, + "instructions.md": 90, + "readme.md": 80, + "capabilities.md": 85, + "rules.md": 85, + }, + extension_priorities={ + ".md": 50, + ".txt": 45, + ".rst": 40, + ".yaml": 30, + ".yml": 30, + ".json": 25, + ".toml": 25, + ".py": 20, + ".js": 15, + }, + required_patterns=["*.md"], # Must have at least one markdown file + ), + + "api_analysis": LoadingStrategy( + name="api_analysis", + description="Load API definitions, schemas, and configuration", + priority_patterns={ + "openapi.yaml": 100, + "swagger.yaml": 100, + "api.yaml": 90, + "schema.json": 85, + "endpoints.md": 80, + }, + extension_priorities={ + ".yaml": 60, + ".yml": 60, + ".json": 55, + ".md": 40, + ".py": 30, + ".js": 30, + }, + ), + + "code_understanding": LoadingStrategy( + name="code_understanding", + description="Load source code, with docs as context", + priority_patterns={ + "main.py": 90, + "app.py": 90, + "index.js": 90, + "readme.md": 70, + }, + extension_priorities={ + ".py": 60, + ".js": 55, + ".ts": 55, + ".go": 50, + ".rs": 50, + ".java": 45, + ".md": 40, + ".yaml": 30, + }, + ), + + "architecture_review": LoadingStrategy( + name="architecture_review", + description="Load architecture docs, design decisions, and diagrams", + priority_patterns={ + "architecture.md": 100, + "design.md": 95, + "adr.md": 90, # Architecture Decision Records + "decisions.md": 90, + "system-design.md": 85, + }, + extension_priorities={ + ".md": 60, + ".txt": 50, + ".mmd": 55, # Mermaid diagrams + ".puml": 55, # PlantUML + ".drawio": 45, + ".yaml": 30, + }, + ), + + "configuration_audit": LoadingStrategy( + name="configuration_audit", + description="Load configuration files and settings", + priority_patterns={ + "config.yaml": 100, + "settings.yaml": 95, + ".env.example": 90, + "defaults.yaml": 85, + }, + extension_priorities={ + ".yaml": 70, + ".yml": 70, + ".toml": 65, + ".ini": 60, + ".env": 60, + ".json": 55, + ".conf": 50, + ".md": 30, + }, + ), + + "generic": LoadingStrategy( + name="generic", + description="Balanced loading for general analysis (default)", + priority_patterns={ + "readme.md": 80, + "index.md": 75, + "main.py": 60, + "config.yaml": 60, + }, + extension_priorities={ + ".md": 50, + ".txt": 45, + ".yaml": 40, + ".yml": 40, + ".json": 35, + ".toml": 35, + ".py": 30, + ".js": 25, + ".ts": 25, + }, + ), +} + + +class ContextLoader: + """Load source context with configurable strategies.""" + + def __init__( + self, + strategy: str = "generic", + custom_strategy: Optional[LoadingStrategy] = None + ): + """Initialize context loader. + + Args: + strategy: Name of pre-defined strategy, or "custom" + custom_strategy: Custom LoadingStrategy if strategy="custom" + """ + if custom_strategy: + self.strategy = custom_strategy + elif strategy in LOADING_STRATEGIES: + self.strategy = LOADING_STRATEGIES[strategy] + else: + logger.warning(f"Unknown strategy '{strategy}', using 'generic'") + self.strategy = LOADING_STRATEGIES["generic"] + + logger.info(f"Using loading strategy: {self.strategy.name}") + + def load(self, path: str) -> Optional[str]: + """Load context from path using configured strategy. + + Args: + path: File or directory path + + Returns: + Combined context string, or None if nothing loaded + """ + if not path: + return None + + path_obj = Path(path) + if not path_obj.exists(): + logger.warning(f"Path not found: {path}") + return None + + if path_obj.is_file(): + return self._load_single_file(path_obj) + elif path_obj.is_dir(): + return self._load_directory(path_obj) + + return None + + def _load_single_file(self, file_path: Path) -> Optional[str]: + """Load a single file.""" + try: + with open(file_path, 'r', encoding='utf-8', errors='ignore') as f: + content = f.read() + logger.info(f"Loaded {file_path.name} ({len(content)} chars)") + return content + except Exception as e: + logger.warning(f"Failed to load {file_path}: {e}") + return None + + def _load_directory(self, dir_path: Path) -> Optional[str]: + """Load files from directory using strategy.""" + + # Common directories to skip + SKIP_DIRS = { + '.git', '__pycache__', 'node_modules', '.venv', 'venv', + 'build', 'dist', 'target', '.pytest_cache', '.mypy_cache', + 'coverage', '.tox', '.eggs' + } + + # Known binary extensions to skip + BINARY_EXTS = { + '.pyc', '.so', '.dll', '.exe', '.bin', '.obj', '.o', + '.zip', '.tar', '.gz', '.bz2', '.xz', '.7z', + '.jpg', '.jpeg', '.png', '.gif', '.bmp', '.ico', + '.pdf', '.mp3', '.mp4', '.avi', '.mov', '.wav', + '.whl', '.jar', '.class' + } + + # Collect files with priority scores + file_scores = [] + + for file_path in dir_path.rglob('*'): + if not file_path.is_file(): + continue + + # Apply filters + if any(skip in file_path.parts for skip in SKIP_DIRS): + continue + + if file_path.suffix.lower() in BINARY_EXTS: + continue + + # Check file size + try: + file_size = file_path.stat().st_size + if file_size > self.strategy.max_file_size: + continue + if file_size == 0: + continue + except Exception: + continue + + # Check if text file + if not self._is_text_file(file_path): + continue + + # Calculate priority score + score = self._calculate_score(file_path, file_size, dir_path) + file_scores.append((score, file_path, file_size)) + + # Sort by score (highest first) + file_scores.sort(key=lambda x: x[0], reverse=True) + + # Load files up to max total size + contents = [] + total_size = 0 + + for score, file_path, file_size in file_scores: + if total_size + file_size > self.strategy.max_total_size: + logger.info(f"Reached max total size ({self.strategy.max_total_size} bytes)") + break + + try: + with open(file_path, 'r', encoding='utf-8', errors='ignore') as f: + content = f.read() + + if not content.strip(): + continue + + rel_path = file_path.relative_to(dir_path) + contents.append(f"# File: {rel_path}\n\n{content}") + logger.info(f" - Loaded {rel_path} (score: {score}, {file_size} bytes)") + + total_size += file_size + + except Exception as e: + logger.warning(f"Failed to load {file_path}: {e}") + + if not contents: + logger.warning(f"No files loaded from {dir_path}") + return None + + combined = "\n\n" + "="*80 + "\n\n".join(contents) + logger.info(f"Loaded {len(contents)} files ({total_size} bytes total)") + + return combined + + def _calculate_score( + self, + file_path: Path, + file_size: int, + base_path: Path + ) -> int: + """Calculate priority score for a file.""" + score = 0 + + # Use custom scorer if provided + if self.strategy.custom_scorer: + return self.strategy.custom_scorer(file_path) + + # Check priority patterns (filename matches) + filename_lower = file_path.name.lower() + for pattern, priority in self.strategy.priority_patterns.items(): + if pattern.lower() == filename_lower: + score += priority + break + + # Check extension priorities + ext_lower = file_path.suffix.lower() + if ext_lower in self.strategy.extension_priorities: + score += self.strategy.extension_priorities[ext_lower] + + # Depth penalty (prefer root-level files) + depth = len(file_path.relative_to(base_path).parts) - 1 + score -= depth * self.strategy.depth_penalty + + # Small file boost + if file_size < 10 * 1024: + score += self.strategy.small_file_boost + + return score + + def _is_text_file(self, file_path: Path) -> bool: + """Check if file is text by looking for null bytes.""" + try: + with open(file_path, 'rb') as f: + sample = f.read(512) + return b'\x00' not in sample + except Exception: + return False + + +def create_custom_strategy( + name: str, + description: str, + **kwargs +) -> LoadingStrategy: + """Helper to create custom loading strategies. + + Example: + strategy = create_custom_strategy( + name="my_task", + description="Load files for my specific task", + priority_patterns={"important.md": 100}, + extension_priorities={".py": 70, ".md": 50} + ) + """ + return LoadingStrategy( + name=name, + description=description, + **kwargs + ) + + +# Convenience function for backward compatibility +def load_source_context( + path: str, + strategy: str = "agent_evaluation" +) -> Optional[str]: + """Load source context using specified strategy. + + Args: + path: File or directory path + strategy: Loading strategy name + + Returns: + Loaded context string + """ + loader = ContextLoader(strategy=strategy) + return loader.load(path) diff --git a/evaluation/test_data_generator/deduplicate_tests.py b/evaluation/test_data_generator/deduplicate_tests.py new file mode 100644 index 0000000..70d31b5 --- /dev/null +++ b/evaluation/test_data_generator/deduplicate_tests.py @@ -0,0 +1,156 @@ +#!/usr/bin/env python3 +"""Deduplicate generated test cases based on name similarity.""" + +import json +import argparse +import sys +from pathlib import Path +from collections import defaultdict + +def deduplicate_tests(input_file: str, output_file: str, strategy: str = "keep_first"): + """Deduplicate tests based on name. + + Args: + input_file: Input JSON file with all tests + output_file: Output JSON file with deduplicated tests + strategy: "keep_first", "keep_best", or "keep_all_unique" + """ + # Load tests + with open(input_file, 'r') as f: + tests = json.load(f) + + print(f"Loaded {len(tests)} tests") + + # Group by name + by_name = defaultdict(list) + for test in tests: + by_name[test['name']].append(test) + + # Find duplicates + duplicates = {name: tests for name, tests in by_name.items() if len(tests) > 1} + unique_names = {name: tests for name, tests in by_name.items() if len(tests) == 1} + + print(f"\nFound:") + print(f" - {len(unique_names)} unique test names") + print(f" - {len(duplicates)} duplicate test names") + print(f" - {sum(len(tests) for tests in duplicates.values())} total duplicate tests") + + # Deduplicate + deduplicated = [] + + # Add unique tests + for tests in unique_names.values(): + deduplicated.append(tests[0]) + + # Handle duplicates based on strategy + if strategy == "keep_first": + print(f"\nStrategy: Keep first instance of each duplicate") + for name, tests in duplicates.items(): + deduplicated.append(tests[0]) + print(f" - '{name}': keeping 1 of {len(tests)}") + + elif strategy == "keep_best": + print(f"\nStrategy: Keep test with most assertions") + for name, tests in duplicates.items(): + # Sort by number of assertions (descending) + sorted_tests = sorted(tests, key=lambda t: len(t.get('assertions', [])), reverse=True) + best = sorted_tests[0] + deduplicated.append(best) + print(f" - '{name}': keeping test with {len(best.get('assertions', []))} assertions (had {len(tests)} duplicates)") + + elif strategy == "keep_all_unique": + print(f"\nStrategy: Keep all tests but rename duplicates") + for name, tests in duplicates.items(): + for i, test in enumerate(tests): + if i == 0: + deduplicated.append(test) + else: + # Rename to make unique + test['name'] = f"{name} (variant {i+1})" + test['id'] = f"{test['id']}_v{i+1}" + deduplicated.append(test) + print(f" - '{name}': kept all {len(tests)} with unique names") + + # Sort by ID for consistency + deduplicated.sort(key=lambda t: t['id']) + + # Save + with open(output_file, 'w') as f: + json.dump(deduplicated, f, indent=2) + + print(f"\nSaved {len(deduplicated)} deduplicated tests to {output_file}") + print(f"Reduction: {len(tests)} → {len(deduplicated)} ({len(tests) - len(deduplicated)} removed)") + + # Print summary + print(f"\n{'='*60}") + print("Deduplication Summary") + print(f"{'='*60}") + print(f"Original tests: {len(tests)}") + print(f"Unique tests: {len(deduplicated)}") + print(f"Tests removed: {len(tests) - len(deduplicated)}") + print(f"Reduction: {(len(tests) - len(deduplicated)) / len(tests) * 100:.1f}%") + + # Complexity distribution + complexity_dist = {} + for test in deduplicated: + comp = test.get('complexity', 'unknown') + complexity_dist[comp] = complexity_dist.get(comp, 0) + 1 + + print(f"\nComplexity distribution:") + for comp, count in sorted(complexity_dist.items()): + print(f" - {comp}: {count}") + + return deduplicated + + +def main(): + parser = argparse.ArgumentParser( + description="Deduplicate generated test cases", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Keep first instance of each duplicate (default) + python deduplicate_tests.py --input all_generated_tests.json --output deduplicated_tests.json + + # Keep test with most assertions + python deduplicate_tests.py --input all_generated_tests.json --output deduplicated_tests.json --strategy keep_best + + # Keep all tests but rename duplicates + python deduplicate_tests.py --input all_generated_tests.json --output deduplicated_tests.json --strategy keep_all_unique + """ + ) + + parser.add_argument( + '--input', + required=True, + help='Input JSON file with all generated tests' + ) + + parser.add_argument( + '--output', + required=True, + help='Output JSON file for deduplicated tests' + ) + + parser.add_argument( + '--strategy', + choices=['keep_first', 'keep_best', 'keep_all_unique'], + default='keep_first', + help='Deduplication strategy (default: keep_first)' + ) + + args = parser.parse_args() + + # Validate input exists + if not Path(args.input).exists(): + print(f"Error: Input file not found: {args.input}") + return 1 + + # Deduplicate + deduplicate_tests(args.input, args.output, args.strategy) + + return 0 + + +if __name__ == '__main__': + sys.exit(main()) diff --git a/evaluation/test_data_generator/domain_analyzer.py b/evaluation/test_data_generator/domain_analyzer.py new file mode 100644 index 0000000..a21d013 --- /dev/null +++ b/evaluation/test_data_generator/domain_analyzer.py @@ -0,0 +1,680 @@ +"""Domain analyzer that extracts patterns and requirements from teacher test samples.""" + +import json +import logging +import re +from typing import Dict, Any, List, Optional, Tuple +from collections import Counter +import boto3 + +logger = logging.getLogger(__name__) + + +class DomainAnalyzer: + """Analyzes teacher test samples to understand the domain and requirements.""" + + MAX_INSTRUCTION_CHARS = 200000 + + def __init__( + self, + region_name: str, + model_id: str, + use_two_pass_analysis: bool = False + ): + """Initialize domain analyzer. + + Args: + region_name: AWS region for Bedrock + model_id: Model ID for analysis + use_two_pass_analysis: If True, use two-pass analysis for large instructions + """ + self.region_name = region_name + self.model_id = model_id + self.use_two_pass_analysis = use_two_pass_analysis + + # Configure Bedrock client with longer timeout + from botocore.config import Config + config = Config( + read_timeout=300, # 5 minutes + connect_timeout=60, + retries={'max_attempts': 3, 'mode': 'adaptive'} + ) + self.bedrock = boto3.client('bedrock-runtime', region_name=region_name, config=config) + + def analyze_test_samples( + self, + test_samples: List[Dict[str, Any]], + source_context: str + ) -> Dict[str, Any]: + """Analyze source context and optional teacher test samples to extract domain understanding. + + Args: + test_samples: List of teacher test samples (can be empty list) + source_context: Source code/docs content - REQUIRED for domain understanding + + Returns: + Analysis results with domain patterns, requirements, and characteristics + """ + if not source_context: + raise ValueError("source_context is required for domain understanding") + + mode = "teacher samples + source context" if test_samples else "source context only" + logger.info(f"Analyzing domain from {mode}...") + if test_samples: + logger.info(f" {len(test_samples)} teacher samples available") + logger.info(f" Using source context ({len(source_context)} chars)") + + # Extract structural patterns (only if samples exist) + structural_patterns = self._extract_structural_patterns(test_samples) if test_samples else self._get_default_structure() + + # Choose analysis strategy based on instruction size and configuration + if self.use_two_pass_analysis and source_context and len(source_context) > 80000: + logger.info("Using two-pass analysis for comprehensive instruction understanding...") + domain_understanding = self._two_pass_analysis(test_samples, source_context) + else: + # Standard analysis with smart chunking + domain_understanding = self._extract_domain_understanding( + test_samples, + source_context + ) + + # Combine results + analysis = { + "structural_patterns": structural_patterns, + "domain_understanding": domain_understanding, + "sample_count": len(test_samples), + "complexity_distribution": self._analyze_complexity(test_samples), + "assertion_patterns": self._analyze_assertions(test_samples), + } + + logger.info("Domain analysis complete") + return analysis + + def _extract_structural_patterns( + self, + test_samples: List[Dict[str, Any]] + ) -> Dict[str, Any]: + """Extract structural patterns from test samples.""" + patterns = { + "fields": {}, + "metadata_keys": set(), + "assertion_types": Counter(), + "tags": Counter(), + "complexity_levels": Counter(), + } + + for sample in test_samples: + # Track field presence + for field in sample.keys(): + if field not in patterns["fields"]: + patterns["fields"][field] = { + "count": 0, + "types": set(), + "examples": [] + } + patterns["fields"][field]["count"] += 1 + patterns["fields"][field]["types"].add(type(sample[field]).__name__) + if len(patterns["fields"][field]["examples"]) < 3: + patterns["fields"][field]["examples"].append(sample[field]) + + # Track metadata keys + if "metadata" in sample: + patterns["metadata_keys"].update(sample["metadata"].keys()) + + # Track assertion types + if "assertions" in sample: + for assertion in sample["assertions"]: + assertion_type = assertion.get("type", "unknown") + patterns["assertion_types"][assertion_type] += 1 + + # Track tags + if "tags" in sample: + patterns["tags"].update(sample["tags"]) + + # Track complexity + if "complexity" in sample: + patterns["complexity_levels"][sample["complexity"]] += 1 + + # Convert sets to lists for JSON serialization + patterns["metadata_keys"] = list(patterns["metadata_keys"]) + for field_data in patterns["fields"].values(): + field_data["types"] = list(field_data["types"]) + patterns["assertion_types"] = dict(patterns["assertion_types"]) + patterns["tags"] = dict(patterns["tags"]) + patterns["complexity_levels"] = dict(patterns["complexity_levels"]) + + return patterns + + def _get_default_structure(self) -> Dict[str, Any]: + """Get default test structure when no teacher samples are available.""" + return { + "fields": { + "id": {"count": 0, "types": ["str"], "examples": []}, + "name": {"count": 0, "types": ["str"], "examples": []}, + "user_message": {"count": 0, "types": ["str"], "examples": []}, + "description": {"count": 0, "types": ["str"], "examples": []}, + "complexity": {"count": 0, "types": ["str"], "examples": ["simple", "medium", "complex"]}, + "tags": {"count": 0, "types": ["list"], "examples": []}, + "max_turns": {"count": 0, "types": ["int"], "examples": [10]}, + "timeout_seconds": {"count": 0, "types": ["int"], "examples": [300]}, + "simulated_human_guidance": {"count": 0, "types": ["str"], "examples": []}, + "metadata": {"count": 0, "types": ["dict"], "examples": []}, + "assertions": {"count": 0, "types": ["list"], "examples": []} + }, + "metadata_keys": [], + "assertion_types": {"llm_judge": 0, "tool_called": 0}, + "tags": {}, + "complexity_levels": {"simple": 0, "medium": 0, "complex": 0} + } + + def _extract_domain_understanding( + self, + test_samples: List[Dict[str, Any]], + source_context: Optional[str] + ) -> Dict[str, Any]: + """Use LLM to extract deep domain understanding.""" + logger.info("Extracting domain understanding via LLM...") + + # Build prompt with smart chunking + prompt = self._build_analysis_prompt(test_samples, source_context) + + # Call LLM + try: + response = self._call_bedrock(prompt) + understanding = self._parse_analysis_response(response) + return understanding + except Exception as e: + logger.exception(f"LLM analysis failed: {e}") + return {"error": str(e)} + + def _two_pass_analysis( + self, + test_samples: List[Dict[str, Any]], + source_context: str + ) -> Dict[str, Any]: + """Two-pass analysis for comprehensive understanding of large instructions. + + Pass 1: Analyze instructions alone to extract key capabilities and rules + Pass 2: Analyze test samples with condensed instruction summary + """ + logger.info("Pass 1: Analyzing power instructions...") + + # Pass 1: Extract instruction summary + instruction_summary = self._analyze_instructions(source_context) + + logger.info("Pass 2: Analyzing test samples with instruction context...") + + # Pass 2: Analyze test samples with condensed context + domain_understanding = self._extract_domain_understanding( + test_samples, + instruction_summary.get("condensed_instructions", source_context[:80000]) + ) + + # Merge instruction insights with domain understanding + domain_understanding["instruction_analysis"] = instruction_summary + + return domain_understanding + + def _analyze_instructions(self, source_context: str) -> Dict[str, Any]: + """Analyze power instructions alone to extract key information. + + Returns: + Dictionary with: + - core_capabilities: List of agent capabilities + - key_rules: Important behavioral rules + - success_criteria: Quality expectations + - condensed_instructions: Summarized version for downstream use + """ + prompt = f"""Analyze the following agent instructions and extract key information. + +# Agent Instructions +``` +{source_context} +``` + +# Task +Extract and summarize: +1. **Core Capabilities**: What can this agent do? +2. **Key Rules**: Critical behavioral constraints and guidelines +3. **Success Criteria**: Quality expectations and requirements +4. **Edge Cases**: Known failure modes or special scenarios +5. **Condensed Summary**: A comprehensive but condensed version (max 20K chars) preserving all critical information + +Provide your analysis in JSON format: + +```json +{{ + "core_capabilities": [ + {{"name": "capability1", "description": "what it does", "priority": "high|medium|low"}} + ], + "key_rules": [ + {{"rule": "rule description", "category": "category", "rationale": "why this matters"}} + ], + "success_criteria": {{ + "must_have": ["criterion1"], + "should_have": ["criterion2"], + "quality_signals": ["signal1"] + }}, + "edge_cases": [ + {{"scenario": "edge case", "handling": "how to handle"}} + ], + "condensed_instructions": "Comprehensive summary preserving all critical information..." +}} +```""" + + try: + response = self._call_bedrock(prompt, max_tokens=16000) + analysis = self._parse_analysis_response(response) + logger.info("Instruction analysis complete") + return analysis + except Exception as e: + logger.exception(f"Instruction analysis failed: {e}") + # Fallback to smart truncation + return { + "error": str(e), + "condensed_instructions": self._smart_truncate(source_context, 80000) + } + + def _smart_truncate( + self, + instructions: str, + max_chars: int, + preserve_structure: bool = True + ) -> str: + """Intelligently truncate instructions preserving key sections. + + Args: + instructions: Full instruction text + max_chars: Maximum character limit + preserve_structure: If True, preserve markdown structure + + Returns: + Truncated instructions with preserved key sections + """ + if len(instructions) <= max_chars: + return instructions + + logger.info(f"Smart truncating instructions from {len(instructions)} to {max_chars} chars...") + + if preserve_structure: + # Extract key sections from markdown + sections = self._extract_key_sections(instructions) + + # Prioritize sections + prioritized = self._prioritize_sections(sections) + + # Build truncated content + truncated = [] + current_length = 0 + + for section_name, content in prioritized: + section_text = f"\n# {section_name}\n{content}\n" + if current_length + len(section_text) <= max_chars: + truncated.append(section_text) + current_length += len(section_text) + else: + # Add partial section if space remains + remaining = max_chars - current_length + if remaining > 200: # Only add if meaningful space left + truncated.append(section_text[:remaining] + "\n[... truncated ...]") + break + + result = "".join(truncated) + logger.info(f"Preserved {len(truncated)} key sections in truncated instructions") + return result + else: + # Simple truncation with warning + return instructions[:max_chars] + "\n\n[... truncated ...]" + + def _extract_key_sections(self, instructions: str) -> Dict[str, str]: + """Extract sections from markdown content. + + Returns: + Dictionary mapping section names to their content + """ + sections = {} + current_section = "Introduction" + current_content = [] + + lines = instructions.split('\n') + + for line in lines: + # Check for markdown headers + header_match = re.match(r'^#+\s+(.+)$', line) + if header_match: + # Save previous section + if current_content: + sections[current_section] = '\n'.join(current_content) + + # Start new section + current_section = header_match.group(1) + current_content = [] + else: + current_content.append(line) + + # Save last section + if current_content: + sections[current_section] = '\n'.join(current_content) + + return sections + + def _prioritize_sections( + self, + sections: Dict[str, str] + ) -> List[Tuple[str, str]]: + """Prioritize sections by importance. + + Args: + sections: Dictionary of section name to content + + Returns: + List of (section_name, content) tuples in priority order + """ + # Define priority keywords (higher score = higher priority) + priority_keywords = { + 'capabilities': 100, + 'capability': 100, + 'features': 90, + 'rules': 85, + 'rule': 85, + 'requirements': 80, + 'behavior': 75, + 'guidelines': 70, + 'examples': 60, + 'example': 60, + 'edge cases': 55, + 'edge': 55, + 'scenarios': 50, + 'usage': 45, + 'overview': 40, + 'introduction': 30, + } + + def calculate_priority(section_name: str) -> int: + """Calculate priority score for a section. + + Uses the highest matching keyword score to avoid inflating + scores for sections with multiple keywords. + """ + name_lower = section_name.lower() + max_score = 0 + for keyword, weight in priority_keywords.items(): + if keyword in name_lower: + max_score = max(max_score, weight) + return max_score + + # Sort sections by priority + prioritized = sorted( + sections.items(), + key=lambda x: calculate_priority(x[0]), + reverse=True + ) + + return prioritized + + def _build_analysis_prompt( + self, + test_samples: List[Dict[str, Any]], + source_context: str + ) -> str: + """Build prompt for domain analysis.""" + has_samples = bool(test_samples) + + if has_samples: + prompt = """You are analyzing source code/documentation and test cases for an AI agent to understand the domain, requirements, and testing patterns. + +# Your Task +Analyze the provided source context and teacher test samples to extract: +1. **Core capabilities** being tested +2. **Domain-specific patterns** and scenarios +3. **User personas** and interaction styles +4. **Success criteria** and quality expectations +5. **Edge cases** and complexity factors +6. **Assertion patterns** and what they validate + +""" + else: + prompt = """You are analyzing source code and documentation for an AI agent to understand the domain and generate appropriate test cases. + +# Your Task +From the provided source context, extract: +1. **Core capabilities** that should be tested +2. **Domain-specific patterns** and use cases +3. **User personas** who would interact with this system +4. **Success criteria** and quality expectations +5. **Edge cases** and complexity factors to test +6. **Appropriate assertion types** for validation + +""" + + # source_context is always provided now + # Smart truncation preserving structure + if len(source_context) > self.MAX_INSTRUCTION_CHARS: + logger.warning( + f"Source context is very large ({len(source_context)} chars, ~{len(source_context)//4} tokens). " + f"Using smart truncation to {self.MAX_INSTRUCTION_CHARS} chars (~{self.MAX_INSTRUCTION_CHARS//4} tokens) while preserving key sections. " + f"For comprehensive analysis, consider using --use-two-pass-analysis flag." + ) + truncated_instructions = self._smart_truncate( + source_context, + self.MAX_INSTRUCTION_CHARS, + preserve_structure=True + ) + else: + truncated_instructions = source_context + + prompt += f"""# Source Context (Code, Documentation, Configuration) +``` +{truncated_instructions} +``` + +""" + + if has_samples: + prompt += f"""# Teacher Test Samples ({len(test_samples)} samples) + +""" + + # Include representative samples + for i, sample in enumerate(test_samples[:3]): # Show max 3 full samples + prompt += f"""## Sample {i+1}: {sample.get('name', 'Unnamed')} +```json +{json.dumps(sample, indent=2)} +``` + +""" + else: + prompt += """# Test Generation Mode +No teacher samples provided. You will need to infer appropriate test structures from the source context above. + +""" + + prompt += """# Analysis Requirements + +Provide a comprehensive analysis in JSON format: + +```json +{ + "domain_description": "Brief description of what domain/system this tests", + "core_capabilities": [ + {"name": "capability1", "description": "what it does", "criticality": "high|medium|low"} + ], + "user_personas": [ + {"name": "persona1", "characteristics": "description", "typical_scenarios": ["scenario1", "scenario2"]} + ], + "interaction_patterns": [ + {"pattern": "pattern_name", "description": "what the pattern is", "frequency": "common|occasional|rare"} + ], + "success_criteria": { + "must_have": ["criterion1", "criterion2"], + "should_have": ["criterion3"], + "quality_signals": ["signal1", "signal2"] + }, + "complexity_factors": { + "simple": "what makes a test simple", + "medium": "what makes a test medium complexity", + "complex": "what makes a test complex" + }, + "assertion_categories": [ + {"type": "assertion_type", "purpose": "what it validates", "examples": ["example1", "example2"]} + ], + "edge_cases_to_test": [ + {"scenario": "edge case", "why_important": "reason", "suggested_complexity": "simple|medium|complex"} + ], + "generation_guidance": { + "key_dimensions": ["dimension1", "dimension2"], + "diversity_strategies": ["strategy1", "strategy2"], + "avoid_patterns": ["pattern1", "pattern2"] + } +} +``` + +Be thorough and specific. This analysis will guide automated test generation.""" + + return prompt + + def _call_bedrock(self, prompt: str, max_tokens: int = 8000) -> str: + """Call Bedrock API. + + Args: + prompt: The prompt to send + max_tokens: Maximum tokens for response (default: 8000) + + Returns: + Response text from the model + """ + body = json.dumps({ + "anthropic_version": "bedrock-2023-05-31", + "max_tokens": max_tokens, + "messages": [ + {"role": "user", "content": prompt} + ], + "temperature": 0.3 # Lower for analysis consistency + }) + + response = self.bedrock.invoke_model( + modelId=self.model_id, + body=body + ) + + response_body = json.loads(response['body'].read()) + return response_body['content'][0]['text'] + + def _parse_analysis_response(self, response: str) -> Dict[str, Any]: + """Parse LLM analysis response.""" + # Find JSON in response + json_match = re.search(r'\{[\s\S]*\}', response) + if not json_match: + logger.error("No JSON found in analysis response") + return {"raw_response": response} + + try: + analysis = json.loads(json_match.group(0)) + return analysis + except json.JSONDecodeError as e: + logger.error(f"Failed to parse analysis JSON: {e}") + return {"raw_response": response, "error": str(e)} + + def _analyze_complexity( + self, + test_samples: List[Dict[str, Any]] + ) -> Dict[str, Any]: + """Analyze complexity distribution.""" + complexity_counts = Counter() + complexity_characteristics = {} + + for sample in test_samples: + complexity = sample.get("complexity", "unknown") + complexity_counts[complexity] += 1 + + # Gather characteristics for each complexity level + if complexity not in complexity_characteristics: + complexity_characteristics[complexity] = { + "avg_assertions": [], + "avg_max_turns": [], + "example_scenarios": [] + } + + if "assertions" in sample: + complexity_characteristics[complexity]["avg_assertions"].append( + len(sample["assertions"]) + ) + + if "max_turns" in sample: + complexity_characteristics[complexity]["avg_max_turns"].append( + sample["max_turns"] + ) + + if len(complexity_characteristics[complexity]["example_scenarios"]) < 2: + complexity_characteristics[complexity]["example_scenarios"].append( + sample.get("name", sample.get("id", "unnamed")) + ) + + # Calculate averages + for complexity, chars in complexity_characteristics.items(): + if chars["avg_assertions"]: + chars["avg_assertions"] = sum(chars["avg_assertions"]) / len(chars["avg_assertions"]) + else: + chars["avg_assertions"] = 0 + + if chars["avg_max_turns"]: + chars["avg_max_turns"] = sum(chars["avg_max_turns"]) / len(chars["avg_max_turns"]) + else: + chars["avg_max_turns"] = 0 + + return { + "distribution": dict(complexity_counts), + "characteristics": complexity_characteristics, + "total_samples": len(test_samples) + } + + def _analyze_assertions( + self, + test_samples: List[Dict[str, Any]] + ) -> Dict[str, Any]: + """Analyze assertion patterns.""" + assertion_names = Counter() + assertion_types = Counter() + assertion_descriptions = [] + + for sample in test_samples: + if "assertions" in sample: + for assertion in sample["assertions"]: + assertion_names[assertion.get("name", "unnamed")] += 1 + assertion_types[assertion.get("type", "unknown")] += 1 + + if "description" in assertion: + assertion_descriptions.append({ + "name": assertion.get("name"), + "type": assertion.get("type"), + "description": assertion["description"] + }) + + return { + "common_assertion_names": dict(assertion_names.most_common(10)), + "assertion_types": dict(assertion_types), + "assertion_examples": assertion_descriptions[:5] + } + + def save_analysis(self, analysis: Dict[str, Any], output_path: str): + """Save analysis results to file. + + Args: + analysis: Analysis results + output_path: Path to save JSON file + """ + with open(output_path, 'w') as f: + json.dump(analysis, f, indent=2) + logger.info(f"Analysis saved to {output_path}") + + def load_analysis(self, input_path: str) -> Dict[str, Any]: + """Load analysis results from file. + + Args: + input_path: Path to JSON file + + Returns: + Analysis results + """ + with open(input_path, 'r') as f: + return json.load(f) diff --git a/evaluation/test_data_generator/example.py b/evaluation/test_data_generator/example.py new file mode 100644 index 0000000..db5e4a0 --- /dev/null +++ b/evaluation/test_data_generator/example.py @@ -0,0 +1,246 @@ +#!/usr/bin/env python3 +"""Example usage of the intelligent test generator.""" + +import json +import logging +from pathlib import Path + +from .intelligent_generator import IntelligentTestGenerator +from .domain_analyzer import DomainAnalyzer + +# Setup logging +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + + +def example_basic_generation(): + """Example: Basic test generation from teacher samples.""" + print("\n" + "=" * 80) + print("EXAMPLE 1: Basic Test Generation") + print("=" * 80 + "\n") + + # Load teacher samples + test_data_dir = Path(__file__).parent.parent / "test_data" + teacher_sample_file = test_data_dir / "onboarding_intermediate.json" + + with open(teacher_sample_file, 'r') as f: + teacher_samples = json.load(f) + + logger.info(f"Loaded {len(teacher_samples)} teacher samples") + + # Initialize generator + generator = IntelligentTestGenerator( + region_name='us-west-2', + model_id='us.anthropic.claude-opus-4-5-20251101-v1:0', + temperature=0.8 + ) + + # Generate 5 tests + generated = generator.generate_test_cases( + teacher_samples=teacher_samples, + count=5, + diversity_factor=0.7, + output_dir="./example_output/basic" + ) + + logger.info(f"\nGenerated {len(generated)} tests") + for i, test in enumerate(generated, 1): + logger.info(f" {i}. {test['name']} (complexity: {test['complexity']})") + + +def example_with_power_md(): + """Example: Generation with POWER.md context.""" + print("\n" + "=" * 80) + print("EXAMPLE 2: Generation with POWER.md Context") + print("=" * 80 + "\n") + + # Load teacher samples + test_data_dir = Path(__file__).parent.parent / "test_data" + teacher_sample_file = test_data_dir / "onboarding_intermediate.json" + + with open(teacher_sample_file, 'r') as f: + teacher_samples = json.load(f) + + # Load POWER.md + power_md_path = Path("/path/to/file") + power_instructions = None + + if power_md_path.exists(): + with open(power_md_path, 'r') as f: + power_instructions = f.read() + logger.info(f"Loaded POWER.md ({len(power_instructions)} chars)") + else: + logger.warning("POWER.md not found, proceeding without it") + + # Initialize generator + generator = IntelligentTestGenerator( + region_name='us-west-2', + model_id='us.anthropic.claude-opus-4-5-20251101-v1:0', + temperature=0.8 + ) + + # Generate tests with POWER.md context + generated = generator.generate_test_cases( + teacher_samples=teacher_samples, + count=3, + context=power_instructions, + diversity_factor=0.8, + output_dir="./example_output/with_power" + ) + + logger.info(f"\nGenerated {len(generated)} tests with POWER.md context") + + +def example_domain_analysis(): + """Example: Domain analysis without generation. + + Note: source_context is required for domain understanding. + This example uses a placeholder, but in practice you should load + from POWER.md, source code, or other documentation. + """ + print("\n" + "=" * 80) + print("EXAMPLE 3: Domain Analysis Only") + print("=" * 80 + "\n") + + # Load teacher samples + test_data_dir = Path(__file__).parent.parent / "test_data" + teacher_sample_file = test_data_dir / "onboarding_intermediate.json" + + with open(teacher_sample_file, 'r') as f: + teacher_samples = json.load(f) + + # Initialize analyzer + analyzer = DomainAnalyzer( + region_name='us-west-2', + model_id='us.anthropic.claude-opus-4-5-20251101-v1:0' + ) + + # Load source context (required for domain understanding) + # In real usage, load from POWER.md or source code + # For this example, we'll use a placeholder + source_context = """ + # Example Agent Instructions + This is a placeholder for actual agent instructions, source code, or documentation. + In practice, you should load this from: + - POWER.md files with agent instructions + - Source code documentation + - Configuration files + - Any context that describes what the agent should do + """ + + # Analyze domain + analysis = analyzer.analyze_test_samples(teacher_samples, source_context) + + # Save analysis + output_path = "./example_output/analysis/domain_analysis.json" + Path(output_path).parent.mkdir(parents=True, exist_ok=True) + analyzer.save_analysis(analysis, output_path) + + # Print summary + logger.info(f"\nDomain Analysis Summary:") + logger.info(f" Domain: {analysis['domain_understanding'].get('domain_description', 'N/A')}") + logger.info(f" Core Capabilities: {len(analysis['domain_understanding'].get('core_capabilities', []))}") + logger.info(f" User Personas: {len(analysis['domain_understanding'].get('user_personas', []))}") + logger.info(f" Assertion Types: {list(analysis['assertion_patterns']['assertion_types'].keys())}") + logger.info(f"\nAnalysis saved to: {output_path}") + + +def example_high_diversity(): + """Example: High diversity generation for edge cases.""" + print("\n" + "=" * 80) + print("EXAMPLE 4: High Diversity Generation") + print("=" * 80 + "\n") + + # Load teacher samples + test_data_dir = Path(__file__).parent.parent / "test_data" + teacher_sample_file = test_data_dir / "onboarding_intermediate.json" + + with open(teacher_sample_file, 'r') as f: + teacher_samples = json.load(f) + + # Initialize generator + generator = IntelligentTestGenerator( + region_name='us-west-2', + model_id='us.anthropic.claude-opus-4-5-20251101-v1:0', + temperature=0.9 # Higher temperature for more creativity + ) + + # Generate with high diversity + generated = generator.generate_test_cases( + teacher_samples=teacher_samples, + count=5, + diversity_factor=0.95, # Very high diversity + output_dir="./example_output/high_diversity" + ) + + logger.info(f"\nGenerated {len(generated)} highly diverse tests") + logger.info("\nScenarios covered:") + for i, test in enumerate(generated, 1): + logger.info(f" {i}. {test['name']}") + logger.info(f" Complexity: {test['complexity']}, Assertions: {len(test.get('assertions', []))}") + + +def example_specific_complexity(): + """Example: Generate tests of specific complexity.""" + print("\n" + "=" * 80) + print("EXAMPLE 5: Specific Complexity Generation") + print("=" * 80 + "\n") + + # Load teacher samples + test_data_dir = Path(__file__).parent.parent / "test_data" + teacher_sample_file = test_data_dir / "onboarding_intermediate.json" + + with open(teacher_sample_file, 'r') as f: + teacher_samples = json.load(f) + + # Initialize generator + generator = IntelligentTestGenerator( + region_name='us-west-2', + model_id='us.anthropic.claude-opus-4-5-20251101-v1:0' + ) + + # Generate only complex tests + generated = generator.generate_test_cases( + teacher_samples=teacher_samples, + count=3, + complexity='complex', + diversity_factor=0.8, + output_dir="./example_output/complex_only" + ) + + logger.info(f"\nGenerated {len(generated)} complex tests") + for test in generated: + logger.info(f" - {test['name']}: {len(test.get('assertions', []))} assertions") + + +def main(): + """Run all examples.""" + print("\n" + "=" * 80) + print("INTELLIGENT TEST GENERATOR - EXAMPLES") + print("=" * 80) + + examples = [ + ("Basic Generation", example_basic_generation), + ("With POWER.md", example_with_power_md), + ("Domain Analysis", example_domain_analysis), + ("High Diversity", example_high_diversity), + ("Specific Complexity", example_specific_complexity), + ] + + print("\nAvailable examples:") + for i, (name, _) in enumerate(examples, 1): + print(f" {i}. {name}") + + print("\nRunning Example 3 (Domain Analysis) as it's quickest...") + print("To run other examples, call them directly from this file.\n") + + # Run domain analysis example (quickest) + example_domain_analysis() + + print("\n" + "=" * 80) + print("Example complete! Check ./example_output/ for results") + print("=" * 80 + "\n") + + +if __name__ == '__main__': + main() diff --git a/evaluation/test_data_generator/intelligent_generator.py b/evaluation/test_data_generator/intelligent_generator.py new file mode 100644 index 0000000..ebaf58b --- /dev/null +++ b/evaluation/test_data_generator/intelligent_generator.py @@ -0,0 +1,648 @@ +"""Intelligent test data generator that understands the task and generates diverse samples.""" + +import json +import logging +from typing import Dict, Any, List, Optional +from pathlib import Path +import boto3 + +from .domain_analyzer import DomainAnalyzer + +logger = logging.getLogger(__name__) + + +class IntelligentTestGenerator: + """Generates test cases based on domain understanding from teacher samples.""" + + def __init__( + self, + region_name: str, + model_id: str, + temperature: float = 0.8 + ): + """Initialize intelligent test generator. + + Args: + region_name: AWS region for Bedrock + model_id: Model ID for generation + temperature: Temperature for creative generation (0-1) + """ + self.region_name = region_name + self.model_id = model_id + self.temperature = temperature + + # Configure Bedrock client with longer timeout + from botocore.config import Config + config = Config( + read_timeout=300, # 5 minutes + connect_timeout=60, + retries={'max_attempts': 3, 'mode': 'adaptive'} + ) + self.bedrock = boto3.client('bedrock-runtime', region_name=region_name, config=config) + self.analyzer = DomainAnalyzer(region_name, model_id) + + def generate_test_cases( + self, + teacher_samples: List[Dict[str, Any]], + count: int, + source_context: str, + complexity: Optional[str] = None, + diversity_factor: float = 0.8, + output_dir: Optional[str] = None, + deduplicate: bool = True, + ensure_complex_tests: bool = True + ) -> List[Dict[str, Any]]: + """Generate diverse test cases based on source context and optional teacher samples. + + Args: + teacher_samples: Teacher test samples to learn from (can be empty list) + count: Number of test cases to generate + source_context: Source context (POWER.md, code, etc.) - REQUIRED for domain understanding + complexity: Optional complexity filter (simple/medium/complex) + diversity_factor: How diverse to make tests (0=similar, 1=very diverse) + output_dir: Optional directory to save generated tests + deduplicate: Remove duplicate test names (default: True) + ensure_complex_tests: Ensure at least 20% complex tests (default: True) + + Returns: + List of generated test cases + """ + if not source_context: + raise ValueError("source_context is required for domain understanding") + + mode = "teacher samples + source context" if teacher_samples else "source context only" + logger.info(f"Generating {count} test cases from {mode}...") + if teacher_samples: + logger.info(f" Using {len(teacher_samples)} teacher samples") + logger.info(f" Using source context ({len(source_context)} chars)") + + # Step 1: Analyze domain + logger.info("Step 1: Analyzing domain from teacher samples...") + domain_analysis = self.analyzer.analyze_test_samples( + teacher_samples, + source_context + ) + + # Save analysis if output directory provided + if output_dir: + Path(output_dir).mkdir(parents=True, exist_ok=True) + analysis_path = Path(output_dir) / "domain_analysis.json" + self.analyzer.save_analysis(domain_analysis, str(analysis_path)) + + # Step 2: Calculate target complexity distribution + target_complex = 0 + if ensure_complex_tests and not complexity: + target_complex = max(int(count * 0.2), 1) # At least 20% complex + logger.info(f"Target: {target_complex} complex tests ({target_complex/count*100:.0f}%)") + + # Step 3: Generate test cases in batches with diversity enforcement + logger.info("Step 2: Generating test cases...") + generated_tests = [] + seen_names = set() # Track names for deduplication + batch_size = min(5, count) + num_batches = (count + batch_size - 1) // batch_size + + # Increase batch size to account for deduplication + if deduplicate: + num_batches = int(num_batches * 1.5) # Generate 50% more for filtering + + for batch_idx in range(num_batches): + if len(generated_tests) >= count: + break + + batch_count = min(batch_size, count - len(generated_tests) + 3) # +3 buffer + logger.info(f"Generating batch {batch_idx + 1}/{num_batches} ({batch_count} tests)...") + + # Adjust complexity for this batch + batch_complexity = complexity + if ensure_complex_tests and not complexity: + # Ensure we generate enough complex tests + complex_so_far = sum(1 for t in generated_tests if t.get('complexity') == 'complex') + if complex_so_far < target_complex and batch_idx >= num_batches // 2: + batch_complexity = 'complex' + logger.info(f" Focusing on complex tests (have {complex_so_far}/{target_complex})") + + batch_tests = self._generate_batch( + teacher_samples=teacher_samples, + domain_analysis=domain_analysis, + count=batch_count, + complexity=batch_complexity, + diversity_factor=diversity_factor, + batch_idx=batch_idx, + existing_names=seen_names # Pass for deduplication + ) + + # Add unique tests only + for test in batch_tests: + test_name = test.get('name', '') + if not deduplicate or test_name not in seen_names: + generated_tests.append(test) + seen_names.add(test_name) + if len(generated_tests) >= count: + break + + # Trim to exact count + generated_tests = generated_tests[:count] + + # Step 4: Validate and post-process + logger.info("Step 3: Validating generated tests...") + validated_tests = self._validate_and_fix_tests(generated_tests, teacher_samples) + + # Step 5: Final quality checks + logger.info("Step 4: Final quality checks...") + validated_tests = self._final_quality_pass(validated_tests, count, ensure_complex_tests) + + # Reassign IDs after final quality pass to ensure sequential numbering + for i, test in enumerate(validated_tests): + test["id"] = f"generated_{i+1:03d}" + + # Save tests if output directory provided + if output_dir: + for i, test in enumerate(validated_tests): + test_file = Path(output_dir) / f"generated_test_{i+1:03d}.json" + with open(test_file, 'w') as f: + json.dump([test], f, indent=2) + logger.debug(f"Saved test to {test_file}") + + # Also save all tests in one file + all_tests_file = Path(output_dir) / "all_generated_tests.json" + with open(all_tests_file, 'w') as f: + json.dump(validated_tests, f, indent=2) + logger.info(f"Saved all tests to {all_tests_file}") + + logger.info(f"Successfully generated {len(validated_tests)} test cases") + return validated_tests + + def _generate_batch( + self, + teacher_samples: List[Dict[str, Any]], + domain_analysis: Dict[str, Any], + count: int, + complexity: Optional[str], + diversity_factor: float, + batch_idx: int, + existing_names: Optional[set] = None + ) -> List[Dict[str, Any]]: + """Generate a batch of test cases.""" + prompt = self._build_generation_prompt( + teacher_samples=teacher_samples, + domain_analysis=domain_analysis, + count=count, + complexity=complexity, + diversity_factor=diversity_factor, + batch_idx=batch_idx, + existing_names=existing_names + ) + + # Retry logic for timeouts + max_retries = 2 + for attempt in range(max_retries + 1): + try: + response = self._call_bedrock(prompt) + tests = self._parse_generation_response(response) + return tests + except Exception as e: + if attempt < max_retries and ('timeout' in str(e).lower() or 'timed out' in str(e).lower()): + logger.warning(f"Batch generation attempt {attempt + 1} timed out, retrying...") + continue + else: + logger.exception(f"Batch generation failed after {attempt + 1} attempts: {e}") + return [] + + logger.error("Batch generation exited retry loop without returning a result; returning empty list.") + return [] + + def _build_generation_prompt( + self, + teacher_samples: List[Dict[str, Any]], + domain_analysis: Dict[str, Any], + count: int, + complexity: Optional[str], + diversity_factor: float, + batch_idx: int, + existing_names: Optional[set] = None + ) -> str: + """Build prompt for test case generation.""" + has_samples = bool(teacher_samples) + + prompt = f"""You are an expert test case generator for AI agent evaluation. Your task is to generate {count} diverse, realistic test cases. + +# Domain Context + +## Domain Understanding +{json.dumps(domain_analysis.get("domain_understanding", {}), indent=2)} + +## Structural Patterns +{json.dumps(domain_analysis.get("structural_patterns", {}), indent=2)} + +## Complexity Distribution +{json.dumps(domain_analysis.get("complexity_distribution", {}), indent=2)} + +""" + + if has_samples: + prompt += """# Teacher Test Samples (Learn from these) + +""" + # Show 2-3 representative teacher samples + num_examples = min(3, len(teacher_samples)) + for i in range(num_examples): + sample = teacher_samples[i % len(teacher_samples)] + prompt += f"""## Teacher Sample {i+1} +```json +{json.dumps(sample, indent=2)} +``` + +""" + prompt += f"""# Generation Requirements + +Generate **{count} NEW, DIVERSE test cases** that: + +1. **Follow the same structure** as teacher samples (same fields, assertion patterns) +2. **Test different scenarios** - do NOT simply copy teacher samples with minor changes +3. **Maintain quality** - assertions should be specific and validate real capabilities +4. **Match complexity**: {complexity if complexity else "Mix of simple (30%), medium (50%), complex (20%)"} +5. **Diversity level**: {diversity_factor:.1f} (0=similar to teachers, 1=very different scenarios)""" + else: + prompt += f"""# Test Structure Requirements + +Since no teacher samples are available, generate tests with this structure: +- **id**: Unique identifier (generated_001, generated_002, etc.) +- **name**: Descriptive name +- **user_message** or **prompt**: User's initial message/request +- **description**: What this test validates +- **complexity**: simple, medium, or complex +- **tags**: Array of relevant tags +- **max_turns**: Expected conversation length (default: 10) +- **timeout_seconds**: Timeout for the test (default: 300) +- **simulated_human_guidance**: Detailed instructions for simulated user behavior +- **metadata**: Domain-specific metadata (domain, scenario_type, etc.) +- **assertions**: Array of assertion objects with: + - name: assertion identifier + - type: llm_judge, tool_called, transcript_not_contains, etc. + - description: What is being validated + - check: The validation criteria or tool name + +# Generation Requirements + +Generate **{count} NEW, DIVERSE test cases** that: + +1. **Test different capabilities** identified in the domain understanding above +2. **Cover various scenarios** - include both common and edge cases +3. **Maintain quality** - assertions should be specific and validate real capabilities +4. **Match complexity**: {complexity if complexity else "Mix of simple (30%), medium (50%), complex (20%)"} +5. **Diversity level**: {diversity_factor:.1f} (0=similar scenarios, 1=very different scenarios) + +## Diversity Guidelines (factor: {diversity_factor:.1f}) +""" + + if diversity_factor >= 0.7: + prompt += """ +- Explore edge cases and unusual scenarios +- Test failure modes and error handling +- Include different user personas and skill levels +- Vary interaction patterns significantly +- Test boundary conditions +""" + elif diversity_factor >= 0.4: + prompt += """ +- Cover different aspects of core capabilities +- Include variations in user requests +- Test both happy paths and some error cases +- Moderate variation in complexity and scope +""" + else: + prompt += """ +- Stay close to teacher sample patterns +- Focus on core capability variations +- Mostly happy path scenarios +- Similar complexity and scope +""" + + # Add deduplication instructions + if existing_names: + prompt += f""" +## CRITICAL: Avoid Duplicates +The following test names have already been generated - DO NOT create tests with these names: +{chr(10).join(f' - "{name}"' for name in sorted(existing_names)[:20])} +{' ... and more' if len(existing_names) > 20 else ''} + +Create COMPLETELY NEW scenarios with UNIQUE names. +""" + + # Add batch-specific guidance for diversity across batches + if batch_idx > 0: + prompt += f""" +## Batch #{batch_idx + 1} Focus +This is batch {batch_idx + 1}. Generate scenarios that are distinct from earlier batches. +Focus on: {self._get_batch_focus(batch_idx, domain_analysis)} +""" + + prompt += """ +# Output Format + +Return a JSON array of test cases. Each test must include: +- **id**: Unique identifier (generated_001, generated_002, etc.) +- **name**: Descriptive name +- **user_message** or **prompt**: User's initial message +- **description**: What this test validates +- **complexity**: simple, medium, or complex +- **tags**: Array of relevant tags +- **max_turns**: Expected conversation length +- **timeout_seconds**: Timeout for the test +- **simulated_human_guidance**: Detailed instructions for simulated user behavior +- **metadata**: Domain-specific metadata (domain, source_platform, target_platform, scenario_type, etc.) +- **assertions**: Array of assertion objects with: + - name: assertion identifier + - type: llm_judge, tool_called, transcript_not_contains, etc. + - description: What is being validated + - check: The validation criteria or tool name + +**CRITICAL**: Generate COMPLETE, VALID test cases. Do not use placeholders like "..." or "etc." + +```json +[ + { + "id": "generated_001", + "name": "...", + "user_message": "...", + "description": "...", + "complexity": "...", + "tags": [...], + "max_turns": ..., + "timeout_seconds": ..., + "simulated_human_guidance": "...", + "metadata": {...}, + "assertions": [...] + } +] +``` + +Generate exactly {count} complete test cases now:""" + + return prompt + + def _get_batch_focus(self, batch_idx: int, domain_analysis: Dict[str, Any]) -> str: + """Get focus area for a specific batch to ensure diversity.""" + domain_understanding = domain_analysis.get("domain_understanding", {}) + + # Extract focus areas from domain analysis + capabilities = domain_understanding.get("core_capabilities", []) + personas = domain_understanding.get("user_personas", []) + edge_cases = domain_understanding.get("edge_cases_to_test", []) + + focus_areas = [] + + # Rotate through different aspects + if batch_idx % 3 == 0 and capabilities: + # Focus on specific capabilities + cap_idx = (batch_idx // 3) % len(capabilities) + cap = capabilities[cap_idx] + focus_areas.append(f"capability '{cap.get('name', 'unknown')}'") + + if batch_idx % 3 == 1 and personas: + # Focus on specific persona + persona_idx = (batch_idx // 3) % len(personas) + persona = personas[persona_idx] + focus_areas.append(f"user persona '{persona.get('name', 'unknown')}'") + + if batch_idx % 3 == 2 and edge_cases: + # Focus on edge cases + edge_idx = (batch_idx // 3) % len(edge_cases) + edge = edge_cases[edge_idx] + focus_areas.append(f"edge case: {edge.get('scenario', 'unknown')}") + + if not focus_areas: + focus_areas.append("alternative scenarios and variations") + + return ", ".join(focus_areas) + + def _call_bedrock(self, prompt: str) -> str: + """Call Bedrock API.""" + body = json.dumps({ + "anthropic_version": "bedrock-2023-05-31", + "max_tokens": 16000, # Large for multiple test cases + "messages": [ + {"role": "user", "content": prompt} + ], + "temperature": self.temperature + }) + + response = self.bedrock.invoke_model( + modelId=self.model_id, + body=body + ) + + response_body = json.loads(response['body'].read()) + return response_body['content'][0]['text'] + + def _parse_generation_response(self, response: str) -> List[Dict[str, Any]]: + """Parse LLM generation response.""" + import re + + # Find JSON array in response + json_match = re.search(r'\[[\s\S]*\]', response) + if not json_match: + logger.error("No JSON array found in generation response") + return [] + + try: + tests = json.loads(json_match.group(0)) + if not isinstance(tests, list): + logger.error("Response is not a list") + return [] + return tests + except json.JSONDecodeError as e: + logger.error(f"Failed to parse generation JSON: {e}") + return [] + + def _validate_and_fix_tests( + self, + generated_tests: List[Dict[str, Any]], + teacher_samples: List[Dict[str, Any]] + ) -> List[Dict[str, Any]]: + """Validate and fix generated test cases.""" + validated = [] + + # Get required fields from teacher samples or use defaults + if teacher_samples: + required_fields = set(teacher_samples[0].keys()) + else: + # Default required fields when no teacher samples + required_fields = { + "id", "name", "description", "complexity", "tags", + "max_turns", "timeout_seconds", "simulated_human_guidance", + "metadata", "assertions" + } + + for i, test in enumerate(generated_tests): + try: + # Ensure required fields (ID will be assigned after final quality pass) + missing_fields = required_fields - set(test.keys()) + if missing_fields: + logger.warning(f"Test {i} missing required fields: {sorted(missing_fields)}") + for field in missing_fields: + if field in {"tags", "assertions"}: + test[field] = [] + elif field in {"metadata"}: + test[field] = {} + elif field in {"max_turns", "timeout_seconds"}: + test[field] = 0 + elif field in {"complexity"}: + test[field] = "medium" + else: + test[field] = "" + if not test.get("name"): + test["name"] = f"Generated Test {i+1}" + + if not test.get("description"): + test["description"] = test.get("name", "Generated test case") + + # Ensure complexity + if not test.get("complexity") or test.get("complexity") not in ["simple", "medium", "complex"]: + test["complexity"] = "medium" + + # Ensure metadata + if "metadata" not in test: + test["metadata"] = {} + + # Ensure assertions exist + if "assertions" not in test or not test["assertions"]: + logger.warning(f"Test {test['id']} has no assertions, skipping") + continue + + # Ensure user_message or prompt + if not test.get("user_message") and not test.get("prompt"): + test["user_message"] = f"User message for {test['name']}" + + # Ensure simulated_human_guidance + if not test.get("simulated_human_guidance"): + test["simulated_human_guidance"] = f"Simulated user behavior for {test['name']}" + + # Ensure reasonable defaults + if not test.get("max_turns"): + test["max_turns"] = 10 + + if not test.get("timeout_seconds"): + test["timeout_seconds"] = 300 + + if not test.get("tags"): + test["tags"] = ["generated"] + + # Validate assertions + valid_assertions = [] + for assertion in test.get("assertions", []): + if assertion.get("name") and assertion.get("type") and assertion.get("check"): + valid_assertions.append(assertion) + else: + logger.warning(f"Invalid assertion in test {test['id']}: {assertion}") + + test["assertions"] = valid_assertions + + if valid_assertions: + validated.append(test) + else: + logger.warning(f"Test {test['id']} has no valid assertions, skipping") + + except Exception as e: + logger.exception(f"Failed to validate test {i}: {e}") + + logger.info(f"Validated {len(validated)}/{len(generated_tests)} tests") + return validated + + def generate_from_analysis( + self, + domain_analysis_path: str, + teacher_samples: List[Dict[str, Any]], + count: int, + **kwargs + ) -> List[Dict[str, Any]]: + """Generate test cases from pre-computed domain analysis. + + Args: + domain_analysis_path: Path to domain analysis JSON + teacher_samples: Teacher samples (for structure reference) + count: Number of tests to generate + **kwargs: Additional generation parameters + + Returns: + Generated test cases + """ + logger.info(f"Loading domain analysis from {domain_analysis_path}") + domain_analysis = self.analyzer.load_analysis(domain_analysis_path) + + generated_tests = [] + batch_size = min(5, count) + num_batches = (count + batch_size - 1) // batch_size + + for batch_idx in range(num_batches): + batch_count = min(batch_size, count - len(generated_tests)) + batch_tests = self._generate_batch( + teacher_samples=teacher_samples, + domain_analysis=domain_analysis, + count=batch_count, + complexity=kwargs.get("complexity"), + diversity_factor=kwargs.get("diversity_factor", 0.8), + batch_idx=batch_idx + ) + generated_tests.extend(batch_tests) + + validated_tests = self._validate_and_fix_tests(generated_tests, teacher_samples) + return validated_tests + + def _final_quality_pass( + self, + tests: List[Dict[str, Any]], + target_count: int, + ensure_complex: bool + ) -> List[Dict[str, Any]]: + """Final quality check: deduplication and complexity balance. + + Args: + tests: Validated test cases + target_count: Target number of tests + ensure_complex: Whether to ensure complex test quota + + Returns: + Quality-checked tests + """ + # Deduplication by name + seen_names = {} + deduplicated = [] + + for test in tests: + name = test.get('name', '') + if name not in seen_names: + deduplicated.append(test) + seen_names[name] = test + else: + # Keep the one with more assertions + existing = seen_names[name] + if len(test.get('assertions', [])) > len(existing.get('assertions', [])): + # Replace with better version + deduplicated.remove(existing) + deduplicated.append(test) + seen_names[name] = test + + if len(deduplicated) < len(tests): + logger.info(f"Removed {len(tests) - len(deduplicated)} duplicate tests") + + # Check complexity distribution + complexity_counts = {} + for test in deduplicated: + comp = test.get('complexity', 'medium') + complexity_counts[comp] = complexity_counts.get(comp, 0) + 1 + + logger.info(f"Complexity distribution: {complexity_counts}") + + # Warn if complex tests are missing + if ensure_complex and len(deduplicated) >= 10: + complex_count = complexity_counts.get('complex', 0) + target_complex = max(int(len(deduplicated) * 0.2), 1) + if complex_count < target_complex: + logger.warning( + f"Only {complex_count}/{target_complex} complex tests. " + f"Consider regenerating with --ensure-complex-tests or generating more batches." + ) + + return deduplicated[:target_count] diff --git a/evaluation/test_data_generator/test_basic.py b/evaluation/test_data_generator/test_basic.py new file mode 100644 index 0000000..3e44167 --- /dev/null +++ b/evaluation/test_data_generator/test_basic.py @@ -0,0 +1,92 @@ +#!/usr/bin/env python3 +"""Basic smoke tests for the intelligent test generator - can be run as a script.""" + +import json +import sys +from pathlib import Path + +# Import using absolute imports when run as script +if __name__ == '__main__': + # Add parent to path for imports when run as script + parent_dir = str(Path(__file__).parent.parent.parent) + if parent_dir not in sys.path: + sys.path.insert(0, parent_dir) + from evaluation.test_data_generator.domain_analyzer import DomainAnalyzer +else: + from .domain_analyzer import DomainAnalyzer + +def run_smoke_tests(): + """Run basic smoke tests without AWS credentials.""" + print("Running basic smoke tests (no AWS required)...\n") + + # Test that we can load a test sample + test_data_dir = Path(__file__).parent.parent / "test_samples" + test_file = test_data_dir / "onboarding_intermediate.json" + + if test_file.exists(): + with open(test_file, 'r') as f: + samples = json.load(f) + print(f"PASS: Loaded {len(samples)} test samples from {test_file.name}") + else: + print(f"FAIL: Test file not found: {test_file}") + return False + + # Create analyzer with mock credentials (won't make API calls for these methods) + try: + analyzer = DomainAnalyzer( + region_name='us-west-2', + model_id='us.anthropic.claude-opus-4-5-20251101-v1:0' + ) + print(f"PASS: DomainAnalyzer initialized") + except Exception as e: + print(f"WARN: DomainAnalyzer init warning (may need AWS config): {e}") + print(" Continuing with structural tests...") + + # Test structural pattern extraction (no API call) + try: + patterns = analyzer._extract_structural_patterns(samples) + print(f"PASS: Structural analysis works") + print(f" - Found {len(patterns['fields'])} field types") + print(f" - Found {len(patterns['metadata_keys'])} metadata keys") + print(f" - Found {len(patterns['assertion_types'])} assertion types") + except Exception as e: + print(f"FAIL: Structural analysis failed: {e}") + import traceback + traceback.print_exc() + return False + + # Test complexity analysis (no API call) + try: + complexity_analysis = analyzer._analyze_complexity(samples) + print(f"PASS: Complexity analysis works") + print(f" - Distribution: {complexity_analysis['distribution']}") + except Exception as e: + print(f"FAIL: Complexity analysis failed: {e}") + import traceback + traceback.print_exc() + return False + + # Test assertion analysis (no API call) + try: + assertion_analysis = analyzer._analyze_assertions(samples) + print(f"PASS: Assertion analysis works") + print(f" - Assertion types: {list(assertion_analysis['assertion_types'].keys())}") + except Exception as e: + print(f"FAIL: Assertion analysis failed: {e}") + import traceback + traceback.print_exc() + return False + + print("\n" + "=" * 60) + print("PASS: ALL BASIC SMOKE TESTS PASSED") + print("=" * 60) + print("\nThe test data generator is ready to use!") + print("\nTo generate tests (requires AWS/Bedrock):") + print(" python -m evaluation.test_data_generator.cli --help") + print("\nTo run full unit tests:") + print(" pytest evaluation/test_data_generator/") + return True + +if __name__ == '__main__': + success = run_smoke_tests() + sys.exit(0 if success else 1) diff --git a/evaluation/test_data_generator/test_units.py b/evaluation/test_data_generator/test_units.py new file mode 100644 index 0000000..427a868 --- /dev/null +++ b/evaluation/test_data_generator/test_units.py @@ -0,0 +1,403 @@ +#!/usr/bin/env python3 +"""Unit tests for test data generator modules (no AWS credentials required).""" + +import unittest +import json +import tempfile +from pathlib import Path +from unittest.mock import patch, MagicMock + +# Import modules to test +from evaluation.test_data_generator.context_loader import ( + LoadingStrategy, + ContextLoader, + LOADING_STRATEGIES, + create_custom_strategy +) +from evaluation.test_data_generator.deduplicate_tests import deduplicate_tests +from evaluation.test_data_generator.domain_analyzer import DomainAnalyzer + + +class TestContextLoader(unittest.TestCase): + """Test ContextLoader without filesystem access.""" + + def setUp(self): + """Set up test fixtures.""" + self.temp_dir = tempfile.mkdtemp() + self.temp_path = Path(self.temp_dir) + + def tearDown(self): + """Clean up temp directory.""" + import shutil + shutil.rmtree(self.temp_dir, ignore_errors=True) + + def test_loading_strategy_creation(self): + """Test LoadingStrategy can be created with valid config.""" + strategy = LoadingStrategy( + name="test_strategy", + description="Test description", + priority_patterns={'main.py': 100}, + extension_priorities={'.py': 50}, + max_file_size=50000 + ) + self.assertEqual(strategy.name, "test_strategy") + self.assertEqual(strategy.description, "Test description") + self.assertIn('main.py', strategy.priority_patterns) + self.assertIn('.py', strategy.extension_priorities) + self.assertEqual(strategy.max_file_size, 50000) + + def test_predefined_strategies_exist(self): + """Test that predefined loading strategies are available.""" + self.assertIn('agent_evaluation', LOADING_STRATEGIES) + self.assertIn('generic', LOADING_STRATEGIES) + self.assertIsInstance(LOADING_STRATEGIES['agent_evaluation'], LoadingStrategy) + + def test_context_loader_initialization(self): + """Test ContextLoader initializes with a strategy.""" + loader = ContextLoader(strategy='generic') + self.assertIsNotNone(loader.strategy) + self.assertEqual(loader.strategy.name, 'generic') + + def test_context_loader_invalid_strategy(self): + """Test ContextLoader falls back to generic for invalid strategy.""" + loader = ContextLoader(strategy='nonexistent_strategy') + # Should fall back to 'generic' + self.assertEqual(loader.strategy.name, 'generic') + + def test_context_loader_with_custom_strategy(self): + """Test ContextLoader accepts custom strategy object.""" + custom = create_custom_strategy( + name="custom", + description="Custom test strategy", + exclude_patterns=['node_modules/**'] + ) + loader = ContextLoader(custom_strategy=custom) + self.assertEqual(loader.strategy.name, "custom") + + def test_is_text_file_detects_binary(self): + """Test binary file detection.""" + # Create binary file + binary_file = self.temp_path / "binary.bin" + binary_file.write_bytes(b'\x00\x01\x02\xff') + + # Create text file + text_file = self.temp_path / "text.txt" + text_file.write_text("Hello world") + + loader = ContextLoader(strategy='generic') + self.assertFalse(loader._is_text_file(binary_file)) + self.assertTrue(loader._is_text_file(text_file)) + + def test_load_single_file(self): + """Test loading a single text file.""" + test_file = self.temp_path / "test.md" + test_content = "# Test Document\n\nThis is a test." + test_file.write_text(test_content) + + loader = ContextLoader(strategy='generic') + result = loader.load(str(test_file)) + + self.assertIsNotNone(result) + self.assertIn("Test Document", result) + self.assertIn(test_content, result) + + def test_load_nonexistent_path(self): + """Test loading from nonexistent path returns None.""" + loader = ContextLoader(strategy='generic') + result = loader.load("/nonexistent/path/to/nowhere") + self.assertIsNone(result) + + def test_skip_directories(self): + """Test that skip_dirs are properly excluded.""" + # Create directory structure + (self.temp_path / ".git").mkdir() + (self.temp_path / ".git" / "config").write_text("git config") + (self.temp_path / "src").mkdir() + (self.temp_path / "src" / "main.py").write_text("print('hello')") + + loader = ContextLoader(strategy='generic') + result = loader.load(str(self.temp_path)) + + # Should have main.py but not .git/config + self.assertIsNotNone(result) + self.assertIn("main.py", result) + self.assertNotIn("git config", result) + + def test_skip_large_files(self): + """Test that files exceeding max_file_size are skipped.""" + large_file = self.temp_path / "large.txt" + large_file.write_text("x" * (200 * 1024)) # 200KB + + small_file = self.temp_path / "small.txt" + small_file.write_text("small content") + + loader = ContextLoader(strategy='generic') + result = loader.load(str(self.temp_path)) + + self.assertIsNotNone(result) + self.assertIn("small content", result) + self.assertNotIn("x" * 1000, result) # Large file should be skipped + + +class TestDeduplication(unittest.TestCase): + """Test deduplication logic.""" + + def setUp(self): + """Set up test fixtures.""" + self.temp_dir = tempfile.mkdtemp() + self.temp_path = Path(self.temp_dir) + + def tearDown(self): + """Clean up temp directory.""" + import shutil + shutil.rmtree(self.temp_dir, ignore_errors=True) + + def test_deduplicate_keep_first(self): + """Test keep_first strategy.""" + tests = [ + {"name": "test_one", "id": "1", "complexity": "simple"}, + {"name": "test_one", "id": "2", "complexity": "medium"}, + {"name": "test_two", "id": "3", "complexity": "complex"} + ] + + input_file = self.temp_path / "input.json" + output_file = self.temp_path / "output.json" + input_file.write_text(json.dumps(tests)) + + deduplicate_tests(str(input_file), str(output_file), strategy="keep_first") + + result = json.loads(output_file.read_text()) + self.assertEqual(len(result), 2) + # Should keep first "test_one" (id=1) + test_one = [t for t in result if t["name"] == "test_one"][0] + self.assertEqual(test_one["id"], "1") + + def test_deduplicate_keep_best(self): + """Test keep_best strategy (keeps test with most assertions).""" + tests = [ + {"name": "test_one", "id": "1", "complexity": "simple", "assertions": [{"name": "a1"}]}, + {"name": "test_one", "id": "2", "complexity": "medium", "assertions": [{"name": "a2"}, {"name": "a3"}]}, + {"name": "test_two", "id": "3", "complexity": "complex", "assertions": []} + ] + + input_file = self.temp_path / "input.json" + output_file = self.temp_path / "output.json" + input_file.write_text(json.dumps(tests)) + + deduplicate_tests(str(input_file), str(output_file), strategy="keep_best") + + result = json.loads(output_file.read_text()) + self.assertEqual(len(result), 2) + # Should keep "test_one" with most assertions (id=2, 2 assertions) + test_one = [t for t in result if t["name"] == "test_one"][0] + self.assertEqual(test_one["id"], "2") + self.assertEqual(len(test_one["assertions"]), 2) + + def test_deduplicate_keep_all_unique(self): + """Test keep_all_unique strategy (renames duplicates).""" + tests = [ + { + "name": "test_one", + "id": "1", + "assertions": [{"name": "a1"}, {"name": "a2"}] + }, + { + "name": "test_one", + "id": "2", + "assertions": [{"name": "a3"}] + } + ] + + input_file = self.temp_path / "input.json" + output_file = self.temp_path / "output.json" + input_file.write_text(json.dumps(tests)) + + deduplicate_tests(str(input_file), str(output_file), strategy="keep_all_unique") + + result = json.loads(output_file.read_text()) + self.assertEqual(len(result), 2) + # Should have renamed second test + names = {t["name"] for t in result} + self.assertIn("test_one", names) + self.assertIn("test_one (variant 2)", names) + + def test_deduplicate_no_duplicates(self): + """Test deduplication with no duplicates.""" + tests = [ + {"name": "test_one", "id": "1"}, + {"name": "test_two", "id": "2"}, + {"name": "test_three", "id": "3"} + ] + + input_file = self.temp_path / "input.json" + output_file = self.temp_path / "output.json" + input_file.write_text(json.dumps(tests)) + + deduplicate_tests(str(input_file), str(output_file), strategy="keep_first") + + result = json.loads(output_file.read_text()) + self.assertEqual(len(result), 3) + + +class TestDomainAnalyzer(unittest.TestCase): + """Test DomainAnalyzer structural analysis (no API calls).""" + + def setUp(self): + """Set up test fixtures.""" + # Mock boto3 to avoid needing AWS credentials + self.boto_patcher = patch('evaluation.test_data_generator.domain_analyzer.boto3') + self.mock_boto = self.boto_patcher.start() + self.mock_boto.client.return_value = MagicMock() + + self.analyzer = DomainAnalyzer( + region_name='us-west-2', + model_id='test-model' + ) + + self.sample_tests = [ + { + "id": "test-1", + "name": "Test One", + "complexity": "simple", + "tags": ["unit", "basic"], + "metadata": {"domain": "testing", "type": "functional"}, + "assertions": [ + {"name": "assert_1", "type": "llm_judge", "check": "something"}, + {"name": "assert_2", "type": "tool_called", "check": "tool_name"} + ] + }, + { + "id": "test-2", + "name": "Test Two", + "complexity": "medium", + "tags": ["integration"], + "metadata": {"domain": "testing", "category": "api"}, + "assertions": [ + {"name": "assert_3", "type": "transcript_contains", "check": "expected"} + ] + } + ] + + def tearDown(self): + """Clean up patches.""" + self.boto_patcher.stop() + + def test_extract_structural_patterns(self): + """Test structural pattern extraction.""" + patterns = self.analyzer._extract_structural_patterns(self.sample_tests) + + self.assertIn('fields', patterns) + self.assertIn('metadata_keys', patterns) + self.assertIn('assertion_types', patterns) + self.assertIn('tags', patterns) + + # Check field extraction + self.assertIn('id', patterns['fields']) + self.assertIn('name', patterns['fields']) + self.assertIn('complexity', patterns['fields']) + + # Check metadata keys + self.assertIn('domain', patterns['metadata_keys']) + self.assertIn('type', patterns['metadata_keys']) + + # Check assertion types + self.assertIn('llm_judge', patterns['assertion_types']) + self.assertIn('tool_called', patterns['assertion_types']) + + def test_analyze_complexity(self): + """Test complexity distribution analysis.""" + analysis = self.analyzer._analyze_complexity(self.sample_tests) + + self.assertIn('distribution', analysis) + self.assertIn('simple', analysis['distribution']) + self.assertIn('medium', analysis['distribution']) + self.assertEqual(analysis['distribution']['simple'], 1) + self.assertEqual(analysis['distribution']['medium'], 1) + + def test_analyze_assertions(self): + """Test assertion pattern analysis.""" + analysis = self.analyzer._analyze_assertions(self.sample_tests) + + self.assertIn('assertion_types', analysis) + self.assertIn('llm_judge', analysis['assertion_types']) + self.assertIn('tool_called', analysis['assertion_types']) + self.assertIn('transcript_contains', analysis['assertion_types']) + + # Check counts + self.assertEqual(analysis['assertion_types']['llm_judge'], 1) + self.assertEqual(analysis['assertion_types']['tool_called'], 1) + + def test_get_default_structure(self): + """Test default structure when no samples provided.""" + default = self.analyzer._get_default_structure() + + self.assertIn('fields', default) + self.assertIn('metadata_keys', default) + self.assertIn('assertion_types', default) + self.assertIsInstance(default['fields'], dict) + + def test_analyze_empty_samples(self): + """Test analysis handles empty sample list.""" + patterns = self.analyzer._extract_structural_patterns([]) + self.assertIn('fields', patterns) + self.assertEqual(len(patterns['fields']), 0) + + +class TestCustomStrategyCreation(unittest.TestCase): + """Test custom strategy creation helper.""" + + def test_create_custom_strategy_basic(self): + """Test basic custom strategy creation.""" + strategy = create_custom_strategy( + name="my_strategy", + description="My test strategy", + exclude_patterns=['test/**'] + ) + self.assertEqual(strategy.name, "my_strategy") + self.assertEqual(strategy.description, "My test strategy") + self.assertIn('test/**', strategy.exclude_patterns) + + def test_create_custom_strategy_with_priorities(self): + """Test custom strategy with priority patterns.""" + strategy = create_custom_strategy( + name="extended", + description="Extended strategy", + priority_patterns={'important.md': 100}, + extension_priorities={'.md': 80} + ) + self.assertEqual(strategy.name, "extended") + self.assertIn('important.md', strategy.priority_patterns) + self.assertIn('.md', strategy.extension_priorities) + self.assertEqual(strategy.priority_patterns['important.md'], 100) + + def test_create_custom_strategy_file_size(self): + """Test custom max_file_size setting.""" + strategy = create_custom_strategy( + name="large_files", + description="Strategy for large files", + max_file_size=500000 + ) + self.assertEqual(strategy.max_file_size, 500000) + + +def run_tests(): + """Run all unit tests.""" + loader = unittest.TestLoader() + suite = unittest.TestSuite() + + # Add all test classes + suite.addTests(loader.loadTestsFromTestCase(TestContextLoader)) + suite.addTests(loader.loadTestsFromTestCase(TestDeduplication)) + suite.addTests(loader.loadTestsFromTestCase(TestDomainAnalyzer)) + suite.addTests(loader.loadTestsFromTestCase(TestCustomStrategyCreation)) + + runner = unittest.TextTestRunner(verbosity=2) + result = runner.run(suite) + + return result.wasSuccessful() + + +if __name__ == '__main__': + import sys + success = run_tests() + sys.exit(0 if success else 1) diff --git a/evaluation/test_samples/onboarding_intermediate.json b/evaluation/test_samples/onboarding_intermediate.json new file mode 100644 index 0000000..f3c6dc9 --- /dev/null +++ b/evaluation/test_samples/onboarding_intermediate.json @@ -0,0 +1,95 @@ +[ +{ + "id": "onboarding-intermediate", + "name": "Intermediate user onboarding — has tools installed, knows the platform", + "user_message": "I just installed the agent-builder power and want to use it.", + "prompt": "I just installed the agent-builder power and want to use it.", + "description": "An intermediate user who has Python 3.11, AWS CLI, and Finch already installed and configured. They know the AWS Transform platform and understand what the power can do. The power should walk through onboarding efficiently: validate prerequisites, offer SDK install, hooks, MCP config, then demonstrate a doc search when asked.", + "complexity": "medium", + "tags": ["matrix", "onboarding", "intermediate"], + "max_turns": 12, + "timeout_seconds": 600, + "simulated_human_guidance": "You are a developer who just installed the agent-builder Kiro Power and want to get started. You have Python 3.11, AWS CLI, and finch installed. Your AWS credentials are configured. Cooperate with the agent throughout the onboarding flow. When the agent asks about your Python version, say you have Python 3.11. When it offers to install the SDK, agree. When it offers to add workspace hooks, agree. When it offers MCP configuration, accept the defaults. Once the agent finishes the onboarding setup (tool validation, SDK, hooks, MCP config), ask it: 'Can you show me a quick example? Search the AWS Transform docs for how to build an orchestrator agent.' After the agent performs the search and presents results, output __DONE__.", + "metadata": { + "domain": "agent_builder", + "source_platform": "kiro", + "target_platform": "aws_transform", + "scenario_type": "onboarding", + "source_file": "evaluation/test_samples/onboarding_intermediate.json" + }, + "assertions": [ + { + "name": "introduces_capabilities", + "type": "llm_judge", + "description": "Agent introduces itself and lists its key capabilities", + "check": "Did the agent introduce itself as the AWS Transform agent-builder power (or AWS Transform agent development assistant) and mention at least three of its capabilities? Capabilities include: documentation search, agent registration, deployment to agent runtime, code generation, debugging, and skill management. The introduction should give the user a clear picture of what the power can help with." + }, + { + "name": "validates_python", + "type": "llm_judge", + "description": "Agent checks or asks about Python version (3.11+ required)", + "check": "Did the agent validate or ask about the Python installation? It should check that Python 3.11 or higher is available, either by running a command (python3 --version) or asking the user. Simply mentioning Python as a prerequisite in a list counts if the agent is actively walking through validation." + }, + { + "name": "validates_aws_cli", + "type": "llm_judge", + "description": "Agent checks or asks about AWS CLI installation and authentication", + "check": "Did the agent validate or ask about AWS CLI? It should check that the AWS CLI is installed (aws --version) and that credentials are configured (aws sts get-caller-identity). Mentioning AWS CLI validation as part of the onboarding walkthrough counts." + }, + { + "name": "validates_container_runtime", + "type": "llm_judge", + "description": "Agent checks or asks about Finch or Docker availability", + "check": "Did the agent validate or ask about a container runtime (Finch or Docker)? It should check that at least one is available for building ARM64 images. Mentioning container runtime validation as part of the onboarding walkthrough counts." + }, + { + "name": "checks_sdk_installation", + "type": "llm_judge", + "description": "Agent addresses AWS Transform Agent SDK installation", + "check": "Did the agent address the AWS Transform Agent SDK installation? This could be checking if the SDK is already installed, offering to install it, or providing installation instructions (install.sh or manual pip install of .whl files). The agent should not skip this step." + }, + { + "name": "offers_workspace_hooks", + "type": "llm_judge", + "description": "Agent offers to add workspace hooks for deployment validation", + "check": "Did the agent offer to add workspace hooks (specifically a validate-deployment hook or similar)? The POWER.md onboarding Step 3 defines a hook that validates IAM roles, container runtime, and AWS access before deployment. The agent should offer to set this up." + }, + { + "name": "offers_mcp_configuration", + "type": "llm_judge", + "description": "Agent offers MCP server environment configuration", + "check": "Did the agent mention or offer to configure the MCP server environment (STAGE, REGION settings in mcp.json)? This is Step 4 of the POWER.md onboarding. The agent may present it as optional, which is correct." + }, + { + "name": "performs_doc_search_demo", + "type": "llm_judge", + "description": "Agent demonstrates the power by performing a documentation search", + "check": "After the user asked for a doc search demo (e.g., 'search the AWS Transform docs for how to build an orchestrator agent'), did the agent perform a live documentation search? It should have called keyword_search or search_by_source and presented results to the user. Simply describing the search capability without actually performing a search does NOT count." + }, + { + "name": "search_tool_called", + "type": "tool_called", + "description": "Agent actually invoked a search MCP tool during the demo", + "check": "keyword_search" + }, + { + "name": "includes_citations", + "type": "llm_judge", + "description": "Agent includes citation tags when presenting search results", + "check": "When the agent presented documentation search results, did it include citation tags (e.g., [dev-guide:...], [sdk:...], [api:...])? The POWER.md grounding rules require citations in every response that uses search results." + }, + { + "name": "follows_onboarding_sequence", + "type": "llm_judge", + "description": "Agent follows a logical onboarding sequence without skipping major steps", + "check": "Review the full transcript. Did the agent follow a logical onboarding progression? The expected sequence is: (1) introduce capabilities, (2) validate tools and access, (3) SDK installation, (4) workspace hooks, (5) MCP configuration. After onboarding, the user asks for a doc search demo, and the agent should perform it. The agent does not need to follow this exact order rigidly, but it should cover all major onboarding steps and respond to the user's search request. Minor reordering is acceptable." + }, + { + "name": "power_did_not_error", + "type": "transcript_not_contains", + "description": "No framework-level errors in the transcript", + "check": "ERROR:" + } + ] +} +]