Skip to content

Commit aaca23d

Browse files
harmjeffclaudescottschreckengaust
authored
feat: add AIDLC Evaluation & Reporting Framework (awslabs#115)
* feat: add aidlc-evaluator framework Evaluation and reporting framework for validating AI-DLC workflow changes. Includes execution, qualitative/quantitative scoring, contract testing, reporting packages, and CLI/IDE harness adapters. Also fixes pytest import-mode collision for same-named test files across packages, and documents known Windows test_run_command.py failures. Co-Authored-By: Claude Opus 4.6 <[email protected]> * test: fix cross-platform compatibility in test_run_command.py Replace shell-specific commands with Python equivalents to ensure tests pass on all platforms (Windows/Mac/Linux) when using shell=False: - Replace `echo 'content' > file` with Python pathlib file writing - Replace shell builtin `exit N` with Python `sys.exit(N)` - Replace `echo 'msg' >&2` with Python `sys.stderr.write()` - Update command-not-found test to handle both OSError and exit code 127 All 245 tests now pass successfully on Windows. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Update scripts/aidlc-evaluator/packages/ide-harness/src/ide_harness/prompt_template.py Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/prompt_template.py Co-authored-by: Scott Schreckengaust <[email protected]> * Remove region profile * More profile updates * More profile updates * Update scripts/aidlc-evaluator/pyproject.toml Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/README.md Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/packages/shared/src/shared/sandbox.py Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/packages/shared/src/shared/sandbox.py Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/packages/shared/src/shared/sandbox.py Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/packages/shared/src/shared/sandbox.py Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/README.md Co-authored-by: Scott Schreckengaust <[email protected]> * Make docker builder script executable * Update scripts/aidlc-evaluator/README.md Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/packages/shared/src/shared/sandbox.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/packages/shared/src/shared/sandbox.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/packages/shared/src/shared/sandbox.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/packages/shared/src/shared/sandbox.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/packages/shared/src/shared/sandbox.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/packages/contracttest/src/contracttest/server.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/post_run.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/post_run.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/post_run.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <[email protected]> * Update scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/post_run.py Codebuilder fixes Co-authored-by: Scott Schreckengaust <[email protected]> --------- Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Scott Schreckengaust <[email protected]>
1 parent 5ffc938 commit aaca23d

182 files changed

Lines changed: 29393 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

scripts/aidlc-evaluator/.gitignore

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
*.so
6+
.Python
7+
build/
8+
develop-eggs/
9+
dist/
10+
downloads/
11+
eggs/
12+
.eggs/
13+
lib/
14+
lib64/
15+
parts/
16+
sdist/
17+
var/
18+
wheels/
19+
*.egg-info/
20+
.installed.cfg
21+
*.egg
22+
23+
config.yaml
24+
25+
# Virtual environments
26+
venv/
27+
env/
28+
ENV/
29+
30+
# Testing
31+
.pytest_cache/
32+
.coverage
33+
htmlcov/
34+
.tox/
35+
36+
# IDEs
37+
.vscode/
38+
.idea/
39+
*.swp
40+
*.swo
41+
*~
42+
43+
# OS
44+
.DS_Store
45+
Thumbs.db
46+
47+
# Project specific
48+
test_results/
49+
reports/
50+
*.log
51+
.env
52+
runs
53+
ralph-coded
54+
.venv/
55+
.ruff_cache/
56+
.claude/

scripts/aidlc-evaluator/ARCHITECTURE.md

Lines changed: 655 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
# Contributing to AI-DLC Evaluation Framework
2+
3+
Thank you for contributing to the AI-DLC workflows evaluation and reporting framework!
4+
5+
## Getting Started
6+
7+
### Prerequisites
8+
9+
- Python 3.13+
10+
- [uv](https://github.com/astral-sh/uv) package manager
11+
- Git
12+
13+
### Setup
14+
15+
```bash
16+
# Clone the repository
17+
git clone <repository-url>
18+
cd aidlc-evaluation-framework
19+
20+
# Install dependencies
21+
uv sync
22+
23+
# Run tests to verify setup
24+
uv run pytest
25+
```
26+
27+
## Development Workflow
28+
29+
### 1. Create a Branch
30+
31+
```bash
32+
git checkout -b feature/your-feature-name
33+
```
34+
35+
### 2. Make Changes
36+
37+
Work in the appropriate package:
38+
- `aidlc-runner/` - Execution Framework (two-agent AIDLC workflow runner)
39+
- `packages/qualitative/` - Semantic Evaluation (intent & design similarity scoring)
40+
- `packages/quantitative/` - Code Evaluation (linting, security, organization)
41+
- `packages/nonfunctional/` - NFR Evaluation (tokens, timing, consistency)
42+
- `packages/reporting/` - Report generation
43+
- `packages/shared/` - Common utilities
44+
45+
Or contribute to other work streams:
46+
- `test_cases/` - Golden Test Cases (baseline inputs)
47+
- `writing-inputs/` - Vision and tech-env document guides
48+
- `.github/workflows/` - GitHub CI/CD Integration & Management
49+
50+
### 3. Run Tests
51+
52+
```bash
53+
# Run all tests
54+
uv run pytest
55+
56+
# Run specific package tests
57+
uv run pytest tests/test_qualitative.py
58+
59+
# Run with coverage
60+
uv run pytest --cov
61+
```
62+
63+
### 4. Lint Your Code
64+
65+
```bash
66+
# Check code style
67+
uv run ruff check .
68+
69+
# Auto-fix issues
70+
uv run ruff check --fix .
71+
72+
# Format code
73+
uv run ruff format .
74+
```
75+
76+
### 5. Commit Changes
77+
78+
Write clear, descriptive commit messages:
79+
80+
```bash
81+
git add .
82+
git commit -m "Add token tracking to nonfunctional package"
83+
```
84+
85+
### 6. Submit a Pull Request
86+
87+
- Push your branch to the repository
88+
- Open a PR with a clear description of changes
89+
- Link to any related issues
90+
- Wait for automated tests to pass
91+
- Address review feedback
92+
93+
## Work Streams
94+
95+
The project is organized around six big rocks. Your changes will typically fall into one or more of these:
96+
97+
| Work Stream | Description | Package / Area |
98+
|---|---|---|
99+
| **Golden Test Case** | Curated baseline test inputs | `test_cases/` |
100+
| **Execution Framework** | Two-agent AIDLC workflow runner (Owner: Jeff) | `aidlc-runner/` |
101+
| **Semantic Evaluation** | Intent & design similarity scoring | `packages/qualitative/` |
102+
| **Code Evaluation** | Linting, security, organization | `packages/quantitative/` |
103+
| **NFR Evaluation** | Tokens, timing, consistency | `packages/nonfunctional/` |
104+
| **GitHub CI/CD** | Pipeline integration & management | `.github/workflows/` |
105+
106+
## Code Standards
107+
108+
### Python Style
109+
110+
- Follow PEP 8 (enforced by Ruff)
111+
- Use type hints
112+
- Maximum line length: 100 characters
113+
- Write docstrings for public functions and classes
114+
115+
### Testing
116+
117+
- Write tests for new functionality
118+
- Maintain or improve code coverage
119+
- Use descriptive test names: `test_<what>_<condition>_<expected>`
120+
121+
### Documentation
122+
123+
- Update README.md if adding new features
124+
- Add docstrings to new modules and functions
125+
- Update relevant docs in `docs/` directory
126+
127+
## Package Dependencies
128+
129+
When adding dependencies:
130+
131+
1. Add to the appropriate `pyproject.toml` in `packages/<package>/` or `aidlc-runner/`
132+
2. Run `uv sync` to update lock file
133+
3. Document why the dependency is needed in your PR
134+
135+
## Reporting Issues
136+
137+
When reporting bugs or requesting features:
138+
139+
- Use GitHub Issues
140+
- Provide clear reproduction steps
141+
- Include relevant logs or error messages
142+
- Specify which package is affected
143+
144+
## Questions?
145+
146+
- Review [FAQ.md](./FAQ.md) for common questions
147+
- Check [OPERATING_PRINCIPLES.md](./OPERATING_PRINCIPLES.md) for decision-making guidance
148+
- Ask in PR comments or open a discussion
149+
150+
## Code of Conduct
151+
152+
- Be respectful and constructive
153+
- Focus on the code, not the person
154+
- Welcome diverse perspectives
155+
- Help others learn and grow
156+
157+
Thank you for helping improve the AI-DLC evaluation framework!

scripts/aidlc-evaluator/FAQ.md

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# AI-DLC Workflows Evaluation & Reporting Framework - FAQ
2+
3+
## What is this?
4+
5+
A comprehensive testing and reporting framework that validates changes to the AI-DLC workflows repository. It automatically evaluates code quality, semantic correctness, and performance to ensure changes don't negatively impact the system.
6+
7+
## Who is this for?
8+
9+
- **Maintainers** who need confidence that changes are safe to merge
10+
- **Contributors** who want to demonstrate their changes improve (or don't harm) the system
11+
- **Users** who depend on consistent, high-quality AI-assisted development workflows
12+
13+
## What are the major work streams?
14+
15+
The framework is organized around six big rocks:
16+
17+
**1. Golden Test Case**
18+
- Curated baseline test cases containing full AIDLC docs and code output
19+
- Versioned reference inputs that all evaluations run against
20+
- Ensures consistent, reproducible evaluation across changes
21+
22+
**2. Execution Framework (Jeff)**
23+
- Core orchestration engine that runs golden test cases through each evaluation
24+
- Manages the pipeline from test case input to structured results output
25+
- Coordinates across all evaluation dimensions
26+
27+
**3. Semantic Evaluation**
28+
- Uses AI to semantically evaluate outputs at major human review points
29+
- Scores outputs for correctness, completeness, and appropriateness
30+
- Validates that AI-generated content meets quality standards
31+
- All semantic metrics are reported **@k** — each evaluation runs multiple trials to account for non-determinism in AI-based grading (see "What does @k mean?" below)
32+
33+
**4. Code Evaluation**
34+
- **Linting:** Code style correctness
35+
- **Security:** Semgrep analysis for vulnerabilities
36+
- **Organization:** Code duplication detection, library usage patterns
37+
- Produces numeric scores (e.g., "3 high-severity security issues")
38+
39+
**5. NFR Evaluation**
40+
- Token consumption per workflow
41+
- Execution time measurements
42+
- Cross-model consistency checks
43+
- Resource utilization metrics
44+
45+
**6. GitHub CI/CD Integration & Management**
46+
- Automated pipelines triggering evaluations on PRs
47+
- Human-readable report generation and attachment
48+
- Versioned report archiving for historical comparison
49+
50+
## How does it work?
51+
52+
1. **Golden test cases** define the reference inputs (AIDLC docs + expected code output)
53+
2. The **execution framework** runs these test cases through each evaluation dimension
54+
3. **Semantic, code, and NFR evaluations** produce structured results
55+
4. **Reports** are generated summarizing impact across all dimensions
56+
5. **GitHub CI/CD** automates the entire pipeline on PRs and attaches reports for review
57+
6. Versioned reports are archived for historical comparison
58+
59+
## What environments are supported?
60+
61+
Kiro is a first-class citizen for testing, but the framework supports multiple AI tools and environments to meet customers where they are.
62+
63+
## What does @k mean for semantic metrics?
64+
65+
AI-based evaluations are non-deterministic — the same input can produce different scores across runs. To get trustworthy results, the framework runs each semantic evaluation multiple times (*k* trials) and reports two complementary metrics (see [Anthropic: Demystifying Evals for AI Agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)):
66+
67+
- **pass@k** — The probability of at least one success in *k* attempts. Answers: *"Can this workflow produce a correct result?"* Higher *k* increases the score, since more attempts mean higher odds of at least one success.
68+
- **pass^k** — The probability that *all k* attempts succeed. Answers: *"Does this workflow consistently produce correct results?"* Higher *k* makes this harder to achieve, since every trial must pass.
69+
70+
At *k*=1 the two metrics are identical (both equal the per-trial success rate). As *k* grows they diverge — pass@k approaches 100% while pass^k drops toward 0%. Together they tell you both the capability ceiling and the reliability floor of a workflow change.
71+
72+
Code evaluation and NFR metrics are deterministic and do not require @k.
73+
74+
## How do I interpret the reports?
75+
76+
Reports include:
77+
- **Semantic scores @k:** AI-evaluated ratings with pass@k (capability) and pass^k (reliability)
78+
- **Code scores:** Numeric metrics for linting, security, duplication (deterministic)
79+
- **NFR metrics:** Token usage, execution time, consistency (deterministic)
80+
- **Trend analysis:** Comparison to previous versions (against golden test cases)
81+
- **Pass/fail gates:** Clear indicators of whether changes meet thresholds
82+
83+
## What if my change shows a evaluation?
84+
85+
Evaluations don't automatically block merges—they provide context. Work with maintainers to:
86+
- Understand if the evaluation is acceptable given the benefits
87+
- Identify ways to mitigate the evaluation
88+
- Document known trade-offs
89+
90+
## How does this relate to the AI-DLC workflows repository?
91+
92+
This framework monitors and validates the [AI-DLC workflows](https://github.com/awslabs/aidlc-workflows) to ensure changes maintain or improve quality. It's a testing layer on top of the workflows themselves.
93+
94+
## Can I run tests locally before submitting a PR?
95+
96+
Yes—the framework is designed to run in CI/CD but can also be executed locally to get early feedback.
97+
98+
## How are reports versioned?
99+
100+
Each test run produces a numbered/named version that includes:
101+
- Timestamp and commit SHA
102+
- Full test results
103+
- Comparison to baseline
104+
- Human-readable summary
105+
106+
Reports are stored for historical analysis and trend tracking.

0 commit comments

Comments
 (0)