harmjeff
diff --git a/‎scripts/aidlc-evaluator/.gitignore‎
Lines changed: 56 additions & 0 deletions b/‎scripts/aidlc-evaluator/.gitignore‎
Lines changed: 56 additions & 0 deletions
diff --git a/‎scripts/aidlc-evaluator/ARCHITECTURE.md‎
Lines changed: 655 additions & 0 deletions b/‎scripts/aidlc-evaluator/ARCHITECTURE.md‎
Lines changed: 655 additions & 0 deletions
diff --git a/‎scripts/aidlc-evaluator/CONTRIBUTING.md‎
Lines changed: 157 additions & 0 deletions b/‎scripts/aidlc-evaluator/CONTRIBUTING.md‎
Lines changed: 157 additions & 0 deletions
diff --git a/‎scripts/aidlc-evaluator/FAQ.md‎
Lines changed: 106 additions & 0 deletions b/‎scripts/aidlc-evaluator/FAQ.md‎
Lines changed: 106 additions & 0 deletions
@@ -0,0 +1,56 @@
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+
+config.yaml
+
+# Virtual environments
+venv/
+env/
+ENV/
+
+# Testing
+.pytest_cache/
+.coverage
+htmlcov/
+.tox/
+
+# IDEs
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+
+# OS
+.DS_Store
+Thumbs.db
+
+# Project specific
+test_results/
+reports/
+*.log
+.env
+runs
+ralph-coded
+.venv/
+.ruff_cache/
+.claude/
@@ -0,0 +1,157 @@
+# Contributing to AI-DLC Evaluation Framework
+
+Thank you for contributing to the AI-DLC workflows evaluation and reporting framework!
+
+## Getting Started
+
+### Prerequisites
+
+- Python 3.13+
+- [uv](https://github.com/astral-sh/uv) package manager
+- Git
+
+### Setup
+
+```bash
+# Clone the repository
+git clone <repository-url>
+cd aidlc-evaluation-framework
+
+# Install dependencies
+uv sync
+
+# Run tests to verify setup
+uv run pytest
+```
+
+## Development Workflow
+
+### 1. Create a Branch
+
+```bash
+git checkout -b feature/your-feature-name
+```
+
+### 2. Make Changes
+
+Work in the appropriate package:
+- `aidlc-runner/` - Execution Framework (two-agent AIDLC workflow runner)
+- `packages/qualitative/` - Semantic Evaluation (intent & design similarity scoring)
+- `packages/quantitative/` - Code Evaluation (linting, security, organization)
+- `packages/nonfunctional/` - NFR Evaluation (tokens, timing, consistency)
+- `packages/reporting/` - Report generation
+- `packages/shared/` - Common utilities
+
+Or contribute to other work streams:
+- `test_cases/` - Golden Test Cases (baseline inputs)
+- `writing-inputs/` - Vision and tech-env document guides
+- `.github/workflows/` - GitHub CI/CD Integration & Management
+
+### 3. Run Tests
+
+```bash
+# Run all tests
+uv run pytest
+
+# Run specific package tests
+uv run pytest tests/test_qualitative.py
+
+# Run with coverage
+uv run pytest --cov
+```
+
+### 4. Lint Your Code
+
+```bash
+# Check code style
+uv run ruff check .
+
+# Auto-fix issues
+uv run ruff check --fix .
+
+# Format code
+uv run ruff format .
+```
+
+### 5. Commit Changes
+
+Write clear, descriptive commit messages:
+
+```bash
+git add .
+git commit -m "Add token tracking to nonfunctional package"
+```
+
+### 6. Submit a Pull Request
+
+- Push your branch to the repository
+- Open a PR with a clear description of changes
+- Link to any related issues
+- Wait for automated tests to pass
+- Address review feedback
+
+## Work Streams
+
+The project is organized around six big rocks. Your changes will typically fall into one or more of these:
+
+| Work Stream | Description | Package / Area |
+|---|---|---|
+| **Golden Test Case** | Curated baseline test inputs | `test_cases/` |
+| **Execution Framework** | Two-agent AIDLC workflow runner (Owner: Jeff) | `aidlc-runner/` |
+| **Semantic Evaluation** | Intent & design similarity scoring | `packages/qualitative/` |
+| **Code Evaluation** | Linting, security, organization | `packages/quantitative/` |
+| **NFR Evaluation** | Tokens, timing, consistency | `packages/nonfunctional/` |
+| **GitHub CI/CD** | Pipeline integration & management | `.github/workflows/` |
+
+## Code Standards
+
+### Python Style
+
+- Follow PEP 8 (enforced by Ruff)
+- Use type hints
+- Maximum line length: 100 characters
+- Write docstrings for public functions and classes
+
+### Testing
+
+- Write tests for new functionality
+- Maintain or improve code coverage
+- Use descriptive test names: `test_<what>_<condition>_<expected>`
+
+### Documentation
+
+- Update README.md if adding new features
+- Add docstrings to new modules and functions
+- Update relevant docs in `docs/` directory
+
+## Package Dependencies
+
+When adding dependencies:
+
+1. Add to the appropriate `pyproject.toml` in `packages/<package>/` or `aidlc-runner/`
+2. Run `uv sync` to update lock file
+3. Document why the dependency is needed in your PR
+
+## Reporting Issues
+
+When reporting bugs or requesting features:
+
+- Use GitHub Issues
+- Provide clear reproduction steps
+- Include relevant logs or error messages
+- Specify which package is affected
+
+## Questions?
+
+- Review [FAQ.md](./FAQ.md) for common questions
+- Check [OPERATING_PRINCIPLES.md](./OPERATING_PRINCIPLES.md) for decision-making guidance
+- Ask in PR comments or open a discussion
+
+## Code of Conduct
+
+- Be respectful and constructive
+- Focus on the code, not the person
+- Welcome diverse perspectives
+- Help others learn and grow
+
+Thank you for helping improve the AI-DLC evaluation framework!
@@ -0,0 +1,106 @@
+# AI-DLC Workflows Evaluation & Reporting Framework - FAQ
+
+## What is this?
+
+A comprehensive testing and reporting framework that validates changes to the AI-DLC workflows repository. It automatically evaluates code quality, semantic correctness, and performance to ensure changes don't negatively impact the system.
+
+## Who is this for?
+
+- **Maintainers** who need confidence that changes are safe to merge
+- **Contributors** who want to demonstrate their changes improve (or don't harm) the system
+- **Users** who depend on consistent, high-quality AI-assisted development workflows
+
+## What are the major work streams?
+
+The framework is organized around six big rocks:
+
+**1. Golden Test Case**
+- Curated baseline test cases containing full AIDLC docs and code output
+- Versioned reference inputs that all evaluations run against
+- Ensures consistent, reproducible evaluation across changes
+
+**2. Execution Framework (Jeff)**
+- Core orchestration engine that runs golden test cases through each evaluation
+- Manages the pipeline from test case input to structured results output
+- Coordinates across all evaluation dimensions
+
+**3. Semantic Evaluation**
+- Uses AI to semantically evaluate outputs at major human review points
+- Scores outputs for correctness, completeness, and appropriateness
+- Validates that AI-generated content meets quality standards
+- All semantic metrics are reported **@k** — each evaluation runs multiple trials to account for non-determinism in AI-based grading (see "What does @k mean?" below)
+
+**4. Code Evaluation**
+- **Linting:** Code style correctness
+- **Security:** Semgrep analysis for vulnerabilities
+- **Organization:** Code duplication detection, library usage patterns
+- Produces numeric scores (e.g., "3 high-severity security issues")
+
+**5. NFR Evaluation**
+- Token consumption per workflow
+- Execution time measurements
+- Cross-model consistency checks
+- Resource utilization metrics
+
+**6. GitHub CI/CD Integration & Management**
+- Automated pipelines triggering evaluations on PRs
+- Human-readable report generation and attachment
+- Versioned report archiving for historical comparison
+
+## How does it work?
+
+1. **Golden test cases** define the reference inputs (AIDLC docs + expected code output)
+2. The **execution framework** runs these test cases through each evaluation dimension
+3. **Semantic, code, and NFR evaluations** produce structured results
+4. **Reports** are generated summarizing impact across all dimensions
+5. **GitHub CI/CD** automates the entire pipeline on PRs and attaches reports for review
+6. Versioned reports are archived for historical comparison
+
+## What environments are supported?
+
+Kiro is a first-class citizen for testing, but the framework supports multiple AI tools and environments to meet customers where they are.
+
+## What does @k mean for semantic metrics?
+
+AI-based evaluations are non-deterministic — the same input can produce different scores across runs. To get trustworthy results, the framework runs each semantic evaluation multiple times (*k* trials) and reports two complementary metrics (see [Anthropic: Demystifying Evals for AI Agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)):
+
+- **pass@k** — The probability of at least one success in *k* attempts. Answers: *"Can this workflow produce a correct result?"* Higher *k* increases the score, since more attempts mean higher odds of at least one success.
+- **pass^k** — The probability that *all k* attempts succeed. Answers: *"Does this workflow consistently produce correct results?"* Higher *k* makes this harder to achieve, since every trial must pass.
+
+At *k*=1 the two metrics are identical (both equal the per-trial success rate). As *k* grows they diverge — pass@k approaches 100% while pass^k drops toward 0%. Together they tell you both the capability ceiling and the reliability floor of a workflow change.
+
+Code evaluation and NFR metrics are deterministic and do not require @k.
+
+## How do I interpret the reports?
+
+Reports include:
+- **Semantic scores @k:** AI-evaluated ratings with pass@k (capability) and pass^k (reliability)
+- **Code scores:** Numeric metrics for linting, security, duplication (deterministic)
+- **NFR metrics:** Token usage, execution time, consistency (deterministic)
+- **Trend analysis:** Comparison to previous versions (against golden test cases)
+- **Pass/fail gates:** Clear indicators of whether changes meet thresholds
+
+## What if my change shows a evaluation?
+
+Evaluations don't automatically block merges—they provide context. Work with maintainers to:
+- Understand if the evaluation is acceptable given the benefits
+- Identify ways to mitigate the evaluation
+- Document known trade-offs
+
+## How does this relate to the AI-DLC workflows repository?
+
+This framework monitors and validates the [AI-DLC workflows](https://github.com/awslabs/aidlc-workflows) to ensure changes maintain or improve quality. It's a testing layer on top of the workflows themselves.
+
+## Can I run tests locally before submitting a PR?
+
+Yes—the framework is designed to run in CI/CD but can also be executed locally to get early feedback.
+
+## How are reports versioned?
+
+Each test run produces a numbered/named version that includes:
+- Timestamp and commit SHA
+- Full test results
+- Comparison to baseline
+- Human-readable summary
+
+Reports are stored for historical analysis and trend tracking.