AIDLC Evaluation Framework — Design Document

1. Purpose

This document describes the architecture, design decisions, data flows, and internal mechanics of the AI-DLC Workflows Evaluation & Reporting Framework. It is intended for developers who need to understand how the system works, extend it, or debug it.

The framework validates changes to the AI-DLC workflows repository by running an AI-driven software development lifecycle end-to-end, then scoring the outputs across multiple quality dimensions: functional correctness, code quality, API contract conformance, and semantic similarity to a golden baseline.

2. High-Level Architecture

                              ┌──────────────────────┐
                              │   Entry Points (CLI) │
                              └──────────┬───────────┘
                 ┌───────────────────────┼──────────────────────┐
                 │                       │                      │
        run_evaluation.py      run_batch_evaluation.py   run_ide_evaluation.py
         (single model)          (multi-model loop)       (IDE adapter)
                 │                       │                      │
                 └───────────┬───────────┘                      │
                             │                                  │
              ┌──────────────▼──────────────┐   ┌───────────────▼──────────┐
              │       6-Stage Pipeline      │   │     IDE Harness          │
              │  ┌──────────────────────┐   │   │  ┌───────────────────┐   │
              │  │ 1. Execution         │   │   │  │ Adapter (Cursor,  │   │
              │  │    (Strands Swarm)   │   │   │  │ Cline, Kiro, ...) │   │
              │  ├──────────────────────┤   │   │  └────────┬──────────┘   │
              │  │ 2. Post-Run Tests    │   │   │           │              │
              │  ├──────────────────────┤   │   │  ┌────────▼──────────┐   │
              │  │ 3. Quantitative      │   │   │  │ Output Normalizer │   │
              │  ├──────────────────────┤   │   │  └────────┬──────────┘   │
              │  │ 4. Contract Tests    │   │   │           │              │
              │  ├──────────────────────┤   │   └───────────┼──────────────┘
              │  │ 5. Qualitative       │   │               │
              │  ├──────────────────────┤   │    ┌──────────▼──────────┐
              │  │ 6. Report Generation │   │    │ --evaluate-only     │
              │  └──────────────────────┘   │    │ (stages 2-6)        │
              └─────────────────────────────┘    └─────────────────────┘
                             │
              ┌──────────────▼───────────────┐
              │  runs/<timestamp>/           │
              │   ├── aidlc-docs/            │
              │   ├── workspace/             │
              │   ├── run-meta.yaml          │
              │   ├── run-metrics.yaml       │
              │   ├── test-results.yaml      │
              │   ├── quality-report.yaml    │
              │   ├── contract-test-results… │
              │   ├── qualitative-comparison…│
              │   ├── report.md / .html      │
              │   └── evaluation-config.yaml │
              └─────────────────────────────┘

3. Package Structure

The project uses a uv workspace (defined in the root pyproject.toml) with eight internal packages. Each package is independently structured with its own pyproject.toml, src/ layout, and tests/ directory.

Package	PyPI Name	Purpose
`packages/execution`	`aidlc-runner`	Two-agent swarm that runs the AIDLC workflow
`packages/qualitative`	`aidlc-qualitative`	Semantic scoring of documents vs golden baseline
`packages/quantitative`	`aidlc-quantitative`	Static analysis: linting, security, duplication
`packages/contracttest`	`aidlc-contracttest`	API contract testing against OpenAPI specs
`packages/nonfunctional`	`aidlc-nonfunctional`	NFR evaluation (tokens, timing, consistency)
`packages/reporting`	`aidlc-reporting`	Consolidated report generation (Markdown + HTML)
`packages/ide-harness`	(not published)	IDE adapter framework for third-party AI assistants
`packages/shared`	`aidlc-shared`	Common utilities shared across packages

Dependency graph (simplified):

run_evaluation.py ──► execution (aidlc-runner)
                  ──► quantitative
                  ──► contracttest
                  ──► qualitative
                  ──► reporting ──► reporting.collector
                                ──► reporting.baseline
                                ──► reporting.render_md
                                ──► reporting.render_html

All packages communicate through YAML files on disk. There are no in-process library-level dependencies between the evaluation packages — the orchestrator (run_evaluation.py) invokes each package as a subprocess via python -m <package>, passing file paths as arguments. This design keeps packages independently testable and allows each to be run in isolation.

4. Configuration System

4.1 Layered Config Resolution

Configuration follows a three-tier precedence model:

CLI flags  >  YAML config file  >  Built-in Python defaults

Built-in defaults are defined as dataclass field defaults in packages/execution/src/aidlc_runner/config.py (RunnerConfig and its nested dataclasses).
YAML config is loaded from config/default.yaml (or a custom path via --config). The _merge_dict_into_dataclass() function recursively overlays YAML values onto the dataclass tree.
CLI flags (e.g., --executor-model, --profile) are applied last, overriding both YAML and defaults.

4.2 Config Dataclass Hierarchy

RunnerConfig
  ├── aws: AwsConfig          # profile, region
  ├── models: ModelsConfig
  │     ├── executor: ModelConfig   # provider, model_id
  │     └── simulator: ModelConfig
  ├── aidlc: AidlcConfig      # rules_source, rules_repo, rules_ref
  ├── swarm: SwarmConfig       # max_handoffs, max_iterations, timeouts
  ├── runs: RunsConfig         # output_dir
  └── execution: ExecutionConfig  # enabled, command_timeout, post_run_tests

4.3 Per-Model Config Files

Files in config/ (e.g., config/sonnet-4-5.yaml, config/nova-pro.yaml) override only the models.executor.model_id field. The batch runner (run_batch_evaluation.py) discovers these automatically by scanning config/*.yaml and excluding default.yaml.

5. Stage-by-Stage Pipeline Design

5.1 Stage 1: Execution (`packages/execution`)

This is the core of the framework. It uses the Strands SDK multi-agent orchestration to run the full AIDLC workflow.

Two-Agent Swarm Architecture

                    ┌──────────────────────┐
                    │   Strands Swarm      │
                    │                      │
  initial prompt ──►│  ┌────────────────┐  │
                    │  │   Executor     │  │
                    │  │   Agent        │◄─┤── handoff ──┐
                    │  │                ├──┤── handoff ──│
                    │  └────────────────┘  │             │
                    │                      │  ┌──────────▼─┐
                    │                      │  │ Simulator  │
                    │                      │  │ Agent      │
                    │                      │  └────────────┘
                    └──────────────────────┘

Executor Agent — Drives the AIDLC workflow through all phases (Inception → Construction). It:

Loads AIDLC rule files on demand via the load_rule tool (lazy loading keeps context window usage low)
Reads/writes files in the run folder via sandboxed read_file, write_file, list_files tools
Executes shell commands (dependency install, test runs) via the run_command tool
Hands off to the Simulator when human input is needed (questions, approvals, reviews)

Simulator Agent — Acts as a simulated human stakeholder. It:

Has the vision document (and optional tech-env document) embedded in its system prompt
Answers clarifying questions, approves documents, reviews code
Always hands back to the Executor to continue the workflow

Key design decisions:

Sandboxed file operations: All file tools use _resolve_safe() to prevent path traversal outside the run folder
Sandboxed command execution: run_command uses a restricted environment (only PATH, HOME, LANG) to isolate execution
Lazy rule loading: Rules are loaded one-at-a-time as each stage begins, rather than pre-loading all rules into the system prompt
Progress streaming: AgentProgressHandler logs tool invocations to stderr without printing full LLM output; SwarmProgressHook logs handoff timing
Metrics collection: MetricsCollector records token usage, handoff timing, context size samples, and error events during execution

AIDLC Workflow Stages

The Executor drives this sequence (some stages are conditional based on project scope):

#	Stage	Phase	Conditional?
1	Workspace Detection	Inception	Always
2	Reverse Engineering	Inception	Brownfield only
3	Requirements Analysis	Inception	Always
4	User Stories	Inception	If complex
5	Workflow Planning	Inception	Always
6	Application Design	Inception	If needed
7	Units Generation	Inception	If needed
8	Functional Design	Construction	If needed
9	NFR Requirements	Construction	If needed
10	NFR Design	Construction	If needed
11	Infrastructure Design	Construction	If needed
12	Code Generation	Construction	Always
13	Build and Test	Construction	Always

Each stage loads its corresponding rule file (e.g., inception/requirements-analysis.md) before execution. The Executor writes all documentation artifacts to aidlc-docs/ and all generated code to workspace/.

Rules Setup

The runner either:

Git clones the AIDLC rules repository (default: awslabs/aidlc-workflows, ref configurable) into the run folder, then extracts the aidlc-rules/ content
Copies from a local path when rules_source: "local" is configured

Run Folder Layout

runs/<YYYYMMDDTHHMMSS>-<rules_slug>/
  ├── vision.md                      # Copied input
  ├── tech-env.md                    # Copied input (if provided)
  ├── aidlc-rules/                   # AIDLC workflow rules
  │   ├── aws-aidlc-rules/           # Core workflow definition
  │   └── aws-aidlc-rule-details/    # Per-stage rule files
  ├── aidlc-docs/                    # Generated AIDLC documents
  │   ├── inception/                 # Requirements, user stories, design docs
  │   ├── construction/              # Functional design, code-gen docs
  │   ├── aidlc-state.md             # Workflow state tracker
  │   └── audit.md                   # Timestamped audit log
  ├── workspace/                     # Generated application code
  └── run-meta.yaml                  # Run identity and config snapshot

Post-Run Test Evaluation

After the swarm completes, post_run.py performs automatic testing:

Project detection: BFS scan of workspace/ for marker files (pyproject.toml, package.json, Cargo.toml, go.mod) up to 3 levels deep
Dependency install: Runs the appropriate install command (e.g., uv pip install -e ".[dev]")
Test execution: Runs the appropriate test command (e.g., uv run pytest)
Output parsing: Language-specific parsers extract pass/fail counts from test output (pytest, Jest/Vitest, cargo test, go test)
Results: Written to test-results.yaml

5.2 Stage 2: Post-Run Tests (Summary)

This stage reads test-results.yaml written by Stage 1 and prints a human-readable summary. It is embedded in the execution stage — the orchestrator reads the file for its summary display.

5.3 Stage 3: Quantitative Analysis (`packages/quantitative`)

Runs static analysis tools against the generated code in workspace/. The analysis is language-aware.

Tool Selection by Project Type

Project Type	Linter	Security Scanner	Duplication
Python	ruff	bandit + semgrep	PMD CPD
Node.js	eslint	npm audit + semgrep	PMD CPD

Analysis Flow

scan_workspace(path)
  ├── detect project type (pyproject.toml → Python, package.json → Node)
  ├── run_ruff() or run_eslint()         → LintFinding[]
  ├── run_bandit() or run_npm_audit()    → SecurityFinding[]
  ├── run_semgrep()                       → SecurityFinding[]
  ├── run_cpd()                           → DuplicationFinding[]
  └── compute_summary()                   → QualityReport

Each tool runner:

Checks if the tool is available (shutil.which or uv run --version)
Executes with JSON output format
Parses structured output into standardized finding models
Returns a ToolResult with findings and metadata

Graceful degradation: If any tool is not installed, the analysis for that tool is skipped with a note — it never fails the evaluation.

Output: quality-report.yaml

5.4 Stage 4: Contract Tests (`packages/contracttest`)

Validates the generated application's API endpoints against an OpenAPI 3.x specification.

Architecture

openapi.yaml ──► spec.py (parser) ──► ContractSpec
                                          ├── AppConfig (module, port, framework)
                                          └── TestCase[] (from x-test-cases extensions)

workspace/ ──► server.py (ServerProcess) ──► uvicorn subprocess
                                                  │
                                                  ▼
ContractSpec ──► runner.py ──► HTTP requests ──► CaseResult[]
                                                  │
                                                  ▼
                                          ContractTestResults

Key mechanics:

Spec parsing: The OpenAPI spec uses custom x-app (server configuration) and x-test-cases (per-operation test inputs/expected outputs) extensions
Server management: ServerProcess creates an isolated venv for the workspace project, starts uvicorn, polls /health until ready, and cleanly shuts down after tests
Test execution: Each test case sends an HTTP request and validates: status code matches, response body contains expected keys/values (recursive deep match with floating-point tolerance)
Abort conditions: Testing stops early if the server process dies or after 3 consecutive connection errors

Output: contract-test-results.yaml

5.5 Stage 5: Qualitative Evaluation (`packages/qualitative`)

Compares the generated AIDLC documents against a golden baseline using semantic similarity scoring.

Document Matching

golden aidlc-docs/         candidate aidlc-docs/
  inception/                 inception/
    requirements.md    ◄──►   requirements.md         (paired)
    user-stories.md    ◄──►   user-stories.md         (paired)
  construction/              construction/
    code-generation.md ◄──►   code-generation.md      (paired)
                              extra-doc.md             (unmatched candidate)

Documents are paired by relative path. Internal workflow files (aidlc-state.md, audit.md) are excluded.

Scoring Dimensions

Each document pair is scored on three dimensions (0.0 to 1.0):

Dimension	Weight	What It Measures
Intent Similarity	0.4	Same goals, requirements, and purpose
Design Similarity	0.4	Same architecture, components, patterns
Completeness	0.2	Candidate covers all reference topics

Overall per-document = 0.4 × intent + 0.4 × design + 0.2 × completeness

Scores are aggregated per-phase (inception, construction) then into an overall score.

Two Scorer Implementations

HeuristicScorer (offline, deterministic):

Intent: Term-frequency cosine similarity with stopword removal
Design: Weighted blend of technical identifier Jaccard similarity (0.6) and heading structure Jaccard similarity (0.4)
Completeness: Fraction of reference headings present in candidate

LlmScorer (default, requires Bedrock):

Sends both documents to an LLM via the Bedrock converse API
Prompt asks for JSON with the three dimension scores plus notes
Uses temperature 0.0 for reproducibility
Content truncated to 15K characters per document

Output: qualitative-comparison.yaml

5.6 Stage 6: Report Generation (`packages/reporting`)

Generates consolidated reports by collecting all YAML artifacts from the run folder.

Data Collection

reporting.collector.collect(run_folder) reads all YAML files and assembles a ReportData dataclass containing:

RunMeta — identity, timing, models, rules
RunMetrics — tokens (total + per-agent), wall clock, handoff timeline, artifact counts, error counts, context size stats
TestResults — unit test pass/fail/total with pass percentage
QualityReport — lint, security, duplication findings
ContractResults — per-endpoint test results
QualitativeResults — per-document and per-phase semantic scores

Baseline Comparison

If a golden.yaml baseline file exists (auto-discovered next to the --golden directory), the report includes a regression comparison:

extract_baseline() flattens ReportData into a BaselineMetrics with ~30 numeric fields
compare() computes deltas and classifies each metric as improved/regressed/unchanged
Classification respects directionality (e.g., fewer lint errors = improved, higher test pass% = improved)

Output Formats

Markdown: render_markdown() produces GitHub-flavored Markdown with verdict banners, tables, delta indicators, and collapsible detail sections
HTML: render_html() wraps the Markdown with CSS styling for standalone viewing

6. Orchestrators

6.1 Single-Model Pipeline (`run_evaluation.py`)

The main entry point. Orchestrates all six stages sequentially:

parse CLI args
  │
  ├── --test mode ──► run pytest on all packages ──► exit
  │
  ├── --evaluate-only mode ──► skip Stage 1
  │     ├── Stage 3 (quantitative)
  │     ├── Stage 4 (contract)
  │     ├── Stage 5 (qualitative)
  │     └── Stage 6 (report)
  │
  └── full pipeline mode
        ├── Stage 1 (execution) ──► creates timestamped run folder
        ├── Save evaluation config and repo info
        ├── Stage 2 (read test-results.yaml from Stage 1)
        ├── Stage 3 (quantitative)
        ├── Stage 4 (contract, if --openapi provided)
        ├── Stage 5 (qualitative)
        ├── Stage 6 (report)
        └── Print summary, exit 0 if all pass

Resilience: If the Strands swarm exits non-zero but AIDLC documents were produced, evaluation continues (the swarm may fail on a late handoff after all documents are written).

6.2 Batch Evaluation (`run_batch_evaluation.py`)

Runs run_evaluation.py in a loop for each selected model config:

discover_models()     ← scans config/*.yaml, excludes default.yaml
  │
  for each model:
  │  ├── build CLI command with --executor-model override
  │  ├── run as subprocess, capture stdout/stderr to log file
  │  ├── find new timestamped run folder
  │  ├── rename folder: <timestamp>-<slug>-<model-name>
  │  └── write per-model batch-summary.yaml
  │
  write batch-summary.yaml with timing and pass/fail for all models

Each model run is fully isolated — a separate subprocess invocation with its own run folder.

6.3 Cross-Model Comparison (`run_comparison_report.py`)

Generates a side-by-side comparison matrix after batch evaluation:

find_model_runs()     ← discovers run folders by model name suffix
  │
  for each model:
  │  └── collect() + extract_baseline() → BaselineMetrics
  │
  load golden baseline (golden.yaml)
  │
  generate_comparison_markdown()   → comparison-report.md
  generate_comparison_yaml()       → comparison-data.yaml

The comparison table includes ~30 metrics across unit tests, contract tests, code quality, qualitative scores, artifacts, execution cost, and context size — with delta indicators (^ better, v worse) relative to the golden baseline.

6.4 IDE Evaluation (`run_ide_evaluation.py`)

Runs the AIDLC workflow through third-party IDE AI assistants:

get_adapter(name)     ← lazy import from registry
  │
  ├── check_prerequisites()
  ├── adapter.run(config) ──► IDE-specific automation
  ├── normalize_output()  ──► standard run folder layout
  └── run_evaluation.py --evaluate-only  ──► stages 2-6

Adapter pattern: Each IDE is implemented as a subclass of IDEAdapter with three methods:

check_prerequisites() — verify the IDE is installed and configured
run(config) — execute the AIDLC process through the IDE
name — human-readable identifier

Output normalization: normalizer.py converts IDE-specific output layouts into the standard run folder structure expected by the evaluation pipeline, generating synthetic run-meta.yaml and run-metrics.yaml.

Supported adapters: Cursor, Cline, Copilot, Kiro, Windsurf, Antigravity.

6.5 CLI Evaluation (`run_cli_evaluation.py`)

Runs the AIDLC workflow through CLI-based AI assistants (Claude Code, Kiro CLI, etc.):

load_adapters_from_config(cfg_data)  ← register any custom adapters from config.yaml
  │
get_adapter(name)     ← lazy import from registry
  │
  ├── check_prerequisites()
  ├── HumanSimulator built once by orchestrator (vision + tech_env + openapi injected)
  ├── adapter.run(config) ──► CLI-specific automation + simulator gate reviews
  ├── normalize_output()  ──► standard run folder layout
  └── run_evaluation.py --evaluate-only  ──► stages 2-6

Adapter pattern: Each CLI tool is implemented as a subclass of CLIAdapter (packages/cli-harness/src/cli_harness/adapter.py) with three methods:

name — human-readable identifier (e.g. "kiro-cli")
check_prerequisites() — verify the CLI tool is installed and credentials are valid
run(config: AdapterConfig) -> AdapterResult — execute the AIDLC workflow and return results

HumanSimulator injection: The orchestrator constructs a single HumanSimulator with the full document context (vision, tech-env, OpenAPI spec) before calling the adapter. It is passed in as config.simulator. Adapters access it via config.simulator.respond(message) — they do not construct it themselves.

Simulator gates: Adapters use config.simulator to inject human-reviewer feedback at key workflow stages. The kiro-cli adapter uses 4 stage gates (requirements → design → code-gen plan → construction); the claude-code-sdk adapter intercepts handoff_to_simulator tool calls inline.

Plugin registration: Custom adapters can be added without modifying framework code — see Adding a New CLI Adapter below.

Supported built-in adapters: claude-code, claude-code-sdk, kiro-cli.

7. Data Flow: YAML Artifact Graph

Every stage communicates through YAML files in the run folder. No in-memory state crosses stage boundaries.

Stage 1 (execution)
  ├── writes: run-meta.yaml, run-metrics.yaml, test-results.yaml
  ├── writes: aidlc-docs/**/*.md, workspace/**/*
  │
Stage 3 (quantitative) reads: workspace/
  └── writes: quality-report.yaml
  │
Stage 4 (contract) reads: workspace/, openapi.yaml (test input)
  └── writes: contract-test-results.yaml
  │
Stage 5 (qualitative) reads: aidlc-docs/, golden-aidlc-docs/ (test input)
  └── writes: qualitative-comparison.yaml
  │
Stage 6 (report) reads: ALL of the above YAML files + golden.yaml
  └── writes: report.md, report.html

The orchestrator also writes evaluation-config.yaml (full resolved config snapshot) and updates run-meta.yaml with evaluation-level fields.

8. Key Data Models

8.1 Execution Metrics (`run-metrics.yaml`)

tokens:
  total: {input_tokens, output_tokens, total_tokens, cache_read_tokens, cache_write_tokens}
  per_agent:
    executor: {input_tokens, output_tokens, total_tokens}
    simulator: {input_tokens, output_tokens, total_tokens}
timing:
  total_wall_clock_ms: int
  handoffs: [{handoff: int, node_id: str, duration_ms: int}, ...]
handoff_patterns:
  total_handoffs: int
  sequence: [str, ...]
  per_agent: {agent: {turn_count, total_duration_ms, avg_turn_duration_ms}}
artifacts:
  workspace: {source_files, test_files, config_files, total_files, total_lines_of_code}
  aidlc_docs: {inception_files, construction_files, total_files}
errors:
  throttle_events, timeout_events, failed_tool_calls, model_error_events, ...
context_size:
  total: {min_tokens, max_tokens, avg_tokens, median_tokens, sample_count}
  per_agent: {executor: {...}, simulator: {...}}

8.2 Qualitative Scores (`qualitative-comparison.yaml`)

overall_score: float  # 0.0 to 1.0
phases:
  - phase: inception
    avg_intent: float
    avg_design: float
    avg_completeness: float
    avg_overall: float
    documents:
      - path: inception/requirements.md
        intent_similarity: float
        design_similarity: float
        completeness: float
        overall: float
        notes: str

8.3 Golden Baseline (`golden.yaml`)

A flat numeric snapshot of ~30 key metrics from a promoted run. Used as the regression comparison target. Fields span execution cost, artifacts, test results, code quality, and qualitative scores.

9. Tool Integration

9.1 Strands SDK (Multi-Agent)

The execution package uses the Strands Agents SDK for:

Agent — wraps a Bedrock model with a system prompt and tool set
Swarm — orchestrates handoffs between agents with configurable limits (max handoffs, max iterations, execution timeout, node timeout)
@tool decorator — registers Python functions as callable tools for agents
BedrockModel — Bedrock model provider with configurable retry policy
Hook system — BeforeNodeCallEvent / AfterNodeCallEvent for progress tracking

9.2 Amazon Bedrock

All LLM calls go through Amazon Bedrock via boto3. Configuration:

Read timeout: 900s (15 min) for execution agents, 300s (5 min) for the qualitative scorer
Connect timeout: 30s
Retry policy: 10 attempts with adaptive mode
Models: Configurable per role (executor, simulator, scorer)

9.3 Static Analysis Tools

Tool	Purpose	Output Format	Graceful Degradation
ruff	Python linting	JSON	Skipped if not on PATH
bandit	Python security	JSON	Skipped if not on PATH
semgrep	Multi-language security	JSON	Skipped if not on PATH
eslint	JS/TS linting	JSON	Falls back to npx
npm audit	JS dependency security	JSON	Needs package-lock.json
PMD CPD	Code duplication	XML	Configurable path or PATH scan

10. Security Model

10.1 File Sandboxing

All file operations performed by AI agents are sandboxed to the run folder:

_resolve_safe(run_folder, relative_path) resolves the path and verifies it stays within the run folder boundary
Path traversal attempts (e.g., ../../etc/passwd) are rejected with a ValueError
Applied to: read_file, write_file, list_files, run_command

10.2 Command Sandboxing

The run_command tool provides a restricted shell environment:

Only PATH, HOME, LANG, TERM are set (plus tool-specific vars like UV_CACHE_DIR)
HOME is set to the run folder to prevent reading host user configuration
Commands have a configurable timeout (default 120s)
Output is truncated at 50K characters

10.3 Server Isolation (Contract Tests)

The contract test server runs in its own venv:

ServerProcess._ensure_venv() creates an isolated venv in the workspace project
This prevents uv run from walking up the directory tree and resolving the parent project
The server is started via the venv's own Python binary

11. Test Cases

Test cases live in test_cases/ and follow a standard structure:

test_cases/<case-name>/
  ├── vision.md              # Project vision and constraints
  ├── tech-env.md            # Technical environment requirements
  ├── openapi.yaml           # API contract spec with x-test-cases
  ├── golden-aidlc-docs/     # Reference aidlc-docs output (golden baseline)
  │   ├── inception/
  │   │   ├── requirements.md
  │   │   └── ...
  │   └── construction/
  │       ├── code-generation.md
  │       └── ...
  └── golden.yaml            # Promoted baseline metrics

The default test case is sci-calc (a scientific calculator API). All CLI defaults point to this test case.

12. Extension Points

Adding a New Model

Create config/<model-name>.yaml with models.executor.model_id set to the Bedrock model ID
The batch runner will automatically discover it

Adding a New CLI Adapter

CLI adapters live in packages/cli-harness and follow a plugin pattern — no framework code changes are needed.

Step 1 — Implement the adapter

Create a module anywhere importable (e.g. packages/cli-harness/src/cli_harness/adapters/my_tool.py):

from cli_harness.adapter import AdapterConfig, AdapterResult, CLIAdapter

class MyToolAdapter(CLIAdapter):
    @property
    def name(self) -> str:
        return "my-tool"

    def check_prerequisites(self) -> tuple[bool, str]:
        import shutil
        if not shutil.which("my-tool"):
            return False, "'my-tool' not found in PATH"
        return True, "my-tool found"

    def run(self, config: AdapterConfig) -> AdapterResult:
        import time, shutil
        from cli_harness.normalizer import normalize_output

        start = time.monotonic()
        workspace = config.output_dir / "workspace"
        workspace.mkdir(parents=True, exist_ok=True)

        # Copy inputs, inject rules, run the CLI tool...
        # Use config.simulator.respond(message) at review gates.
        simulator = config.simulator  # pre-built with vision/tech_env/openapi context
        if simulator is None:
            raise RuntimeError("my-tool requires a simulator (set --simulator-model)")

        # ... run CLI tool stages, call simulator.respond() between stages ...

        elapsed = time.monotonic() - start
        normalize_output(
            source_dir=workspace,
            output_dir=config.output_dir,
            adapter_name=self.name,
            elapsed_seconds=elapsed,
        )
        dst_docs = config.output_dir / "aidlc-docs"
        return AdapterResult(
            success=dst_docs.is_dir(),
            output_dir=config.output_dir,
            aidlc_docs_dir=dst_docs if dst_docs.is_dir() else None,
            workspace_dir=workspace,
            elapsed_seconds=elapsed,
        )

Step 2 — Register in config (no framework edits needed)

Add one line to config/default.yaml (or your own config file):

cli:
  adapters:
    my-tool: "cli_harness.adapters.my_tool.MyToolAdapter"

Step 3 — Verify

# Confirm it appears
uv run python run.py cli --list

# Check prerequisites
uv run python run.py cli --cli my-tool --check-only

# Run evaluation
uv run python run.py cli --cli my-tool --scenario sci-calc

Key contracts for adapter implementors:

What	Where	Notes
Abstract base	`cli_harness/adapter.py` - `CLIAdapter`	Implement `name`, `check_prerequisites`, `run`
Simulator	`config.simulator` (`HumanSimulator`)	Call `.respond(message)` at review gates; never construct it yourself
Output layout	`cli_harness/normalizer.py` (`normalize_output()`)	Call at end of `run()` to write `run-meta.yaml` / `run-metrics.yaml`
Post-run tests	`aidlc_runner.post_run.run_post_evaluation()`	Optional; call after `normalize_output()` to run generated project tests
Document context	`config.vision_path`, `config.tech_env_path`, `config.openapi_content`	Available if needed; simulator already has this context

Adding a New IDE Adapter

Create packages/ide-harness/src/ide_harness/adapters/<name>.py
Implement the IDEAdapter abstract class (three methods: name, check_prerequisites, run)
Register in _ADAPTER_MAP in packages/ide-harness/src/ide_harness/registry.py

Adding a New Static Analysis Tool

Add an analyzer function in packages/quantitative/src/quantitative/analyzers.py (follow the run_ruff pattern)
Define a finding model if needed in models.py
Call it from scanner.py based on project type detection

Adding a New Test Case

Create a directory under test_cases/<case-name>/
Provide vision.md, tech-env.md, and optionally openapi.yaml
Run the full pipeline once to generate the golden baseline
Use reporting.baseline.promote() to create golden.yaml
Copy the run's aidlc-docs/ as golden-aidlc-docs/

13. Dependency Stack

Component	Technology
Language	Python 3.13+
Package manager	uv (workspace mode)
AI orchestration	Strands Agents SDK
LLM provider	Amazon Bedrock (boto3)
HTTP client	httpx (contract tests)
ASGI server	uvicorn (contract tests)
Test framework	pytest
Serialization	PyYAML
Linting	ruff
Security scanning	bandit, semgrep
Duplication detection	PMD CPD (external)

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History