Skip to content

Build Pipeline

Daniel Babjak edited this page Apr 8, 2026 · 1 revision

Build pipeline

The build pipeline is the bounded context that turns "implement this feature" into actual files on disk inside an isolated workspace, verifies them, runs acceptance evaluation, and packages a delivery the operator can ship.

Code: agent/build/. Tests: tests/test_build_domain.py.

Design contract. A build is a deterministic transformation of a workspace. The codegen step (LLM) is the only stochastic component. Everything around it — workspace setup, mutation apply, verification, acceptance, delivery — is deterministic and resumable.


Lifecycle

operator (telegram /build, /intake, or HTTP /api/build)
    │
    ▼
intake.qualify_operator_intake()
    │   ├─ validate work_type, repo, description, focus_areas
    │   └─ structured denial if blocked
    ▼
intake.preview_operator_intake()  ← optional dry-run
    │
    ▼
intake.submit_operator_intake()
    │   ├─ budget check (hard cap, stop-loss, single-tx approval cap)
    │   ├─ multi-step approval if required
    │   └─ schedules BuildJob
    ▼
BuildService.run_build()
    │
    ├─ 1. workspace setup
    │     - WorkspaceManager creates an isolated dir under workspaces/<id>/
    │     - hash-chained audit trail records every file mutation
    │     - workspace_id linked to build job in SQLite
    │
    ├─ 2. capability validation
    │     - check the requested capability is in the catalog
    │     - block unsupported scopes early
    │
    ├─ 3. discover or accept implementation plan
    │     - if operator provided plan_file → use it
    │     - else call codegen (LLM → BuildOperation[])
    │     - AUDIT_MARKER_ONLY guard: refuse if codegen failed or plan is empty
    │
    ├─ 4. apply mutations
    │     - 10 supported mutation types (see below)
    │     - each mutation hashed into the workspace audit chain
    │     - capability scope enforced per-mutation
    │
    ├─ 5. verification
    │     - discover or accept verification plan
    │     - run test/lint/typecheck steps in Docker (build/docker_executor.py)
    │     - 256MB RAM, no network, image whitelist
    │     - capture artifacts: stdout, stderr, exit code, duration
    │
    ├─ 6. acceptance evaluation
    │     - acceptance criteria from operator (or auto-discovered)
    │     - 3 evaluator types: auto, verify, review
    │     - each criterion scored independently
    │
    ├─ 7. artifact persistence
    │     - BuildStorage (SQLite WAL mode) saves job + every artifact
    │     - artifacts: patch.diff, build_report.md, verification.json, ...
    │     - whitelisted DDL identifiers, parameterized writes
    │
    └─ 8. delivery package
          - prepare bundle (workspace tarball + artifacts)
          - awaiting_approval state
          - operator approves via /deliver, /api/operator/deliveries, or dashboard
          - handed_off → operator can rsync/scp/git push the bundle

Build job model

agent/build/models.py::BuildJob:

Field Type Purpose
id str UUID-like, used everywhere as the join key
build_type enum IMPLEMENT_FROM_SCRATCH, IMPLEMENT_FROM_PLAN, BOUNDED_REFACTOR, ...
status enum PROPOSED, RUNNING, VERIFYING, COMPLETED, FAILED, BLOCKED, DELIVERED
implementation_mode enum LLM_GENERATED, OPERATOR_PLAN, AUDIT_MARKER_ONLY
repo_path str Where the source code lives
description str Operator's natural-language ask
target_files list[str] Optional scoping hint
implementation_plan list[BuildOperation] The mutation plan (from codegen or operator)
implementation_results list[BuildOperationResult] What actually happened per mutation
verification_plan list[VerificationStep] test/lint/typecheck steps to run
verification_results list[VerificationResult] Per-step outcome
acceptance_criteria list[AcceptanceCriterion] Operator's "is this done?" definition
acceptance_results list[AcceptanceResult] Per-criterion verdict
workspace_id str Link to agent/work/workspace.py
delivery_state enum NONE, PREPARED, AWAITING_APPROVAL, APPROVED, HANDED_OFF, REJECTED
timing object created_at, started_at, completed_at, durations
usage object input_tokens, output_tokens, cost_usd

Mutation types

Defined in BuildOperation.kind:

Kind What it does
create_file Write a new file (no overwrite — fails if exists)
overwrite_file Replace an existing file's contents
edit_file Targeted edit (find + replace, with context match)
delete_file Remove a file (capability scope must allow it)
copy_file Copy from source_path to target_path
move_file Atomic rename within the workspace
create_dir mkdir -p
append_file Append text to an existing file
prepend_file Insert text at the top
apply_patch Apply a unified diff

Every mutation:

  • Validates its target_path against the capability scope (allowed prefixes/suffixes per build type).
  • Records itself in the workspace audit chain (each entry hashed with the previous, so tampering is detectable).
  • Returns a structured BuildOperationResult with success, error, bytes_written, path.

If any mutation fails and stop_on_first_failure=True, the rest of the plan is skipped and the job moves to FAILED with the failed operation listed in artifacts.


The AUDIT_MARKER_ONLY fail-closed guard

Before v1.35.0, a build with implementation_mode == AUDIT_MARKER_ONLY (used for builds that only emit a marker file rather than mutate code) could pass verify even if the codegen LLM call failed and returned no real plan. The job would silently complete with an empty result.

The fix in agent/build/service.py rejects this combination explicitly:

if (
    job.implementation_mode == BuildImplementationMode.AUDIT_MARKER_ONLY
    and (codegen_failed or no_real_plan)
):
    job.status = BuildJobStatus.FAILED
    job.fail_reason = "AUDIT_MARKER_ONLY: codegen failed or returned no real plan"
    return job

Tested in tests/test_build_domain.py::TestCodegenFallbackGuard.


Verification

Verification steps live in agent/build/verification.py. Each step has:

  • kind: pytest, ruff, mypy, docker_build, custom_command, ...
  • command: shell command (run inside Docker if requires_isolation=True)
  • working_dir: relative to the workspace root
  • timeout_seconds: hard cap
  • success_signal: regex or exit-code condition

The discovered plan is the default (auto-detect pyproject.toml, package.json, Cargo.toml, ...). The operator can override with an explicit --build-verification-file.

Verification results are persisted as BuildArtifact(kind=VERIFICATION).


Docker isolation

Build commands marked requires_isolation=True run inside a Docker container managed by agent/build/docker_executor.py:

docker run --rm \
  --network=none \                 # No network during tests
  --memory=512m --cpus=1.0 \       # 512 MB / 1 CPU per build
  --read-only \                    # Workspace mounted read-only
  --tmpfs /work:size=512m \        # Writable copy of the workspace
  --security-opt=no-new-privileges \
  --pids-limit=50 \
  --timeout=300s \                 # 5-minute hard cap
  python:3.12-slim                 # Image whitelist (same as sandbox)

The workspace is mounted read-only and the build container gets a writable tmpfs copy. After verification, the diff between the original workspace and the tmpfs is captured as patch.diff artifact and applied back to the workspace audit trail.

This isolation is what makes builds safe to run without operator approval per-mutation. The blast radius is contained by the kernel.


Acceptance criteria

AcceptanceCriterion has three evaluator types:

Evaluator What it does
auto Static check on the workspace (file exists, function exists, line count below threshold, ...)
verify Re-uses a verification step's result (e.g. "all pytest tests pass")
review Requires operator approval — blocks COMPLETED until manually accepted

Example acceptance file:

{
  "criteria": [
    {
      "id": "tests-green",
      "evaluator": "verify",
      "verification_step_id": "pytest",
      "expected": "passed"
    },
    {
      "id": "no-todo-comments",
      "evaluator": "auto",
      "check": "grep_count",
      "pattern": "TODO|FIXME",
      "max_count": 0
    },
    {
      "id": "operator-sign-off",
      "evaluator": "review",
      "description": "operator confirms the new endpoint behaves correctly"
    }
  ]
}

A criterion that fails marks the whole job as BLOCKED (not FAILED) so the operator can either fix it manually or override.


Delivery packaging

After acceptance, the build moves into the delivery lifecycle:

PREPARED  ← bundle is built (workspace tarball + manifest + artifacts)
   │
   ▼
AWAITING_APPROVAL  ← operator must approve via /deliver, /api/operator/deliveries, or dashboard
   │
   ├─ APPROVED  → HANDED_OFF (operator picks up the bundle)
   │
   └─ REJECTED  → operator can re-run, modify, or close the job

Delivery state changes are recorded as control plane traces and surfaced in the operator inbox (agent --report).


Storage layout (agent/build/storage.py)

CREATE TABLE build_jobs (
    id TEXT PRIMARY KEY,
    data TEXT NOT NULL,           -- full BuildJob JSON
    status TEXT,
    build_type TEXT,
    created_at TEXT
);

CREATE TABLE build_artifacts (
    id TEXT PRIMARY KEY,
    job_id TEXT NOT NULL,
    artifact_kind TEXT,
    content TEXT,
    content_json TEXT,
    format TEXT DEFAULT 'text',
    created_at TEXT
);
  • WAL mode (PRAGMA journal_mode=WAL) for safe concurrent reads.
  • check_same_thread=False because the asyncio loop hops between threads when running blocking work in executors.
  • Dynamic DDL (_ensure_text_column) goes through the whitelist + identifier regex + escape pattern — same approach as agent/review/storage.py.
  • Artifact content is capped at 5 MB per row; oversized payloads are truncated with a warning event.

Operator surface

Channel Command What
Telegram /build <description> Quick build with auto-plan
Telegram /intake build <description> Unified intake (qualify → preview → submit)
Telegram /jobs List recent build jobs with status
Telegram /deliver <job_id> Show delivery state, approve/reject
HTTP POST /api/operator/intake Same as /intake build
HTTP GET /api/operator/jobs Same as /jobs
HTTP POST /api/operator/deliveries/<id>/approve Approve delivery
CLI python -m agent --build-repo . --build-description "..." Direct build invocation
CLI python -m agent --build-repo . ... --build-plan-file plan.json Use a pre-built plan
CLI python -m agent --build-repo . ... --build-acceptance-file acceptance.json Custom acceptance criteria
Dashboard Build panel Job listing, delivery preview, approve/reject buttons

Things the build pipeline does NOT do

  • It doesn't push to git. The build produces a workspace bundle and a delivery package. The operator picks it up and pushes (or runs git apply, or rsyncs to a deploy host). This is intentional — pushing is irreversible and human-only.
  • It doesn't run untrusted shell commands on the host. Every command marked requires_isolation=True runs inside Docker. Verification commands that operate on the workspace (e.g. grep) can run on the host because they're read-only.
  • It doesn't auto-merge with main. Build deliveries are bundles, not PRs. The operator decides how to integrate.
  • It doesn't bypass the budget. Every codegen LLM call goes through the budget check. A build that would exceed the daily hard cap is blocked at intake.
  • It doesn't remember failed plans. If codegen fails twice in a row, the operator has to either provide a manual plan or wait for the rate-limit/budget reset. There is no auto-retry loop on stochastic failures.

Clone this wiki locally