This roadmap treats each chapter as a falsifiable research unit: a hypothesis, a set of architectural questions, known open problems, and concrete artifacts (diagrams, patterns, benchmarks) that make the claims testable.
AI-first engineering becomes reliable when we treat harness + tooling + evaluation as the primary performance lever, and treat model selection as secondary within a bounded capability envelope.
- What is the minimal “AI-first” system boundary (what must be externalized into the harness vs kept in the model context)?
- Which responsibilities belong to the human (governance) vs the system (automation) vs the model (inference)?
- What invariants distinguish an AI-first loop from “LLM-assisted” development (traceability, determinism constraints, evaluation gates)?
- Where does autonomy live: within an agent loop, within CI, within runtime systems, or in all three?
- Separating apparent capability gains from harness-induced improvements.
- Defining “done” for reasoning-heavy work where correctness is probabilistic.
- Characterizing failure surfaces (mis-specification vs tool error vs model error vs evaluation gaps).
- Diagrams: lifecycle map of AI-first development (intent → plan → tool execution → verification → trace → iteration).
- Patterns: “harness-is-product” framing; “evaluation-gated autonomy.”
- Benchmarks: before/after harness interventions (same model, same tasks).
Most reliability gains come from harness design: tool schemas, constraints, loop control, error handling, and evaluation integration.
- What is the minimal harness interface (tools, file operations, build/test hooks) that yields predictable behavior?
- How should the harness constrain the search space (budgets, stop conditions, allowed edits)?
- What’s the right decomposition: monolithic mega-prompt vs layered prompts + tool contracts?
- How do we design harnesses that are portable across models and tasks?
- Robust prompt/tool contract drift over time.
- Guardrail design that prevents “helpful but wrong” actions without blocking progress.
- Measuring harness quality independent of model choice.
- Diagrams: harness architecture (control plane, tool plane, evaluation plane, state plane).
- Patterns: structured patching, diff discipline, tool-first execution.
- Benchmarks: harness A/B (same tasks) with trace-based metrics (iterations, regressions, time-to-fix).
Small, well-governed “autonomous kernels” (tight loops with explicit budgets and evaluation gates) outperform broad autonomy in stability and debuggability.
- What is the kernel’s execution model (plan → act → verify → commit) and what state is persisted?
- How do kernels compose (nested loops, delegation, sub-agents) without losing traceability?
- What are the minimal correctness checks for kernel actions (unit tests, lint, static checks, spec checks)?
- How should kernels handle uncertainty (branching vs asking vs running experiments)?
- Avoiding local minima (kernel keeps making small safe changes but never solves the root issue).
- Reducing “tool thrash” (too many actions with low information gain).
- Formalizing stop conditions for open-ended tasks.
- Diagrams: kernel state machine; failure recovery paths.
- Patterns: budgeted loop, verify-first, reversible edits.
- Benchmarks: tasks solved per iteration budget; regression rate under autonomy.
Persistent memory improves long-horizon work only when it is structured, queryable, and governed (with provenance), not when it is an uncurated dump of past context.
- What memory classes are required (episodic traces, semantic knowledge, project state, decisions, constraints)?
- What is the read/write policy (when to store, how to summarize, how to expire, how to version)?
- How do we preserve provenance and prevent stale knowledge from dominating?
- What retrieval strategies work under time and token budgets?
- Memory poisoning via incorrect intermediate conclusions.
- Summarization loss leading to systematic blind spots.
- Evaluating memory usefulness without circularity (memory helps because it says it helps).
- Diagrams: memory layers and data flows; provenance model.
- Patterns: decision records, trace-indexed memory, “facts vs hypotheses” tagging.
- Benchmarks: long-horizon tasks with/without memory; drift detection on stored assertions.
Trace-first engineering (capturing actions, tool I/O, diffs, and checks) is necessary to make AI-first systems reproducible and to attribute failures to the right layer.
- What trace schema is sufficient (inputs, outputs, tool calls, diffs, evaluations, budgets)?
- Which evaluations are mandatory gates (tests, lint, type checks, property tests, spec conformance)?
- How do we design evals that measure system quality, not just model performance?
- How do we detect capability drift and harness regressions over time?
- Preventing eval gaming (optimizing for the metric while harming real quality).
- Designing cheap evaluations for expensive tasks.
- Aligning qualitative judgments (readability, maintainability) with automated checks.
- Diagrams: trace pipeline; evaluation gating in CI.
- Patterns: “evals as contracts,” trace sampling, regression triage.
- Benchmarks: pass@k under gated loops; change-risk scoring vs post-merge defects.
Governance mechanisms (permissions, budgets, review policies, and auditability) are not optional; they define the safe operating envelope for autonomy.
- What permissions should agents have by default (read-only, patch-only, tool-limited)?
- Where do approvals live (human-in-the-loop checkpoints, protected paths, release gates)?
- How do we represent and enforce policy (constitution, agent rules, CI policies)?
- What is the incident response model when autonomy misbehaves?
- Policy conflicts (speed vs safety) and how to resolve them mechanically.
- Auditing at scale (what to log, how to search, what to retain).
- Governance drift (rules change, old traces become incomparable).
- Diagrams: permission model; escalation paths; audit log flow.
- Patterns: protected resources, diff-only changes, evaluation before merge.
- Benchmarks: prevented-incident rate; false-positive friction cost.
AI-first systems in production behave like distributed systems: reliability depends on orchestration, observability, caching, cost control, and reproducible environments.
- What runtime components are required (tool servers, sandboxes, queues, cache, secret handling, artifact store)?
- What are the operational SLOs (latency, cost, error rate, rollback time, trace coverage)?
- How do we isolate failures (per-task sandboxes, deterministic replays, capability flags)?
- What is the minimal infrastructure to move from “local agent” to “team-scale system”?
- Cost predictability under variable reasoning depth.
- Secure tool execution in heterogeneous environments.
- Reproducible reruns (same inputs) across changing models and dependencies.
- Diagrams: reference architecture for production agent systems.
- Patterns: sandboxed tool plane, cache + replay, artifact-first runs.
- Benchmarks: cost/latency per successful task; replay success rate.
The main frontier is not larger models; it is better system-level interfaces: verifiable tool contracts, stronger evaluations, and memory/governance primitives that scale.
- What would “formal methods for agent loops” look like in practice?
- Which tasks can be made provably safe vs only statistically reliable?
- How should standards emerge (trace formats, tool schemas, evaluation suites)?
- What new failure classes appear as autonomy scales (org-level coupling, supply-chain of prompts/tools)?
- Cross-model portability of traces and evaluations.
- Long-horizon alignment between product intent and accumulated memory.
- Standardizing interoperability without freezing innovation.
- Diagrams: maturity model for AI-first teams and systems.
- Patterns: interoperability contracts, evaluation portability.
- Benchmarks: cross-model reproducibility; “upgrade impact” reports.