Motivation — live evidence from VerticalBench
blueprint-agent's GEPA lift loop is producing proto-skills right now: seeded with a deliberately thin worker prompt ("build to a real, working, compiling state"), generation 10 independently rediscovered the no-shims directive, the grounding lever ("no progress after two edits → read the real docs / web search"), and production-completeness — and wrote them to driver-skills/vb-gepa-worker-10.md. The system synthesizes operational knowledge from experience; the artifacts then die as run outputs. There is no registry, no provenance, no lift score attached, no maintenance. Every consumer that wants this (blueprint-agent today; every product on the stack tomorrow) would hand-roll the same lifecycle.
The thesis: a consumer that has defineAgent + the improvement loop should get the full artifact lifecycle for free by flipping a switch.
The structural insight
Every lifecycle stage is one of exactly three shapes the runtime already ships:
| Stage |
Shape |
Existing primitive |
| Distill (traces/findings → candidate artifact) |
one-shot completion |
reflectiveGenerator |
| Build (tool code, MCP server, hook script) |
agentic session in a sandboxed worktree |
agenticGenerator |
| Eval (marginal lift, held-out) |
multi-shot loop |
runImprovementLoop + HeldOutGate (agent-eval) |
| Dedupe/merge |
one-shot judge over pairs |
JudgeConfig |
| Promote |
gate decision |
Gate |
| Drift-watch (decay detection) |
scheduled re-eval |
scorecard/production-loop cron pattern |
| Refresh |
re-seeded improvement loop |
improvementDriver |
| Retire |
registry state change |
— (registry missing) |
Nothing here needs a new execution model. What's missing is the artifact model + the orchestration.
Artifact taxonomy (everything in an AgentProfile is lifecycle-able)
AgentSurfaces already declares mutable dirs for systemPrompt sections, tools, personas, rubric, knowledge, scaffolding, memory, rag, outputSchema. Proposed extensions:
- skill (markdown directive bundle — what VB's driver-skills are)
- tool (executable code + tests, not just
<tool>/README.md)
- mcp-server (full build: scaffold → implement → compile → serve → register; this is how coding agents actually consume custom tools)
- hook (pre/post-toolcall scripts)
- subagent (persona + tool allowlist + model)
- prompt-surface (the existing sections)
The gaps (the actual feature request)
ArtifactRegistry — versioned, provenance-carrying store: { id, type, version, content-ref, provenance: { runIds, scenarios, generatorKind }, scores: { marginalLift, heldOutSuite, n, measuredAt }, state: candidate|active|decayed|retired, tenancy: tenant|vertical|general }. Without a lift score an artifact doesn't exist — that rule is what keeps a 5,000-artifact library an asset instead of a junk drawer. Natural home: agent-knowledge or a new agent-runtime/registry subpath.
- Ablation runner — first-class
measureMarginalLift(profile, artifact, heldOutSuite): campaign cells for profile-with vs profile-without. The atom of the whole system; expressible today via matrix composition but nobody should hand-roll it.
- MCP/tool/hook/subagent surface types — extend
AgentSurfaces beyond doc dirs to buildable artifacts; agenticGenerator already runs real coding harnesses in candidate worktrees, so "build an MCP server and prove it compiles+serves" is its existing job description pointed at a new surface.
- Lifecycle orchestration in
defineAgent — declarative: lifecycles: { skills: { distill: 'reflective', eval: { suite, minLift }, driftCheck: '7d' }, mcpServers: { build: 'agentic', ... } }. Production loop runs the stages.
- Dedupe/merge stage — behavioral equivalence: two artifacts whose lifts don't stack → merge or retire one. Judge-over-registry, scheduled.
- Composer —
composeProfile(domain, k): top-k active artifacts by marginal lift on that domain's suite → AgentProfile. Tenancy boundary enforced here (tenant-local artifacts never promote without cross-tenant lift evidence — this is both the IP boundary and the anti-overfit guarantee).
How much exists today vs custom (boundary audit, runtime 0.49 / eval 0.89)
| Capability |
Status |
| One-shot + agentic-session generators |
✅ runtime (reflectiveGenerator / agenticGenerator) |
| Mutable-surface declaration + validation |
✅ runtime (AgentSurfaces) |
| Improvement loop, GEPA, held-out gate, judges |
✅ agent-eval |
| Findings → surface routing |
✅ runtime (analyst-loop + FindingSubject resolution) |
| Sandboxed candidate worktrees |
✅ runtime (agenticGenerator) |
| Artifact registry w/ provenance + scores + states |
❌ missing |
| Marginal-lift ablation primitive |
❌ missing |
| MCP-server / hook / subagent surfaces |
❌ missing |
| Drift-watch + auto-refresh wiring |
❌ missing (cron patterns exist, not artifact-scoped) |
| Top-k composer + tenancy promotion |
❌ missing |
| Consumer-side and staying there |
verticals/leaves, scaffold families, realness specs, backend choice (cli-bridge vs sandbox), domain judges — pure data/config, correctly not substrate |
VB's current loop uses agent-eval's gepaDriver over a custom 3-section string surface instead of AgentSurfaces — i.e., the first consumer migration is also the first validation of the design.
Suggested phasing
ArtifactRegistry + ablation runner (everything else hangs off these).
defineAgent.lifecycles orchestration for skill + prompt-surface types (reuse both generators as-is).
- Buildable surfaces:
tool, mcp-server (agenticGenerator + compile/serve verification), then hook/subagent.
- Drift-watch + dedupe/merge + composer.
- Migrate blueprint-agent's lift loop onto it (deletes VB's custom surface plumbing; proves the "for free" claim).
First consumer experiment queued in blueprint-agent regardless: skill-profile ablation matrix (the skills axis of AgentProfileSpec has only ever carried one bundle — decompose it + the GEPA winners into ~8 atomic candidates and measure per-skill marginal lift). The registry schema should be shaped by what that produces.
🤖 Generated with Claude Code
Motivation — live evidence from VerticalBench
blueprint-agent's GEPA lift loop is producing proto-skills right now: seeded with a deliberately thin worker prompt ("build to a real, working, compiling state"), generation 10 independently rediscovered the no-shims directive, the grounding lever ("no progress after two edits → read the real docs / web search"), and production-completeness — and wrote them to
driver-skills/vb-gepa-worker-10.md. The system synthesizes operational knowledge from experience; the artifacts then die as run outputs. There is no registry, no provenance, no lift score attached, no maintenance. Every consumer that wants this (blueprint-agent today; every product on the stack tomorrow) would hand-roll the same lifecycle.The thesis: a consumer that has
defineAgent+ the improvement loop should get the full artifact lifecycle for free by flipping a switch.The structural insight
Every lifecycle stage is one of exactly three shapes the runtime already ships:
reflectiveGeneratoragenticGeneratorrunImprovementLoop+HeldOutGate(agent-eval)JudgeConfigGateimprovementDriverNothing here needs a new execution model. What's missing is the artifact model + the orchestration.
Artifact taxonomy (everything in an AgentProfile is lifecycle-able)
AgentSurfacesalready declares mutable dirs forsystemPromptsections,tools,personas,rubric,knowledge,scaffolding,memory,rag,outputSchema. Proposed extensions:<tool>/README.md)The gaps (the actual feature request)
ArtifactRegistry— versioned, provenance-carrying store:{ id, type, version, content-ref, provenance: { runIds, scenarios, generatorKind }, scores: { marginalLift, heldOutSuite, n, measuredAt }, state: candidate|active|decayed|retired, tenancy: tenant|vertical|general }. Without a lift score an artifact doesn't exist — that rule is what keeps a 5,000-artifact library an asset instead of a junk drawer. Natural home: agent-knowledge or a newagent-runtime/registrysubpath.measureMarginalLift(profile, artifact, heldOutSuite): campaign cells for profile-with vs profile-without. The atom of the whole system; expressible today via matrix composition but nobody should hand-roll it.AgentSurfacesbeyond doc dirs to buildable artifacts;agenticGeneratoralready runs real coding harnesses in candidate worktrees, so "build an MCP server and prove it compiles+serves" is its existing job description pointed at a new surface.defineAgent— declarative:lifecycles: { skills: { distill: 'reflective', eval: { suite, minLift }, driftCheck: '7d' }, mcpServers: { build: 'agentic', ... } }. Production loop runs the stages.composeProfile(domain, k): top-k active artifacts by marginal lift on that domain's suite → AgentProfile. Tenancy boundary enforced here (tenant-local artifacts never promote without cross-tenant lift evidence — this is both the IP boundary and the anti-overfit guarantee).How much exists today vs custom (boundary audit, runtime 0.49 / eval 0.89)
reflectiveGenerator/agenticGenerator)AgentSurfaces)FindingSubjectresolution)agenticGenerator)VB's current loop uses agent-eval's
gepaDriverover a custom 3-section string surface instead ofAgentSurfaces— i.e., the first consumer migration is also the first validation of the design.Suggested phasing
ArtifactRegistry+ ablation runner (everything else hangs off these).defineAgent.lifecyclesorchestration forskill+prompt-surfacetypes (reuse both generators as-is).tool,mcp-server(agenticGenerator + compile/serve verification), thenhook/subagent.First consumer experiment queued in blueprint-agent regardless: skill-profile ablation matrix (the
skillsaxis ofAgentProfileSpechas only ever carried one bundle — decompose it + the GEPA winners into ~8 atomic candidates and measure per-skill marginal lift). The registry schema should be shaped by what that produces.🤖 Generated with Claude Code