feat(lifecycle): profile-artifact lifecycle as a runtime primitive — generate/eval/promote/maintain skills, tools, MCPs, hooks, subagents for free

## Motivation — live evidence from VerticalBench

blueprint-agent's GEPA lift loop is producing proto-skills *right now*: seeded with a deliberately thin worker prompt ("build to a real, working, compiling state"), generation 10 **independently rediscovered** the no-shims directive, the grounding lever ("no progress after two edits → read the real docs / web search"), and production-completeness — and wrote them to `driver-skills/vb-gepa-worker-10.md`. The system synthesizes operational knowledge from experience; the artifacts then **die as run outputs**. There is no registry, no provenance, no lift score attached, no maintenance. Every consumer that wants this (blueprint-agent today; every product on the stack tomorrow) would hand-roll the same lifecycle.

The thesis: a consumer that has `defineAgent` + the improvement loop should get the full artifact lifecycle **for free by flipping a switch**.

## The structural insight

Every lifecycle stage is one of exactly three shapes the runtime already ships:

| Stage | Shape | Existing primitive |
|---|---|---|
| **Distill** (traces/findings → candidate artifact) | one-shot completion | `reflectiveGenerator` |
| **Build** (tool code, MCP server, hook script) | agentic session in a sandboxed worktree | `agenticGenerator` |
| **Eval** (marginal lift, held-out) | multi-shot loop | `runImprovementLoop` + `HeldOutGate` (agent-eval) |
| **Dedupe/merge** | one-shot judge over pairs | `JudgeConfig` |
| **Promote** | gate decision | `Gate` |
| **Drift-watch** (decay detection) | scheduled re-eval | scorecard/production-loop cron pattern |
| **Refresh** | re-seeded improvement loop | `improvementDriver` |
| **Retire** | registry state change | — (registry missing) |

Nothing here needs a new execution model. What's missing is the **artifact model + the orchestration**.

## Artifact taxonomy (everything in an AgentProfile is lifecycle-able)

`AgentSurfaces` already declares mutable dirs for `systemPrompt` sections, `tools`, `personas`, `rubric`, `knowledge`, `scaffolding`, `memory`, `rag`, `outputSchema`. Proposed extensions:

- **skill** (markdown directive bundle — what VB's driver-skills are)
- **tool** (executable code + tests, not just `<tool>/README.md`)
- **mcp-server** (full build: scaffold → implement → compile → serve → register; this is how coding agents actually consume custom tools)
- **hook** (pre/post-toolcall scripts)
- **subagent** (persona + tool allowlist + model)
- **prompt-surface** (the existing sections)

## The gaps (the actual feature request)

1. **`ArtifactRegistry`** — versioned, provenance-carrying store: `{ id, type, version, content-ref, provenance: { runIds, scenarios, generatorKind }, scores: { marginalLift, heldOutSuite, n, measuredAt }, state: candidate|active|decayed|retired, tenancy: tenant|vertical|general }`. Without a lift score an artifact doesn't exist — that rule is what keeps a 5,000-artifact library an asset instead of a junk drawer. Natural home: agent-knowledge or a new `agent-runtime/registry` subpath.
2. **Ablation runner** — first-class `measureMarginalLift(profile, artifact, heldOutSuite)`: campaign cells for profile-with vs profile-without. The atom of the whole system; expressible today via matrix composition but nobody should hand-roll it.
3. **MCP/tool/hook/subagent surface types** — extend `AgentSurfaces` beyond doc dirs to buildable artifacts; `agenticGenerator` already runs real coding harnesses in candidate worktrees, so "build an MCP server and prove it compiles+serves" is its existing job description pointed at a new surface.
4. **Lifecycle orchestration in `defineAgent`** — declarative: `lifecycles: { skills: { distill: 'reflective', eval: { suite, minLift }, driftCheck: '7d' }, mcpServers: { build: 'agentic', ... } }`. Production loop runs the stages.
5. **Dedupe/merge stage** — behavioral equivalence: two artifacts whose lifts don't stack → merge or retire one. Judge-over-registry, scheduled.
6. **Composer** — `composeProfile(domain, k)`: top-k active artifacts by marginal lift on that domain's suite → AgentProfile. Tenancy boundary enforced here (tenant-local artifacts never promote without cross-tenant lift evidence — this is both the IP boundary and the anti-overfit guarantee).

## How much exists today vs custom (boundary audit, runtime 0.49 / eval 0.89)

| Capability | Status |
|---|---|
| One-shot + agentic-session generators | ✅ runtime (`reflectiveGenerator` / `agenticGenerator`) |
| Mutable-surface declaration + validation | ✅ runtime (`AgentSurfaces`) |
| Improvement loop, GEPA, held-out gate, judges | ✅ agent-eval |
| Findings → surface routing | ✅ runtime (analyst-loop + `FindingSubject` resolution) |
| Sandboxed candidate worktrees | ✅ runtime (`agenticGenerator`) |
| Artifact registry w/ provenance + scores + states | ❌ missing |
| Marginal-lift ablation primitive | ❌ missing |
| MCP-server / hook / subagent surfaces | ❌ missing |
| Drift-watch + auto-refresh wiring | ❌ missing (cron patterns exist, not artifact-scoped) |
| Top-k composer + tenancy promotion | ❌ missing |
| Consumer-side and staying there | verticals/leaves, scaffold families, realness specs, backend choice (cli-bridge vs sandbox), domain judges — pure data/config, correctly not substrate |

VB's current loop uses agent-eval's `gepaDriver` over a custom 3-section string surface instead of `AgentSurfaces` — i.e., the first consumer migration is also the first validation of the design.

## Suggested phasing

1. `ArtifactRegistry` + ablation runner (everything else hangs off these).
2. `defineAgent.lifecycles` orchestration for `skill` + `prompt-surface` types (reuse both generators as-is).
3. Buildable surfaces: `tool`, `mcp-server` (agenticGenerator + compile/serve verification), then `hook`/`subagent`.
4. Drift-watch + dedupe/merge + composer.
5. Migrate blueprint-agent's lift loop onto it (deletes VB's custom surface plumbing; proves the "for free" claim).

First consumer experiment queued in blueprint-agent regardless: skill-profile ablation matrix (the `skills` axis of `AgentProfileSpec` has only ever carried one bundle — decompose it + the GEPA winners into ~8 atomic candidates and measure per-skill marginal lift). The registry schema should be shaped by what that produces.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Stage	Shape	Existing primitive
Distill (traces/findings → candidate artifact)	one-shot completion	`reflectiveGenerator`
Build (tool code, MCP server, hook script)	agentic session in a sandboxed worktree	`agenticGenerator`
Eval (marginal lift, held-out)	multi-shot loop	`runImprovementLoop` + `HeldOutGate` (agent-eval)
Dedupe/merge	one-shot judge over pairs	`JudgeConfig`
Promote	gate decision	`Gate`
Drift-watch (decay detection)	scheduled re-eval	scorecard/production-loop cron pattern
Refresh	re-seeded improvement loop	`improvementDriver`
Retire	registry state change	— (registry missing)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(lifecycle): profile-artifact lifecycle as a runtime primitive — generate/eval/promote/maintain skills, tools, MCPs, hooks, subagents for free #267

Motivation — live evidence from VerticalBench

The structural insight

Artifact taxonomy (everything in an AgentProfile is lifecycle-able)

The gaps (the actual feature request)

How much exists today vs custom (boundary audit, runtime 0.49 / eval 0.89)

Suggested phasing

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Capability	Status
One-shot + agentic-session generators	✅ runtime (`reflectiveGenerator` / `agenticGenerator`)
Mutable-surface declaration + validation	✅ runtime (`AgentSurfaces`)
Improvement loop, GEPA, held-out gate, judges	✅ agent-eval
Findings → surface routing	✅ runtime (analyst-loop + `FindingSubject` resolution)
Sandboxed candidate worktrees	✅ runtime (`agenticGenerator`)
Artifact registry w/ provenance + scores + states	❌ missing
Marginal-lift ablation primitive	❌ missing
MCP-server / hook / subagent surfaces	❌ missing
Drift-watch + auto-refresh wiring	❌ missing (cron patterns exist, not artifact-scoped)
Top-k composer + tenancy promotion	❌ missing
Consumer-side and staying there	verticals/leaves, scaffold families, realness specs, backend choice (cli-bridge vs sandbox), domain judges — pure data/config, correctly not substrate

feat(lifecycle): profile-artifact lifecycle as a runtime primitive — generate/eval/promote/maintain skills, tools, MCPs, hooks, subagents for free #267

Description

Motivation — live evidence from VerticalBench

The structural insight

Artifact taxonomy (everything in an AgentProfile is lifecycle-able)

The gaps (the actual feature request)

How much exists today vs custom (boundary audit, runtime 0.49 / eval 0.89)

Suggested phasing

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions