Skip to content

feat(lifecycle): profile-artifact lifecycle as a runtime primitive — generate/eval/promote/maintain skills, tools, MCPs, hooks, subagents for free #267

@drewstone

Description

@drewstone

Motivation — live evidence from VerticalBench

blueprint-agent's GEPA lift loop is producing proto-skills right now: seeded with a deliberately thin worker prompt ("build to a real, working, compiling state"), generation 10 independently rediscovered the no-shims directive, the grounding lever ("no progress after two edits → read the real docs / web search"), and production-completeness — and wrote them to driver-skills/vb-gepa-worker-10.md. The system synthesizes operational knowledge from experience; the artifacts then die as run outputs. There is no registry, no provenance, no lift score attached, no maintenance. Every consumer that wants this (blueprint-agent today; every product on the stack tomorrow) would hand-roll the same lifecycle.

The thesis: a consumer that has defineAgent + the improvement loop should get the full artifact lifecycle for free by flipping a switch.

The structural insight

Every lifecycle stage is one of exactly three shapes the runtime already ships:

Stage Shape Existing primitive
Distill (traces/findings → candidate artifact) one-shot completion reflectiveGenerator
Build (tool code, MCP server, hook script) agentic session in a sandboxed worktree agenticGenerator
Eval (marginal lift, held-out) multi-shot loop runImprovementLoop + HeldOutGate (agent-eval)
Dedupe/merge one-shot judge over pairs JudgeConfig
Promote gate decision Gate
Drift-watch (decay detection) scheduled re-eval scorecard/production-loop cron pattern
Refresh re-seeded improvement loop improvementDriver
Retire registry state change — (registry missing)

Nothing here needs a new execution model. What's missing is the artifact model + the orchestration.

Artifact taxonomy (everything in an AgentProfile is lifecycle-able)

AgentSurfaces already declares mutable dirs for systemPrompt sections, tools, personas, rubric, knowledge, scaffolding, memory, rag, outputSchema. Proposed extensions:

  • skill (markdown directive bundle — what VB's driver-skills are)
  • tool (executable code + tests, not just <tool>/README.md)
  • mcp-server (full build: scaffold → implement → compile → serve → register; this is how coding agents actually consume custom tools)
  • hook (pre/post-toolcall scripts)
  • subagent (persona + tool allowlist + model)
  • prompt-surface (the existing sections)

The gaps (the actual feature request)

  1. ArtifactRegistry — versioned, provenance-carrying store: { id, type, version, content-ref, provenance: { runIds, scenarios, generatorKind }, scores: { marginalLift, heldOutSuite, n, measuredAt }, state: candidate|active|decayed|retired, tenancy: tenant|vertical|general }. Without a lift score an artifact doesn't exist — that rule is what keeps a 5,000-artifact library an asset instead of a junk drawer. Natural home: agent-knowledge or a new agent-runtime/registry subpath.
  2. Ablation runner — first-class measureMarginalLift(profile, artifact, heldOutSuite): campaign cells for profile-with vs profile-without. The atom of the whole system; expressible today via matrix composition but nobody should hand-roll it.
  3. MCP/tool/hook/subagent surface types — extend AgentSurfaces beyond doc dirs to buildable artifacts; agenticGenerator already runs real coding harnesses in candidate worktrees, so "build an MCP server and prove it compiles+serves" is its existing job description pointed at a new surface.
  4. Lifecycle orchestration in defineAgent — declarative: lifecycles: { skills: { distill: 'reflective', eval: { suite, minLift }, driftCheck: '7d' }, mcpServers: { build: 'agentic', ... } }. Production loop runs the stages.
  5. Dedupe/merge stage — behavioral equivalence: two artifacts whose lifts don't stack → merge or retire one. Judge-over-registry, scheduled.
  6. ComposercomposeProfile(domain, k): top-k active artifacts by marginal lift on that domain's suite → AgentProfile. Tenancy boundary enforced here (tenant-local artifacts never promote without cross-tenant lift evidence — this is both the IP boundary and the anti-overfit guarantee).

How much exists today vs custom (boundary audit, runtime 0.49 / eval 0.89)

Capability Status
One-shot + agentic-session generators ✅ runtime (reflectiveGenerator / agenticGenerator)
Mutable-surface declaration + validation ✅ runtime (AgentSurfaces)
Improvement loop, GEPA, held-out gate, judges ✅ agent-eval
Findings → surface routing ✅ runtime (analyst-loop + FindingSubject resolution)
Sandboxed candidate worktrees ✅ runtime (agenticGenerator)
Artifact registry w/ provenance + scores + states ❌ missing
Marginal-lift ablation primitive ❌ missing
MCP-server / hook / subagent surfaces ❌ missing
Drift-watch + auto-refresh wiring ❌ missing (cron patterns exist, not artifact-scoped)
Top-k composer + tenancy promotion ❌ missing
Consumer-side and staying there verticals/leaves, scaffold families, realness specs, backend choice (cli-bridge vs sandbox), domain judges — pure data/config, correctly not substrate

VB's current loop uses agent-eval's gepaDriver over a custom 3-section string surface instead of AgentSurfaces — i.e., the first consumer migration is also the first validation of the design.

Suggested phasing

  1. ArtifactRegistry + ablation runner (everything else hangs off these).
  2. defineAgent.lifecycles orchestration for skill + prompt-surface types (reuse both generators as-is).
  3. Buildable surfaces: tool, mcp-server (agenticGenerator + compile/serve verification), then hook/subagent.
  4. Drift-watch + dedupe/merge + composer.
  5. Migrate blueprint-agent's lift loop onto it (deletes VB's custom surface plumbing; proves the "for free" claim).

First consumer experiment queued in blueprint-agent regardless: skill-profile ablation matrix (the skills axis of AgentProfileSpec has only ever carried one bundle — decompose it + the GEPA winners into ~8 atomic candidates and measure per-skill marginal lift). The registry schema should be shaped by what that produces.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions