Skip to content

continua-ai/happy-paths

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

🤸‍♀️ Happy Paths

Happy Paths watercolor jumping girl mark

Stop paying repeatedly for the same wrong turns.

CI License TypeScript

Website · Architecture · HTTP ingest · Wrong-turn flow · Feasibility gate · Skateboard E2E · Metrics · Related work · Benchmark · Roadmap

Contributor trust policy: VOUCHED.td


Happy Paths is a trace-driven learning loop for agentic coding. It captures agent traces, indexes them, mines wrong-turn corrections, and feeds those recoveries back into future runs so each session wastes less time and fewer tokens than the last.

Why this exists

Every coding agent session starts from zero. If the agent hits pytest: command not found, spends 4 steps figuring out it needs a venv, and eventually succeeds — the next session on the same project will repeat the exact same detour.

Happy Paths remembers what worked and intervenes at the moment of failure, before the agent wastes steps rediscovering the fix.

Two wins, one thesis

We ran 17 benchmark iterations (~1,000+ runs across three suites) to find what actually works. The thesis: Happy Paths doesn't make smart models smarter at things they already know. It makes undiscoverable things discoverable.

Win 1: Tool registry eliminates reinvention waste (−100%)

Mining 300 real sessions revealed 9,012 throwaway inline scripts (~2.3M wasted tokens). Agents kept rewriting the same Linear API / GCloud boilerplate because existing tools weren't discoverable. A 10-line markdown table in AGENTS.md fixed it completely:

Metric Without registry With registry
Throwaway heredocs (36 runs) 9 0
CLI tool usage 59 163 (2.8×)
Wasted tokens 1,048 0

Cost: ~200 tokens in the system prompt. Savings: ~1,000+ per session.

Win 2: Error-time hints save 4–11% on undocumented repos

When a repo has no README and the only way to run tests is an undocumented CLI tool, hints at the moment of error give the agent a direct path:

Repo What's missing Δ time
ledgerkit No README, ./kit CLI undiscoverable −11%
logparse No README, ./qa CLI undiscoverable −4%

Discoverability gate: selective hint suppression

Not all repos need hints. A discoverability gate scans README.md at session start and suppresses hints when the fix is already documented:

Repo Without gate With gate Why
ledgerkit −11% −14% README doesn't document ./kit → hint fires
toolhub +10% ~0% README documents ./th setup → hint suppressed

Where it doesn't help

  • Well-documented repos: agent reads README (gate now suppresses hints here)
  • Standard errors (git push conflicts, venv setup): model already knows
  • Too many hints or hints injected too early: adds noise, net-harmful

See Benchmark results below for the full data.

How it works

Happy Paths uses Pi's tool_result hook to intercept errors in real time. When a tool call returns an error matching a known pattern, Happy Paths appends a short recovery hint to the error output before the agent sees it.

Agent runs `pytest tests/` → error: "pytest: command not found"
                                    ↓
            Happy Paths matches error pattern
                                    ↓
            Appends: "This project needs setup. Create a venv,
            install dev deps, check for setup scripts in the
            repo root, then use .venv/bin/pytest."
                                    ↓
            Agent follows recipe → skips 3-4 wrong turns

The hints are error-keyed (matched by regex on error output), not command-keyed. This means the same hint fires regardless of which command produced the error. Hints are deduplicated per session — each hint fires at most once.

The story arc

The same failure pattern repeats at every scale. It starts with one engineer and one agent looping on avoidable dead-ends, then compounds when multiple agents run concurrently and replay each other's mistakes. At team scale, engineers rediscover similar fixes independently and the cost becomes org-wide. The natural endpoint is opt-in global sharing of learned happy paths — similar in spirit to skill exchange, but extracted and curated from real traces.

Single engineer single agent Single engineer multiple agents
Team learning network Global crowdsourced learning

Visuals above are auto-generated concept illustrations for storytelling.

Core principles

  1. Correctness first — never make the agent less reliable.
  2. Precise over prolific — one good hint beats three noisy ones.
  3. Error-time delivery — intervene at the moment of failure, not before.
  4. Lexical/signature retrieval first — exact and near-exact matching before heavier semantic techniques.
  5. No mandatory external deps — local mode has no database or vector dependency.
  6. Pluggable — adapters/backends are swappable (harness, storage, index).

Install

# Bun (preferred)
bun install && bun run verify

# npm
npm install && npm run verify

Quick start

As a Pi extension (recommended)

# from npm
pi install npm:@continua-ai/happy-paths

# or from source
pi install git:github.com/continua-ai/happy-paths

That's it. Happy Paths will capture traces and inject hints automatically.

Configuration (env vars)

Variable Default Description
HAPPY_PATHS_TRACE_ROOT ~/.happy-paths/traces Where traces are stored
HAPPY_PATHS_TRACE_SCOPE personal personal, team, or public
HAPPY_PATHS_MAX_SUGGESTIONS 3 Max hints per session start
HAPPY_PATHS_ERROR_TIME_HINTS on Enable/disable error-time hints
HAPPY_PATHS_BEFORE_AGENT_START true Enable/disable pre-session hints
HAPPY_PATHS_HINT_MODE suggest suggest, inject, or none
HAPPY_PATHS_SESSION_ID (auto) Override session ID (for benchmarks)

Programmatic usage

import { createLocalLearningLoop } from "@continua-ai/happy-paths";

// Create a learning loop backed by local JSONL files
const loop = createLocalLearningLoop({ dataDir: ".happy-paths" });

// Ingest a trace event (normally done automatically by the Pi adapter)
await loop.ingest({
  id: crypto.randomUUID(),
  timestamp: new Date().toISOString(),
  sessionId: "session-1",
  harness: "pi",
  scope: "personal",
  type: "tool_result",
  payload: {
    command: "npm test",
    output: "Error: Cannot find module 'foo'",
    isError: true,
  },
});

// Retrieve relevant past events
const hits = await loop.retrieve({ text: "cannot find module" });

Rehydrate from persisted traces

import { initializeLocalLearningLoop } from "@continua-ai/happy-paths";

// Bootstraps in-memory index from on-disk JSONL traces
const { loop, bootstrap } = await initializeLocalLearningLoop({
  dataDir: ".happy-paths",
});

console.log(`Loaded ${bootstrap.eventCount} events from prior sessions`);

Ship traces to a hosted endpoint

export HAPPY_PATHS_INGEST_URL=https://your-ingest-server.example.com
export HAPPY_PATHS_TEAM_ID=team_abc
export HAPPY_PATHS_TEAM_TOKEN_FILE=~/.happy-paths/team-token.txt
export HAPPY_PATHS_TRACE_ROOTS=~/.happy-paths/traces

npx @continua-ai/happy-paths ingest ship

Project identity overrides

Brand-specific identifiers are centralized in src/core/projectIdentity.ts and can be overridden per integration:

const loop = createLocalLearningLoop({
  projectIdentity: {
    displayName: "YourBrand",
    defaultDataDirName: ".yourbrand",
    extensionCustomType: "yourbrand",
  },
});

Development

npm run verify          # lint + typecheck + test
npm run test            # unit tests only
npm run build           # compile TypeScript

# Quality gates
npm run test:wrong-turn-gate       # wrong-turn retrieval quality gate
npm run eval:wrong-turn            # wrong-turn evaluator (hit@k, MRR)
npm run eval:feasibility           # feasibility gate evaluation
npm run eval:skateboard            # skateboard E2E evaluation

See docs/metrics.md for evaluation methodology and docs/feasibility-gate.md for the go/no-go validation flow.

Benchmark results

We built a recurring-pattern benchmark to measure whether error-time hints actually save time and tokens. The benchmark uses synthetic Python repos with intentional traps — undocumented CLI tools, misdirecting error messages, non-standard project setup — that simulate the kinds of knowledge gaps models can't resolve from training data alone.

Setup

  • Model: gpt-5.3-codex (via Pi + OpenAI Codex provider)
  • Design: A/B — each task runs OFF (no hints) and ON (hints enabled), interleaved, with 3 replicates per variant
  • Metric: wall-clock time, error count, and tool-call count per run
  • Repos: 14 synthetic Python projects, 56 tasks, 27 unique traps
  • Trap families: undocumented tooling, misdirecting error messages, non-standard test setup, format-before-lint, build target syntax, hallucinated tool names, reinvention waste, git workflow
  • Real sad paths: 2 repos mined from 300 real Pi sessions (~2,275 categorized errors across 95K tool calls)
  • Total runs: ~1,000+ across 17+ iterations

How we got here (13 iterations)

Finding the right hint strategy took systematic iteration. Early attempts were net-harmful — they added overhead without reducing errors. Each iteration isolated one variable:

Version Strategy ledgerkit Δ logparse Δ Key lesson
v3 Easy-trap hints (venv, deps) +89% slower Models handle standard errors fine — don't hint what they already know
v7 Undocumented-tool hints + pre-session injection +31% slower +42% slower Hints fire but pre-session overhead dominates
v8 3 separate per-error hints + pre-session +15% slower +27% slower Fewer hints = less overhead, but still net-negative
v9 1 comprehensive recipe + pre-session +1% slower +10% slower Single hint dramatically better than multiple
v10 1 recipe, error-time only (no pre-session) −5% faster +7% slower Removing pre-session noise flips ledgerkit net-positive
v11 Prescriptive recipe, error-time only −11% faster −4% faster Explicit .venv/bin/pytest prevents model shortcuts
v12 Terse format (just the fix command) +14% slower −15% faster Terse best for simple fixes, verbose for discovery
v13 Adaptive format (terse/verbose per hint) −2% faster +89%* slower Middle-of-road; v11 remains best general policy

* v13 logparse average skewed by single 596s outlier; median: −7%.

v11 results (current)

Error-time-only mode with a prescriptive setup recipe. Key wording change from v10: "Use .venv/bin/pytest (not pytest or python -m pytest)" — this forces the model to create a venv instead of taking shortcuts that cause additional errors.

ledgerkit (undocumented ./kit CLI tool, no README):

Variant Avg time Avg errors/run Avg calls/run
OFF (no hints) 65s 3.2 17.7
ON (error-time recipe) 58s 3.3 18.0
Δ −11% +0.1 errors +0.3 calls

logparse (undocumented ./qa CLI tool, no README):

Variant Avg time Avg errors/run Avg calls/run
OFF (no hints) 51s 3.4 15.8
ON (error-time recipe) 49s 3.0 15.7
Δ −4% −0.4 errors −0.1 calls

webutil (misdirecting error messages, session fixture timeout trap):

Variant Avg time Avg errors/run Avg calls/run
OFF (no hints) 91s 2.7 15.7
ON (error-time recipe) 92s 2.5 14.8
Δ +1% −0.2 errors −0.8 calls

Both ledgerkit and logparse are net-positive. Webutil is neutral on time but reduces errors and tool calls.

Real sad path analysis (session mining)

We mined 300 real Pi sessions (~95K tool calls) and identified 14 recurring sad path families. The top errors agents hit repeatedly:

Category Real freq In benchmark?
Format before lint 533x ✅ monobuild (new)
Build target syntax 368x ✅ monobuild (new)
dx preflight timeout 329x (CI-specific)
Git push conflicts 244x (git-specific)
Git dirty rebase 135x (git-specific)
Git worktree confusion 132x (git-specific)
Hallucinated tool names 92x ✅ toolhub (new)
Missing Python modules 88x ✅ toolhub (new)

The 4 git-specific patterns (push conflicts, dirty rebase, worktree confusion) and the CI timeout pattern require git/CI infrastructure in the benchmark — a future improvement.

Reinvention waste benchmark (new)

We discovered a second class of waste beyond error recovery: agents writing throwaway scripts for operations that have existing repo tools. Mining 300 real Pi sessions revealed 9,012 inline Python heredocs (~2.3M wasted tokens), with 55% being Linear API and GCloud boilerplate rewritten every session.

We built a separate benchmark to measure this — 3 synthetic repos (issuetracker, opsboard, dataquery) with 12 tasks, 151-191 files each, and existing CLI tools (./track, ./ops, jq) buried in docs:

Version Files/repo Intervention Heredocs (36 runs) CLI usage Token waste
v3 (baseline) 151-191 None 9 59 1,048
v3 + hints 151-191 Tool-call hints only 6 67 971 (−7%)
v4 (registry) 151-191 AGENTS.md tool registry 0 163 (2.8x) 0 (−100%)

The fix isn't an algorithm — it's making tools discoverable. A 10-line markdown table in AGENTS.md mapping operations → CLI commands completely eliminated throwaway scripts and nearly tripled CLI usage. Cost: ~200 tokens in the system prompt. Savings: ~1,000+ tokens per session.

Higher-confidence results (r=5)

We re-ran webutil and toolhub with 5 replicates (80 sessions) to reduce noise:

Repo OFF median ON median Δ median ON faster?
webutil 84s 100s +18% 1/4 tasks
toolhub 48s 54s +10% 0/4 tasks

Both repos are clearly net-harmful with hints at r=5. These are well-documented repos where the agent discovers tools on its own.

Git workflow (new)

We added push-conflict and dirty-rebase traps — the top git sad paths from session mining (244× and 135× respectively). Results (24 sessions, r=3):

Task OFF median ON median Δ
push-after-diverge 77s 85s +10%
push-conflict-multiply 38s 36s −5%
rebase-dirty-subtract 46s 47s +2%
rebase-dirty-upper 50s 60s +20%

Overall +10% slower with hints. Models handle standard git errors fine.

What the data teaches

  1. One comprehensive hint > many small hints. When the agent hits pytest: command not found, give it the full recipe (venv + deps + check for setup scripts + run tests). Don't drip-feed 3 hints across 3 errors.

  2. Error-time delivery > pre-session injection. Injecting hints before the agent starts (via before_agent_start) adds overhead even when the hints are relevant. The agent hasn't seen the project yet, so generic warnings just add noise. Error-time delivery waits until the agent has context.

  3. Don't hint what the model already knows. gpt-5.3-codex handles pip install, venv creation, and standard toolchain errors in 1-2 steps. Hinting on those is net-harmful — it adds processing overhead without saving any steps.

  4. Be prescriptive, not advisory. "Use .venv/bin/pytest" works better than "create a venv first" because the model can't take a shortcut — .venv/bin/pytest won't exist without the venv. Name the specific tools (./kit, ./qa) instead of saying "check for executable files."

  5. Hints work when errors misdirect; they hurt when README already explains. Toolhub has a clear README and ./th setup — hints add noise. Ledgerkit and logparse have NO README and opaque error messages — hints save 4-7 steps.

  6. The value gap is narrow but real. Happy Paths helps most when:

    • Error messages point the wrong way (e.g., "See https://internal.docs/" for a URL that doesn't exist)
    • The fix requires running a tool that isn't mentioned in any repo file
    • The project uses internal/proprietary tooling that the model has no training data for
  7. Modern models are excellent explorers. Even with zero documentation, gpt-5.3-codex discovers undocumented CLI tools via ls → find → read script → execute. Hints provide a more direct path, but the model usually gets there on its own in 3-4 extra steps.

Methodology notes

  • All benchmark repos are synthetic (no real user data). Source: scripts/build-recurring-pattern-benchmark.ts
  • Runs use git clean -fdx between tasks to ensure clean state
  • Traces are captured per-run and analyzed post-hoc for error counts, hint firing, and tool-call sequences
  • Full methodology: docs/recurring-pattern-benchmark.md

Prior work: SWE-bench Lite

We also ran ~15 matrix iterations on a SWE-bench Lite lane (real open-source bug fixes). Hints were consistently net-harmful there because the tasks don't share failure modes — each bug is unique, so there's nothing useful to learn across sessions. This confirmed that Happy Paths is specifically valuable for recurring patterns, not one-off bug fixes.

Hosted vision

The hosted direction is opt-in sharing that grows from personal → team → global scope, with privacy controls and artifact review at each stage. Learned recoveries can be safely published and reused at internet scale.

Credits

Made with care by David Petrou (@dpetrou) and collaborators at Continua AI.

License

Apache-2.0 (see LICENSE).

About

Trace-driven learning loop for coding agents: capture, index, mine, and share successful paths.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors