feat: input-side token-saving upgrades + eval quality gate by edubraqd · Pull Request #579 · JuliusBrussee/caveman

edubraqd · 2026-06-27T20:06:28Z

Why

In agentic Claude Code the weekly token limit is dominated by input — the whole context (system prompt, tool schemas, prior prose, every tool result) is re-sent every turn; caveman's output compression is the smallest slice. These changes target the input side and add a quality gate so compression never silently trades away correctness. A token saved on the re-sent prefix is re-billed every later turn (compounds); a token saved on output is billed once.

What

1. Agentic-loop output discipline + gated reinforcement + one-shot nudge (24c8e03)

skills/caveman/SKILL.md: new ## Agentic Loop section — result-first, no recap of files just read, plan once then deltas, one intent clause per tool batch. Cross-references Auto-Clarity so security/irreversible/multi-step stay full prose.
caveman-mode-tracker.js: per-turn reinforcement now fires every 3rd turn (+ on mode (re)activation), not every turn — the counter lives in a separate flag file so the injected string stays byte-stable and keeps hitting the prompt cache.
caveman-activate.js: statusline-setup nudge is one-shot (a .caveman-nudged marker) instead of re-appending ~90 tokens to the cached SessionStart prefix every session.

2. Eval quality gate — fidelity axis + two-gate accept rule (5681437, 5c7e19a)

measure.py only counts tokens, which is gameable (a skill replying k scores -99% and "wins"). Adds evals/judge.py (LLM judge for rubric facts + deterministic verbatim check -> fidelity.json) and evals/gate.py (accept only if tokens hold AND fidelity holds; high-risk prompts like the git-rebase "rewrites history" warning allow zero fidelity loss).
evals/prompts/rubrics.json for all 10 prompts; a committed fidelity.json baseline; tests/test_eval_gate.py (incl. the canonical "reply k -> REJECT"); .github/workflows/eval-gate.yml.

3. Opt-in PostToolUse tool-result trim (de7da10)

caveman-trim-tool-result.js: when enabled, replaces oversized Read/Bash/Grep/Glob results (via the public updatedToolOutput) with a deterministic, lossless-first trimmed version before they enter context. OFF by default (CAVEMAN_TRIM_TOOL_RESULTS=1); wired via --with-trim, independent of the plugin (no double-fire).

4. cavecrew as the default locate route (e2e7076)

Nudge the main thread to delegate locate-shaped work to cavecrew-investigator (its verbose Grep/Read stays out of main context); cap investigator/reviewer output at 25 rows.

5. Structural ultra + honest input/output split in /caveman-stats (b01efd3)

skills/caveman/SKILL.md: the ultra level now collapses multi-sentence explanation into one causal chain and cuts transition sentences (not just word abbreviation), keeping the verbatim code-symbol/error-string exemption.
caveman-stats.js: /caveman-stats reports the input-vs-output token split — output is the demo, input is where the weekly budget goes — and points at the input-side levers. Number formatting pinned to en-US so output is locale-deterministic.

Testing

tests/test_hook_output.js (8), tests/test_trim_tool_result.js (15), tests/test_eval_gate.py (7), tests/test_caveman_stats.js (31) all pass.
Trim hook adversarially reviewed (determinism, data corruption, fail-open) — no issues.
Gate verified end-to-end on the committed snapshot: caveman saves ~50% tokens at 100% fidelity.

Notes

Clean-room: hook mechanisms are from the public Claude Code hook docs.
The Windows safeWriteFlag temp-leak fix is a separate PR (fix(hooks): prevent safeWriteFlag temp-file leak on Windows rename co… #578) — this branch deliberately leaves caveman-config.js untouched.
fidelity.json was judged inline (an authenticated claude -p was unavailable in the author's environment); regenerate with evals/judge.py once CLI auth is set up.

🤖 Generated with Claude Code

…ed reinforcement, one-shot nudge) In agentic Claude Code the weekly limit is dominated by INPUT (context replayed every turn), not output. These Tier-1 changes trim the input caveman itself adds and the structural output that compounds into input: - skills/caveman/SKILL.md: new `## Agentic Loop` section — result-first, no recap of files just read, plan once then deltas, one intent clause per tool batch, no redundant confirmations. Cross-references Auto-Clarity so security / irreversible / multi-step work stays full prose. - caveman-mode-tracker.js: emit the per-turn reinforcement every 3rd turn (1,4,7, ...) plus any turn that (re)activates the mode, instead of every turn. The counter lives in a separate flag file so the injected string stays byte-stable and keeps hitting the prompt cache. caveman-activate.js resets it each session. - caveman-activate.js: make the statusline-setup nudge one-shot via a .caveman-nudged marker, instead of re-appending ~90 tokens to the cached SessionStart prefix every session for users who never wire up the statusline. - caveman-help SKILL.md: document /caveman-compress on memory files as the headline input-token lever, plus statusline setup. - tests/test_hook_output.js: cover cadence, byte-stable cache-safe payloads, and the one-shot nudge. Update checksums.sha256. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

measure.py only counts tokens, which is gameable (a skill replying `k` scores -99% and "wins"). Add the missing correctness axis so a compression change is accepted only if it saves tokens AND stays correct. - evals/prompts/rubrics.json: per-prompt binary fact checklist, risk class, and verbatim invariants (tokens that must survive compression, e.g. TCP/EXPLAIN/ rebase) for every prompt in en.txt. - evals/judge.py: scores each (arm, prompt) answer — LLM judge (temp 0, optional --runs majority) for facts + deterministic verbatim check — into fidelity.json. Fails closed on unparseable judge output. - evals/gate.py: pure/offline two-gate decision. TOKEN gate (savings don't regress past a noise floor) AND QUALITY gate (mean fidelity drop <= tol, no prompt past its risk band's hard limit — normal 10pt, high 0pt — and every verbatim invariant holds). High-risk prompts (e.g. git-rebase history warning) allow zero fidelity loss. - tests/test_eval_gate.py: offline unit tests of the gate logic (no LLM/tiktoken), incl. the canonical "reply k to everything is REJECTED". - .github/workflows/eval-gate.yml: runs the offline gate self-test + token measurement on PRs touching SKILL.md / rules / evals. - evals/README.md: document the fidelity axis, the two-gate rule, and calibration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…(Tier 2) In agentic Claude Code the weekly token limit is dominated by INPUT — every turn re-sends every prior tool result. caveman's output compression never touches those bytes. This adds a PostToolUse hook that, when enabled, replaces oversized built-in tool results (Read/Bash/Grep/Glob) with a trimmed version via the public `updatedToolOutput` field, before they ever cost context. - src/hooks/caveman-trim-tool-result.js: pure, deterministic transform — strips ANSI/CR/whitespace noise (lossless), then for still-huge text keeps head+tail with a re-run marker. Never touches JSON-ish results. Fails open (emits nothing on any error, so the original result passes through). Determinism keeps the re-sent result byte-stable so the prompt cache stays warm. - OFF by default: the hook exits immediately unless CAVEMAN_TRIM_TOOL_RESULTS=1. Threshold tunable via CAVEMAN_TRIM_THRESHOLD (default 8000 chars). - bin/install.js --with-trim: wires a PostToolUse(Read|Bash|Grep|Glob) entry in settings.json, independent of the plugin (which registers no PostToolUse), so it never double-fires. Not wired by the plugin manifest — avoids a per-tool hook spawn for users who don't opt in. - bin/lib/settings.js: addCommandHook now supports an optional `matcher`; caveman-trim-tool-result.js added to MANAGED_HOOK_BASENAMES. - tests/test_trim_tool_result.js: 15 tests — determinism, lossless stripping, head/tail elision, JSON passthrough, token preservation, fail-open, env gate. - src/hooks/README.md: document opt-in wiring + runtime enable. checksums updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nt output Locate-shaped work ("where is X", "what calls Y", "list uses", "map dir") is where delegation pays off most: the verbose Grep/Read runs in the subagent and never enters main context — only a ~60%-smaller path:line map returns. Nudge the main thread toward it and bound the result so a delegation can't itself blow up. - caveman-mode-tracker.js: on investigation-shaped prompts (tightly-scoped verb regex), append an advisory nudge to prefer cavecrew-investigator over inline Grep/Read. Assembled alongside the cadence-gated reinforcement as byte-stable segments (no per-turn-varying token -> cache-safe). Advisory only — the model still skips it for a one-line lookup. - agents/cavecrew-investigator.md / cavecrew-reviewer.md: cap output at 25 rows/ findings with a "+N more" line, so a huge match set returns bounded and the main thread knows it was capped. - skills/cavecrew/SKILL.md: document the locate default + the row cap. - tests/test_hook_output.js: cover the nudge (fires on locate prompts incl. off-cadence, silent otherwise). checksums updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- evals/snapshots/fidelity.json: baseline correctness snapshot for the caveman arm — all 10 prompts at 100% fidelity, all verbatim invariants held (incl. the high-risk git-rebase "rewrites history" warning). Judged inline by Opus because judge.py's `claude -p` path returns 401 in this environment; regenerate with evals/judge.py once that auth is available. Pairs with the committed results.json. - evals/gate.py: replace the check/cross/em-dash glyphs in the report with ASCII so it doesn't crash on a Windows cp1252 console (UnicodeEncodeError). Verified end-to-end: gate baseline-vs-self ACCEPTs (token savings ~50%, 0 fidelity drop). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…(Tier 4) - skills/caveman/SKILL.md: strengthen the `ultra` level to collapse multi-sentence explanations into a single causal chain and cut transition sentences (not just abbreviate words), keeping the verbatim code-symbol/error-string exemption. - caveman-stats.js: /caveman-stats now reports the input vs output token split — caveman compresses output, but in agentic use the weekly limit is dominated by input (the whole context is re-sent every turn). Reframes the win and points at the input-side levers (/caveman-compress, the opt-in trim hook). Also pin number formatting to en-US so output is deterministic regardless of system locale (a pt-BR locale rendered "1.000" and broke the existing savings test). - tests/test_caveman_stats.js: cover the split (shown with input data, omitted without). checksums regenerated from the committed (LF) blobs. Deferred with reasons: aggressive auto-intensity (trivial->ultra) needs the Tier-3 gate to validate empirically first; caveman-shrink tools/call result compression is low-impact (MCP-proxy users only) and risks corrupting structured data — the Tier-2 trim hook already covers built-in tool results. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Replace the inline-Opus baseline with a real judge.py run (claude-haiku-4-5, temp 0): the caveman arm scores 100% fidelity on all 10 prompts, every verbatim invariant held (incl. the high-risk git-rebase "rewrites history" warning). Per-fact verdicts now come from the judge model rather than inline judgment — same result as the inline pass, confirming it. Gate baseline-vs-self ACCEPTs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…uirement Two Windows-user bugs surfaced while regenerating fidelity.json: - run_claude hardcoded ["claude", ...]; on Windows the CLI is a .cmd shim that CreateProcess can't launch without a shell, so judge.py died before any judge call. Resolve via shutil.which (prefers claude.exe via PATHEXT) and route a .cmd/.bat shim through `cmd /c` — verified it resolves the local npm shim. - the docs said `uv run python evals/judge.py`, but judge.py is pure stdlib and the repo ships no pyproject.toml / uv. Use plain `python`; note tiktoken is needed only for gate.py/measure.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

caveman-shrink only compressed tools/list descriptions before. Add an opt-in (CAVEMAN_SHRINK_RESULTS=1) pass over tools/call result content[].text: a LOSSLESS strip of ANSI/terminal noise + whitespace only — never the prose compress() (that would corrupt data the model acts on). JSON-ish text and non-text content (images/resources) are skipped; deterministic so the re-sent result stays cache-stable. Tests cover the strip, the no-prose-compress guarantee, JSON passthrough, and determinism. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

/caveman auto lets caveman pick the intensity per answer instead of a fixed level: trivial fact/yes-no -> ultra fragments, routine explain/fix -> full, design/security/irreversible/multi-step -> full prose (Auto-Clarity wins). Opt-in, so it never surprises a user who set a fixed level — no behaviour change by default, hence no eval-gate dependency. - caveman-config.js: add `auto` to VALID_MODES (so /caveman auto and the flag round-trip; the mode-tracker slash handler already routes any valid level). - skills/caveman/SKILL.md: `auto` intensity row + frontmatter; caveman-help: row. - caveman-statusline.sh/.ps1: whitelist `auto` so the badge renders [CAVEMAN:AUTO]. - tests/test_hook_output.js: /caveman auto writes the flag. checksums updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ot just strings The Bash tool_response is an object ({stdout, stderr, ...}); the string-only guard made the trim hook a no-op for Bash. extractText() now pulls stdout (or output/content/text/result) from an object, trims it, re-appends a short stderr tail, and returns a string updatedToolOutput. Objects with no text field and JSON-ish text still pass through. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… Code Verified empirically on a live session (logging wrapper on the PostToolUse hook): 1. Claude Code does NOT apply a string updatedToolOutput when tool_response is a structured object — and every built-in tool returns one (Bash {stdout,...}, Read {type,file}, Grep {mode,content,...}, Glob likewise). The hook fires and emits a valid trimmed updatedToolOutput, but CC shows the original unchanged. So it's a no-op for all four matched tools, not just Bash. 2. CC already persists oversized results natively (~30KB -> disk + ~2KB preview), more aggressively than this hook would. The hook code is correct + unit-tested; the limitation is in the PostToolUse API surface. Left OFF by default. README note added so users/maintainers know its real-world effect is ~0 until a future CC honors updatedToolOutput for objects (which would require returning the object shape with a trimmed text field). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

edubraqd · 2026-06-27T22:10:40Z

Heads-up on item 3 (the opt-in PostToolUse trim hook)

Verified empirically on a live Claude Code session (logging wrapper on the hook): the trim hook currently has ~no real-world effect, for two reasons:

Claude Code does not apply a string updatedToolOutput when the tool_response is a structured object — and every built-in tool returns one: Bash {stdout, stderr, interrupted, …}, Read {type, file}, Grep {mode, content, numLines, …}, Glob likewise. The hook fires, extracts the text, trims it, and emits a valid updatedToolOutput string — but CC shows the original result unchanged. So it's a no-op for all four matched tools.
Claude Code already persists oversized results natively (~30KB → saved to disk + a ~2KB preview), more aggressively than this hook would.

The hook code is correct and unit-tested; the limitation is in the PostToolUse API surface, not the implementation. It ships OFF by default, so it harms nothing — but it won't save tokens on current Claude Code. A doc note was added (src/hooks/README.md). If a future CC honors updatedToolOutput for object results, the hook would need to return the object shape with a trimmed text field (not a string).

The other items in this PR (agentic-loop discipline, gated reinforcement, one-shot nudge, eval quality gate, cavecrew locate routing, structural ultra, /caveman-stats input/output split) are unaffected.

…urce) PostToolUse can't trim built-in results (CC honors output replacement for MCP tools only); PreToolUse updatedInput IS applied to the real call (verified live). caveman-bound-tool-input.js caps an unbounded Read's `limit` (maxResultSizeChars is Infinity = no native protection) and an oversized Grep head_limit, at the source. Opt-in (--with-bound + CAVEMAN_BOUND_TOOL_INPUT=1). 13 tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…lever) The weekly token limit is dominated by cache_read: the entire conversation context is re-sent on EVERY turn, so a long session that never resets is the true burn (a 17k-turn session re-sending a ~500K context = billions of cache_read tokens, far more than any output the caveman style saves). mode-tracker now reads the transcript tail each turn, estimates the live context size (input + cache_creation + cache_read of the last turn), and writes a humanized value to .caveman-ctx. The statusline renders it color-coded (green <180K / yellow <320K / red), so the user SEES the session ballooning. When it crosses the soft/hard threshold the hook periodically nudges the model to suggest /clear or a fresh session — rate-limited so it never spams. - caveman-mode-tracker.js: readContextSize() (256KB tail read, fail-open), humanizeTok(), .caveman-ctx write via safeWriteFlag, graduated guard segments - caveman-statusline.{sh,ps1}: render `ctx 200K` color-coded; same symlink- refuse + whitelist hardening as the flag/savings files - 4 new tests (ctx write, no-transcript silence, hard guard at turn 20, below-threshold silence); 13/13 pass - checksums.sha256 regenerated for the 3 changed hook files - src/hooks/README.md: document the meter + guard Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

edubraqd and others added 12 commits June 27, 2026 17:02

edubraqd and others added 2 commits June 27, 2026 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: input-side token-saving upgrades + eval quality gate#579

feat: input-side token-saving upgrades + eval quality gate#579
edubraqd wants to merge 14 commits into
JuliusBrussee:mainfrom
edubraqd:feat/input-token-reductions

edubraqd commented Jun 27, 2026 •

edited

Loading

Uh oh!

edubraqd commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

edubraqd commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Testing

Notes

Uh oh!

edubraqd commented Jun 27, 2026

Heads-up on item 3 (the opt-in PostToolUse trim hook)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

edubraqd commented Jun 27, 2026 •

edited

Loading