Skip to content

feat: input-side token-saving upgrades + eval quality gate#579

Open
edubraqd wants to merge 14 commits into
JuliusBrussee:mainfrom
edubraqd:feat/input-token-reductions
Open

feat: input-side token-saving upgrades + eval quality gate#579
edubraqd wants to merge 14 commits into
JuliusBrussee:mainfrom
edubraqd:feat/input-token-reductions

Conversation

@edubraqd

@edubraqd edubraqd commented Jun 27, 2026

Copy link
Copy Markdown

Why

In agentic Claude Code the weekly token limit is dominated by input — the whole context (system prompt, tool schemas, prior prose, every tool result) is re-sent every turn; caveman's output compression is the smallest slice. These changes target the input side and add a quality gate so compression never silently trades away correctness. A token saved on the re-sent prefix is re-billed every later turn (compounds); a token saved on output is billed once.

What

1. Agentic-loop output discipline + gated reinforcement + one-shot nudge (24c8e03)

  • skills/caveman/SKILL.md: new ## Agentic Loop section — result-first, no recap of files just read, plan once then deltas, one intent clause per tool batch. Cross-references Auto-Clarity so security/irreversible/multi-step stay full prose.
  • caveman-mode-tracker.js: per-turn reinforcement now fires every 3rd turn (+ on mode (re)activation), not every turn — the counter lives in a separate flag file so the injected string stays byte-stable and keeps hitting the prompt cache.
  • caveman-activate.js: statusline-setup nudge is one-shot (a .caveman-nudged marker) instead of re-appending ~90 tokens to the cached SessionStart prefix every session.

2. Eval quality gate — fidelity axis + two-gate accept rule (5681437, 5c7e19a)

  • measure.py only counts tokens, which is gameable (a skill replying k scores -99% and "wins"). Adds evals/judge.py (LLM judge for rubric facts + deterministic verbatim check -> fidelity.json) and evals/gate.py (accept only if tokens hold AND fidelity holds; high-risk prompts like the git-rebase "rewrites history" warning allow zero fidelity loss).
  • evals/prompts/rubrics.json for all 10 prompts; a committed fidelity.json baseline; tests/test_eval_gate.py (incl. the canonical "reply k -> REJECT"); .github/workflows/eval-gate.yml.

3. Opt-in PostToolUse tool-result trim (de7da10)

  • caveman-trim-tool-result.js: when enabled, replaces oversized Read/Bash/Grep/Glob results (via the public updatedToolOutput) with a deterministic, lossless-first trimmed version before they enter context. OFF by default (CAVEMAN_TRIM_TOOL_RESULTS=1); wired via --with-trim, independent of the plugin (no double-fire).

4. cavecrew as the default locate route (e2e7076)

  • Nudge the main thread to delegate locate-shaped work to cavecrew-investigator (its verbose Grep/Read stays out of main context); cap investigator/reviewer output at 25 rows.

5. Structural ultra + honest input/output split in /caveman-stats (b01efd3)

  • skills/caveman/SKILL.md: the ultra level now collapses multi-sentence explanation into one causal chain and cuts transition sentences (not just word abbreviation), keeping the verbatim code-symbol/error-string exemption.
  • caveman-stats.js: /caveman-stats reports the input-vs-output token split — output is the demo, input is where the weekly budget goes — and points at the input-side levers. Number formatting pinned to en-US so output is locale-deterministic.

Testing

  • tests/test_hook_output.js (8), tests/test_trim_tool_result.js (15), tests/test_eval_gate.py (7), tests/test_caveman_stats.js (31) all pass.
  • Trim hook adversarially reviewed (determinism, data corruption, fail-open) — no issues.
  • Gate verified end-to-end on the committed snapshot: caveman saves ~50% tokens at 100% fidelity.

Notes

  • Clean-room: hook mechanisms are from the public Claude Code hook docs.
  • The Windows safeWriteFlag temp-leak fix is a separate PR (fix(hooks): prevent safeWriteFlag temp-file leak on Windows rename co… #578) — this branch deliberately leaves caveman-config.js untouched.
  • fidelity.json was judged inline (an authenticated claude -p was unavailable in the author's environment); regenerate with evals/judge.py once CLI auth is set up.

🤖 Generated with Claude Code

edubraqd and others added 12 commits June 27, 2026 17:02
…ed reinforcement, one-shot nudge)

In agentic Claude Code the weekly limit is dominated by INPUT (context replayed
every turn), not output. These Tier-1 changes trim the input caveman itself adds
and the structural output that compounds into input:

- skills/caveman/SKILL.md: new `## Agentic Loop` section — result-first, no recap
  of files just read, plan once then deltas, one intent clause per tool batch, no
  redundant confirmations. Cross-references Auto-Clarity so security / irreversible
  / multi-step work stays full prose.
- caveman-mode-tracker.js: emit the per-turn reinforcement every 3rd turn (1,4,7,
  ...) plus any turn that (re)activates the mode, instead of every turn. The
  counter lives in a separate flag file so the injected string stays byte-stable
  and keeps hitting the prompt cache. caveman-activate.js resets it each session.
- caveman-activate.js: make the statusline-setup nudge one-shot via a
  .caveman-nudged marker, instead of re-appending ~90 tokens to the cached
  SessionStart prefix every session for users who never wire up the statusline.
- caveman-help SKILL.md: document /caveman-compress on memory files as the
  headline input-token lever, plus statusline setup.
- tests/test_hook_output.js: cover cadence, byte-stable cache-safe payloads, and
  the one-shot nudge. Update checksums.sha256.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
measure.py only counts tokens, which is gameable (a skill replying `k` scores
-99% and "wins"). Add the missing correctness axis so a compression change is
accepted only if it saves tokens AND stays correct.

- evals/prompts/rubrics.json: per-prompt binary fact checklist, risk class, and
  verbatim invariants (tokens that must survive compression, e.g. TCP/EXPLAIN/
  rebase) for every prompt in en.txt.
- evals/judge.py: scores each (arm, prompt) answer — LLM judge (temp 0, optional
  --runs majority) for facts + deterministic verbatim check — into fidelity.json.
  Fails closed on unparseable judge output.
- evals/gate.py: pure/offline two-gate decision. TOKEN gate (savings don't
  regress past a noise floor) AND QUALITY gate (mean fidelity drop <= tol, no
  prompt past its risk band's hard limit — normal 10pt, high 0pt — and every
  verbatim invariant holds). High-risk prompts (e.g. git-rebase history warning)
  allow zero fidelity loss.
- tests/test_eval_gate.py: offline unit tests of the gate logic (no LLM/tiktoken),
  incl. the canonical "reply k to everything is REJECTED".
- .github/workflows/eval-gate.yml: runs the offline gate self-test + token
  measurement on PRs touching SKILL.md / rules / evals.
- evals/README.md: document the fidelity axis, the two-gate rule, and calibration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…(Tier 2)

In agentic Claude Code the weekly token limit is dominated by INPUT — every turn
re-sends every prior tool result. caveman's output compression never touches
those bytes. This adds a PostToolUse hook that, when enabled, replaces oversized
built-in tool results (Read/Bash/Grep/Glob) with a trimmed version via the public
`updatedToolOutput` field, before they ever cost context.

- src/hooks/caveman-trim-tool-result.js: pure, deterministic transform — strips
  ANSI/CR/whitespace noise (lossless), then for still-huge text keeps head+tail
  with a re-run marker. Never touches JSON-ish results. Fails open (emits nothing
  on any error, so the original result passes through). Determinism keeps the
  re-sent result byte-stable so the prompt cache stays warm.
- OFF by default: the hook exits immediately unless CAVEMAN_TRIM_TOOL_RESULTS=1.
  Threshold tunable via CAVEMAN_TRIM_THRESHOLD (default 8000 chars).
- bin/install.js --with-trim: wires a PostToolUse(Read|Bash|Grep|Glob) entry in
  settings.json, independent of the plugin (which registers no PostToolUse), so
  it never double-fires. Not wired by the plugin manifest — avoids a per-tool
  hook spawn for users who don't opt in.
- bin/lib/settings.js: addCommandHook now supports an optional `matcher`;
  caveman-trim-tool-result.js added to MANAGED_HOOK_BASENAMES.
- tests/test_trim_tool_result.js: 15 tests — determinism, lossless stripping,
  head/tail elision, JSON passthrough, token preservation, fail-open, env gate.
- src/hooks/README.md: document opt-in wiring + runtime enable. checksums updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nt output

Locate-shaped work ("where is X", "what calls Y", "list uses", "map dir") is
where delegation pays off most: the verbose Grep/Read runs in the subagent and
never enters main context — only a ~60%-smaller path:line map returns. Nudge the
main thread toward it and bound the result so a delegation can't itself blow up.

- caveman-mode-tracker.js: on investigation-shaped prompts (tightly-scoped verb
  regex), append an advisory nudge to prefer cavecrew-investigator over inline
  Grep/Read. Assembled alongside the cadence-gated reinforcement as byte-stable
  segments (no per-turn-varying token -> cache-safe). Advisory only — the model
  still skips it for a one-line lookup.
- agents/cavecrew-investigator.md / cavecrew-reviewer.md: cap output at 25 rows/
  findings with a "+N more" line, so a huge match set returns bounded and the
  main thread knows it was capped.
- skills/cavecrew/SKILL.md: document the locate default + the row cap.
- tests/test_hook_output.js: cover the nudge (fires on locate prompts incl.
  off-cadence, silent otherwise). checksums updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- evals/snapshots/fidelity.json: baseline correctness snapshot for the caveman
  arm — all 10 prompts at 100% fidelity, all verbatim invariants held (incl. the
  high-risk git-rebase "rewrites history" warning). Judged inline by Opus because
  judge.py's `claude -p` path returns 401 in this environment; regenerate with
  evals/judge.py once that auth is available. Pairs with the committed results.json.
- evals/gate.py: replace the check/cross/em-dash glyphs in the report with ASCII
  so it doesn't crash on a Windows cp1252 console (UnicodeEncodeError). Verified
  end-to-end: gate baseline-vs-self ACCEPTs (token savings ~50%, 0 fidelity drop).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…(Tier 4)

- skills/caveman/SKILL.md: strengthen the `ultra` level to collapse multi-sentence
  explanations into a single causal chain and cut transition sentences (not just
  abbreviate words), keeping the verbatim code-symbol/error-string exemption.
- caveman-stats.js: /caveman-stats now reports the input vs output token split —
  caveman compresses output, but in agentic use the weekly limit is dominated by
  input (the whole context is re-sent every turn). Reframes the win and points at
  the input-side levers (/caveman-compress, the opt-in trim hook). Also pin number
  formatting to en-US so output is deterministic regardless of system locale (a
  pt-BR locale rendered "1.000" and broke the existing savings test).
- tests/test_caveman_stats.js: cover the split (shown with input data, omitted
  without). checksums regenerated from the committed (LF) blobs.

Deferred with reasons: aggressive auto-intensity (trivial->ultra) needs the
Tier-3 gate to validate empirically first; caveman-shrink tools/call result
compression is low-impact (MCP-proxy users only) and risks corrupting structured
data — the Tier-2 trim hook already covers built-in tool results.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the inline-Opus baseline with a real judge.py run (claude-haiku-4-5,
temp 0): the caveman arm scores 100% fidelity on all 10 prompts, every verbatim
invariant held (incl. the high-risk git-rebase "rewrites history" warning).
Per-fact verdicts now come from the judge model rather than inline judgment —
same result as the inline pass, confirming it. Gate baseline-vs-self ACCEPTs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…uirement

Two Windows-user bugs surfaced while regenerating fidelity.json:
- run_claude hardcoded ["claude", ...]; on Windows the CLI is a .cmd shim that
  CreateProcess can't launch without a shell, so judge.py died before any judge
  call. Resolve via shutil.which (prefers claude.exe via PATHEXT) and route a
  .cmd/.bat shim through `cmd /c` — verified it resolves the local npm shim.
- the docs said `uv run python evals/judge.py`, but judge.py is pure stdlib and
  the repo ships no pyproject.toml / uv. Use plain `python`; note tiktoken is
  needed only for gate.py/measure.py.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
caveman-shrink only compressed tools/list descriptions before. Add an opt-in
(CAVEMAN_SHRINK_RESULTS=1) pass over tools/call result content[].text: a LOSSLESS
strip of ANSI/terminal noise + whitespace only — never the prose compress()
(that would corrupt data the model acts on). JSON-ish text and non-text content
(images/resources) are skipped; deterministic so the re-sent result stays
cache-stable. Tests cover the strip, the no-prose-compress guarantee, JSON
passthrough, and determinism.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
/caveman auto lets caveman pick the intensity per answer instead of a fixed
level: trivial fact/yes-no -> ultra fragments, routine explain/fix -> full,
design/security/irreversible/multi-step -> full prose (Auto-Clarity wins).
Opt-in, so it never surprises a user who set a fixed level — no behaviour change
by default, hence no eval-gate dependency.

- caveman-config.js: add `auto` to VALID_MODES (so /caveman auto and the flag
  round-trip; the mode-tracker slash handler already routes any valid level).
- skills/caveman/SKILL.md: `auto` intensity row + frontmatter; caveman-help: row.
- caveman-statusline.sh/.ps1: whitelist `auto` so the badge renders [CAVEMAN:AUTO].
- tests/test_hook_output.js: /caveman auto writes the flag. checksums updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ot just strings

The Bash tool_response is an object ({stdout, stderr, ...}); the string-only
guard made the trim hook a no-op for Bash. extractText() now pulls stdout (or
output/content/text/result) from an object, trims it, re-appends a short stderr
tail, and returns a string updatedToolOutput. Objects with no text field and
JSON-ish text still pass through.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… Code

Verified empirically on a live session (logging wrapper on the PostToolUse hook):
1. Claude Code does NOT apply a string updatedToolOutput when tool_response is a
   structured object — and every built-in tool returns one (Bash {stdout,...},
   Read {type,file}, Grep {mode,content,...}, Glob likewise). The hook fires and
   emits a valid trimmed updatedToolOutput, but CC shows the original unchanged.
   So it's a no-op for all four matched tools, not just Bash.
2. CC already persists oversized results natively (~30KB -> disk + ~2KB preview),
   more aggressively than this hook would.

The hook code is correct + unit-tested; the limitation is in the PostToolUse API
surface. Left OFF by default. README note added so users/maintainers know its
real-world effect is ~0 until a future CC honors updatedToolOutput for objects
(which would require returning the object shape with a trimmed text field).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@edubraqd

Copy link
Copy Markdown
Author

Heads-up on item 3 (the opt-in PostToolUse trim hook)

Verified empirically on a live Claude Code session (logging wrapper on the hook): the trim hook currently has ~no real-world effect, for two reasons:

  1. Claude Code does not apply a string updatedToolOutput when the tool_response is a structured object — and every built-in tool returns one: Bash {stdout, stderr, interrupted, …}, Read {type, file}, Grep {mode, content, numLines, …}, Glob likewise. The hook fires, extracts the text, trims it, and emits a valid updatedToolOutput string — but CC shows the original result unchanged. So it's a no-op for all four matched tools.
  2. Claude Code already persists oversized results natively (~30KB → saved to disk + a ~2KB preview), more aggressively than this hook would.

The hook code is correct and unit-tested; the limitation is in the PostToolUse API surface, not the implementation. It ships OFF by default, so it harms nothing — but it won't save tokens on current Claude Code. A doc note was added (src/hooks/README.md). If a future CC honors updatedToolOutput for object results, the hook would need to return the object shape with a trimmed text field (not a string).

The other items in this PR (agentic-loop discipline, gated reinforcement, one-shot nudge, eval quality gate, cavecrew locate routing, structural ultra, /caveman-stats input/output split) are unaffected.

edubraqd and others added 2 commits June 27, 2026 19:36
…urce)

PostToolUse can't trim built-in results (CC honors output replacement for MCP
tools only); PreToolUse updatedInput IS applied to the real call (verified live).
caveman-bound-tool-input.js caps an unbounded Read's `limit` (maxResultSizeChars
is Infinity = no native protection) and an oversized Grep head_limit, at the
source. Opt-in (--with-bound + CAVEMAN_BOUND_TOOL_INPUT=1). 13 tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lever)

The weekly token limit is dominated by cache_read: the entire conversation
context is re-sent on EVERY turn, so a long session that never resets is the
true burn (a 17k-turn session re-sending a ~500K context = billions of
cache_read tokens, far more than any output the caveman style saves).

mode-tracker now reads the transcript tail each turn, estimates the live
context size (input + cache_creation + cache_read of the last turn), and writes
a humanized value to .caveman-ctx. The statusline renders it color-coded
(green <180K / yellow <320K / red), so the user SEES the session ballooning.
When it crosses the soft/hard threshold the hook periodically nudges the model
to suggest /clear or a fresh session — rate-limited so it never spams.

- caveman-mode-tracker.js: readContextSize() (256KB tail read, fail-open),
  humanizeTok(), .caveman-ctx write via safeWriteFlag, graduated guard segments
- caveman-statusline.{sh,ps1}: render `ctx 200K` color-coded; same symlink-
  refuse + whitelist hardening as the flag/savings files
- 4 new tests (ctx write, no-transcript silence, hard guard at turn 20,
  below-threshold silence); 13/13 pass
- checksums.sha256 regenerated for the 3 changed hook files
- src/hooks/README.md: document the meter + guard

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant