Skip to content

Add metrics framework, setup automation, and context router hardening#16

Open
stbiadmin wants to merge 1 commit intoGMaN1911:mainfrom
stbiadmin:feature/setup-automation-and-metrics
Open

Add metrics framework, setup automation, and context router hardening#16
stbiadmin wants to merge 1 commit intoGMaN1911:mainfrom
stbiadmin:feature/setup-automation-and-metrics

Conversation

@stbiadmin
Copy link
Copy Markdown

Summary

This PR adds three things that I believe claude-cognitive needs in order to deliver on its claims:

  1. A metrics framework that can be used to determine whether the system actually reduces token usage, or at least is providing meaningful context injection and routing.
  2. An interactive setup skill that replaces the tedius manual setup with a 5-minute guided workflow
  3. 10 bug fixes in the context router code, including two critical ones (Windows breakage, stdin corruption)

The existing context routing and pool coordination code is solid. These additions make the system usable by people who didn't write it, and provable by people who want evidence it works.

Motivation

I set up claude-cognitive on a real project and hit these issues:

  • Default keywords only match the author's project. Running the router with generic prompts ("How does the API work?") produces zero activations. A new user who doesn't create keywords.json gets a system that silently does nothing.
  • Setup requires deep knowledge of the internals. Path resolution, fractal doc format, keyword mapping, co-activation rules, hook wiring - each requires reading source code to understand.
  • No way to verify it's working. The README claims 64-95% token savings but ships no tooling to measure this. The usage tracker exists but isn't wired into hooks.
  • Silent failures everywhere. When keywords don't match, the router outputs nothing. When keywords.json is malformed, it falls back silently. Users can't tell "working, nothing to inject" from "completely broken."

What changed

Metrics framework (scripts/metrics/, 5 files, ~2,800 lines)

JSONL-based event collection that hooks into the existing UserPromptSubmit, SessionStart, and Stop lifecycle. Captures per-turn injection size, keyword matches, attention tier distribution, and transition data.

The analyzer computes: token savings statistics (mean/median/percentiles), keyword hit rates, attention dynamics, coverage gaps (which docs never activate), and trends over time.

Reports use a practical three-scenario framing:

  • Baseline (CLAUDE.md only) - what you get without cognitive
  • With cognitive (baseline + targeted injection) - what the router provides
  • Dump everything (all docs) - the naive alternative

This replaces the original "99.9% savings" metric, which compared against a baseline nobody would/could actually use.

Setup automation (/cognitive-setup skill, install.sh)

A 6-phase skill that:

  1. Checks environment (Python version, existing config)
  2. Scans the codebase (language detection, module discovery, framework identification)
  3. Generates keywords.json from the analysis
  4. Creates fractal documentation stubs with proper <\!-- WARM CONTEXT ENDS --> markers
  5. Verifies hook configuration
  6. Runs dry-run validation against project-relevant prompts

Each phase presents its output for review before writing files. The whole thing is idempotent.

install.sh handles the mechanical parts (copy scripts, merge hooks into settings.json, install skills).

Context router hardening (scripts/context-router-v2.py, +261 lines)

Bug fixes:

Bug Severity
import fcntl crashes on Windows Critical
stdin read twice in except branch (data already consumed) Critical
pinned config parsed but variable unused Major
--diagnostics runs after save_state(), mutating state Major
Session state file race between concurrent sessions Major
datetime.utcnow() deprecated since Python 3.12 Major
Threshold constants duplicated across files Major
save_report() name parameter unsanitized (path traversal) Major
argparse --help exits and kills hook process Minor
_read_last_router_output() reads entire log file Minor

New capabilities:

  • --validate "prompt" for dry-run testing without state mutation
  • --diagnostics for JSON diagnostic output
  • Non-silent failure messages on first 3 turns when nothing activates
  • Project-local .claude/ always preferred over global ~/.claude/

Supporting additions

  • /cognitive-status skill for health checks (file presence, hook config, attention state)
  • /cognitive-state skill and standalone cognitive-state.py script for checking attention without burning context tokens
  • /cognitive-metrics skill for interactive analysis

What did NOT change

  • No changes to the pool coordination scripts (pool-loader.py, pool-extractor.py, pool-auto-update.py, pool-query.py)
  • No changes to the hook contract (stdin JSON in, stdout text out)
  • No changes to the .claude/settings.json schema
  • No new runtime dependencies beyond Python stdlib
  • All changes are additive. Existing configurations continue to work without modification.

How to review

There's a lot of changes. I recommend that you focus on these in order:

  1. scripts/context-router-v2.py - the bug fixes (search for HAS_FCNTL, raw = sys.stdin.read(), pinned)
  2. scripts/metrics/collector.py - how per-turn data is captured
  3. scripts/metrics/analyzer.py - the analysis logic
  4. .claude/skills/cognitive-setup/SKILL.md - the setup workflow instructions

These can be skimmed or skipped:

  • .claude/skills/*/SKILL.md (other than setup) - skill definitions (natural language instructions)
  • templates/ - example configurations

Testing

Verified by running against the claude-cognitive repo itself:

Design decisions

Why JSONL for metrics storage? Consistency with the existing attention_history.jsonl and instance_state.jsonl patterns. Append-only writes avoid corruption from concurrent hooks. Daily rotation keeps files manageable.

Why a skill instead of a standalone setup script? Users are likely already in a Claude Code session when they want to set up. Skills integrate with the existing workflow. The skill can also auto-install missing scripts, solving the chicken-and-egg problem.

Why reframe the token savings metric? The original metric compared against "inject every .md file on every turn," which inflates savings to 99.9%. The practical comparison is: without cognitive you get CLAUDE.md (399 tokens). With cognitive you get CLAUDE.md plus targeted injection (424 tokens for a quiet turn, more when keywords match). This is honest and useful.

Why not wire usage_tracker.py into hooks? The metrics collector already captures keyword effectiveness and file activation data per-turn, which overlaps with usage_tracker's purpose. Wiring it in would require parsing tool call results to infer file access, which is complex for unclear incremental value.

Adds a metrics framework for measuring context routing effectiveness,
an interactive setup skill that replaces manual configuration with a
guided workflow, and 10 bug fixes in the context router including
Windows compatibility and stdin corruption.

New components:
- Metrics framework (scripts/metrics/) with JSONL event collection,
  statistical analysis, and report generation
- /cognitive-setup skill with project analyzer for automated keyword
  generation and documentation stub creation
- /cognitive-status, /cognitive-state, /cognitive-metrics skills
- install.sh for one-command installation

Context router fixes:
- fcntl import crash on Windows (conditional import with fallback)
- stdin double-read in except branch (capture once, reuse)
- pinned config parsed but never applied
- --diagnostics running after save_state(), mutating state
- Session state file race between concurrent sessions
- datetime.utcnow() deprecated since Python 3.12
- save_report() path traversal via unsanitized name parameter
- argparse --help killing hook process

All changes are additive. Existing configurations continue to work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant