Add metrics framework, setup automation, and context router hardening#16
Open
stbiadmin wants to merge 1 commit intoGMaN1911:mainfrom
Open
Add metrics framework, setup automation, and context router hardening#16stbiadmin wants to merge 1 commit intoGMaN1911:mainfrom
stbiadmin wants to merge 1 commit intoGMaN1911:mainfrom
Conversation
Adds a metrics framework for measuring context routing effectiveness, an interactive setup skill that replaces manual configuration with a guided workflow, and 10 bug fixes in the context router including Windows compatibility and stdin corruption. New components: - Metrics framework (scripts/metrics/) with JSONL event collection, statistical analysis, and report generation - /cognitive-setup skill with project analyzer for automated keyword generation and documentation stub creation - /cognitive-status, /cognitive-state, /cognitive-metrics skills - install.sh for one-command installation Context router fixes: - fcntl import crash on Windows (conditional import with fallback) - stdin double-read in except branch (capture once, reuse) - pinned config parsed but never applied - --diagnostics running after save_state(), mutating state - Session state file race between concurrent sessions - datetime.utcnow() deprecated since Python 3.12 - save_report() path traversal via unsanitized name parameter - argparse --help killing hook process All changes are additive. Existing configurations continue to work.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds three things that I believe claude-cognitive needs in order to deliver on its claims:
The existing context routing and pool coordination code is solid. These additions make the system usable by people who didn't write it, and provable by people who want evidence it works.
Motivation
I set up claude-cognitive on a real project and hit these issues:
keywords.jsongets a system that silently does nothing.keywords.jsonis malformed, it falls back silently. Users can't tell "working, nothing to inject" from "completely broken."What changed
Metrics framework (
scripts/metrics/, 5 files, ~2,800 lines)JSONL-based event collection that hooks into the existing
UserPromptSubmit,SessionStart, andStoplifecycle. Captures per-turn injection size, keyword matches, attention tier distribution, and transition data.The analyzer computes: token savings statistics (mean/median/percentiles), keyword hit rates, attention dynamics, coverage gaps (which docs never activate), and trends over time.
Reports use a practical three-scenario framing:
This replaces the original "99.9% savings" metric, which compared against a baseline nobody would/could actually use.
Setup automation (
/cognitive-setupskill,install.sh)A 6-phase skill that:
keywords.jsonfrom the analysis<\!-- WARM CONTEXT ENDS -->markersEach phase presents its output for review before writing files. The whole thing is idempotent.
install.shhandles the mechanical parts (copy scripts, merge hooks into settings.json, install skills).Context router hardening (
scripts/context-router-v2.py, +261 lines)Bug fixes:
import fcntlcrashes on Windowspinnedconfig parsed but variable unused--diagnosticsruns aftersave_state(), mutating statedatetime.utcnow()deprecated since Python 3.12save_report()name parameter unsanitized (path traversal)argparse --helpexits and kills hook process_read_last_router_output()reads entire log fileNew capabilities:
--validate "prompt"for dry-run testing without state mutation--diagnosticsfor JSON diagnostic output.claude/always preferred over global~/.claude/Supporting additions
/cognitive-statusskill for health checks (file presence, hook config, attention state)/cognitive-stateskill and standalonecognitive-state.pyscript for checking attention without burning context tokens/cognitive-metricsskill for interactive analysisWhat did NOT change
pool-loader.py,pool-extractor.py,pool-auto-update.py,pool-query.py).claude/settings.jsonschemaHow to review
There's a lot of changes. I recommend that you focus on these in order:
scripts/context-router-v2.py- the bug fixes (search forHAS_FCNTL,raw = sys.stdin.read(),pinned)scripts/metrics/collector.py- how per-turn data is capturedscripts/metrics/analyzer.py- the analysis logic.claude/skills/cognitive-setup/SKILL.md- the setup workflow instructionsThese can be skimmed or skipped:
.claude/skills/*/SKILL.md(other than setup) - skill definitions (natural language instructions)templates/- example configurationsTesting
Verified by running against the claude-cognitive repo itself:
Design decisions
Why JSONL for metrics storage? Consistency with the existing
attention_history.jsonlandinstance_state.jsonlpatterns. Append-only writes avoid corruption from concurrent hooks. Daily rotation keeps files manageable.Why a skill instead of a standalone setup script? Users are likely already in a Claude Code session when they want to set up. Skills integrate with the existing workflow. The skill can also auto-install missing scripts, solving the chicken-and-egg problem.
Why reframe the token savings metric? The original metric compared against "inject every .md file on every turn," which inflates savings to 99.9%. The practical comparison is: without cognitive you get CLAUDE.md (399 tokens). With cognitive you get CLAUDE.md plus targeted injection (424 tokens for a quiet turn, more when keywords match). This is honest and useful.
Why not wire usage_tracker.py into hooks? The metrics collector already captures keyword effectiveness and file activation data per-turn, which overlaps with usage_tracker's purpose. Wiring it in would require parsing tool call results to infer file access, which is complex for unclear incremental value.