| title | Architecture | ||
|---|---|---|---|
| description | How ContextCrawler is put together - the lib+bin split, the Claude Code hook gate, the filter pipeline, config and data layout, the env-var shim, and the module map. | ||
| sidebar |
|
This page describes how ContextCrawler is structured as of 0.4.0, for
contributors and for anyone embedding the library. It is synthesised from the
source; the code in src/ is the source of truth.
ContextCrawler is a downstream fork. Each piece keeps its origin, and the single binary is assembled from upstream rtk plus ported contextzip modules, with Tirith invoked subprocess-only (no AGPL link):
flowchart TB
RTK["rtk-ai/rtk<br/>(Apache-2.0 / MIT)<br/>v0.39.0 core<br/>+ 60+ command filters"]
CZIP["jee599/contextzip<br/>(MIT)<br/>session compactor<br/>error_cmd, web_cmd"]
TIRITH["sheeki03/tirith<br/>(AGPL-3.0)<br/>shell-command<br/>security gate"]
FORK["contextcrawler fork branch:<br/>contextzip-downstream<br/>sentinel-blocked patches"]
PATCHES["Downstream modules:<br/>supply_chain_gate<br/>tirith_gate<br/>security_cmd<br/>error_cmd · curl/wget HTML"]
BIN["<code>contextcrawler</code><br/>single Rust binary + library"]
USERS["You / Claude / Cursor /<br/>Copilot / Gemini / OpenCode"]
RTK -- "git rebase" --> FORK
CZIP -- "ported MIT source<br/>(SPDX headers)" --> PATCHES
FORK --> BIN
PATCHES --> BIN
TIRITH -. "subprocess only<br/>(no AGPL link)" .-> BIN
BIN --> USERS
classDef upstream fill:#1a1a2e,stroke:#888,color:#ddd
classDef ours fill:#2a0a2e,stroke:#e83e8c,color:#fff
class RTK,CZIP,TIRITH upstream
class FORK,PATCHES,BIN ours
ContextCrawler is one crate that builds both a library and a binary from a single compile tree.
-
src/main.rsis a five-line shim. It calls the library and exits with the returned code:fn main() { std::process::exit(contextcrawler::run()); }
-
src/lib.rsis the crate root. It declares the whole module tree (analytics,api,cmds,core,discover,hooks,learn,parser,cli) and re-exports the curated public surface. See ContextCrawler as a library for that surface. -
src/cli.rsholdspub fn run() -> i32, the actual CLI. It defines the clap command tree, dispatches each subcommand to its filter module, and returns the process exit code.
Because the binary is a shim over cli::run, the binary exercises the exact
code path a downstream embedder would. There is no separate "binary-only" logic
to drift out of sync.
run() does three things in order before dispatching:
- On Unix, reset
SIGPIPEtoSIG_DFLso a broken pipe on stdout terminates cleanly (exit 141) instead of panicking insideprintln!. - Call
core::path_migrate::migrate_legacy_dirs_once()(see Path migration below). - Run
run_cli(), mapping anyanyhow::Errorto a stderr line and exit code 1.
ContextCrawler integrates with AI coding agents through a PreToolUse-style hook. There are two code paths, and only one is live in the Claude Code wiring:
| Path | Entry point | File | Status |
|---|---|---|---|
| Live | contextcrawler hook claude |
src/hooks/hook_cmd.rs |
This is what runs for Claude Code. |
| Legacy | contextcrawler rewrite <cmd> |
src/hooks/rewrite_cmd.rs |
Older path, still compiled. |
Both share the gate logic in src/hooks/tirith_gate.rs and
src/hooks/supply_chain_gate.rs. The decisive difference is gate ordering:
the live path gates the raw command before deciding on a rewrite, so every
command class is gated. The legacy path gates inside the rewrite branch and so
has a known gap for non-rewritable commands. New work should follow the live
path's ordering.
The live flow lives in hook_cmd.rs. The agent sends a PreToolUse JSON payload
on stdin; run_claude reads it (size-limited, failing closed on a malformed or
oversized payload) and hands the parsed value to
process_claude_payload_with_gate.
The critical ordering property: the defence-in-depth gates run on the RAW command before any rewrite is computed. A rewrite never has the chance to mask a flagged command from the gate.
flowchart TD
A[PreToolUse JSON on stdin] --> B{Parse payload}
B -- malformed / oversized --> DENY1[Deny - fail closed]
B -- ok --> C{Extract /tool_input/command}
C -- absent --> IGN1[Ignore - non-Bash tool]
C -- present but not a string --> DENY2[Deny - malformed payload]
C -- empty string --> IGN2[Ignore - nothing to do]
C -- non-empty string --> D[permission check_command]
D -- Deny verdict --> DENY3[Deny - deny rule]
D -- Allow / Ask --> E[run_gates on RAW cmd]
E --> F[Tirith gate + supply-chain gate]
F -- Deny --> DENY4[Deny - supply-chain block]
F -- Ask --> G[set gate_ask = true]
F -- Proceed --> H[get_rewritten cmd]
G --> H
H -- no rewrite --> I{gate_ask or verdict == Ask}
I -- yes --> ASK[Ask - user prompted #2286 / #111]
I -- no --> SKIP[Skip - passthrough unchanged]
H -- has rewrite --> J[build updatedInput with rewritten cmd]
J --> K{verdict == Allow AND not gate_ask}
K -- yes --> ALLOW[Rewrite + permissionDecision allow]
K -- no --> ASKR[Rewrite, no auto-allow - host prompts]
Things worth calling out:
- Fail closed on shape errors. A payload whose
commandis not a non-empty string, or that cannot be parsed at all, returnsDeny. The hook never lets a command it could not reason about run unchecked. - A gate
Askalways wins over a permissionsAllow(#100 / #197). When a gate flags a command, the auto-allow is suppressed: even with an explicitAllowrule, ContextCrawler omitspermissionDecisionso Claude Code prompts the user. A copy-paste trust hint from the Tirith verdict is surfaced as the prompt reason where available. - A permission
Askis never silently dropped (#2286). If the command has no rewrite butcheck_commandreturnedAsk(a not-auto-evaluable construct such as command substitution or a file-write redirect, or an explicit ask-rule), the hook escalates to a real Claude Codeaskrather than skipping. Without this, a hostBash(git:*)allow rule could auto-approve something likegit status $(whoami). - Supply-chain hard block is a
Deny. A supply-chainDenyverdict fails closed; anUnavailableorAskfrom the (opt-in) supply-chain gate also fails closed to a prompt rather than waving an install through.
Gate wiring (run_gates) is shared between this hook path and the
contextcrawler proxy CLI path, so proxy no longer bypasses the gates. Both
gates are opt-in and default off: with Tirith unavailable and the supply-chain
gate disabled, the path is byte-for-byte the same as no gate at all.
For a command that is run through ContextCrawler (rather than gated at the hook layer), the pipeline is:
command -> routing / rewrite lookup -> filter module -> tracking
- Routing. The clap command tree in
cli.rsroutes a recognised command (git,cargo,pytest,grep, ...) to its typed subcommand. On the hook path,rewrite_commandinsrc/discover/registry.rsmaps a raw command string to itscontextcrawlerequivalent and returnsNonewhen there is no equivalent. - Filter module. Each ecosystem has a module under
src/cmds/<ecosystem>/. The module runs the real underlying tool, captures its output, and applies a filter to produce the compact form. The shared execution skeleton lives insrc/core/runner.rs; theno_bloatguard there ensures a filter never emits more tokens than the raw baseline would have cost. - Tracking. Input and output token counts are recorded so
contextcrawler gaincan report savings (see Config and data layout).
When clap does not match a typed subcommand, run_fallback takes over. It:
- refuses to fall back for ContextCrawler's own meta-commands (so a typo in
gain --badflagshows clap's error rather than trying to rungainfromPATH); - applies cloud-CLI hardening (per-tool argument deny-lists and environment
stripping) for
kubectl,docker,aws,psql,curl,wget,gh,glab,gt; - otherwise looks up a project or global TOML filter (
find_matching_filter, bypassable withCTXCRL_NO_TOML=1) and applies it, or finally passes the command through unfiltered.
Commands that take the raw-passthrough branch are recorded as parse failures
(zero savings) so contextcrawler gain --failures can surface them.
Path basenames are defined once in src/core/constants.rs
(RTK_DATA_DIR = "ctxcrl"):
| What | Path |
|---|---|
| User config | ~/.config/ctxcrl/config.toml |
| User filters | ~/.config/ctxcrl/filters.toml |
| Trusted filters | ~/.config/ctxcrl/trusted_filters.json |
| Project-local filters | ./.ctxcrl/filters.toml, ./.ctxcrl/filters/*.toml |
| Tracking database | ~/.local/share/ctxcrl/history.db (SQLite) |
The tracking database path can be overridden with the CTXCRL_DB_PATH
environment variable; this also acts as the explicit opt-in that lets a caller
redirect tracking writes (for example in tests, which otherwise redirect to a
per-process temp file so they never touch the production database).
The canonical env-var prefix is CTXCRL_. The legacy RTK_ prefix is still
honoured as a deprecated fallback. All runtime env reads go through
src/core/env_compat.rs, which is the only place the old prefix survives:
env_var("CTXCRL_FOO")returnsCTXCRL_FOOif set, otherwise falls back toRTK_FOO. The canonical name always wins when both are set.env_flag(...)is true when the canonical or legacy var equals exactly"1".env_present(...)is true when either name is set to any value.
So CTXCRL_NO_TOML, CTXCRL_DB_PATH, and friends are the names to use; the
matching RTK_* names keep older setups working but are slated for removal.
On every run, core::path_migrate::migrate_legacy_dirs_once() moves on-disk
settings from the legacy rtk directory names to the canonical ctxcrl names.
It is cheap, idempotent (guarded by a std::sync::Once), and runs before
anything reads config or filters.
What it does:
- Migrates the schema-stable settings files (
config.toml,filters.toml,trusted_filters.json) from the legacyrtksegment in both the user config dir and the user data dir. Each file is checked independently: a pre-existing destination directory (for example one already holding a freshhistory.dbor.device_salt) never blocks moving a file whose destination is still absent. An existing destination file is never overwritten. - Migrates the project-local
./.rtkdirectory to./.ctxcrl- a whole-dir move when the destination is absent, otherwise a per-file migration of the known filter payload. - Refuses to migrate a symlinked legacy
.rtkdirectory, so a planted symlink cannot redirect the move.
What it deliberately does not do: migrate history.db. The tracking
database is a complete reset. A fresh database with the current schema is created
at the new ctxcrl path on first run, and any old-path database is left
orphaned and untouched. There is no DB back-compat and no ALTER.
For contributors, the top-level layout under src/:
| Module | Responsibility |
|---|---|
main.rs |
Binary shim - calls contextcrawler::run(). |
lib.rs |
Crate root - module tree + curated public re-exports. |
cli.rs |
clap command tree, pub fn run, routing, clap-fallback path. |
api.rs |
Curated filtering API (filter_output, auto_filter_output, available_filters). |
core/ |
Shared infrastructure: config, tracking, tee, utils, filter, toml_filter, output_summary, runner, env_compat, path_migrate, constants, telemetry. |
hooks/ |
Hook system: hook_cmd (live Claude path), rewrite_cmd (legacy), tirith_gate, supply_chain_gate, permissions, init, trust, integrity, verify_cmd. |
analytics/ |
Token-savings reporting: gain, cc_economics, ccusage, session_cmd. |
cmds/ |
Per-ecosystem filter modules: git/, rust/, js/, python/, go/, jvm/, dotnet/, cloud/, system/, ruby/. |
discover/ |
Claude Code history analysis (missed-savings discovery), plus registry - the command-to-rewrite map (rewrite_command) the hook path uses. |
learn/ |
CLI-correction detection from error history. |
parser/ |
Parser infrastructure shared by filters. |
filters/ |
TOML filter configs (the DSL filter definitions). |
- ContextCrawler as a library - the embeddable public API.