|
1 | 1 | ## Architecture |
2 | 2 |
|
3 | | -`ctxeng` builds LLM-ready context from your repository by running a simple pipeline: |
| 3 | +`ctxeng` is a local-first “context compiler”: it **discovers** code, **scores** it against your task, optionally **expands/retrieves** more relevant neighbors, then **budgets + renders** a paste-ready context bundle. |
4 | 4 |
|
5 | 5 | ```mermaid |
6 | 6 | flowchart TD |
7 | | - discover[Discover_files] --> score[Score_and_rank] |
8 | | - score --> expand[Expand_import_graph_optional] |
9 | | - expand --> retrieve[RAG_chunk_retrieval_optional] |
10 | | - retrieve --> redact[Redact_secrets_and_PII] |
11 | | - redact --> budget[Fit_token_budget] |
12 | | - budget --> render[Render_XML_or_Markdown] |
13 | | - render --> output[Context_output] |
| 7 | + subgraph Inputs |
| 8 | + Q[Task / query] |
| 9 | + R[(Repository root)] |
| 10 | + O[Options / flags] |
| 11 | + end |
| 12 | +
|
| 13 | + subgraph Pipeline |
| 14 | + D[Discover files] |
| 15 | + S[Score + rank] |
| 16 | + IG[Import-graph expansion (optional)] |
| 17 | + RAG[RAG chunk retrieval (optional)] |
| 18 | + SK[AST skeleton (optional)] |
| 19 | + RX[Redact secrets + PII (optional, default on)] |
| 20 | + B[Fit token budget] |
| 21 | + RND[Render output (XML / Markdown / plain)] |
| 22 | + end |
| 23 | +
|
| 24 | + Q --> S |
| 25 | + R --> D --> S --> IG --> RAG --> SK --> RX --> B --> RND |
| 26 | + O --> D |
| 27 | + O --> S |
| 28 | + O --> IG |
| 29 | + O --> RAG |
| 30 | + O --> SK |
| 31 | + O --> RX |
| 32 | + O --> B |
| 33 | + RND --> OUT[Context output + metadata] |
14 | 34 | ``` |
15 | 35 |
|
16 | | -### Inputs |
17 | | -- **Repository root** (`--root`) |
18 | | -- **Task/query** (free text) |
19 | | -- **Options**: import graph, semantic scoring, RAG, skeleton mode, redaction, tracing |
| 36 | +## Key design goals |
| 37 | + |
| 38 | +- **Deterministic & explainable**: scoring signals are explicit; optional tracing records “what was included and why”. |
| 39 | +- **Portable**: output is not tied to any editor or model vendor (works with chat UIs, APIs, CI). |
| 40 | +- **Budget-safe**: context is constructed to fit a specific model window (or explicit `--budget`). |
| 41 | +- **Safety by default**: redaction runs before token counting, tracing, and rendering. |
| 42 | + |
| 43 | +## Components (where the logic lives) |
| 44 | + |
| 45 | +- **Discovery / sources**: `ctxeng/sources/__init__.py` |
| 46 | + - `collect_filesystem()`: walks the repo and yields `(path, content)` |
| 47 | + - `collect_git_changed()`: yields changed files (`--git-diff`) |
| 48 | + - `collect_explicit()`: yields exact paths (`--files` / `include_files()`) |
| 49 | +- **Ignore rules**: `ctxeng/ignore.py` |
| 50 | + - merges `.gitignore` + `.ctxengignore` into a single matcher (gitwildmatch via `pathspec`) |
| 51 | +- **Scoring / ranking**: `ctxeng/scorer.py` |
| 52 | + - combines signals into a score in \([0, 1]\) |
| 53 | + - supports optional semantic scoring + multi-language AST symbol overlap |
| 54 | + - supports configurable scoring weights (`.ctxeng/config.json` / `--scoring-config`) |
| 55 | +- **Import graph expansion (Python)**: `ctxeng/import_graph.py` |
| 56 | + - builds a static import graph among discovered `.py` files |
| 57 | + - expands context with imported neighbors (with score decay) before budgeting |
| 58 | +- **Chunking + retrieval (RAG)**: `ctxeng/chunking.py`, `ctxeng/retrieval.py` |
| 59 | + - splits files into overlapping chunks |
| 60 | + - lexical retrieval is always available; embeddings retrieval requires `ctxeng[semantic]` |
| 61 | +- **Skeleton mode (Python)**: `ctxeng/ast_skeleton.py` |
| 62 | + - replaces Python bodies with an AST-derived outline for “overview” requests |
| 63 | +- **Redaction**: `ctxeng/redaction.py` |
| 64 | + - masks secrets + PII with stable hashes so traces and output don’t leak sensitive values |
| 65 | +- **Budgeting + truncation**: `ctxeng/optimizer.py` |
| 66 | + - estimates token counts (uses `tiktoken` if installed, otherwise heuristic) |
| 67 | + - greedily fills budget by score, smart-truncates large files |
| 68 | +- **Tracing (optional)**: `ctxeng/tracing.py` |
| 69 | + - writes JSONL events under `.ctxeng/traces/` (safe payloads) |
| 70 | +- **Snapshots (optional)**: `ctxeng/snapshots.py` |
| 71 | + - writes `context.txt` + `manifest.json` under `.ctxeng/snapshots/<id>/` |
| 72 | +- **Orchestration**: `ctxeng/core.py` |
| 73 | + - `ContextEngine.build()` coordinates the whole pipeline |
| 74 | +- **Fluent API**: `ctxeng/builder.py` |
| 75 | + - `ContextBuilder` provides a chainable configuration layer |
| 76 | + |
| 77 | +## Pipeline, step-by-step |
| 78 | + |
| 79 | +### 1) Discover files |
| 80 | + |
| 81 | +Discovery chooses the candidate set: |
| 82 | + |
| 83 | +- **Filesystem** (default): walk from repo root, apply ignore rules, skip binary-ish files, apply size guard. |
| 84 | +- **Git diff** (`--git-diff`): only changed/untracked files (great for PR reviews). |
| 85 | +- **Explicit files** (`--files` / `include_files()`): bypass discovery, useful for targeted tasks. |
| 86 | + |
| 87 | +### 2) Score + rank |
| 88 | + |
| 89 | +Each file is scored using a weighted mix of signals, then sorted descending. Signals include: |
| 90 | + |
| 91 | +- **Keyword overlap**: query token overlap with content |
| 92 | +- **Path relevance**: filename + directory names matching query tokens |
| 93 | +- **AST symbol overlap**: |
| 94 | + - Python (built-in) |
| 95 | + - JS/TS/Go (optional, tree-sitter) |
| 96 | +- **Git recency**: recently changed files get a boost (optional) |
| 97 | +- **Semantic similarity**: optional local embeddings (`sentence-transformers`) |
| 98 | + |
| 99 | +Weights can be customized with `--scoring-config` (or `.ctxeng/config.json`). |
| 100 | + |
| 101 | +### 3) Import graph expansion (optional, Python) |
| 102 | + |
| 103 | +If enabled, ctxeng can pull in locally imported Python modules from the discovered set (with score decay). This helps with “function is defined elsewhere” cases. |
| 104 | + |
| 105 | +### 4) RAG chunk retrieval (optional) |
| 106 | + |
| 107 | +For large repos, `--rag` switches from whole-file inclusion to chunk-level selection: |
| 108 | + |
| 109 | +- Candidate set: top-ranked files are chunked (keeps runtime bounded). |
| 110 | +- Retrieval: |
| 111 | + - embeddings (if installed) or lexical fallback (always available) |
| 112 | +- Output: |
| 113 | + - retrieved chunks become the “ranked list” fed into budgeting |
| 114 | + |
| 115 | +### 5) Skeleton mode (optional, Python) |
| 116 | + |
| 117 | +`--skeleton` replaces Python file bodies with an AST-derived outline (imports, defs, methods). This is best for: |
20 | 118 |
|
21 | | -### Scoring signals |
22 | | -Each file gets a relevance score in \([0,1]\) from a weighted mix of: |
23 | | -- **Keyword overlap** with the query |
24 | | -- **Path relevance** (filename + directories) |
25 | | -- **AST symbol overlap** (Python built-in; JS/TS/Go via optional tree-sitter) |
26 | | -- **Git recency** (recently changed files score higher) |
27 | | -- **Semantic similarity** (optional, sentence-transformers) |
| 119 | +- “high-level architecture overview” |
| 120 | +- “what are the key modules/classes?” |
| 121 | +- tight budgets where full bodies are too expensive |
28 | 122 |
|
29 | | -### Budgeting |
30 | | -After ranking, files/chunks are added in descending score order until the token budget is reached.\nWhen a file is too large, ctxeng applies smart truncation (head + tail) before skipping. |
| 123 | +### 6) Redaction (optional, default on) |
31 | 124 |
|
32 | | -### Safety |
33 | | -Before any output or trace is written, ctxeng can **redact secrets/PII**. This happens **before token counting, tracing, and rendering**. |
| 125 | +Redaction runs **before** token counting, tracing, and rendering. |
| 126 | + |
| 127 | +This is intentional: it prevents accidental leakage through: |
| 128 | + |
| 129 | +- output text |
| 130 | +- trace logs |
| 131 | +- token counting artifacts |
| 132 | + |
| 133 | +### 7) Fit token budget |
| 134 | + |
| 135 | +Budgeting includes: |
| 136 | + |
| 137 | +- counting query/system tokens |
| 138 | +- greedily including top-ranked items |
| 139 | +- smart truncation (head + tail) when a file is too large |
| 140 | + |
| 141 | +### 8) Render output |
| 142 | + |
| 143 | +Context is rendered as: |
| 144 | + |
| 145 | +- `xml` (default) |
| 146 | +- `markdown` |
| 147 | +- `plain` |
| 148 | + |
| 149 | +Metadata may include trace/snapshot ids and other build details. |
| 150 | + |
| 151 | +## Practical recipes |
| 152 | + |
| 153 | +### PR review (fast + focused) |
| 154 | + |
| 155 | +```bash |
| 156 | +ctxeng build "Review this PR. Focus on security and correctness." --git-diff --fmt markdown --output ctx.md |
| 157 | +``` |
| 158 | + |
| 159 | +### Large repo explanation (RAG + tracing) |
| 160 | + |
| 161 | +```bash |
| 162 | +ctxeng build "Explain the authentication flow end-to-end" --rag --trace --fmt markdown --output ctx.md |
| 163 | +``` |
| 164 | + |
| 165 | +### High-level overview (skeleton) |
| 166 | + |
| 167 | +```bash |
| 168 | +ctxeng build "Give me a high-level architecture overview" --skeleton --fmt markdown --output ctx.md |
| 169 | +``` |
34 | 170 |
|
0 commit comments