Skip to content

Commit ed94aba

Browse files
committed
docs: improve differentiation and onboarding
Add why-ctxeng section, simple visual, 30-second start, before/after impact, and expand architecture/performance/FAQ docs. Made-with: Cursor
1 parent 2469e11 commit ed94aba

4 files changed

Lines changed: 432 additions & 38 deletions

File tree

README.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,35 @@ The quality of your LLM's output depends almost entirely on *what you put in the
3030
- **Fits the budget** — smart truncation keeps the best parts within any model's token limit
3131
- **Ships ready to paste** — XML, Markdown, or plain text output that works with Claude, GPT-4o, Gemini, and every other model
3232

33+
---
34+
35+
## Why ctxeng vs Cursor / Copilot?
36+
37+
Cursor and Copilot are great for **in-editor assistance**. `ctxeng` is different: it’s a **local-first context builder** that creates a *portable*, *reproducible* context bundle you can use with **any** LLM (chat UI, API, CI).
38+
39+
What you get with `ctxeng` that “just ask the editor” usually doesn’t solve well:
40+
41+
- **Deterministic context selection**: explicit scoring + ranking signals (keyword, AST, git recency, import graph, optional semantic).
42+
- **Token-budget guarantee**: builds a context that *fits* your model window via budgeting + smart truncation.
43+
- **Safety by default**: secrets/PII redaction happens before token counting, tracing, or output.
44+
- **RAG for big repos**: chunk-level retrieval (embeddings optional, lexical fallback).
45+
- **Operational workflows**: tracing + snapshots + `ctxeng ci` for pipeline runs and reproducible artifacts.
46+
47+
If you want the shortest summary: **Cursor/Copilot help you ask; ctxeng helps you package the right evidence reliably.**
48+
49+
---
50+
51+
## Visual: what ctxeng does
52+
53+
```mermaid
54+
flowchart LR
55+
U[You / task] --> C[ctxeng]
56+
R[(Your repo)] --> C
57+
C --> O[LLM-ready context\n(XML/Markdown/plain)]
58+
O --> A[LLM (ChatGPT/Claude/Gemini/API)]
59+
A --> B[Better answers\nless hallucination]
60+
```
61+
3362
Docs:
3463
- [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)
3564
- [`docs/PERFORMANCE.md`](docs/PERFORMANCE.md)
@@ -79,6 +108,15 @@ pip install "ctxeng[all]" # everything
79108

80109
## Quickstart
81110

111+
### Start in 30 seconds
112+
113+
```bash
114+
pip install ctxeng
115+
ctxeng build "Fix the auth bug" --git-diff --fmt markdown --output ctx.md
116+
```
117+
118+
Open `ctx.md` and paste it into your LLM.
119+
82120
### Python API
83121

84122
```python
@@ -152,6 +190,32 @@ ctxeng build "Explain the payment flow" --output context.md
152190
ctxeng info
153191
```
154192

193+
---
194+
195+
## Real-world impact (Before vs After)
196+
197+
**Without ctxeng** (typical outcome when you paste a few files / vague snippets):
198+
199+
- Misses the real failing module
200+
- Suggests changes that don’t match your codebase
201+
- Spends tokens asking for more files
202+
203+
**With ctxeng** (same question, but with an evidence-packed context):
204+
205+
- Gets the right files (ranked) on the first try
206+
- Uses imports + git recency to pull relevant neighbors
207+
- Fits your model window automatically and stays reproducible (trace/snapshot)
208+
209+
Example prompt:
210+
211+
> “Why is `test_login` failing after my last commit? Provide a minimal fix.”
212+
213+
Example command:
214+
215+
```bash
216+
ctxeng build "Why is test_login failing after my last commit? Provide a minimal fix." --git-diff --trace --fmt markdown --output ctx.md
217+
```
218+
155219
### Watch mode
156220

157221
Automatically rebuild context when files change (requires `watchdog`):

docs/ARCHITECTURE.md

Lines changed: 159 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,170 @@
11
## Architecture
22

3-
`ctxeng` builds LLM-ready context from your repository by running a simple pipeline:
3+
`ctxeng` is a local-first “context compiler”: it **discovers** code, **scores** it against your task, optionally **expands/retrieves** more relevant neighbors, then **budgets + renders** a paste-ready context bundle.
44

55
```mermaid
66
flowchart TD
7-
discover[Discover_files] --> score[Score_and_rank]
8-
score --> expand[Expand_import_graph_optional]
9-
expand --> retrieve[RAG_chunk_retrieval_optional]
10-
retrieve --> redact[Redact_secrets_and_PII]
11-
redact --> budget[Fit_token_budget]
12-
budget --> render[Render_XML_or_Markdown]
13-
render --> output[Context_output]
7+
subgraph Inputs
8+
Q[Task / query]
9+
R[(Repository root)]
10+
O[Options / flags]
11+
end
12+
13+
subgraph Pipeline
14+
D[Discover files]
15+
S[Score + rank]
16+
IG[Import-graph expansion (optional)]
17+
RAG[RAG chunk retrieval (optional)]
18+
SK[AST skeleton (optional)]
19+
RX[Redact secrets + PII (optional, default on)]
20+
B[Fit token budget]
21+
RND[Render output (XML / Markdown / plain)]
22+
end
23+
24+
Q --> S
25+
R --> D --> S --> IG --> RAG --> SK --> RX --> B --> RND
26+
O --> D
27+
O --> S
28+
O --> IG
29+
O --> RAG
30+
O --> SK
31+
O --> RX
32+
O --> B
33+
RND --> OUT[Context output + metadata]
1434
```
1535

16-
### Inputs
17-
- **Repository root** (`--root`)
18-
- **Task/query** (free text)
19-
- **Options**: import graph, semantic scoring, RAG, skeleton mode, redaction, tracing
36+
## Key design goals
37+
38+
- **Deterministic & explainable**: scoring signals are explicit; optional tracing records “what was included and why”.
39+
- **Portable**: output is not tied to any editor or model vendor (works with chat UIs, APIs, CI).
40+
- **Budget-safe**: context is constructed to fit a specific model window (or explicit `--budget`).
41+
- **Safety by default**: redaction runs before token counting, tracing, and rendering.
42+
43+
## Components (where the logic lives)
44+
45+
- **Discovery / sources**: `ctxeng/sources/__init__.py`
46+
- `collect_filesystem()`: walks the repo and yields `(path, content)`
47+
- `collect_git_changed()`: yields changed files (`--git-diff`)
48+
- `collect_explicit()`: yields exact paths (`--files` / `include_files()`)
49+
- **Ignore rules**: `ctxeng/ignore.py`
50+
- merges `.gitignore` + `.ctxengignore` into a single matcher (gitwildmatch via `pathspec`)
51+
- **Scoring / ranking**: `ctxeng/scorer.py`
52+
- combines signals into a score in \([0, 1]\)
53+
- supports optional semantic scoring + multi-language AST symbol overlap
54+
- supports configurable scoring weights (`.ctxeng/config.json` / `--scoring-config`)
55+
- **Import graph expansion (Python)**: `ctxeng/import_graph.py`
56+
- builds a static import graph among discovered `.py` files
57+
- expands context with imported neighbors (with score decay) before budgeting
58+
- **Chunking + retrieval (RAG)**: `ctxeng/chunking.py`, `ctxeng/retrieval.py`
59+
- splits files into overlapping chunks
60+
- lexical retrieval is always available; embeddings retrieval requires `ctxeng[semantic]`
61+
- **Skeleton mode (Python)**: `ctxeng/ast_skeleton.py`
62+
- replaces Python bodies with an AST-derived outline for “overview” requests
63+
- **Redaction**: `ctxeng/redaction.py`
64+
- masks secrets + PII with stable hashes so traces and output don’t leak sensitive values
65+
- **Budgeting + truncation**: `ctxeng/optimizer.py`
66+
- estimates token counts (uses `tiktoken` if installed, otherwise heuristic)
67+
- greedily fills budget by score, smart-truncates large files
68+
- **Tracing (optional)**: `ctxeng/tracing.py`
69+
- writes JSONL events under `.ctxeng/traces/` (safe payloads)
70+
- **Snapshots (optional)**: `ctxeng/snapshots.py`
71+
- writes `context.txt` + `manifest.json` under `.ctxeng/snapshots/<id>/`
72+
- **Orchestration**: `ctxeng/core.py`
73+
- `ContextEngine.build()` coordinates the whole pipeline
74+
- **Fluent API**: `ctxeng/builder.py`
75+
- `ContextBuilder` provides a chainable configuration layer
76+
77+
## Pipeline, step-by-step
78+
79+
### 1) Discover files
80+
81+
Discovery chooses the candidate set:
82+
83+
- **Filesystem** (default): walk from repo root, apply ignore rules, skip binary-ish files, apply size guard.
84+
- **Git diff** (`--git-diff`): only changed/untracked files (great for PR reviews).
85+
- **Explicit files** (`--files` / `include_files()`): bypass discovery, useful for targeted tasks.
86+
87+
### 2) Score + rank
88+
89+
Each file is scored using a weighted mix of signals, then sorted descending. Signals include:
90+
91+
- **Keyword overlap**: query token overlap with content
92+
- **Path relevance**: filename + directory names matching query tokens
93+
- **AST symbol overlap**:
94+
- Python (built-in)
95+
- JS/TS/Go (optional, tree-sitter)
96+
- **Git recency**: recently changed files get a boost (optional)
97+
- **Semantic similarity**: optional local embeddings (`sentence-transformers`)
98+
99+
Weights can be customized with `--scoring-config` (or `.ctxeng/config.json`).
100+
101+
### 3) Import graph expansion (optional, Python)
102+
103+
If enabled, ctxeng can pull in locally imported Python modules from the discovered set (with score decay). This helps with “function is defined elsewhere” cases.
104+
105+
### 4) RAG chunk retrieval (optional)
106+
107+
For large repos, `--rag` switches from whole-file inclusion to chunk-level selection:
108+
109+
- Candidate set: top-ranked files are chunked (keeps runtime bounded).
110+
- Retrieval:
111+
- embeddings (if installed) or lexical fallback (always available)
112+
- Output:
113+
- retrieved chunks become the “ranked list” fed into budgeting
114+
115+
### 5) Skeleton mode (optional, Python)
116+
117+
`--skeleton` replaces Python file bodies with an AST-derived outline (imports, defs, methods). This is best for:
20118

21-
### Scoring signals
22-
Each file gets a relevance score in \([0,1]\) from a weighted mix of:
23-
- **Keyword overlap** with the query
24-
- **Path relevance** (filename + directories)
25-
- **AST symbol overlap** (Python built-in; JS/TS/Go via optional tree-sitter)
26-
- **Git recency** (recently changed files score higher)
27-
- **Semantic similarity** (optional, sentence-transformers)
119+
- “high-level architecture overview”
120+
- “what are the key modules/classes?”
121+
- tight budgets where full bodies are too expensive
28122

29-
### Budgeting
30-
After ranking, files/chunks are added in descending score order until the token budget is reached.\nWhen a file is too large, ctxeng applies smart truncation (head + tail) before skipping.
123+
### 6) Redaction (optional, default on)
31124

32-
### Safety
33-
Before any output or trace is written, ctxeng can **redact secrets/PII**. This happens **before token counting, tracing, and rendering**.
125+
Redaction runs **before** token counting, tracing, and rendering.
126+
127+
This is intentional: it prevents accidental leakage through:
128+
129+
- output text
130+
- trace logs
131+
- token counting artifacts
132+
133+
### 7) Fit token budget
134+
135+
Budgeting includes:
136+
137+
- counting query/system tokens
138+
- greedily including top-ranked items
139+
- smart truncation (head + tail) when a file is too large
140+
141+
### 8) Render output
142+
143+
Context is rendered as:
144+
145+
- `xml` (default)
146+
- `markdown`
147+
- `plain`
148+
149+
Metadata may include trace/snapshot ids and other build details.
150+
151+
## Practical recipes
152+
153+
### PR review (fast + focused)
154+
155+
```bash
156+
ctxeng build "Review this PR. Focus on security and correctness." --git-diff --fmt markdown --output ctx.md
157+
```
158+
159+
### Large repo explanation (RAG + tracing)
160+
161+
```bash
162+
ctxeng build "Explain the authentication flow end-to-end" --rag --trace --fmt markdown --output ctx.md
163+
```
164+
165+
### High-level overview (skeleton)
166+
167+
```bash
168+
ctxeng build "Give me a high-level architecture overview" --skeleton --fmt markdown --output ctx.md
169+
```
34170

docs/FAQ.md

Lines changed: 79 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,92 @@
11
## FAQ
22

3+
### Why ctxeng vs Cursor / Copilot?
4+
5+
Cursor and Copilot are great for **in-editor assistance**. `ctxeng` focuses on a different problem: building a **portable**, **budget-safe**, **reproducible** context bundle you can use with *any* LLM (chat UI, API, CI).
6+
7+
`ctxeng` is especially useful when you need:
8+
9+
- deterministic selection (ranked evidence, not “whatever is open”)
10+
- strict token budgeting (fits the model window)
11+
- safety (redaction before output/traces)
12+
- large-repo workflows (RAG chunk retrieval)
13+
- automation (CI, snapshots, tracing)
14+
315
### Does ctxeng send my whole repo to an LLM?
16+
417
No. ctxeng selects a subset of files (or chunks with `--rag`) based on your query and token budget.
518

19+
Also: **ctxeng itself does not call an LLM** unless you use an optional integration function (e.g. `ask_claude()`).
20+
621
### How does redaction work?
22+
723
When enabled (default), ctxeng masks common secrets and PII patterns before token counting, tracing, or output.
824

25+
To disable:
26+
27+
```bash
28+
ctxeng build "Your query" --no-redact
29+
```
30+
931
### What languages are supported?
10-
- File discovery and keyword/path scoring work for many text/code files.
11-
- Python import graph and skeleton mode are Python-specific.
12-
- JS/TS/Go AST symbol extraction is supported via optional tree-sitter dependencies.
32+
33+
- **Discovery + keyword/path scoring**: works for many text/code files.
34+
- **Python-only features**:
35+
- import graph expansion
36+
- skeleton mode
37+
- **JS/TS/Go symbols**: supported via optional tree-sitter dependencies:
38+
39+
```bash
40+
pip install "ctxeng[ast]"
41+
```
1342

1443
### Why is the VSCode extension disabled?
44+
1545
It is currently under development and disabled to avoid unstable activation in releases. Use the CLI/Python package.
1646

47+
### I got `Token required because branch is protected` from Codecov
48+
49+
That usually means Codecov requires a token for uploads on protected branches.
50+
51+
Fix:
52+
53+
- Add `CODECOV_TOKEN` in your GitHub repo secrets
54+
- Or configure Codecov to allow tokenless uploads for your setup
55+
56+
### PyPI upload problems
57+
58+
#### `No module named twine`
59+
60+
Install it in the environment you’re using:
61+
62+
```bash
63+
python -m pip install -U twine build
64+
```
65+
66+
#### `HTTPError: 400 Bad Request` on upload
67+
68+
Most common causes:
69+
70+
- that version already exists on PyPI
71+
- you’re uploading old artifacts from `dist/`
72+
73+
Recommended flow:
74+
75+
```bash
76+
rm -rf dist build
77+
python -m build
78+
python -m twine check dist/*
79+
python -m twine upload dist/*
80+
```
81+
82+
### “It feels overwhelming—where do I start?”
83+
84+
Start with the smallest workflow:
85+
86+
```bash
87+
pip install ctxeng
88+
ctxeng build "Fix the auth bug" --git-diff --fmt markdown --output ctx.md
89+
```
90+
91+
Paste `ctx.md` into your LLM. Add `--trace` once you want explainability, and `--rag` once your repo is large.
92+

0 commit comments

Comments
 (0)