docs: improve differentiation and onboarding

Sayeem3051 · Sayeem3051 · commit ed94aba5ac0b · 2026-04-14T17:43:44.000+05:30
Add why-ctxeng section, simple visual, 30-second start, before/after impact, and expand architecture/performance/FAQ docs.

Made-with: Cursor
diff --git a/README.md b/README.md
@@ -30,6 +30,35 @@ The quality of your LLM's output depends almost entirely on *what you put in the
 - **Fits the budget** — smart truncation keeps the best parts within any model's token limit
 - **Ships ready to paste** — XML, Markdown, or plain text output that works with Claude, GPT-4o, Gemini, and every other model
 
+---
+
+## Why ctxeng vs Cursor / Copilot?
+
+Cursor and Copilot are great for **in-editor assistance**. `ctxeng` is different: it’s a **local-first context builder** that creates a *portable*, *reproducible* context bundle you can use with **any** LLM (chat UI, API, CI).
+
+What you get with `ctxeng` that “just ask the editor” usually doesn’t solve well:
+
+- **Deterministic context selection**: explicit scoring + ranking signals (keyword, AST, git recency, import graph, optional semantic).
+- **Token-budget guarantee**: builds a context that *fits* your model window via budgeting + smart truncation.
+- **Safety by default**: secrets/PII redaction happens before token counting, tracing, or output.
+- **RAG for big repos**: chunk-level retrieval (embeddings optional, lexical fallback).
+- **Operational workflows**: tracing + snapshots + `ctxeng ci` for pipeline runs and reproducible artifacts.
+
+If you want the shortest summary: **Cursor/Copilot help you ask; ctxeng helps you package the right evidence reliably.**
+
+---
+
+## Visual: what ctxeng does
+
+```mermaid
+flowchart LR
+  U[You / task] --> C[ctxeng]
+  R[(Your repo)] --> C
+  C --> O[LLM-ready context\n(XML/Markdown/plain)]
+  O --> A[LLM (ChatGPT/Claude/Gemini/API)]
+  A --> B[Better answers\nless hallucination]
+```
+
 Docs:
 - [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)
 - [`docs/PERFORMANCE.md`](docs/PERFORMANCE.md)
@@ -79,6 +108,15 @@ pip install "ctxeng[all]"          # everything
 
 ## Quickstart
 
+### Start in 30 seconds
+
+```bash
+pip install ctxeng
+ctxeng build "Fix the auth bug" --git-diff --fmt markdown --output ctx.md
+```
+
+Open `ctx.md` and paste it into your LLM.
+
 ### Python API
 
 ```python
@@ -152,6 +190,32 @@ ctxeng build "Explain the payment flow" --output context.md
 ctxeng info
 ```
 
+---
+
+## Real-world impact (Before vs After)
+
+**Without ctxeng** (typical outcome when you paste a few files / vague snippets):
+
+- Misses the real failing module
+- Suggests changes that don’t match your codebase
+- Spends tokens asking for more files
+
+**With ctxeng** (same question, but with an evidence-packed context):
+
+- Gets the right files (ranked) on the first try
+- Uses imports + git recency to pull relevant neighbors
+- Fits your model window automatically and stays reproducible (trace/snapshot)
+
+Example prompt:
+
+> “Why is `test_login` failing after my last commit? Provide a minimal fix.”
+
+Example command:
+
+```bash
+ctxeng build "Why is test_login failing after my last commit? Provide a minimal fix." --git-diff --trace --fmt markdown --output ctx.md
+```
+
 ### Watch mode
 
 Automatically rebuild context when files change (requires `watchdog`):
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
@@ -1,34 +1,170 @@
 ## Architecture
 
-`ctxeng` builds LLM-ready context from your repository by running a simple pipeline:
+`ctxeng` is a local-first “context compiler”: it **discovers** code, **scores** it against your task, optionally **expands/retrieves** more relevant neighbors, then **budgets + renders** a paste-ready context bundle.
 
 ```mermaid
 flowchart TD
-  discover[Discover_files] --> score[Score_and_rank]
-  score --> expand[Expand_import_graph_optional]
-  expand --> retrieve[RAG_chunk_retrieval_optional]
-  retrieve --> redact[Redact_secrets_and_PII]
-  redact --> budget[Fit_token_budget]
-  budget --> render[Render_XML_or_Markdown]
-  render --> output[Context_output]
+  subgraph Inputs
+    Q[Task / query]
+    R[(Repository root)]
+    O[Options / flags]
+  end
+
+  subgraph Pipeline
+    D[Discover files]
+    S[Score + rank]
+    IG[Import-graph expansion (optional)]
+    RAG[RAG chunk retrieval (optional)]
+    SK[AST skeleton (optional)]
+    RX[Redact secrets + PII (optional, default on)]
+    B[Fit token budget]
+    RND[Render output (XML / Markdown / plain)]
+  end
+
+  Q --> S
+  R --> D --> S --> IG --> RAG --> SK --> RX --> B --> RND
+  O --> D
+  O --> S
+  O --> IG
+  O --> RAG
+  O --> SK
+  O --> RX
+  O --> B
+  RND --> OUT[Context output + metadata]
 ```
 
-### Inputs
-- **Repository root** (`--root`)
-- **Task/query** (free text)
-- **Options**: import graph, semantic scoring, RAG, skeleton mode, redaction, tracing
+## Key design goals
+
+- **Deterministic & explainable**: scoring signals are explicit; optional tracing records “what was included and why”.
+- **Portable**: output is not tied to any editor or model vendor (works with chat UIs, APIs, CI).
+- **Budget-safe**: context is constructed to fit a specific model window (or explicit `--budget`).
+- **Safety by default**: redaction runs before token counting, tracing, and rendering.
+
+## Components (where the logic lives)
+
+- **Discovery / sources**: `ctxeng/sources/__init__.py`
+  - `collect_filesystem()`: walks the repo and yields `(path, content)`
+  - `collect_git_changed()`: yields changed files (`--git-diff`)
+  - `collect_explicit()`: yields exact paths (`--files` / `include_files()`)
+- **Ignore rules**: `ctxeng/ignore.py`
+  - merges `.gitignore` + `.ctxengignore` into a single matcher (gitwildmatch via `pathspec`)
+- **Scoring / ranking**: `ctxeng/scorer.py`
+  - combines signals into a score in \([0, 1]\)
+  - supports optional semantic scoring + multi-language AST symbol overlap
+  - supports configurable scoring weights (`.ctxeng/config.json` / `--scoring-config`)
+- **Import graph expansion (Python)**: `ctxeng/import_graph.py`
+  - builds a static import graph among discovered `.py` files
+  - expands context with imported neighbors (with score decay) before budgeting
+- **Chunking + retrieval (RAG)**: `ctxeng/chunking.py`, `ctxeng/retrieval.py`
+  - splits files into overlapping chunks
+  - lexical retrieval is always available; embeddings retrieval requires `ctxeng[semantic]`
+- **Skeleton mode (Python)**: `ctxeng/ast_skeleton.py`
+  - replaces Python bodies with an AST-derived outline for “overview” requests
+- **Redaction**: `ctxeng/redaction.py`
+  - masks secrets + PII with stable hashes so traces and output don’t leak sensitive values
+- **Budgeting + truncation**: `ctxeng/optimizer.py`
+  - estimates token counts (uses `tiktoken` if installed, otherwise heuristic)
+  - greedily fills budget by score, smart-truncates large files
+- **Tracing (optional)**: `ctxeng/tracing.py`
+  - writes JSONL events under `.ctxeng/traces/` (safe payloads)
+- **Snapshots (optional)**: `ctxeng/snapshots.py`
+  - writes `context.txt` + `manifest.json` under `.ctxeng/snapshots/<id>/`
+- **Orchestration**: `ctxeng/core.py`
+  - `ContextEngine.build()` coordinates the whole pipeline
+- **Fluent API**: `ctxeng/builder.py`
+  - `ContextBuilder` provides a chainable configuration layer
+
+## Pipeline, step-by-step
+
+### 1) Discover files
+
+Discovery chooses the candidate set:
+
+- **Filesystem** (default): walk from repo root, apply ignore rules, skip binary-ish files, apply size guard.
+- **Git diff** (`--git-diff`): only changed/untracked files (great for PR reviews).
+- **Explicit files** (`--files` / `include_files()`): bypass discovery, useful for targeted tasks.
+
+### 2) Score + rank
+
+Each file is scored using a weighted mix of signals, then sorted descending. Signals include:
+
+- **Keyword overlap**: query token overlap with content
+- **Path relevance**: filename + directory names matching query tokens
+- **AST symbol overlap**:
+  - Python (built-in)
+  - JS/TS/Go (optional, tree-sitter)
+- **Git recency**: recently changed files get a boost (optional)
+- **Semantic similarity**: optional local embeddings (`sentence-transformers`)
+
+Weights can be customized with `--scoring-config` (or `.ctxeng/config.json`).
+
+### 3) Import graph expansion (optional, Python)
+
+If enabled, ctxeng can pull in locally imported Python modules from the discovered set (with score decay). This helps with “function is defined elsewhere” cases.
+
+### 4) RAG chunk retrieval (optional)
+
+For large repos, `--rag` switches from whole-file inclusion to chunk-level selection:
+
+- Candidate set: top-ranked files are chunked (keeps runtime bounded).
+- Retrieval:
+  - embeddings (if installed) or lexical fallback (always available)
+- Output:
+  - retrieved chunks become the “ranked list” fed into budgeting
+
+### 5) Skeleton mode (optional, Python)
+
+`--skeleton` replaces Python file bodies with an AST-derived outline (imports, defs, methods). This is best for:
 
-### Scoring signals
-Each file gets a relevance score in \([0,1]\) from a weighted mix of:
-- **Keyword overlap** with the query
-- **Path relevance** (filename + directories)
-- **AST symbol overlap** (Python built-in; JS/TS/Go via optional tree-sitter)
-- **Git recency** (recently changed files score higher)
-- **Semantic similarity** (optional, sentence-transformers)
+- “high-level architecture overview”
+- “what are the key modules/classes?”
+- tight budgets where full bodies are too expensive
 
-### Budgeting
-After ranking, files/chunks are added in descending score order until the token budget is reached.\nWhen a file is too large, ctxeng applies smart truncation (head + tail) before skipping.
+### 6) Redaction (optional, default on)
 
-### Safety
-Before any output or trace is written, ctxeng can **redact secrets/PII**. This happens **before token counting, tracing, and rendering**.
+Redaction runs **before** token counting, tracing, and rendering.
+
+This is intentional: it prevents accidental leakage through:
+
+- output text
+- trace logs
+- token counting artifacts
+
+### 7) Fit token budget
+
+Budgeting includes:
+
+- counting query/system tokens
+- greedily including top-ranked items
+- smart truncation (head + tail) when a file is too large
+
+### 8) Render output
+
+Context is rendered as:
+
+- `xml` (default)
+- `markdown`
+- `plain`
+
+Metadata may include trace/snapshot ids and other build details.
+
+## Practical recipes
+
+### PR review (fast + focused)
+
+```bash
+ctxeng build "Review this PR. Focus on security and correctness." --git-diff --fmt markdown --output ctx.md
+```
+
+### Large repo explanation (RAG + tracing)
+
+```bash
+ctxeng build "Explain the authentication flow end-to-end" --rag --trace --fmt markdown --output ctx.md
+```
+
+### High-level overview (skeleton)
+
+```bash
+ctxeng build "Give me a high-level architecture overview" --skeleton --fmt markdown --output ctx.md
+```
 
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -1,16 +1,92 @@
 ## FAQ
 
+### Why ctxeng vs Cursor / Copilot?
+
+Cursor and Copilot are great for **in-editor assistance**. `ctxeng` focuses on a different problem: building a **portable**, **budget-safe**, **reproducible** context bundle you can use with *any* LLM (chat UI, API, CI).
+
+`ctxeng` is especially useful when you need:
+
+- deterministic selection (ranked evidence, not “whatever is open”)
+- strict token budgeting (fits the model window)
+- safety (redaction before output/traces)
+- large-repo workflows (RAG chunk retrieval)
+- automation (CI, snapshots, tracing)
+
 ### Does ctxeng send my whole repo to an LLM?
+
 No. ctxeng selects a subset of files (or chunks with `--rag`) based on your query and token budget.
 
+Also: **ctxeng itself does not call an LLM** unless you use an optional integration function (e.g. `ask_claude()`).
+
 ### How does redaction work?
+
 When enabled (default), ctxeng masks common secrets and PII patterns before token counting, tracing, or output.
 
+To disable:
+
+```bash
+ctxeng build "Your query" --no-redact
+```
+
 ### What languages are supported?
-- File discovery and keyword/path scoring work for many text/code files.
-- Python import graph and skeleton mode are Python-specific.
-- JS/TS/Go AST symbol extraction is supported via optional tree-sitter dependencies.
+
+- **Discovery + keyword/path scoring**: works for many text/code files.
+- **Python-only features**:
+  - import graph expansion
+  - skeleton mode
+- **JS/TS/Go symbols**: supported via optional tree-sitter dependencies:
+
+```bash
+pip install "ctxeng[ast]"
+```
 
 ### Why is the VSCode extension disabled?
+
 It is currently under development and disabled to avoid unstable activation in releases. Use the CLI/Python package.
 
+### I got `Token required because branch is protected` from Codecov
+
+That usually means Codecov requires a token for uploads on protected branches.
+
+Fix:
+
+- Add `CODECOV_TOKEN` in your GitHub repo secrets
+- Or configure Codecov to allow tokenless uploads for your setup
+
+### PyPI upload problems
+
+#### `No module named twine`
+
+Install it in the environment you’re using:
+
+```bash
+python -m pip install -U twine build
+```
+
+#### `HTTPError: 400 Bad Request` on upload
+
+Most common causes:
+
+- that version already exists on PyPI
+- you’re uploading old artifacts from `dist/`
+
+Recommended flow:
+
+```bash
+rm -rf dist build
+python -m build
+python -m twine check dist/*
+python -m twine upload dist/*
+```
+
+### “It feels overwhelming—where do I start?”
+
+Start with the smallest workflow:
+
+```bash
+pip install ctxeng
+ctxeng build "Fix the auth bug" --git-diff --fmt markdown --output ctx.md
+```
+
+Paste `ctx.md` into your LLM. Add `--trace` once you want explainability, and `--rag` once your repo is large.
+
diff --git a/docs/PERFORMANCE.md b/docs/PERFORMANCE.md