Skip to content

Commit 6dc0b3e

Browse files
committed
feat: insights, RAG improvements, scoring config
Enhance ctxeng info with size/import-graph insights and markdown pie chart. Improve RAG with Python structure-aware chunking and chunk context window. Add scoring weights config support and multi-language AST symbols (JS/TS/Go). Made-with: Cursor
1 parent 07dd9d8 commit 6dc0b3e

15 files changed

Lines changed: 584 additions & 52 deletions

README.md

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
<p align="center">
99
<a href="https://pypi.org/project/ctxeng/"><img src="https://img.shields.io/pypi/v/ctxeng?color=blue&label=pypi&cacheSeconds=3600" alt="PyPI"></a>
1010
<a href="https://github.com/sayeem3051/python-context-engineer/actions"><img src="https://github.com/sayeem3051/python-context-engineer/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
11+
<a href="https://codecov.io/gh/Sayeem3051/python-context-engineer"><img src="https://codecov.io/gh/Sayeem3051/python-context-engineer/branch/main/graph/badge.svg" alt="Coverage"></a>
1112
<a href="https://pypi.org/project/ctxeng/"><img src="https://img.shields.io/pypi/pyversions/ctxeng?cacheSeconds=3600" alt="Python"></a>
1213
<img src="https://img.shields.io/github/license/sayeem3051/python-context-engineer?cacheSeconds=3600" alt="License">
1314
<a href="https://pepy.tech/project/ctxeng"><img src="https://static.pepy.tech/badge/ctxeng/month" alt="Downloads"></a>
@@ -29,6 +30,11 @@ The quality of your LLM's output depends almost entirely on *what you put in the
2930
- **Fits the budget** — smart truncation keeps the best parts within any model's token limit
3031
- **Ships ready to paste** — XML, Markdown, or plain text output that works with Claude, GPT-4o, Gemini, and every other model
3132

33+
Docs:
34+
- [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)
35+
- [`docs/PERFORMANCE.md`](docs/PERFORMANCE.md)
36+
- [`docs/FAQ.md`](docs/FAQ.md)
37+
3238
One small dependency ([pathspec](https://pypi.org/project/pathspec/)) powers ``.ctxengignore`` (gitignore-style patterns). Works with any LLM.
3339

3440
---
@@ -59,6 +65,8 @@ For semantic similarity scoring (optional local embeddings):
5965
pip install "ctxeng[semantic]"
6066
```
6167

68+
Default semantic model is `all-mpnet-base-v2`. Override with `--semantic-model` when building context.
69+
6270
For one-line LLM calls:
6371

6472
```bash
@@ -222,6 +230,7 @@ For large repositories, `--rag` switches from whole-file inclusion to **chunk-le
222230

223231
- Uses **embeddings** when `sentence-transformers` is installed
224232
- Falls back to **lexical retrieval** when embeddings aren’t available
233+
- Chunks Python files by **function/class** boundaries when possible
225234

226235
```bash
227236
ctxeng build "Explain the login flow" --rag
@@ -514,7 +523,7 @@ build options:
514523
--show-cost / --no-show-cost
515524
Include estimated input cost in stderr summary (default: on)
516525
--semantic Enable semantic similarity scoring (requires sentence-transformers)
517-
--semantic-model Semantic model name (default: all-MiniLM-L6-v2)
526+
--semantic-model Semantic model name (default: all-mpnet-base-v2)
518527
--gitignore / --no-gitignore
519528
Respect .gitignore in addition to .ctxengignore (default: on)
520529
--allow Allowlist path prefixes; only these paths may be included
@@ -541,7 +550,7 @@ build options:
541550
watch options:
542551
--interval S Polling interval in seconds (default: 1.0)
543552
--semantic Enable semantic similarity scoring (requires sentence-transformers)
544-
--semantic-model Semantic model name (default: all-MiniLM-L6-v2)
553+
--semantic-model Semantic model name (default: all-mpnet-base-v2)
545554
--gitignore / --no-gitignore
546555
Respect .gitignore in addition to .ctxengignore (default: on)
547556
--allow Allowlist path prefixes; only these paths may be included
@@ -648,6 +657,18 @@ You could. But you'll hit these problems immediately:
648657

649658
PRs welcome! See [CONTRIBUTING.md](CONTRIBUTING.md).
650659

660+
## Test coverage
661+
662+
We track coverage with `pytest-cov` in CI and upload `coverage.xml` to Codecov.
663+
664+
- **Goal**: keep coverage **above 80%**
665+
- **Local run**:
666+
667+
```bash
668+
pip install -e ".[dev]"
669+
pytest --cov=ctxeng --cov-report=term-missing --cov-report=xml
670+
```
671+
651672
```bash
652673
git clone https://github.com/sayeem3051/python-context-engineer
653674
cd python-context-engineer

ctxeng/builder.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ def __init__(self, root: str | Path = ".") -> None:
4343
self._use_import_graph = True
4444
self._import_graph_depth = 1
4545
self._use_semantic = False
46-
self._semantic_model = "all-MiniLM-L6-v2"
46+
self._semantic_model = "all-mpnet-base-v2"
4747
self._respect_gitignore = True
4848
self._allow_paths: list[str | Path] = []
4949
self._deny_paths: list[str | Path] = []
@@ -54,12 +54,14 @@ def __init__(self, root: str | Path = ".") -> None:
5454
self._rag_max_chunks = 20
5555
self._rag_chunk_max_lines = 120
5656
self._rag_chunk_overlap = 20
57+
self._rag_chunk_context_lines = 3
5758
self._rag_embedding_model = "all-MiniLM-L6-v2"
5859
self._skeleton = False
5960
self._redact = True
6061
self._fewshot = False
6162
self._fewshot_dir: str | Path = ".ctxeng/examples"
6263
self._fewshot_max_files = 5
64+
self._scoring_config: str | Path | None = None
6365

6466
def for_model(self, model: str) -> ContextBuilder:
6567
"""Set the target model (determines token budget)."""
@@ -118,7 +120,7 @@ def no_import_graph(self) -> ContextBuilder:
118120
self._use_import_graph = False
119121
return self
120122

121-
def use_semantic(self, model: str = "all-MiniLM-L6-v2") -> ContextBuilder:
123+
def use_semantic(self, model: str = "all-mpnet-base-v2") -> ContextBuilder:
122124
"""Enable semantic similarity scoring (requires `sentence-transformers`)."""
123125
self._use_semantic = True
124126
self._semantic_model = model
@@ -153,13 +155,15 @@ def rag(
153155
max_chunks: int = 20,
154156
chunk_max_lines: int = 120,
155157
chunk_overlap: int = 20,
158+
chunk_context_lines: int = 3,
156159
embedding_model: str = "all-MiniLM-L6-v2",
157160
) -> ContextBuilder:
158161
"""Enable chunk-level retrieval (RAG)."""
159162
self._rag = enabled
160163
self._rag_max_chunks = max_chunks
161164
self._rag_chunk_max_lines = chunk_max_lines
162165
self._rag_chunk_overlap = chunk_overlap
166+
self._rag_chunk_context_lines = chunk_context_lines
163167
self._rag_embedding_model = embedding_model
164168
return self
165169

@@ -186,6 +190,11 @@ def fewshot(
186190
self._fewshot_max_files = max_files
187191
return self
188192

193+
def scoring_config(self, path: str | Path) -> ContextBuilder:
194+
"""Load scoring weights from a config file."""
195+
self._scoring_config = path
196+
return self
197+
189198
def build(self, query: str = "") -> Context:
190199
"""
191200
Build and return the optimized Context.
@@ -229,10 +238,12 @@ def _build_engine(self) -> ContextEngine:
229238
rag_max_chunks=self._rag_max_chunks,
230239
rag_chunk_max_lines=self._rag_chunk_max_lines,
231240
rag_chunk_overlap=self._rag_chunk_overlap,
241+
rag_chunk_context_lines=self._rag_chunk_context_lines,
232242
rag_embedding_model=self._rag_embedding_model,
233243
skeleton=self._skeleton,
234244
redact=self._redact,
235245
fewshot=self._fewshot,
236246
fewshot_dir=self._fewshot_dir,
237247
fewshot_max_files=self._fewshot_max_files,
248+
scoring_config=self._scoring_config,
238249
)

ctxeng/chunking.py

Lines changed: 85 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
from __future__ import annotations
44

5+
import ast
56
from dataclasses import dataclass
67
from pathlib import Path
78

@@ -17,12 +18,35 @@ def id(self) -> str:
1718
return f"{self.path.as_posix()}:{self.start_line}-{self.end_line}"
1819

1920

21+
def chunk_file(
22+
path: Path,
23+
text: str,
24+
*,
25+
max_lines: int = 120,
26+
overlap: int = 20,
27+
context_lines: int = 3,
28+
) -> list[Chunk]:
29+
"""
30+
Structure-aware chunking:
31+
- Python: chunk by class/function using AST line spans.
32+
- Other: fall back to line chunking.
33+
34+
`context_lines` expands each chunk with surrounding lines for better local context.
35+
"""
36+
if path.suffix.lower() == ".py":
37+
chunks = _chunk_python_ast(path, text, max_lines=max_lines, context_lines=context_lines)
38+
if chunks:
39+
return chunks
40+
return chunk_text(path, text, max_lines=max_lines, overlap=overlap, context_lines=context_lines)
41+
42+
2043
def chunk_text(
2144
path: Path,
2245
text: str,
2346
*,
2447
max_lines: int = 120,
2548
overlap: int = 20,
49+
context_lines: int = 0,
2650
) -> list[Chunk]:
2751
"""
2852
Split file content into overlapping line chunks.
@@ -44,13 +68,73 @@ def chunk_text(
4468
n = len(lines)
4569
while i < n:
4670
j = min(n, i + max_lines)
47-
chunk_lines = lines[i:j]
4871
start_line = i + 1
4972
end_line = j
73+
74+
# Expand with surrounding context
75+
start0 = max(1, start_line - context_lines)
76+
end0 = min(n, end_line + context_lines)
77+
chunk_lines = lines[start0 - 1 : end0]
78+
start_line = start0
79+
end_line = end0
5080
chunks.append(Chunk(path=path, start_line=start_line, end_line=end_line, text="\n".join(chunk_lines)))
5181
if j >= n:
5282
break
5383
i = max(0, j - overlap)
5484

5585
return chunks
5686

87+
88+
def _chunk_python_ast(
89+
path: Path,
90+
text: str,
91+
*,
92+
max_lines: int,
93+
context_lines: int,
94+
) -> list[Chunk]:
95+
lines = text.splitlines()
96+
if not lines:
97+
return []
98+
try:
99+
tree = ast.parse(text)
100+
except SyntaxError:
101+
return []
102+
103+
spans: list[tuple[int, int]] = []
104+
for node in ast.walk(tree):
105+
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
106+
start = getattr(node, "lineno", None)
107+
end = getattr(node, "end_lineno", None)
108+
if isinstance(start, int) and isinstance(end, int) and end >= start:
109+
spans.append((start, end))
110+
111+
if not spans:
112+
return []
113+
114+
# Dedupe + sort
115+
spans = sorted(set(spans))
116+
117+
out: list[Chunk] = []
118+
n = len(lines)
119+
for start, end in spans:
120+
# Expand context window
121+
s = max(1, start - context_lines)
122+
e = min(n, end + context_lines)
123+
if e - s + 1 > max_lines:
124+
# Too big: fall back to line chunking inside span (no overlap; keep boundaries)
125+
segment = "\n".join(lines[s - 1 : e])
126+
for ch in chunk_text(path, segment, max_lines=max_lines, overlap=0, context_lines=0):
127+
# Rebase line numbers
128+
out.append(
129+
Chunk(
130+
path=path,
131+
start_line=s + (ch.start_line - 1),
132+
end_line=s + (ch.end_line - 1),
133+
text=ch.text,
134+
)
135+
)
136+
else:
137+
out.append(Chunk(path=path, start_line=s, end_line=e, text="\n".join(lines[s - 1 : e])))
138+
139+
return out
140+

0 commit comments

Comments
 (0)