Codebase RAG

This README explains how the repository works end‑to‑end and exactly how to run it.
It also documents the main modules, data flow, CLI flags, expected inputs/outputs, and API surfaces you can call from code.

The system is model‑agnostic. Bring your own embedder exposing .encode(list[str]) -> np.ndarray.

Which file do I run? (Entry Point)

Use the CLI at cgx/cli/main.py. It now routes through a new auto‑wired pipeline that connects all key components.

# Make the src/ layout importable
export PYTHONPATH="$PWD/src:$PYTHONPATH"

# 1) INDEX — parse → graph → records → two-view embeddings → FAISS → persist
python -m cgx.cli.main index   --project-root /path/to/your/codebase   --embedder "myproj.embed:make_model"   --out-dir /tmp/cgx_index   --metric cosine   --index-type flat

# 2) QUERY — hybrid retrieval (semantic + optional lexical + graph) + aggregation + insertion anchors
python -m cgx.cli.main query   --index-dir /tmp/cgx_index/indices   --records /tmp/cgx_index/records.jsonl   --embedder "myproj.embed:make_model"   --query "How do we add a new FastAPI route?"

Under the hood the CLI calls the auto‑wired pipeline:

src/cgx/pipeline/auto.py
- run_index_auto(project_root, embedder, out_dir, metric, index_type)
- run_query_auto(index_dir, records_path, embedder, query, ...)

Your original pipeline remains unchanged and available:

src/cgx/pipeline/run.py
- run_index(...)
- run_query(...)

You can keep using these or delegate them to the auto‑wired versions if you want one canonical path.

Architecture & Data Flow

Parse → Chunks
cgx.parser.parse_codebase.parse_codebase(project_root)
Produces canonical chunks (files/classes/functions/methods) and, if available, basic call edges.
Graph
cgx.graph.build_graph.build_knowledge_graph(chunks, calls=None)
Builds a NetworkX knowledge graph over code entities and relations (calls/modules/attrs/etc.).
Records & Two‑View Corpus
- cgx.embeddings.records.make_index_records(chunks, G) → records (deterministic S4‑style)
- cgx.embeddings.records.prepare_embedding_corpus(records, which=('intent','impl')) → corpus
  - intent view: NL‑friendly summary (names/docstrings/comments)
  - impl view: implementation‑centric text (code/signatures), optionally normalized
Embeddings & FAISS per view
- Embeddings (explicitly exercised): cgx.embeddings.build.build_embeddings(...)
- ANN index (explicitly exercised): cgx.embeddings.index.build_faiss_index(...)
- Persist per‑view artifacts via cgx.io.persist.save_indices(...)
Retrieval, Fusion & Post‑processing
- Hybrid two‑view retrieval (semantic on both views + optional lexical + optional graph) with RRF fusion:
  cgx.retrieval.orchestrator.hybrid_retrieve_two_view(...)
- Aggregate to implementation units:
  - cgx.retrieval.orchestrator.aggregate_by_file(...)
  - cgx.retrieval.orchestrator.aggregate_by_class(...)
- Suggest insertion points for new code:
  - cgx.retrieval.orchestrator.suggest_insertion_points(query, results, records)

CLI Reference

`index`

Builds two FAISS indices (one per view) and saves metadata + rows + records.

Flags

--project-root (required): Repository to index.
--embedder (required): Import spec "module:attr" that yields an object with .encode(list[str]) -> ndarray.
- Class → instantiated with no args.
- Callable (factory) → called to produce the object.
- Pre‑instantiated object (module attr) → used directly.
--out-dir (required): Output directory.
--metric (default cosine): One of cosine|l2|ip.
--index-type (default flat): One of flat|ivf|hnsw.
Compatibility flags (kept for UX continuity; not required):
--no-normalize-impl, --strip-literals-impl.

Output Layout

/out-dir/
  ├── indices/
  │   ├── meta.json
  │   ├── intent.index
  │   ├── intent.rows.jsonl
  │   ├── impl.index
  │   └── impl.rows.jsonl
  └── records.jsonl

`query`

Runs hybrid two‑view retrieval + file/class aggregation + insertion‑point suggestions.
Optionally, runs a single‑view semantic helper for debugging.

Flags

--index-dir (required): Directory containing indices/ from the index step.
--records (required): Path to records.jsonl.
--embedder (required): Same import spec used at index time.
--query (required): User question / task.
--chunks: Optional chunks.jsonl to power lexical search.
--graph: Optional JSON graph (if you wish to include graph expansion).
--top-k (default 10): Per‑view semantic top‑k.
--depth (default 1): Graph neighbor depth (if graph expansion is enabled).
--no-lexical: Disable lexical component.
--single-view {intent,impl}: Also run the semantic_search(...) helper on a single view and return its top‑k.

Output (printed JSON)

hits — fused top‑k chunks with ranks/scores.
top_files — aggregated by file.
top_classes — aggregated by class.
anchors — suggested insertion points (deterministic overlap signals).
single_view — optional block (when --single-view is provided).

Programmatic Usage

Auto‑wired pipeline (same as the CLI)

from cgx.pipeline.auto import run_index_auto, run_query_auto

# Build indices
summary = run_index_auto(
    project_root="/path/to/code",
    embedder=make_model(),          # object with .encode(list[str]) -> ndarray
    out_dir="/tmp/cgx_index",
    metric="cosine",
    index_type="flat",
)

# Query with hybrid fusion + anchors
results = run_query_auto(
    index_dir="/tmp/cgx_index/indices",
    records_path="/tmp/cgx_index/records.jsonl",
    embedder=make_model(),
    query="How to add JWT validation?",
    top_k_per_view=10,
    neighbor_depth=1,
    use_lexical=True,
    single_view=None,               # or "intent"/"impl"
)

Legacy pipeline (kept intact)

from cgx.pipeline.run import run_index, run_query
# These remain available and unchanged.

Configuration Objects

Typed configs live in cgx/config.py and support a simple overrides surface:

EmbeddingConfig.from_overrides(...).to_dict()
FaissConfig.from_overrides(metric="cosine", index_type="flat").to_dict()
HybridSearchConfig.from_overrides(rrf_k=60.0, ...).to_dict()

Some fields may also read environment variables (see cgx/config.py for exact names).

Embedder Contract (BYO Model)

Any embedder works as long as it implements:

.encode(list[str]) -> numpy.ndarray  # shape (N, D), dtype float32 preferred

Tips

For cosine/ip metrics, L2‑normalize vectors across rows (both for index and query).
Reuse model/tokenizer across calls; batch requests to avoid overhead.

Troubleshooting & Tips

PYTHONPATH: Always export PYTHONPATH="$PWD/src:$PYTHONPATH" for src‑layout imports.
Missing FAISS: The persist layer degrades gracefully; install FAISS for best performance.
Graph Optionality: Hybrid retrieval works without a graph; provide one to enable graph expansion.
Large Repos: Consider ivf/hnsw for larger corpora; tune nlist, efSearch, etc., if exposed by your build.
Determinism: Records and row order are deterministic; indices map back to stable record IDs written in *.rows.jsonl.

Module Map

Parsing — cgx.parser.parse_codebase.parse_codebase
Graph — cgx.graph.build_graph.build_knowledge_graph
Records — cgx.embeddings.records.make_index_records, prepare_embedding_corpus
Embeddings — cgx.embeddings.build.build_embeddings
Index — cgx.embeddings.index.build_faiss_index
Orchestrator — cgx.retrieval.orchestrator.hybrid_retrieve_two_view, aggregate_by_file, aggregate_by_class, suggest_insertion_points
Persistence — cgx.io.persist.save_indices/load_indices/save_jsonl/load_jsonl
CLI — cgx.cli.main
Pipeline — auto: cgx.pipeline.auto.run_index_auto, run_query_auto; legacy: cgx.pipeline.run.run_index, run_query

Codebase RAG — Full Guide (Capabilities & Usage)

This project indexes an entire codebase and lets you ask questions, find the right places to modify, and add new functionality that fits the existing patterns. It does this by parsing code into canonical chunks, building a two‑view embedding index (intent & implementation), optionally expanding across a code graph, and fusing multiple signals for grounded retrieval. It also suggests insertion points to help you place new code safely.

Bring‑Your‑Own Embedder (BYOE). Any model works as long as it exposes .encode(list[str]) -> numpy.ndarray.

What you can do

Ask questions about the codebase in natural language (e.g., “Where is JWT verification implemented?”)
Find where to add new functionality (e.g., “Where should I add CSV export for reports?”)
Reuse patterns (e.g., “Show me canonical logging setup and usage across services”)
Discover APIs & contracts (e.g., “Which class validates requests?”)
Explore related code using optional graph expansion (follow callers/callees/imports)
Get insertion anchors for new code (files/classes/locations most likely to be correct)

The system returns:

hits (top chunks)
top_files (file rollups)
top_classes (class rollups)
anchors (suggested insertion points)

Which file do I run? (Entry Point)

Use the CLI at cgx/cli/main.py. It routes through the auto‑wired pipeline that connects all major components.

# Make the src/ layout importable
export PYTHONPATH="$PWD/src:$PYTHONPATH"

# 1) INDEX — parse → graph → records → two‑view (intent/impl) embeddings → FAISS → persist
python -m cgx.cli.main index   --project-root /path/to/your/codebase   --embedder "myproj.embed:make_model"   --out-dir /tmp/cgx_index   --metric cosine   --index-type flat

# 2) QUERY — hybrid retrieval (semantic + optional lexical + graph) + aggregation + insertion anchors
python -m cgx.cli.main query   --index-dir /tmp/cgx_index/indices   --records /tmp/cgx_index/records.jsonl   --embedder "myproj.embed:make_model"   --query "How do we add a new FastAPI route?"

Under the hood the CLI calls the auto‑wired pipeline:

src/cgx/pipeline/auto.py
- run_index_auto(project_root, embedder, out_dir, metric, index_type)
- run_query_auto(index_dir, records_path, embedder, query, ...)

Your original pipeline remains unchanged and available:

src/cgx/pipeline/run.py
- run_index(...)
- run_query(...)

Install / Environment

python -m venv .venv
source .venv/bin/activate    # Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Make src/ layout importable for local development
export PYTHONPATH="$PWD/src:$PYTHONPATH"
# (Alternatively, pip install -e . if you have a proper pyproject/setup)

Embedding Model — provide an object created from "module:attr":

If it’s a class: it will be instantiated with no arguments.
If it’s a callable factory: it will be called to obtain the object.
If it’s an object: it will be used directly.
The object must expose: .encode(list[str]) -> numpy.ndarray

CLI Reference

`index`

Builds two FAISS indices (one per view) and saves metadata + rows + records.

Flags

--project-root (required): Repository to index
--embedder (required): Import spec "module:attr" returning an object with .encode(...)
--out-dir (required): Output directory
--metric (default cosine): cosine|l2|ip
--index-type (default flat): flat|ivf|hnsw
Compatibility flags (kept for UX continuity): --no-normalize-impl, --strip-literals-impl

Output Layout

/out-dir/
  ├── indices/
  │   ├── meta.json
  │   ├── intent.index
  │   ├── intent.rows.jsonl
  │   ├── impl.index
  │   └── impl.rows.jsonl
  └── records.jsonl

`query`

Runs hybrid two‑view retrieval + file/class aggregation + insertion‑point suggestions.
Optionally, runs a single‑view semantic helper for debugging.

Flags

--index-dir (required): Directory containing indices/ from the index step
--records (required): Path to records.jsonl
--embedder (required): Same import spec used at index time
--query (required): The question or task
--chunks: Optional chunks.jsonl to power lexical search
--graph: Optional JSON graph to enable graph expansion
--top-k (default 10): Per‑view semantic top‑k
--depth (default 1): Graph neighbor depth
--no-lexical: Disable lexical component
--single-view {intent,impl}: Also run semantic_search(...) on a single view

Printed JSON

hits — fused top‑k chunks with ranks/scores
top_files — aggregated by file
top_classes — aggregated by class
anchors — suggested insertion points
single_view — optional block when --single-view is provided

Programmatic Usage

Auto‑wired (same execution path as the CLI)

from cgx.pipeline.auto import run_index_auto, run_query_auto

# Build indices
summary = run_index_auto(
    project_root="/path/to/code",
    embedder=make_model(),          # object with .encode(list[str]) -> ndarray
    out_dir="/tmp/cgx_index",
    metric="cosine",
    index_type="flat",
)

# Query with hybrid fusion + anchors
results = run_query_auto(
    index_dir="/tmp/cgx_index/indices",
    records_path="/tmp/cgx_index/records.jsonl",
    embedder=make_model(),
    query="How to add JWT validation?",
    top_k_per_view=10,
    neighbor_depth=1,
    use_lexical=True,
    single_view=None,               # or "intent"/"impl"
)

Legacy pipeline (kept intact)

from cgx.pipeline.run import run_index, run_query
# Same responsibilities as auto-wired; available if you prefer legacy names.

How it works (Architecture)

Parsing → Chunks
cgx.parser.parse_codebase.parse_codebase(project_root) produces canonical chunks representing files/classes/functions/methods (and optionally call edges).
Graph (optional)
cgx.graph.build_graph.build_knowledge_graph(chunks, calls=None) creates a NetworkX graph of entities and relations (calls/imports/attributes).
Deterministic Records & Two‑View Corpus

cgx.embeddings.records.make_index_records(chunks, G) → records (stable IDs, metadata)
cgx.embeddings.records.prepare_embedding_corpus(records, which=('intent','impl')) → corpus rows
- intent view = NL‑friendly (names/docstrings/comments)
- impl view = code‑focused (signatures/bodies)

Two‑view Embeddings & FAISS

cgx.embeddings.build.build_embeddings(...) encodes text per view
cgx.embeddings.index.build_faiss_index(...) builds an ANN index per view
Metadata + row mappings persisted with cgx.io.persist.save_indices(...)

Hybrid Retrieval & Post‑Processing

cgx.retrieval.orchestrator.hybrid_retrieve_two_view(...): semantic on both views + optional lexical + optional graph → fused with RRF
Aggregates: aggregate_by_file(...), aggregate_by_class(...)
Anchors: suggest_insertion_points(query, results, records) for safe code placement

BYO Embedder (Contract & Tips)

Contract: any object with

.encode(list[str]) -> numpy.ndarray  # shape (N, D), dtype float32 preferred

Tips

For cosine/ip, L2‑normalize vectors across rows for both index and query.
Batch large inputs to avoid overhead; reuse model/tokenizer across calls.

Scenarios (Copy/Paste)

Find where to add a feature (CSV export):

python -m cgx.cli.main query     --index-dir /tmp/cgx_index/indices     --records /tmp/cgx_index/records.jsonl     --embedder "myproj.embed:make_model"     --query "Where should I add CSV export for reports? Show helpers and similar code paths."

Follow canonical logging pattern:

python -m cgx.cli.main query     --index-dir /tmp/cgx_index/indices     --records /tmp/cgx_index/records.jsonl     --embedder "myproj.embed:make_model"     --query "Find canonical logging setup and usage patterns across services"     --single-view impl

New OAuth provider with awareness of neighbors:

python -m cgx.cli.main query     --index-dir /tmp/cgx_index/indices     --records /tmp/cgx_index/records.jsonl     --embedder "myproj.embed:make_model"     --query "Add new OAuth provider: where to plug in config, handlers, and tests?"     --depth 2

Troubleshooting

PYTHONPATH: Ensure export PYTHONPATH="$PWD/src:$PYTHONPATH" for src‑layout imports.
FAISS: If FAISS isn’t present, ensure the requirements.txt includes a CPU build (or install via conda).
Graph: Not required. Provide one only if you want graph expansion.
Index / Query Mismatch: Use the same embedder for querying as you used for indexing.
Large Repos: Consider --index-type ivf|hnsw for scale; tune advanced params if exposed by your FAISS build.

Module Map (for navigation)

Parsing — cgx.parser.parse_codebase.parse_codebase
Graph — cgx.graph.build_graph.build_knowledge_graph
Records — cgx.embeddings.records.make_index_records, prepare_embedding_corpus
Embeddings — cgx.embeddings.build.build_embeddings
ANN Index — cgx.embeddings.index.build_faiss_index
Retrieval — cgx.retrieval.orchestrator.hybrid_retrieve_two_view, aggregate_by_file, aggregate_by_class, suggest_insertion_points
Persistence — cgx.io.persist.save_indices/load_indices/save_jsonl/load_jsonl
CLI — cgx.cli.main
Pipelines — auto: cgx.pipeline.auto.run_index_auto, run_query_auto; legacy: cgx.pipeline.run.run_index, run_query

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src		src
.gitignore		.gitignore
README.md		README.md
app.py		app.py
app_gradio_llm.py		app_gradio_llm.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Codebase RAG

Which file do I run? (Entry Point)

Architecture & Data Flow

CLI Reference

`index`

`query`

Programmatic Usage

Auto‑wired pipeline (same as the CLI)

Legacy pipeline (kept intact)

Configuration Objects

Embedder Contract (BYO Model)

Troubleshooting & Tips

Module Map

Codebase RAG — Full Guide (Capabilities & Usage)

What you can do

Which file do I run? (Entry Point)

Install / Environment

CLI Reference

`index`

`query`

Programmatic Usage

Auto‑wired (same execution path as the CLI)

Legacy pipeline (kept intact)

How it works (Architecture)

BYO Embedder (Contract & Tips)

Scenarios (Copy/Paste)

Troubleshooting

Module Map (for navigation)

License & Contributing

About

Uh oh!

Releases 1

Packages

Languages

raminmohammadi/Averix

Folders and files

Latest commit

History

Repository files navigation

Codebase RAG

Which file do I run? (Entry Point)

Architecture & Data Flow

CLI Reference

index

query

Programmatic Usage

Auto‑wired pipeline (same as the CLI)

Legacy pipeline (kept intact)

Configuration Objects

Embedder Contract (BYO Model)

Troubleshooting & Tips

Module Map

Codebase RAG — Full Guide (Capabilities & Usage)

What you can do

Which file do I run? (Entry Point)

Install / Environment

CLI Reference

index

query

Programmatic Usage

Auto‑wired (same execution path as the CLI)

Legacy pipeline (kept intact)

How it works (Architecture)

BYO Embedder (Contract & Tips)

Scenarios (Copy/Paste)

Troubleshooting

Module Map (for navigation)

License & Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

`index`

`query`

`index`

`query`

Packages