Agentic AI Codebase Assistant

A production-grade Agentic RAG system that lets you have intelligent conversations with any Python codebase — powered by hybrid search, AST-based code parsing, self-healing retrieval, and a multi-step agent with reflection.

Why This Project • Architecture • Features • Quick Start • API Reference • Design Decisions • Tech Stack

Why This Project?

Most RAG demos are toys. They do: embed → retrieve → generate and call it done. Real codebases break that pattern in seconds.

This system is built on what production AI systems (Cursor, Perplexity, GitHub Copilot) actually use:

Hybrid search — Vector search misses exact function names. BM25 misses semantic meaning. Combined with Reciprocal Rank Fusion, you get both.
AST-based chunking — Text splitters destroy code structure. This system parses Python at the AST level, extracting functions with their arguments, decorators, call graphs, and docstrings.
Self-healing RAG — When confidence is low, the system automatically expands retrieval, regenerates, and picks the best answer.
Agentic routing — Not every question needs retrieval. The planner decides: code search, dependency analysis, function explanation, or full RAG.
Hallucination detection — Every answer is verified against its source context before being returned.

For Recruiters: This project demonstrates production ML engineering skills — not just calling an LLM API, but building the full reliability and quality stack around it: observability, caching, self-healing, agent loops, evaluation, and load testing.

Architecture

                         User Query
                              │
                              ▼
                    ┌─────────────────┐
                    │   FastAPI       │  Async, Streaming, Health, Metrics
                    └────────┬────────┘
                             │
                             ▼
              ┌──────────────────────────────┐
              │       Agent Planner          │
              │  Decides HOW to answer:      │
              │  ┌────────────────────────┐  │
              │  │ • Code Search Tool     │  │
              │  │ • Dependency Finder    │  │
              │  │ • Function Explainer   │  │
              │  │ • Full RAG Pipeline    │  │
              │  └────────────────────────┘  │
              │  Memory │ Reflection Loop     │
              └──────────────┬───────────────┘
                             │
                    ┌────────▼────────┐
                    │  RAG Pipeline   │
                    └────────┬────────┘
                             │
              ┌──────────────▼──────────────────┐
              │         Retrieval Layer          │
              │  Query Rewrite → Decompose       │
              │  Multi-Query Expansion           │
              │  ┌───────────────────────────┐   │
              │  │  BM25 Keyword Search      │   │
              │  │  Vector Similarity Search │   │
              │  │  RRF Fusion + Reranking   │   │
              │  └───────────────────────────┘   │
              └──────────────┬──────────────────┘
                             │
              ┌──────────────▼──────────────────┐
              │       Quality Control            │
              │  Context Compression             │
              │  LLM Generation (versioned)      │
              │  Reflection → Improvement        │
              │  Verification (SUPPORTED check)  │
              │  Hallucination Detection         │
              │  Confidence Score (0–100)        │
              │  Grounding Score (0–100)         │
              │  Self-Healing Retry if low       │
              └──────────────┬──────────────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │    Response     │
                    │  answer +       │
                    │  confidence +   │
                    │  sources +      │
                    │  tool trace     │
                    └─────────────────┘

Features

Retrieval — Finding the Right Code

Feature	Detail
AST-Based Chunking	Extracts functions, classes, and methods with full metadata: arguments, return types, decorators, call graphs, docstrings
Hybrid BM25 + Vector Search	Exact keyword matching + semantic similarity, run in parallel
Reciprocal Rank Fusion (RRF)	Industry-standard score-agnostic fusion. Used by Microsoft and Meta RAG systems
Multi-Query Expansion	Complex queries decomposed and expanded to maximize recall
Query Rewriting	LLM-powered query reformulation for better retrieval
Cross-Encoder Reranking	BGE reranker re-scores top candidates for precision (retrieve 20 → rerank to 5)
Adaptive Retrieval Depth	Automatically expands from k=8 to k=15 during self-healing

Reliability — Answers You Can Trust

Feature	Detail
Answer Verification	Checks if the answer is `SUPPORTED` by the retrieved context
Hallucination Detection	Explicit LLM check: does the answer make claims not in the source?
Confidence Scoring	0–100 score quantifying answer certainty
Grounding Score	0–100 score measuring faithfulness to retrieved context
Self-Healing Loop	Low score → expand retrieval → regenerate → pick best → retry
Circuit Breaker Pattern	Prevents cascading failures in the retrieval/generation pipeline
Retry with Backoff	Async retry logic for transient LLM API failures

Agent Capabilities — Intelligent Routing

Feature	Detail
Planner Agent	Decides the best strategy for each query type
Code Search Tool	Targeted search for specific functions, classes, or patterns
Dependency Finder	AST-based analysis of what imports, calls, or depends on a given symbol
Function Explainer	Detailed breakdown: purpose, parameters, return value, side effects
Reflection Loop	Agent evaluates its own answers: `GOOD` / `RETRY` / `EXPAND`
Conversation Memory	Multi-turn context window with recent Q&A history
Step Limiter	Safety cap of 4 agent steps to prevent runaway loops

Performance & Observability

Feature	Detail
Streaming API	Token-by-token response via Server-Sent Events
Semantic Cache	Embedding similarity cache (threshold 0.92) — catches paraphrase queries
Redis Cache	Fast exact-match cache for repeated queries
Async Throughout	Fully async FastAPI with `asyncio` for non-blocking I/O
Prometheus Metrics	Request counts, latency histograms, cache hit rates, error rates
Structured Logging	`structlog` with JSON output for log aggregation
Cost Tracking	Token usage and estimated cost per query via `/stats`

Quick Start

Prerequisites

Python 3.10+
OpenAI API Key
Docker & Docker Compose (for Qdrant + Redis)

1. Clone and Install

git clone https://github.com/alihashim786/agentic-ai-codebase-assistant.git
cd agentic-ai-codebase-assistant
pip install -r requirements.txt

2. Configure Environment

cp .env.example .env

Edit .env:

OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
QDRANT_URL=http://localhost:6333
REDIS_URL=redis://localhost:6379

3. Start Infrastructure

cd docker && docker-compose up -d
# Starts Qdrant (vector DB) and Redis (cache)

4. Start the API Server

python app/main.py
# API available at http://localhost:8000
# Docs at http://localhost:8000/docs

5. Ingest a Codebase

Point it at any Python project directory:

curl -X POST http://localhost:8000/ingest \
  -H "Content-Type: application/json" \
  -d '{"path": "/path/to/your/python/project"}'

The ingestion pipeline:

Recursively loads all .py files
Parses each file with Python's AST module
Extracts functions, classes, and methods with full metadata
Indexes into Qdrant (vector) + BM25 (keyword) stores

6. Ask Questions

Standard RAG query:

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "How does the authentication flow work?"}'

Agentic query (planner + tools):

curl -X POST http://localhost:8000/agent \
  -H "Content-Type: application/json" \
  -d '{"query": "What classes depend on UserRepository?"}'

Streaming response:

curl -X POST http://localhost:8000/query/stream \
  -H "Content-Type: application/json" \
  -d '{"query": "Explain the payment processing module"}'

API Reference

Method	Endpoint	Description
`POST`	`/ingest`	Ingest a Python codebase from a local path
`POST`	`/query`	RAG query — full pipeline with self-healing
`POST`	`/agent`	Agentic query — planner routes to best tool
`POST`	`/query/stream`	Streaming RAG (token-by-token SSE)
`GET`	`/health`	Health check
`GET`	`/metrics`	Prometheus metrics endpoint
`GET`	`/stats`	Token usage and cost statistics

Query Response Shape

{
  "answer": "The authentication flow starts with...",
  "confidence": 87,
  "grounding": 91,
  "cached": false,
  "sources": [
    {
      "file": "app/auth/middleware.py",
      "name": "verify_token",
      "type": "async_function",
      "lines": "45-78"
    }
  ],
  "metadata": {
    "docs_retrieved": 8,
    "context_length": 3240,
    "prompt_version": "v3"
  }
}

Agent Response Shape

{
  "answer": "UserRepository is imported and used by...",
  "tool_trace": [
    {
      "step": 1,
      "tool": "dependency_finder",
      "reason": "Query asks about dependencies on a class",
      "reflection": "GOOD",
      "result_preview": "UserRepository is used in..."
    }
  ],
  "steps": 1,
  "memory_size": 3
}

Design Decisions

Why AST-Based Chunking Over Text Splitting?

Text splitters destroy code structure — they cut a function in the middle, strip context, and discard metadata. AST parsing extracts each function as a natural semantic unit with its full signature, docstring, call graph, and type annotations. This enables function-level retrieval precision that text splitting fundamentally cannot achieve.

Why Hybrid Search + RRF?

Vector search is excellent at semantic similarity but fails on exact identifiers (verify_jwt_token → hard to find semantically). BM25 is excellent at exact keyword matching but misses paraphrases. Combined with Reciprocal Rank Fusion (the industry standard, used by Microsoft and Meta), both signals merge into a single ranked list without score normalization problems.

Why Cross-Encoder Reranking?

Embedding models are trained for recall (find all relevant docs). Cross-encoders are trained for precision (rank the best at the top). Two-stage retrieval (get 20, rerank to 5) typically improves answer quality by 20–40% with minimal latency overhead.

Why Semantic Cache?

Exact-match Redis caching misses functionally identical queries: "What is RAG?" and "What is retrieval augmented generation?" are the same question. An embedding similarity cache with threshold 0.92 catches these, significantly reducing API costs for repeated question patterns.

Why Self-Healing?

Single-pass RAG fails silently — if the retrieved context is poor, the LLM generates a plausible-sounding but wrong answer. Self-healing explicitly scores both confidence and grounding after generation. Low scores trigger automatic retrieval expansion and regeneration, picking the best answer. This is how production RAG achieves reliability.

Project Structure

agentic-ai-codebase-assistant/
├── app/
│   ├── main.py                     # Entry point, FastAPI app initialization
│   ├── api.py                      # All API route handlers
│   ├── agent.py                    # Agent loop: Planner → Tool → Reflect → Memory
│   ├── rag_pipeline.py             # Core RAG with self-healing (11-step pipeline)
│   ├── cache.py                    # Redis + semantic embedding cache
│   ├── config.py                   # All configuration via env vars
│   ├── resilience.py               # Retry, timeout, circuit breaker
│   ├── retrieval/
│   │   ├── hybrid_retriever.py     # BM25 + Vector + RRF fusion
│   │   ├── reranker.py             # BGE cross-encoder reranker
│   │   ├── multi_query.py          # Query expansion and decomposition
│   │   ├── query_rewriter.py       # LLM-powered query reformulation
│   │   └── retrieval_orchestrator.py  # Full retrieval pipeline
│   ├── tools/
│   │   ├── tool_base.py            # Tool interface and registry
│   │   ├── code_search.py          # Targeted code search tool
│   │   ├── dependency_finder.py    # Dependency/import analysis tool
│   │   └── function_explainer.py   # Detailed function explanation tool
│   ├── prompts/
│   │   └── prompt_manager.py       # Versioned prompt templates
│   ├── memory/
│   │   └── memory.py               # Multi-turn conversation memory
│   └── observability/
│       ├── logger.py               # structlog JSON logging
│       └── metrics.py              # Prometheus counters and histograms
├── ingestion/
│   ├── loaders.py                  # Recursive Python file loader
│   ├── chunking.py                 # AST-based code chunker (key differentiator)
│   └── indexing.py                 # Qdrant + BM25 dual indexing
├── evaluation/
│   ├── datasets/testset.json       # Evaluation test cases
│   ├── eval_runner.py              # RAGAS evaluation runner
│   └── metrics.py                  # Faithfulness, relevancy, precision, recall
├── benchmarks/
│   ├── latency_test.py             # End-to-end latency benchmarks
│   └── load_test.py                # Locust load testing
├── docker/
│   ├── Dockerfile
│   └── docker-compose.yml          # Qdrant + Redis + App
├── scripts/
│   └── ingest_data.py              # CLI script for codebase ingestion
├── tests/
├── requirements.txt
├── .env.example
├── Makefile
└── architecture.md                 # Detailed design doc with all trade-offs

Running Evaluations

After ingesting a codebase:

make eval

Runs RAGAS evaluation across the test set and reports:

Metric	Measures
Faithfulness	Is the answer supported by retrieved context?
Answer Relevancy	Does the answer address the question?
Context Precision	Are retrieved chunks actually relevant?
Context Recall	Did retrieval find all necessary information?

Load Testing

# Run latency benchmarks
python benchmarks/latency_test.py

# Run load test (requires Locust)
locust -f benchmarks/load_test.py --host http://localhost:8000

Tech Stack

Component	Technology	Why
API Framework	FastAPI (async)	Non-blocking I/O, auto OpenAPI docs, streaming
LLM	OpenAI GPT-4o-mini	Best cost/quality ratio for code understanding
Embeddings	text-embedding-3-small	Fast, cost-effective, strong on code
Vector DB	Qdrant	Production-grade, filtering support, self-hostable
Keyword Search	BM25 (rank-bm25)	Exact identifier matching
Reranker	BGE (sentence-transformers)	Cross-encoder precision on top of recall
Cache	Redis + Semantic Cache	Exact + approximate query deduplication
Logging	structlog	JSON structured logs, easy aggregation
Metrics	Prometheus Client	Standard metrics, Grafana compatible
Evaluation	RAGAS	Industry-standard RAG evaluation framework
Load Testing	Locust	Python-native, async support
Container	Docker Compose	One-command local infrastructure

Future Roadmap

Multi-language support (JavaScript, TypeScript, Java, Go)
Interactive call graph visualization
Fine-tuned code embeddings (CodeBERT / UniXcoder)
RAG evaluation dashboard with trend tracking
Auto prompt optimization using DSPy
Cost optimization engine (dynamic model routing)
GitHub Actions CI with evaluation gate

License

This project is licensed under the MIT License — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
benchmarks		benchmarks
docker		docker
evaluation		evaluation
ingestion		ingestion
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
architecture.md		architecture.md
extract.py		extract.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Agentic AI Codebase Assistant

Why This Project?

Architecture

Features

Retrieval — Finding the Right Code

Reliability — Answers You Can Trust

Agent Capabilities — Intelligent Routing

Performance & Observability

Quick Start

Prerequisites

1. Clone and Install

2. Configure Environment

3. Start Infrastructure

4. Start the API Server

5. Ingest a Codebase

6. Ask Questions

API Reference

Query Response Shape

Agent Response Shape

Design Decisions

Why AST-Based Chunking Over Text Splitting?

Why Hybrid Search + RRF?

Why Cross-Encoder Reranking?

Why Semantic Cache?

Why Self-Healing?

Project Structure

Running Evaluations

Load Testing

Tech Stack

Future Roadmap

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages