A production-grade Agentic RAG system that lets you have intelligent conversations with any Python codebase — powered by hybrid search, AST-based code parsing, self-healing retrieval, and a multi-step agent with reflection.
Why This Project • Architecture • Features • Quick Start • API Reference • Design Decisions • Tech Stack
Most RAG demos are toys. They do: embed → retrieve → generate and call it done. Real codebases break that pattern in seconds.
This system is built on what production AI systems (Cursor, Perplexity, GitHub Copilot) actually use:
- Hybrid search — Vector search misses exact function names. BM25 misses semantic meaning. Combined with Reciprocal Rank Fusion, you get both.
- AST-based chunking — Text splitters destroy code structure. This system parses Python at the AST level, extracting functions with their arguments, decorators, call graphs, and docstrings.
- Self-healing RAG — When confidence is low, the system automatically expands retrieval, regenerates, and picks the best answer.
- Agentic routing — Not every question needs retrieval. The planner decides: code search, dependency analysis, function explanation, or full RAG.
- Hallucination detection — Every answer is verified against its source context before being returned.
For Recruiters: This project demonstrates production ML engineering skills — not just calling an LLM API, but building the full reliability and quality stack around it: observability, caching, self-healing, agent loops, evaluation, and load testing.
User Query
│
▼
┌─────────────────┐
│ FastAPI │ Async, Streaming, Health, Metrics
└────────┬────────┘
│
▼
┌──────────────────────────────┐
│ Agent Planner │
│ Decides HOW to answer: │
│ ┌────────────────────────┐ │
│ │ • Code Search Tool │ │
│ │ • Dependency Finder │ │
│ │ • Function Explainer │ │
│ │ • Full RAG Pipeline │ │
│ └────────────────────────┘ │
│ Memory │ Reflection Loop │
└──────────────┬───────────────┘
│
┌────────▼────────┐
│ RAG Pipeline │
└────────┬────────┘
│
┌──────────────▼──────────────────┐
│ Retrieval Layer │
│ Query Rewrite → Decompose │
│ Multi-Query Expansion │
│ ┌───────────────────────────┐ │
│ │ BM25 Keyword Search │ │
│ │ Vector Similarity Search │ │
│ │ RRF Fusion + Reranking │ │
│ └───────────────────────────┘ │
└──────────────┬──────────────────┘
│
┌──────────────▼──────────────────┐
│ Quality Control │
│ Context Compression │
│ LLM Generation (versioned) │
│ Reflection → Improvement │
│ Verification (SUPPORTED check) │
│ Hallucination Detection │
│ Confidence Score (0–100) │
│ Grounding Score (0–100) │
│ Self-Healing Retry if low │
└──────────────┬──────────────────┘
│
▼
┌─────────────────┐
│ Response │
│ answer + │
│ confidence + │
│ sources + │
│ tool trace │
└─────────────────┘
| Feature | Detail |
|---|---|
| AST-Based Chunking | Extracts functions, classes, and methods with full metadata: arguments, return types, decorators, call graphs, docstrings |
| Hybrid BM25 + Vector Search | Exact keyword matching + semantic similarity, run in parallel |
| Reciprocal Rank Fusion (RRF) | Industry-standard score-agnostic fusion. Used by Microsoft and Meta RAG systems |
| Multi-Query Expansion | Complex queries decomposed and expanded to maximize recall |
| Query Rewriting | LLM-powered query reformulation for better retrieval |
| Cross-Encoder Reranking | BGE reranker re-scores top candidates for precision (retrieve 20 → rerank to 5) |
| Adaptive Retrieval Depth | Automatically expands from k=8 to k=15 during self-healing |
| Feature | Detail |
|---|---|
| Answer Verification | Checks if the answer is SUPPORTED by the retrieved context |
| Hallucination Detection | Explicit LLM check: does the answer make claims not in the source? |
| Confidence Scoring | 0–100 score quantifying answer certainty |
| Grounding Score | 0–100 score measuring faithfulness to retrieved context |
| Self-Healing Loop | Low score → expand retrieval → regenerate → pick best → retry |
| Circuit Breaker Pattern | Prevents cascading failures in the retrieval/generation pipeline |
| Retry with Backoff | Async retry logic for transient LLM API failures |
| Feature | Detail |
|---|---|
| Planner Agent | Decides the best strategy for each query type |
| Code Search Tool | Targeted search for specific functions, classes, or patterns |
| Dependency Finder | AST-based analysis of what imports, calls, or depends on a given symbol |
| Function Explainer | Detailed breakdown: purpose, parameters, return value, side effects |
| Reflection Loop | Agent evaluates its own answers: GOOD / RETRY / EXPAND |
| Conversation Memory | Multi-turn context window with recent Q&A history |
| Step Limiter | Safety cap of 4 agent steps to prevent runaway loops |
| Feature | Detail |
|---|---|
| Streaming API | Token-by-token response via Server-Sent Events |
| Semantic Cache | Embedding similarity cache (threshold 0.92) — catches paraphrase queries |
| Redis Cache | Fast exact-match cache for repeated queries |
| Async Throughout | Fully async FastAPI with asyncio for non-blocking I/O |
| Prometheus Metrics | Request counts, latency histograms, cache hit rates, error rates |
| Structured Logging | structlog with JSON output for log aggregation |
| Cost Tracking | Token usage and estimated cost per query via /stats |
- Python 3.10+
- OpenAI API Key
- Docker & Docker Compose (for Qdrant + Redis)
git clone https://github.com/alihashim786/agentic-ai-codebase-assistant.git
cd agentic-ai-codebase-assistant
pip install -r requirements.txtcp .env.example .envEdit .env:
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
QDRANT_URL=http://localhost:6333
REDIS_URL=redis://localhost:6379cd docker && docker-compose up -d
# Starts Qdrant (vector DB) and Redis (cache)python app/main.py
# API available at http://localhost:8000
# Docs at http://localhost:8000/docsPoint it at any Python project directory:
curl -X POST http://localhost:8000/ingest \
-H "Content-Type: application/json" \
-d '{"path": "/path/to/your/python/project"}'The ingestion pipeline:
- Recursively loads all
.pyfiles - Parses each file with Python's AST module
- Extracts functions, classes, and methods with full metadata
- Indexes into Qdrant (vector) + BM25 (keyword) stores
Standard RAG query:
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "How does the authentication flow work?"}'Agentic query (planner + tools):
curl -X POST http://localhost:8000/agent \
-H "Content-Type: application/json" \
-d '{"query": "What classes depend on UserRepository?"}'Streaming response:
curl -X POST http://localhost:8000/query/stream \
-H "Content-Type: application/json" \
-d '{"query": "Explain the payment processing module"}'| Method | Endpoint | Description |
|---|---|---|
POST |
/ingest |
Ingest a Python codebase from a local path |
POST |
/query |
RAG query — full pipeline with self-healing |
POST |
/agent |
Agentic query — planner routes to best tool |
POST |
/query/stream |
Streaming RAG (token-by-token SSE) |
GET |
/health |
Health check |
GET |
/metrics |
Prometheus metrics endpoint |
GET |
/stats |
Token usage and cost statistics |
{
"answer": "The authentication flow starts with...",
"confidence": 87,
"grounding": 91,
"cached": false,
"sources": [
{
"file": "app/auth/middleware.py",
"name": "verify_token",
"type": "async_function",
"lines": "45-78"
}
],
"metadata": {
"docs_retrieved": 8,
"context_length": 3240,
"prompt_version": "v3"
}
}{
"answer": "UserRepository is imported and used by...",
"tool_trace": [
{
"step": 1,
"tool": "dependency_finder",
"reason": "Query asks about dependencies on a class",
"reflection": "GOOD",
"result_preview": "UserRepository is used in..."
}
],
"steps": 1,
"memory_size": 3
}Text splitters destroy code structure — they cut a function in the middle, strip context, and discard metadata. AST parsing extracts each function as a natural semantic unit with its full signature, docstring, call graph, and type annotations. This enables function-level retrieval precision that text splitting fundamentally cannot achieve.
Vector search is excellent at semantic similarity but fails on exact identifiers (verify_jwt_token → hard to find semantically). BM25 is excellent at exact keyword matching but misses paraphrases. Combined with Reciprocal Rank Fusion (the industry standard, used by Microsoft and Meta), both signals merge into a single ranked list without score normalization problems.
Embedding models are trained for recall (find all relevant docs). Cross-encoders are trained for precision (rank the best at the top). Two-stage retrieval (get 20, rerank to 5) typically improves answer quality by 20–40% with minimal latency overhead.
Exact-match Redis caching misses functionally identical queries: "What is RAG?" and "What is retrieval augmented generation?" are the same question. An embedding similarity cache with threshold 0.92 catches these, significantly reducing API costs for repeated question patterns.
Single-pass RAG fails silently — if the retrieved context is poor, the LLM generates a plausible-sounding but wrong answer. Self-healing explicitly scores both confidence and grounding after generation. Low scores trigger automatic retrieval expansion and regeneration, picking the best answer. This is how production RAG achieves reliability.
agentic-ai-codebase-assistant/
├── app/
│ ├── main.py # Entry point, FastAPI app initialization
│ ├── api.py # All API route handlers
│ ├── agent.py # Agent loop: Planner → Tool → Reflect → Memory
│ ├── rag_pipeline.py # Core RAG with self-healing (11-step pipeline)
│ ├── cache.py # Redis + semantic embedding cache
│ ├── config.py # All configuration via env vars
│ ├── resilience.py # Retry, timeout, circuit breaker
│ ├── retrieval/
│ │ ├── hybrid_retriever.py # BM25 + Vector + RRF fusion
│ │ ├── reranker.py # BGE cross-encoder reranker
│ │ ├── multi_query.py # Query expansion and decomposition
│ │ ├── query_rewriter.py # LLM-powered query reformulation
│ │ └── retrieval_orchestrator.py # Full retrieval pipeline
│ ├── tools/
│ │ ├── tool_base.py # Tool interface and registry
│ │ ├── code_search.py # Targeted code search tool
│ │ ├── dependency_finder.py # Dependency/import analysis tool
│ │ └── function_explainer.py # Detailed function explanation tool
│ ├── prompts/
│ │ └── prompt_manager.py # Versioned prompt templates
│ ├── memory/
│ │ └── memory.py # Multi-turn conversation memory
│ └── observability/
│ ├── logger.py # structlog JSON logging
│ └── metrics.py # Prometheus counters and histograms
├── ingestion/
│ ├── loaders.py # Recursive Python file loader
│ ├── chunking.py # AST-based code chunker (key differentiator)
│ └── indexing.py # Qdrant + BM25 dual indexing
├── evaluation/
│ ├── datasets/testset.json # Evaluation test cases
│ ├── eval_runner.py # RAGAS evaluation runner
│ └── metrics.py # Faithfulness, relevancy, precision, recall
├── benchmarks/
│ ├── latency_test.py # End-to-end latency benchmarks
│ └── load_test.py # Locust load testing
├── docker/
│ ├── Dockerfile
│ └── docker-compose.yml # Qdrant + Redis + App
├── scripts/
│ └── ingest_data.py # CLI script for codebase ingestion
├── tests/
├── requirements.txt
├── .env.example
├── Makefile
└── architecture.md # Detailed design doc with all trade-offs
After ingesting a codebase:
make evalRuns RAGAS evaluation across the test set and reports:
| Metric | Measures |
|---|---|
| Faithfulness | Is the answer supported by retrieved context? |
| Answer Relevancy | Does the answer address the question? |
| Context Precision | Are retrieved chunks actually relevant? |
| Context Recall | Did retrieval find all necessary information? |
# Run latency benchmarks
python benchmarks/latency_test.py
# Run load test (requires Locust)
locust -f benchmarks/load_test.py --host http://localhost:8000| Component | Technology | Why |
|---|---|---|
| API Framework | FastAPI (async) | Non-blocking I/O, auto OpenAPI docs, streaming |
| LLM | OpenAI GPT-4o-mini | Best cost/quality ratio for code understanding |
| Embeddings | text-embedding-3-small | Fast, cost-effective, strong on code |
| Vector DB | Qdrant | Production-grade, filtering support, self-hostable |
| Keyword Search | BM25 (rank-bm25) | Exact identifier matching |
| Reranker | BGE (sentence-transformers) | Cross-encoder precision on top of recall |
| Cache | Redis + Semantic Cache | Exact + approximate query deduplication |
| Logging | structlog | JSON structured logs, easy aggregation |
| Metrics | Prometheus Client | Standard metrics, Grafana compatible |
| Evaluation | RAGAS | Industry-standard RAG evaluation framework |
| Load Testing | Locust | Python-native, async support |
| Container | Docker Compose | One-command local infrastructure |
- Multi-language support (JavaScript, TypeScript, Java, Go)
- Interactive call graph visualization
- Fine-tuned code embeddings (CodeBERT / UniXcoder)
- RAG evaluation dashboard with trend tracking
- Auto prompt optimization using DSPy
- Cost optimization engine (dynamic model routing)
- GitHub Actions CI with evaluation gate
This project is licensed under the MIT License — see LICENSE for details.