Skip to content

alihashim786/agentic-rag-codebase-assistant

Repository files navigation

Agentic AI Codebase Assistant

Python FastAPI OpenAI Qdrant Redis Docker Prometheus

A production-grade Agentic RAG system that lets you have intelligent conversations with any Python codebase — powered by hybrid search, AST-based code parsing, self-healing retrieval, and a multi-step agent with reflection.

Why This ProjectArchitectureFeaturesQuick StartAPI ReferenceDesign DecisionsTech Stack


Why This Project?

Most RAG demos are toys. They do: embed → retrieve → generate and call it done. Real codebases break that pattern in seconds.

This system is built on what production AI systems (Cursor, Perplexity, GitHub Copilot) actually use:

  • Hybrid search — Vector search misses exact function names. BM25 misses semantic meaning. Combined with Reciprocal Rank Fusion, you get both.
  • AST-based chunking — Text splitters destroy code structure. This system parses Python at the AST level, extracting functions with their arguments, decorators, call graphs, and docstrings.
  • Self-healing RAG — When confidence is low, the system automatically expands retrieval, regenerates, and picks the best answer.
  • Agentic routing — Not every question needs retrieval. The planner decides: code search, dependency analysis, function explanation, or full RAG.
  • Hallucination detection — Every answer is verified against its source context before being returned.

For Recruiters: This project demonstrates production ML engineering skills — not just calling an LLM API, but building the full reliability and quality stack around it: observability, caching, self-healing, agent loops, evaluation, and load testing.


Architecture

                         User Query
                              │
                              ▼
                    ┌─────────────────┐
                    │   FastAPI       │  Async, Streaming, Health, Metrics
                    └────────┬────────┘
                             │
                             ▼
              ┌──────────────────────────────┐
              │       Agent Planner          │
              │  Decides HOW to answer:      │
              │  ┌────────────────────────┐  │
              │  │ • Code Search Tool     │  │
              │  │ • Dependency Finder    │  │
              │  │ • Function Explainer   │  │
              │  │ • Full RAG Pipeline    │  │
              │  └────────────────────────┘  │
              │  Memory │ Reflection Loop     │
              └──────────────┬───────────────┘
                             │
                    ┌────────▼────────┐
                    │  RAG Pipeline   │
                    └────────┬────────┘
                             │
              ┌──────────────▼──────────────────┐
              │         Retrieval Layer          │
              │  Query Rewrite → Decompose       │
              │  Multi-Query Expansion           │
              │  ┌───────────────────────────┐   │
              │  │  BM25 Keyword Search      │   │
              │  │  Vector Similarity Search │   │
              │  │  RRF Fusion + Reranking   │   │
              │  └───────────────────────────┘   │
              └──────────────┬──────────────────┘
                             │
              ┌──────────────▼──────────────────┐
              │       Quality Control            │
              │  Context Compression             │
              │  LLM Generation (versioned)      │
              │  Reflection → Improvement        │
              │  Verification (SUPPORTED check)  │
              │  Hallucination Detection         │
              │  Confidence Score (0–100)        │
              │  Grounding Score (0–100)         │
              │  Self-Healing Retry if low       │
              └──────────────┬──────────────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │    Response     │
                    │  answer +       │
                    │  confidence +   │
                    │  sources +      │
                    │  tool trace     │
                    └─────────────────┘

Features

Retrieval — Finding the Right Code

Feature Detail
AST-Based Chunking Extracts functions, classes, and methods with full metadata: arguments, return types, decorators, call graphs, docstrings
Hybrid BM25 + Vector Search Exact keyword matching + semantic similarity, run in parallel
Reciprocal Rank Fusion (RRF) Industry-standard score-agnostic fusion. Used by Microsoft and Meta RAG systems
Multi-Query Expansion Complex queries decomposed and expanded to maximize recall
Query Rewriting LLM-powered query reformulation for better retrieval
Cross-Encoder Reranking BGE reranker re-scores top candidates for precision (retrieve 20 → rerank to 5)
Adaptive Retrieval Depth Automatically expands from k=8 to k=15 during self-healing

Reliability — Answers You Can Trust

Feature Detail
Answer Verification Checks if the answer is SUPPORTED by the retrieved context
Hallucination Detection Explicit LLM check: does the answer make claims not in the source?
Confidence Scoring 0–100 score quantifying answer certainty
Grounding Score 0–100 score measuring faithfulness to retrieved context
Self-Healing Loop Low score → expand retrieval → regenerate → pick best → retry
Circuit Breaker Pattern Prevents cascading failures in the retrieval/generation pipeline
Retry with Backoff Async retry logic for transient LLM API failures

Agent Capabilities — Intelligent Routing

Feature Detail
Planner Agent Decides the best strategy for each query type
Code Search Tool Targeted search for specific functions, classes, or patterns
Dependency Finder AST-based analysis of what imports, calls, or depends on a given symbol
Function Explainer Detailed breakdown: purpose, parameters, return value, side effects
Reflection Loop Agent evaluates its own answers: GOOD / RETRY / EXPAND
Conversation Memory Multi-turn context window with recent Q&A history
Step Limiter Safety cap of 4 agent steps to prevent runaway loops

Performance & Observability

Feature Detail
Streaming API Token-by-token response via Server-Sent Events
Semantic Cache Embedding similarity cache (threshold 0.92) — catches paraphrase queries
Redis Cache Fast exact-match cache for repeated queries
Async Throughout Fully async FastAPI with asyncio for non-blocking I/O
Prometheus Metrics Request counts, latency histograms, cache hit rates, error rates
Structured Logging structlog with JSON output for log aggregation
Cost Tracking Token usage and estimated cost per query via /stats

Quick Start

Prerequisites

  • Python 3.10+
  • OpenAI API Key
  • Docker & Docker Compose (for Qdrant + Redis)

1. Clone and Install

git clone https://github.com/alihashim786/agentic-ai-codebase-assistant.git
cd agentic-ai-codebase-assistant
pip install -r requirements.txt

2. Configure Environment

cp .env.example .env

Edit .env:

OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
QDRANT_URL=http://localhost:6333
REDIS_URL=redis://localhost:6379

3. Start Infrastructure

cd docker && docker-compose up -d
# Starts Qdrant (vector DB) and Redis (cache)

4. Start the API Server

python app/main.py
# API available at http://localhost:8000
# Docs at http://localhost:8000/docs

5. Ingest a Codebase

Point it at any Python project directory:

curl -X POST http://localhost:8000/ingest \
  -H "Content-Type: application/json" \
  -d '{"path": "/path/to/your/python/project"}'

The ingestion pipeline:

  1. Recursively loads all .py files
  2. Parses each file with Python's AST module
  3. Extracts functions, classes, and methods with full metadata
  4. Indexes into Qdrant (vector) + BM25 (keyword) stores

6. Ask Questions

Standard RAG query:

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "How does the authentication flow work?"}'

Agentic query (planner + tools):

curl -X POST http://localhost:8000/agent \
  -H "Content-Type: application/json" \
  -d '{"query": "What classes depend on UserRepository?"}'

Streaming response:

curl -X POST http://localhost:8000/query/stream \
  -H "Content-Type: application/json" \
  -d '{"query": "Explain the payment processing module"}'

API Reference

Method Endpoint Description
POST /ingest Ingest a Python codebase from a local path
POST /query RAG query — full pipeline with self-healing
POST /agent Agentic query — planner routes to best tool
POST /query/stream Streaming RAG (token-by-token SSE)
GET /health Health check
GET /metrics Prometheus metrics endpoint
GET /stats Token usage and cost statistics

Query Response Shape

{
  "answer": "The authentication flow starts with...",
  "confidence": 87,
  "grounding": 91,
  "cached": false,
  "sources": [
    {
      "file": "app/auth/middleware.py",
      "name": "verify_token",
      "type": "async_function",
      "lines": "45-78"
    }
  ],
  "metadata": {
    "docs_retrieved": 8,
    "context_length": 3240,
    "prompt_version": "v3"
  }
}

Agent Response Shape

{
  "answer": "UserRepository is imported and used by...",
  "tool_trace": [
    {
      "step": 1,
      "tool": "dependency_finder",
      "reason": "Query asks about dependencies on a class",
      "reflection": "GOOD",
      "result_preview": "UserRepository is used in..."
    }
  ],
  "steps": 1,
  "memory_size": 3
}

Design Decisions

Why AST-Based Chunking Over Text Splitting?

Text splitters destroy code structure — they cut a function in the middle, strip context, and discard metadata. AST parsing extracts each function as a natural semantic unit with its full signature, docstring, call graph, and type annotations. This enables function-level retrieval precision that text splitting fundamentally cannot achieve.

Why Hybrid Search + RRF?

Vector search is excellent at semantic similarity but fails on exact identifiers (verify_jwt_token → hard to find semantically). BM25 is excellent at exact keyword matching but misses paraphrases. Combined with Reciprocal Rank Fusion (the industry standard, used by Microsoft and Meta), both signals merge into a single ranked list without score normalization problems.

Why Cross-Encoder Reranking?

Embedding models are trained for recall (find all relevant docs). Cross-encoders are trained for precision (rank the best at the top). Two-stage retrieval (get 20, rerank to 5) typically improves answer quality by 20–40% with minimal latency overhead.

Why Semantic Cache?

Exact-match Redis caching misses functionally identical queries: "What is RAG?" and "What is retrieval augmented generation?" are the same question. An embedding similarity cache with threshold 0.92 catches these, significantly reducing API costs for repeated question patterns.

Why Self-Healing?

Single-pass RAG fails silently — if the retrieved context is poor, the LLM generates a plausible-sounding but wrong answer. Self-healing explicitly scores both confidence and grounding after generation. Low scores trigger automatic retrieval expansion and regeneration, picking the best answer. This is how production RAG achieves reliability.


Project Structure

agentic-ai-codebase-assistant/
├── app/
│   ├── main.py                     # Entry point, FastAPI app initialization
│   ├── api.py                      # All API route handlers
│   ├── agent.py                    # Agent loop: Planner → Tool → Reflect → Memory
│   ├── rag_pipeline.py             # Core RAG with self-healing (11-step pipeline)
│   ├── cache.py                    # Redis + semantic embedding cache
│   ├── config.py                   # All configuration via env vars
│   ├── resilience.py               # Retry, timeout, circuit breaker
│   ├── retrieval/
│   │   ├── hybrid_retriever.py     # BM25 + Vector + RRF fusion
│   │   ├── reranker.py             # BGE cross-encoder reranker
│   │   ├── multi_query.py          # Query expansion and decomposition
│   │   ├── query_rewriter.py       # LLM-powered query reformulation
│   │   └── retrieval_orchestrator.py  # Full retrieval pipeline
│   ├── tools/
│   │   ├── tool_base.py            # Tool interface and registry
│   │   ├── code_search.py          # Targeted code search tool
│   │   ├── dependency_finder.py    # Dependency/import analysis tool
│   │   └── function_explainer.py   # Detailed function explanation tool
│   ├── prompts/
│   │   └── prompt_manager.py       # Versioned prompt templates
│   ├── memory/
│   │   └── memory.py               # Multi-turn conversation memory
│   └── observability/
│       ├── logger.py               # structlog JSON logging
│       └── metrics.py              # Prometheus counters and histograms
├── ingestion/
│   ├── loaders.py                  # Recursive Python file loader
│   ├── chunking.py                 # AST-based code chunker (key differentiator)
│   └── indexing.py                 # Qdrant + BM25 dual indexing
├── evaluation/
│   ├── datasets/testset.json       # Evaluation test cases
│   ├── eval_runner.py              # RAGAS evaluation runner
│   └── metrics.py                  # Faithfulness, relevancy, precision, recall
├── benchmarks/
│   ├── latency_test.py             # End-to-end latency benchmarks
│   └── load_test.py                # Locust load testing
├── docker/
│   ├── Dockerfile
│   └── docker-compose.yml          # Qdrant + Redis + App
├── scripts/
│   └── ingest_data.py              # CLI script for codebase ingestion
├── tests/
├── requirements.txt
├── .env.example
├── Makefile
└── architecture.md                 # Detailed design doc with all trade-offs

Running Evaluations

After ingesting a codebase:

make eval

Runs RAGAS evaluation across the test set and reports:

Metric Measures
Faithfulness Is the answer supported by retrieved context?
Answer Relevancy Does the answer address the question?
Context Precision Are retrieved chunks actually relevant?
Context Recall Did retrieval find all necessary information?

Load Testing

# Run latency benchmarks
python benchmarks/latency_test.py

# Run load test (requires Locust)
locust -f benchmarks/load_test.py --host http://localhost:8000

Tech Stack

Component Technology Why
API Framework FastAPI (async) Non-blocking I/O, auto OpenAPI docs, streaming
LLM OpenAI GPT-4o-mini Best cost/quality ratio for code understanding
Embeddings text-embedding-3-small Fast, cost-effective, strong on code
Vector DB Qdrant Production-grade, filtering support, self-hostable
Keyword Search BM25 (rank-bm25) Exact identifier matching
Reranker BGE (sentence-transformers) Cross-encoder precision on top of recall
Cache Redis + Semantic Cache Exact + approximate query deduplication
Logging structlog JSON structured logs, easy aggregation
Metrics Prometheus Client Standard metrics, Grafana compatible
Evaluation RAGAS Industry-standard RAG evaluation framework
Load Testing Locust Python-native, async support
Container Docker Compose One-command local infrastructure

Future Roadmap

  • Multi-language support (JavaScript, TypeScript, Java, Go)
  • Interactive call graph visualization
  • Fine-tuned code embeddings (CodeBERT / UniXcoder)
  • RAG evaluation dashboard with trend tracking
  • Auto prompt optimization using DSPy
  • Cost optimization engine (dynamic model routing)
  • GitHub Actions CI with evaluation gate

License

This project is licensed under the MIT License — see LICENSE for details.

About

Production-grade Agentic RAG system for AI-powered codebase understanding - hybrid BM25+vector search, AST chunking, self-healing pipeline, hallucination detection & Prometheus observability

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors