AI Codebase Intelligence Platform

An AI-powered system that ingests any GitHub repository and lets you talk to it like a senior engineer who has worked on it for years.

Features • Demo • Installation • Usage • API Docs • Architecture

What Is This?

The AI Codebase Intelligence Platform is a developer tool that combines Retrieval Augmented Generation (RAG), static code analysis, call graph generation, and LLM reasoning to help developers understand large, complex codebases instantly.

Instead of spending hours reading through thousands of files, you simply paste a GitHub URL and ask questions in plain English:

"How does the authentication system work?"
"Trace the login execution flow"
"Which files are involved when a user signs up?"
"Explain the payment module"
"What services interact with the order service?"

The system answers like a senior engineer who has deeply studied the codebase.

Features

Feature	Description
Codebase Q&A	Ask any natural language question about any GitHub repo
Execution Flow Tracing	Trace how execution flows from any entry function
Architecture Analysis	Get a high-level overview of the entire system design
Call Graph Builder	Visualize which functions call which other functions
Dependency Graph	See how files and modules depend on each other
Multi-Language Support	Python, JavaScript, TypeScript, Java, Go, Rust, C++, C#, Ruby, PHP, Kotlin, Swift
Persistent Indexing	Index once, query forever — no re-processing needed
Semantic Search	Finds relevant code even when exact keywords don't match
Session Recovery	Restore your session after server restart without re-embedding
Docker Support	One command to run everything

How It Works

You paste a GitHub URL
        │
        ▼
┌─────────────────────────────────────────────────────┐
│                  INGESTION PIPELINE                  │
│                                                      │
│  Clone Repo → Scan Files → Parse AST → Chunk Code  │
│       → Generate Embeddings → Store in ChromaDB     │
│       → Build Call Graph → Build Dependency Graph   │
└─────────────────────────────────────────────────────┘
        │
        ▼
You ask a question
        │
        ▼
┌─────────────────────────────────────────────────────┐
│                   QUERY PIPELINE                     │
│                                                      │
│  Embed Question → Search ChromaDB → Retrieve Top-K  │
│    Chunks → Enrich with Graph Data → Build Prompt   │
│             → Send to Claude AI → Get Answer        │
└─────────────────────────────────────────────────────┘
        │
        ▼
You get a precise, developer-quality answer with source file references

Tech Stack

Layer	Technology	Purpose
LLM	Anthropic Claude (claude-sonnet-4)	Code reasoning & explanation
Embeddings	OpenAI text-embedding-3-small	Semantic code search
Vector Database	ChromaDB	Store & retrieve code embeddings
Parsing	Python AST + Regex	Extract functions, classes, imports
Graph Engine	NetworkX	Call graphs & dependency graphs
Backend API	FastAPI + Uvicorn	REST API endpoints
Frontend	Streamlit + Plotly	Interactive UI & graph visualization
Version Control	Git	Source control
Containerization	Docker + Docker Compose	Deployment

Project Structure

ai-codebase-intelligence/
│
├──  backend/                          # FastAPI backend
│   ├── main.py                        # App entry point
│   ├── config.py                      # All configuration via .env
│   │
│   ├──  ingestion/
│   │   ├── repo_cloner.py             # Clone GitHub repos
│   │   ├── file_scanner.py            # Walk dirs, detect code files
│   │   └── ingestion_orchestrator.py  # Full pipeline coordinator
│   │
│   ├──  parsing/
│   │   ├── code_unit.py               # Core data model (CodeUnit)
│   │   ├── python_parser.py           # Python AST parser
│   │   ├── generic_parser.py          # Regex parser for other languages
│   │   └── parser_dispatcher.py       # Routes files to correct parser
│   │
│   ├──  chunking/
│   │   └── smart_chunker.py           # Function/class-level chunking
│   │
│   ├──  embeddings/
│   │   ├── embedding_model.py         # OpenAI + local fallback model
│   │   └── embedding_pipeline.py      # Batch embedding processor
│   │
│   ├──  vectordb/
│   │   └── chroma_store.py            # ChromaDB store & search
│   │
│   ├──  retrieval/
│   │   └── retrieval_engine.py        # Semantic search + re-ranking
│   │
│   ├──  graphs/
│   │   ├── call_graph.py              # Function call relationships
│   │   └── dependency_graph.py        # File/module dependencies
│   │
│   ├──  flow_tracer/
│   │   └── execution_tracer.py        # BFS execution path tracer
│   │
│   ├──  llm/
│   │   ├── llm_client.py              # Claude API wrapper
│   │   ├── prompt_builder.py          # Structured prompt templates
│   │   └── reasoning_engine.py        # Orchestrates RAG + LLM
│   │
│   └──  api/
│       ├── session_cache.py           # In-memory session store
│       ├──  routes/
│       │   ├── ingest.py              # POST /ingest-repo
│       │   ├── query.py               # POST /ask-question, /trace-flow
│       │   └── architecture.py        # GET /get-architecture, /repo-status
│       └──  models/
│           └── schemas.py             # Pydantic request/response models
│
├──  frontend/
│   └── app.py                         # Streamlit UI (4 tabs)
│
├──  tests/                          # 50 tests — all passing
│   ├── conftest.py                    # Shared fixtures
│   ├── test_api.py                    # FastAPI endpoint tests
│   ├── test_parsing.py                # Parser tests
│   ├── test_graphs.py                 # Graph builder tests
│   ├── test_embeddings.py             # Embedding model tests
│   ├── test_vectordb.py               # ChromaDB tests
│   ├── test_retrieval.py              # Retrieval engine tests
│   ├── test_ingestion.py              # Full pipeline integration tests
│   └── test_flow_tracer.py            # Execution tracer tests
│
├──  docker/
│   ├── Dockerfile
│   └── docker-compose.yml
│
├──  .vscode/                        # VS Code launch configs
│   ├── settings.json
│   ├── launch.json                    # One-click run backend/frontend/tests
│   └── extensions.json                # Recommended extensions
│
├── api.http                           # VS Code REST Client test file
├── setup.bat                          # Windows one-click setup
├── start_backend.bat                  # Start FastAPI server
├── start_frontend.bat                 # Start Streamlit UI
├── run_tests.bat                      # Run all 50 tests
├── requirements.txt                   # All Python dependencies
├── pyproject.toml                     # Project metadata + pytest config
└── .env.example                       # Environment variable template

Prerequisites

Before you begin, make sure you have the following installed:

Requirement	Version	Download
Python	3.10 or higher	python.org
Git	Any recent version	git-scm.com
VS Code	Latest	code.visualstudio.com

Check if Python and Git are installed:

Open Command Prompt (Win + R → type cmd → Enter) and run:

python --version
git --version

You should see version numbers. If not, install them from the links above.

API Keys Required

You need at least one of the following:

Option A — Full Quality (Recommended)

Key	Where to Get	Cost
`ANTHROPIC_API_KEY`	console.anthropic.com	Free trial available
`OPENAI_API_KEY`	platform.openai.com	Free $5 credits on signup

Option B — No API Keys (Offline Mode)

Set EMBEDDING_PROVIDER=local in your .env file. The system will use a built-in TF-IDF embedding model. Quality will be lower but it works 100% offline. You still need an Anthropic key for LLM answers though.

Installation

Step 1 — Clone the Repository

git clone https://github.com/MayurS23/ai-codebase-intelligence.git
cd ai-codebase-intelligence

Step 2 — Run the Setup Script (Windows)

setup.bat

This will automatically:

Create a Python virtual environment (.venv)
Install all dependencies from requirements.txt
Create your .env file from the template
Create required data directories

Step 3 — Configure Your API Keys

Open the .env file in VS Code and fill in your keys:

# ── LLM (Required for answering questions) ──────────────
ANTHROPIC_API_KEY=sk-ant-your-key-here

# ── Embeddings (Required for semantic search) ────────────
EMBEDDING_PROVIDER=openai
OPENAI_API_KEY=sk-your-key-here

# ── Or use local embeddings (no OpenAI key needed) ───────
# EMBEDDING_PROVIDER=local

# ── These defaults work fine, no need to change ──────────
LLM_MODEL=claude-sonnet-4-20250514
EMBEDDING_MODEL=text-embedding-3-small
CHROMA_PERSIST_DIR=./data/chromadb
REPOS_DIR=./data/repos
MAX_FILE_SIZE_KB=500
API_HOST=0.0.0.0
API_PORT=8000

Save the file with Ctrl+S.

Step 4 — Verify Installation (Run Tests)

run_tests.bat

You should see:

50 passed in ~11s

If all 50 tests pass, your installation is complete.

Running the Application

You need two terminals open at the same time.

Terminal 1 — Start the Backend API

start_backend.bat

You should see:

INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Application startup complete.

Visit http://localhost:8000/docs to see the interactive API documentation.

Terminal 2 — Start the Frontend UI

start_frontend.bat

You should see:

You can now view your Streamlit app in your browser.
Local URL: http://localhost:8501

Open the App

Go to http://localhost:8501 in your browser. 🎉

Using the Application

1. Ingest a Repository

In the sidebar on the left:

Paste any GitHub URL, for example:
```
https://github.com/pallets/flask
```
Click ** Ingest Repository**
Wait for processing (1–5 minutes depending on repo size)
You'll see metrics: files scanned, units parsed, chunks stored

2. Ask Questions (Tab 1 — Ask Questions)

Type any question about the codebase:

How does routing work?
Where is the database connection handled?
What does the application factory do?
Which files are involved in request handling?
Explain the templating system

The system will return:

A detailed answer with code references
The source files it used to generate the answer
Call graph context

3. Trace Execution Flow (Tab 2 — Trace Flow)

Enter a function name like full_dispatch_request and click Trace Flow.

You'll see the complete execution path:

→ full_dispatch_request  [app.py]
  → dispatch_request  [app.py]
    → ensure_sync  [app.py]
      → view_function  [views.py]

4. Architecture Overview (Tab 3 — Architecture)

Click Generate Architecture Analysis to get a full system overview including:

What kind of application it is
Main modules and their responsibilities
How modules interact
Entry points
Design patterns used

5. Graph Explorer (Tab 4 — Graph Explorer)

Visualize the codebase as an interactive graph:

Dependency Graph — which files import which files
Call Graph — which functions call which functions

API Reference

The full interactive API docs are available at http://localhost:8000/docs when the backend is running.

Endpoints

`POST /api/ingest-repo`

Ingest a GitHub repository.

Request:
{
  "repo_url": "https://github.com/pallets/flask",
  "force": false
}

Response:
{
  "repo_id": "pallets__flask",
  "files_scanned": 45,
  "units_parsed": 312,
  "units_embedded": 389,
  "duration_seconds": 42.3,
  "message": "Repository indexed successfully."
}

`POST /api/ask-question`

Ask a natural language question.

Request:
{
  "repo_id": "pallets__flask",
  "question": "How does routing work?"
}

Response:
{
  "answer": "Flask routing works through...",
  "source_files": ["src/flask/routing.py", "src/flask/app.py"],
  "call_graph": { "nodes": [...], "edges": [...] }
}

`POST /api/trace-flow`

Trace execution from an entry function.

Request:
{
  "repo_id": "pallets__flask",
  "entry_function": "full_dispatch_request",
  "max_depth": 5
}

`GET /api/get-architecture/{repo_id}`

Get architecture overview.

`GET /api/repo-status/{repo_id}`

Check if a repo is indexed and session is loaded.

`POST /api/reload-session/{repo_id}`

Restore session after server restart (no re-embedding needed).

`GET /api/list-functions/{repo_id}`

List all known functions in the indexed repo.

Running with Docker

If you have Docker installed, you can run the entire application with one command:

cd docker
docker-compose up --build

This starts:

Backend API at http://localhost:8000
Frontend UI at http://localhost:8501

To stop:

docker-compose down

Running Tests

run_tests.bat

Or manually:

.venv\Scripts\activate
set PYTHONPATH=%CD%
set EMBEDDING_PROVIDER=local
python -m pytest tests/ -v

Test Coverage

Test File	What It Tests	Tests
`test_parsing.py`	Python AST parser, chunker	8
`test_graphs.py`	Call graph, dependency graph	5
`test_embeddings.py`	Embedding models, batch pipeline	5
`test_vectordb.py`	ChromaDB store, search, empty handling	6
`test_retrieval.py`	Semantic search, re-ranking	4
`test_flow_tracer.py`	Execution tracing, depth limits	5
`test_ingestion.py`	File scanner, full pipeline integration	5
`test_api.py`	All 7 FastAPI endpoints	12
Total		50

VS Code Integration

This project includes full VS Code configuration.

One-Click Run (Press F5)

Open the Run and Debug panel (Ctrl+Shift+D) and select:

▶ Run FastAPI Backend — starts the API server with debugger
▶ Run Streamlit Frontend — starts the UI
Run All Tests — runs all 50 tests

Recommended Extensions

When you open the project in VS Code, it will suggest installing:

Python — Python language support
Pylance — Fast Python type checking
REST Client — Test API endpoints from api.http file
GitLens — Enhanced Git capabilities

Test API from VS Code

Open api.http in VS Code and click Send Request above any endpoint to test it directly.

Troubleshooting

`ModuleNotFoundError`

.venv\Scripts\activate
pip install -r requirements.txt

`Connection refused on port 8000`

The backend isn't running. Run start_backend.bat in a separate terminal.

`Repo not found` error when asking questions

You need to ingest the repo first via the UI or POST /api/ingest-repo.

`API key invalid`

Double-check your .env file. Make sure there are no spaces around the = sign:

ANTHROPIC_API_KEY=sk-ant-...   ✅ correct
ANTHROPIC_API_KEY = sk-ant-... ❌ wrong

Server restarted and questions stopped working

Run this to restore your session without re-indexing:

POST /api/reload-session/{your_repo_id}

Or just ingest the repo again (embeddings are cached).

Tests failing

Make sure you're running tests with local embeddings:

set EMBEDDING_PROVIDER=local
python -m pytest tests/ -v

Architecture Deep Dive

┌──────────────────────────────────────────────────────────────────┐
│                         LAYER 1: INGESTION                        │
│   GitHub URL → Clone → Scan Files → Parse AST → CodeUnit objects │
└──────────────────────────────────────┬───────────────────────────┘
                                        │
              ┌─────────────────────────┼────────────────────────┐
              │                         │                          │
              ▼                         ▼                          ▼
┌─────────────────────┐   ┌─────────────────────┐   ┌───────────────────┐
│   LAYER 2: CHUNKS   │   │  LAYER 2: CALL GRAPH │   │ LAYER 2: DEP GRAPH│
│  Split by function/ │   │  function → function │   │  file → file       │
│  class boundaries   │   │  (NetworkX DiGraph)  │   │  (NetworkX DiGraph)│
└──────────┬──────────┘   └─────────────────────┘   └───────────────────┘
           │
           ▼
┌─────────────────────┐
│  LAYER 3: EMBEDDINGS│
│  OpenAI / Local     │
│  text-embedding-3   │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│   LAYER 4: STORAGE  │
│   ChromaDB          │
│   (vectors +        │
│    metadata)        │
└──────────┬──────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────┐
│                     LAYER 5: INTELLIGENCE                         │
│   Query → Embed → Search ChromaDB → Enrich with Graphs →         │
│   Build Prompt → Claude AI → Answer with source references       │
└──────────────────────────────────────────────────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────┐
│                      LAYER 6: API (FastAPI)                       │
│   /ingest-repo  /ask-question  /trace-flow  /get-architecture    │
└──────────────────────────────────────────────────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────┐
│                    LAYER 7: UI (Streamlit)                        │
│   Ask Questions │ Trace Flow │ Architecture │ Graph Explorer      │
└──────────────────────────────────────────────────────────────────┘

Key Design Decisions

Why function/class level chunking? Most RAG tutorials split code by token count (e.g. 512 tokens). This breaks functions in half, destroying semantic meaning. We chunk at function and class boundaries so the LLM always sees complete, meaningful units of code.

Why two indexes (vector + graph)? Vector search answers "what is this about?" but can't answer "how does A connect to B?" Graph indexes answer structural questions. Combining both gives dramatically better answers than either alone.

Why metadata-rich chunks? Every chunk stored in ChromaDB carries: file path, function name, start/end line number, language, parent class, docstring, and called functions. This enables precise attribution — the LLM can say "this is in auth.py line 42" rather than giving vague answers.

Contributing

Contributions are welcome! Here's how:

Fork the repository
Create a feature branch: git checkout -b feature/my-feature
Make your changes
Run tests: run_tests.bat
Commit: git commit -m "feat: add my feature"
Push: git push origin feature/my-feature
Open a Pull Request

📄 License

This project is licensed under the MIT License.

Author

Mayur S GitHub: @MayurS23

If this project helped you, please give it a ⭐ on GitHub!

Built with ❤️ by MayurS23

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.vscode		.vscode
backend		backend
docker		docker
frontend		frontend
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api.http		api.http
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_tests.bat		run_tests.bat
setup.bat		setup.bat
start_backend.bat		start_backend.bat
start_frontend.bat		start_frontend.bat

Folders and files

Latest commit

History

Repository files navigation

AI Codebase Intelligence Platform

What Is This?

Features

How It Works

Tech Stack

Project Structure

Prerequisites

Check if Python and Git are installed:

API Keys Required

Option A — Full Quality (Recommended)

Option B — No API Keys (Offline Mode)

Installation

Step 1 — Clone the Repository

Step 2 — Run the Setup Script (Windows)

Step 3 — Configure Your API Keys

Step 4 — Verify Installation (Run Tests)

Running the Application

Terminal 1 — Start the Backend API

Terminal 2 — Start the Frontend UI

Open the App

Using the Application

1. Ingest a Repository

2. Ask Questions (Tab 1 — Ask Questions)

3. Trace Execution Flow (Tab 2 — Trace Flow)

4. Architecture Overview (Tab 3 — Architecture)

5. Graph Explorer (Tab 4 — Graph Explorer)

API Reference

Endpoints

POST /api/ingest-repo

POST /api/ask-question

POST /api/trace-flow

GET /api/get-architecture/{repo_id}

GET /api/repo-status/{repo_id}

POST /api/reload-session/{repo_id}

GET /api/list-functions/{repo_id}

Running with Docker

Running Tests

Test Coverage

VS Code Integration

One-Click Run (Press F5)

Recommended Extensions

Test API from VS Code

Troubleshooting

ModuleNotFoundError

Connection refused on port 8000

Repo not found error when asking questions

API key invalid

Server restarted and questions stopped working

Tests failing

Architecture Deep Dive

Key Design Decisions

Contributing

📄 License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /api/ingest-repo`

`POST /api/ask-question`

`POST /api/trace-flow`

`GET /api/get-architecture/{repo_id}`

`GET /api/repo-status/{repo_id}`

`POST /api/reload-session/{repo_id}`

`GET /api/list-functions/{repo_id}`

`ModuleNotFoundError`

`Connection refused on port 8000`

`Repo not found` error when asking questions

`API key invalid`

Packages