An AI-powered system that ingests any GitHub repository and lets you talk to it like a senior engineer who has worked on it for years.
Features • Demo • Installation • Usage • API Docs • Architecture
The AI Codebase Intelligence Platform is a developer tool that combines Retrieval Augmented Generation (RAG), static code analysis, call graph generation, and LLM reasoning to help developers understand large, complex codebases instantly.
Instead of spending hours reading through thousands of files, you simply paste a GitHub URL and ask questions in plain English:
- "How does the authentication system work?"
- "Trace the login execution flow"
- "Which files are involved when a user signs up?"
- "Explain the payment module"
- "What services interact with the order service?"
The system answers like a senior engineer who has deeply studied the codebase.
| Feature | Description |
|---|---|
| Codebase Q&A | Ask any natural language question about any GitHub repo |
| Execution Flow Tracing | Trace how execution flows from any entry function |
| Architecture Analysis | Get a high-level overview of the entire system design |
| Call Graph Builder | Visualize which functions call which other functions |
| Dependency Graph | See how files and modules depend on each other |
| Multi-Language Support | Python, JavaScript, TypeScript, Java, Go, Rust, C++, C#, Ruby, PHP, Kotlin, Swift |
| Persistent Indexing | Index once, query forever — no re-processing needed |
| Semantic Search | Finds relevant code even when exact keywords don't match |
| Session Recovery | Restore your session after server restart without re-embedding |
| Docker Support | One command to run everything |
You paste a GitHub URL
│
▼
┌─────────────────────────────────────────────────────┐
│ INGESTION PIPELINE │
│ │
│ Clone Repo → Scan Files → Parse AST → Chunk Code │
│ → Generate Embeddings → Store in ChromaDB │
│ → Build Call Graph → Build Dependency Graph │
└─────────────────────────────────────────────────────┘
│
▼
You ask a question
│
▼
┌─────────────────────────────────────────────────────┐
│ QUERY PIPELINE │
│ │
│ Embed Question → Search ChromaDB → Retrieve Top-K │
│ Chunks → Enrich with Graph Data → Build Prompt │
│ → Send to Claude AI → Get Answer │
└─────────────────────────────────────────────────────┘
│
▼
You get a precise, developer-quality answer with source file references
| Layer | Technology | Purpose |
|---|---|---|
| LLM | Anthropic Claude (claude-sonnet-4) | Code reasoning & explanation |
| Embeddings | OpenAI text-embedding-3-small | Semantic code search |
| Vector Database | ChromaDB | Store & retrieve code embeddings |
| Parsing | Python AST + Regex | Extract functions, classes, imports |
| Graph Engine | NetworkX | Call graphs & dependency graphs |
| Backend API | FastAPI + Uvicorn | REST API endpoints |
| Frontend | Streamlit + Plotly | Interactive UI & graph visualization |
| Version Control | Git | Source control |
| Containerization | Docker + Docker Compose | Deployment |
ai-codebase-intelligence/
│
├── backend/ # FastAPI backend
│ ├── main.py # App entry point
│ ├── config.py # All configuration via .env
│ │
│ ├── ingestion/
│ │ ├── repo_cloner.py # Clone GitHub repos
│ │ ├── file_scanner.py # Walk dirs, detect code files
│ │ └── ingestion_orchestrator.py # Full pipeline coordinator
│ │
│ ├── parsing/
│ │ ├── code_unit.py # Core data model (CodeUnit)
│ │ ├── python_parser.py # Python AST parser
│ │ ├── generic_parser.py # Regex parser for other languages
│ │ └── parser_dispatcher.py # Routes files to correct parser
│ │
│ ├── chunking/
│ │ └── smart_chunker.py # Function/class-level chunking
│ │
│ ├── embeddings/
│ │ ├── embedding_model.py # OpenAI + local fallback model
│ │ └── embedding_pipeline.py # Batch embedding processor
│ │
│ ├── vectordb/
│ │ └── chroma_store.py # ChromaDB store & search
│ │
│ ├── retrieval/
│ │ └── retrieval_engine.py # Semantic search + re-ranking
│ │
│ ├── graphs/
│ │ ├── call_graph.py # Function call relationships
│ │ └── dependency_graph.py # File/module dependencies
│ │
│ ├── flow_tracer/
│ │ └── execution_tracer.py # BFS execution path tracer
│ │
│ ├── llm/
│ │ ├── llm_client.py # Claude API wrapper
│ │ ├── prompt_builder.py # Structured prompt templates
│ │ └── reasoning_engine.py # Orchestrates RAG + LLM
│ │
│ └── api/
│ ├── session_cache.py # In-memory session store
│ ├── routes/
│ │ ├── ingest.py # POST /ingest-repo
│ │ ├── query.py # POST /ask-question, /trace-flow
│ │ └── architecture.py # GET /get-architecture, /repo-status
│ └── models/
│ └── schemas.py # Pydantic request/response models
│
├── frontend/
│ └── app.py # Streamlit UI (4 tabs)
│
├── tests/ # 50 tests — all passing
│ ├── conftest.py # Shared fixtures
│ ├── test_api.py # FastAPI endpoint tests
│ ├── test_parsing.py # Parser tests
│ ├── test_graphs.py # Graph builder tests
│ ├── test_embeddings.py # Embedding model tests
│ ├── test_vectordb.py # ChromaDB tests
│ ├── test_retrieval.py # Retrieval engine tests
│ ├── test_ingestion.py # Full pipeline integration tests
│ └── test_flow_tracer.py # Execution tracer tests
│
├── docker/
│ ├── Dockerfile
│ └── docker-compose.yml
│
├── .vscode/ # VS Code launch configs
│ ├── settings.json
│ ├── launch.json # One-click run backend/frontend/tests
│ └── extensions.json # Recommended extensions
│
├── api.http # VS Code REST Client test file
├── setup.bat # Windows one-click setup
├── start_backend.bat # Start FastAPI server
├── start_frontend.bat # Start Streamlit UI
├── run_tests.bat # Run all 50 tests
├── requirements.txt # All Python dependencies
├── pyproject.toml # Project metadata + pytest config
└── .env.example # Environment variable template
Before you begin, make sure you have the following installed:
| Requirement | Version | Download |
|---|---|---|
| Python | 3.10 or higher | python.org |
| Git | Any recent version | git-scm.com |
| VS Code | Latest | code.visualstudio.com |
Open Command Prompt (Win + R → type cmd → Enter) and run:
python --version
git --versionYou should see version numbers. If not, install them from the links above.
You need at least one of the following:
| Key | Where to Get | Cost |
|---|---|---|
ANTHROPIC_API_KEY |
console.anthropic.com | Free trial available |
OPENAI_API_KEY |
platform.openai.com | Free $5 credits on signup |
Set EMBEDDING_PROVIDER=local in your .env file. The system will use a built-in TF-IDF embedding model. Quality will be lower but it works 100% offline. You still need an Anthropic key for LLM answers though.
git clone https://github.com/MayurS23/ai-codebase-intelligence.git
cd ai-codebase-intelligencesetup.batThis will automatically:
- Create a Python virtual environment (
.venv) - Install all dependencies from
requirements.txt - Create your
.envfile from the template - Create required data directories
Open the .env file in VS Code and fill in your keys:
# ── LLM (Required for answering questions) ──────────────
ANTHROPIC_API_KEY=sk-ant-your-key-here
# ── Embeddings (Required for semantic search) ────────────
EMBEDDING_PROVIDER=openai
OPENAI_API_KEY=sk-your-key-here
# ── Or use local embeddings (no OpenAI key needed) ───────
# EMBEDDING_PROVIDER=local
# ── These defaults work fine, no need to change ──────────
LLM_MODEL=claude-sonnet-4-20250514
EMBEDDING_MODEL=text-embedding-3-small
CHROMA_PERSIST_DIR=./data/chromadb
REPOS_DIR=./data/repos
MAX_FILE_SIZE_KB=500
API_HOST=0.0.0.0
API_PORT=8000Save the file with Ctrl+S.
run_tests.batYou should see:
50 passed in ~11s
If all 50 tests pass, your installation is complete.
You need two terminals open at the same time.
start_backend.batYou should see:
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: Application startup complete.
Visit http://localhost:8000/docs to see the interactive API documentation.
start_frontend.batYou should see:
You can now view your Streamlit app in your browser.
Local URL: http://localhost:8501
Go to http://localhost:8501 in your browser. 🎉
In the sidebar on the left:
- Paste any GitHub URL, for example:
https://github.com/pallets/flask - Click ** Ingest Repository**
- Wait for processing (1–5 minutes depending on repo size)
- You'll see metrics: files scanned, units parsed, chunks stored
Type any question about the codebase:
How does routing work?
Where is the database connection handled?
What does the application factory do?
Which files are involved in request handling?
Explain the templating system
The system will return:
- A detailed answer with code references
- The source files it used to generate the answer
- Call graph context
Enter a function name like full_dispatch_request and click Trace Flow.
You'll see the complete execution path:
→ full_dispatch_request [app.py]
→ dispatch_request [app.py]
→ ensure_sync [app.py]
→ view_function [views.py]
Click Generate Architecture Analysis to get a full system overview including:
- What kind of application it is
- Main modules and their responsibilities
- How modules interact
- Entry points
- Design patterns used
Visualize the codebase as an interactive graph:
- Dependency Graph — which files import which files
- Call Graph — which functions call which functions
The full interactive API docs are available at http://localhost:8000/docs when the backend is running.
Ingest a GitHub repository.
Request:
{
"repo_url": "https://github.com/pallets/flask",
"force": false
}
Response:
{
"repo_id": "pallets__flask",
"files_scanned": 45,
"units_parsed": 312,
"units_embedded": 389,
"duration_seconds": 42.3,
"message": "Repository indexed successfully."
}Ask a natural language question.
Request:
{
"repo_id": "pallets__flask",
"question": "How does routing work?"
}
Response:
{
"answer": "Flask routing works through...",
"source_files": ["src/flask/routing.py", "src/flask/app.py"],
"call_graph": { "nodes": [...], "edges": [...] }
}Trace execution from an entry function.
Request:
{
"repo_id": "pallets__flask",
"entry_function": "full_dispatch_request",
"max_depth": 5
}Get architecture overview.
Check if a repo is indexed and session is loaded.
Restore session after server restart (no re-embedding needed).
List all known functions in the indexed repo.
If you have Docker installed, you can run the entire application with one command:
cd docker
docker-compose up --buildThis starts:
- Backend API at
http://localhost:8000 - Frontend UI at
http://localhost:8501
To stop:
docker-compose downrun_tests.batOr manually:
.venv\Scripts\activate
set PYTHONPATH=%CD%
set EMBEDDING_PROVIDER=local
python -m pytest tests/ -v| Test File | What It Tests | Tests |
|---|---|---|
test_parsing.py |
Python AST parser, chunker | 8 |
test_graphs.py |
Call graph, dependency graph | 5 |
test_embeddings.py |
Embedding models, batch pipeline | 5 |
test_vectordb.py |
ChromaDB store, search, empty handling | 6 |
test_retrieval.py |
Semantic search, re-ranking | 4 |
test_flow_tracer.py |
Execution tracing, depth limits | 5 |
test_ingestion.py |
File scanner, full pipeline integration | 5 |
test_api.py |
All 7 FastAPI endpoints | 12 |
| Total | 50 |
This project includes full VS Code configuration.
Open the Run and Debug panel (Ctrl+Shift+D) and select:
- ▶ Run FastAPI Backend — starts the API server with debugger
- ▶ Run Streamlit Frontend — starts the UI
- Run All Tests — runs all 50 tests
When you open the project in VS Code, it will suggest installing:
- Python — Python language support
- Pylance — Fast Python type checking
- REST Client — Test API endpoints from
api.httpfile - GitLens — Enhanced Git capabilities
Open api.http in VS Code and click Send Request above any endpoint to test it directly.
.venv\Scripts\activate
pip install -r requirements.txtThe backend isn't running. Run start_backend.bat in a separate terminal.
You need to ingest the repo first via the UI or POST /api/ingest-repo.
Double-check your .env file. Make sure there are no spaces around the = sign:
ANTHROPIC_API_KEY=sk-ant-... ✅ correct
ANTHROPIC_API_KEY = sk-ant-... ❌ wrong
Run this to restore your session without re-indexing:
POST /api/reload-session/{your_repo_id}
Or just ingest the repo again (embeddings are cached).
Make sure you're running tests with local embeddings:
set EMBEDDING_PROVIDER=local
python -m pytest tests/ -v┌──────────────────────────────────────────────────────────────────┐
│ LAYER 1: INGESTION │
│ GitHub URL → Clone → Scan Files → Parse AST → CodeUnit objects │
└──────────────────────────────────────┬───────────────────────────┘
│
┌─────────────────────────┼────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌───────────────────┐
│ LAYER 2: CHUNKS │ │ LAYER 2: CALL GRAPH │ │ LAYER 2: DEP GRAPH│
│ Split by function/ │ │ function → function │ │ file → file │
│ class boundaries │ │ (NetworkX DiGraph) │ │ (NetworkX DiGraph)│
└──────────┬──────────┘ └─────────────────────┘ └───────────────────┘
│
▼
┌─────────────────────┐
│ LAYER 3: EMBEDDINGS│
│ OpenAI / Local │
│ text-embedding-3 │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ LAYER 4: STORAGE │
│ ChromaDB │
│ (vectors + │
│ metadata) │
└──────────┬──────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ LAYER 5: INTELLIGENCE │
│ Query → Embed → Search ChromaDB → Enrich with Graphs → │
│ Build Prompt → Claude AI → Answer with source references │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ LAYER 6: API (FastAPI) │
│ /ingest-repo /ask-question /trace-flow /get-architecture │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ LAYER 7: UI (Streamlit) │
│ Ask Questions │ Trace Flow │ Architecture │ Graph Explorer │
└──────────────────────────────────────────────────────────────────┘
Why function/class level chunking? Most RAG tutorials split code by token count (e.g. 512 tokens). This breaks functions in half, destroying semantic meaning. We chunk at function and class boundaries so the LLM always sees complete, meaningful units of code.
Why two indexes (vector + graph)? Vector search answers "what is this about?" but can't answer "how does A connect to B?" Graph indexes answer structural questions. Combining both gives dramatically better answers than either alone.
Why metadata-rich chunks?
Every chunk stored in ChromaDB carries: file path, function name, start/end line number, language, parent class, docstring, and called functions. This enables precise attribution — the LLM can say "this is in auth.py line 42" rather than giving vague answers.
Contributions are welcome! Here's how:
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Make your changes
- Run tests:
run_tests.bat - Commit:
git commit -m "feat: add my feature" - Push:
git push origin feature/my-feature - Open a Pull Request
This project is licensed under the MIT License.
Mayur S GitHub: @MayurS23
If this project helped you, please give it a ⭐ on GitHub!