A FastMCP server for parsing, indexing, and semantically searching codebases using Tree-sitter and SpaCy embeddings.
- Multi-language Support: Parses 30+ programming languages using Tree-sitter WASM grammars
- Semantic Search: Vector-based search using SpaCy embeddings and sqlite-vec
- Incremental Updates: Only re-parses files that have changed
- FastMCP Integration: Native MCP server with async command support
- Structured Symbol Extraction: Extracts classes, functions, methods, and more
- Multi-Project Support: Index and manage multiple codebases with project isolation
- Progress Reporting: Real-time progress updates during long-running operations
- FastMCP Resources: Direct data access through REST-like resource URIs
- Code Relationship Mapping: Track function calls, inheritance, implementations, and dependencies
- Python, JavaScript, TypeScript, TSX
- Java, C#, C, C++
- Go, Rust, Ruby, PHP
- Swift, Kotlin, Scala
- Lua, HTML, CSS, JSON
- YAML, TOML, Vue
- Solidity, Zig, Elixir
- OCaml, Elm, Bash
- Elisp, SystemRDL, TLA+
- QL, ReScript
- Python 3.12+
- FastMCP CLI (
pip install fastmcp)
-
Clone and install dependencies:
git clone <repository> cd codebase-analyzer pip install -e .
-
Install sqlite-vec:
pip install sqlite-vec
-
Install SpaCy and download the embedding model:
pip install spacy python -m spacy download en_core_web_md
Note: You can also use
en_core_web_trffor better quality (but slower) embeddings:python -m spacy download en_core_web_trf
-
Verify installation:
python -c "import spacy; nlp = spacy.load('en_core_web_md'); print('SpaCy ready!')"
# For development with MCP Inspector
fastmcp dev main.py --ui-port 3000 --server-port 8000
# For production
fastmcp run main.pyThe server will start and register the following MCP commands:
# Index with default project ID
mcp call index_codebase "/path/to/your/codebase"
# Index with custom project ID
mcp call index_codebase "/path/to/your/codebase" "my-project"Response:
{
"success": true,
"project_id": "my-project",
"processed_files": 45,
"total_symbols_added": 1234,
"errors": [],
"database_stats": {
"total_symbols": 1234,
"symbols_with_embeddings": 1234,
"languages": {
"python": 567,
"javascript": 234,
"typescript": 123
},
"total_files": 45,
"project_id": "my-project"
}
}Progress Reporting: The indexing operation provides real-time progress updates:
- File Discovery (0-10%): Scanning for files to process
- File Processing (10-90%): Parsing files and generating embeddings
- Finalization (90-100%): Completing the indexing process
# Search across all projects
mcp call search_symbol_by_name "functionName"
# Search within a specific project
mcp call search_symbol_by_name "functionName" "python" "my-project"
# Search with language filter
mcp call search_symbol_by_name "functionName" "python"# Search across all projects
mcp call search_symbol_semantic "find functions that handle user authentication"
# Search within a specific project
mcp call search_symbol_semantic "find functions that handle user authentication" 10 "my-project"# List all projects
mcp call list_projects
# Get statistics for a specific project
mcp call get_stats "my-project"
# Get global statistics
mcp call get_stats
# Delete a project
mcp call delete_project "my-project"mcp call health_checkThe codebase analyzer now supports multiple projects with complete isolation:
- Each project has its own namespace in the database
- Symbols from different projects are completely separated
- You can search within a project or across all projects
- Project-specific statistics and management
- Use meaningful project IDs (e.g., "frontend-app", "backend-api", "shared-libs")
- Default project ID is "default" for backward compatibility
- Project IDs are case-sensitive and should be unique
# Index multiple projects
mcp call index_codebase "/path/to/frontend" "frontend-app"
mcp call index_codebase "/path/to/backend" "backend-api"
mcp call index_codebase "/path/to/shared" "shared-libs"
# List all projects
mcp call list_projects
# Search within a specific project
mcp call search_symbol_by_name "UserComponent" "typescript" "frontend-app"
# Search across all projects
mcp call search_symbol_semantic "database connection" 5
# Get project-specific stats
mcp call get_stats "frontend-app"
# Clean up a project
mcp call delete_project "old-project"The server provides real-time progress updates for long-running operations:
- File Discovery: Shows how many files were found
- File Processing: Updates for each file being processed
- Completion: Final statistics and summary
- Preparation: Counting symbols to delete
- Deletion: Removing symbols and embeddings
- Completion: Confirmation of deletion
- Clients must support progress tokens to receive progress updates
- Progress updates are sent via the MCP context
- If progress tokens aren't supported, operations still work but without progress feedback
The codebase analyzer exposes indexed data through FastMCP resources, providing direct access to project information, symbols, and search results without requiring tool calls.
- URI:
codebase://stats/{project_id} - Description: Get database statistics for a specific project
- Example:
codebase://stats/my-project
- URI:
codebase://symbols/{project_id} - Description: Get all symbols for a specific project
- Example:
codebase://symbols/my-project
- URI:
codebase://symbols/{project_id}/{language} - Description: Get all symbols for a specific project and language
- Example:
codebase://symbols/my-project/python
- URI:
codebase://files/{project_id} - Description: Get all files indexed for a specific project
- Example:
codebase://files/my-project
- URI:
codebase://languages/{project_id} - Description: Get all languages used in a specific project with symbol counts
- Example:
codebase://languages/my-project
- URI:
codebase://search/{project_id}/{query} - Description: Search for symbols by name within a specific project
- Example:
codebase://search/my-project/UserService
- Direct Access: No need for tool calls to access indexed data
- REST-like Patterns: Familiar URI structure for easy integration
- Automatic Serialization: JSON responses with proper MIME types
- LLM-Friendly: Structured data that's easy for LLMs to consume
- Real-time: Always reflects the current state of indexed data
# Get project statistics
stats = await client.read_resource("codebase://stats/my-project")
# Get all Python symbols in a project
symbols = await client.read_resource("codebase://symbols/my-project/python")
# Search for a specific function
results = await client.read_resource("codebase://search/my-project/calculateTotal")
# Get all files in a project
files = await client.read_resource("codebase://files/my-project")All resources return JSON data with consistent structure:
{
"project_id": "my-project",
"total_symbols": 150,
"symbols": [
{
"id": 1,
"project_id": "my-project",
"name": "calculateTotal",
"symbol_type": "function",
"language": "python",
"file_path": "/path/to/file.py",
"line_start": 10,
"line_end": 15,
"code_snippet": "def calculateTotal(items):\n return sum(items)"
}
]
}The codebase analyzer now supports code relationship mapping, allowing AI to understand how code elements relate to each other. This enables powerful queries like "show me all callers of this function" or "find all implementations of this protocol".
The analyzer tracks several types of code relationships:
- Function Calls: Which functions call other functions
- Class Inheritance: Which classes inherit from other classes
- Interface Implementation: Which classes implement interfaces/protocols
- Method Calls: Which methods call other methods
- Dependencies: Cross-file and cross-module dependencies
# Find all callers of a specific function
result = await client.call_tool("find_function_callers", {
"function_name": "calculateTotal",
"project_id": "my-project"
})# Find all implementations of an interface/protocol
result = await client.call_tool("find_interface_implementations", {
"interface_name": "DataProcessor",
"project_id": "my-project"
})# Get all relationships for a specific symbol
result = await client.call_tool("get_symbol_relationships", {
"symbol_name": "UserService",
"relationship_type": "calls", # Optional filter
"direction": "both", # "incoming", "outgoing", or "both"
"project_id": "my-project"
})# Get a complete dependency graph for the project
result = await client.call_tool("get_dependency_graph", {
"project_id": "my-project",
"max_depth": 3
})# Analyze the call hierarchy for a specific function
result = await client.call_tool("analyze_call_hierarchy", {
"function_name": "main",
"project_id": "my-project",
"max_depth": 3
})- URI:
codebase://callers/{project_id}/{function_name} - Description: Get all callers of a specific function
- Example:
codebase://callers/my-project/calculateTotal
- URI:
codebase://implementations/{project_id}/{interface_name} - Description: Get all implementations of an interface/protocol
- Example:
codebase://implementations/my-project/DataProcessor
- URI:
codebase://relationships/{project_id}/{symbol_name} - Description: Get all relationships for a specific symbol
- Example:
codebase://relationships/my-project/UserService
- URI:
codebase://dependencies/{project_id} - Description: Get dependency graph for a project
- Example:
codebase://dependencies/my-project
- URI:
codebase://hierarchy/{project_id}/{function_name} - Description: Get call hierarchy for a specific function
- Example:
codebase://hierarchy/my-project/main
# Find all functions that would be affected if we change calculateTotal
callers = await client.read_resource("codebase://callers/my-project/calculateTotal")
print(f"Changing calculateTotal would affect {len(callers['callers'])} functions")# Find all classes that implement a specific interface
implementations = await client.read_resource("codebase://implementations/my-project/DataProcessor")
for impl in implementations['implementations']:
print(f"Found implementation: {impl['implementation']['name']}")# Get the complete dependency graph
graph = await client.read_resource("codebase://dependencies/my-project")
print(f"Project has {graph['total_nodes']} symbols with {graph['total_edges']} relationships")# Analyze the call hierarchy for the main function
hierarchy = await client.read_resource("codebase://hierarchy/my-project/main")
print(f"Main function calls {len(hierarchy['callees'])} other functions")
print(f"Main function is called by {len(hierarchy['callers'])} functions")Relationships are stored with the following structure:
{
"source_symbol_id": 123,
"target_symbol_id": 456,
"relationship_type": "calls",
"relationship_data": {
"line": 15,
"target_type": "function"
}
}- Impact Analysis: Understand what code would be affected by changes
- Refactoring Support: Find all usages before refactoring
- Architecture Understanding: Visualize code dependencies and relationships
- Code Navigation: Navigate through call chains and inheritance hierarchies
- Documentation: Automatically generate relationship documentation
- Testing: Identify which functions need testing based on usage
# Index a Python project
await index_codebase("/path/to/python/app", "python-app")
# Search for a specific function
results = await search_symbol_by_name("calculate_total", "python", "python-app")
# Semantic search for authentication functions
results = await search_symbol_semantic("user authentication login", 5, "python-app")# Index multiple related projects
await index_codebase("/path/to/monorepo/frontend", "frontend")
await index_codebase("/path/to/monorepo/backend", "backend")
await index_codebase("/path/to/monorepo/shared", "shared")
# Cross-project search
results = await search_symbol_semantic("API endpoint", 10) # Searches all projects
# Project-specific search
results = await search_symbol_by_name("UserService", "typescript", "frontend")-
Database Layer (
db.py)- SQLite with sqlite-vec for vector storage
- Symbol metadata and embeddings tables
- Incremental update support via file hashing
-
Embedding Manager (
embeddings.py)- SpaCy integration for vector generation
- Code-specific text preprocessing
- Batch processing support
-
Code Parser (
parsers/code_parser.py)- Tree-sitter WASM grammar loading
- Language detection and symbol extraction
- Query-based pattern matching
-
FastMCP Server (
main.py)- Async command registration
- Component orchestration
- Error handling and logging
-- Symbols table
CREATE TABLE symbols (
id INTEGER PRIMARY KEY,
language TEXT NOT NULL,
symbol_type TEXT NOT NULL,
name TEXT NOT NULL,
file_path TEXT NOT NULL,
line_start INTEGER NOT NULL,
line_end INTEGER NOT NULL,
code_snippet TEXT NOT NULL,
file_hash TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE(language, name, file_path, line_start)
);
-- Vector embeddings table
CREATE VIRTUAL TABLE symbol_embeddings
USING vec0(
id INTEGER PRIMARY KEY,
embedding FLOAT[300]
);SPACY_MODEL: SpaCy model name (default:en_core_web_md)DB_PATH: Database file path (default:codebase_analyzer.db)LOG_LEVEL: Logging level (default:INFO)
- Add the Tree-sitter WASM grammar to
grammars/ - Create a query file in
queries/(see existing examples) - Update the language mapping in
parsers/code_parser.py
# In main.py, modify the EmbeddingManager initialization:
embedding_manager = EmbeddingManager(model_name="en_core_web_trf")- Use SSD storage for better database performance
- Increase batch size for large codebases
- Use
en_core_web_trffor better semantic search quality - Monitor memory usage with large embedding models
-
SpaCy model not found:
python -m spacy download en_core_web_md
-
sqlite-vec not available:
pip install sqlite-vec
Note: If you get "no such module: vec0" errors, the server will automatically fall back to a simpler storage method. Semantic search functionality will be limited but the server will still work.
-
Tree-sitter grammar errors:
- Check that WASM files are in
grammars/ - Verify query files exist in
queries/
- Check that WASM files are in
-
Memory issues with large codebases:
- Process in smaller batches
- Use smaller embedding models
- Monitor system resources
Enable debug logging:
export LOG_LEVEL=DEBUG
fastmcp serve main.pycodebase-analyzer/
├── main.py # FastMCP server entrypoint
├── db.py # Database layer
├── embeddings.py # SpaCy embedding manager
├── parsers/ # Modular language parsers
├── grammars/ # Tree-sitter WASM grammars
├── queries/ # Language-specific queries
├── pyproject.toml # Dependencies and metadata
└── README.md # This file
- New MCP Commands: Add
@mcp_commanddecorators inmain.py - Database Schema: Modify
db.pyand run migrations - Language Support: Add grammars and queries
- Embedding Models: Extend
embeddings.py
[Add your license here]
[Add contribution guidelines here]