This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Semcode is a semantic code search tool written in Rust that indexes C/C++ codebases using machine learning embeddings. It consists of several binaries:
semcode-index: Analyzes and indexes codebases using the CodeBERT modelsemcode: Interactive query tool for searching the indexed codesemcode-mcp: Model Context Protocol server for Claude Desktop integrationsemcode-lsp: Language Server Protocol server for editor integration (see docs/lsp-server.md)
Install required system dependencies:
# Ubuntu/Debian
sudo apt-get install build-essential libclang-dev protobuf-compiler libprotobuf-dev
# Fedora
sudo dnf install gcc-c++ clang-devel protobuf-compiler protobuf-develNote: The C++ compiler (g++/gcc-c++) is required for building dependencies like esaxx-rs.
# Build release binaries (recommended)
cargo build --release
# Use build script (creates symlinks in ./bin/)
./build.sh
# Build with test code samples
./build.sh --with-testMANDATORY before committing:
# Format code (ALWAYS run this before committing)
cargo fmt
# Run clippy linter (treat warnings as errors)
cargo clippy --all-targets -- -D warningsClippy must pass with zero warnings before code can be committed. All clippy warnings should be fixed, not silenced with allow attributes unless there's a very good reason.
This repository uses shared git hooks to automatically enforce code formatting and linting. The hooks are tracked in the hooks/ directory and shared with all developers.
Setup for New Developers
After cloning the repository, run the setup script to enable the hooks:
./setup-hooks.shAlternatively, you can manually configure the hooks:
git config core.hooksPath hooksActive Hooks
- pre-commit: Runs comprehensive checks before each commit:
cargo fmt --check- Verifies code formattingcargo check --all-targets- Verifies code compilescargo test- Runs all tests
- pre-push: Runs additional quality checks before each push:
cargo fmt --check- Verifies code formattingcargo clippy --all-targets -- -D warnings- Checks for code quality issuescargo test- Runs all tests
These hooks ensure that improperly formatted, non-compiling, or failing code cannot be committed or pushed to the repository.
If a hook fails:
- For formatting issues: Run
cargo fmtto format your code - For compilation errors: Run
cargo check --all-targetsto see the errors and fix them - For clippy issues: Fix the warnings/errors reported by clippy
- For test failures: Fix the failing tests
- Stage your changes with
git add(if needed) - Try your commit or push again
Hook Location
The git hooks are stored in the hooks/ directory (tracked by git) and are shared across all developers. This ensures consistent code quality enforcement for everyone working on the project.
Semcode uses the following search order to locate the .semcode.db database directory:
For semcode-index:
- -d flag: If provided, use the specified path (direct database path or parent directory containing
.semcode.db) - SEMCODE_DB environment variable: Same path semantics as
-d - Source directory: Look for
.semcode.dbin the source directory specified by-s - Current directory: Fall back to
./.semcode.dbin the current working directory
For semcode (query tool), semcode-mcp, and semcode-lsp:
- -d flag / configuration: If provided, use the specified path (direct database path or parent directory containing
.semcode.db) - SEMCODE_DB environment variable: Same path semantics as
-d - Workspace/Current directory: Use
./.semcode.dbin the workspace or current working directory
The -d flag can specify either:
- A direct path to the database directory (e.g.,
./my-custom.db) - A parent directory containing
.semcode.db(e.g.,-d /path/to/projectwill use/path/to/project/.semcode.db)
Typical workflow:
# Index a codebase - creates /path/to/code/.semcode.db
./bin/semcode-index --source /path/to/code
# Query from within the indexed directory
cd /path/to/code
semcode # Automatically finds ./.semcode.db
# Or query from elsewhere
semcode --database /path/to/code # Uses /path/to/code/.semcode.dbOther indexing options:
# Basic analysis only (no vectors)
./bin/semcode-index --source /path/to/code
# Index files modified in git commit range
./bin/semcode-index --source /path/to/code --git HEAD~10..HEAD
# Custom database location (overrides search order)
./bin/semcode-index --source /path/to/code --database ./custom.dbAll features that query the database MUST use git-aware lookup functions.
Semcode indexes codebases at specific git commits and stores multiple versions of functions, types, and macros as the codebase evolves. When implementing any feature that looks up code entities, always use git-aware functions to ensure you're finding the correct version that matches the user's current working directory.
-
Obtain the current git SHA:
use semcode::git::get_git_sha; let git_sha = get_git_sha(&repo_path)? .ok_or_else(|| anyhow::anyhow!("Not a git repository"))?;
-
Use git-aware lookup functions:
// ✅ CORRECT: Git-aware function lookup let function = db.find_function_git_aware(name, &git_sha).await?; // ❌ WRONG: Non-git-aware lookup (may return wrong version) let function = db.find_function(name).await?;
-
Pass git_repo_path to DatabaseManager:
// DatabaseManager needs git_repo_path for git-aware resolution let db = DatabaseManager::new(&db_path, git_repo_path).await?;
In DatabaseManager (src/database/connection.rs):
find_function_git_aware(name: &str, git_sha: &str)- Find function at specific commitfind_macro_git_aware(name: &str, git_sha: &str)- Find macro at specific commitget_function_callees_git_aware(name: &str, git_sha: &str)- Get callees at specific commitbuild_callchain_with_manifest()- Call chain analysis with git manifest
Non-git-aware functions like find_function() should only be used:
- As a fallback when git SHA cannot be determined (not in a git repo)
- For administrative/debugging operations that need to see all versions
- When the operation explicitly requires seeing historical data across commits
// ✅ CORRECT IMPLEMENTATION
async fn my_new_feature(db: &DatabaseManager, repo_path: &str) -> Result<()> {
// 1. Get current git SHA
let git_sha = semcode::git::get_git_sha(repo_path)?
.ok_or_else(|| anyhow::anyhow!("Not a git repository"))?;
// 2. Use git-aware lookup
if let Some(func) = db.find_function_git_aware("my_func", &git_sha).await? {
println!("Found function at current commit: {}", func.name);
}
Ok(())
}
// ❌ WRONG IMPLEMENTATION (may return wrong version)
async fn my_bad_feature(db: &DatabaseManager) -> Result<()> {
if let Some(func) = db.find_function("my_func").await? {
println!("Found function (but which version?): {}", func.name);
}
Ok(())
}Without git-aware lookups:
- Users may jump to outdated function definitions
- Call chains may include deleted or renamed functions
- Type information may not match current code structure
- Results are confusing and incorrect for active development
Remember: When in doubt, use git-aware functions!
- The database is very large. No operation should be implemented via full table scans unless that operation is a effectively a full table scan.
- If you're tempted to do full table scans, remember that LanceDB has an .only_if() parameter to queries allowing you to search without doing full table scans. Example: .only_if(format!("name = '{}'", escaped_name)).
Example only_if() usage for regex searches in grep_function_bodies():
// Escape special characters in the pattern for SQL string literal
let escaped_pattern = pattern
.replace("\\", "\\\\") // Escape backslashes first
.replace("'", "''"); // Escape single quotes for SQL
let where_clause = format!("regexp_match(content, '{}')", escaped_pattern);
let content_results = content_table
.query()
.only_if(&where_clause)
.execute()
.await?Example bulk insertion with duplicate removal:
let tmpdir = tempfile::tempdir().unwrap();
let db = lancedb::connect(tmpdir.path().to_str().unwrap())
.execute()
.await
.unwrap();
let new_data = RecordBatchIterator::new(
vec![RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int32Array::from_iter_values(0..10)),
Arc::new(
FixedSizeListArray::from_iter_primitive::<Float32Type, _, _>(
(0..10).map(|_| Some(vec![Some(1.0); 128])),
128,
),
),
],
)
.unwrap()]
.into_iter()
.map(Ok),
schema.clone(),
);
// Perform an upsert operation
let mut merge_insert = tbl.merge_insert(&["id"]);
merge_insert
.when_matched_update_all(None)
.when_not_matched_insert_all();
merge_insert.execute(Box::new(new_data)).await.unwrap();
- treesitter_analyzer.rs: C/C++ code analysis using Tree-sitter, extracts functions, types, macros
- vectorizer.rs: CodeBERT-based semantic embedding generation using ONNX Runtime
- database/: LanceDB vector database operations and schema management
- types.rs: Core data structures (FunctionInfo, TypeInfo, MacroInfo, etc.)
- pipeline.rs: Multi-stage pipeline processing for optimal CPU utilization
- src/bin/index.rs: Indexing tool with parallel processing and batch operations
- src/bin/query.rs: Interactive query interface with semantic and exact search
- src/pipeline.rs: Pipeline processing implementation for continuous CPU utilization
- Pipeline processing: Continuous processing with optimal CPU utilization (always enabled)
- Parallel processing: Configurable via
-j threads[:sessions[:batch]]format - Selective macro indexing: Only indexes function-like macros (reduces noise by 95%+)
- GPU acceleration: Automatic CUDA detection when
--gpuflag is used - Quantized models: INT8 quantization for 2-4x faster CPU inference
- Embedded relationships: Call and type relationships stored as JSON arrays for ~10-15x performance
All indexing uses pipeline processing for optimal performance:
# Standard indexing with pipeline processing (creates /path/to/code/.semcode.db)
semcode-index --source /path/to/codePipeline processing provides:
- Continuous processing without gaps between batches
- Optimal CPU utilization (no idle periods)
- Parallel file parsing, deduplication, and database insertion
- Adaptive batch sizing based on processing speed
- Progress monitoring and performance statistics
Use the -j option to control parallelism:
# 16 analysis threads, 4 ORT sessions, batch size 32
semcode-index --vectors -j 16:4:32 --source /path/to/codeSemcode uses Tree-sitter for C/C++ code analysis:
- Uses Tree-sitter's fast incremental parser for AST analysis
- Fast processing suitable for large codebases
- Does not require compilation flags or headers to be present
- Extracts functions, types, macros, and call relationships
- Embedded JSON approach stores relationships directly in records
- exclude .git and target directories from your source code grep commands
- Uses standard Rust formatting and conventions
- Extensive use of async/await for database operations
- Error handling via
anyhow::Result - Parallel processing with
rayoncrate - Progress reporting with
indicatif - Use gitoxide (gix) for all git access, never command line git
Semcode uses anstream + owo-colors for automatic color handling:
- Automatic TTY detection:
anstream::stdout()automatically strips ANSI codes when output isn't a terminal - Consistent coloring:
owo-colorsprovides a clean API for colorized text - No manual conditionals: Code always outputs colored text; anstream handles compatibility
Example colorized output:
use anstream::stdout;
use owo_colors::OwoColorize as _;
use std::io::Write;
fn display_function_info(name: &str, file: &str, line: u32, writer: &mut dyn Write) -> Result<()> {
writeln!(writer, "Function: {}", name.yellow())?;
writeln!(writer, "File: {} ({}:{})",
file.cyan(),
"line".bright_black(),
line.to_string().bright_white()
)?;
Ok(())
}
// Usage: always outputs colors, anstream handles TTY detection
display_function_info("main", "src/main.rs", 42, &mut stdout())?;- Tree-sitter: Fast incremental parsing for C/C++ code analysis
- LanceDB: Vector database for embeddings storage
- ONNX Runtime: CodeBERT model inference
- Tokenizers: Text preprocessing for the ML model
- Rayon: Parallel processing framework
# Create test code samples
./build.sh --with-test
# Test indexing on sample code (creates ./test_code/.semcode.db)
./bin/semcode-index --source ./test_code --vectors
# Test querying (from test_code directory)
cd test_code && ../bin/semcode
# Or query from current directory
./bin/semcode --database ./test_code- Linux/macOS:
~/.cache/semcode/models/
Uses LanceDB with an embedded JSON relationship approach. For complete schema details, see docs/schema.md.
Summary of key tables:
- See docs/schema.md
- Note that function and macro tables have different schemas. Many of the queries are expected to search both, and need to verify they are using the correct schema for the correct table.
- Embedded JSON Relationships: Call and type relationships are stored as JSON arrays within each entity's record rather than separate mapping tables
- Git SHA-based Deduplication: Each record includes a git file hash for content-based uniqueness and incremental processing
- Complete Source Storage: Full function bodies, type definitions, and macro expansions are stored for context
- Optimized Indexing: BTree indices on names, file paths, and git hashes for fast queries
- Vector Integration: Separate vectors table allows optional semantic search capabilities
- GIT Integration: Expected to run in a git repository, and store information on a per-commit basis
Honors standard proxy environment variables (HTTP_PROXY, HTTPS_PROXY, NO_PROXY) for model downloads.
- Regex Pattern Support: Full regex syntax supported via LanceDB's
regexp_match()function - Path Filtering: Optional secondary regex filter on file paths
- Performance Optimized: Uses content deduplication and efficient batching for large codebases
- Limit Control: Configurable result limits with limit-hit detection
- Parallel Processing: Automatically uses parallel queries for large result sets
The search leverages Semcode's content deduplication system:
- Function bodies are stored in a separate
contenttable with Blake3 hashes - Regex search first finds matching content entries
- Content hashes are then used to locate corresponding functions
- Path filtering is applied as a post-processing step for refined results
This approach provides excellent performance even on large codebases by avoiding full table scans and utilizing LanceDB's efficient filtering capabilities.
- Don't use tracing::debug!() for debugging logging, there's too much other verbosity. Use tracing::info!()