CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Semcode is a semantic code search tool written in Rust that indexes C/C++ codebases using machine learning embeddings. It consists of several binaries:

semcode-index: Analyzes and indexes codebases using the CodeBERT model
semcode: Interactive query tool for searching the indexed code
semcode-mcp: Model Context Protocol server for Claude Desktop integration
semcode-lsp: Language Server Protocol server for editor integration (see docs/lsp-server.md)

Build and Development Commands

Prerequisites

Install required system dependencies:

# Ubuntu/Debian
sudo apt-get install build-essential libclang-dev protobuf-compiler libprotobuf-dev

# Fedora
sudo dnf install gcc-c++ clang-devel protobuf-compiler protobuf-devel

Note: The C++ compiler (g++/gcc-c++) is required for building dependencies like esaxx-rs.

Build Commands

# Build release binaries (recommended)
cargo build --release

# Use build script (creates symlinks in ./bin/)
./build.sh

# Build with test code samples
./build.sh --with-test

Code Quality Commands

MANDATORY before committing:

# Format code (ALWAYS run this before committing)
cargo fmt

# Run clippy linter (treat warnings as errors)
cargo clippy --all-targets -- -D warnings

Clippy must pass with zero warnings before code can be committed. All clippy warnings should be fixed, not silenced with allow attributes unless there's a very good reason.

Git Hooks

This repository uses shared git hooks to automatically enforce code formatting and linting. The hooks are tracked in the hooks/ directory and shared with all developers.

Setup for New Developers

After cloning the repository, run the setup script to enable the hooks:

./setup-hooks.sh

Alternatively, you can manually configure the hooks:

git config core.hooksPath hooks

Active Hooks

pre-commit: Runs comprehensive checks before each commit:
- cargo fmt --check - Verifies code formatting
- cargo check --all-targets - Verifies code compiles
- cargo test - Runs all tests
pre-push: Runs additional quality checks before each push:
- cargo fmt --check - Verifies code formatting
- cargo clippy --all-targets -- -D warnings - Checks for code quality issues
- cargo test - Runs all tests

These hooks ensure that improperly formatted, non-compiling, or failing code cannot be committed or pushed to the repository.

If a hook fails:

For formatting issues: Run cargo fmt to format your code
For compilation errors: Run cargo check --all-targets to see the errors and fix them
For clippy issues: Fix the warnings/errors reported by clippy
For test failures: Fix the failing tests
Stage your changes with git add (if needed)
Try your commit or push again

Hook Location

The git hooks are stored in the hooks/ directory (tracked by git) and are shared across all developers. This ensures consistent code quality enforcement for everyone working on the project.

Database Location

Semcode uses the following search order to locate the .semcode.db database directory:

For semcode-index:

-d flag: If provided, use the specified path (direct database path or parent directory containing .semcode.db)
SEMCODE_DB environment variable: Same path semantics as -d
Source directory: Look for .semcode.db in the source directory specified by -s
Current directory: Fall back to ./.semcode.db in the current working directory

For semcode (query tool), semcode-mcp, and semcode-lsp:

-d flag / configuration: If provided, use the specified path (direct database path or parent directory containing .semcode.db)
SEMCODE_DB environment variable: Same path semantics as -d
Workspace/Current directory: Use ./.semcode.db in the workspace or current working directory

The -d flag can specify either:

A direct path to the database directory (e.g., ./my-custom.db)
A parent directory containing .semcode.db (e.g., -d /path/to/project will use /path/to/project/.semcode.db)

Running the Tools

Typical workflow:

# Index a codebase - creates /path/to/code/.semcode.db
./bin/semcode-index --source /path/to/code

# Query from within the indexed directory
cd /path/to/code
semcode  # Automatically finds ./.semcode.db

# Or query from elsewhere
semcode --database /path/to/code  # Uses /path/to/code/.semcode.db

Other indexing options:

# Basic analysis only (no vectors)
./bin/semcode-index --source /path/to/code

# Index files modified in git commit range
./bin/semcode-index --source /path/to/code --git HEAD~10..HEAD

# Custom database location (overrides search order)
./bin/semcode-index --source /path/to/code --database ./custom.db

Architecture

Git-Aware Operations (IMPORTANT)

All features that query the database MUST use git-aware lookup functions.

Semcode indexes codebases at specific git commits and stores multiple versions of functions, types, and macros as the codebase evolves. When implementing any feature that looks up code entities, always use git-aware functions to ensure you're finding the correct version that matches the user's current working directory.

Required Approach

Obtain the current git SHA:

use semcode::git::get_git_sha;

let git_sha = get_git_sha(&repo_path)?
    .ok_or_else(|| anyhow::anyhow!("Not a git repository"))?;

Use git-aware lookup functions:

// ✅ CORRECT: Git-aware function lookup
let function = db.find_function_git_aware(name, &git_sha).await?;

// ❌ WRONG: Non-git-aware lookup (may return wrong version)
let function = db.find_function(name).await?;

Pass git_repo_path to DatabaseManager:

// DatabaseManager needs git_repo_path for git-aware resolution
let db = DatabaseManager::new(&db_path, git_repo_path).await?;

Available Git-Aware Functions

In DatabaseManager (src/database/connection.rs):

find_function_git_aware(name: &str, git_sha: &str) - Find function at specific commit
find_macro_git_aware(name: &str, git_sha: &str) - Find macro at specific commit
get_function_callees_git_aware(name: &str, git_sha: &str) - Get callees at specific commit
build_callchain_with_manifest() - Call chain analysis with git manifest

When to Use Non-Git-Aware Functions

Non-git-aware functions like find_function() should only be used:

As a fallback when git SHA cannot be determined (not in a git repo)
For administrative/debugging operations that need to see all versions
When the operation explicitly requires seeing historical data across commits

Example: Implementing a New Feature

// ✅ CORRECT IMPLEMENTATION
async fn my_new_feature(db: &DatabaseManager, repo_path: &str) -> Result<()> {
    // 1. Get current git SHA
    let git_sha = semcode::git::get_git_sha(repo_path)?
        .ok_or_else(|| anyhow::anyhow!("Not a git repository"))?;

    // 2. Use git-aware lookup
    if let Some(func) = db.find_function_git_aware("my_func", &git_sha).await? {
        println!("Found function at current commit: {}", func.name);
    }

    Ok(())
}

// ❌ WRONG IMPLEMENTATION (may return wrong version)
async fn my_bad_feature(db: &DatabaseManager) -> Result<()> {
    if let Some(func) = db.find_function("my_func").await? {
        println!("Found function (but which version?): {}", func.name);
    }
    Ok(())
}

Why This Matters

Without git-aware lookups:

Users may jump to outdated function definitions
Call chains may include deleted or renamed functions
Type information may not match current code structure
Results are confusing and incorrect for active development

Remember: When in doubt, use git-aware functions!

Scalability

The database is very large. No operation should be implemented via full table scans unless that operation is a effectively a full table scan.
If you're tempted to do full table scans, remember that LanceDB has an .only_if() parameter to queries allowing you to search without doing full table scans. Example: .only_if(format!("name = '{}'", escaped_name)).

Example only_if() usage for regex searches in grep_function_bodies():

// Escape special characters in the pattern for SQL string literal
let escaped_pattern = pattern
    .replace("\\", "\\\\") // Escape backslashes first
    .replace("'", "''"); // Escape single quotes for SQL

let where_clause = format!("regexp_match(content, '{}')", escaped_pattern);
let content_results = content_table
    .query()
    .only_if(&where_clause)
    .execute()
    .await?

Example bulk insertion with duplicate removal:

let tmpdir = tempfile::tempdir().unwrap();
let db = lancedb::connect(tmpdir.path().to_str().unwrap())
    .execute()
    .await
    .unwrap();
let new_data = RecordBatchIterator::new(
    vec![RecordBatch::try_new(
        schema.clone(),
        vec![
            Arc::new(Int32Array::from_iter_values(0..10)),
            Arc::new(
                FixedSizeListArray::from_iter_primitive::<Float32Type, _, _>(
                    (0..10).map(|_| Some(vec![Some(1.0); 128])),
                    128,
                ),
            ),
        ],
    )
    .unwrap()]
    .into_iter()
    .map(Ok),
    schema.clone(),
);
// Perform an upsert operation
let mut merge_insert = tbl.merge_insert(&["id"]);
merge_insert
    .when_matched_update_all(None)
    .when_not_matched_insert_all();
merge_insert.execute(Box::new(new_data)).await.unwrap();

Core Components

treesitter_analyzer.rs: C/C++ code analysis using Tree-sitter, extracts functions, types, macros
vectorizer.rs: CodeBERT-based semantic embedding generation using ONNX Runtime
database/: LanceDB vector database operations and schema management
types.rs: Core data structures (FunctionInfo, TypeInfo, MacroInfo, etc.)
pipeline.rs: Multi-stage pipeline processing for optimal CPU utilization

Binary Structure

src/bin/index.rs: Indexing tool with parallel processing and batch operations
src/bin/query.rs: Interactive query interface with semantic and exact search
src/pipeline.rs: Pipeline processing implementation for continuous CPU utilization

Key Features

Pipeline processing: Continuous processing with optimal CPU utilization (always enabled)
Parallel processing: Configurable via -j threads[:sessions[:batch]] format
Selective macro indexing: Only indexes function-like macros (reduces noise by 95%+)
GPU acceleration: Automatic CUDA detection when --gpu flag is used
Quantized models: INT8 quantization for 2-4x faster CPU inference
Embedded relationships: Call and type relationships stored as JSON arrays for ~10-15x performance

Performance Tuning

Pipeline Processing (Always Enabled)

All indexing uses pipeline processing for optimal performance:

# Standard indexing with pipeline processing (creates /path/to/code/.semcode.db)
semcode-index --source /path/to/code

Pipeline processing provides:

Continuous processing without gaps between batches
Optimal CPU utilization (no idle periods)
Parallel file parsing, deduplication, and database insertion
Adaptive batch sizing based on processing speed
Progress monitoring and performance statistics

Parallelism Configuration

Use the -j option to control parallelism:

# 16 analysis threads, 4 ORT sessions, batch size 32
semcode-index --vectors -j 16:4:32 --source /path/to/code

Analysis Method

Semcode uses Tree-sitter for C/C++ code analysis:

Uses Tree-sitter's fast incremental parser for AST analysis
Fast processing suitable for large codebases
Does not require compilation flags or headers to be present
Extracts functions, types, macros, and call relationships
Embedded JSON approach stores relationships directly in records

Development Notes

Grepping

exclude .git and target directories from your source code grep commands

Code Conventions

Uses standard Rust formatting and conventions
Extensive use of async/await for database operations
Error handling via anyhow::Result
Parallel processing with rayon crate
Progress reporting with indicatif
Use gitoxide (gix) for all git access, never command line git

Color Output System

Semcode uses anstream + owo-colors for automatic color handling:

Automatic TTY detection: anstream::stdout() automatically strips ANSI codes when output isn't a terminal
Consistent coloring: owo-colors provides a clean API for colorized text
No manual conditionals: Code always outputs colored text; anstream handles compatibility

Example colorized output:

use anstream::stdout;
use owo_colors::OwoColorize as _;
use std::io::Write;

fn display_function_info(name: &str, file: &str, line: u32, writer: &mut dyn Write) -> Result<()> {
    writeln!(writer, "Function: {}", name.yellow())?;
    writeln!(writer, "File: {} ({}:{})",
        file.cyan(),
        "line".bright_black(),
        line.to_string().bright_white()
    )?;
    Ok(())
}

// Usage: always outputs colors, anstream handles TTY detection
display_function_info("main", "src/main.rs", 42, &mut stdout())?;

Dependencies

Tree-sitter: Fast incremental parsing for C/C++ code analysis
LanceDB: Vector database for embeddings storage
ONNX Runtime: CodeBERT model inference
Tokenizers: Text preprocessing for the ML model
Rayon: Parallel processing framework

Testing

# Create test code samples
./build.sh --with-test

# Test indexing on sample code (creates ./test_code/.semcode.db)
./bin/semcode-index --source ./test_code --vectors

# Test querying (from test_code directory)
cd test_code && ../bin/semcode

# Or query from current directory
./bin/semcode --database ./test_code

Model and Data Storage

Model Location

Linux/macOS: ~/.cache/semcode/models/

Database Format

Uses LanceDB with an embedded JSON relationship approach. For complete schema details, see docs/schema.md.

Summary of key tables:

See docs/schema.md
Note that function and macro tables have different schemas. Many of the queries are expected to search both, and need to verify they are using the correct schema for the correct table.

Key Design Features

Embedded JSON Relationships: Call and type relationships are stored as JSON arrays within each entity's record rather than separate mapping tables
Git SHA-based Deduplication: Each record includes a git file hash for content-based uniqueness and incremental processing
Complete Source Storage: Full function bodies, type definitions, and macro expansions are stored for context
Optimized Indexing: BTree indices on names, file paths, and git hashes for fast queries
Vector Integration: Separate vectors table allows optional semantic search capabilities
GIT Integration: Expected to run in a git repository, and store information on a per-commit basis

Proxy Support

Honors standard proxy environment variables (HTTP_PROXY, HTTPS_PROXY, NO_PROXY) for model downloads.

Key Features

Regex Pattern Support: Full regex syntax supported via LanceDB's regexp_match() function
Path Filtering: Optional secondary regex filter on file paths
Performance Optimized: Uses content deduplication and efficient batching for large codebases
Limit Control: Configurable result limits with limit-hit detection
Parallel Processing: Automatically uses parallel queries for large result sets

Implementation Details

The search leverages Semcode's content deduplication system:

Function bodies are stored in a separate content table with Blake3 hashes
Regex search first finds matching content entries
Content hashes are then used to locate corresponding functions
Path filtering is applied as a post-processing step for refined results

This approach provides excellent performance even on large codebases by avoiding full table scans and utilizing LanceDB's efficient filtering capabilities.

Misc

Don't use tracing::debug!() for debugging logging, there's too much other verbosity. Use tracing::info!()

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Build and Development Commands

Prerequisites

Build Commands

Code Quality Commands

Git Hooks

Database Location

Running the Tools

Architecture

Git-Aware Operations (IMPORTANT)

Required Approach

Available Git-Aware Functions

When to Use Non-Git-Aware Functions

Example: Implementing a New Feature

Why This Matters

Scalability

Core Components

Binary Structure

Key Features

Performance Tuning

Pipeline Processing (Always Enabled)

Parallelism Configuration

Analysis Method

Development Notes

Grepping

Code Conventions

Color Output System

Dependencies

Testing

Model and Data Storage

Model Location

Database Format

Key Design Features

Proxy Support

Key Features

Implementation Details

Misc