This project is a high-performance code indexer designed to provide deep, contextual code intelligence for large codebases. It combines semantic search with rich metadata extraction to power advanced AI-driven development tools. The primary use case is to run on a schedule (e.g., a cron job) to keep an Elasticsearch index up-to-date with a git repository.
- High-Throughput Indexing: Utilizes a multi-threaded, streaming architecture to efficiently parse and index thousands of files in parallel.
- Semantic Search: Uses Elasticsearch's ELSER model to generate vector embeddings for code chunks, enabling powerful natural language search.
- Incremental Updates: Can efficiently update the index by only processing files that have changed since the last indexed commit.
- OpenTelemetry Integration: Built-in support for structured logging via OpenTelemetry, enabling integration with modern observability platforms.
- Efficient
.gitignoreHandling: Correctly applies.gitignorerules to exclude irrelevant files and directories.
- Node.js v20+ (check with
node -v) - Elasticsearch 8.0+ with ELSER model deployed (critical - indexing will fail without this)
- Elasticsearch credentials (username/password or API key)
# 1. Install dependencies
npm install
# 2. Configure Elasticsearch connection
cp .env.example .env
# Edit .env with your Elasticsearch URL, username, and password
# 3. Deploy ELSER in Kibana (if not already done)
# Go to Stack Management → Machine Learning → Trained Models
# Find .elser_model_2 and click "Deploy"
# 4. (Optional) Add .indexerignore to your repository
# Copy .indexerignore.example to your repo as .indexerignore to exclude files
# This reduces indexing time and improves relevance by excluding tests, build artifacts, etc.
# 5. Index your repository
npm run index -- /path/to/your/repo --clean --watch --concurrency 8The indexer respects both .gitignore and .indexerignore files in your repository. Create a .indexerignore file in the root of the repository you're indexing to exclude additional files beyond what's in .gitignore.
Example use cases:
- Exclude test files (
**/*.test.ts,**/*.spec.js) - Skip build artifacts (
target/,dist/,build/) - Ignore large generated files or documentation
See .indexerignore.example in this repository for a complete example tailored for large repositories like Kibana.
Clones a target repository into the ./.repos/ directory to prepare it for indexing.
Arguments:
<repo_url>- The URL of the git repository to clone--token <token>- GitHub Personal Access Token for private repositories
Examples:
npm run setup -- https://github.com/elastic/kibana.git
# With a token for a private repository
npm run setup -- https://github.com/my-org/my-private-repo.git --token ghp_YourTokenHereIndexes one or more repositories by scanning the codebase, enqueuing code chunks, and processing them to Elasticsearch. This unified command handles both scanning and indexing in a single operation.
Arguments:
[repos...]- One or more repository paths, names, or URLs (format:repo[:index]). Optional ifREPOSITORIES_TO_INDEXenv var is set.--clean- Delete existing Elasticsearch index before starting (full rebuild)--pull- Git pull before indexing--watch- Keep indexer running after processing queue (for continuous indexing)--concurrency <number>- Number of parallel workers (default: 1, recommended: CPU core count)--token <token>- GitHub token for private repositories--branch <branch>- Branch name for logging/metadata (default: auto-detect)
Examples:
# Basic usage - index a local repository
npm run index -- /path/to/repo
# Full clean reindex with parallel workers
npm run index -- /path/to/repo --clean --concurrency 8
# Index with watch mode (keeps running for continuous updates)
npm run index -- /path/to/repo --watch --concurrency 8
# Index a remote repository (clones automatically)
npm run index -- https://github.com/elastic/kibana.git --clean
# Index with custom Elasticsearch index name
npm run index -- /path/to/repo:my-custom-index
# Index multiple repositories sequentially
npm run index -- /path/to/repo1 /path/to/repo2 --concurrency 4
# Incremental update (only changed files)
npm run index -- /path/to/repo --pull
# Private repository with token
npm run index -- https://github.com/org/private-repo.git --token ghp_YourToken
# Using REPOSITORIES_TO_INDEX env var (backward compatibility)
export REPOSITORIES_TO_INDEX="/path/to/repo1 /path/to/repo2"
npm run index -- --concurrency 4How It Works:
- Scan Phase: Parses files and enqueues code chunks to a SQLite queue
- Index Phase: Worker processes the queue and sends documents to Elasticsearch
- Watch Mode (optional): Worker continues running to process new items as they arrive
Incremental vs. Full Indexing:
- Without
--clean: Automatically detects if this is a first-time index or an incremental update- If no previous index exists, performs a full index
- If previous index exists, only processes changed files since last indexed commit
- With
--clean: Always performs a full rebuild, deleting the existing index first
Generates a new language configuration file from templates. This command simplifies adding new language support by automatically creating properly formatted configuration files and optionally registering them in the language index.
Arguments:
--name <name>- Language name (lowercase, no spaces, alphanumeric with underscores)--extensions <extensions>- File extensions (comma-separated, e.g., ".rs,.rlib")--parser <parser>- Tree-sitter package name (e.g., tree-sitter-rust)--custom- Use custom parser (no tree-sitter)--no-register- Skip auto-registration in index.ts
Examples:
# Create a new tree-sitter language
npx ts-node src/index.ts scaffold-language --name rust --extensions ".rs,.rlib" --parser tree-sitter-rust
# Create a custom parser language (for markup/template languages)
npx ts-node src/index.ts scaffold-language --name toml --extensions ".toml" --custom
# Skip auto-registration in index.ts
npx ts-node src/index.ts scaffold-language --name proto --extensions ".proto" --parser tree-sitter-proto --no-registerThe command will:
- Generate a language configuration file in
src/languages/ - Validate the configuration for common errors
- Optionally register the language in
src/languages/index.ts - Provide clear next steps for completing the language setup
To index private GitHub repositories, you need to provide a Personal Access Token (PAT).
The recommended and most secure method is to use a fine-grained PAT with read-only permissions for the specific repositories you want to index.
- Go to your GitHub Settings > Developer settings > Personal access tokens > Fine-grained tokens.
- Click Generate new token.
- Give the token a descriptive name (e.g., "Code Indexer Token").
- Under Repository access, select Only select repositories and choose the private repository (or repositories) you need to index.
- Under Permissions, go to Repository permissions.
- Find the Contents permission and select Read-only from the dropdown.
- Click Generate token.
You can provide the token in two ways:
-
As a command-line argument (for
setup): Use the--tokenoption when running thesetupcommand.npm run setup -- <private-repo-url> --token <your-token>
-
As a command-line argument (for
index): Use the--tokenoption when running theindexcommand.npm run index -- <repo-url> --token <your-token>
You can also set a global
GITHUB_TOKENin your.envfile as a fallback.# .env file GITHUB_TOKEN=ghp_YourGlobalToken
These commands help you inspect and manage the document processing queues. For multi-repository deployments, you must specify which repository's queue you want to operate on.
Important Note on --repo-name:
The --repo-name argument should be the simple name of the repository's directory (e.g., kibana), not the full path to it.
Check queue status - how many documents are pending, processing, or failed.
Options:
--repo-name <repoName>- Repository name (auto-detects if only one repo exists)
Examples:
# Auto-detect repository (if only one exists)
npm run queue:monitor
# Specify repository
npm run queue:monitor -- --repo-name=elasticsearch-jsDelete all documents from the queue (useful for starting fresh).
Options:
--repo-name <repoName>- Repository name (auto-detects if only one repo exists)
Examples:
# Auto-detect repository (if only one exists)
npm run queue:clear
# Specify repository
npm run queue:clear -- --repo-name=elasticsearch-jsPro tip: Run watch -n 5 'npm run queue:monitor' to continuously monitor the queue.
Resets all documents in a queue with a failed status back to pending. This is useful for retrying documents that may have failed due to transient errors like network timeouts.
Options:
--repo-name <repoName>- Repository name (auto-detects if only one repo exists)
Examples:
# Auto-detect repository (if only one exists)
npm run queue:retry-failed
# Specify repository
npm run queue:retry-failed -- --repo-name=elasticsearch-jsLists all documents in a queue that have a failed status, showing their ID, content size, and file path. This is useful for diagnosing "poison pill" documents that consistently fail to process.
Options:
--repo-name <repoName>- Repository name (auto-detects if only one repo exists)
Examples:
# Auto-detect repository (if only one exists)
npm run queue:list-failed
# Specify repository
npm run queue:list-failed -- --repo-name=elasticsearch-jsThis indexer is designed to work with a Model Context Protocol (MCP) server, which exposes the indexed data through a standardized set of tools for AI coding agents. The official MCP server for this project is located in a separate repository.
For information on how to set up and run the server, please visit: https://github.com/elastic/semantic-code-search-mcp-server
This indexer is designed to be deployed on a server (e.g., a GCP Compute Engine VM) and run on a schedule. For detailed instructions on how to set up the indexer with systemd timers for a multi-repository environment, please see the GCP Deployment Guide.
Configuration is managed via environment variables in a .env file.
| Variable | Description | Default |
|---|---|---|
ELASTICSEARCH_ENDPOINT |
The endpoint URL for your Elasticsearch instance. | |
ELASTICSEARCH_CLOUD_ID |
The Cloud ID for your Elastic Cloud instance. | |
ELASTICSEARCH_USER |
The username for Elasticsearch authentication. | |
ELASTICSEARCH_PASSWORD |
The password for Elasticsearch authentication. | |
ELASTICSEARCH_API_KEY |
An API key for Elasticsearch authentication. | |
ELASTICSEARCH_INDEX |
The name of the Elasticsearch index to use. This is often set dynamically by the deployment scripts. | code-chunks |
ELASTICSEARCH_INFERENCE_ID |
The Elasticsearch inference endpoint ID for the ELSER model to use. Note: ELASTICSEARCH_MODEL is still supported for backward compatibility. |
.elser-2-elastic |
OTEL_LOGGING_ENABLED |
Enable OpenTelemetry logging. | false |
OTEL_METRICS_ENABLED |
Enable OpenTelemetry metrics (defaults to same as OTEL_LOGGING_ENABLED). |
Same as OTEL_LOGGING_ENABLED |
OTEL_SERVICE_NAME |
Service name for OpenTelemetry logs and metrics. | semantic-code-search-indexer |
OTEL_EXPORTER_OTLP_ENDPOINT |
OpenTelemetry collector endpoint for both logs and metrics. | http://localhost:4318 |
OTEL_EXPORTER_OTLP_LOGS_ENDPOINT |
Logs-specific OTLP endpoint (overrides OTEL_EXPORTER_OTLP_ENDPOINT). | |
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT |
Metrics-specific OTLP endpoint (overrides OTEL_EXPORTER_OTLP_ENDPOINT). | |
OTEL_EXPORTER_OTLP_HEADERS |
Headers for OTLP exporter (e.g., authorization=Bearer token). |
|
OTEL_METRIC_EXPORT_INTERVAL_MILLIS |
Interval in milliseconds between metric exports. | 60000 (60 seconds) |
QUEUE_BASE_DIR |
The base directory for all repository queue databases. Each repository gets its own SQLite queue at QUEUE_BASE_DIR/<repo-name>/queue.db. |
.queues |
REPOSITORIES_TO_INDEX |
Optional: Space-separated list of repositories to index. Used as fallback when no repositories are provided as CLI arguments. Format: "repo1 repo2" or "repo1:index1 repo2:index2". |
|
BATCH_SIZE |
The number of chunks to index in a single bulk request. | 500 |
MAX_QUEUE_SIZE |
The maximum number of items to keep in the queue. | 1000 |
CPU_CORES |
The number of CPU cores to use for file parsing. | Half of the available cores |
MAX_CHUNK_SIZE_BYTES |
The maximum size of a code chunk in bytes. | 1000000 |
DEFAULT_CHUNK_LINES |
Number of lines per chunk for line-based parsing (JSON, YAML, text without paragraphs). | 15 |
CHUNK_OVERLAP_LINES |
Number of overlapping lines between chunks in line-based parsing. | 3 |
MARKDOWN_CHUNK_DELIMITER |
Regular expression pattern for splitting markdown files into chunks. | \n\s*\n |
ENABLE_DENSE_VECTORS |
Whether to enable dense vectors for code similarity search. | false |
GIT_PATH |
The path to the git executable. |
git |
NODE_ENV |
The node environment. | development |
SEMANTIC_CODE_INDEXER_LANGUAGES |
A comma-separated list of languages to index. | typescript,javascript,markdown,yaml,java,go,python |
When using the default inference ID .elser-2-elastic, your deployment uses the Elastic Inference Service (EIS), which is GPU-backed and has specific rate limits:
- Rate limits: 6,000 documents/minute OR 6,000,000 tokens/minute (whichever is reached first)
- Recommended settings:
BATCH_SIZE=14with concurrency of2 - Important consideration: Chunks larger than 512K may generate additional chunks in ELSER, potentially causing some batches to be rejected
- Monitoring requirement: Setting up an OTEL Collector is critical to monitor logs for errors when using these settings
These settings help avoid rate limit issues while maintaining good indexing throughput. Monitor your deployment logs closely when operating near these limits.
The indexer uses different chunking strategies depending on file type to optimize for both semantic search quality and LLM context window limits:
- JSON: Always uses line-based chunking with configurable chunk size (
DEFAULT_CHUNK_LINES) and overlap (CHUNK_OVERLAP_LINES). This prevents large JSON values from creating oversized chunks. - YAML: Always uses line-based chunking with the same configuration. This provides more context than single-line chunks while maintaining manageable sizes.
- Text files: Uses paragraph-based chunking (splitting on double newlines) when paragraphs are detected. Falls back to line-based chunking for continuous text without paragraph breaks.
- Markdown: Uses configurable delimiter-based chunking to preserve logical document structure. See
MARKDOWN_CHUNK_DELIMITERbelow for customization options. - Code files (TypeScript, JavaScript, Python, Java, Go, etc.): Uses tree-sitter based parsing to extract functions, classes, and other semantic units.
The markdown chunking behavior can be customized via the MARKDOWN_CHUNK_DELIMITER environment variable:
-
MARKDOWN_CHUNK_DELIMITER: Regular expression pattern for splitting markdown files into chunks- Default:
\n\s*\n(splits by paragraphs - double newlines) - Example for section separators:
\n---\n - Example for custom delimiter:
\n===\n - The delimiter is converted to a RegExp, so escape special characters appropriately
Use Cases:
- Default (paragraphs): Best for general markdown documents
- Section separators (
\n---\n): Best for markdown with explicit section dividers - Custom delimiters: Use any pattern that makes sense for your document structure
Example:
export MARKDOWN_CHUNK_DELIMITER='\n---\n' npm run index
- Default:
This indexer supports comprehensive OpenTelemetry integration for both logs and metrics, enabling deep observability into indexing operations. Telemetry data is sent via OTLP/HTTP protocol to an OpenTelemetry Collector, which routes it to various backends (Elasticsearch, Prometheus, etc.).
By default, the indexer outputs text-format logs to the console (except when NODE_ENV=test):
[2025-10-16T10:30:45.123Z] [INFO] Successfully indexed 500 files
[2025-10-16T10:30:45.234Z] [ERROR] Failed to parse file: syntax error
To enable OpenTelemetry log and metrics export:
OTEL_LOGGING_ENABLED=true
OTEL_METRICS_ENABLED=true # Optional, defaults to same as OTEL_LOGGING_ENABLED
OTEL_SERVICE_NAME=my-indexer # Optional, defaults to 'semantic-code-search-indexer'
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318For authentication to the collector:
OTEL_EXPORTER_OTLP_HEADERS="authorization=Bearer your-token"You can also configure separate endpoints for logs and metrics:
OTEL_EXPORTER_OTLP_LOGS_ENDPOINT=http://otel-collector:4318/v1/logs
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://otel-collector:4318/v1/metricsThe following resource attributes are automatically attached to all logs and metrics:
service.name: Service name (fromOTEL_SERVICE_NAME)service.version: Version from package.jsondeployment.environment: FromNODE_ENVhost.name,host.arch,host.type,os.type: Host information
Each log entry and metric includes attributes based on context:
repo.name: Repository being indexed (e.g., "kibana", "elasticsearch")repo.branch: Branch being indexed (e.g., "main", "feature/metrics")- Custom metadata passed to logging calls or metric recordings
The indexer exports the following metrics for monitoring indexing operations:
| Metric | Type | Description | Attributes |
|---|---|---|---|
parser.files.processed |
Counter | Total files processed | language, status, repo.name, repo.branch |
parser.files.failed |
Counter | Files that failed to parse | language, status, repo.name, repo.branch |
parser.chunks.created |
Counter | Total chunks created | language, parser_type, repo.name, repo.branch |
parser.chunks.skipped |
Counter | Chunks skipped due to exceeding maxChunkSizeBytes | language, parser_type, size, repo.name, repo.branch |
parser.chunks.size |
Histogram | Distribution of chunk sizes (bytes) | language, parser_type, repo.name, repo.branch |
| Metric | Type | Description | Attributes |
|---|---|---|---|
queue.documents.enqueued |
Counter | Documents added to queue | repo.name, repo.branch |
queue.documents.dequeued |
Counter | Documents removed from queue | repo.name, repo.branch |
queue.documents.committed |
Counter | Successfully indexed documents | repo.name, repo.branch |
queue.documents.requeued |
Counter | Documents requeued after failure | repo.name, repo.branch |
queue.documents.failed |
Counter | Documents marked as failed | repo.name, repo.branch |
queue.documents.deleted |
Counter | Documents deleted from queue | repo.name, repo.branch |
queue.size.pending |
Gauge | Current pending documents | repo.name, repo.branch, status |
queue.size.processing |
Gauge | Current processing documents | repo.name, repo.branch, status |
queue.size.failed |
Gauge | Current failed documents | repo.name, repo.branch, status |
| Metric | Type | Description | Attributes |
|---|---|---|---|
indexer.batch.processed |
Counter | Successful batches indexed | repo.name, repo.branch, concurrency |
indexer.batch.failed |
Counter | Failed batches | repo.name, repo.branch, concurrency |
indexer.batch.duration |
Histogram | Batch processing time (ms) | repo.name, repo.branch, concurrency |
indexer.batch.size |
Histogram | Distribution of batch sizes | repo.name, repo.branch, concurrency |
All metrics and logs include repo.name and repo.branch attributes, enabling you to:
- Filter telemetry data by repository and branch
- Create repository-specific dashboards in Kibana
- Set up alerts for specific repositories
- Compare indexing performance across repositories
Example Kibana query to filter by repository:
repo.name: "kibana" AND repo.branch: "main"
A complete example collector configuration is provided in docs/otel-collector-config.yaml. This configuration:
- Receives logs and metrics via OTLP/HTTP
- Batches telemetry data for efficiency
- Adds host resource attributes
- Exports logs to
logs-semanticcode.otel-defaultindex - Exports metrics to
metrics-semanticcode.otel-defaultindex - Supports authentication to Elasticsearch
- Configured with
mapping: mode: otelfor proper histogram support (requires Elasticsearch 8.12+)
Important: The indexer configures metrics with Delta temporality, which is required by the Elasticsearch exporter for histogram metrics. Without this configuration, histograms (parser.chunks.size, indexer.batch.duration, indexer.batch.size) will be silently dropped.
Note on Histogram Visibility: OpenTelemetry histogram metrics are stored as complex nested structures in Elasticsearch and may not appear in Kibana's field list or be easily queryable via ES|QL. This is a known limitation of Kibana's histogram support. Histograms are still indexed and can be accessed via direct Elasticsearch queries or specialized visualizations.
To use the example configuration:
export ELASTICSEARCH_ENDPOINT=https://elasticsearch:9200
export ELASTICSEARCH_API_KEY=your-api-key
docker run -p 4318:4318 -p 4317:4317 -p 13133:13133 \
-e ELASTICSEARCH_ENDPOINT \
-e ELASTICSEARCH_API_KEY \
-v $(pwd)/docs/otel-collector-config.yaml:/etc/otelcol/config.yaml \
otel/opentelemetry-collector-contrib:latestThis indexer supports an optional, high-fidelity "find similar code" feature powered by the microsoft/codebert-base model. This model generates dense vector embeddings for code chunks, which enables more nuanced, semantic similarity searches than the default ELSER model.
Trade-offs: Enabling this feature has a significant performance cost. Indexing will be substantially slower and the Elasticsearch index will require more disk space. It is recommended to only enable this feature if the "find similar code" capability is a critical requirement for your use case.
To enable this feature, you must perform the following manual setup steps:
1. Install the Ingest Pipeline
You must install a dedicated ingest pipeline in your Elasticsearch cluster. Run the following command in the Kibana Dev Console:
PUT _ingest/pipeline/code-similarity-pipeline
{
"description": "Pipeline to selectively generate dense vector embeddings for substantive code chunks.",
"processors": [
{
"grok": {
"field": "kind",
"patterns": ["^(call_expression|import_statement|lexical_declaration)$"],
"on_failure": [
{
"inference": {
"model_id": "microsoft__codebert-base",
"target_field": "code_vector",
"field_map": {
"content": "text_field"
}
}
},
{
"set": {
"field": "code_vector",
"copy_from": "code_vector.predicted_value"
}
}
]
}
}
]
}2. Enable the Feature Flag
Set the following environment variable in your .env file:
ENABLE_DENSE_VECTORS=true
3. Re-index Your Data
To generate the dense vectors for your codebase, you must run a full, clean index. This will apply the ingest pipeline to all of your documents.
npm run index -- .repos/your-repo --clean