Arcaneum

CLI tools and Claude Code plugins for semantic and full-text search.

pipx install "https://github.com/cwensel/arcaneum/releases/download/v0.8.2/arcaneum-0.8.2-py3-none-any.whl"

Overview

Arcaneum helps you discover and understand project dependencies, documentation, and reference implementations. By indexing libraries, frameworks, and technical papers, you can semantically search for patterns, APIs, and concepts when building new projects. Works especially well with the RDR (Recommendation Data Record) model for AI-assisted development planning.

The system supports PDF documents and source code with git-aware, AST-based chunking.

Currently Available:

Semantic search with Qdrant (vector embeddings)
Full-text search with MeiliSearch (exact phrase matching)
Dual indexing workflow for comprehensive search

Features

Search Capabilities

Semantic Search (Qdrant): Find conceptually similar content using vector embeddings
Full-Text Search (MeiliSearch): Exact phrase matching, keyword search, and typo-tolerant queries

Indexing

PDF Indexing: OCR support for scanned documents, page-level metadata, parallel processing
Source Code Indexing: Git-aware with AST chunking, multi-branch support, 165+ languages
Markdown Indexing: YAML frontmatter extraction, semantic chunking, incremental sync
Dual Indexing: Single command to index to both search engines
Performance Tuning: Granular control over workers, batch sizes, and process priority via arc corpus sync --max-embedding-batch, --text-workers, --cpu-workers, and single-system indexing flags such as --embedding-batch-size and --process-priority

Multiple Embedding Models

arctic-m (768D) - DEFAULT for PDFs/markdown - stable FastEmbed retrieval model
stella (1024D) - High-quality opt-in document model, requires arcaneum[sentence-transformers]
mxbai-large (1024D) - High-quality FastEmbed document model
jina-code (768D) - DEFAULT for code - stable FastEmbed code model
jina-code-st (768D) - Legacy SentenceTransformers code path, requires arcaneum[sentence-transformers]
jina-code-0.5b (896D) - Higher-quality opt-in code model, 32K context, requires arcaneum[sentence-transformers]
jina-code-1.5b (1536D) - Highest quality code embeddings, SOTA Sept 2025, requires arcaneum[sentence-transformers]
codesage-large (1024D) - CodeSage V2, 9 programming languages, requires arcaneum[sentence-transformers]
bge-large (1024D) - BGE large embeddings, balanced performance
jina-v3 (1024D) - Multilingual embeddings with extended 8K context
bge-base (768D) - BGE base embeddings, balanced performance and speed
bge-small (384D) - BGE small embeddings, fastest for size-constrained scenarios

See arc models list for complete model information and recommendations. Use arc models list --json for the LLM-readable catalog: it includes backend, recommended and default corpus uses, support/risk tier, prompt policy, context limit, hardware support, runtime-aligned batch guidance, and reindex warnings.

Arcaneum records each collection's embedding prompt policy when it is indexed. Reindex a corpus after changing model query/document prompts, tasks, or prompt-aware model defaults; semantic search rejects collections whose stored prompt policy no longer matches the current model registry.

GPU Acceleration

CPU is the default for the most stable indexing behavior
Supports Apple Silicon (MPS) and NVIDIA GPUs (CUDA)
Use --gpu to opt into accelerator embedding
FastEmbed/CoreML on Apple Silicon is experimental and requires ARC_EXPERIMENTAL_COREML=1

CLI-First Design

All operations via command-line interface
JSON output mode for automation
Structured error messages with exit codes
Python >= 3.12 required

Claude Code Integration

Slash commands for all operations (/arc:search, /arc:index, /arc:collection, etc.)
Discoverable via /help or /commands in Claude Code
No MCP overhead - direct CLI execution

Quick Start

Get started with Arcaneum in just a few commands:

# 1. Install
pipx install "https://github.com/cwensel/arcaneum/releases/download/v0.8.2/arcaneum-0.8.2-py3-none-any.whl"

# 2. Install Claude Code plugin (optional, in Claude Code)
# /plugin install cwensel/arcaneum

# 3. Verify and start services
arc doctor
arc container start

# 4. Create a corpus and sync content (indexes to both Qdrant and MeiliSearch)
arc corpus create Frameworks --type code
arc corpus sync Frameworks ~/libs/fastapi ~/libs/sqlalchemy

# 5. Search with semantic or full-text queries
arc search semantic "dependency injection pattern" --corpus Frameworks
arc search text "async def" --corpus Frameworks

First time? Run arc doctor to check prerequisites and get setup guidance.

👉 Full Quick Start Guide - Detailed walkthrough with troubleshooting

Quick Reference

# Service Management
arc container start          # Start Qdrant and MeiliSearch
arc container status         # Check service health
arc container backup         # Back up Qdrant and MeiliSearch data
arc container restore DIR    # Restore a backup
arc doctor                   # Verify setup

# Corpus (Recommended - Dual Indexing to Both Systems)
arc corpus create NAME --type TYPE              # pdf, code, or markdown
arc corpus list                                 # List all corpora
arc corpus sync NAME PATH [PATH...]             # Sync one or more directories
arc corpus sync NAME PATH --parity              # Also detect renames, remove files no longer on disk
arc corpus items NAME                           # List items with parity status
arc corpus verify NAME                          # Verify corpus health across both systems
arc corpus parity NAME                          # Check/restore parity between systems
arc corpus repair NAME                          # Re-index incomplete or garbled files
arc corpus update NAME --description "..."      # Update corpus metadata
arc corpus delete NAME                          # Delete both collection and index

# Search (Works with corpus, collection, or index)
arc search semantic "query" --corpus NAME              # Conceptual similarity
arc search semantic "query" --corpus N1 --corpus N2    # Multi-corpus
arc search text "query" --corpus NAME                  # Exact phrase matching

# --- Advanced: single-system only (prefer `arc corpus` above for normal use) ---

# Collections (Qdrant Only - Semantic Search)
arc collection create NAME --type TYPE   # When you only need semantic search
arc collection list
arc collection items NAME
arc index pdf PATH --collection NAME
arc index code PATH --collection NAME

# Indexes (MeiliSearch Only - Full-Text Search)
arc indexes create NAME --type TYPE      # When you only need full-text search
arc indexes list
arc index text pdf PATH --index NAME
arc index text code PATH --index NAME
arc index text markdown PATH --index NAME

Common Workflows

Search Dependencies and Libraries (Recommended)

# Create a corpus for framework source code
arc corpus create Frameworks --type code

# Sync framework directories (indexes to both Qdrant and MeiliSearch)
arc corpus sync Frameworks ~/libs/fastapi ~/libs/sqlalchemy

# List what's indexed
arc corpus items Frameworks

# Semantic search for patterns and APIs
arc search semantic "dependency injection pattern" --corpus Frameworks --limit 10

# Full-text search for exact code
arc search text "async def create_app" --corpus Frameworks

Search Technical Documentation

# Create a corpus for PDF documents
arc corpus create Papers --type pdf

# Sync documentation directories
arc corpus sync Papers ~/Documents/papers ~/Documents/specs

# Semantic search for concepts
arc search semantic "distributed consensus algorithms" --corpus Papers

# Full-text search for exact phrases
arc search text '"rate limiting"' --corpus Papers

Index Markdown Files

# Create a corpus for notes and documentation
arc corpus create Notes --type markdown

# Sync your notes directory
arc corpus sync Notes ~/obsidian-vault

# Semantic search
arc search semantic "project planning" --corpus Notes

# Full-text search
arc search text "meeting notes" --corpus Notes

Features:

YAML frontmatter extraction (title, tags, category, etc.)
Semantic chunking preserving document structure
Incremental sync (SHA256 content hashing)
Custom exclude patterns
Supports .md, .markdown, .mdown extensions

Single-System Indexing (Advanced)

Prefer arc corpus sync for normal use. Use collections or indexes directly only when you explicitly need one type of search without the other:

# Semantic search only (Qdrant collection)
arc collection create MyCollection --type code
arc index code ~/project --collection MyCollection
arc search semantic "query" --corpus MyCollection

# Full-text search only (MeiliSearch index)
arc indexes create MyIndex --type pdf
arc index text pdf ~/docs --index MyIndex
arc search text "query" --corpus MyIndex

Store Agent Memory

# Create a corpus for agent-generated content
arc corpus create Memory --type markdown

# Store from file with metadata
arc store analysis.md --collection Memory \
  --title "Security Analysis" \
  --category "security" \
  --tags "audit,findings"

# Store from stdin (agent workflow)
echo "# Research\n\nFindings..." | arc store - --collection Memory

# Search agent memory
arc search semantic "security vulnerabilities" --corpus Memory
arc search text "SQL injection" --corpus Memory

# Content persisted to: ~/.local/share/arcaneum/agent-memory/{collection}/
# Enables re-indexing and full-text retrieval

Use Case: Designed for AI agents to store research, analysis, and synthesized information with rich metadata. Content is automatically persisted for durability.

Manage Services

arc container start    # Start Qdrant and MeiliSearch
arc container status   # Check health
arc container backup    # Create a timestamped backup
arc container logs     # View logs
arc container stop     # Stop services

Installation

Prerequisites

Python 3.12+ - Check with python --version
pipx - Recommended for global CLI install
Docker - Install Docker Desktop (Mac/Windows) or Docker Engine (Linux)

Install

# Recommended: Install via pipx from latest release
pipx install "https://github.com/cwensel/arcaneum/releases/download/v0.8.2/arcaneum-0.8.2-py3-none-any.whl"

# Or install via Homebrew (macOS/Linux)
brew install cwensel/arcaneum/arcaneum

# Or install latest from source
pipx install "git+https://github.com/cwensel/arcaneum.git"

# Development install (from cloned repo)
git clone https://github.com/cwensel/arcaneum
cd arcaneum
pip install -e ".[dev]"

Claude Code Plugin

After installing the CLI globally, install the plugin in Claude Code:

/plugin install cwensel/arcaneum

The plugin assumes arc is available in PATH. Slash commands execute arc directly.

Verify Setup

arc doctor

The doctor command checks your environment and guides you through any issues.

👉 Full Installation Guide - Complete walkthrough with troubleshooting

Data Storage

Arcaneum stores data in XDG-compliant locations:

Cache (Re-downloadable):

~/.cache/arcaneum/models/     # Embedding models, ~1-2GB per model

Data (User-created):

~/.local/share/arcaneum/      # Local databases and indexed content

Vector Database (Docker):

Qdrant uses Docker named volumes for data persistence and safety:

qdrant-arcaneum-storage    # Main vector database storage
qdrant-arcaneum-snapshots  # Backup snapshots

Named volumes store data on a Linux ext4 filesystem inside Docker, providing better reliability and performance than bind mounts.

Legacy Migration:

If upgrading from an older version with ~/.arcaneum/, the directory will be automatically migrated to XDG-compliant locations on first run. Qdrant client configuration is read from ~/.config/arcaneum/config.yaml; an existing legacy ~/.arcaneum/config.yaml is copied there the first time Qdrant configuration is loaded.

Migration Note: If you're upgrading from bind mounts to named volumes, see Qdrant Migration Guide for detailed migration steps.

Benefits:

Reliable data persistence across container restarts
Better performance compared to bind mounts
Easy backup via arc container backup
Native Linux filesystem (ext4) for data safety

Backup and Restore

Use arc container backup before upgrades or migrations. It creates a timestamped directory under ~/.local/share/arcaneum/backups/ with Qdrant collection snapshots plus MeiliSearch index settings and JSONL document exports. Restore with arc container restore <backup-directory> while the container services are running. Restore recreates same-named MeiliSearch indexes from the backup.

Run backups when indexing is idle. arc container backup checks MeiliSearch for active tasks before and after export and aborts if any MeiliSearch task appears during the backup window.

Backups protect Arcaneum's indexed data and corpus metadata stored in Qdrant and MeiliSearch. They do not include source files referenced by indexes, cached embedding models, Docker images, or local configuration secrets.

Corporate Networks

Behind a VPN with SSL issues? See Corporate Network Setup for:

Offline mode setup
SSL certificate workarounds
Model pre-downloading

Claude Code Plugin

Install the CLI globally first (see Installation), then in Claude Code:

/plugin install cwensel/arcaneum

Available Commands

All commands use the arc: namespace prefix:

Command	Description
`/arc:corpus`	Recommended - Manage dual-index corpora (Qdrant + MeiliSearch)
`/arc:search`	Semantic or full-text search
`/arc:index`	Index PDF, code, or markdown content
`/arc:store`	Store agent-generated content for memory
`/arc:container`	Manage Docker services (start, stop, status)
`/arc:doctor`	Verify setup and prerequisites
`/arc:models`	List available embedding models
`/arc:config`	Manage configuration and cache
`/arc:collection`	Manage Qdrant collections (semantic search only)
`/arc:indexes`	Manage MeiliSearch indexes (full-text search only)

Usage Examples:

/arc:corpus create my-docs --type pdf
/arc:corpus sync my-docs ~/Documents
/arc:search semantic "example query" --corpus my-docs
/arc:search text "exact phrase" --corpus my-docs
/arc:models list

Use /help in Claude Code to see all available commands or /arc:doctor to check your setup.

For Developers: See Claude Code Plugin Testing Guide for local testing instructions.

Development

Architecture Principles

CLI-First: All functionality as CLI tools (RDR-001, RDR-006)
Slash Commands: Thin wrappers calling CLI via Bash (RDR-006)
No MCP (v1): Avoid MCP overhead, use direct CLI execution (RDR-006)
Local Docker: Databases run locally with volume persistence (RDR-002, RDR-008)
RDR-Based Planning: Detailed design before implementation (docs/rdr/)

Implementation Status

✅ RDR-001: Project structure (COMPLETED)
✅ RDR-002: Qdrant server setup (COMPLETED)
✅ RDR-003: Collection management (COMPLETED)
✅ RDR-004: PDF bulk indexing (COMPLETED)
✅ RDR-005: Source code indexing (COMPLETED)
✅ RDR-006: Claude Code integration (COMPLETED)
✅ RDR-007: Semantic search (COMPLETED)
✅ RDR-008: MeiliSearch setup (COMPLETED)
✅ RDR-009: Dual indexing strategy (COMPLETED)
✅ RDR-010: PDF full-text indexing (COMPLETED)
✅ RDR-011: Source code full-text indexing (COMPLETED)
✅ RDR-012: Full-text search integration (COMPLETED)
✅ RDR-014: Markdown indexing (COMPLETED)

Testing

# Run unit tests
pytest tests/unit tests/fulltext tests/schema -v

# Run with coverage
pytest tests/unit tests/fulltext tests/schema --cov=src/arcaneum -v

# Run integration tests (requires Qdrant and MeiliSearch running)
arc container start
pytest tests/integration tests/indexing tests/cli -v

Documentation

User Guides

Quick Start Guide - Installation, setup, and your first search
CLI Reference - Complete command documentation and options
PDF Indexing Guide - Advanced PDF indexing with OCR support, performance tuning, and troubleshooting
Qdrant Migration Guide - Migrate from bind mounts to Docker named volumes
Corporate Network Setup - Setup for VPN, SSL certificates, and offline mode
Claude Code Plugin Testing Guide - Local development and testing for plugin developers

Development

RDR Process - Recommendation Data Records workflow for complex features
Individual RDRs - Technical specifications and design decisions for each feature
Slash Commands - Claude Code plugin command implementations

Contributing

We welcome contributions! See CONTRIBUTING.md for detailed guidelines on:

Development setup and workflow
When to create RDRs
Code and documentation standards
Pull request process

Quick Start for Contributors:

Read docs/rdr/README.md for RDR-based development workflow
Create an RDR for complex features before implementation
Follow CLI-first architecture pattern
Add tests for new functionality
Update this README with implementation status

License

MIT - See LICENSE file for details

Acknowledgments

Built on Qdrant and MeiliSearch

Name		Name	Last commit message	Last commit date
Latest commit History 448 Commits
.claude-plugin		.claude-plugin
.github		.github
bin		bin
commands		commands
deploy		deploy
docs		docs
homebrew		homebrew
scripts		scripts
src/arcaneum		src/arcaneum
tests		tests
.env.example		.env.example
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
test-install-mac.sh		test-install-mac.sh
test-install-ubuntu.sh		test-install-ubuntu.sh

Folders and files

Latest commit

History

Repository files navigation

Arcaneum

Overview

Features

Search Capabilities

Indexing

Multiple Embedding Models

GPU Acceleration

CLI-First Design

Claude Code Integration

Quick Start

Quick Reference

Common Workflows

Search Dependencies and Libraries (Recommended)

Search Technical Documentation

Index Markdown Files

Single-System Indexing (Advanced)

Store Agent Memory

Manage Services

Installation

Prerequisites

Install

Claude Code Plugin

Verify Setup

Data Storage

Backup and Restore

Corporate Networks

Claude Code Plugin

Available Commands

Development

Architecture Principles

Implementation Status

Testing

Documentation

User Guides

Development

Contributing

License

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages