CLI tools and Claude Code plugins for semantic and full-text search.
pipx install "https://github.com/cwensel/arcaneum/releases/download/v0.8.2/arcaneum-0.8.2-py3-none-any.whl"Arcaneum helps you discover and understand project dependencies, documentation, and reference implementations. By indexing libraries, frameworks, and technical papers, you can semantically search for patterns, APIs, and concepts when building new projects. Works especially well with the RDR (Recommendation Data Record) model for AI-assisted development planning.
The system supports PDF documents and source code with git-aware, AST-based chunking.
Currently Available:
- Semantic search with Qdrant (vector embeddings)
- Full-text search with MeiliSearch (exact phrase matching)
- Dual indexing workflow for comprehensive search
- Semantic Search (Qdrant): Find conceptually similar content using vector embeddings
- Full-Text Search (MeiliSearch): Exact phrase matching, keyword search, and typo-tolerant queries
- PDF Indexing: OCR support for scanned documents, page-level metadata, parallel processing
- Source Code Indexing: Git-aware with AST chunking, multi-branch support, 165+ languages
- Markdown Indexing: YAML frontmatter extraction, semantic chunking, incremental sync
- Dual Indexing: Single command to index to both search engines
- Performance Tuning: Granular control over workers, batch sizes, and process priority via
arc corpus sync --max-embedding-batch,--text-workers,--cpu-workers, and single-system indexing flags such as--embedding-batch-sizeand--process-priority
- arctic-m (768D) - DEFAULT for PDFs/markdown - stable FastEmbed retrieval model
- stella (1024D) - High-quality opt-in document model, requires
arcaneum[sentence-transformers] - mxbai-large (1024D) - High-quality FastEmbed document model
- jina-code (768D) - DEFAULT for code - stable FastEmbed code model
- jina-code-st (768D) - Legacy SentenceTransformers code path, requires
arcaneum[sentence-transformers] - jina-code-0.5b (896D) - Higher-quality opt-in code model, 32K context, requires
arcaneum[sentence-transformers] - jina-code-1.5b (1536D) - Highest quality code embeddings, SOTA Sept 2025, requires
arcaneum[sentence-transformers] - codesage-large (1024D) - CodeSage V2, 9 programming languages, requires
arcaneum[sentence-transformers] - bge-large (1024D) - BGE large embeddings, balanced performance
- jina-v3 (1024D) - Multilingual embeddings with extended 8K context
- bge-base (768D) - BGE base embeddings, balanced performance and speed
- bge-small (384D) - BGE small embeddings, fastest for size-constrained scenarios
See arc models list for complete model information and recommendations.
Use arc models list --json for the LLM-readable catalog: it includes backend,
recommended and default corpus uses, support/risk tier, prompt policy, context
limit, hardware support, runtime-aligned batch guidance, and reindex warnings.
Arcaneum records each collection's embedding prompt policy when it is indexed. Reindex a corpus after changing model query/document prompts, tasks, or prompt-aware model defaults; semantic search rejects collections whose stored prompt policy no longer matches the current model registry.
- CPU is the default for the most stable indexing behavior
- Supports Apple Silicon (MPS) and NVIDIA GPUs (CUDA)
- Use
--gputo opt into accelerator embedding - FastEmbed/CoreML on Apple Silicon is experimental and requires
ARC_EXPERIMENTAL_COREML=1
- All operations via command-line interface
- JSON output mode for automation
- Structured error messages with exit codes
- Python >= 3.12 required
- Slash commands for all operations (
/arc:search,/arc:index,/arc:collection, etc.) - Discoverable via
/helpor/commandsin Claude Code - No MCP overhead - direct CLI execution
Get started with Arcaneum in just a few commands:
# 1. Install
pipx install "https://github.com/cwensel/arcaneum/releases/download/v0.8.2/arcaneum-0.8.2-py3-none-any.whl"
# 2. Install Claude Code plugin (optional, in Claude Code)
# /plugin install cwensel/arcaneum
# 3. Verify and start services
arc doctor
arc container start
# 4. Create a corpus and sync content (indexes to both Qdrant and MeiliSearch)
arc corpus create Frameworks --type code
arc corpus sync Frameworks ~/libs/fastapi ~/libs/sqlalchemy
# 5. Search with semantic or full-text queries
arc search semantic "dependency injection pattern" --corpus Frameworks
arc search text "async def" --corpus FrameworksFirst time? Run arc doctor to check prerequisites and get setup guidance.
π Full Quick Start Guide - Detailed walkthrough with troubleshooting
# Service Management
arc container start # Start Qdrant and MeiliSearch
arc container status # Check service health
arc container backup # Back up Qdrant and MeiliSearch data
arc container restore DIR # Restore a backup
arc doctor # Verify setup
# Corpus (Recommended - Dual Indexing to Both Systems)
arc corpus create NAME --type TYPE # pdf, code, or markdown
arc corpus list # List all corpora
arc corpus sync NAME PATH [PATH...] # Sync one or more directories
arc corpus sync NAME PATH --parity # Also detect renames, remove files no longer on disk
arc corpus items NAME # List items with parity status
arc corpus verify NAME # Verify corpus health across both systems
arc corpus parity NAME # Check/restore parity between systems
arc corpus repair NAME # Re-index incomplete or garbled files
arc corpus update NAME --description "..." # Update corpus metadata
arc corpus delete NAME # Delete both collection and index
# Search (Works with corpus, collection, or index)
arc search semantic "query" --corpus NAME # Conceptual similarity
arc search semantic "query" --corpus N1 --corpus N2 # Multi-corpus
arc search text "query" --corpus NAME # Exact phrase matching
# --- Advanced: single-system only (prefer `arc corpus` above for normal use) ---
# Collections (Qdrant Only - Semantic Search)
arc collection create NAME --type TYPE # When you only need semantic search
arc collection list
arc collection items NAME
arc index pdf PATH --collection NAME
arc index code PATH --collection NAME
# Indexes (MeiliSearch Only - Full-Text Search)
arc indexes create NAME --type TYPE # When you only need full-text search
arc indexes list
arc index text pdf PATH --index NAME
arc index text code PATH --index NAME
arc index text markdown PATH --index NAME# Create a corpus for framework source code
arc corpus create Frameworks --type code
# Sync framework directories (indexes to both Qdrant and MeiliSearch)
arc corpus sync Frameworks ~/libs/fastapi ~/libs/sqlalchemy
# List what's indexed
arc corpus items Frameworks
# Semantic search for patterns and APIs
arc search semantic "dependency injection pattern" --corpus Frameworks --limit 10
# Full-text search for exact code
arc search text "async def create_app" --corpus Frameworks# Create a corpus for PDF documents
arc corpus create Papers --type pdf
# Sync documentation directories
arc corpus sync Papers ~/Documents/papers ~/Documents/specs
# Semantic search for concepts
arc search semantic "distributed consensus algorithms" --corpus Papers
# Full-text search for exact phrases
arc search text '"rate limiting"' --corpus Papers# Create a corpus for notes and documentation
arc corpus create Notes --type markdown
# Sync your notes directory
arc corpus sync Notes ~/obsidian-vault
# Semantic search
arc search semantic "project planning" --corpus Notes
# Full-text search
arc search text "meeting notes" --corpus NotesFeatures:
- YAML frontmatter extraction (title, tags, category, etc.)
- Semantic chunking preserving document structure
- Incremental sync (SHA256 content hashing)
- Custom exclude patterns
- Supports .md, .markdown, .mdown extensions
Prefer arc corpus sync for normal use. Use collections or indexes directly
only when you explicitly need one type of search without the other:
# Semantic search only (Qdrant collection)
arc collection create MyCollection --type code
arc index code ~/project --collection MyCollection
arc search semantic "query" --corpus MyCollection
# Full-text search only (MeiliSearch index)
arc indexes create MyIndex --type pdf
arc index text pdf ~/docs --index MyIndex
arc search text "query" --corpus MyIndex# Create a corpus for agent-generated content
arc corpus create Memory --type markdown
# Store from file with metadata
arc store analysis.md --collection Memory \
--title "Security Analysis" \
--category "security" \
--tags "audit,findings"
# Store from stdin (agent workflow)
echo "# Research\n\nFindings..." | arc store - --collection Memory
# Search agent memory
arc search semantic "security vulnerabilities" --corpus Memory
arc search text "SQL injection" --corpus Memory
# Content persisted to: ~/.local/share/arcaneum/agent-memory/{collection}/
# Enables re-indexing and full-text retrievalUse Case: Designed for AI agents to store research, analysis, and synthesized information with rich metadata. Content is automatically persisted for durability.
arc container start # Start Qdrant and MeiliSearch
arc container status # Check health
arc container backup # Create a timestamped backup
arc container logs # View logs
arc container stop # Stop services- Python 3.12+ - Check with
python --version - pipx - Recommended for global CLI install
- Docker - Install Docker Desktop (Mac/Windows) or Docker Engine (Linux)
# Recommended: Install via pipx from latest release
pipx install "https://github.com/cwensel/arcaneum/releases/download/v0.8.2/arcaneum-0.8.2-py3-none-any.whl"
# Or install via Homebrew (macOS/Linux)
brew install cwensel/arcaneum/arcaneum
# Or install latest from source
pipx install "git+https://github.com/cwensel/arcaneum.git"
# Development install (from cloned repo)
git clone https://github.com/cwensel/arcaneum
cd arcaneum
pip install -e ".[dev]"After installing the CLI globally, install the plugin in Claude Code:
/plugin install cwensel/arcaneum
The plugin assumes arc is available in PATH. Slash commands execute arc directly.
arc doctorThe doctor command checks your environment and guides you through any issues.
π Full Installation Guide - Complete walkthrough with troubleshooting
Arcaneum stores data in XDG-compliant locations:
Cache (Re-downloadable):
~/.cache/arcaneum/models/ # Embedding models, ~1-2GB per model
Data (User-created):
~/.local/share/arcaneum/ # Local databases and indexed content
Vector Database (Docker):
Qdrant uses Docker named volumes for data persistence and safety:
qdrant-arcaneum-storage # Main vector database storage
qdrant-arcaneum-snapshots # Backup snapshots
Named volumes store data on a Linux ext4 filesystem inside Docker, providing better reliability and performance than bind mounts.
Legacy Migration:
If upgrading from an older version with ~/.arcaneum/, the directory will be
automatically migrated to XDG-compliant locations on first run. Qdrant client
configuration is read from ~/.config/arcaneum/config.yaml; an existing legacy
~/.arcaneum/config.yaml is copied there the first time Qdrant configuration is
loaded.
Migration Note: If you're upgrading from bind mounts to named volumes, see Qdrant Migration Guide for detailed migration steps.
Benefits:
- Reliable data persistence across container restarts
- Better performance compared to bind mounts
- Easy backup via
arc container backup - Native Linux filesystem (ext4) for data safety
Use arc container backup before upgrades or migrations. It creates a
timestamped directory under ~/.local/share/arcaneum/backups/ with Qdrant
collection snapshots plus MeiliSearch index settings and JSONL document exports.
Restore with arc container restore <backup-directory> while the container
services are running. Restore recreates same-named MeiliSearch indexes from the
backup.
Run backups when indexing is idle. arc container backup checks MeiliSearch for
active tasks before and after export and aborts if any MeiliSearch task appears
during the backup window.
Backups protect Arcaneum's indexed data and corpus metadata stored in Qdrant and MeiliSearch. They do not include source files referenced by indexes, cached embedding models, Docker images, or local configuration secrets.
Behind a VPN with SSL issues? See Corporate Network Setup for:
- Offline mode setup
- SSL certificate workarounds
- Model pre-downloading
Install the CLI globally first (see Installation), then in Claude Code:
/plugin install cwensel/arcaneum
All commands use the arc: namespace prefix:
| Command | Description |
|---|---|
/arc:corpus |
Recommended - Manage dual-index corpora (Qdrant + MeiliSearch) |
/arc:search |
Semantic or full-text search |
/arc:index |
Index PDF, code, or markdown content |
/arc:store |
Store agent-generated content for memory |
/arc:container |
Manage Docker services (start, stop, status) |
/arc:doctor |
Verify setup and prerequisites |
/arc:models |
List available embedding models |
/arc:config |
Manage configuration and cache |
/arc:collection |
Manage Qdrant collections (semantic search only) |
/arc:indexes |
Manage MeiliSearch indexes (full-text search only) |
Usage Examples:
/arc:corpus create my-docs --type pdf
/arc:corpus sync my-docs ~/Documents
/arc:search semantic "example query" --corpus my-docs
/arc:search text "exact phrase" --corpus my-docs
/arc:models list
Use /help in Claude Code to see all available commands or /arc:doctor to check your setup.
For Developers: See Claude Code Plugin Testing Guide for local testing instructions.
- CLI-First: All functionality as CLI tools (RDR-001, RDR-006)
- Slash Commands: Thin wrappers calling CLI via Bash (RDR-006)
- No MCP (v1): Avoid MCP overhead, use direct CLI execution (RDR-006)
- Local Docker: Databases run locally with volume persistence (RDR-002, RDR-008)
- RDR-Based Planning: Detailed design before implementation (docs/rdr/)
- β RDR-001: Project structure (COMPLETED)
- β RDR-002: Qdrant server setup (COMPLETED)
- β RDR-003: Collection management (COMPLETED)
- β RDR-004: PDF bulk indexing (COMPLETED)
- β RDR-005: Source code indexing (COMPLETED)
- β RDR-006: Claude Code integration (COMPLETED)
- β RDR-007: Semantic search (COMPLETED)
- β RDR-008: MeiliSearch setup (COMPLETED)
- β RDR-009: Dual indexing strategy (COMPLETED)
- β RDR-010: PDF full-text indexing (COMPLETED)
- β RDR-011: Source code full-text indexing (COMPLETED)
- β RDR-012: Full-text search integration (COMPLETED)
- β RDR-014: Markdown indexing (COMPLETED)
# Run unit tests
pytest tests/unit tests/fulltext tests/schema -v
# Run with coverage
pytest tests/unit tests/fulltext tests/schema --cov=src/arcaneum -v
# Run integration tests (requires Qdrant and MeiliSearch running)
arc container start
pytest tests/integration tests/indexing tests/cli -v- Quick Start Guide - Installation, setup, and your first search
- CLI Reference - Complete command documentation and options
- PDF Indexing Guide - Advanced PDF indexing with OCR support, performance tuning, and troubleshooting
- Qdrant Migration Guide - Migrate from bind mounts to Docker named volumes
- Corporate Network Setup - Setup for VPN, SSL certificates, and offline mode
- Claude Code Plugin Testing Guide - Local development and testing for plugin developers
- RDR Process - Recommendation Data Records workflow for complex features
- Individual RDRs - Technical specifications and design decisions for each feature
- Slash Commands - Claude Code plugin command implementations
We welcome contributions! See CONTRIBUTING.md for detailed guidelines on:
- Development setup and workflow
- When to create RDRs
- Code and documentation standards
- Pull request process
Quick Start for Contributors:
- Read
docs/rdr/README.mdfor RDR-based development workflow - Create an RDR for complex features before implementation
- Follow CLI-first architecture pattern
- Add tests for new functionality
- Update this README with implementation status
MIT - See LICENSE file for details
- Built on Qdrant and MeiliSearch