Skip to content

cwensel/arcaneum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

448 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Arcaneum

Tests Release

CLI tools and Claude Code plugins for semantic and full-text search.

pipx install "https://github.com/cwensel/arcaneum/releases/download/v0.8.2/arcaneum-0.8.2-py3-none-any.whl"

Overview

Arcaneum helps you discover and understand project dependencies, documentation, and reference implementations. By indexing libraries, frameworks, and technical papers, you can semantically search for patterns, APIs, and concepts when building new projects. Works especially well with the RDR (Recommendation Data Record) model for AI-assisted development planning.

The system supports PDF documents and source code with git-aware, AST-based chunking.

Currently Available:

  • Semantic search with Qdrant (vector embeddings)
  • Full-text search with MeiliSearch (exact phrase matching)
  • Dual indexing workflow for comprehensive search

Features

Search Capabilities

  • Semantic Search (Qdrant): Find conceptually similar content using vector embeddings
  • Full-Text Search (MeiliSearch): Exact phrase matching, keyword search, and typo-tolerant queries

Indexing

  • PDF Indexing: OCR support for scanned documents, page-level metadata, parallel processing
  • Source Code Indexing: Git-aware with AST chunking, multi-branch support, 165+ languages
  • Markdown Indexing: YAML frontmatter extraction, semantic chunking, incremental sync
  • Dual Indexing: Single command to index to both search engines
  • Performance Tuning: Granular control over workers, batch sizes, and process priority via arc corpus sync --max-embedding-batch, --text-workers, --cpu-workers, and single-system indexing flags such as --embedding-batch-size and --process-priority

Multiple Embedding Models

  • arctic-m (768D) - DEFAULT for PDFs/markdown - stable FastEmbed retrieval model
  • stella (1024D) - High-quality opt-in document model, requires arcaneum[sentence-transformers]
  • mxbai-large (1024D) - High-quality FastEmbed document model
  • jina-code (768D) - DEFAULT for code - stable FastEmbed code model
  • jina-code-st (768D) - Legacy SentenceTransformers code path, requires arcaneum[sentence-transformers]
  • jina-code-0.5b (896D) - Higher-quality opt-in code model, 32K context, requires arcaneum[sentence-transformers]
  • jina-code-1.5b (1536D) - Highest quality code embeddings, SOTA Sept 2025, requires arcaneum[sentence-transformers]
  • codesage-large (1024D) - CodeSage V2, 9 programming languages, requires arcaneum[sentence-transformers]
  • bge-large (1024D) - BGE large embeddings, balanced performance
  • jina-v3 (1024D) - Multilingual embeddings with extended 8K context
  • bge-base (768D) - BGE base embeddings, balanced performance and speed
  • bge-small (384D) - BGE small embeddings, fastest for size-constrained scenarios

See arc models list for complete model information and recommendations. Use arc models list --json for the LLM-readable catalog: it includes backend, recommended and default corpus uses, support/risk tier, prompt policy, context limit, hardware support, runtime-aligned batch guidance, and reindex warnings.

Arcaneum records each collection's embedding prompt policy when it is indexed. Reindex a corpus after changing model query/document prompts, tasks, or prompt-aware model defaults; semantic search rejects collections whose stored prompt policy no longer matches the current model registry.

GPU Acceleration

  • CPU is the default for the most stable indexing behavior
  • Supports Apple Silicon (MPS) and NVIDIA GPUs (CUDA)
  • Use --gpu to opt into accelerator embedding
  • FastEmbed/CoreML on Apple Silicon is experimental and requires ARC_EXPERIMENTAL_COREML=1

CLI-First Design

  • All operations via command-line interface
  • JSON output mode for automation
  • Structured error messages with exit codes
  • Python >= 3.12 required

Claude Code Integration

  • Slash commands for all operations (/arc:search, /arc:index, /arc:collection, etc.)
  • Discoverable via /help or /commands in Claude Code
  • No MCP overhead - direct CLI execution

Quick Start

Get started with Arcaneum in just a few commands:

# 1. Install
pipx install "https://github.com/cwensel/arcaneum/releases/download/v0.8.2/arcaneum-0.8.2-py3-none-any.whl"

# 2. Install Claude Code plugin (optional, in Claude Code)
# /plugin install cwensel/arcaneum

# 3. Verify and start services
arc doctor
arc container start

# 4. Create a corpus and sync content (indexes to both Qdrant and MeiliSearch)
arc corpus create Frameworks --type code
arc corpus sync Frameworks ~/libs/fastapi ~/libs/sqlalchemy

# 5. Search with semantic or full-text queries
arc search semantic "dependency injection pattern" --corpus Frameworks
arc search text "async def" --corpus Frameworks

First time? Run arc doctor to check prerequisites and get setup guidance.

πŸ‘‰ Full Quick Start Guide - Detailed walkthrough with troubleshooting

Quick Reference

# Service Management
arc container start          # Start Qdrant and MeiliSearch
arc container status         # Check service health
arc container backup         # Back up Qdrant and MeiliSearch data
arc container restore DIR    # Restore a backup
arc doctor                   # Verify setup

# Corpus (Recommended - Dual Indexing to Both Systems)
arc corpus create NAME --type TYPE              # pdf, code, or markdown
arc corpus list                                 # List all corpora
arc corpus sync NAME PATH [PATH...]             # Sync one or more directories
arc corpus sync NAME PATH --parity              # Also detect renames, remove files no longer on disk
arc corpus items NAME                           # List items with parity status
arc corpus verify NAME                          # Verify corpus health across both systems
arc corpus parity NAME                          # Check/restore parity between systems
arc corpus repair NAME                          # Re-index incomplete or garbled files
arc corpus update NAME --description "..."      # Update corpus metadata
arc corpus delete NAME                          # Delete both collection and index

# Search (Works with corpus, collection, or index)
arc search semantic "query" --corpus NAME              # Conceptual similarity
arc search semantic "query" --corpus N1 --corpus N2    # Multi-corpus
arc search text "query" --corpus NAME                  # Exact phrase matching

# --- Advanced: single-system only (prefer `arc corpus` above for normal use) ---

# Collections (Qdrant Only - Semantic Search)
arc collection create NAME --type TYPE   # When you only need semantic search
arc collection list
arc collection items NAME
arc index pdf PATH --collection NAME
arc index code PATH --collection NAME

# Indexes (MeiliSearch Only - Full-Text Search)
arc indexes create NAME --type TYPE      # When you only need full-text search
arc indexes list
arc index text pdf PATH --index NAME
arc index text code PATH --index NAME
arc index text markdown PATH --index NAME

Common Workflows

Search Dependencies and Libraries (Recommended)

# Create a corpus for framework source code
arc corpus create Frameworks --type code

# Sync framework directories (indexes to both Qdrant and MeiliSearch)
arc corpus sync Frameworks ~/libs/fastapi ~/libs/sqlalchemy

# List what's indexed
arc corpus items Frameworks

# Semantic search for patterns and APIs
arc search semantic "dependency injection pattern" --corpus Frameworks --limit 10

# Full-text search for exact code
arc search text "async def create_app" --corpus Frameworks

Search Technical Documentation

# Create a corpus for PDF documents
arc corpus create Papers --type pdf

# Sync documentation directories
arc corpus sync Papers ~/Documents/papers ~/Documents/specs

# Semantic search for concepts
arc search semantic "distributed consensus algorithms" --corpus Papers

# Full-text search for exact phrases
arc search text '"rate limiting"' --corpus Papers

Index Markdown Files

# Create a corpus for notes and documentation
arc corpus create Notes --type markdown

# Sync your notes directory
arc corpus sync Notes ~/obsidian-vault

# Semantic search
arc search semantic "project planning" --corpus Notes

# Full-text search
arc search text "meeting notes" --corpus Notes

Features:

  • YAML frontmatter extraction (title, tags, category, etc.)
  • Semantic chunking preserving document structure
  • Incremental sync (SHA256 content hashing)
  • Custom exclude patterns
  • Supports .md, .markdown, .mdown extensions

Single-System Indexing (Advanced)

Prefer arc corpus sync for normal use. Use collections or indexes directly only when you explicitly need one type of search without the other:

# Semantic search only (Qdrant collection)
arc collection create MyCollection --type code
arc index code ~/project --collection MyCollection
arc search semantic "query" --corpus MyCollection

# Full-text search only (MeiliSearch index)
arc indexes create MyIndex --type pdf
arc index text pdf ~/docs --index MyIndex
arc search text "query" --corpus MyIndex

Store Agent Memory

# Create a corpus for agent-generated content
arc corpus create Memory --type markdown

# Store from file with metadata
arc store analysis.md --collection Memory \
  --title "Security Analysis" \
  --category "security" \
  --tags "audit,findings"

# Store from stdin (agent workflow)
echo "# Research\n\nFindings..." | arc store - --collection Memory

# Search agent memory
arc search semantic "security vulnerabilities" --corpus Memory
arc search text "SQL injection" --corpus Memory

# Content persisted to: ~/.local/share/arcaneum/agent-memory/{collection}/
# Enables re-indexing and full-text retrieval

Use Case: Designed for AI agents to store research, analysis, and synthesized information with rich metadata. Content is automatically persisted for durability.

Manage Services

arc container start    # Start Qdrant and MeiliSearch
arc container status   # Check health
arc container backup    # Create a timestamped backup
arc container logs     # View logs
arc container stop     # Stop services

Installation

Prerequisites

  • Python 3.12+ - Check with python --version
  • pipx - Recommended for global CLI install
  • Docker - Install Docker Desktop (Mac/Windows) or Docker Engine (Linux)

Install

# Recommended: Install via pipx from latest release
pipx install "https://github.com/cwensel/arcaneum/releases/download/v0.8.2/arcaneum-0.8.2-py3-none-any.whl"

# Or install via Homebrew (macOS/Linux)
brew install cwensel/arcaneum/arcaneum

# Or install latest from source
pipx install "git+https://github.com/cwensel/arcaneum.git"

# Development install (from cloned repo)
git clone https://github.com/cwensel/arcaneum
cd arcaneum
pip install -e ".[dev]"

Claude Code Plugin

After installing the CLI globally, install the plugin in Claude Code:

/plugin install cwensel/arcaneum

The plugin assumes arc is available in PATH. Slash commands execute arc directly.

Verify Setup

arc doctor

The doctor command checks your environment and guides you through any issues.

πŸ‘‰ Full Installation Guide - Complete walkthrough with troubleshooting

Data Storage

Arcaneum stores data in XDG-compliant locations:

Cache (Re-downloadable):

~/.cache/arcaneum/models/     # Embedding models, ~1-2GB per model

Data (User-created):

~/.local/share/arcaneum/      # Local databases and indexed content

Vector Database (Docker):

Qdrant uses Docker named volumes for data persistence and safety:

qdrant-arcaneum-storage    # Main vector database storage
qdrant-arcaneum-snapshots  # Backup snapshots

Named volumes store data on a Linux ext4 filesystem inside Docker, providing better reliability and performance than bind mounts.

Legacy Migration:

If upgrading from an older version with ~/.arcaneum/, the directory will be automatically migrated to XDG-compliant locations on first run. Qdrant client configuration is read from ~/.config/arcaneum/config.yaml; an existing legacy ~/.arcaneum/config.yaml is copied there the first time Qdrant configuration is loaded.

Migration Note: If you're upgrading from bind mounts to named volumes, see Qdrant Migration Guide for detailed migration steps.

Benefits:

  • Reliable data persistence across container restarts
  • Better performance compared to bind mounts
  • Easy backup via arc container backup
  • Native Linux filesystem (ext4) for data safety

Backup and Restore

Use arc container backup before upgrades or migrations. It creates a timestamped directory under ~/.local/share/arcaneum/backups/ with Qdrant collection snapshots plus MeiliSearch index settings and JSONL document exports. Restore with arc container restore <backup-directory> while the container services are running. Restore recreates same-named MeiliSearch indexes from the backup.

Run backups when indexing is idle. arc container backup checks MeiliSearch for active tasks before and after export and aborts if any MeiliSearch task appears during the backup window.

Backups protect Arcaneum's indexed data and corpus metadata stored in Qdrant and MeiliSearch. They do not include source files referenced by indexes, cached embedding models, Docker images, or local configuration secrets.

Corporate Networks

Behind a VPN with SSL issues? See Corporate Network Setup for:

  • Offline mode setup
  • SSL certificate workarounds
  • Model pre-downloading

Claude Code Plugin

Install the CLI globally first (see Installation), then in Claude Code:

/plugin install cwensel/arcaneum

Available Commands

All commands use the arc: namespace prefix:

Command Description
/arc:corpus Recommended - Manage dual-index corpora (Qdrant + MeiliSearch)
/arc:search Semantic or full-text search
/arc:index Index PDF, code, or markdown content
/arc:store Store agent-generated content for memory
/arc:container Manage Docker services (start, stop, status)
/arc:doctor Verify setup and prerequisites
/arc:models List available embedding models
/arc:config Manage configuration and cache
/arc:collection Manage Qdrant collections (semantic search only)
/arc:indexes Manage MeiliSearch indexes (full-text search only)

Usage Examples:

/arc:corpus create my-docs --type pdf
/arc:corpus sync my-docs ~/Documents
/arc:search semantic "example query" --corpus my-docs
/arc:search text "exact phrase" --corpus my-docs
/arc:models list

Use /help in Claude Code to see all available commands or /arc:doctor to check your setup.

For Developers: See Claude Code Plugin Testing Guide for local testing instructions.

Development

Architecture Principles

  1. CLI-First: All functionality as CLI tools (RDR-001, RDR-006)
  2. Slash Commands: Thin wrappers calling CLI via Bash (RDR-006)
  3. No MCP (v1): Avoid MCP overhead, use direct CLI execution (RDR-006)
  4. Local Docker: Databases run locally with volume persistence (RDR-002, RDR-008)
  5. RDR-Based Planning: Detailed design before implementation (docs/rdr/)

Implementation Status

  • βœ… RDR-001: Project structure (COMPLETED)
  • βœ… RDR-002: Qdrant server setup (COMPLETED)
  • βœ… RDR-003: Collection management (COMPLETED)
  • βœ… RDR-004: PDF bulk indexing (COMPLETED)
  • βœ… RDR-005: Source code indexing (COMPLETED)
  • βœ… RDR-006: Claude Code integration (COMPLETED)
  • βœ… RDR-007: Semantic search (COMPLETED)
  • βœ… RDR-008: MeiliSearch setup (COMPLETED)
  • βœ… RDR-009: Dual indexing strategy (COMPLETED)
  • βœ… RDR-010: PDF full-text indexing (COMPLETED)
  • βœ… RDR-011: Source code full-text indexing (COMPLETED)
  • βœ… RDR-012: Full-text search integration (COMPLETED)
  • βœ… RDR-014: Markdown indexing (COMPLETED)

Testing

# Run unit tests
pytest tests/unit tests/fulltext tests/schema -v

# Run with coverage
pytest tests/unit tests/fulltext tests/schema --cov=src/arcaneum -v

# Run integration tests (requires Qdrant and MeiliSearch running)
arc container start
pytest tests/integration tests/indexing tests/cli -v

Documentation

User Guides

Development

  • RDR Process - Recommendation Data Records workflow for complex features
  • Individual RDRs - Technical specifications and design decisions for each feature
  • Slash Commands - Claude Code plugin command implementations

Contributing

We welcome contributions! See CONTRIBUTING.md for detailed guidelines on:

  • Development setup and workflow
  • When to create RDRs
  • Code and documentation standards
  • Pull request process

Quick Start for Contributors:

  1. Read docs/rdr/README.md for RDR-based development workflow
  2. Create an RDR for complex features before implementation
  3. Follow CLI-first architecture pattern
  4. Add tests for new functionality
  5. Update this README with implementation status

License

MIT - See LICENSE file for details

Acknowledgments

About

CLI tools and Claude Code plugins for semantic and full-text search across code, docs, and technical references.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors