VERSION: 1.0.0 DATE: 2025-12-11
┌─────────────────────────────────────────────────────────────────────────────┐
│ REASONKIT-CORE ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ LAYER 5: RETRIEVAL & QUERY │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ HYBRID SEARCH│ │ RAPTOR │ │ RERANKING │ │ │
│ │ │ (BM25+Vector)│ │ (Tree Query) │ │ (ColBERT) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ ▲ │
│ ┌─────────────────────────────────┴───────────────────────────────────┐ │
│ │ LAYER 4: INDEXING │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ HNSW INDEX │ │ BM25 INDEX │ │ RAPTOR TREE │ │ │
│ │ │ (hnswlib-rs) │ │ (tantivy) │ │ (hierarchical)│ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ ▲ │
│ ┌─────────────────────────────────┴───────────────────────────────────┐ │
│ │ LAYER 3: EMBEDDING │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ DENSE EMBED │ │ SPARSE EMBED │ │ COLBERT │ │ │
│ │ │ (E5/BGE) │ │ (SPLADE) │ │ (Late Inter.)│ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ ▲ │
│ ┌─────────────────────────────────┴───────────────────────────────────┐ │
│ │ LAYER 2: PROCESSING │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ CHUNKING │ │ CLEANING │ │ METADATA │ │ │
│ │ │ (semantic) │ │ (normalize) │ │ (extraction) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ ▲ │
│ ┌─────────────────────────────────┴───────────────────────────────────┐ │
│ │ LAYER 1: INGESTION │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │ │
│ │ │ PDF │ │ HTML/MD │ │ JSON/JSONL │ │ GITHUB │ │ │
│ │ │ (pdf_oxide) │ │ (pulldown) │ │ (serde) │ │ (octocrab) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ └────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ STORAGE: Qdrant (Primary) | DuckDB (Metadata) | JSONL (Raw) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Rationale:
- Written in Rust (aligns with Rust-First philosophy)
- Highest QPS (1,200+) and lowest latency (1.6ms) in benchmarks
- Hybrid search support (dense + sparse vectors)
- Excellent filtering capabilities
- 24x compression with asymmetric quantization
Alternatives Considered:
| Option | Verdict | Reason |
|---|---|---|
| Milvus | DEFER | Better for billion-scale, overkill for our needs |
| LanceDB | KEEP AS OPTION | Good for edge/embedded use cases |
| ChromaDB | REJECT | Python-first, slower |
Rationale:
- Rust-native full-text search engine
- BM25 implementation for hybrid search
- 10x+ faster than Lucene in some benchmarks
- Apache 2.0 license
dense_embedding:
model: "BAAI/bge-m3" # or "intfloat/e5-large-v2"
dimensions: 1024
multilingual: true
use_case: "Semantic similarity"
sparse_embedding:
model: "naver/splade-v3"
use_case: "Keyword/exact match"
late_interaction:
model: "jina-colbert-v2"
use_case: "High-precision reranking"
dimensions: "128 per token"| Format | Library | Notes |
|---|---|---|
pdf_oxide |
47.9x faster than alternatives | |
| HTML | scraper + html5ever |
Rust-native |
| Markdown | pulldown-cmark |
CommonMark compliant |
| JSON/JSONL | serde_json |
Standard |
| EPUB | epub-rs |
For documentation |
Based on RAPTOR paper (ICLR 2024):
RAPTOR Tree Structure:
[ROOT SUMMARY]
/ \
[CLUSTER A] [CLUSTER B]
/ | \ / | \
[C1] [C2] [C3] [C4] [C5] [C6]
/|\ /|\ /|\ /|\ /|\ /|\
chunks... chunks...
Benefits:
- +20% absolute accuracy on QuALITY benchmark
- Captures both fine-grained and high-level understanding
- State-of-the-art on multi-hop reasoning tasks
Rationale:
- Rust has excellent JSON support (
serde_json) - Human-readable for debugging
- Schema validation with JSON Schema
- Easy to process incrementally (JSONL)
- Widely supported by all tools
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["id", "type", "content", "metadata"],
"properties": {
"id": {
"type": "string",
"description": "Unique document identifier (UUID)"
},
"type": {
"type": "string",
"enum": ["paper", "documentation", "code", "note"]
},
"source": {
"type": "object",
"properties": {
"url": { "type": "string" },
"path": { "type": "string" },
"retrieved_at": { "type": "string", "format": "date-time" }
}
},
"content": {
"type": "object",
"properties": {
"raw": { "type": "string" },
"chunks": {
"type": "array",
"items": { "$ref": "#/definitions/chunk" }
}
}
},
"metadata": {
"type": "object",
"properties": {
"title": { "type": "string" },
"authors": { "type": "array", "items": { "type": "string" } },
"date": { "type": "string" },
"tags": { "type": "array", "items": { "type": "string" } },
"citations": { "type": "integer" },
"venue": { "type": "string" }
}
}
},
"definitions": {
"chunk": {
"type": "object",
"properties": {
"id": { "type": "string" },
"text": { "type": "string" },
"start_char": { "type": "integer" },
"end_char": { "type": "integer" },
"embedding_id": { "type": "string" }
}
}
}
}{
"type": "object",
"required": ["id", "chunk_id", "vector", "model"],
"properties": {
"id": { "type": "string" },
"chunk_id": { "type": "string" },
"document_id": { "type": "string" },
"vector": {
"type": "array",
"items": { "type": "number" }
},
"model": { "type": "string" },
"dimensions": { "type": "integer" },
"created_at": { "type": "string", "format": "date-time" }
}
}data/
├── papers/
│ ├── raw/ # Original PDFs
│ │ └── arxiv_2401.18059.pdf
│ └── processed/ # Extracted JSON
│ └── arxiv_2401.18059.json
├── docs/
│ ├── raw/ # Original HTML/MD
│ │ └── claude-code/
│ └── processed/ # Extracted JSON
│ └── claude-code.jsonl
├── embeddings/
│ ├── dense/ # Dense vector embeddings
│ │ └── bge-m3/
│ └── sparse/ # Sparse embeddings
│ └── splade/
├── indexes/
│ ├── hnsw/ # HNSW index files
│ ├── bm25/ # Tantivy index
│ └── raptor/ # RAPTOR tree structure
└── metadata/
└── catalog.json # Master document catalog
| Paper | arXiv |
|---|---|
| Chain-of-Thought Prompting (Wei et al.) | 2201.11903 |
| Self-Consistency (Wang et al.) | 2203.11171 |
| Tree of Thoughts (Yao et al.) | 2305.10601 |
| RAPTOR (Sarthi et al.) | 2401.18059 |
| Let's Verify Step by Step (OpenAI) | - |
| Reflexion (Shinn et al.) | 2303.11366 |
| Constitutional AI (Anthropic) | 2212.08073 |
| Paper | arXiv |
|---|---|
| ColBERT (Khattab & Zaharia) | 2004.12832 |
| E5 Embeddings | 2212.03533 |
| BGE-M3 | 2402.03216 |
| Semantic Entropy (Nature) | - |
| Paper | arXiv |
|---|---|
| GSM8K | 2110.14168 |
| MATH Benchmark | 2103.03874 |
| MMLU | 2009.03300 |
| Tool | Source |
|---|---|
| Claude Code | https://github.com/anthropics/claude-code |
| Gemini CLI | https://github.com/google-gemini/gemini-cli |
| OpenAI Codex | https://github.com/openai/codex |
| MCP Servers | https://github.com/modelcontextprotocol/servers |
| Sequential Thinking | modelcontextprotocol/servers/src/sequentialthinking |
| API | Source |
|---|---|
| Anthropic API | https://docs.anthropic.com |
| OpenAI API | https://platform.openai.com/docs |
| Google AI | https://ai.google.dev/docs |
| OpenRouter | https://openrouter.ai/docs |
| Framework | Source |
|---|---|
| LangChain | https://python.langchain.com/docs |
| LlamaIndex | https://docs.llamaindex.ai |
| DSPy | https://dspy-docs.vercel.app |
□ Set up Cargo workspace
□ Implement PDF ingestion (pdf_oxide)
□ Implement JSON serialization (serde)
□ Create document schema validation
□ Download first batch of papers
□ Implement semantic chunking
□ Create metadata extraction
□ Set up Tantivy for BM25
□ Implement basic retrieval
□ Integrate embedding model (ONNX or API)
□ Set up Qdrant (local mode)
□ Implement HNSW indexing
□ Create hybrid search
□ Implement clustering (GMM)
□ Create summarization pipeline
□ Build hierarchical tree
□ Implement tree-based retrieval
[package]
name = "reasonkit-core"
version = "1.0.0"
edition = "2024"
[dependencies]
# Serialization
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
# PDF Processing
pdf_oxide = "0.1" # or lopdf for control
# Text Processing
pulldown-cmark = "0.9" # Markdown
scraper = "0.17" # HTML
regex = "1.10"
# Vector DB
qdrant-client = "1.8"
# Full-text Search
tantivy = "0.21"
# HNSW Index
hnswlib-rs = "0.3"
# Async Runtime
tokio = { version = "1", features = ["full"] }
# HTTP Client
reqwest = { version = "0.11", features = ["json"] }
# CLI
clap = { version = "4", features = ["derive"] }
# Error Handling
anyhow = "1.0"
thiserror = "1.0"
# Logging
tracing = "0.1"
tracing-subscriber = "0.3"
# UUID
uuid = { version = "1", features = ["v4", "serde"] }
# Date/Time
chrono = { version = "0.4", features = ["serde"] }
[dev-dependencies]
criterion = "0.5"| Operation | Target | Notes |
|---|---|---|
| PDF extraction | <100ms per page | pdf_oxide claims 53ms/PDF |
| Chunking | <10ms per 1000 tokens | |
| Embedding (API) | <500ms per chunk | Network bound |
| HNSW search | <10ms top-100 | hnswlib benchmark |
| BM25 search | <5ms | Tantivy benchmark |
| Hybrid search | <20ms total | Combined |
-
Embedding model hosting: Local ONNX or API?
- API: Simpler, more accurate (GPT/Claude embeddings)
- Local: Faster, no cost, works offline
- DECISION NEEDED
-
Qdrant deployment: Embedded or server mode?
- Embedded: Simpler, single binary
- Server: More scalable, dashboard
- RECOMMENDATION: Start embedded, upgrade if needed
-
RAPTOR summarization: Which LLM?
- GPT-4o: High quality, cost
- Claude Haiku: Good quality, lower cost
- Local (Llama): Free, lower quality
- RECOMMENDATION: Claude Haiku for balance
END OF DOCUMENT