ReasonKit Core - Knowledge Base Architecture

The Reasoning Engine — Auditable AI Reasoning Architecture

VERSION: 1.0.0 DATE: 2025-12-11

1. ARCHITECTURE OVERVIEW

┌─────────────────────────────────────────────────────────────────────────────┐
│                        REASONKIT-CORE ARCHITECTURE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ LAYER 5: RETRIEVAL & QUERY                                          │   │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                 │   │
│  │  │ HYBRID SEARCH│ │ RAPTOR       │ │ RERANKING    │                 │   │
│  │  │ (BM25+Vector)│ │ (Tree Query) │ │ (ColBERT)    │                 │   │
│  │  └──────────────┘ └──────────────┘ └──────────────┘                 │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    ▲                                        │
│  ┌─────────────────────────────────┴───────────────────────────────────┐   │
│  │ LAYER 4: INDEXING                                                    │   │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                 │   │
│  │  │ HNSW INDEX   │ │ BM25 INDEX   │ │ RAPTOR TREE  │                 │   │
│  │  │ (hnswlib-rs) │ │ (tantivy)    │ │ (hierarchical)│                │   │
│  │  └──────────────┘ └──────────────┘ └──────────────┘                 │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    ▲                                        │
│  ┌─────────────────────────────────┴───────────────────────────────────┐   │
│  │ LAYER 3: EMBEDDING                                                   │   │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                 │   │
│  │  │ DENSE EMBED  │ │ SPARSE EMBED │ │ COLBERT      │                 │   │
│  │  │ (E5/BGE)     │ │ (SPLADE)     │ │ (Late Inter.)│                 │   │
│  │  └──────────────┘ └──────────────┘ └──────────────┘                 │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    ▲                                        │
│  ┌─────────────────────────────────┴───────────────────────────────────┐   │
│  │ LAYER 2: PROCESSING                                                  │   │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                 │   │
│  │  │ CHUNKING     │ │ CLEANING     │ │ METADATA     │                 │   │
│  │  │ (semantic)   │ │ (normalize)  │ │ (extraction) │                 │   │
│  │  └──────────────┘ └──────────────┘ └──────────────┘                 │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    ▲                                        │
│  ┌─────────────────────────────────┴───────────────────────────────────┐   │
│  │ LAYER 1: INGESTION                                                   │   │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────────┐  │   │
│  │  │ PDF          │ │ HTML/MD      │ │ JSON/JSONL   │ │ GITHUB     │  │   │
│  │  │ (pdf_oxide)  │ │ (pulldown)   │ │ (serde)      │ │ (octocrab) │  │   │
│  │  └──────────────┘ └──────────────┘ └──────────────┘ └────────────┘  │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ STORAGE: Qdrant (Primary) | DuckDB (Metadata) | JSONL (Raw)         │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

2. TECHNOLOGY DECISIONS

2.1 Vector Database: Qdrant (PRIMARY)

Rationale:

Written in Rust (aligns with Rust-First philosophy)
Highest QPS (1,200+) and lowest latency (1.6ms) in benchmarks
Hybrid search support (dense + sparse vectors)
Excellent filtering capabilities
24x compression with asymmetric quantization

Alternatives Considered:

Option	Verdict	Reason
Milvus	DEFER	Better for billion-scale, overkill for our needs
LanceDB	KEEP AS OPTION	Good for edge/embedded use cases
ChromaDB	REJECT	Python-first, slower

2.2 Full-Text Search: Tantivy

Rationale:

Rust-native full-text search engine
BM25 implementation for hybrid search
10x+ faster than Lucene in some benchmarks
Apache 2.0 license

2.3 Embedding Strategy: Hybrid

dense_embedding:
  model: "BAAI/bge-m3" # or "intfloat/e5-large-v2"
  dimensions: 1024
  multilingual: true
  use_case: "Semantic similarity"

sparse_embedding:
  model: "naver/splade-v3"
  use_case: "Keyword/exact match"

late_interaction:
  model: "jina-colbert-v2"
  use_case: "High-precision reranking"
  dimensions: "128 per token"

2.4 Document Processing: Rust Libraries

Format	Library	Notes
PDF	`pdf_oxide`	47.9x faster than alternatives
HTML	`scraper` + `html5ever`	Rust-native
Markdown	`pulldown-cmark`	CommonMark compliant
JSON/JSONL	`serde_json`	Standard
EPUB	`epub-rs`	For documentation

2.5 RAPTOR Implementation

Based on RAPTOR paper (ICLR 2024):

RAPTOR Tree Structure:
                    [ROOT SUMMARY]
                    /            \
         [CLUSTER A]              [CLUSTER B]
         /    |    \              /    |    \
      [C1]  [C2]  [C3]         [C4]  [C5]  [C6]
      /|\   /|\   /|\          /|\   /|\   /|\
    chunks...                chunks...

Benefits:

+20% absolute accuracy on QuALITY benchmark
Captures both fine-grained and high-level understanding
State-of-the-art on multi-hop reasoning tasks

3. FILE FORMAT STRATEGY

3.1 JSON as Primary Format (Confirmed)

Rationale:

Rust has excellent JSON support (serde_json)
Human-readable for debugging
Schema validation with JSON Schema
Easy to process incrementally (JSONL)
Widely supported by all tools

3.2 Data Schemas

3.2.1 Document Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["id", "type", "content", "metadata"],
  "properties": {
    "id": {
      "type": "string",
      "description": "Unique document identifier (UUID)"
    },
    "type": {
      "type": "string",
      "enum": ["paper", "documentation", "code", "note"]
    },
    "source": {
      "type": "object",
      "properties": {
        "url": { "type": "string" },
        "path": { "type": "string" },
        "retrieved_at": { "type": "string", "format": "date-time" }
      }
    },
    "content": {
      "type": "object",
      "properties": {
        "raw": { "type": "string" },
        "chunks": {
          "type": "array",
          "items": { "$ref": "#/definitions/chunk" }
        }
      }
    },
    "metadata": {
      "type": "object",
      "properties": {
        "title": { "type": "string" },
        "authors": { "type": "array", "items": { "type": "string" } },
        "date": { "type": "string" },
        "tags": { "type": "array", "items": { "type": "string" } },
        "citations": { "type": "integer" },
        "venue": { "type": "string" }
      }
    }
  },
  "definitions": {
    "chunk": {
      "type": "object",
      "properties": {
        "id": { "type": "string" },
        "text": { "type": "string" },
        "start_char": { "type": "integer" },
        "end_char": { "type": "integer" },
        "embedding_id": { "type": "string" }
      }
    }
  }
}

3.2.2 Embedding Schema

{
  "type": "object",
  "required": ["id", "chunk_id", "vector", "model"],
  "properties": {
    "id": { "type": "string" },
    "chunk_id": { "type": "string" },
    "document_id": { "type": "string" },
    "vector": {
      "type": "array",
      "items": { "type": "number" }
    },
    "model": { "type": "string" },
    "dimensions": { "type": "integer" },
    "created_at": { "type": "string", "format": "date-time" }
  }
}

3.3 Storage Layout

data/
├── papers/
│   ├── raw/              # Original PDFs
│   │   └── arxiv_2401.18059.pdf
│   └── processed/        # Extracted JSON
│       └── arxiv_2401.18059.json
├── docs/
│   ├── raw/              # Original HTML/MD
│   │   └── claude-code/
│   └── processed/        # Extracted JSON
│       └── claude-code.jsonl
├── embeddings/
│   ├── dense/            # Dense vector embeddings
│   │   └── bge-m3/
│   └── sparse/           # Sparse embeddings
│       └── splade/
├── indexes/
│   ├── hnsw/             # HNSW index files
│   ├── bm25/             # Tantivy index
│   └── raptor/           # RAPTOR tree structure
└── metadata/
    └── catalog.json      # Master document catalog

4. ACADEMIC PAPERS TO DOWNLOAD

4.1 Core Reasoning Papers

Paper	arXiv
Chain-of-Thought Prompting (Wei et al.)	2201.11903
Self-Consistency (Wang et al.)	2203.11171
Tree of Thoughts (Yao et al.)	2305.10601
RAPTOR (Sarthi et al.)	2401.18059
Let's Verify Step by Step (OpenAI)	-
Reflexion (Shinn et al.)	2303.11366
Constitutional AI (Anthropic)	2212.08073

4.2 Retrieval & Embedding Papers

Paper	arXiv
ColBERT (Khattab & Zaharia)	2004.12832
E5 Embeddings	2212.03533
BGE-M3	2402.03216
Semantic Entropy (Nature)	-

4.3 Benchmark Papers

Paper	arXiv
GSM8K	2110.14168
MATH Benchmark	2103.03874
MMLU	2009.03300

5. REFERENCE DOCUMENTATION

5.1 CLI Tools

Tool	Source
Claude Code	https://github.com/anthropics/claude-code
Gemini CLI	https://github.com/google-gemini/gemini-cli
OpenAI Codex	https://github.com/openai/codex
MCP Servers	https://github.com/modelcontextprotocol/servers
Sequential Thinking	modelcontextprotocol/servers/src/sequentialthinking

5.2 APIs & SDKs

API	Source
Anthropic API	https://docs.anthropic.com
OpenAI API	https://platform.openai.com/docs
Google AI	https://ai.google.dev/docs
OpenRouter	https://openrouter.ai/docs

5.3 Frameworks

Framework	Source
LangChain	https://python.langchain.com/docs
LlamaIndex	https://docs.llamaindex.ai
DSPy	https://dspy-docs.vercel.app

6. IMPLEMENTATION PLAN

Phase 1: Foundation (Week 1)

□ Set up Cargo workspace
□ Implement PDF ingestion (pdf_oxide)
□ Implement JSON serialization (serde)
□ Create document schema validation
□ Download first batch of papers

Phase 2: Processing (Week 2)

□ Implement semantic chunking
□ Create metadata extraction
□ Set up Tantivy for BM25
□ Implement basic retrieval

Phase 3: Embedding (Week 3)

□ Integrate embedding model (ONNX or API)
□ Set up Qdrant (local mode)
□ Implement HNSW indexing
□ Create hybrid search

Phase 4: RAPTOR (Week 4)

□ Implement clustering (GMM)
□ Create summarization pipeline
□ Build hierarchical tree
□ Implement tree-based retrieval

7. Rust DEPENDENCIES (Cargo.toml)

[package]
name = "reasonkit-core"
version = "1.0.0"
edition = "2024"

[dependencies]
# Serialization
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

# PDF Processing
pdf_oxide = "0.1"  # or lopdf for control

# Text Processing
pulldown-cmark = "0.9"  # Markdown
scraper = "0.17"        # HTML
regex = "1.10"

# Vector DB
qdrant-client = "1.8"

# Full-text Search
tantivy = "0.21"

# HNSW Index
hnswlib-rs = "0.3"

# Async Runtime
tokio = { version = "1", features = ["full"] }

# HTTP Client
reqwest = { version = "0.11", features = ["json"] }

# CLI
clap = { version = "4", features = ["derive"] }

# Error Handling
anyhow = "1.0"
thiserror = "1.0"

# Logging
tracing = "0.1"
tracing-subscriber = "0.3"

# UUID
uuid = { version = "1", features = ["v4", "serde"] }

# Date/Time
chrono = { version = "0.4", features = ["serde"] }

[dev-dependencies]
criterion = "0.5"

8. BENCHMARKING TARGETS

Operation	Target	Notes
PDF extraction	<100ms per page	pdf_oxide claims 53ms/PDF
Chunking	<10ms per 1000 tokens
Embedding (API)	<500ms per chunk	Network bound
HNSW search	<10ms top-100	hnswlib benchmark
BM25 search	<5ms	Tantivy benchmark
Hybrid search	<20ms total	Combined

9. OPEN QUESTIONS

Embedding model hosting: Local ONNX or API?
- API: Simpler, more accurate (GPT/Claude embeddings)
- Local: Faster, no cost, works offline
- DECISION NEEDED
Qdrant deployment: Embedded or server mode?
- Embedded: Simpler, single binary
- Server: More scalable, dashboard
- RECOMMENDATION: Start embedded, upgrade if needed
RAPTOR summarization: Which LLM?
- GPT-4o: High quality, cost
- Claude Haiku: Good quality, lower cost
- Local (Llama): Free, lower quality
- RECOMMENDATION: Claude Haiku for balance

END OF DOCUMENT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReasonKit Core - Knowledge Base Architecture

The Reasoning Engine — Auditable AI Reasoning Architecture

1. ARCHITECTURE OVERVIEW

2. TECHNOLOGY DECISIONS

2.1 Vector Database: Qdrant (PRIMARY)

2.2 Full-Text Search: Tantivy

2.3 Embedding Strategy: Hybrid

2.4 Document Processing: Rust Libraries

2.5 RAPTOR Implementation

3. FILE FORMAT STRATEGY

3.1 JSON as Primary Format (Confirmed)

3.2 Data Schemas

3.2.1 Document Schema

3.2.2 Embedding Schema

3.3 Storage Layout

4. ACADEMIC PAPERS TO DOWNLOAD

4.1 Core Reasoning Papers

4.2 Retrieval & Embedding Papers

4.3 Benchmark Papers

5. REFERENCE DOCUMENTATION

5.1 CLI Tools

5.2 APIs & SDKs

5.3 Frameworks

6. IMPLEMENTATION PLAN

Phase 1: Foundation (Week 1)

Phase 2: Processing (Week 2)

Phase 3: Embedding (Week 3)

Phase 4: RAPTOR (Week 4)

7. Rust DEPENDENCIES (Cargo.toml)

8. BENCHMARKING TARGETS

9. OPEN QUESTIONS

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

ReasonKit Core - Knowledge Base Architecture

The Reasoning Engine — Auditable AI Reasoning Architecture

1. ARCHITECTURE OVERVIEW

2. TECHNOLOGY DECISIONS

2.1 Vector Database: Qdrant (PRIMARY)

2.2 Full-Text Search: Tantivy

2.3 Embedding Strategy: Hybrid

2.4 Document Processing: Rust Libraries

2.5 RAPTOR Implementation

3. FILE FORMAT STRATEGY

3.1 JSON as Primary Format (Confirmed)

3.2 Data Schemas

3.2.1 Document Schema

3.2.2 Embedding Schema

3.3 Storage Layout

4. ACADEMIC PAPERS TO DOWNLOAD

4.1 Core Reasoning Papers

4.2 Retrieval & Embedding Papers

4.3 Benchmark Papers

5. REFERENCE DOCUMENTATION

5.1 CLI Tools

5.2 APIs & SDKs

5.3 Frameworks

6. IMPLEMENTATION PLAN

Phase 1: Foundation (Week 1)

Phase 2: Processing (Week 2)

Phase 3: Embedding (Week 3)

Phase 4: RAPTOR (Week 4)

7. Rust DEPENDENCIES (Cargo.toml)

8. BENCHMARKING TARGETS

9. OPEN QUESTIONS