Data Storage

Overview

The storage system uses SQLite with a normalized schema design for efficient document storage, retrieval, and version management. The schema supports page-level metadata tracking, ETag-based change detection, hierarchical document chunking, and embedding model identity tracking.

Database Schema

The database consists of four core tables with normalized relationships, plus a metadata table for system-level configuration tracking:

erDiagram
    libraries ||--o{ versions : has
    versions ||--o{ pages : contains
    pages ||--o{ documents : has_chunks
    
    libraries {
        int id PK
        text name UK
        datetime created_at
        datetime updated_at
    }
    
    versions {
        int id PK
        int library_id FK
        text name
        text status
        int progress_pages
        int progress_max_pages
        text error_message
        datetime started_at
        datetime created_at
        datetime updated_at
        text source_url
        json scraper_options
    }
    
    pages {
        int id PK
        int version_id FK
        text url UK
        text title
        text etag
        text last_modified
        text content_type
        int depth
        datetime created_at
        datetime updated_at
    }
    
    documents {
        int id PK
        int page_id FK
        text content
        json metadata
        int sort_order
        blob embedding
        datetime created_at
    }

    metadata {
        text key PK
        text value
    }

Libraries Table

Core library metadata and organization.

Schema:

id (INTEGER PRIMARY KEY): Auto-increment identifier
name (TEXT UNIQUE): Library name (case-insensitive)
created_at (DATETIME): Creation timestamp
updated_at (DATETIME): Last update timestamp

Purpose: Library name normalization and metadata storage.

Code Reference: src/store/types.ts - Type definitions used throughout DocumentManagementService

Versions Table

Version tracking with comprehensive status and configuration.

Schema:

id (INTEGER PRIMARY KEY): Auto-increment identifier
library_id (INTEGER FK): References libraries(id)
name (TEXT): Version name (NULL for unversioned content)
status (TEXT): Version indexing status (not_indexed, queued, running, completed, failed, cancelled, updating)
progress_pages (INTEGER): Current page count during indexing
progress_max_pages (INTEGER): Maximum pages to index
error_message (TEXT): Error details if indexing fails
started_at (DATETIME): When indexing job started
created_at (DATETIME): Creation timestamp
updated_at (DATETIME): Last update timestamp
source_url (TEXT): Original scraping URL
scraper_options (JSON): Stored scraper configuration for reproducibility

Purpose: Job state management, progress tracking, and scraper configuration persistence.

Code Reference: src/store/types.ts lines 184-201 (DbVersion interface)

Pages Table

Page-level metadata for each unique URL within a version.

Schema:

id (INTEGER PRIMARY KEY): Auto-increment identifier
version_id (INTEGER FK): References versions(id)
url (TEXT): Page URL (unique per version)
title (TEXT): Page title extracted from content
etag (TEXT): HTTP ETag for change detection
last_modified (TEXT): HTTP Last-Modified header
source_content_type (TEXT): Original MIME type of the fetched resource before processing
content_type (TEXT): MIME type of the stored chunk content after processing
depth (INTEGER): Crawl depth from source URL (0 = root page)
created_at (DATETIME): Creation timestamp
updated_at (DATETIME): Last update timestamp

Purpose: Page-level metadata tracking, original-versus-processed content type tracking, ETag-based refresh support, and depth tracking for scoping.

Code Reference:

db/migrations/009-add-pages-table.sql - Initial pages table creation
db/migrations/010-add-depth-to-pages.sql - Depth column addition
src/store/types.ts lines 9-20 (DbPage interface)

Documents Table

Document chunks with embeddings and hierarchical metadata.

Schema:

id (INTEGER PRIMARY KEY): Auto-increment identifier
page_id (INTEGER FK): References pages(id)
content (TEXT): Chunk content text
metadata (JSON): Chunk-specific metadata (level, path, types)
sort_order (INTEGER): Ordering within page
embedding (BLOB): Vector embedding as binary data
created_at (DATETIME): Creation timestamp

Purpose: Content storage with vector embeddings, hierarchical metadata, and search optimization.

Code Reference: src/store/types.ts lines 39-48 (DbChunk interface)

Metadata Table

Key-value store for system-level configuration tracking, independent of library/version data.

Schema:

key (TEXT PRIMARY KEY): Configuration key name
value (TEXT NOT NULL): Configuration value

Purpose: Tracks the active embedding model identity (embedding_model and embedding_dimension keys) to detect incompatible configuration changes between server restarts. When a model or dimension change is detected, the server prompts the user to confirm vector invalidation before proceeding.

Code Reference: db/migrations/013-create-metadata-table.sql, src/store/DocumentStore.ts - getEmbeddingMetadata(), setEmbeddingMetadata(), checkEmbeddingModelChange()

Schema Evolution

Migration System

Sequential SQL migrations in db/migrations/:

000-initial-schema.sql - Base schema with documents, FTS, and vector tables
001-add-indexed-at-column.sql - Indexing timestamp tracking
002-normalize-library-table.sql - Library and version normalization
003-normalize-vector-table.sql - Vector storage optimization
004-complete-normalization.sql - Remove redundant columns, finalize schema
005-add-status-tracking.sql - Job status and progress tracking
006-add-scraper-options.sql - Configuration persistence for reproducibility
007-dedupe-unversioned-versions.sql - Enforce unique unversioned content
008-case-insensitive-names.sql - Case-insensitive library name handling
009-add-pages-table.sql - Page-level metadata normalization
010-add-depth-to-pages.sql - Crawl depth tracking for refresh operations
011-add-vector-triggers.sql - FTS and vector table trigger maintenance
012-add-source-content-type.sql - Source content type tracking on pages
013-create-metadata-table.sql - Key-value metadata table for embedding model tracking

Code Reference: All migration files in db/migrations/ directory

Migration Application

Automatic migration execution on startup:

Check current schema version against available migrations
Apply pending migrations sequentially
Validate schema integrity after each migration
Handle migration failures with detailed error messages
Trigger-based FTS index maintenance

Data Location

Storage Directory Resolution

Database location determined by priority:

Project-local .store directory (development)
OS-specific application data directory (production)
Temporary directory as fallback

Default Storage Location

On macOS: ~/Library/Application Support/docs-mcp-server/

Code Reference: src/utils/paths.ts - resolveStorePath() function

Document Management

DocumentManagementService

Handles document lifecycle operations with normalized schema access.

Core Operations:

Document chunk insertion via pages table
Version management and cleanup
Library organization with case-insensitive handling
Page-level metadata management
Duplicate detection using unique constraints

Version Resolution:

Exact version name matching
Semantic version range queries
Latest version fallback logic
Unversioned content handling (NULL version name)

Code Reference: src/store/DocumentManagementService.ts

Document Storage Flow

Create or resolve library record (case-insensitive name)
Create version record with job configuration
Create page records for each unique URL
Process and store document chunks linked to pages
Generate and store embeddings as binary BLOB
Update version status and progress

Embedding Management

Vector Storage

Embeddings stored as BLOB in documents table:

1536-dimensional vectors by default (configurable via embeddings.vectorDimension)
Provider-agnostic binary serialization
NULL handling for documents without embeddings
Direct storage eliminates need for separate vector table

Code Reference: src/store/types.ts line 4 (EMBEDDINGS_VECTOR_DIMENSION constant)

EmbeddingFactory

Centralized embedding generation supporting multiple providers.

Supported Providers:

OpenAI: text-embedding-3-small (default), text-embedding-3-large, custom endpoints
Google: Gemini embedding models, Vertex AI with service account auth
Azure: Azure OpenAI service with custom deployments
AWS: Bedrock embedding models with IAM authentication

Code Reference: src/store/embeddings/EmbeddingFactory.ts

Search Implementation

DocumentRetrieverService

Handles search and retrieval operations with hybrid ranking.

Search Methods:

Vector similarity search using sqlite-vec extension
Full-text search using FTS5 virtual table
Hybrid search with Reciprocal Rank Fusion (RRF)
Context-aware result assembly

Search Architecture:

Query embeddings generated for vector search
FTS5 query for keyword matching
Results combined using RRF algorithm
Chunks assembled with hierarchical context
Results ranked by combined score

Code Reference: src/store/DocumentRetrieverService.ts

FTS Implementation

Full-text search using SQLite FTS5:

Porter stemmer for English language
Unicode61 tokenizer for international support
Trigger-based index maintenance (automatic updates)
External content mode (FTS references documents table)

Indexed Fields: content, title, url, path (from metadata)

Data Consistency

Write-Through Architecture

Immediate persistence of state changes via database transactions:

Job status updates to versions table
Progress tracking during indexing
Configuration changes with full audit trail
Error information for debugging

Transaction Management

Database transactions ensure consistency:

Atomic page and document insertions
Version state transitions with validation
Batch operations for performance
Automatic rollback on errors

Concurrent Access

Safe concurrent database access:

Better-sqlite3 with synchronous API
Transaction-based locking
Read operations don't block each other
Write operations serialize automatically

Code Reference: src/store/ - All service classes use transaction blocks

Performance Considerations

Index Strategy

Database indexes optimize query performance:

Primary keys on all tables (automatic)
Foreign key indexes for join performance
FTS5 indexes for text search
Composite indexes for common query patterns (library_id + status)

Code Reference: Index creation statements in migration files

Query Optimization

Efficient query patterns throughout the codebase:

Prepared statements for repeated queries
Batch operations for bulk inserts
JOIN queries to minimize round trips
Query result pagination for large result sets

Storage Efficiency

Space-efficient data storage:

Binary embedding storage (BLOB format)
JSON metadata for flexible chunk properties
Normalized schema eliminates redundant data
SQLite VACUUM operations for space reclamation

Backup and Recovery

Data Export

Export functionality through DocumentManagementService:

Complete database export via SQLite backup API
Library-specific export using filtered queries
Version-specific export for portability
Metadata preservation in JSON format

Data Import

Import from external sources:

Database restoration from backups
Configuration-based re-indexing
Duplicate detection during import (unique constraints)
Automatic migration application

Disaster Recovery

Recovery mechanisms:

Database integrity checks on startup
Transaction log for crash recovery (SQLite WAL mode)
Schema validation after migration
Automatic repair for corrupted indexes

Monitoring and Maintenance

Database Health

Health monitoring capabilities:

Storage space utilization tracking
Query performance metrics via logging
Connection status monitoring
Error rate tracking in version records

Maintenance Operations

Regular maintenance tasks:

VACUUM operations for space recovery
Index rebuilding via REINDEX
Orphaned record cleanup via foreign key constraints
Performance analysis using EXPLAIN QUERY PLAN

Diagnostics

Debugging and diagnostic capabilities:

Query execution analysis
Storage space breakdown by table
Relationship integrity checks via PRAGMA foreign_key_check
Performance bottleneck identification

FilesExpand file tree

data-storage.md

Latest commit

History

data-storage.md

File metadata and controls

Data Storage

Overview

Database Schema

Libraries Table

Versions Table

Pages Table

Documents Table

Metadata Table

Schema Evolution

Migration System

Migration Application

Data Location

Storage Directory Resolution

Default Storage Location

Document Management

DocumentManagementService

Document Storage Flow

Embedding Management

Vector Storage

EmbeddingFactory

Search Implementation

DocumentRetrieverService

FTS Implementation

Data Consistency

Write-Through Architecture

Transaction Management

Concurrent Access

Performance Considerations

Index Strategy

Query Optimization

Storage Efficiency

Backup and Recovery

Data Export

Data Import

Disaster Recovery

Monitoring and Maintenance

Database Health

Maintenance Operations

Diagnostics