The storage system uses SQLite with a normalized schema design for efficient document storage, retrieval, and version management. The schema supports page-level metadata tracking, ETag-based change detection, hierarchical document chunking, and embedding model identity tracking.
The database consists of four core tables with normalized relationships, plus a metadata table for system-level configuration tracking:
erDiagram
libraries ||--o{ versions : has
versions ||--o{ pages : contains
pages ||--o{ documents : has_chunks
libraries {
int id PK
text name UK
datetime created_at
datetime updated_at
}
versions {
int id PK
int library_id FK
text name
text status
int progress_pages
int progress_max_pages
text error_message
datetime started_at
datetime created_at
datetime updated_at
text source_url
json scraper_options
}
pages {
int id PK
int version_id FK
text url UK
text title
text etag
text last_modified
text content_type
int depth
datetime created_at
datetime updated_at
}
documents {
int id PK
int page_id FK
text content
json metadata
int sort_order
blob embedding
datetime created_at
}
metadata {
text key PK
text value
}
Core library metadata and organization.
Schema:
id(INTEGER PRIMARY KEY): Auto-increment identifiername(TEXT UNIQUE): Library name (case-insensitive)created_at(DATETIME): Creation timestampupdated_at(DATETIME): Last update timestamp
Purpose: Library name normalization and metadata storage.
Code Reference: src/store/types.ts - Type definitions used throughout DocumentManagementService
Version tracking with comprehensive status and configuration.
Schema:
id(INTEGER PRIMARY KEY): Auto-increment identifierlibrary_id(INTEGER FK): References libraries(id)name(TEXT): Version name (NULL for unversioned content)status(TEXT): Version indexing status (not_indexed, queued, running, completed, failed, cancelled, updating)progress_pages(INTEGER): Current page count during indexingprogress_max_pages(INTEGER): Maximum pages to indexerror_message(TEXT): Error details if indexing failsstarted_at(DATETIME): When indexing job startedcreated_at(DATETIME): Creation timestampupdated_at(DATETIME): Last update timestampsource_url(TEXT): Original scraping URLscraper_options(JSON): Stored scraper configuration for reproducibility
Purpose: Job state management, progress tracking, and scraper configuration persistence.
Code Reference: src/store/types.ts lines 184-201 (DbVersion interface)
Page-level metadata for each unique URL within a version.
Schema:
id(INTEGER PRIMARY KEY): Auto-increment identifierversion_id(INTEGER FK): References versions(id)url(TEXT): Page URL (unique per version)title(TEXT): Page title extracted from contentetag(TEXT): HTTP ETag for change detectionlast_modified(TEXT): HTTP Last-Modified headersource_content_type(TEXT): Original MIME type of the fetched resource before processingcontent_type(TEXT): MIME type of the stored chunk content after processingdepth(INTEGER): Crawl depth from source URL (0 = root page)created_at(DATETIME): Creation timestampupdated_at(DATETIME): Last update timestamp
Purpose: Page-level metadata tracking, original-versus-processed content type tracking, ETag-based refresh support, and depth tracking for scoping.
Code Reference:
db/migrations/009-add-pages-table.sql- Initial pages table creationdb/migrations/010-add-depth-to-pages.sql- Depth column additionsrc/store/types.tslines 9-20 (DbPage interface)
Document chunks with embeddings and hierarchical metadata.
Schema:
id(INTEGER PRIMARY KEY): Auto-increment identifierpage_id(INTEGER FK): References pages(id)content(TEXT): Chunk content textmetadata(JSON): Chunk-specific metadata (level, path, types)sort_order(INTEGER): Ordering within pageembedding(BLOB): Vector embedding as binary datacreated_at(DATETIME): Creation timestamp
Purpose: Content storage with vector embeddings, hierarchical metadata, and search optimization.
Code Reference: src/store/types.ts lines 39-48 (DbChunk interface)
Key-value store for system-level configuration tracking, independent of library/version data.
Schema:
key(TEXT PRIMARY KEY): Configuration key namevalue(TEXT NOT NULL): Configuration value
Purpose: Tracks the active embedding model identity (embedding_model and embedding_dimension keys) to detect incompatible configuration changes between server restarts. When a model or dimension change is detected, the server prompts the user to confirm vector invalidation before proceeding.
Code Reference: db/migrations/013-create-metadata-table.sql, src/store/DocumentStore.ts - getEmbeddingMetadata(), setEmbeddingMetadata(), checkEmbeddingModelChange()
Sequential SQL migrations in db/migrations/:
000-initial-schema.sql- Base schema with documents, FTS, and vector tables001-add-indexed-at-column.sql- Indexing timestamp tracking002-normalize-library-table.sql- Library and version normalization003-normalize-vector-table.sql- Vector storage optimization004-complete-normalization.sql- Remove redundant columns, finalize schema005-add-status-tracking.sql- Job status and progress tracking006-add-scraper-options.sql- Configuration persistence for reproducibility007-dedupe-unversioned-versions.sql- Enforce unique unversioned content008-case-insensitive-names.sql- Case-insensitive library name handling009-add-pages-table.sql- Page-level metadata normalization010-add-depth-to-pages.sql- Crawl depth tracking for refresh operations011-add-vector-triggers.sql- FTS and vector table trigger maintenance012-add-source-content-type.sql- Source content type tracking on pages013-create-metadata-table.sql- Key-value metadata table for embedding model tracking
Code Reference: All migration files in db/migrations/ directory
Automatic migration execution on startup:
- Check current schema version against available migrations
- Apply pending migrations sequentially
- Validate schema integrity after each migration
- Handle migration failures with detailed error messages
- Trigger-based FTS index maintenance
Database location determined by priority:
- Project-local
.storedirectory (development) - OS-specific application data directory (production)
- Temporary directory as fallback
On macOS: ~/Library/Application Support/docs-mcp-server/
Code Reference: src/utils/paths.ts - resolveStorePath() function
Handles document lifecycle operations with normalized schema access.
Core Operations:
- Document chunk insertion via pages table
- Version management and cleanup
- Library organization with case-insensitive handling
- Page-level metadata management
- Duplicate detection using unique constraints
Version Resolution:
- Exact version name matching
- Semantic version range queries
- Latest version fallback logic
- Unversioned content handling (NULL version name)
Code Reference: src/store/DocumentManagementService.ts
- Create or resolve library record (case-insensitive name)
- Create version record with job configuration
- Create page records for each unique URL
- Process and store document chunks linked to pages
- Generate and store embeddings as binary BLOB
- Update version status and progress
Embeddings stored as BLOB in documents table:
- 1536-dimensional vectors by default (configurable via
embeddings.vectorDimension) - Provider-agnostic binary serialization
- NULL handling for documents without embeddings
- Direct storage eliminates need for separate vector table
Code Reference: src/store/types.ts line 4 (EMBEDDINGS_VECTOR_DIMENSION constant)
Centralized embedding generation supporting multiple providers.
Supported Providers:
- OpenAI: text-embedding-3-small (default), text-embedding-3-large, custom endpoints
- Google: Gemini embedding models, Vertex AI with service account auth
- Azure: Azure OpenAI service with custom deployments
- AWS: Bedrock embedding models with IAM authentication
Code Reference: src/store/embeddings/EmbeddingFactory.ts
Handles search and retrieval operations with hybrid ranking.
Search Methods:
- Vector similarity search using sqlite-vec extension
- Full-text search using FTS5 virtual table
- Hybrid search with Reciprocal Rank Fusion (RRF)
- Context-aware result assembly
Search Architecture:
- Query embeddings generated for vector search
- FTS5 query for keyword matching
- Results combined using RRF algorithm
- Chunks assembled with hierarchical context
- Results ranked by combined score
Code Reference: src/store/DocumentRetrieverService.ts
Full-text search using SQLite FTS5:
- Porter stemmer for English language
- Unicode61 tokenizer for international support
- Trigger-based index maintenance (automatic updates)
- External content mode (FTS references documents table)
Indexed Fields: content, title, url, path (from metadata)
Immediate persistence of state changes via database transactions:
- Job status updates to versions table
- Progress tracking during indexing
- Configuration changes with full audit trail
- Error information for debugging
Database transactions ensure consistency:
- Atomic page and document insertions
- Version state transitions with validation
- Batch operations for performance
- Automatic rollback on errors
Safe concurrent database access:
- Better-sqlite3 with synchronous API
- Transaction-based locking
- Read operations don't block each other
- Write operations serialize automatically
Code Reference: src/store/ - All service classes use transaction blocks
Database indexes optimize query performance:
- Primary keys on all tables (automatic)
- Foreign key indexes for join performance
- FTS5 indexes for text search
- Composite indexes for common query patterns (library_id + status)
Code Reference: Index creation statements in migration files
Efficient query patterns throughout the codebase:
- Prepared statements for repeated queries
- Batch operations for bulk inserts
- JOIN queries to minimize round trips
- Query result pagination for large result sets
Space-efficient data storage:
- Binary embedding storage (BLOB format)
- JSON metadata for flexible chunk properties
- Normalized schema eliminates redundant data
- SQLite VACUUM operations for space reclamation
Export functionality through DocumentManagementService:
- Complete database export via SQLite backup API
- Library-specific export using filtered queries
- Version-specific export for portability
- Metadata preservation in JSON format
Import from external sources:
- Database restoration from backups
- Configuration-based re-indexing
- Duplicate detection during import (unique constraints)
- Automatic migration application
Recovery mechanisms:
- Database integrity checks on startup
- Transaction log for crash recovery (SQLite WAL mode)
- Schema validation after migration
- Automatic repair for corrupted indexes
Health monitoring capabilities:
- Storage space utilization tracking
- Query performance metrics via logging
- Connection status monitoring
- Error rate tracking in version records
Regular maintenance tasks:
- VACUUM operations for space recovery
- Index rebuilding via REINDEX
- Orphaned record cleanup via foreign key constraints
- Performance analysis using EXPLAIN QUERY PLAN
Debugging and diagnostic capabilities:
- Query execution analysis
- Storage space breakdown by table
- Relationship integrity checks via PRAGMA foreign_key_check
- Performance bottleneck identification