Skip to content

⚡ Instantly index, deduplicate, and search your code, docs, and web content in a blazing-fast Qdrant vector DB for AI & RAG.

License

BjornMelin/vector-vault

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VectorVault

Python Docker License Qdrant FastEmbed Firecrawl

🚀 Empower AI developers with VectorVault – the lightning-fast CLI tool that transforms your scattered files, code snippets, and web content into a smart, searchable vector database powered by Qdrant. Eliminate duplicates, secure your data, and unlock instant insights with hybrid search and AI-powered processing. Seamlessly integrate with your AI agents and RAG systems to build intelligent applications in minutes!

Key Capabilities

📋 Transform scattered knowledge into a searchable vector database with these features:

Capability Description Emoji
Multi-format Ingestion Process various file types with type-specific chunking 📁
Web Scraping Free and premium options for URL and site crawling 🌐
Deduplication Intelligent similarity detection with auto-delete 🔍
Security Scanning Detect and redact sensitive information 🔒
Optimization Auto-optimize collections for performance

Overview Diagram

graph TB
    A[CLI Commands] --> B[Content Processors]
    B --> C[Type-Specific Chunking]
    C --> D[Hybrid Embedding]
    D --> E[Qdrant Storage]
    E --> F[Deduplication & Security]
    F --> G[Optimized Knowledge Base]
Loading

Setup Instructions

  1. Prerequisites:

    • Python 3.13 installed.
    • Docker for running Qdrant.
    • Run Qdrant Docker container: docker run -d -p 7000:6333 qdrant/qdrant:latest
  2. Project Setup:

    • Install uv if not already: pip install uv
    • Create project: uv init vectorvault --python 3.13
    • Navigate: cd vectorvault
    • Add dependencies (as in pyproject.toml)
    • Copy .env.example to .env and fill in values (e.g., FIRECRAWL_API_KEY)
  3. Directories:

    • logs/: For logs (created automatically if needed).
    • data/: For backups (created automatically).

Configuration Guide

Edit .env based on .env.example:

  • QDRANT_HOST and PORT: Local Docker setup.
  • FIRECRAWL_API_KEY: Required for premium web features.
  • Processing params: Chunk sizes (2048 default), similarity threshold.
  • Embedding params: Dense/sparse/reranker models, hybrid/rerank toggles, GPU usage.
  • Optimization: Quant type (scalar default), on-disk, optimize threshold, HNSW params.
  • Chunking: Toggles for code/parser, doc/section.
  • Code: Model for code embeds.
  • Precision/Async: FP16/async toggles.

Usage Examples

# Add a single file
python main.py add-file /path/to/file.txt

# Add a directory
python main.py add-dir /path/to/directory

# Add a URL (free scraping)
python main.py add-url https://example.com

# Add a URL (Firecrawl premium)
python main.py add-url https://example.com --firecrawl

# Crawl an entire site
python main.py crawl-site https://example.com

# Extract structured data from URL
python main.py extract-url https://example.com

# Find duplicates (auto-delete with flag)
python main.py find-dupes --auto-dedup

# Scan for secrets (auto-redact with flag)
python main.py scan-secrets --auto-redact

# Optimize collection
python main.py optimize

Troubleshooting

  • Qdrant Connection Issues: Ensure Docker is running on port 7000. Check logs for errors.
  • Firecrawl Errors: Verify API key in .env. Fall back to free scraping if issues.
  • Encoding Problems: The tool uses automatic detection; if fails, check file encoding.
  • Performance: For large datasets, increase limits or optimize chunk sizes in .env. Enable GPU for faster embeddings.
  • No Content Added: Check supported file types or URL accessibility.
  • Embeddings: If sparse/hybrid slow, toggle off in .env; models are local for zero-cost.
  • Auto Modes: Use --auto-dedup/redact for automation; logs show actions.

MVP Validation Steps

  1. Start Qdrant Docker.
  2. Run python main.py add-file test.txt (create test.txt with content).
  3. Verify in Qdrant (use admin UI or query).
  4. Run add-url, crawl-site, extract-url with example.com.
  5. Run find-dupes --auto-dedup; check logs for deletions.
  6. Run scan-secrets --auto-redact; check logs/backups.
  7. Run optimize; check logs.
  8. Test type-specific: Add .py/.pdf/.json, check chunks via logs.
  9. Time add-dir with 1000 files to validate throughput.

Tech Stack

Up-to-date with 2025 technologies:

  • Python: 3.13 for async and performance features
  • Qdrant: 1.14+ for vector storage with hybrid search
  • FastEmbed: For zero-cost local embeddings with GPU support
  • Firecrawl: For premium web scraping
  • Pydantic: 2.11+ for type-safe configuration
  • Other: HTTPX, BeautifulSoup, Tree-sitter, etc.

For full dependencies, see pyproject.toml.

How to Reference

If you use VectorVault in your work, please cite it as follows:

BibTeX

@software{VectorVault2024,
  author = {Your Name or Organization},
  title = {VectorVault: Lightweight CLI for Qdrant-based Knowledge Management},
  year = {2024},
  url = {https://github.com/yourusername/vectorvault},
  version = {1.0.0},
  note = {Supports file/web ingestion, deduplication, security scanning}
}

For more project details, see docs/prd.md.

Releases

No releases published

Packages

No packages published