VectorVault

🚀 Empower AI developers with VectorVault – the lightning-fast CLI tool that transforms your scattered files, code snippets, and web content into a smart, searchable vector database powered by Qdrant. Eliminate duplicates, secure your data, and unlock instant insights with hybrid search and AI-powered processing. Seamlessly integrate with your AI agents and RAG systems to build intelligent applications in minutes!

Key Capabilities

📋 Transform scattered knowledge into a searchable vector database with these features:

Capability	Description	Emoji
Multi-format Ingestion	Process various file types with type-specific chunking	📁
Web Scraping	Free and premium options for URL and site crawling	🌐
Deduplication	Intelligent similarity detection with auto-delete	🔍
Security Scanning	Detect and redact sensitive information	🔒
Optimization	Auto-optimize collections for performance	⚡

Overview Diagram

graph TB
    A[CLI Commands] --> B[Content Processors]
    B --> C[Type-Specific Chunking]
    C --> D[Hybrid Embedding]
    D --> E[Qdrant Storage]
    E --> F[Deduplication & Security]
    F --> G[Optimized Knowledge Base]

Setup Instructions

Prerequisites:
- Python 3.13 installed.
- Docker for running Qdrant.
- Run Qdrant Docker container: docker run -d -p 7000:6333 qdrant/qdrant:latest
Project Setup:
- Install uv if not already: pip install uv
- Create project: uv init vectorvault --python 3.13
- Navigate: cd vectorvault
- Add dependencies (as in pyproject.toml)
- Copy .env.example to .env and fill in values (e.g., FIRECRAWL_API_KEY)
Directories:
- logs/: For logs (created automatically if needed).
- data/: For backups (created automatically).

Configuration Guide

Edit .env based on .env.example:

QDRANT_HOST and PORT: Local Docker setup.
FIRECRAWL_API_KEY: Required for premium web features.
Processing params: Chunk sizes (2048 default), similarity threshold.
Embedding params: Dense/sparse/reranker models, hybrid/rerank toggles, GPU usage.
Optimization: Quant type (scalar default), on-disk, optimize threshold, HNSW params.
Chunking: Toggles for code/parser, doc/section.
Code: Model for code embeds.
Precision/Async: FP16/async toggles.

Usage Examples

# Add a single file
python main.py add-file /path/to/file.txt

# Add a directory
python main.py add-dir /path/to/directory

# Add a URL (free scraping)
python main.py add-url https://example.com

# Add a URL (Firecrawl premium)
python main.py add-url https://example.com --firecrawl

# Crawl an entire site
python main.py crawl-site https://example.com

# Extract structured data from URL
python main.py extract-url https://example.com

# Find duplicates (auto-delete with flag)
python main.py find-dupes --auto-dedup

# Scan for secrets (auto-redact with flag)
python main.py scan-secrets --auto-redact

# Optimize collection
python main.py optimize

Troubleshooting

Qdrant Connection Issues: Ensure Docker is running on port 7000. Check logs for errors.
Firecrawl Errors: Verify API key in .env. Fall back to free scraping if issues.
Encoding Problems: The tool uses automatic detection; if fails, check file encoding.
Performance: For large datasets, increase limits or optimize chunk sizes in .env. Enable GPU for faster embeddings.
No Content Added: Check supported file types or URL accessibility.
Embeddings: If sparse/hybrid slow, toggle off in .env; models are local for zero-cost.
Auto Modes: Use --auto-dedup/redact for automation; logs show actions.

MVP Validation Steps

Start Qdrant Docker.
Run python main.py add-file test.txt (create test.txt with content).
Verify in Qdrant (use admin UI or query).
Run add-url, crawl-site, extract-url with example.com.
Run find-dupes --auto-dedup; check logs for deletions.
Run scan-secrets --auto-redact; check logs/backups.
Run optimize; check logs.
Test type-specific: Add .py/.pdf/.json, check chunks via logs.
Time add-dir with 1000 files to validate throughput.

Tech Stack

Up-to-date with 2025 technologies:

Python: 3.13 for async and performance features
Qdrant: 1.14+ for vector storage with hybrid search
FastEmbed: For zero-cost local embeddings with GPU support
Firecrawl: For premium web scraping
Pydantic: 2.11+ for type-safe configuration
Other: HTTPX, BeautifulSoup, Tree-sitter, etc.

For full dependencies, see pyproject.toml.

How to Reference

If you use VectorVault in your work, please cite it as follows:

BibTeX

@software{VectorVault2024,
  author = {Your Name or Organization},
  title = {VectorVault: Lightweight CLI for Qdrant-based Knowledge Management},
  year = {2024},
  url = {https://github.com/yourusername/vectorvault},
  version = {1.0.0},
  note = {Supports file/web ingestion, deduplication, security scanning}
}

For more project details, see docs/prd.md.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

VectorVault

Key Capabilities

Overview Diagram

Setup Instructions

Configuration Guide

Usage Examples

Troubleshooting

MVP Validation Steps

Tech Stack

How to Reference

BibTeX

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

License

Uh oh!

BjornMelin/vector-vault

Folders and files

Latest commit

History

Repository files navigation

VectorVault

Key Capabilities

Overview Diagram

Setup Instructions

Configuration Guide

Usage Examples

Troubleshooting

MVP Validation Steps

Tech Stack

How to Reference

BibTeX

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages