🚀 Empower AI developers with VectorVault – the lightning-fast CLI tool that transforms your scattered files, code snippets, and web content into a smart, searchable vector database powered by Qdrant. Eliminate duplicates, secure your data, and unlock instant insights with hybrid search and AI-powered processing. Seamlessly integrate with your AI agents and RAG systems to build intelligent applications in minutes!
📋 Transform scattered knowledge into a searchable vector database with these features:
| Capability | Description | Emoji |
|---|---|---|
| Multi-format Ingestion | Process various file types with type-specific chunking | 📁 |
| Web Scraping | Free and premium options for URL and site crawling | 🌐 |
| Deduplication | Intelligent similarity detection with auto-delete | 🔍 |
| Security Scanning | Detect and redact sensitive information | 🔒 |
| Optimization | Auto-optimize collections for performance | ⚡ |
graph TB
A[CLI Commands] --> B[Content Processors]
B --> C[Type-Specific Chunking]
C --> D[Hybrid Embedding]
D --> E[Qdrant Storage]
E --> F[Deduplication & Security]
F --> G[Optimized Knowledge Base]
-
Prerequisites:
- Python 3.13 installed.
- Docker for running Qdrant.
- Run Qdrant Docker container:
docker run -d -p 7000:6333 qdrant/qdrant:latest
-
Project Setup:
- Install uv if not already:
pip install uv - Create project:
uv init vectorvault --python 3.13 - Navigate:
cd vectorvault - Add dependencies (as in pyproject.toml)
- Copy
.env.exampleto.envand fill in values (e.g., FIRECRAWL_API_KEY)
- Install uv if not already:
-
Directories:
logs/: For logs (created automatically if needed).data/: For backups (created automatically).
Edit .env based on .env.example:
- QDRANT_HOST and PORT: Local Docker setup.
- FIRECRAWL_API_KEY: Required for premium web features.
- Processing params: Chunk sizes (2048 default), similarity threshold.
- Embedding params: Dense/sparse/reranker models, hybrid/rerank toggles, GPU usage.
- Optimization: Quant type (scalar default), on-disk, optimize threshold, HNSW params.
- Chunking: Toggles for code/parser, doc/section.
- Code: Model for code embeds.
- Precision/Async: FP16/async toggles.
# Add a single file
python main.py add-file /path/to/file.txt
# Add a directory
python main.py add-dir /path/to/directory
# Add a URL (free scraping)
python main.py add-url https://example.com
# Add a URL (Firecrawl premium)
python main.py add-url https://example.com --firecrawl
# Crawl an entire site
python main.py crawl-site https://example.com
# Extract structured data from URL
python main.py extract-url https://example.com
# Find duplicates (auto-delete with flag)
python main.py find-dupes --auto-dedup
# Scan for secrets (auto-redact with flag)
python main.py scan-secrets --auto-redact
# Optimize collection
python main.py optimize- Qdrant Connection Issues: Ensure Docker is running on port 7000. Check logs for errors.
- Firecrawl Errors: Verify API key in .env. Fall back to free scraping if issues.
- Encoding Problems: The tool uses automatic detection; if fails, check file encoding.
- Performance: For large datasets, increase limits or optimize chunk sizes in .env. Enable GPU for faster embeddings.
- No Content Added: Check supported file types or URL accessibility.
- Embeddings: If sparse/hybrid slow, toggle off in .env; models are local for zero-cost.
- Auto Modes: Use --auto-dedup/redact for automation; logs show actions.
- Start Qdrant Docker.
- Run
python main.py add-file test.txt(create test.txt with content). - Verify in Qdrant (use admin UI or query).
- Run add-url, crawl-site, extract-url with example.com.
- Run find-dupes --auto-dedup; check logs for deletions.
- Run scan-secrets --auto-redact; check logs/backups.
- Run optimize; check logs.
- Test type-specific: Add .py/.pdf/.json, check chunks via logs.
- Time add-dir with 1000 files to validate throughput.
Up-to-date with 2025 technologies:
- Python: 3.13 for async and performance features
- Qdrant: 1.14+ for vector storage with hybrid search
- FastEmbed: For zero-cost local embeddings with GPU support
- Firecrawl: For premium web scraping
- Pydantic: 2.11+ for type-safe configuration
- Other: HTTPX, BeautifulSoup, Tree-sitter, etc.
For full dependencies, see pyproject.toml.
If you use VectorVault in your work, please cite it as follows:
@software{VectorVault2024,
author = {Your Name or Organization},
title = {VectorVault: Lightweight CLI for Qdrant-based Knowledge Management},
year = {2024},
url = {https://github.com/yourusername/vectorvault},
version = {1.0.0},
note = {Supports file/web ingestion, deduplication, security scanning}
}For more project details, see docs/prd.md.