VectorVault-AI is a local-first semantic research and document exploration engine. Built for the Artificial Intelligence course in my 4th semester of BS Artificial Intelligence (DUET), it moves beyond traditional keyword searching to provide a "Neural Search" experience. It utilizes high-dimensional vector embeddings to understand the context and meaning of academic documents, offering advanced features like semantic clustering and automated knowledge mapping.
The project demonstrates the application of modern AI retrieval systems: extracting text from unstructured PDFs, transforming language into mathematical vectors, and performing efficient similarity searches. By utilizing a local vector database and transformer models, VectorVault-AI ensures that all academic data remains private and is processed entirely on the user's machine, eliminating the need for expensive or privacy-invasive cloud APIs.
- Neural Search Engine: Semantic similarity search using Cosine Distance. Finds information based on concept even if exact keywords aren't present.
- Automated Knowledge Map: A 2D visualization of the "Document Landscape" using Principal Component Analysis (PCA) to project 384-dimensional embeddings onto a scatter plot.
- Multi-Document Ingestion: Supports drag-and-drop uploading of multiple PDFs simultaneously with parallel asynchronous processing.
- Intelligent Chunking: Automatically splits documents into 500-character segments with overlapping windows to preserve semantic context across boundaries.
- Dynamic Theming: Interactive Glassmorphism UI with a real-time Light/Dark mode toggle that persists across sessions.
- Local-First Architecture: Powered by local Sentence-Transformers and ChromaDB, ensuring zero data latency and 100% data privacy.
| Component | Purpose |
|---|---|
| Python / FastAPI | High-performance backend and API orchestration |
| Sentence-Transformers | Local embedding generation (all-MiniLM-L6-v2) |
| ChromaDB | Vector database for persistent embedding storage and retrieval |
| scikit-learn | Dimensionality reduction (PCA) for the Knowledge Map |
| PyMuPDF (fitz) | High-fidelity PDF text extraction |
| Chart.js | Interactive 2D scatter plot rendering for the map |
| HTML5 / CSS3 / JS | Custom Glassmorphism UI and asynchronous frontend logic |
| Docker | Containerization for reproducible local and cloud deployment |
VectorVault-AI/
├── app/
│ ├── main.py # FastAPI routes, file handling, and API logic
│ ├── engine.py # AI logic: embedding, chunking, PCA, and ChromaDB ops
│ ├── static/ # Frontend assets (index.html, style.css, app.js)
│ ├── uploads/ # Temporary storage for ingested PDFs
│ └── db/ # Persistent ChromaDB vector storage
├── Dockerfile # GCR-optimized container config
├── requirements.txt # Python dependencies (pinned versions)
├── .gitignore # Excludes venv, __pycache__, and local DB files
└── README.md # Project documentation
- Clone the repository:
git clone [https://github.com/abdulhayykhan/VectorVault-AI.git](https://github.com/abdulhayykhan/VectorVault-AI.git)
cd VectorVault-AI
- Install C++ Build Tools (Windows only):
Required for
chromadbcompilation. Install Microsoft C++ Build Tools with the "Desktop development with C++" workload. - Create and activate a virtualenv:
python -m venv .venv
.venv\Scripts\activate # Windows
- Install dependencies:
pip install -r requirements.txt
- Run the development server:
cd app
uvicorn main:app --reload --host 127.0.0.1 --port 8000
- Dashboard: Access the UI at
http://127.0.0.1:8000. Toggle the theme in the top right for your preferred environment. - Ingestion: Drop one or multiple PDF files into the "Document Vault" zone. Wait for the system to extract text and generate embeddings.
- Neural Search: Type a concept-based question (e.g., "How do neural networks learn?") into the search bar.
- Explore the Map: Switch to the Knowledge Map to see how the AI has clustered your documents. Hover over points to see text snippets from specific files.
| Concept | Implementation in VectorVault-AI |
|---|---|
| Vector Embeddings | Transforming text into 384-dimensional numerical arrays representing "meaning." |
| Semantic Similarity | Using Cosine Distance to rank document segments by conceptual relevance to a query. |
| Dimensionality Reduction | Applying PCA to project high-dimensional data into a human-readable 2D space. |
| RAG (Retrieval Logic) | Creating a foundational pipeline for Retrieval-Augmented Generation without cloud dependencies. |
| Unsupervised Clustering | Visualizing how the AI naturally groups related lecture notes together without human labels. |
The project is configured for seamless deployment to GCR:
- Build the image:
gcloud builds submit --tag gcr.io/[PROJECT_ID]/vectorvault-ai - Deploy:
gcloud run deploy vectorvault-ai --image gcr.io/[PROJECT_ID]/vectorvault-ai --memory 2Gi --cpu 2
This project is open-source and available for educational use under the MIT License.