This project is a sophisticated, local-first Retrieval-Augmented Generation (RAG) platform designed to transform a diverse range of organizational content—including PDFs, Office documents, emails, and images—into an interactive and searchable knowledge base. It is built for anyone who needs to derive reliable answers from in-house documentation without compromising data privacy.
- Multi-Source Ingestion: Supports a wide variety of file formats, including PDF, DOCX, PPTX, CSV, and TXT. EML support is currently experimental. XLSX ingestion is under development.
- Configurable Chunking: Employs a rule-based chunking system that allows for different strategies (e.g., by paragraph, by slide) to be applied to different document types, ensuring optimal data segmentation.
- Flexible Embedding Models: Easily switch between local, open-source embedding models (via
sentence-transformers) and powerful API-based models like OpenAI's. - Multi-Modal Retrieval: Capable of retrieving both text and image-based information. The system includes an
enrich-imagescommand to generate textual descriptions for images using an agentic workflow, making visual content fully searchable. - Advanced Retrieval Strategies: Uses a late-fusion approach to combine results from multiple sources, ensuring comprehensive and relevant context for every query.
- Streamlit UI: An intuitive user interface for creating and managing projects, uploading documents, and editing configurations.
- Command-Line Interface: A powerful CLI for interacting with the platform, allowing you to ingest documents, generate embeddings, and ask questions directly from your terminal.
- Local-First and Secure: All your data, including raw files, indexes, and logs, is stored locally on your machine, ensuring complete privacy and control.
- Python 3.10 or higher
- Poetry for dependency management
- An API key for your chosen LLM and embedding providers (e.g.,
OPENAI_API_KEY)
-
Clone the repository:
git clone <repository-url> cd <repository-name>
-
Install the dependencies using Poetry:
poetry install
The platform can be operated through the Streamlit UI or the command-line interface.
To launch the user interface, run the following command:
poetry run streamlit run scripts/ui/ui_project_manager.pyThe UI allows you to:
- Create new projects.
- Upload documents.
- View and edit project configurations.
The CLI provides a powerful way to interact with the platform. Here is a typical workflow:
-
Ingest and Chunk Documents:
python -m app.cli ingest /path/to/your/project --chunk
-
Generate Embeddings:
python -m app.cli embed /path/to/your/project
Optional: Use
--with-image-indexto run image enrichment and indexing immediately after embedding. -
Enrich and Index Images (Standalone): If you didn't use
--with-image-indexduring embedding, you can run these steps separately:python -m app.cli enrich-images /path/to/your/project --doc-type pptx python -m app.cli index-images /path/to/your/project --doc-type pptx
-
Retrieve Context: Test the retrieval system directly:
python -m app.cli retrieve /path/to/your/project "Your search query" --top-k 5 -
Ask a Question: Generate an answer using the RAG pipeline:
python -m app.cli ask /path/to/your/project "Your question here" -
View Configuration: Check the current project configuration:
python -m app.cli config /path/to/your/project
For more detailed information on the available commands and their options, please refer to the app/README.md file.
The platform is built around a modular pipeline that processes your data in several stages:
- Ingestion: The first step is to ingest your raw documents. The platform provides a suite of loaders that can handle a wide variety of file formats.
- Chunking: Once ingested, the documents are split into smaller, more manageable chunks. This process is highly configurable and can be tailored to the specific characteristics of each document type.
- Enrichment: The platform includes an
ImageInsightAgentthat can analyze images and generate textual descriptions for them. This makes visual content searchable and adds another layer of context to your knowledge base. - Embedding: The text and image chunks are then converted into numerical representations (embeddings) using a chosen embedding model.
- Indexing: The embeddings are stored in a local FAISS index, which allows for efficient similarity searches.
- Retrieval: When you ask a question, the platform uses a late-fusion retrieval strategy to find the most relevant text and image chunks from the index.
- Generation: The retrieved context is then used to construct a detailed prompt, which is sent to a large language model to generate a final answer.
The project is organized into the following key directories:
app/: Contains the command-line interface for the platform. Seeapp/README.md.assets/: A place for static assets. Seeassets/README.md.configs/: Home to thechunk_rules.yamlfile, which defines the chunking strategies for different document types. Seeconfigs/README.md.docs/: Contains project-related documentation, including architecture diagrams and planning documents. Seedocs/README.md.scripts/: The heart of the platform, containing the core logic for ingestion, chunking, embedding, retrieval, and more. Seescripts/README.mdfor a high-level overview.tests/: Contains the test suite for the project.
Contributions are welcome! Please feel free to open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
