Skip to content

hagaybar/Multi-Source_RAG_Platform

Repository files navigation

Multi-Source RAG Platform

This project is a sophisticated, local-first Retrieval-Augmented Generation (RAG) platform designed to transform a diverse range of organizational content—including PDFs, Office documents, emails, and images—into an interactive and searchable knowledge base. It is built for anyone who needs to derive reliable answers from in-house documentation without compromising data privacy.

High-level architecture of the RAG platform


✨ Key Features

  • Multi-Source Ingestion: Supports a wide variety of file formats, including PDF, DOCX, PPTX, CSV, and TXT. EML support is currently experimental. XLSX ingestion is under development.
  • Configurable Chunking: Employs a rule-based chunking system that allows for different strategies (e.g., by paragraph, by slide) to be applied to different document types, ensuring optimal data segmentation.
  • Flexible Embedding Models: Easily switch between local, open-source embedding models (via sentence-transformers) and powerful API-based models like OpenAI's.
  • Multi-Modal Retrieval: Capable of retrieving both text and image-based information. The system includes an enrich-images command to generate textual descriptions for images using an agentic workflow, making visual content fully searchable.
  • Advanced Retrieval Strategies: Uses a late-fusion approach to combine results from multiple sources, ensuring comprehensive and relevant context for every query.
  • Streamlit UI: An intuitive user interface for creating and managing projects, uploading documents, and editing configurations.
  • Command-Line Interface: A powerful CLI for interacting with the platform, allowing you to ingest documents, generate embeddings, and ask questions directly from your terminal.
  • Local-First and Secure: All your data, including raw files, indexes, and logs, is stored locally on your machine, ensuring complete privacy and control.

🚀 Getting Started

Prerequisites

  • Python 3.10 or higher
  • Poetry for dependency management
  • An API key for your chosen LLM and embedding providers (e.g., OPENAI_API_KEY)

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd <repository-name>
  2. Install the dependencies using Poetry:

    poetry install

Usage

The platform can be operated through the Streamlit UI or the command-line interface.

Streamlit UI

To launch the user interface, run the following command:

poetry run streamlit run scripts/ui/ui_project_manager.py

The UI allows you to:

  • Create new projects.
  • Upload documents.
  • View and edit project configurations.

Command-Line Interface

The CLI provides a powerful way to interact with the platform. Here is a typical workflow:

  1. Ingest and Chunk Documents:

    python -m app.cli ingest /path/to/your/project --chunk
  2. Generate Embeddings:

    python -m app.cli embed /path/to/your/project

    Optional: Use --with-image-index to run image enrichment and indexing immediately after embedding.

  3. Enrich and Index Images (Standalone): If you didn't use --with-image-index during embedding, you can run these steps separately:

    python -m app.cli enrich-images /path/to/your/project --doc-type pptx
    python -m app.cli index-images /path/to/your/project --doc-type pptx
  4. Retrieve Context: Test the retrieval system directly:

    python -m app.cli retrieve /path/to/your/project "Your search query" --top-k 5
  5. Ask a Question: Generate an answer using the RAG pipeline:

    python -m app.cli ask /path/to/your/project "Your question here"
  6. View Configuration: Check the current project configuration:

    python -m app.cli config /path/to/your/project

For more detailed information on the available commands and their options, please refer to the app/README.md file.


Core Concepts

The platform is built around a modular pipeline that processes your data in several stages:

  1. Ingestion: The first step is to ingest your raw documents. The platform provides a suite of loaders that can handle a wide variety of file formats.
  2. Chunking: Once ingested, the documents are split into smaller, more manageable chunks. This process is highly configurable and can be tailored to the specific characteristics of each document type.
  3. Enrichment: The platform includes an ImageInsightAgent that can analyze images and generate textual descriptions for them. This makes visual content searchable and adds another layer of context to your knowledge base.
  4. Embedding: The text and image chunks are then converted into numerical representations (embeddings) using a chosen embedding model.
  5. Indexing: The embeddings are stored in a local FAISS index, which allows for efficient similarity searches.
  6. Retrieval: When you ask a question, the platform uses a late-fusion retrieval strategy to find the most relevant text and image chunks from the index.
  7. Generation: The retrieved context is then used to construct a detailed prompt, which is sent to a large language model to generate a final answer.

🗂️ Project Structure

The project is organized into the following key directories:

  • app/: Contains the command-line interface for the platform. See app/README.md.
  • assets/: A place for static assets. See assets/README.md.
  • configs/: Home to the chunk_rules.yaml file, which defines the chunking strategies for different document types. See configs/README.md.
  • docs/: Contains project-related documentation, including architecture diagrams and planning documents. See docs/README.md.
  • scripts/: The heart of the platform, containing the core logic for ingestion, chunking, embedding, retrieval, and more. See scripts/README.md for a high-level overview.
  • tests/: Contains the test suite for the project.

🤝 Contributing

Contributions are welcome! Please feel free to open an issue or submit a pull request.

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages