Multi-Source RAG Platform

This project is a sophisticated, local-first Retrieval-Augmented Generation (RAG) platform designed to transform a diverse range of organizational content—including PDFs, Office documents, emails, and images—into an interactive and searchable knowledge base. It is built for anyone who needs to derive reliable answers from in-house documentation without compromising data privacy.

✨ Key Features

Multi-Source Ingestion: Supports a wide variety of file formats, including PDF, DOCX, PPTX, CSV, and TXT. EML support is currently experimental. XLSX ingestion is under development.
Configurable Chunking: Employs a rule-based chunking system that allows for different strategies (e.g., by paragraph, by slide) to be applied to different document types, ensuring optimal data segmentation.
Flexible Embedding Models: Easily switch between local, open-source embedding models (via sentence-transformers) and powerful API-based models like OpenAI's.
Multi-Modal Retrieval: Capable of retrieving both text and image-based information. The system includes an enrich-images command to generate textual descriptions for images using an agentic workflow, making visual content fully searchable.
Advanced Retrieval Strategies: Uses a late-fusion approach to combine results from multiple sources, ensuring comprehensive and relevant context for every query.
Streamlit UI: An intuitive user interface for creating and managing projects, uploading documents, and editing configurations.
Command-Line Interface: A powerful CLI for interacting with the platform, allowing you to ingest documents, generate embeddings, and ask questions directly from your terminal.
Local-First and Secure: All your data, including raw files, indexes, and logs, is stored locally on your machine, ensuring complete privacy and control.

🚀 Getting Started

Prerequisites

Python 3.10 or higher
Poetry for dependency management
An API key for your chosen LLM and embedding providers (e.g., OPENAI_API_KEY)

Installation

Clone the repository:

git clone <repository-url>
cd <repository-name>

Install the dependencies using Poetry:
```
poetry install
```

Usage

The platform can be operated through the Streamlit UI or the command-line interface.

Streamlit UI

To launch the user interface, run the following command:

poetry run streamlit run scripts/ui/ui_project_manager.py

The UI allows you to:

Create new projects.
Upload documents.
View and edit project configurations.

Command-Line Interface

The CLI provides a powerful way to interact with the platform. Here is a typical workflow:

Ingest and Chunk Documents:

python -m app.cli ingest /path/to/your/project --chunk

Generate Embeddings:
```
python -m app.cli embed /path/to/your/project
```
Optional: Use --with-image-index to run image enrichment and indexing immediately after embedding.

Enrich and Index Images (Standalone): If you didn't use --with-image-index during embedding, you can run these steps separately:

python -m app.cli enrich-images /path/to/your/project --doc-type pptx
python -m app.cli index-images /path/to/your/project --doc-type pptx

Retrieve Context: Test the retrieval system directly:

python -m app.cli retrieve /path/to/your/project "Your search query" --top-k 5

Ask a Question: Generate an answer using the RAG pipeline:

python -m app.cli ask /path/to/your/project "Your question here"

View Configuration: Check the current project configuration:
```
python -m app.cli config /path/to/your/project
```

For more detailed information on the available commands and their options, please refer to the app/README.md file.

Core Concepts

The platform is built around a modular pipeline that processes your data in several stages:

Ingestion: The first step is to ingest your raw documents. The platform provides a suite of loaders that can handle a wide variety of file formats.
Chunking: Once ingested, the documents are split into smaller, more manageable chunks. This process is highly configurable and can be tailored to the specific characteristics of each document type.
Enrichment: The platform includes an ImageInsightAgent that can analyze images and generate textual descriptions for them. This makes visual content searchable and adds another layer of context to your knowledge base.
Embedding: The text and image chunks are then converted into numerical representations (embeddings) using a chosen embedding model.
Indexing: The embeddings are stored in a local FAISS index, which allows for efficient similarity searches.
Retrieval: When you ask a question, the platform uses a late-fusion retrieval strategy to find the most relevant text and image chunks from the index.
Generation: The retrieved context is then used to construct a detailed prompt, which is sent to a large language model to generate a final answer.

🗂️ Project Structure

The project is organized into the following key directories:

app/: Contains the command-line interface for the platform. See app/README.md.
assets/: A place for static assets. See assets/README.md.
configs/: Home to the chunk_rules.yaml file, which defines the chunking strategies for different document types. See configs/README.md.
docs/: Contains project-related documentation, including architecture diagrams and planning documents. See docs/README.md.
scripts/: The heart of the platform, containing the core logic for ingestion, chunking, embedding, retrieval, and more. See scripts/README.md for a high-level overview.
tests/: Contains the test suite for the project.

🤝 Contributing

Contributions are welcome! Please feel free to open an issue or submit a pull request.

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 350 Commits
app		app
assets		assets
configs		configs
docs		docs
logs/runs/demo_step1		logs/runs/demo_step1
scripts		scripts
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
BRANCH_ANALYSIS_2025-11-21.md		BRANCH_ANALYSIS_2025-11-21.md
CLAUDE.md		CLAUDE.md
README.md		README.md
folder_structure.txt		folder_structure.txt
instructions.txt		instructions.txt
llms.txt		llms.txt
outlook_connector_from_wsl.md		outlook_connector_from_wsl.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
run_ui.sh		run_ui.sh
sample_test.mbox		sample_test.mbox

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi-Source RAG Platform

✨ Key Features

🚀 Getting Started

Prerequisites

Installation

Usage

Streamlit UI

Command-Line Interface

Core Concepts

🗂️ Project Structure

🤝 Contributing

📜 License

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

hagaybar/Multi-Source_RAG_Platform

Folders and files

Latest commit

History

Repository files navigation

Multi-Source RAG Platform

✨ Key Features

🚀 Getting Started

Prerequisites

Installation

Usage

Streamlit UI

Command-Line Interface

Core Concepts

🗂️ Project Structure

🤝 Contributing

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages