Skip to content

Data Pipeline Overview

AdaptAware edited this page Dec 8, 2024 · 4 revisions

Introduction

The data pipeline in RAGit is engineered to ensure the seamless transformation of raw documents into actionable insights within a Retrieval Augmented Generation (RAG) solution. This document provides a high-level overview of each stage in the pipeline, emphasizing key processes involved.

Document Ingestion

Documents can be ingested simply by placing them into the designated documents directory for the RAG collection within the shared directory. The system currently supports PDF, DOCX, Python, and Markdown formats, with additional data types to be incorporated as the project evolves.

Document ETL Process

After documents are placed in the designated directory, they must undergo processing to facilitate splitting, embedding, and storage into the vector database, enabling RAG querying capabilities.

PDF File Processing

PDF files are initially converted to images, with each page represented as a separate image. These images are then transformed into Markdown format. Following this conversion, the subsequent steps—embedding calculation and storage in the vector database—proceed just as they do for other Markdown documents.

Document Splitting

Once collected, documents are divided into smaller chunks to enhance processing efficiency and searchability. Splitting is based on specific criteria, such as paragraph breaks or sentence boundaries.

Database Insertion

The resultant chunks are stored incrementally in a relational database, allowing for updates without the necessity of upfront ingestion of all documents.

Embedding Calculation and Storage

To facilitate vector-based search, embeddings—numerical representations capturing the semantic meaning of text—are computed for each chunk in the database.

  • Embedding Calculation: A dedicated process identifies chunks lacking embeddings.
  • Embedding Storage: Calculated embeddings are stored back in the database.

Vector Database Construction

Using the stored embeddings, the vector database is either constructed or updated, enabling efficient retrieval of relevant chunks based on semantic similarity.

  • Vector Database Update: Embeddings are indexed, allowing for search operations that return similar chunks in response to a query.

Frontend Deployment

The vector database and web service frontend are deployed on a web server, making the RAG solution accessible to users.

  • Web Service: This service interfaces with the vector database to fetch relevant chunks based on user queries.

Evaluation and Enhancement

User interactions with the frontend are monitored to gather feedback and evaluate the RAG solution’s performance. This feedback loop informs regular updates and enhancements to the data pipeline.

  • User Feedback: Captures user assessments (e.g., thumbs up or down) to evaluate response quality.
  • Periodic Updates: Utilizes feedback to regularly update the vector database, refine prompts, and improve other solution components.

High-Level Data Pipeline Workflow

  1. Document Collection

    • Gather supported documents (PDF, DOCX, Markdown).
  2. Document Splitting and Database Insertion

    • Split documents into chunks.
    • Insert chunks into the relational database.
  3. Embedding Calculation and Storage

    • Identify chunks lacking embeddings.
    • Calculate and store embeddings in the database.
  4. Vector Database Construction

    • Index embeddings within the vector database.
  5. Frontend Deployment

    • Deploy the vector database and web service frontend on a web server.
  6. Evaluation and Enhancement

    • Gather user feedback for performance evaluation.
    • Regularly update and improve the data pipeline based on feedback.

Conclusion

The data pipeline in RAGit is a detailed and iterative process that transforms raw documents into a robust and efficient RAG solution. By adhering to these stages, RAGit ensures precise processing, indexing, and accessibility of data for retrieval-augmented generation tasks, supporting effective applications.

Clone this wiki locally