-
Notifications
You must be signed in to change notification settings - Fork 0
Data Pipeline Overview
The data pipeline in RAGit is engineered to ensure the seamless transformation of raw documents into actionable insights within a Retrieval Augmented Generation (RAG) solution. This document provides a high-level overview of each stage in the pipeline, emphasizing key processes involved.
Documents can be ingested simply by placing them into the designated documents directory for the RAG collection within the shared directory. The system currently supports PDF, DOCX, Python, and Markdown formats, with additional data types to be incorporated as the project evolves.
After documents are placed in the designated directory, they must undergo processing to facilitate splitting, embedding, and storage into the vector database, enabling RAG querying capabilities.
PDF files are initially converted to images, with each page represented as a separate image. These images are then transformed into Markdown format. Following this conversion, the subsequent steps—embedding calculation and storage in the vector database—proceed just as they do for other Markdown documents.
Once collected, documents are divided into smaller chunks to enhance processing efficiency and searchability. Splitting is based on specific criteria, such as paragraph breaks or sentence boundaries.
The resultant chunks are stored incrementally in a relational database, allowing for updates without the necessity of upfront ingestion of all documents.
To facilitate vector-based search, embeddings—numerical representations capturing the semantic meaning of text—are computed for each chunk in the database.
- Embedding Calculation: A dedicated process identifies chunks lacking embeddings.
- Embedding Storage: Calculated embeddings are stored back in the database.
Using the stored embeddings, the vector database is either constructed or updated, enabling efficient retrieval of relevant chunks based on semantic similarity.
- Vector Database Update: Embeddings are indexed, allowing for search operations that return similar chunks in response to a query.
The vector database and web service frontend are deployed on a web server, making the RAG solution accessible to users.
- Web Service: This service interfaces with the vector database to fetch relevant chunks based on user queries.
User interactions with the frontend are monitored to gather feedback and evaluate the RAG solution’s performance. This feedback loop informs regular updates and enhancements to the data pipeline.
- User Feedback: Captures user assessments (e.g., thumbs up or down) to evaluate response quality.
- Periodic Updates: Utilizes feedback to regularly update the vector database, refine prompts, and improve other solution components.
-
Document Collection
- Gather supported documents (PDF, DOCX, Markdown).
-
Document Splitting and Database Insertion
- Split documents into chunks.
- Insert chunks into the relational database.
-
Embedding Calculation and Storage
- Identify chunks lacking embeddings.
- Calculate and store embeddings in the database.
-
Vector Database Construction
- Index embeddings within the vector database.
-
Frontend Deployment
- Deploy the vector database and web service frontend on a web server.
-
Evaluation and Enhancement
- Gather user feedback for performance evaluation.
- Regularly update and improve the data pipeline based on feedback.
The data pipeline in RAGit is a detailed and iterative process that transforms raw documents into a robust and efficient RAG solution. By adhering to these stages, RAGit ensures precise processing, indexing, and accessibility of data for retrieval-augmented generation tasks, supporting effective applications.