RAG PDF QA

RAG-PDF-QA is a Streamlit-based application that allows users to ask questions about the content of PDF documents. It leverages vector stores and large language models to provide accurate, reference-backed answers based on the provided documents.

Features

PDF Document Ingestion: Load and process PDF files from a specified directory.
Document Splitting: Automatically splits PDF pages into overlapping text chunks for improved semantic search and retrieval accuracy.
Vector Store Creation: Embed and store document pages for efficient semantic search.
Question Answering: Ask questions about the ingested documents and receive answers with references to the source document and page.
Streamlit UI: Simple web interface for user interaction.
(In Progress) Multi-Document Support: Backend code for managing and querying multiple PDF documents is present, but not yet integrated into the main UI.

How it Works

PDF Loading: All PDF files in data/Documents/ are loaded and each page is extracted.
Splitting: Each page is split into overlapping text chunks (default: 1000 characters per chunk, 250 character overlap) using a recursive character splitter. This ensures that answers can be found even if the relevant information spans across page boundaries or is located in the middle of a long page.
Embedding & Vector Store: The text chunks are embedded and stored in a vector store for fast semantic search.
Question Answering: When a user asks a question, the app searches the vector store for the most relevant chunks and uses an LLM to generate an answer, citing the source document and page.

Directory Structure

├── app/
│   ├── main.py                # Streamlit app entry point
│   ├── pdfLoader.py           # PDF loading utilities
│   ├── vectorStore.py         # Single vector store management
│   ├── multiVectorStore.py    # Multi-document vector store backend (not yet in UI)
│   ├── chain.py               # LLM chain and prompt logic
├── data/
│   ├── Documents/             # Place your PDF files here
│   │   └── document.txt       # Directory guide ("Put the pdf file in this directory.")
│   ├── VectorStore/           # Stores vector store pickle files
│   │   ├── vector_store.pkl   # Main vector store file
│   │   └── vector_store.txt   # Directory guide ("Vector store will be created in this directory.")
│   └── Diagram/               # Project workflow diagrams
│       ├── diagram-dark.png
│       └── diagram-light.png
├── README.md                  # Project documentation

Diagram

Setup Instructions

Clone the repository:

git clone <repo-url>
cd <repo-directory>

Install dependencies: Create a requirements.txt with the following (or use your preferred environment manager):

streamlit
langchain
langchain-core
langchain-community
langchain-huggingface
langchain-groq
python-dotenv
sentence-transformers

Then install:

pip install -r requirements.txt

Set up environment variables:
- Create a .env file in the root or app/ directory with your Groq API key:
```
GROQ_API_KEY=your_groq_api_key_here
```
Add PDF documents:
- Place your PDF files in data/Documents/.

Usage

Run the Streamlit app:
```
streamlit run app/main.py
```
Interact via the web UI:
- Enter your question in the input box and click "Submit".
- The app will search the ingested documents and return an answer with references.

Note:

On first run or when adding new documents, the app will process all PDFs and create/update the vector store. This may take some time depending on the number and size of documents.
The data/Documents/document.txt and data/VectorStore/vector_store.txt files are directory guides and can be safely ignored or removed if not needed.

Multi-Document Support (Backend Only)

The file app/multiVectorStore.py contains backend logic for managing and querying multiple PDF documents, each with its own vector store.
This feature is not yet integrated into the main Streamlit UI. If you wish to experiment, you can use the class directly in your own scripts/notebooks.
Vector stores for multiple documents are intended to be stored in a dedicated directory (see code for details), but this is not yet active in the main workflow.

Enhancement

In Progress

Integration of robust support for managing and querying multiple PDF documents simultaneously.

Planned Enhancements

Enable users to upload their own PDF files directly through the web interface.
Extend document processing capabilities to support a wider range of file formats beyond PDFs.

Dependencies

Environment Variables

GROQ_API_KEY: Your API key for Groq LLM access (required).

License

MIT License

Acknowledgements

Built with LangChain and Streamlit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAG PDF QA

Features

How it Works

Directory Structure

Diagram

Setup Instructions

Usage

Multi-Document Support (Backend Only)

Enhancement

Dependencies

Environment Variables

License

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
app		app
data		data
.env		.env
LICENSE		LICENSE
README.md		README.md

License

yash-meshram/rag-pdf-qa

Folders and files

Latest commit

History

Repository files navigation

RAG PDF QA

Features

How it Works

Directory Structure

Diagram

Setup Instructions

Usage

Multi-Document Support (Backend Only)

Enhancement

Dependencies

Environment Variables

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages