This project, Chat With PDF, is a Streamlit-based application that enables users to upload a PDF file and interact with its content using a chatbot powered by large language models (LLMs). The application leverages embeddings, vector stores, and a question-answering pipeline to extract and analyze text from PDFs, enabling users to ask meaningful questions about the uploaded content.
- PDF Upload and Parsing: Users can upload a PDF file, which is processed to extract text from all pages.
- Text Chunking: Text from the PDF is split into manageable chunks using LangChain’s
RecursiveCharacterTextSplitter
for efficient processing. - Embeddings: The application uses HuggingFace embeddings (via the
sentence-transformers/all-MiniLM-L6-v2
model) to create vector representations of text chunks. - Vector Storage: The project uses FAISS for similarity search to retrieve relevant chunks based on user queries.
- Question Answering: A transformer-based pipeline (from HuggingFace’s
deepset/roberta-base-squad2
model) answers user questions by analyzing the relevant text chunks. - Streamlit Interface: A simple and interactive UI for uploading PDFs, asking questions, and viewing results.
- Streamlit: Frontend framework for building interactive web applications.
- LangChain: For text processing and chunking.
- HuggingFace: Provides pre-trained models for embeddings and question answering.
- FAISS: Vector storage and similarity search.
- PyPDF2: For reading and extracting text from PDF files.
- dotenv: For environment variable management.
-
app.py:
- The main script that implements the application’s logic.
- Defines the UI, PDF processing pipeline, embeddings generation, and QA pipeline.
-
requirements.txt:
- Contains the list of dependencies required to run the application.
Follow these steps to set up and run the project locally:
-
Clone the repository:
git clone https://github.com/mahesh-diwan/chat-with-pdf.git cd chat-with-pdf
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the required packages:
pip install -r requirements.txt
-
Run the application:
streamlit run app.py
-
Open the app in your browser at http://localhost:8501.
- Upload a PDF File:
- Use the "Upload your PDF file" button to upload a document.
- Ask Questions:
- Enter your question in the text box, and the app will retrieve relevant text chunks and provide an answer.
- View Results:
- The answer will be displayed along with any supporting context.
- Upload a research paper and ask specific questions about methodologies or results.
- Analyze long reports by querying particular sections without manually searching through the document.
- Add support for multiple file formats (e.g., Word, Excel).
- Optimize embeddings for faster performance with larger documents.
- Enhance UI/UX with additional features like visualization of search results.
Author: Mahesh
Contact: Feel free to reach out for any queries or suggestions.
Chat WIth PDF