This project implements a document question-answering system that allows users to ask questions about the content of PDF documents. The system uses LangChain, OpenAI embeddings, and Pinecone vector database to enable semantic search and intelligent responses.
- PDF document loading and processing
- Text chunking for efficient processing
- Semantic search capabilities
- Question-answering using GPT-3.5 Turbo
- Vector storage using Pinecone
- Support for both local and online PDF files
- Python 3.8+
- OpenAI API key
- Pinecone API key and environment
- Required Python packages (see Requirements section)
- Clone this repository:
git clone https://github.com/frrobledo/BookGPT.git
cd BookGPT
- Install the required packages:
pip install langchain openai pinecone-client unstructured pyyaml
- Create a
config.yaml
file in the root directory with your API credentials:
openai_key: "your-openai-api-key"
pinecone_key: "your-pinecone-api-key"
pinecone_env: "your-pinecone-environment"
index_name: "your-pinecone-index-name"
-
Place your PDF document in the
data/
directory or use an online PDF URL -
Run the Jupyter notebook:
jupyter notebook "Ask A Book Questions.ipynb"
- Follow the notebook cells to:
- Load and process your PDF document
- Create text chunks
- Generate embeddings
- Store vectors in Pinecone
- Ask questions about your document
-
Document Loading: The system uses UnstructuredPDFLoader to load PDF documents, supporting both local and online PDFs.
-
Text Processing: The document is split into smaller chunks using RecursiveCharacterTextSplitter for more efficient processing.
-
Embedding Generation: OpenAI's embedding model converts text chunks into vector representations.
-
Vector Storage: Embeddings are stored in a Pinecone vector database for efficient similarity search.
-
Question Answering: When a question is asked, the system:
- Finds relevant text chunks using semantic search
- Passes these chunks and the question to GPT-3.5 Turbo
- Returns a coherent answer based on the document content
Key configuration parameters in the notebook:
chunk_size
: Size of text chunks (default: 1000 characters)chunk_overlap
: Overlap between chunks (default: 0 characters)temperature
: Controls randomness in GPT-3.5 responses (default: 0)max_tokens
: Maximum length of generated responses (default: 2104)
query = "Enumerate all that is said about monotonicity in this paper"
docs = docsearch.similarity_search(query, include_metadata=True)
chain.run(input_documents=docs, question=query)
- langchain
- openai
- pinecone-client
- unstructured
- pyyaml
- jupyter
- Keep your
config.yaml
file secure and never commit it to version control - Add
config.yaml
to your.gitignore
file - Monitor your API usage to manage costs
Contributions are welcome! Please feel free to submit a Pull Request.
- LangChain for the document processing framework
- OpenAI for embeddings and language model
- Pinecone for vector storage capabilities