Ask a Book Questions

This project implements a document question-answering system that allows users to ask questions about the content of PDF documents. The system uses LangChain, OpenAI embeddings, and Pinecone vector database to enable semantic search and intelligent responses.

Features

PDF document loading and processing
Text chunking for efficient processing
Semantic search capabilities
Question-answering using GPT-3.5 Turbo
Vector storage using Pinecone
Support for both local and online PDF files

Prerequisites

Python 3.8+
OpenAI API key
Pinecone API key and environment
Required Python packages (see Requirements section)

Installation

Clone this repository:

git clone https://github.com/frrobledo/BookGPT.git
cd BookGPT

Install the required packages:

pip install langchain openai pinecone-client unstructured pyyaml

Create a config.yaml file in the root directory with your API credentials:

openai_key: "your-openai-api-key"
pinecone_key: "your-pinecone-api-key"
pinecone_env: "your-pinecone-environment"
index_name: "your-pinecone-index-name"

Usage

Place your PDF document in the data/ directory or use an online PDF URL
Run the Jupyter notebook:

jupyter notebook "Ask A Book Questions.ipynb"

Follow the notebook cells to:
- Load and process your PDF document
- Create text chunks
- Generate embeddings
- Store vectors in Pinecone
- Ask questions about your document

How It Works

Document Loading: The system uses UnstructuredPDFLoader to load PDF documents, supporting both local and online PDFs.
Text Processing: The document is split into smaller chunks using RecursiveCharacterTextSplitter for more efficient processing.
Embedding Generation: OpenAI's embedding model converts text chunks into vector representations.
Vector Storage: Embeddings are stored in a Pinecone vector database for efficient similarity search.
Question Answering: When a question is asked, the system:
- Finds relevant text chunks using semantic search
- Passes these chunks and the question to GPT-3.5 Turbo
- Returns a coherent answer based on the document content

Configuration

Key configuration parameters in the notebook:

chunk_size: Size of text chunks (default: 1000 characters)
chunk_overlap: Overlap between chunks (default: 0 characters)
temperature: Controls randomness in GPT-3.5 responses (default: 0)
max_tokens: Maximum length of generated responses (default: 2104)

Example

query = "Enumerate all that is said about monotonicity in this paper"
docs = docsearch.similarity_search(query, include_metadata=True)
chain.run(input_documents=docs, question=query)

Dependencies

langchain
openai
pinecone-client
unstructured
pyyaml
jupyter

Security Notes

Keep your config.yaml file secure and never commit it to version control
Add config.yaml to your .gitignore file
Monitor your API usage to manage costs

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

LangChain for the document processing framework
OpenAI for embeddings and language model
Pinecone for vector storage capabilities

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
Ask A Book Questions.ipynb		Ask A Book Questions.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ask a Book Questions

Features

Prerequisites

Installation

Usage

How It Works

Configuration

Example

Dependencies

Security Notes

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

frrobledo/BookGPT

Folders and files

Latest commit

History

Repository files navigation

Ask a Book Questions

Features

Prerequisites

Installation

Usage

How It Works

Configuration

Example

Dependencies

Security Notes

Contributing

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages