Security Warning: Never share your
.env
file or API keys. The.env
file is gitignored by default, and sensitive credentials should always be kept private.
This project is a modular, well-documented implementation of the LangChain "Chat With Your Data" tutorial. Each step is a separate script, so you can learn and experiment with each concept locally.
- Document Loading: Load data from PDFs, web pages, and (optionally) YouTube. See
src/load_documents.py
. - Text Splitting: Break documents into manageable chunks using different splitters. See
src/split_text.py
. - Embeddings: Convert text to vector representations and compare semantic similarity. See
src/embeddings.py
. - Vector Stores: Store and retrieve document embeddings efficiently with Chroma. See
src/vector_store.py
. - Question Answering: Build QA chains to answer questions about your documents, with a custom prompt. See
src/qa_chain.py
.
- Python 3.9+
- An OpenAI API key (add to
.env
) - (Optional) A PDF file at
data/test.pdf
for PDF loading demo - LangChain v0.1+ and [langchain_community]
- Clone this repo and
cd
into it. - Copy
.env.example
to.env
and add your OpenAI API key and any other required environment variables. - (Optional) Place a PDF at
data/test.pdf
for PDF loading. - Install dependencies:
pip install -r requirements.txt
- (Recommended) Install and run ruff for linting and uv for dependency management:
pip install ruff uv ruff check src/ # (Optional) Compile requirements.txt from requirements.in uv pip compile requirements.in --output-file requirements.txt
Scripts must be run in order, as each step saves output for the next. All scripts use utils.py
to load environment variables from .env
(using python-dotenv
).
python src/load_documents.py
# Load and preview documents (saves pickles/docs.pkl)- Loads PDF and web documents, prints a preview.
python src/split_text.py
# Split documents into chunks (saves pickles/splits.pkl)- Splits documents using RecursiveCharacterTextSplitter and saves the result.
python src/embeddings.py
# Generate and compare embeddings- Generates OpenAI embeddings, compares semantic similarity, and embeds document chunks.
python src/vector_store.py
# Create and query a vector store (saves to database/)- Loads splits, creates a Chroma vector store, runs a sample query, and persists the DB automatically in the
database/
directory.
- Loads splits, creates a Chroma vector store, runs a sample query, and persists the DB automatically in the
python src/qa_chain.py
# Run a question-answering chain- Loads the Chroma vector store from
database/
, sets up a custom prompt, and answers a sample question using a RetrievalQA chain.
- Loads the Chroma vector store from
Each script is commented for learning. See the source for details and experiment with your own data!
- Intermediate outputs are saved in the
pickles/
directory (e.g.,docs.pkl
,splits.pkl
). - Persistent vector store is saved in the
database/
directory (Chroma DB and related files). - Both
pickles/
anddatabase/
are gitignored and safe to delete if you want to reset the workflow.
- Prompts: You can edit the prompt in
src/qa_chain.py
to change the style or constraints of the answers. - Document Sources: Add more loaders in
src/load_documents.py
as needed (see LangChain docs for options).
- If you see
OPENAI_API_KEY not set in .env file.
, check your.env
file. - If you get file not found errors, ensure you ran the previous step and the required files exist.
- For PDF loading, make sure
data/test.pdf
exists. - Chroma DB is automatically persisted on any change (no need to call persist manually).
Inspired by DeepLearning.AI - LangChain Chat With Your Data