- Create a copy of
.env.example
and rename it to.env
. Fill in the values. - Make sure you have Poetry installed:
pip install poetry
- Then do,
poetry install
to install the dependencies. - Install the
shell
plugin:poetry self add poetry-plugin-shell
- Start the shell:
poetry shell
- Run the app:
python ./src/agent.py --query "what can you tell me about the data?" --excel-path "./src/assets/AVERAGE EXPENDITURES FOR AUTO INSURANCE, UNITED STATES, 1998-2007.xls" --debug
In this session, you'll build an AI-powered data analysis pipeline that processes and analyzes insurance data from Excel files. The system will use LangChain for orchestration, pgvector for document storage, and Claude for advanced analysis.
Session Duration: 1.5 hours (90 minutes) Additional Work Time Required: 4-6 hours Difficulty Level: Intermediate
- Python 3.12
- Basic understanding of:
- LangChain framework
- Vector databases
- RAG (Retrieval Augmented Generation)
- Agents and Tools in LangChain
- Completed previous session on data collection
- Python packages:
pip install langchain langchain-core langchain-community langchain-anthropic langchain-postgres python-dotenv psycopg[binary] openai langchain-openai anthropic
- API Keys & Services:
- Anthropic API key (for Claude)
- neon.tech account (free tier works)
- Your Excel files from previous session
Instructor-Led Discussion:
Independent Work: Build a document processing pipeline using LangChain's built-in tools:
- Load Excel files using CSVLoader or UnstructuredExcelLoader
- Split documents using RecursiveCharacterTextSplitter
- Create embeddings using Anthropic's embeddings
Checkpoint Discussion (5 mins):
- Common issues with document processing
- Best practices for chunking data
Independent Work:
- Set up neon.tech database
- Implement vector store integration
- Test document storage and retrieval
Final Discussion (10 mins):
- Progress review
- Introduction to homework tasks
- Q&A
Implement the DocumentProcessor class:
from langchain.document_loaders import UnstructuredExcelLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
class DocumentProcessor:
def __init__(self):
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
def load_excel(self, file_path: str) -> list[Document]:
"""
TODO: Implement Excel loading using UnstructuredExcelLoader
Expected output: List of Document objects
"""
pass
def process_documents(self, documents: list[Document]) -> list[Document]:
"""
TODO: Implement document processing:
1. Split documents using self.text_splitter
2. Extract metadata (dates, regions, insurance types)
3. Return processed documents
"""
pass
Build the vector store integration:
from langchain.vectorstores import PGVector
class VectorStore:
def __init__(self):
self.connection_string = os.getenv("NEON_CONNECTION_STRING")
def init_store(self):
"""
TODO: Initialize PGVector store
Use LangChain's built-in PGVector implementation
"""
pass
def add_documents(self, documents: list[Document]):
"""
TODO: Add documents to vector store
Include proper error handling
"""
pass
Build an agent that can analyze the insurance data:
from langchain.agents import initialize_agent, Tool
from langchain.chains import RetrievalQAWithSourcesChain
class InsuranceAnalysisAgent:
def __init__(self, vector_store):
"""
TODO: Initialize agent with:
1. RetrievalQAWithSourcesChain for document querying
2. Tools for data analysis
3. Claude as the base LLM
"""
pass
def analyze_trends(self, query: str) -> str:
"""
TODO: Implement trend analysis using the agent
Should handle queries like:
- Compare regional insurance costs
- Analyze year-over-year changes
- Identify cost factors
"""
pass
- Successfully loads all Excel files
- Creates appropriate chunk sizes
- Extracts relevant metadata
- Handles errors gracefully
Performance metric: Process files in under 5 minutes
- Properly stores and retrieves documents
- Maintains document metadata
- Performs efficient similarity search
Performance metric: Retrieves results in under 2 seconds
- Implements all required tools
- Provides sourced responses
- Handles complex analytical queries
Performance metric: Response generation under 10 seconds
- Clear documentation
- Proper error handling
- Follows Python best practices
- Efficient use of LangChain tools
- Correct interpretation of data
- Accurate trend identification
- Proper source attribution
- Factual correctness in responses
- Thorough analysis of trends
- Consideration of multiple factors
- Detailed explanations
- Meaningful insights
- Novel approaches to analysis
- Interesting connections in data
- Unique insights
- Creative visualization suggestions
- Clear response format
- Intuitive query handling
- Appropriate level of detail
- Helpful error messages
- Handles various query types
- Adapts to different analysis needs
- Flexible response formats
- Graceful handling of edge cases
Test your system with these queries:
queries = [
"What's the trend in auto insurance costs over the last 3 years?",
"Compare insurance costs between different regions",
"What factors most influence insurance costs?",
"Generate a summary of key findings from the data"
]
- Create a GitHub repository with your implementation
- Include:
- All code files
- Requirements.txt
- Setup instructions
- Example queries and outputs
- Submit by: [Date]