Building an Insurance Data Analysis Pipeline with LangChain

Running the Code

Create a copy of .env.example and rename it to .env. Fill in the values.
Make sure you have Poetry installed: pip install poetry
Then do, poetry install to install the dependencies.
Install the shell plugin: poetry self add poetry-plugin-shell
Start the shell: poetry shell
Run the app: python ./src/agent.py --query "what can you tell me about the data?" --excel-path "./src/assets/AVERAGE EXPENDITURES FOR AUTO INSURANCE, UNITED STATES, 1998-2007.xls" --debug

Session Guide & Homework Assignment

Overview

In this session, you'll build an AI-powered data analysis pipeline that processes and analyzes insurance data from Excel files. The system will use LangChain for orchestration, pgvector for document storage, and Claude for advanced analysis.

Session Duration: 1.5 hours (90 minutes) Additional Work Time Required: 4-6 hours Difficulty Level: Intermediate

Prerequisites

Python 3.12
Basic understanding of:
- LangChain framework
- Vector databases
- RAG (Retrieval Augmented Generation)
- Agents and Tools in LangChain
Completed previous session on data collection

Required Tools & Services

Python packages:

pip install langchain langchain-core langchain-community langchain-anthropic langchain-postgres python-dotenv psycopg[binary] openai langchain-openai anthropic

API Keys & Services:

Anthropic API key (for Claude)
neon.tech account (free tier works)
Your Excel files from previous session

Session Breakdown

Part 1: Setup & Overview (15 minutes)

Instructor-Led Discussion:

Architecture overview
Component interactions
Setting up the development environment

Part 2: Building Document Pipeline (30 minutes)

Independent Work: Build a document processing pipeline using LangChain's built-in tools:

Load Excel files using CSVLoader or UnstructuredExcelLoader
Split documents using RecursiveCharacterTextSplitter
Create embeddings using Anthropic's embeddings

Checkpoint Discussion (5 mins):

Common issues with document processing
Best practices for chunking data

Part 3: Vector Store Setup (30 minutes)

Independent Work:

Set up neon.tech database
Implement vector store integration
Test document storage and retrieval

Final Discussion (10 mins):

Progress review
Introduction to homework tasks
Q&A

Tasks (Continue at home)

Task 1: Complete the Document Pipeline

Implement the DocumentProcessor class:

from langchain.document_loaders import UnstructuredExcelLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

class DocumentProcessor:
    def __init__(self):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200
        )

    def load_excel(self, file_path: str) -> list[Document]:
        """
        TODO: Implement Excel loading using UnstructuredExcelLoader
        Expected output: List of Document objects
        """
        pass

    def process_documents(self, documents: list[Document]) -> list[Document]:
        """
        TODO: Implement document processing:
        1. Split documents using self.text_splitter
        2. Extract metadata (dates, regions, insurance types)
        3. Return processed documents
        """
        pass

Task 2: Implement Vector Store

Build the vector store integration:

from langchain.vectorstores import PGVector

class VectorStore:
    def __init__(self):
        self.connection_string = os.getenv("NEON_CONNECTION_STRING")

    def init_store(self):
        """
        TODO: Initialize PGVector store
        Use LangChain's built-in PGVector implementation
        """
        pass

    def add_documents(self, documents: list[Document]):
        """
        TODO: Add documents to vector store
        Include proper error handling
        """
        pass

Task 3: Create Analysis Agent

Build an agent that can analyze the insurance data:

from langchain.agents import initialize_agent, Tool
from langchain.chains import RetrievalQAWithSourcesChain

class InsuranceAnalysisAgent:
    def __init__(self, vector_store):
        """
        TODO: Initialize agent with:
        1. RetrievalQAWithSourcesChain for document querying
        2. Tools for data analysis
        3. Claude as the base LLM
        """
        pass

    def analyze_trends(self, query: str) -> str:
        """
        TODO: Implement trend analysis using the agent
        Should handle queries like:
        - Compare regional insurance costs
        - Analyze year-over-year changes
        - Identify cost factors
        """
        pass

Things to notice:

1. Technical Implementation

Document Processing

Successfully loads all Excel files
Creates appropriate chunk sizes
Extracts relevant metadata
Handles errors gracefully

Performance metric: Process files in under 5 minutes

Vector Store

Properly stores and retrieves documents
Maintains document metadata
Performs efficient similarity search

Performance metric: Retrieves results in under 2 seconds

Analysis Agent

Implements all required tools
Provides sourced responses
Handles complex analytical queries

Performance metric: Response generation under 10 seconds

Code Quality

Clear documentation
Proper error handling
Follows Python best practices
Efficient use of LangChain tools

2. Analysis Quality

Accuracy

Correct interpretation of data
Accurate trend identification
Proper source attribution
Factual correctness in responses

Depth

Thorough analysis of trends
Consideration of multiple factors
Detailed explanations
Meaningful insights

Creativity

Novel approaches to analysis
Interesting connections in data
Unique insights
Creative visualization suggestions

3. User Experience

Usability

Clear response format
Intuitive query handling
Appropriate level of detail
Helpful error messages

Adaptability

Handles various query types
Adapts to different analysis needs
Flexible response formats
Graceful handling of edge cases

Testing Your Implementation

Test your system with these queries:

queries = [
    "What's the trend in auto insurance costs over the last 3 years?",
    "Compare insurance costs between different regions",
    "What factors most influence insurance costs?",
    "Generate a summary of key findings from the data"
]

Submission Guidelines

Create a GitHub repository with your implementation
Include:
- All code files
- Requirements.txt
- Setup instructions
- Example queries and outputs
Submit by: [Date]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

subzero10/ai-insurance-data-analysis-pipeline

Folders and files

Latest commit

History

Repository files navigation

Building an Insurance Data Analysis Pipeline with LangChain

Running the Code

Session Guide & Homework Assignment

Overview

Prerequisites

Required Tools & Services

Session Breakdown

Part 1: Setup & Overview (15 minutes)

Part 2: Building Document Pipeline (30 minutes)

Part 3: Vector Store Setup (30 minutes)

Tasks (Continue at home)

Task 1: Complete the Document Pipeline

Task 2: Implement Vector Store

Task 3: Create Analysis Agent

Things to notice:

1. Technical Implementation

Document Processing

Vector Store

Analysis Agent

Code Quality

2. Analysis Quality

Accuracy

Depth

Creativity

3. User Experience

Usability

Adaptability

Testing Your Implementation

Submission Guidelines

Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages