Skip to content

subzero10/ai-insurance-data-analysis-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Building an Insurance Data Analysis Pipeline with LangChain

Running the Code

  • Create a copy of .env.example and rename it to .env. Fill in the values.
  • Make sure you have Poetry installed: pip install poetry
  • Then do, poetry install to install the dependencies.
  • Install the shell plugin: poetry self add poetry-plugin-shell
  • Start the shell: poetry shell
  • Run the app: python ./src/agent.py --query "what can you tell me about the data?" --excel-path "./src/assets/AVERAGE EXPENDITURES FOR AUTO INSURANCE, UNITED STATES, 1998-2007.xls" --debug

Session Guide & Homework Assignment

Overview

In this session, you'll build an AI-powered data analysis pipeline that processes and analyzes insurance data from Excel files. The system will use LangChain for orchestration, pgvector for document storage, and Claude for advanced analysis.

Session Duration: 1.5 hours (90 minutes) Additional Work Time Required: 4-6 hours Difficulty Level: Intermediate

Prerequisites

  • Python 3.12
  • Basic understanding of:
    • LangChain framework
    • Vector databases
    • RAG (Retrieval Augmented Generation)
    • Agents and Tools in LangChain
  • Completed previous session on data collection

Required Tools & Services

  1. Python packages:
pip install langchain langchain-core langchain-community langchain-anthropic langchain-postgres python-dotenv psycopg[binary] openai langchain-openai anthropic
  1. API Keys & Services:
  • Anthropic API key (for Claude)
  • neon.tech account (free tier works)
  • Your Excel files from previous session

Session Breakdown

Part 1: Setup & Overview (15 minutes)

Instructor-Led Discussion:

  • Architecture overview

    graph TD    A[Excel Files] --> B[Document Processor]    B --> C[Vector Database]    C --> D[Analysis Agent]    D --> E[Results/Analysis]    F[Human Query] --> D    B --> G[(Document Store)]    G --> D        classDef component fill:#f9f,stroke:#333,stroke-width:4px;    class B,C,D component

  • Component interactions

  • Setting up the development environment

Part 2: Building Document Pipeline (30 minutes)

Independent Work: Build a document processing pipeline using LangChain's built-in tools:

  1. Load Excel files using CSVLoader or UnstructuredExcelLoader
  2. Split documents using RecursiveCharacterTextSplitter
  3. Create embeddings using Anthropic's embeddings

Checkpoint Discussion (5 mins):

  • Common issues with document processing
  • Best practices for chunking data

Part 3: Vector Store Setup (30 minutes)

Independent Work:

  1. Set up neon.tech database
  2. Implement vector store integration
  3. Test document storage and retrieval

Final Discussion (10 mins):

  • Progress review
  • Introduction to homework tasks
  • Q&A

Tasks (Continue at home)

Task 1: Complete the Document Pipeline

Implement the DocumentProcessor class:

from langchain.document_loaders import UnstructuredExcelLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

class DocumentProcessor:
    def __init__(self):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200
        )

    def load_excel(self, file_path: str) -> list[Document]:
        """
        TODO: Implement Excel loading using UnstructuredExcelLoader
        Expected output: List of Document objects
        """
        pass

    def process_documents(self, documents: list[Document]) -> list[Document]:
        """
        TODO: Implement document processing:
        1. Split documents using self.text_splitter
        2. Extract metadata (dates, regions, insurance types)
        3. Return processed documents
        """
        pass

Task 2: Implement Vector Store

Build the vector store integration:

from langchain.vectorstores import PGVector

class VectorStore:
    def __init__(self):
        self.connection_string = os.getenv("NEON_CONNECTION_STRING")

    def init_store(self):
        """
        TODO: Initialize PGVector store
        Use LangChain's built-in PGVector implementation
        """
        pass

    def add_documents(self, documents: list[Document]):
        """
        TODO: Add documents to vector store
        Include proper error handling
        """
        pass

Task 3: Create Analysis Agent

Build an agent that can analyze the insurance data:

from langchain.agents import initialize_agent, Tool
from langchain.chains import RetrievalQAWithSourcesChain

class InsuranceAnalysisAgent:
    def __init__(self, vector_store):
        """
        TODO: Initialize agent with:
        1. RetrievalQAWithSourcesChain for document querying
        2. Tools for data analysis
        3. Claude as the base LLM
        """
        pass

    def analyze_trends(self, query: str) -> str:
        """
        TODO: Implement trend analysis using the agent
        Should handle queries like:
        - Compare regional insurance costs
        - Analyze year-over-year changes
        - Identify cost factors
        """
        pass

Things to notice:

1. Technical Implementation

Document Processing

  • Successfully loads all Excel files
  • Creates appropriate chunk sizes
  • Extracts relevant metadata
  • Handles errors gracefully

Performance metric: Process files in under 5 minutes

Vector Store

  • Properly stores and retrieves documents
  • Maintains document metadata
  • Performs efficient similarity search

Performance metric: Retrieves results in under 2 seconds

Analysis Agent

  • Implements all required tools
  • Provides sourced responses
  • Handles complex analytical queries

Performance metric: Response generation under 10 seconds

Code Quality

  • Clear documentation
  • Proper error handling
  • Follows Python best practices
  • Efficient use of LangChain tools

2. Analysis Quality

Accuracy

  • Correct interpretation of data
  • Accurate trend identification
  • Proper source attribution
  • Factual correctness in responses

Depth

  • Thorough analysis of trends
  • Consideration of multiple factors
  • Detailed explanations
  • Meaningful insights

Creativity

  • Novel approaches to analysis
  • Interesting connections in data
  • Unique insights
  • Creative visualization suggestions

3. User Experience

Usability

  • Clear response format
  • Intuitive query handling
  • Appropriate level of detail
  • Helpful error messages

Adaptability

  • Handles various query types
  • Adapts to different analysis needs
  • Flexible response formats
  • Graceful handling of edge cases

Testing Your Implementation

Test your system with these queries:

queries = [
    "What's the trend in auto insurance costs over the last 3 years?",
    "Compare insurance costs between different regions",
    "What factors most influence insurance costs?",
    "Generate a summary of key findings from the data"
]

Submission Guidelines

  1. Create a GitHub repository with your implementation
  2. Include:
    • All code files
    • Requirements.txt
    • Setup instructions
    • Example queries and outputs
  3. Submit by: [Date]

Resources

About

Take home exercise from GenAI Essentials Course

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages