This is the code repository for Building Natural Language Pipelines, published by Packt.
Author: Laura Funderburk
- What You'll Learn to Build
- Setting Up
- Chapter Breakdown
- Chapter 1: Introduction to natural language processing pipelines (no required code exercises)
- Chapter 2: Diving Deep into Large Language Models
- Chapter 3: Introduction to Haystack
- Chapter 4: Bringing components together: Haystack pipelines for different use cases
- Chapter 5: Haystack pipeline development with custom components
- Chapter 6: Setting up a reproducible project: naive vs hybrid RAG with reranking and evaluation
- Chapter 7: Production deployment strategies
- Chapter 8: Hands-on projects
- Chapter 9: Future trends and beyond (no required code exercises)
- Optional: Advanced multi-agent architecture for production
This book guides you through building advanced Retrieval-Augmented Generation (RAG) systems and multi-agent applications using the Haystack 2.0, Ragas and LangGraph frameworks. Beginning with state-based agent development using LangGraph, you'll learn to build intelligent agents with tool integration, middleware patterns, and multi-agent coordination. You'll then master Haystack's component architecture, progressing through creating intelligent search systems with semantic and hybrid retrieval, building custom components for specialized tasks, and implementing comprehensive evaluation frameworks. The journey advances through production deployment strategies with Docker and REST APIs, culminating in hands-on projects including named entity recognition systems, zero-shot text classification pipelines, sentiment analysis tools, and sophisticated multi-agent orchestration systems that coordinate multiple specialized Haystack pipelines through supervisor-worker patterns with LangGraph.
Chapter 2: Single agents and multi agents with LangChain and LangGraph
This chapter contains optional LangGraph demonstrations that introduce state-based agents at a conceptual level. These examples are previews intended to build intuition. The full, practical use of LangGraph for multi-agent orchestration appears later in Chapter 8 and the epilogue, once the Haystack tool layer has been fully developed.
| Agent with one tool | Agent calling supervisor |
![]() |
![]() |
Chapter 3: Building robust agent tools with Haystack
| Supercomponents and pipeline | Prompt template pipeline |
![]() |
![]() |
Chapter 5: Build custom components: synthetic data generation with Ragas
| Knowledge graph and synthetic data generation (SDG) pipeline | SDG applied to websites and PDFS |
![]() |
![]() |
Chapter 6: Reproducible evaluation of hybrid and naive RAG with Ragas and Weights and Biases
Chapter 7: Deploy pipelines as an API with FastAPI and Hayhooks
Chapter 8 and Optional Advanced Modules: Capstone and Agentic Patterns for Production
| Microservice architecture | Multi-agent system using microservices |
![]() |
![]() |
📝 Sovereign-Friendly & Local Execution: The majority of exercises throughout this book are written so you can choose between OpenAI APIs or local models via Ollama (such as Mistral Nemo, GPT-OSS, or Deepseek-R1 and Qwen3), with the exception of the cost tracking exercises in Chapter 6 which specifically demonstrate OpenAI API usage monitoring. Each notebook provides specific model recommendations to help you choose the most suitable option for that particular exercise. The frameworks explored are extensible and models from other providers can be used to substitute OpenAI or local models. No US cloud, external APIs, or proprietary services are required for the majority of the book, making it easy to run in EU-regulated or air-gapped environments. The epilogue-advanced folder includes an optional prototype-to-production multi-agent implementation with LangGraph using LangSmith Studio. These exercises require a free LangSmith Studio API key, all exercises can also be run entirely locally and you can disable the tracer
export LANGCHAIN_TRACING_V2="false". Scripts are provided so you can run the agent on your terminal - you simply won’t see the studio traces or visualize the agent if you choose not to use LangSmith studio.
Clone the repository
git clone https://github.com/PacktPublishing/Building-Natural-Language-Pipelines.git
cd Building-Natural-Language-Pipelines/
Each chapter contains a pyproject.toml file with the folder's dependencies. (Recommended) Open each folder in a new VS Code window.
- Install uv:
pip install uv
- Change directories into the folder
- Install dependencies:
uv sync
- Activate the virtual environment:
source .venv/bin/activate - Select the virtual environment as the Jupyter kernel:
- Open any notebook.
- Click the kernel picker (top right) and select the
.venvenvironment.
Agent Foundations & State Management
- LangGraph Fundamentals: Understanding state-based agent frameworks and graph architecture
- Building Simple Agents: Creating agents with state management using MessagesState and reducers
- Tool Integration: Connecting agents with external tools (search APIs, databases, custom functions)
- Multi-Agent Systems: Designing and coordinating multiple specialized agents in workflows
- Middleware Patterns: Implementing logging, authentication, and monitoring layers for agent systems
- Local vs Cloud LLMs: Running agents with OpenAI APIs or locally with Ollama (Qwen2, Llama, Mistral)
Core Concepts & Foundation
- Component Architecture: Understanding Haystack's modular design patterns
- Pipeline Construction: Building linear and branching data flow pipelines
- Document Processing: Text extraction, cleaning, and preprocessing workflows
- Prompting LLMs: Learn to build prompt templates and guide how an LLM responds
- Package pipelines as Supercomponents: Abstract a pipeline as a Haystack component
Scaling & Optimization
- Indexing Pipelines: Automated document ingestion and preprocessing workflows
- Naive RAG: Semantic search using sentence transformers and embedding models
- Hybrid RAG: Combining keyword (BM25) and semantic (vector) search strategies
- Reranking: Advanced retrieval techniques using ranker models
- Pipelines as tools for an Agent: Package advanced RAG as a tool for an autonomous Agent
Extensibility & Testing
- Component SDK: Creating custom Haystack components with proper interfaces
- Knowledge Graph Integration: Building components for structured knowledge representation
- Synthetic Data Generation: Automated test data creation for pipeline validation
- Quality Control Systems: Implementing automated evaluation and monitoring components
- Unit Testing Frameworks: Comprehensive testing strategies for custom components
Reproducible Workflows & Evaluation
- Reproducible Workflow Building Blocks: Setting up consistent environments with Docker, Elasticsearch, and dependency management
- Naive RAG Implementation: Building basic retrieval-augmented generation with semantic search
- Hybrid RAG with Reranking: Advanced retrieval combining keyword (BM25) and semantic search with rank fusion strategies
- Evaluation with RAGAS: Using the RAGAS framework to assess and compare naive vs hybrid RAG system quality across multiple dimensions
- Observability with Weights and Biases: Implementing monitoring and tracking for RAG system performance comparison and experiment management
- Performance Optimization through Feedback Loops: Creating iterative improvement cycles using evaluation results to enhance retrieval and generation performance
Deployment & Scaling
- FastAPI REST API: Building production-ready APIs with clean documentation and error handling
- Docker Containerization: Full containerization with Docker Compose for scalable deployments
- Elasticsearch Integration: Production-grade document storage and hybrid search capabilities
- Local Development Workflows: Script-based development environment setup and testing
- Hayhooks Framework: Multi-pipeline deployment using Haystack's native REST API framework
- Pipeline Orchestration: Managing multiple RAG pipelines (indexing + querying) as microservices
- Service Discovery: Automated API endpoint generation and pipeline management
Real-World Applications & Multi-Agent Systems
Hands-on projects that progress from beginner to advanced complexity, focusing on Named Entity Recognition, Text Classification, and Multi-Agent Systems. Projects includes complete notebooks with custom component definition, pipeline definition, and pipeline serialization.
- Haystack Pipeline Fundamentals: Building basic pipelines for entity extraction workflows
- Pre-trained NER Models: Using transformer models to identify people, organizations, and locations
- Custom Component Creation: Developing reusable components for text processing
- Web Content Processing: Building pipelines that extract entities from web search results
- SuperComponents and Agents: Wrapping pipelines as tools and building agents for natural language interaction
- Zero-Shot Classification: Categorizing content without training data using LLMs
- External API Integration: Connecting Haystack pipelines with the Yelp API
- Model Performance Evaluation: Assessing classification accuracy on labeled datasets
- Sentiment Analysis Pipelines: Building custom components for analyzing review sentiment
- Haystack Agent Mini Project: Hands-on exercise combining NER and classification pipelines with agent orchestration and Hayhooks deployment
- Pipeline Chaining: Connecting multiple specialized Haystack pipelines into complex workflows
- Hayhooks Deployment: Deploying pipelines as REST API endpoints for agent consumption
- LangGraph Multi-Agent Orchestration: Building intelligent supervisor systems that coordinate specialized agents
- Modular Pipeline Architecture: Creating 4 specialized pipelines (business search, details, sentiment, reporting)
- Ambiguous Input Handling: Using NER and intelligent routing to process natural language queries
- Distributed Data Aggregation: Generating comprehensive reports from multiple data sources
This folder contains an extended, production-grade implementation of the agentic supervisor described in Chapter 8.
- Three Agent Architectures: Progressive implementations from learning (V1 monolithic) to production-ready (V3 with checkpointing)
- State Management Patterns: Understanding how architectural decisions impact token usage and cost (16-50% reduction)
- Monolithic vs Supervisor Patterns: Comparing design approaches with automated token measurement tools
- Production Features: Error handling with retry policies, conversation persistence with checkpointing, and graceful degradation
- Guardrails: Input validation with prompt injection detection and PII sanitization for secure agent interactions
- Checkpointing Systems: Thread-based session management with both in-memory (development) and SQLite (production) persistence options










