Buy the Book on Amazon: https://a.co/d/eaTeURV
This repository contains code examples, experiments, and implementations for the Building Agentic AI book by Sinan Ozdemir. The book offers a practical, durable foundation for understanding how modern AI systems are built, why they behave the way they do, and how to push them to their limits.
Building Agentic AI is a practical guide for builders. If you're a developer deploying your first model, a data scientist making sense of embeddings and agents, or a founder exploring how AI workflows can reshape your product, this repository provides the code and examples to accompany your learning journey.
The book is organized in three acts:
- Act I: Foundations — LLMs, embeddings, retrieval, and workflows for reliable, cost-effective, scalable systems
- Act II: Agents — Designing, deploying, and evaluating systems that don't just respond, but act
- Act III: Optimization — Fine-tuning, quantization, distillation, and tools to push performance while maintaining efficiency
- Prerequisites
- Case Studies
- Case Study 1: Text to SQL Workflow
- Case Study 2: LLM Evaluation
- Case Study 3: LLM Experimentation
- Case Study 4: "Simple" Summary Prompt
- Case Study 5: From RAG to Agents
- Case Study 6: AI Rubrics for Grading
- Case Study 7: AI SDR with MCP
- Case Study 8: Prompt Engineering Agents
- Case Study 9: Deep Research + Agentic Workflows
- Case Study 10: Agentic Tool Selection Performance
- Case Study 11: Benchmarking Reasoning Models
- Case Study 12: Computer Use
- Case Study 13: Classification vs Multiple Choice
- Case Study 14: Domain Adaptation
- Case Study 15: Speculative Decoding
- Case Study 16: Voice Bot
- Case Study 17: Fine-Tuning Matryoshka Embeddings
- Common Patterns
- Troubleshooting
- Contributing
- Python 3.8 or higher recommended
- Python 3.10+ for optimal compatibility with all libraries
Most case studies use these core libraries:
langchainandlanggraph- Agent frameworkslangchain-openai- OpenAI integrationopenai- OpenAI API clientchromadb- Vector databasepydantic- Structured outputs and data validationfastapi/flask- Web frameworksstreamlit- Interactive UIsplaywright- Browser automationpandas/numpy- Data manipulation
IMPORTANT: Never commit your actual API keys to the repository. Always use environment variables or .env files (which are gitignored).
You'll need API keys for various services. We've provided .env.example files in key directories as templates:
-
Copy the example file to create your own
.envfile:# Root directory cp .env.example .env # Or for specific case studies cp sdr_multi_agent/.env.example sdr_multi_agent/.env cp text_to_sql/.env.example text_to_sql/.env cp codeact_browser/.env.example codeact_browser/.env
-
Edit the
.envfile and add your actual API keys:# Required for most case studies OPENAI_API_KEY=your_actual_openai_api_key_here OPENROUTER_API_KEY=your_actual_openrouter_api_key_here # For specific case studies GROQ_API_KEY=your_actual_groq_api_key # Case Study 16: Voice Bot SERPAPI_API_KEY=your_actual_serpapi_key # Case Study 9: Deep Research FIRECRAWL_API_KEY=your_actual_firecrawl_key # Case Study 9: Deep Research RESEND_API_KEY=your_actual_resend_key # Case Study 7: AI SDR TWILIO_API_KEY=your_actual_twilio_key # Case Study 16: Voice Bot ANTHROPIC_API_KEY=your_actual_anthropic_key # For Claude models LANGSMITH_API_KEY=your_actual_langsmith_key # Optional: for tracing
-
For Jupyter notebooks: Some notebooks use
%envmagic commands. You can either:- Set environment variables before starting Jupyter:
export OPENROUTER_API_KEY=your_key - Or update the notebook's first cell to use your actual key (but remember not to commit it!)
- Set environment variables before starting Jupyter:
Security Note: All .env files are automatically ignored by git. Never commit actual API keys to the repository.
- Clone the repository:
git clone <repository-url>
cd applied-ai-book- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies for specific case studies (see individual case study sections)
Description: Build a system that converts natural language questions to SQL queries using RAG (Retrieval-Augmented Generation). This implementation achieves 30% better SQL accuracy than raw LLMs with half the token cost.
Key Concepts: LangGraph workflows, RAG pipelines, ChromaDB vector storage, document retrieval, SQL generation
Code Location: text_to_sql/
Setup:
cd text_to_sql/prototype
pip install -r requirements.txt
# Set OPENAI_API_KEY in .env file
python app.pyFiles:
- Main implementation:
text_to_sql/src/text_to_sql_rag/sql_rag_langgraph.ipynb- LangGraph RAG workflow notebook - Core workflow:
text_to_sql/src/text_to_sql_rag/rag_graph.py- RAG workflow implementation - Database management:
text_to_sql/src/text_to_sql_rag/rag_db.py- Database and vector store setup - Web interface:
text_to_sql/prototype/app.py- Flask web application - Utilities:
text_to_sql/src/text_to_sql_rag/utils.py
Dependencies: See text_to_sql/prototype/requirements.txt
Key Features:
- LangGraph-based RAG workflow
- ChromaDB for vector storage
- Multiple database support (Formula 1, Superhero, Financial, etc.)
- Interactive Flask web interface
- Evidence retrieval with similarity scores
Description: An exploration of how well the SQL system from Case Study 1 actually works. This includes comprehensive evaluation metrics, accuracy measurements, and performance analysis.
Key Concepts: Evaluation methodologies, accuracy metrics, performance benchmarking, SQL correctness validation
Code Location: text_to_sql/src/choosing_generator/
Setup:
cd text_to_sql/src/choosing_generator
# Install dependencies from parent directory
python run_batch_evaluation.py
python visualize_model_performance.pyFiles:
- Batch evaluation:
text_to_sql/src/choosing_generator/run_batch_evaluation.py - Performance visualization:
text_to_sql/src/choosing_generator/visualize_model_performance.py - Multi-model evaluation:
text_to_sql/src/choosing_generator/run_multi_model_evaluation.py - Evaluation results: Multiple CSV files in
text_to_sql/src/choosing_generator/with batch evaluation results
Key Features:
- SQL accuracy evaluation across multiple models
- Performance metrics and visualization
- Batch processing for large-scale evaluation
- Model comparison and analysis
Description: Run systematic tests on prompts, models, and embedding models. Learn how even the smallest changes can 2x your performance when you experiment efficiently.
Key Concepts: Systematic experimentation, prompt optimization, embedding model comparison, A/B testing, performance optimization
Code Location: text_to_sql/src/choosing_generator/ and prompting/
Setup:
cd text_to_sql/src/choosing_generator
# Run prompt engineering experiments
jupyter notebook prompt_engineering_generator.ipynb
# Or run multi-model evaluation
python run_multi_model_evaluation.pyFiles:
- Prompt engineering:
text_to_sql/src/choosing_generator/prompt_engineering_generator.ipynb - Multi-model evaluation:
text_to_sql/src/choosing_generator/run_multi_model_evaluation.py - Prompt caching:
prompting/prompt_cache_haystack.ipynb- Prompt caching experiments - Model definitions:
text_to_sql/src/choosing_generator/models.py
Key Features:
- Systematic testing framework
- Prompt optimization techniques
- Embedding model experiments
- Performance comparison across models
- Cost and latency analysis
Description: Discover why LLMs favor content at the start and end of prompts. This positional bias is breaking your RAG systems and chatbots, and you may not even know it.
Key Concepts: Positional bias, embedding similarity, prompt engineering, RAG system optimization
Code Location: prompting/
Setup:
cd prompting
jupyter notebook summary_positional_bias.ipynbFiles:
- Main analysis:
prompting/summary_positional_bias.ipynb- Positional bias discovery and analysis - Chunked analysis:
prompting/summary_positional_bias_chunk.ipynb- Chunked document analysis - MMLU dataset:
prompting/mmlu_positional_bias.ipynb- Positional bias in MMLU dataset - Results: CSV files with positional bias metrics
Key Features:
- Embedding similarity analysis
- Positional bias discovery
- Impact on RAG system performance
- Visualization of bias patterns
Description: Convert a workflow into an agent that makes its own decisions using tools. Agents handle the weird edge cases your workflow never imagined, but at what cost (literally)?
Key Concepts: ReAct agents, LangGraph, tool usage, agent decision-making, cost analysis
Code Location: text_to_sql/agent/
Setup:
cd text_to_sql/agent
jupyter notebook react_agent_sql.ipynbFiles:
- ReAct agent:
text_to_sql/agent/react_agent_sql.ipynb- Main ReAct agent implementation - Agent utilities:
text_to_sql/agent/utils.py- Helper functions - Long-term memory:
text_to_sql/agent/long_term_memory_experiment.ipynb- Memory experiments - Memory experiment part 2:
text_to_sql/agent/long_term_memory_experiment_part_deux.ipynb - Evaluation results:
text_to_sql/agent/eval_results.csv
Key Features:
- LangGraph ReAct agent implementation
- Tool-based decision making
- Edge case handling
- Long-term memory experiments
- Cost and performance comparison with workflows
Description: Create scoring systems to evaluate AI outputs consistently and with mitigated bias. Less arguing about quality—more clear, measurable criteria.
Key Concepts: Structured outputs, Pydantic models, evaluation rubrics, bias mitigation, consistent scoring
Code Location: policy_bot/
Setup:
cd policy_bot
pip install -r requirements.txt
# Set OPENROUTER_API_KEY in environment
python -c "from ai.rubric import get_structured_scorer; scorer = get_structured_scorer()"Files:
- Rubric system:
policy_bot/ai/rubric.py- Structured scoring system with Pydantic - Policy agent:
policy_bot/ai/agent.py- Policy agent using rubrics - Evaluation notebooks:
policy_bot/rubric_grade_domain_adapt.ipynb - Results: CSV files with scoring results
Key Features:
- Pydantic structured outputs for consistent scoring
- 0-3 scoring scale with detailed reasoning
- Bias mitigation in evaluation
- Automated rubric-based grading
- Integration with policy agents
Description: Build multiple agents that research contacts and send emails. Your outreach can finally sound human at scale.
Key Concepts: Multi-agent systems, MCP (Model Context Protocol), Flask applications, Celery async tasks, agent orchestration
Code Location: sdr_multi_agent/
Setup:
cd sdr_multi_agent
# Start Docker services (RabbitMQ, etc.)
docker-compose up -d
# Install dependencies
cd flask_app
pip install -r requirements.txt
# Run Flask app
python app.pyFiles:
- Flask application:
sdr_multi_agent/flask_app/app.py- Main Flask app with sync/async endpoints - Agent builder:
sdr_multi_agent/flask_app/agent_builder.py- Generic agent builder with MCP integration - System prompts:
sdr_multi_agent/flask_app/prompts.py - Celery tasks:
sdr_multi_agent/flask_app/celery_app.py - MCP servers:
- Agent configs: JSON files in
sdr_multi_agent/flask_app/(e.g.,email_agent.json,lead_gen_config.json)
Dependencies: See sdr_multi_agent/flask_app/requirements.txt
Key Features:
- Multi-agent system architecture
- MCP server integration
- Celery for async task processing
- Configurable agent system with JSON configs
- Persistent conversation memory
- Research and email automation
Description: Create agents that follow company policies using synthetic test data as a measuring stick. See how one single sentence in a prompt can move accuracy from 44% to 70%.
Key Concepts: Prompt engineering, policy compliance, accuracy optimization, synthetic test data, agent evaluation
Code Location: policy_bot/
Setup:
cd policy_bot
pip install -r requirements.txt
jupyter notebook agent_prompting_test.ipynbFiles:
- Main experiment:
policy_bot/agent_prompting_test.ipynb- Prompt engineering experiments - Mini version:
policy_bot/agent_prompting_test_mini.ipynb- Smaller test version - Policy agent:
policy_bot/ai/agent.py- Agent implementation - Dataset building:
policy_bot/build_dataset.ipynb- Test dataset creation - Results:
Dependencies: See policy_bot/requirements.txt
Key Features:
- Prompt optimization techniques
- Accuracy improvements (44% to 70%)
- Synthetic test data generation
- Policy compliance evaluation
- Tool call tracking and analysis
Description: Combine structured workflows with agent flexibility for research tasks. Get reliability without sacrificing adaptability.
Key Concepts: LangGraph workflows, planning and replanning, step execution, structured workflows, agent flexibility
Code Location: deep_research/
Setup:
cd deep_research
pip install -r requirements.txt
# For Streamlit UI
pip install -r streamlit_requirements.txt
streamlit run streamlit_app.py
# Or run notebook
jupyter notebook deep_research_langgraph_demo.ipynbFiles:
- Main workflow:
deep_research/deep_research_graph.py- LangGraph workflow implementation - Planning logic:
deep_research/planning.py- Plan generation and replanning - Step executor:
deep_research/step_executor.py- Step execution with ReAct agents - Streamlit UI:
deep_research/streamlit_app.py- Interactive web interface - Demo notebook:
deep_research/deep_research_langgraph_demo.ipynb - Tests:
deep_research/tests/test_planning.py,deep_research/tests/test_step_executor.py
Dependencies:
Key Features:
- Structured workflows with agent flexibility
- Planning and replanning system
- Step-by-step research execution
- Real-time streaming events
- Performance analytics
- Web scraping and search integration
Environment Variables Needed:
OPENROUTER_API_KEY- For LLM accessSERPAPI_API_KEY- For Google searchFIRECRAWL_API_KEY- For web scraping
Description: Test how well different LLMs choose the right tools. Tool order in prompts can shift accuracy by 40%.
Key Concepts: Tool selection, positional bias, MCP servers, agent performance evaluation, accuracy analysis
Code Location: agent_positional_bias/
Setup:
cd agent_positional_bias
jupyter notebook "LangGraph_React - MCP + Tool Selection.ipynb"Files:
- Main experiment:
agent_positional_bias/LangGraph_React - MCP + Tool Selection.ipynb- Tool selection experiments - Prompt-only version:
agent_positional_bias/PROMPT ONLY LangGraph_React - MCP + Tool Selection.ipynb - Reasoning edition:
agent_positional_bias/positional bias reasoning edition.ipynb - MCP server:
agent_positional_bias/mcp_server.py- Basic MCP server example - Similarity-based:
agent_positional_bias/similarity_based_mcp.py- Similarity-based tool selection - Random MCP:
agent_positional_bias/random_mcp_server.py - Results:
Key Features:
- Tool selection accuracy testing
- Positional bias analysis (40% accuracy shifts)
- Comparison across multiple LLMs
- Reasoning vs non-reasoning model comparison
- Visualization of tool selection patterns
Description: Compare reasoning models like o1 and Claude against standard LLMs. They may even lose to cheaper models on real tasks—we'll see!
Key Concepts: Reasoning models, chain-of-thought, model benchmarking, cost/performance analysis, o1, Claude reasoning
Code Location: reasoning_llms/
Setup:
cd reasoning_llms
jupyter notebook benchmarking_reasoning_models.ipynbFiles:
- Main benchmarking:
reasoning_llms/benchmarking_reasoning_models.ipynb- Comprehensive benchmarking notebook - Reasoning agents:
reasoning_llms/reasoning_llm_agents.ipynb- Agent implementations - Introduction:
reasoning_llms/intro_to_reasoning_models.ipynb- Introduction to reasoning models - Tree of Thoughts:
reasoning_llms/tot.ipynb- Tree of Thoughts implementation - Results:
reasoning_llms/reasoning_results_math_qa.csv
Key Features:
- Reasoning model comparison
- Cost and performance analysis
- Tree of Thoughts implementation
- Real-world task evaluation
- Performance visualization
Description: Build agents that control browsers and applications through screenshots. Your agent can finally use software you can't API into.
Key Concepts: Screenshot-based automation, GUI control, browser automation, computer vision, agent control
Code Location: reasoning_llms/computer_use/
Setup:
cd reasoning_llms/computer_use
jupyter notebook using_computer_use.ipynbFiles:
- Core implementation:
reasoning_llms/computer_use/simple_computer_use.py- Main computer use implementation - Usage notebook:
reasoning_llms/computer_use/using_computer_use.ipynb- How to use the system - Benchmarking:
reasoning_llms/computer_use/benchmarking reasoning_model_computer_use.ipynb - Screenshot app:
reasoning_llms/computer_use/screenshot_app/- Electron-based screenshot application - Dataset:
reasoning_llms/computer_use/uipad_dataset/- UI interaction dataset
Key Features:
- Screenshot-based UI automation
- Browser and application control
- GUI interaction through vision
- Benchmarking and evaluation
- Real-world application control
Description: Compare fine-tuning against multiple choice prompting for classification. The winner might depend on if you have 100 or 10,000 examples.
Key Concepts: Fine-tuning, classification, multiple choice prompting, model comparison, data efficiency
Code Location: finetuning/app_review_clf/ and root clf.ipynb
Colab Notebooks:
- Fine-tuning BERT for app reviews: Google Colab
- App review calibration: Google Colab
Setup:
# For fine-tuning approach
cd finetuning/app_review_clf
jupyter notebook openai_app_review_ft.ipynb
# For classification comparison
cd ../..
jupyter notebook clf.ipynbFiles:
- Fine-tuning notebook:
finetuning/app_review_clf/openai_app_review_ft.ipynb- OpenAI fine-tuning implementation - Classification comparison:
clf.ipynb- Comparison between fine-tuning and prompting - Training data:
finetuning/app_review_clf/openai_training_data/- JSONL training files - Results: PNG files with performance visualizations
Key Features:
- Fine-tuning vs prompting comparison
- Classification accuracy analysis
- Data efficiency evaluation
- Cost and performance trade-offs
- App review sentiment classification
Description: Fine-tune Qwen on domain-specific documents. Generic models become experts in your exact business rules.
Key Concepts: Domain adaptation, fine-tuning, Qwen model, policy compliance, business rule learning
Colab Notebook:
- Fine-tuning Qwen for domain adaptation: Google Colab
Setup:
cd policy_bot
jupyter notebook rubric_grade_domain_adapt.ipynbFiles:
- Domain adaptation evaluation:
policy_bot/rubric_grade_domain_adapt.ipynb- Evaluation of domain-adapted models - Fine-tuning checkpoints:
finetuning/domain_adaptation/- Model checkpoints and configs - Policy agent:
policy_bot/ai/agent.py- Agent using domain-adapted models - Results: CSV files with before/after fine-tuning comparisons
Key Features:
- Qwen model fine-tuning
- Airbnb policy domain adaptation
- Domain-specific expertise
- Before/after performance comparison
- Business rule learning
Description: Speed up inference by having a small model draft for a large model. Same exact outputs, 2-3x faster, sometimes.
Key Concepts: Speculative decoding, inference acceleration, draft models, performance optimization
Colab Notebook:
- Qwen speculative decoding: Google Colab
Description: As newer voice-to-voice models mature, building real-time voice bots with streaming audio can still perform well with sub-500ms responses making conversations feel natural.
Key Concepts: Real-time voice streaming, WebSockets, Twilio integration, Groq API, low-latency responses
Code Location: multimodal/twilio/
Setup:
cd multimodal/twilio
# Option 1: Docker (recommended)
docker-compose up --build
# Option 2: Local setup
pip install -r requirements.txt
# Create .env file with GROQ_API_KEY, NGROK_AUTHTOKEN
python twilio_app.pyFiles:
- Main application:
multimodal/twilio/twilio_app.py- FastAPI application with WebSocket support - Simple version:
multimodal/twilio/twilio_app_simple.py- Simplified implementation - Docker setup:
multimodal/twilio/README_DOCKER.md- Detailed Docker instructions - Audio analysis:
multimodal/twilio/audio_analysis.ipynb- Audio processing notebook - Cross-provider test:
multimodal/twilio/cross_provider_audio_test.py
Dependencies: See multimodal/twilio/requirements.txt
Key Features:
- Real-time voice streaming via WebSockets
- Sub-500ms response times
- Twilio voice call integration
- Groq API for fast inference
- Audio recording and storage
- Docker support with ngrok integration
Environment Variables Needed:
GROQ_API_KEY- For Groq API accessNGROK_AUTHTOKEN- For public tunnel (optional)PORT- Server port (default: 5015)
Description: Train embeddings that work at multiple dimensions. Dynamically trade speed for accuracy based on each query's needs.
Key Concepts: Matryoshka embeddings, multi-dimensional embeddings, dynamic dimension selection, speed/accuracy trade-offs
Colab Notebook:
- Matryoshka embeddings: Google Colab
Most case studies use LangGraph for building agent workflows:
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_core.messages import BaseMessage
class WorkflowState(BaseModel):
messages: Annotated[List[BaseMessage], add_messages]
# ... other state fields
workflow = StateGraph(WorkflowState)
workflow.add_node("process", process_node)
workflow.add_edge("process", END)
app = workflow.compile()For case studies using MCP (Model Context Protocol):
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("Server Name")
@mcp.tool()
def my_tool(param: str) -> str:
"""Tool description"""
return result
if __name__ == "__main__":
mcp.run(transport="stdio")Creating ReAct agents with LangGraph:
from langchain.agents import create_agent
from langgraph.checkpoint.memory import MemorySaver
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-5.1")
tools = [tool1, tool2, tool3]
checkpointer = MemorySaver()
agent = create_agent(
llm,
tools,
prompt=system_prompt,
checkpointer=checkpointer
)Structured evaluation using Pydantic:
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
class ScoreResponse(BaseModel):
reasoning: str = Field(description="Evaluation reasoning")
score: int = Field(description="Score from 0-3")
llm = ChatOpenAI(model="gpt-4")
structured_llm = llm.with_structured_output(ScoreResponse)API Key Errors:
- Ensure all required API keys are set in your
.envfile - Check that environment variables are loaded correctly
- Verify API key format and permissions
Import Errors:
- Install dependencies from the case study's
requirements.txt - Ensure you're using the correct Python version (3.8+)
- Try reinstalling packages:
pip install --upgrade -r requirements.txt
ChromaDB Issues:
- Clear the ChromaDB directory if you encounter corruption
- Ensure write permissions for the database directory
- Check disk space availability
Docker Issues:
- Ensure Docker is running:
docker ps - Check Docker Compose version:
docker-compose --version - Review logs:
docker-compose logs
Notebook Issues:
- Restart kernel if cells hang
- Clear output and re-run cells in order
- Check that all required files are in the correct directories
- Check the specific case study's directory for additional README files
- Review error messages carefully—they often point to missing dependencies
- Ensure all environment variables are set correctly
- Verify Python version compatibility
- Follow PEP 8 for Python code
- Use type hints where possible
- Include docstrings for functions and classes
- Keep notebooks organized with clear markdown cells
- Create a new directory following the naming convention
- Include a
requirements.txtfile - Add a README.md in the directory with setup instructions
- Update this main README.md with the new case study
- Include example code and test cases
- Run notebooks from top to bottom to ensure they work
- Test with different API keys/models where applicable
- Verify that results match expected outputs
- Include error handling in production code
Sinan Ozdemir is an AI entrepreneur, educator, and advisor. He holds a master's degree in pure mathematics and has founded startups, written multiple textbooks on AI, and guided venture-backed companies through deploying AI at scale. He currently serves as CTO at LoopGenius, where he leads teams building AI-driven automation systems, and continues to teach, write, and share knowledge on applied AI.
This repository contains code examples and implementations for the Building Agentic AI book. Please refer to the book for detailed explanations and context.
Last updated: 2025
