jxnl
diff --git a/‎assets/images/social/workshops/chapter5-slides.png‎
27.3 KB b/‎assets/images/social/workshops/chapter5-slides.png‎
27.3 KB
diff --git a/‎assets/images/social/workshops/chapter5_extracted.png‎
26.6 KB b/‎assets/images/social/workshops/chapter5_extracted.png‎
26.6 KB
diff --git a/‎assets/images/social/workshops/chapter6-slides.png‎
-16.5 KB b/‎assets/images/social/workshops/chapter6-slides.png‎
-16.5 KB
diff --git a/‎search/search_index.json‎
Lines changed: 1 addition & 1 deletion b/‎search/search_index.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎sitemap.xml‎
Lines changed: 8 additions & 0 deletions b/‎sitemap.xml‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎sitemap.xml.gz‎
14 Bytes b/‎sitemap.xml.gz‎
14 Bytes
diff --git a/‎workshops/chapter1/index.html‎
Lines changed: 7 additions & 5 deletions b/‎workshops/chapter1/index.html‎
Lines changed: 7 additions & 5 deletions
diff --git a/‎workshops/chapter2/index.html‎
Lines changed: 14 additions & 8 deletions b/‎workshops/chapter2/index.html‎
Lines changed: 14 additions & 8 deletions
@@ -276,6 +276,14 @@
          <loc>https://567-labs.github.io/systematically-improving-rag/workshops/chapter5-2/</loc>
          <lastmod>2025-09-04</lastmod>
     </url>
+    <url>
+         <loc>https://567-labs.github.io/systematically-improving-rag/workshops/chapter5-slides/</loc>
+         <lastmod>2025-09-04</lastmod>
+    </url>
+    <url>
+         <loc>https://567-labs.github.io/systematically-improving-rag/workshops/chapter5_extracted/</loc>
+         <lastmod>2025-09-04</lastmod>
+    </url>
     <url>
          <loc>https://567-labs.github.io/systematically-improving-rag/workshops/chapter6-1/</loc>
          <lastmod>2025-09-04</lastmod>
 
@@ -3779,12 +3779,13 @@ <h3 id="key-insight">Key Insight</h3>
 <p><strong>You can't improve what you can't measure—and you can measure before you have users.</strong> Synthetic data isn't just a stopgap until real users arrive. It's a powerful tool for establishing baselines, testing edge cases, and building the evaluation infrastructure that will power continuous improvement. Start with retrieval metrics (precision and recall), not generation quality, because they're faster, cheaper, and more objective.</p>
 <div class="admonition info">
 <p class="admonition-title">Learn the Complete RAG Playbook</p>
-<p>All of this content comes from my <a href="https://maven.com/applied-llms/rag-playbook?promoCode=EBOOK">Systematically Improving RAG Applications</a> course. Readers get <strong>20% off</strong> with code EBOOK. Join 500+ engineers who've transformed their RAG systems from demos to production-ready applications.</p>
+<p>All of this content comes from my <a href="https://maven.com/applied-llms/rag-playbook?promoCode=EBOOK">Systematically Improving RAG Applications</a> course. Readers get <strong>20% off</strong> with code EBOOK. </p>
+<p><strong>Join 500+ engineers</strong> who've transformed their RAG systems from demos to production-ready applications. Previous cohort participants work at companies like HubSpot, Zapier, and numerous AI startups - from seed stage to $100M+ valuations.</p>
 </div>
 <p>Alright, let's talk about making RAG applications actually work. Most teams I work with are stuck in this weird loop where they keep tweaking things randomly and hoping something sticks. Sound familiar?</p>
 <p>Here's what we're going to cover: how to set up evaluations that actually tell you something useful, common ways teams shoot themselves in the foot (and how to avoid them), and how to use synthetic data to test your system before you even have users.</p>
 <h2 id="common-pitfalls-in-ai-development">Common Pitfalls in AI Development</h2>
-<p>After working with dozens of companies on their RAG applications, I keep seeing the same patterns. Let me walk you through them so you don't waste months making the same mistakes.</p>
+<p>After consulting with dozens of companies - from AI startups to $100M+ companies - I keep seeing the same patterns. I've seen companies hire ML engineers only to realize they weren't logging data, then wait 3-6 months to collect it. Let me walk you through these patterns so you don't make the same mistakes.</p>
 <h3 id="the-reasoning-fallacy">The Reasoning Fallacy</h3>
 <p>I can't tell you how many times I hear "we need more complex reasoning" or "the model isn't smart enough." Nine times out of ten, that's not the problem. The real issue? You don't actually know what your users want.</p>
 <p>Think about it - when was the last time you:</p>
@@ -3796,7 +3797,7 @@ <h3 id="the-reasoning-fallacy">The Reasoning Fallacy</h3>
 <p>If you're like most teams, the answer is "uhh..." And that's the problem. You end up building these generic tools that don't solve any specific problem particularly well.</p>
 <h3 id="the-vague-metrics-problem">The Vague Metrics Problem</h3>
 <p>Here's another one that drives me crazy. Teams will spend weeks changing things and then evaluate success by asking "does it look better?" or "does it feel right?"</p>
-<p>I worked with a company valued at $100 million that had maybe 30 evaluation examples. Total. When something broke or improved, they had no idea what actually changed or why.</p>
+<p><strong>Real Example</strong>: I've worked with companies valued at $100 million that had less than 30 evaluation examples total. When something broke or improved, they had no idea what actually changed or why.</p>
 <p>Without concrete metrics, you get stuck in this loop:</p>
 <ol>
 <li>Make random changes based on gut feeling</li>
@@ -3843,6 +3844,7 @@ <h3 id="the-calories-in-calories-out-analogy">The Calories In, Calories Out Anal
 <h3 id="the-1-leading-metric-experiment-velocity">The #1 Leading Metric: Experiment Velocity</h3>
 <p>If I had to pick one metric for early-stage RAG applications, it's this: how many experiments are you running?</p>
 <p>Instead of asking "did the last change improve things?" in standup, ask "how can we run twice as many experiments next week?" What infrastructure would help? What's blocking us from testing more ideas?</p>
+<p><strong>Real Impact</strong>: Teams that focus on experiment velocity often see 6-10% improvements in recall with just hundreds of dollars in API calls - work that previously required tens of thousands in data labeling costs.</p>
 <p>This shift from outcomes to velocity changes everything.</p>
 <h2 id="absence-blindness-and-intervention-bias">Absence Blindness and Intervention Bias</h2>
 <p>These two biases kill more RAG projects than anything else.</p>
@@ -3932,13 +3934,13 @@ <h3 id="case-study-1-report-generation-from-expert-interviews">Case Study 1: Rep
 <p>A client generates reports from user research interviews. Consultants do 15-30 interviews and want AI-generated summaries.</p>
 <p><strong>Problem</strong>: Reports were missing quotes. A consultant knew 6 experts said something similar, but the report only cited 3. That 50% recall rate killed trust.</p>
 <p><strong>Solution</strong>: We built manual evaluation sets from problematic examples. Turns out, better text chunking fixed most issues.</p>
-<p><strong>Result</strong>: Recall went from 50% to 90% in a few iterations. Customers noticed immediately.</p>
+<p><strong>Result</strong>: Recall went from 50% to 90% in a few iterations - a 40 percentage point improvement that customers noticed immediately. This kind of measurable improvement builds trust and enables continued partnership.</p>
 <p><strong>Lesson</strong>: Pre-processing that matches how users query can dramatically improve retrieval.</p>
 <h3 id="case-study-2-blueprint-search-for-construction">Case Study 2: Blueprint Search for Construction</h3>
 <p>Another client needed AI search for construction blueprints - workers asking questions about building plans.</p>
 <p><strong>Problem</strong>: Only 27% recall when finding the right blueprint for questions.</p>
 <p><strong>Solution</strong>: We used a vision model to create detailed captions for blueprints, including hypothetical questions users might ask.</p>
-<p><strong>Result</strong>: Four days later, recall hit 85%. Once live, we discovered 20% of queries involved counting objects, which justified investing in bounding box models.</p>
+<p><strong>Result</strong>: Four days later, recall jumped from 27% to 85% - a 58 percentage point improvement. Once live, we discovered 20% of queries involved counting objects, which justified investing in bounding box models for those specific use cases.</p>
 <p><strong>Lesson</strong>: Test subsystems independently for rapid improvements. Synthetic data for specific use cases works great.</p>
 <p><strong>Chunk Size Best Practices:</strong></p>
 <p>Start with 800 tokens and 50% overlap. This works for most use cases.</p>
 
@@ -3814,7 +3814,15 @@ <h3 id="key-insight">Key Insight</h3>
 <div class="admonition success">
 <p class="admonition-title">Fine-Tuning Cost Reality Check</p>
 </div>
-<p><strong>Embedding Model Fine-Tuning:</strong> - Cost: ~$1.50 for 6,000 examples - Time: 40 minutes on a laptop - Infrastructure: Consumer GPU or cloud notebook - Improvement: 6-10% better performance</p>
+<p><strong>Real Numbers from Production:</strong>
+    - <strong>Just 6,000 examples = 6-10% improvement</strong> (Sentence Transformers team validated)
+    - <strong>Cost: Hundreds of dollars in API calls</strong> (vs tens of thousands for data labeling previously)
+    - <strong>Time: 40 minutes training on your laptop</strong> 
+    - <strong>Systems at 70% can reach 85-90%</strong> - remember that 50% to 90% recall jump from Chapter 1? That's exactly this kind of improvement
+    - <strong>Companies see 14% accuracy boost over baseline</strong> just from fine-tuning cross-encoders
+    - <strong>12% increase in exact match</strong> by training better passage encoders
+    - <strong>20% improvement in response accuracy</strong> with rerankers
+    - <strong>30% reduction in irrelevant documents</strong> with proper fine-tuning</p>
 <div class="highlight"><pre><span></span><code>**Language Model Fine-Tuning:**
 - Cost: $100-1000s depending on model size
 - Time: Hours to days
@@ -3824,13 +3832,11 @@ <h3 id="key-insight">Key Insight</h3>
 This dramatic difference explains why embedding fine-tuning should be your first focus.
 </code></pre></div>
 <h2 id="introduction">Introduction</h2>
-<p>In the previous chapter, we set up evaluation and generated synthetic data to benchmark our RAG system. Now let's talk about how to actually use that data to improve things.</p>
-<p><strong>Prerequisites from Previous Chapters:</strong>
-- <strong><a href="../chapter0/">Chapter 0</a></strong>: Understanding the improvement flywheel concept
-- <strong><a href="../chapter1/">Chapter 1</a></strong>: Creating evaluation datasets with synthetic data</p>
-<p>The evaluation examples from Chapter 1 become your training data in this chapter.</p>
+<p>Remember in Chapter 1 where we talked about that $100M company with only 30 evaluation examples? Well, here's the good news: once you have those evaluation examples, you can multiply their value. The synthetic data and evaluation framework from Chapter 1 becomes your training data in this chapter.</p>
+<p><strong>Building on Chapter 1's Foundation:</strong>
+Your evaluation examples (synthetic questions + ground truth) now become few-shot examples and training data. We're turning that evaluation flywheel into a fine-tuning flywheel.</p>
 <p>Here's the thing: the data you collect for evaluation shouldn't just sit there. Every question, every relevance judgment, every piece of feedback—it can all be used to improve your system. That's what we'll cover here.</p>
-<p><strong>Key Philosophy:</strong> "Every evaluation example is a potential training example. The data flywheel transforms what begins as a handful of evaluation examples into few-shot prompts, then into training datasets for fine-tuning embedding models and re-rankers."</p>
+<p><strong>Key Philosophy:</strong> "This is the "wax on, wax off" moment: 20 examples become evals (Chapter 1), 30 examples become few-shot prompts, 1,000 examples let you start fine-tuning. Remember that $100M company with 30 evals? Once you have that data, this is how you turn it into actual improvements. It's never done, just gets better."</p>
 <p>The process is straightforward: you start with evaluation examples, turn them into few-shot prompts, then eventually use them to fine-tune your embedding models and re-rankers. Each step builds on the last.</p>
 <h2 id="why-generic-embeddings-fall-short">Why Generic Embeddings Fall Short</h2>
 <p>Let me start with something that trips up a lot of teams: generic embeddings from providers like OpenAI often don't work great for specialized applications. They're good models, don't get me wrong. But they're built to handle everything, which means they don't handle your specific thing particularly well.</p>
@@ -3846,7 +3852,7 @@ <h3 id="the-elusive-nature-of-similarity">The Elusive Nature of "Similarity"</h3
 <p>Embedding models seem simple enough: they turn text into numbers, and similar text should end up with similar numbers. Measure the distance between vectors and you know how similar things are.</p>
 <p><strong>Domain-Specific Similarity Example:</strong> In e-commerce, what makes two products "similar"? Are they substitutes (different brands of red shirts) or complements (a shirt and matching pants)? Depends on what you're trying to do.</p>
 <p>Take music recommendations. Songs might be similar because they're the same genre, or because they show up in the same playlists, or because the same people like them. If you're adding songs to a playlist, you want one kind of similarity. If you're building Spotify's Discovery Weekly, you want something else entirely.</p>
-<p>My favorite example is from dating apps. Should "I love coffee" and "I hate coffee" be similar? Linguistically, they're opposites. But both people care about coffee enough to mention it. Maybe opposites don't attract when it comes to beverages. Or maybe they do if one person likes tea and the other likes coffee.</p>
+<p>Take dating apps - should "I love coffee" and "I hate coffee" be similar? Linguistically opposite, but both care enough about coffee to mention it. Generic embeddings see them as opposites. But for matching people? Maybe that matters more than word similarity. This is exactly the kind of nuance you miss without domain-specific fine-tuning.</p>
 <p>Here's the thing: <strong>What actually matters for a dating app is whether two people will like each other</strong>, not whether their profiles use similar words. Generic embeddings trained on web text have no idea about this.</p>
 <p>The problem is that "similarity" means different things in different contexts. There's no universal right answer—it depends on what you're trying to do.</p>
 <h3 id="the-hidden-assumptions-in-provider-models">The Hidden Assumptions in Provider Models</h3>