You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p><strong>You can't improve what you can't measure—and you can measure before you have users.</strong> Synthetic data isn't just a stopgap until real users arrive. It's a powerful tool for establishing baselines, testing edge cases, and building the evaluation infrastructure that will power continuous improvement. Start with retrieval metrics (precision and recall), not generation quality, because they're faster, cheaper, and more objective.</p>
3780
3780
<divclass="admonition info">
3781
3781
<pclass="admonition-title">Learn the Complete RAG Playbook</p>
3782
-
<p>All of this content comes from my <ahref="https://maven.com/applied-llms/rag-playbook?promoCode=EBOOK">Systematically Improving RAG Applications</a> course. Readers get <strong>20% off</strong> with code EBOOK. Join 500+ engineers who've transformed their RAG systems from demos to production-ready applications.</p>
3782
+
<p>All of this content comes from my <ahref="https://maven.com/applied-llms/rag-playbook?promoCode=EBOOK">Systematically Improving RAG Applications</a> course. Readers get <strong>20% off</strong> with code EBOOK. </p>
3783
+
<p><strong>Join 500+ engineers</strong> who've transformed their RAG systems from demos to production-ready applications. Previous cohort participants work at companies like HubSpot, Zapier, and numerous AI startups - from seed stage to $100M+ valuations.</p>
3783
3784
</div>
3784
3785
<p>Alright, let's talk about making RAG applications actually work. Most teams I work with are stuck in this weird loop where they keep tweaking things randomly and hoping something sticks. Sound familiar?</p>
3785
3786
<p>Here's what we're going to cover: how to set up evaluations that actually tell you something useful, common ways teams shoot themselves in the foot (and how to avoid them), and how to use synthetic data to test your system before you even have users.</p>
3786
3787
<h2id="common-pitfalls-in-ai-development">Common Pitfalls in AI Development</h2>
3787
-
<p>After working with dozens of companies on their RAG applications, I keep seeing the same patterns. Let me walk you through them so you don't waste months making the same mistakes.</p>
3788
+
<p>After consulting with dozens of companies - from AI startups to $100M+ companies - I keep seeing the same patterns. I've seen companies hire ML engineers only to realize they weren't logging data, then wait 3-6 months to collect it. Let me walk you through these patterns so you don't make the same mistakes.</p>
<p>I can't tell you how many times I hear "we need more complex reasoning" or "the model isn't smart enough." Nine times out of ten, that's not the problem. The real issue? You don't actually know what your users want.</p>
3790
3791
<p>Think about it - when was the last time you:</p>
<p>If you're like most teams, the answer is "uhh..." And that's the problem. You end up building these generic tools that don't solve any specific problem particularly well.</p>
<p>Here's another one that drives me crazy. Teams will spend weeks changing things and then evaluate success by asking "does it look better?" or "does it feel right?"</p>
3799
-
<p>I worked with a company valued at $100 million that had maybe 30 evaluation examples. Total. When something broke or improved, they had no idea what actually changed or why.</p>
3800
+
<p><strong>Real Example</strong>: I've worked with companies valued at $100 million that had less than 30 evaluation examples total. When something broke or improved, they had no idea what actually changed or why.</p>
3800
3801
<p>Without concrete metrics, you get stuck in this loop:</p>
3801
3802
<ol>
3802
3803
<li>Make random changes based on gut feeling</li>
@@ -3843,6 +3844,7 @@ <h3 id="the-calories-in-calories-out-analogy">The Calories In, Calories Out Anal
3843
3844
<h3id="the-1-leading-metric-experiment-velocity">The #1 Leading Metric: Experiment Velocity</h3>
3844
3845
<p>If I had to pick one metric for early-stage RAG applications, it's this: how many experiments are you running?</p>
3845
3846
<p>Instead of asking "did the last change improve things?" in standup, ask "how can we run twice as many experiments next week?" What infrastructure would help? What's blocking us from testing more ideas?</p>
3847
+
<p><strong>Real Impact</strong>: Teams that focus on experiment velocity often see 6-10% improvements in recall with just hundreds of dollars in API calls - work that previously required tens of thousands in data labeling costs.</p>
3846
3848
<p>This shift from outcomes to velocity changes everything.</p>
3847
3849
<h2id="absence-blindness-and-intervention-bias">Absence Blindness and Intervention Bias</h2>
3848
3850
<p>These two biases kill more RAG projects than anything else.</p>
@@ -3932,13 +3934,13 @@ <h3 id="case-study-1-report-generation-from-expert-interviews">Case Study 1: Rep
3932
3934
<p>A client generates reports from user research interviews. Consultants do 15-30 interviews and want AI-generated summaries.</p>
3933
3935
<p><strong>Problem</strong>: Reports were missing quotes. A consultant knew 6 experts said something similar, but the report only cited 3. That 50% recall rate killed trust.</p>
3934
3936
<p><strong>Solution</strong>: We built manual evaluation sets from problematic examples. Turns out, better text chunking fixed most issues.</p>
3935
-
<p><strong>Result</strong>: Recall went from 50% to 90% in a few iterations. Customers noticed immediately.</p>
3937
+
<p><strong>Result</strong>: Recall went from 50% to 90% in a few iterations - a 40 percentage point improvement that customers noticed immediately. This kind of measurable improvement builds trust and enables continued partnership.</p>
3936
3938
<p><strong>Lesson</strong>: Pre-processing that matches how users query can dramatically improve retrieval.</p>
3937
3939
<h3id="case-study-2-blueprint-search-for-construction">Case Study 2: Blueprint Search for Construction</h3>
3938
3940
<p>Another client needed AI search for construction blueprints - workers asking questions about building plans.</p>
3939
3941
<p><strong>Problem</strong>: Only 27% recall when finding the right blueprint for questions.</p>
3940
3942
<p><strong>Solution</strong>: We used a vision model to create detailed captions for blueprints, including hypothetical questions users might ask.</p>
3941
-
<p><strong>Result</strong>: Four days later, recall hit 85%. Once live, we discovered 20% of queries involved counting objects, which justified investing in bounding box models.</p>
3943
+
<p><strong>Result</strong>: Four days later, recall jumped from 27% to 85% - a 58 percentage point improvement. Once live, we discovered 20% of queries involved counting objects, which justified investing in bounding box models for those specific use cases.</p>
3942
3944
<p><strong>Lesson</strong>: Test subsystems independently for rapid improvements. Synthetic data for specific use cases works great.</p>
3943
3945
<p><strong>Chunk Size Best Practices:</strong></p>
3944
3946
<p>Start with 800 tokens and 50% overlap. This works for most use cases.</p>
This dramatic difference explains why embedding fine-tuning should be your first focus.
3825
3833
</code></pre></div>
3826
3834
<h2id="introduction">Introduction</h2>
3827
-
<p>In the previous chapter, we set up evaluation and generated synthetic data to benchmark our RAG system. Now let's talk about how to actually use that data to improve things.</p>
3828
-
<p><strong>Prerequisites from Previous Chapters:</strong>
3829
-
- <strong><ahref="../chapter0/">Chapter 0</a></strong>: Understanding the improvement flywheel concept
3830
-
- <strong><ahref="../chapter1/">Chapter 1</a></strong>: Creating evaluation datasets with synthetic data</p>
3831
-
<p>The evaluation examples from Chapter 1 become your training data in this chapter.</p>
3835
+
<p>Remember in Chapter 1 where we talked about that $100M company with only 30 evaluation examples? Well, here's the good news: once you have those evaluation examples, you can multiply their value. The synthetic data and evaluation framework from Chapter 1 becomes your training data in this chapter.</p>
3836
+
<p><strong>Building on Chapter 1's Foundation:</strong>
3837
+
Your evaluation examples (synthetic questions + ground truth) now become few-shot examples and training data. We're turning that evaluation flywheel into a fine-tuning flywheel.</p>
3832
3838
<p>Here's the thing: the data you collect for evaluation shouldn't just sit there. Every question, every relevance judgment, every piece of feedback—it can all be used to improve your system. That's what we'll cover here.</p>
3833
-
<p><strong>Key Philosophy:</strong> "Every evaluation example is a potential training example. The data flywheel transforms what begins as a handful of evaluation examples into few-shot prompts, then into training datasets for fine-tuning embedding models and re-rankers."</p>
3839
+
<p><strong>Key Philosophy:</strong> "This is the "wax on, wax off" moment: 20 examples become evals (Chapter 1), 30 examples become few-shot prompts, 1,000 examples let you start fine-tuning. Remember that $100M company with 30 evals? Once you have that data, this is how you turn it into actual improvements. It's never done, just gets better."</p>
3834
3840
<p>The process is straightforward: you start with evaluation examples, turn them into few-shot prompts, then eventually use them to fine-tune your embedding models and re-rankers. Each step builds on the last.</p>
3835
3841
<h2id="why-generic-embeddings-fall-short">Why Generic Embeddings Fall Short</h2>
3836
3842
<p>Let me start with something that trips up a lot of teams: generic embeddings from providers like OpenAI often don't work great for specialized applications. They're good models, don't get me wrong. But they're built to handle everything, which means they don't handle your specific thing particularly well.</p>
@@ -3846,7 +3852,7 @@ <h3 id="the-elusive-nature-of-similarity">The Elusive Nature of "Similarity"</h3
3846
3852
<p>Embedding models seem simple enough: they turn text into numbers, and similar text should end up with similar numbers. Measure the distance between vectors and you know how similar things are.</p>
3847
3853
<p><strong>Domain-Specific Similarity Example:</strong> In e-commerce, what makes two products "similar"? Are they substitutes (different brands of red shirts) or complements (a shirt and matching pants)? Depends on what you're trying to do.</p>
3848
3854
<p>Take music recommendations. Songs might be similar because they're the same genre, or because they show up in the same playlists, or because the same people like them. If you're adding songs to a playlist, you want one kind of similarity. If you're building Spotify's Discovery Weekly, you want something else entirely.</p>
3849
-
<p>My favorite example is from dating apps. Should "I love coffee" and "I hate coffee" be similar? Linguistically, they're opposites. But both people care about coffee enough to mention it. Maybe opposites don't attract when it comes to beverages. Or maybe they do if one person likes tea and the other likes coffee.</p>
3855
+
<p>Take dating apps - should "I love coffee" and "I hate coffee" be similar? Linguistically opposite, but both care enough about coffee to mention it. Generic embeddings see them as opposites. But for matching people? Maybe that matters more than word similarity. This is exactly the kind of nuance you miss without domain-specific fine-tuning.</p>
3850
3856
<p>Here's the thing: <strong>What actually matters for a dating app is whether two people will like each other</strong>, not whether their profiles use similar words. Generic embeddings trained on web text have no idea about this.</p>
3851
3857
<p>The problem is that "similarity" means different things in different contexts. There's no universal right answer—it depends on what you're trying to do.</p>
3852
3858
<h3id="the-hidden-assumptions-in-provider-models">The Hidden Assumptions in Provider Models</h3>
0 commit comments