You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enhance workshop chapters with new code examples and insights
- Added Python code examples for blueprint extraction and searching, improving practical application for users.
- Revised chapter titles and content for clarity, emphasizing the importance of modular API development and team collaboration.
- Incorporated discussions on synthetic data and evaluation metrics to drive better understanding of system improvements.
- Concluded the course with a focus on the significance of evaluations and user feedback in machine learning projects.
Copy file name to clipboardExpand all lines: docs/workshops/chapter6-1.md
+64-6Lines changed: 64 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -45,6 +45,46 @@ Once we extract that and put into a database, now we can think about querying th
45
45
46
46
In the first example, we define a blueprint extractor, which saves a date and a description. Now we can build a search blueprint model that searches the description and potentially has start and end dates. And now we have to define an execute method that builds the query and then sends it off to the database.
47
47
48
+
```python
49
+
from pydantic import BaseModel
50
+
51
+
classBlueprintExtractor(BaseModel):
52
+
description: str
53
+
date: str|None=None
54
+
55
+
defextract_blueprint_description(image):
56
+
"""
57
+
Extract the description of the blueprint from the image.
58
+
"""
59
+
60
+
return client.create(
61
+
messages=[
62
+
{
63
+
"role": "user",
64
+
"content": [
65
+
{
66
+
"type": "text",
67
+
"text": "Extract the description of the blueprint from the image."
With this simple tool, we can start testing whether ornot that document we're looking for is returned in the arguments that we specified.
@@ -63,13 +103,15 @@ Each of these retrieval requests should feel very much like a get request or a p
63
103
64
104
This is something that Model Context Protocol could support in the future that would allow you to build this tool once, expose the protocol once and expose it to many interfaces, whether it's to Claude API, or potentially Cursor.
65
105
66
-
## Modular API Development Benefits
106
+
## You're a Framework Developer Now
67
107
68
108
By defining these APIs, separating our concerns, making this a lot more modular and allowing bigger teams to work together.
69
109
70
-
Individual teams can work on specific APIs, whether it's our ability to search emails versus blueprints or schedules or something else. And it allows us to have a bigger team work together. You realize you're effectively a framework developer for the language model.
110
+
Individual teams can work on specific APIs, whether it's our ability to search emails versus blueprints or schedules or something else. And it allows us to have a bigger team work together.
111
+
112
+
**You realize you're effectively a framework developer for the language model.**
71
113
72
-
From my own experience, I spent many years developing multiple microservices to do retrieval for other teams, and I think moving forward it's gonna feel a lot like building distributed microservices.
114
+
From my own experience, I spent many years developing multiple microservices to do retrieval for other teams, and I think moving forward it's gonna feel a lot like building distributed microservices. The patterns are the same - clear interfaces, separation of concerns, team ownership. But now instead of serving other engineers, you're serving an AI that calls your functions.
73
115
74
116
## Adding More Capabilities
75
117
@@ -91,20 +133,36 @@ An answer could contain not only the response, but citations and sources and fol
91
133
92
134
And now when we execute a search query, we can basically send it to the search function, return a list of queries, and then gather all the results. And then what we can do is we can pass these results back into a language model that answers the question, and then you can go forward from here.
93
135
94
-
## The Classic Architecture Pattern
136
+
## The Classic Architecture Pattern (Interface, Implementation, Gateway)
95
137
96
138
This might harken back to the old school way of doing things with interfaces that we can experiment with. And they can define the interactions of the tools with our client andwith our backend. Then we have implementations of individual tools. And then lastly, we have a gateway that puts it all together.
97
139
98
-
And these boundaries will ultimately help you figure out how to split your team and your resources. Each team can experiment with a different aspect of the interface, the implementation, and the gateway. One team could explore the segmentation of the tools and figure out what the right interfaces are. Another can run experiments to improve the implementation of each one, improving the per tool recall. And then the last team, for example, can test the tools and see how they can be connected and put together through the gateway router system.
140
+
**And these boundaries will ultimately help you figure out how to split your team and your resources.**
141
+
142
+
Each team can experiment with a different aspect:
143
+
-**Interface team**: Explores the segmentation of the tools and figures out what the right interfaces are
144
+
-**Implementation team**: Runs experiments to improve the implementation of each one, improving the per-tool recall
145
+
-**Gateway team**: Tests the tools and sees how they can be connected and put together through the gateway router system
146
+
147
+
This separation of concerns is critical. One team could be improving SQL search while another is working on document retrieval, and they don't step on each other's toes. They can ship independently.
99
148
100
149
And obviously we talked about the first two in sessions four and five.
101
150
102
-
## What's Next
151
+
## What's Next: The System Remains the Same
103
152
104
153
So this week we're mostly gonna be talking about how we can think about testing. And again, we're gonna go back to the same concepts of precision and recall. You can imagine creating a simple data setin the beginning that is just for a certain question, what were the tools being called?
105
154
106
155
Once we have a data set that looks like this, we can go back to just doing precision and recall of tool selection.
107
156
157
+
And if this sounds similar, it's because it is. **The reason I called this course Systematically Improving RAG Applications is because we are applying this system over and over again.** I really want you to pause here, internalize this concept, because this system is what we repeat over and over again:
158
+
159
+
1. Start with synthetic data to produce query-to-tool or tool-to-query data
160
+
2. Create recall metrics
161
+
3. Iterate on the few-shot examples of each tool to improve recall or tool selection
162
+
4. Build data flywheels that continuously improve the system
163
+
164
+
Throughout this whole course, I'm just teaching the same thing over and over again. The system remains the same.
Copy file name to clipboardExpand all lines: docs/workshops/chapter6-2.md
+26-1Lines changed: 26 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -146,6 +146,10 @@ The kicker? When we added new tools, we just updated the prompt. No retraining,
146
146
147
147
Here's what makes routers actually work: good examples. Not many examples - GOOD examples.
148
148
149
+
And again, synthetic data can help you dramatically. If you have good descriptions of what these tools are, then you can potentially randomly sample them to create queries that might trigger those tools. And if you feel you can't do that, chances are you don't have detailed enough prompts on what these tools are supposed to do.
150
+
151
+
Don't be surprised if you see yourself making prompts with 10 to 40 examples per tool. Prompt caching makes this very tractable, and in production cases I've often seen tons of examples be used.
152
+
149
153
### The Examples That Matter
150
154
151
155
After analyzing thousands of routing decisions, I found three types of examples that dramatically improve accuracy:
@@ -178,7 +182,11 @@ Focus your examples on that 19%. The 80% will work anyway, and the 1% isn't wort
178
182
179
183
## Dynamic Example Selection (When You Have Data)
180
184
181
-
Once you have real usage data, you can get fancy. Here's the pattern that worked for us:
185
+
Once you have real usage data, you can get fancy. This is the same approach we use in Text-to-SQL.
186
+
187
+
Initially, we might just want to hard code 10 to 40 examples to describe how each individual tool should be used, and include examples of using tools in tandem. As we get more complex, we can apply the same approach used in Text-to-SQL, where we can use search to fill in the ideal few-shot examples per tool.
@@ -225,6 +233,23 @@ I've seen both work brilliantly and both fail spectacularly. Pick based on your
225
233
- Routing precision/recall and per-tool recall
226
234
- Interface stability (breaking changes avoided over releases)
227
235
236
+
### The Per-Class Recall Problem
237
+
238
+
You can effectively just consider each tool as some kind of class in a classification task. It's not enough just to look at recall of the entire system - you need to evaluate whether specific tools are having challenges.
239
+
240
+
For example, imagine your overall recall is 65%, but when you compute the per-tool recall:
241
+
- SearchText: 90% recall (doing great!)
242
+
- SearchBlueprint: 20% recall (massive problem!)
243
+
244
+
Now you know what the problem is, and now you can have a targeted intervention. Our job is to figure out whether we can give more examples of SearchBlueprint to figure out when it should be called.
245
+
246
+
### Using Confusion Matrices
247
+
248
+
Once we have the confusion matrix, we can filter out these failure modes and pull that data out of the database and just look at those examples. A lot of it is just figuring out what to look at, where to look, and just fixing those individual cases.
249
+
250
+
!!! warning "Data Leakage Alert"
251
+
I will also have to call out that once you use your test data to create few-shot examples, you have to be very cautious of data leakage, especially when many teams I work with only start with a couple dozen examples. The questions you ask should not be in the training data. This is gonna dramatically overestimate your ability to figure out whether or not the tools are actually working.
Copy file name to clipboardExpand all lines: docs/workshops/chapter6-3.md
+26-1Lines changed: 26 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -316,6 +316,12 @@ That's success.
316
316
317
317
## The Path Forward
318
318
319
+
This generally concludes the course. Obviously there's gonna be many more office hours, and I'll still be on Slack for the remainder of the year to answer any questions. But what I hope to distill in you is that **you need evaluations**.
320
+
321
+
Way too many teams I work with have either no evaluations or a tiny set of evaluations, like 10 or 20 examples. But evaluations are critical to understanding how to improve your system. Evals represent the dataset you can use to inform your decision making.
322
+
323
+
Ideally you can change the way you run meetings so that your conversations are not just about how to make the AI better, but **how to move specific metrics**.
324
+
319
325
You now have all the pieces:
320
326
321
327
- Specialized retrievers for different content types
@@ -326,7 +332,18 @@ You now have all the pieces:
326
332
327
333
The secret isn't in any single component. It's in connecting them so the system gets better every week.
328
334
329
-
Next week in Chapter 7, we'll talk about taking this to production. But first, get your measurement in place. You can't improve what you don't measure.
335
+
## The Fundamental Truth About Machine Learning
336
+
337
+
One of the biggest lessons I hope you can take away is the value of synthetic data. Synthetic data and customer feedback is ultimately what you need to make your applications go to the next level. This is the fundamental building block of creating good and successful machine learning products.
338
+
339
+
And if you refuse to believe this, you're ultimately condemning yourself to being lost and confused in this very hyped up space of machine learning. And it's been the same every single time. **There are always going to be new companies, new technologies, and new frameworks with new names. But we're all more or less doing the same thing we have been doing for the past 20 years.**
340
+
341
+
Ultimately the process has been the same:
342
+
- A good product generates better evaluations with strong user experience, good UI and good expectation setting
343
+
- Better evaluations allow you to train and fine tune models to create a better product
344
+
- Data analysis over your users (especially segmentation) tells you where to focus your product development efforts
345
+
346
+
And that process is ultimately what building and deploying a machine learning based project is all about.
330
347
331
348
Remember: The goal isn't perfection. It's building something that improves faster than user needs grow. Nail that, and you've won.
332
349
@@ -340,6 +357,14 @@ Continue to [7. Production Considerations](chapter7.md)
340
357
- Use dual-mode interfaces (chat + direct tools) to improve training signals
341
358
- Instrument, automate, and close the loop so the system improves weekly
342
359
360
+
## Course Conclusion
361
+
362
+
This marks the end of our course. Please don't hesitate to give me any feedback 'cause my goal is to effectively convey the importance of having these strong fundamentals. A lot of this is gonna be product oriented because technology will always be changing.
363
+
364
+
If you think there's any way this course can be made better for future iterations, please let me know. If there's topics you wish I covered but didn't, let me know. And I'll work on some additional videos for everyone else for the remainder of the year.
365
+
366
+
Thank you everyone, and as always, we'll see you on Slack and at office hours.
367
+
343
368
---
344
369
345
370
If you want to get discounts and 6 day email source on the topic make sure to subscribe to
0 commit comments