Skip to content

Commit 7020f91

Browse files
committed
Enhance workshop chapters with new code examples and insights
- Added Python code examples for blueprint extraction and searching, improving practical application for users. - Revised chapter titles and content for clarity, emphasizing the importance of modular API development and team collaboration. - Incorporated discussions on synthetic data and evaluation metrics to drive better understanding of system improvements. - Concluded the course with a focus on the significance of evaluations and user feedback in machine learning projects.
1 parent fc65cfc commit 7020f91

File tree

3 files changed

+116
-8
lines changed

3 files changed

+116
-8
lines changed

docs/workshops/chapter6-1.md

Lines changed: 64 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,46 @@ Once we extract that and put into a database, now we can think about querying th
4545

4646
In the first example, we define a blueprint extractor, which saves a date and a description. Now we can build a search blueprint model that searches the description and potentially has start and end dates. And now we have to define an execute method that builds the query and then sends it off to the database.
4747

48+
```python
49+
from pydantic import BaseModel
50+
51+
class BlueprintExtractor(BaseModel):
52+
description: str
53+
date: str | None = None
54+
55+
def extract_blueprint_description(image):
56+
"""
57+
Extract the description of the blueprint from the image.
58+
"""
59+
60+
return client.create(
61+
messages=[
62+
{
63+
"role": "user",
64+
"content": [
65+
{
66+
"type": "text",
67+
"text": "Extract the description of the blueprint from the image."
68+
},
69+
{
70+
"type": "image_url",
71+
"image_url": {"url": f"data:image/jpeg;base64,{image}"}
72+
}
73+
}
74+
],
75+
response_model=BlueprintExtractor,
76+
)
77+
78+
def search_blueprint(description, start_date=None, end_date=None):
79+
"""
80+
Search the blueprint database for a description.
81+
"""
82+
return db.search(description).where(
83+
date >= start_date if start_date else None,
84+
date <= end_date if end_date else None,
85+
)
86+
```
87+
4888
## Building APIs for Your Language Model
4989

5090
With this simple tool, we can start testing whether or not that document we're looking for is returned in the arguments that we specified.
@@ -63,13 +103,15 @@ Each of these retrieval requests should feel very much like a get request or a p
63103

64104
This is something that Model Context Protocol could support in the future that would allow you to build this tool once, expose the protocol once and expose it to many interfaces, whether it's to Claude API, or potentially Cursor.
65105

66-
## Modular API Development Benefits
106+
## You're a Framework Developer Now
67107

68108
By defining these APIs, separating our concerns, making this a lot more modular and allowing bigger teams to work together.
69109

70-
Individual teams can work on specific APIs, whether it's our ability to search emails versus blueprints or schedules or something else. And it allows us to have a bigger team work together. You realize you're effectively a framework developer for the language model.
110+
Individual teams can work on specific APIs, whether it's our ability to search emails versus blueprints or schedules or something else. And it allows us to have a bigger team work together.
111+
112+
**You realize you're effectively a framework developer for the language model.**
71113

72-
From my own experience, I spent many years developing multiple microservices to do retrieval for other teams, and I think moving forward it's gonna feel a lot like building distributed microservices.
114+
From my own experience, I spent many years developing multiple microservices to do retrieval for other teams, and I think moving forward it's gonna feel a lot like building distributed microservices. The patterns are the same - clear interfaces, separation of concerns, team ownership. But now instead of serving other engineers, you're serving an AI that calls your functions.
73115

74116
## Adding More Capabilities
75117

@@ -91,20 +133,36 @@ An answer could contain not only the response, but citations and sources and fol
91133

92134
And now when we execute a search query, we can basically send it to the search function, return a list of queries, and then gather all the results. And then what we can do is we can pass these results back into a language model that answers the question, and then you can go forward from here.
93135

94-
## The Classic Architecture Pattern
136+
## The Classic Architecture Pattern (Interface, Implementation, Gateway)
95137

96138
This might harken back to the old school way of doing things with interfaces that we can experiment with. And they can define the interactions of the tools with our client and with our backend. Then we have implementations of individual tools. And then lastly, we have a gateway that puts it all together.
97139

98-
And these boundaries will ultimately help you figure out how to split your team and your resources. Each team can experiment with a different aspect of the interface, the implementation, and the gateway. One team could explore the segmentation of the tools and figure out what the right interfaces are. Another can run experiments to improve the implementation of each one, improving the per tool recall. And then the last team, for example, can test the tools and see how they can be connected and put together through the gateway router system.
140+
**And these boundaries will ultimately help you figure out how to split your team and your resources.**
141+
142+
Each team can experiment with a different aspect:
143+
- **Interface team**: Explores the segmentation of the tools and figures out what the right interfaces are
144+
- **Implementation team**: Runs experiments to improve the implementation of each one, improving the per-tool recall
145+
- **Gateway team**: Tests the tools and sees how they can be connected and put together through the gateway router system
146+
147+
This separation of concerns is critical. One team could be improving SQL search while another is working on document retrieval, and they don't step on each other's toes. They can ship independently.
99148

100149
And obviously we talked about the first two in sessions four and five.
101150

102-
## What's Next
151+
## What's Next: The System Remains the Same
103152

104153
So this week we're mostly gonna be talking about how we can think about testing. And again, we're gonna go back to the same concepts of precision and recall. You can imagine creating a simple data set in the beginning that is just for a certain question, what were the tools being called?
105154

106155
Once we have a data set that looks like this, we can go back to just doing precision and recall of tool selection.
107156

157+
And if this sounds similar, it's because it is. **The reason I called this course Systematically Improving RAG Applications is because we are applying this system over and over again.** I really want you to pause here, internalize this concept, because this system is what we repeat over and over again:
158+
159+
1. Start with synthetic data to produce query-to-tool or tool-to-query data
160+
2. Create recall metrics
161+
3. Iterate on the few-shot examples of each tool to improve recall or tool selection
162+
4. Build data flywheels that continuously improve the system
163+
164+
Throughout this whole course, I'm just teaching the same thing over and over again. The system remains the same.
165+
108166
## Metrics and evaluation
109167

110168
- Routing precision/recall and per-tool recall

docs/workshops/chapter6-2.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,10 @@ The kicker? When we added new tools, we just updated the prompt. No retraining,
146146

147147
Here's what makes routers actually work: good examples. Not many examples - GOOD examples.
148148

149+
And again, synthetic data can help you dramatically. If you have good descriptions of what these tools are, then you can potentially randomly sample them to create queries that might trigger those tools. And if you feel you can't do that, chances are you don't have detailed enough prompts on what these tools are supposed to do.
150+
151+
Don't be surprised if you see yourself making prompts with 10 to 40 examples per tool. Prompt caching makes this very tractable, and in production cases I've often seen tons of examples be used.
152+
149153
### The Examples That Matter
150154

151155
After analyzing thousands of routing decisions, I found three types of examples that dramatically improve accuracy:
@@ -178,7 +182,11 @@ Focus your examples on that 19%. The 80% will work anyway, and the 1% isn't wort
178182

179183
## Dynamic Example Selection (When You Have Data)
180184

181-
Once you have real usage data, you can get fancy. Here's the pattern that worked for us:
185+
Once you have real usage data, you can get fancy. This is the same approach we use in Text-to-SQL.
186+
187+
Initially, we might just want to hard code 10 to 40 examples to describe how each individual tool should be used, and include examples of using tools in tandem. As we get more complex, we can apply the same approach used in Text-to-SQL, where we can use search to fill in the ideal few-shot examples per tool.
188+
189+
Here's the pattern that worked for us:
182190

183191
```python
184192
def get_relevant_examples(query: str, num_examples: 5):
@@ -225,6 +233,23 @@ I've seen both work brilliantly and both fail spectacularly. Pick based on your
225233
- Routing precision/recall and per-tool recall
226234
- Interface stability (breaking changes avoided over releases)
227235

236+
### The Per-Class Recall Problem
237+
238+
You can effectively just consider each tool as some kind of class in a classification task. It's not enough just to look at recall of the entire system - you need to evaluate whether specific tools are having challenges.
239+
240+
For example, imagine your overall recall is 65%, but when you compute the per-tool recall:
241+
- SearchText: 90% recall (doing great!)
242+
- SearchBlueprint: 20% recall (massive problem!)
243+
244+
Now you know what the problem is, and now you can have a targeted intervention. Our job is to figure out whether we can give more examples of SearchBlueprint to figure out when it should be called.
245+
246+
### Using Confusion Matrices
247+
248+
Once we have the confusion matrix, we can filter out these failure modes and pull that data out of the database and just look at those examples. A lot of it is just figuring out what to look at, where to look, and just fixing those individual cases.
249+
250+
!!! warning "Data Leakage Alert"
251+
I will also have to call out that once you use your test data to create few-shot examples, you have to be very cautious of data leakage, especially when many teams I work with only start with a couple dozen examples. The questions you ask should not be in the training data. This is gonna dramatically overestimate your ability to figure out whether or not the tools are actually working.
252+
228253
## Implementation Patterns That Scale
229254

230255
Here are the patterns I use in every project now:

docs/workshops/chapter6-3.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -316,6 +316,12 @@ That's success.
316316

317317
## The Path Forward
318318

319+
This generally concludes the course. Obviously there's gonna be many more office hours, and I'll still be on Slack for the remainder of the year to answer any questions. But what I hope to distill in you is that **you need evaluations**.
320+
321+
Way too many teams I work with have either no evaluations or a tiny set of evaluations, like 10 or 20 examples. But evaluations are critical to understanding how to improve your system. Evals represent the dataset you can use to inform your decision making.
322+
323+
Ideally you can change the way you run meetings so that your conversations are not just about how to make the AI better, but **how to move specific metrics**.
324+
319325
You now have all the pieces:
320326

321327
- Specialized retrievers for different content types
@@ -326,7 +332,18 @@ You now have all the pieces:
326332

327333
The secret isn't in any single component. It's in connecting them so the system gets better every week.
328334

329-
Next week in Chapter 7, we'll talk about taking this to production. But first, get your measurement in place. You can't improve what you don't measure.
335+
## The Fundamental Truth About Machine Learning
336+
337+
One of the biggest lessons I hope you can take away is the value of synthetic data. Synthetic data and customer feedback is ultimately what you need to make your applications go to the next level. This is the fundamental building block of creating good and successful machine learning products.
338+
339+
And if you refuse to believe this, you're ultimately condemning yourself to being lost and confused in this very hyped up space of machine learning. And it's been the same every single time. **There are always going to be new companies, new technologies, and new frameworks with new names. But we're all more or less doing the same thing we have been doing for the past 20 years.**
340+
341+
Ultimately the process has been the same:
342+
- A good product generates better evaluations with strong user experience, good UI and good expectation setting
343+
- Better evaluations allow you to train and fine tune models to create a better product
344+
- Data analysis over your users (especially segmentation) tells you where to focus your product development efforts
345+
346+
And that process is ultimately what building and deploying a machine learning based project is all about.
330347

331348
Remember: The goal isn't perfection. It's building something that improves faster than user needs grow. Nail that, and you've won.
332349

@@ -340,6 +357,14 @@ Continue to [7. Production Considerations](chapter7.md)
340357
- Use dual-mode interfaces (chat + direct tools) to improve training signals
341358
- Instrument, automate, and close the loop so the system improves weekly
342359

360+
## Course Conclusion
361+
362+
This marks the end of our course. Please don't hesitate to give me any feedback 'cause my goal is to effectively convey the importance of having these strong fundamentals. A lot of this is gonna be product oriented because technology will always be changing.
363+
364+
If you think there's any way this course can be made better for future iterations, please let me know. If there's topics you wish I covered but didn't, let me know. And I'll work on some additional videos for everyone else for the remainder of the year.
365+
366+
Thank you everyone, and as always, we'll see you on Slack and at office hours.
367+
343368
---
344369

345370
If you want to get discounts and 6 day email source on the topic make sure to subscribe to

0 commit comments

Comments
 (0)