Enhance workshop chapters with new code examples and insights

jxnl · jxnl · commit 7020f91babc4 · 2025-08-10T15:33:26.000-04:00
- Added Python code examples for blueprint extraction and searching, improving practical application for users.
- Revised chapter titles and content for clarity, emphasizing the importance of modular API development and team collaboration.
- Incorporated discussions on synthetic data and evaluation metrics to drive better understanding of system improvements.
- Concluded the course with a focus on the significance of evaluations and user feedback in machine learning projects.
diff --git a/docs/workshops/chapter6-1.md b/docs/workshops/chapter6-1.md
@@ -45,6 +45,46 @@ Once we extract that and put into a database, now we can think about querying th
 
 In the first example, we define a blueprint extractor, which saves a date and a description. Now we can build a search blueprint model that searches the description and potentially has start and end dates. And now we have to define an execute method that builds the query and then sends it off to the database.
 
+```python
+from pydantic import BaseModel
+
+class BlueprintExtractor(BaseModel):
+    description: str
+    date: str | None = None
+
+def extract_blueprint_description(image):
+    """
+    Extract the description of the blueprint from the image.
+    """
+      
+    return client.create(
+      messages=[
+        {
+          "role": "user",
+          "content": [
+            {
+              "type": "text",
+              "text": "Extract the description of the blueprint from the image."
+            },
+            {
+              "type": "image_url",
+              "image_url": {"url": f"data:image/jpeg;base64,{image}"}
+            }
+        }
+      ],
+      response_model=BlueprintExtractor,
+    )
+
+def search_blueprint(description, start_date=None, end_date=None):
+    """
+    Search the blueprint database for a description.
+    """
+    return db.search(description).where(
+        date >= start_date if start_date else None,
+        date <= end_date if end_date else None,
+    )
+```
+
 ## Building APIs for Your Language Model
 
 With this simple tool, we can start testing whether or not that document we're looking for is returned in the arguments that we specified.
@@ -63,13 +103,15 @@ Each of these retrieval requests should feel very much like a get request or a p
 
 This is something that Model Context Protocol could support in the future that would allow you to build this tool once, expose the protocol once and expose it to many interfaces, whether it's to Claude API, or potentially Cursor.
 
-## Modular API Development Benefits
+## You're a Framework Developer Now
 
 By defining these APIs, separating our concerns, making this a lot more modular and allowing bigger teams to work together.
 
-Individual teams can work on specific APIs, whether it's our ability to search emails versus blueprints or schedules or something else. And it allows us to have a bigger team work together. You realize you're effectively a framework developer for the language model.
+Individual teams can work on specific APIs, whether it's our ability to search emails versus blueprints or schedules or something else. And it allows us to have a bigger team work together. 
+
+**You realize you're effectively a framework developer for the language model.**
 
-From my own experience, I spent many years developing multiple microservices to do retrieval for other teams, and I think moving forward it's gonna feel a lot like building distributed microservices.
+From my own experience, I spent many years developing multiple microservices to do retrieval for other teams, and I think moving forward it's gonna feel a lot like building distributed microservices. The patterns are the same - clear interfaces, separation of concerns, team ownership. But now instead of serving other engineers, you're serving an AI that calls your functions.
 
 ## Adding More Capabilities
 
@@ -91,20 +133,36 @@ An answer could contain not only the response, but citations and sources and fol
 
 And now when we execute a search query, we can basically send it to the search function, return a list of queries, and then gather all the results. And then what we can do is we can pass these results back into a language model that answers the question, and then you can go forward from here.
 
-## The Classic Architecture Pattern
+## The Classic Architecture Pattern (Interface, Implementation, Gateway)
 
 This might harken back to the old school way of doing things with interfaces that we can experiment with. And they can define the interactions of the tools with our client and with our backend. Then we have implementations of individual tools. And then lastly, we have a gateway that puts it all together.
 
-And these boundaries will ultimately help you figure out how to split your team and your resources. Each team can experiment with a different aspect of the interface, the implementation, and the gateway. One team could explore the segmentation of the tools and figure out what the right interfaces are. Another can run experiments to improve the implementation of each one, improving the per tool recall. And then the last team, for example, can test the tools and see how they can be connected and put together through the gateway router system.
+**And these boundaries will ultimately help you figure out how to split your team and your resources.** 
+
+Each team can experiment with a different aspect:
+- **Interface team**: Explores the segmentation of the tools and figures out what the right interfaces are
+- **Implementation team**: Runs experiments to improve the implementation of each one, improving the per-tool recall
+- **Gateway team**: Tests the tools and sees how they can be connected and put together through the gateway router system
+
+This separation of concerns is critical. One team could be improving SQL search while another is working on document retrieval, and they don't step on each other's toes. They can ship independently.
 
 And obviously we talked about the first two in sessions four and five.
 
-## What's Next
+## What's Next: The System Remains the Same
 
 So this week we're mostly gonna be talking about how we can think about testing. And again, we're gonna go back to the same concepts of precision and recall. You can imagine creating a simple data set in the beginning that is just for a certain question, what were the tools being called?
 
 Once we have a data set that looks like this, we can go back to just doing precision and recall of tool selection.
 
+And if this sounds similar, it's because it is. **The reason I called this course Systematically Improving RAG Applications is because we are applying this system over and over again.** I really want you to pause here, internalize this concept, because this system is what we repeat over and over again:
+
+1. Start with synthetic data to produce query-to-tool or tool-to-query data
+2. Create recall metrics
+3. Iterate on the few-shot examples of each tool to improve recall or tool selection
+4. Build data flywheels that continuously improve the system
+
+Throughout this whole course, I'm just teaching the same thing over and over again. The system remains the same.
+
 ## Metrics and evaluation
 
 - Routing precision/recall and per-tool recall
diff --git a/docs/workshops/chapter6-2.md b/docs/workshops/chapter6-2.md
@@ -146,6 +146,10 @@ The kicker? When we added new tools, we just updated the prompt. No retraining,
 
 Here's what makes routers actually work: good examples. Not many examples - GOOD examples.
 
+And again, synthetic data can help you dramatically. If you have good descriptions of what these tools are, then you can potentially randomly sample them to create queries that might trigger those tools. And if you feel you can't do that, chances are you don't have detailed enough prompts on what these tools are supposed to do.
+
+Don't be surprised if you see yourself making prompts with 10 to 40 examples per tool. Prompt caching makes this very tractable, and in production cases I've often seen tons of examples be used.
+
 ### The Examples That Matter
 
 After analyzing thousands of routing decisions, I found three types of examples that dramatically improve accuracy:
@@ -178,7 +182,11 @@ Focus your examples on that 19%. The 80% will work anyway, and the 1% isn't wort
 
 ## Dynamic Example Selection (When You Have Data)
 
-Once you have real usage data, you can get fancy. Here's the pattern that worked for us:
+Once you have real usage data, you can get fancy. This is the same approach we use in Text-to-SQL.
+
+Initially, we might just want to hard code 10 to 40 examples to describe how each individual tool should be used, and include examples of using tools in tandem. As we get more complex, we can apply the same approach used in Text-to-SQL, where we can use search to fill in the ideal few-shot examples per tool.
+
+Here's the pattern that worked for us:
 
 ```python
 def get_relevant_examples(query: str, num_examples: 5):
@@ -225,6 +233,23 @@ I've seen both work brilliantly and both fail spectacularly. Pick based on your
 - Routing precision/recall and per-tool recall
 - Interface stability (breaking changes avoided over releases)
 
+### The Per-Class Recall Problem
+
+You can effectively just consider each tool as some kind of class in a classification task. It's not enough just to look at recall of the entire system - you need to evaluate whether specific tools are having challenges.
+
+For example, imagine your overall recall is 65%, but when you compute the per-tool recall:
+- SearchText: 90% recall (doing great!)
+- SearchBlueprint: 20% recall (massive problem!)
+
+Now you know what the problem is, and now you can have a targeted intervention. Our job is to figure out whether we can give more examples of SearchBlueprint to figure out when it should be called.
+
+### Using Confusion Matrices
+
+Once we have the confusion matrix, we can filter out these failure modes and pull that data out of the database and just look at those examples. A lot of it is just figuring out what to look at, where to look, and just fixing those individual cases.
+
+!!! warning "Data Leakage Alert"
+    I will also have to call out that once you use your test data to create few-shot examples, you have to be very cautious of data leakage, especially when many teams I work with only start with a couple dozen examples. The questions you ask should not be in the training data. This is gonna dramatically overestimate your ability to figure out whether or not the tools are actually working.
+
 ## Implementation Patterns That Scale
 
 Here are the patterns I use in every project now:
diff --git a/docs/workshops/chapter6-3.md b/docs/workshops/chapter6-3.md
@@ -316,6 +316,12 @@ That's success.
 
 ## The Path Forward
 
+This generally concludes the course. Obviously there's gonna be many more office hours, and I'll still be on Slack for the remainder of the year to answer any questions. But what I hope to distill in you is that **you need evaluations**.
+
+Way too many teams I work with have either no evaluations or a tiny set of evaluations, like 10 or 20 examples. But evaluations are critical to understanding how to improve your system. Evals represent the dataset you can use to inform your decision making.
+
+Ideally you can change the way you run meetings so that your conversations are not just about how to make the AI better, but **how to move specific metrics**.
+
 You now have all the pieces:
 
 - Specialized retrievers for different content types
@@ -326,7 +332,18 @@ You now have all the pieces:
 
 The secret isn't in any single component. It's in connecting them so the system gets better every week.
 
-Next week in Chapter 7, we'll talk about taking this to production. But first, get your measurement in place. You can't improve what you don't measure.
+## The Fundamental Truth About Machine Learning
+
+One of the biggest lessons I hope you can take away is the value of synthetic data. Synthetic data and customer feedback is ultimately what you need to make your applications go to the next level. This is the fundamental building block of creating good and successful machine learning products.
+
+And if you refuse to believe this, you're ultimately condemning yourself to being lost and confused in this very hyped up space of machine learning. And it's been the same every single time. **There are always going to be new companies, new technologies, and new frameworks with new names. But we're all more or less doing the same thing we have been doing for the past 20 years.**
+
+Ultimately the process has been the same:
+- A good product generates better evaluations with strong user experience, good UI and good expectation setting
+- Better evaluations allow you to train and fine tune models to create a better product
+- Data analysis over your users (especially segmentation) tells you where to focus your product development efforts
+
+And that process is ultimately what building and deploying a machine learning based project is all about.
 
 Remember: The goal isn't perfection. It's building something that improves faster than user needs grow. Nail that, and you've won.
 
@@ -340,6 +357,14 @@ Continue to [7. Production Considerations](chapter7.md)
 - Use dual-mode interfaces (chat + direct tools) to improve training signals
 - Instrument, automate, and close the loop so the system improves weekly
 
+## Course Conclusion
+
+This marks the end of our course. Please don't hesitate to give me any feedback 'cause my goal is to effectively convey the importance of having these strong fundamentals. A lot of this is gonna be product oriented because technology will always be changing.
+
+If you think there's any way this course can be made better for future iterations, please let me know. If there's topics you wish I covered but didn't, let me know. And I'll work on some additional videos for everyone else for the remainder of the year. 
+
+Thank you everyone, and as always, we'll see you on Slack and at office hours.
+
 ---
 
 If you want to get discounts and 6 day email source on the topic make sure to subscribe to