You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update documentation for improved clarity and consistency
- Added missing newlines and improved formatting in multiple README and summary files to enhance readability.
- Standardized the use of markdown headers and emphasized key points for better organization across various documents.
- Ensured consistent phrasing and structure in FAQs and instructional content to align with established style guidelines.
- Enhanced the overall clarity of the documentation, making it easier for users to navigate and understand the material.
Copy file name to clipboardExpand all lines: cohort_2/office-hours/week1-summary.md
+9-5Lines changed: 9 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,7 @@ Another good use of DSpy is around using LLMs as judges. If you have a tonality
16
16
17
17
## Is it useful to prompt language models with an understanding of structure and rationale for their actions?
18
18
19
-
Yes, absolutely. Understanding structure and rationale is critical because your product includes the ways you collect feedback, set expectations in the UI, perform data extraction, and represent chunks in the context.
19
+
Yes, absolutely. Understanding structure and rationale is critical because your product includes the ways you collect feedback, set expectations in the UI, perform data extraction, and represent chunks in the context.
20
20
21
21
It's not just about the prompt—it's a whole system. And if you can spend time looking at how the model makes mistakes and what users are asking for, you'll make much more progress in improving the product holistically.
22
22
@@ -35,6 +35,7 @@ The general idea is to use structured extraction to identify start and end dates
35
35
In my 10 years of doing data science and machine learning, I generally stay away from any kind of graph modeling. The reason is that every time I've seen a company go into this graph-based world, within 4-5 years they decide to move back to a PostgreSQL database.
36
36
37
37
There are several issues with graph databases:
38
+
38
39
1. They're really hard to learn - it's much easier to hire talent that knows PostgreSQL than graph databases.
39
40
2. Defining schemas in PostgreSQL and joins is well-defined, whereas in graph databases there's often too much debate and not enough best practices.
40
41
3. Most cases don't require more than one or two traversals of your graph.
@@ -180,6 +181,7 @@ Another advantage is that LanceDB can be hosted on S3 and is easy to set up for
180
181
## Which industry or application domain do you think is most difficult for LLMs?
181
182
182
183
It's hard to say definitively, but generally:
184
+
183
185
1. Tasks with complex images are difficult
184
186
2. Highly regulated industries like legal and healthcare contexts present challenges
185
187
3. Financial services, especially ratings agencies, face enormous regulatory hurdles
@@ -217,6 +219,7 @@ For visual content like photographs, CLIP embeddings work well since they're inh
217
219
For instructional manuals with images, I'd pass the images to a language model and ask for a detailed summary of what the image shows, including all text in the image. Then embed that summary instead. This creates a text representation that points to the original image.
218
220
219
221
The approach has two steps:
222
+
220
223
1. Given an image, create a synthetic question that would retrieve it
221
224
2. Create a summary that would be retrieved for that question
222
225
@@ -262,6 +265,7 @@ Over a 12-year career, we kept trying different technologies (Hadoop, Spark, etc
262
265
Prompt caching is a technique where language models can avoid reprocessing the beginning of prompts that are often identical.
263
266
264
267
Different providers handle this differently:
268
+
265
269
- Anthropic caches prompts for 5 minutes; if you make the same request within that time, the entire message is cached
266
270
- OpenAI figures out the optimal prefix to cache automatically
267
271
@@ -279,7 +283,7 @@ Reducto's performance comes from having people manually label thousands of PDFs,
279
283
280
284
## How does Brain Trust work with the notebooks in this course?
281
285
282
-
Brain Trust just saves the results that your laptop is running locally. It's not executing anything or using a better database—it's more like an observability tool (similar to Datadog).
286
+
Brain Trust just saves the results that your laptop is running locally. It's not executing anything or using a better database—it's more like an observability tool (similar to Datadog).
283
287
284
288
When we run the notebooks, everything is running on your laptop in LanceDB. The only thing Brain Trust sees is row IDs and scores. Think of it as a powerful UI over a database that's saving your logs, not as a computation service.
285
289
@@ -306,19 +310,19 @@ This is especially useful when you need embeddings to understand domain-specific
306
310
307
311
## How do you understand metrics like precision and recall in one-to-one answer scenarios?
308
312
309
-
For questions with exactly one correct answer, these metrics behave somewhat differently. Recall will be either 0% or 100% depending on whether K is large enough to include the correct answer.
313
+
For questions with exactly one correct answer, these metrics behave somewhat differently. Recall will be either 0% or 100% depending on whether K is large enough to include the correct answer.
310
314
311
315
For example, if we want to retrieve exactly one document and there's only one correct answer, precision could be either 0% or 100%, and the same for recall.
312
316
313
317
The metrics become more meaningful when:
318
+
314
319
1. There are multiple relevant documents
315
320
2. We're analyzing trends across many queries
316
321
3. We're comparing different retrieval methods
317
322
318
323
Even with one-to-one mappings, MRR (Mean Reciprocal Rank) is still useful to see where the correct answer appears in your results.
319
324
320
-
What really matters isn't the absolute number but whether we can move these metrics in a positive direction with our interventions. It's like weighing yourself—the absolute number may vary by scale, but if you've gained two pounds, you've definitely gained two pounds.
321
-
---
325
+
## What really matters isn't the absolute number but whether we can move these metrics in a positive direction with our interventions. It's like weighing yourself—the absolute number may vary by scale, but if you've gained two pounds, you've definitely gained two pounds.
322
326
323
327
IF you want to get discounts and 6 day email source on the topic make sure to subscribe to
Copy file name to clipboardExpand all lines: cohort_2/office-hours/week2-summary.md
+5-2Lines changed: 5 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -110,6 +110,7 @@ Over a 12-year career, we kept trying different technologies (Hadoop, Spark, etc
110
110
## What have you learned about prompt caching?
111
111
112
112
Prompt caching is a technique where language models can avoid reprocessing the beginning of prompts that are often identical:
113
+
113
114
- Anthropic caches prompts for 5 minutes; if you make the same request within that time, the entire message is cached
114
115
- OpenAI figures out the optimal prefix to cache automatically
115
116
@@ -184,11 +185,12 @@ If you can reorganize text chunks by clustering and bringing related information
184
185
185
186
## How do you understand metrics like precision and recall in one-to-one answer scenarios?
186
187
187
-
For questions with exactly one correct answer, these metrics behave somewhat differently. Recall will be either 0% or 100% depending on whether K is large enough to include the correct answer.
188
+
For questions with exactly one correct answer, these metrics behave somewhat differently. Recall will be either 0% or 100% depending on whether K is large enough to include the correct answer.
188
189
189
190
For example, if we want to retrieve exactly one document and there's only one correct answer, precision could be either 0% or 100%, and the same for recall.
190
191
191
192
The metrics become more meaningful when:
193
+
192
194
1. There are multiple relevant documents
193
195
2. We're analyzing trends across many queries
194
196
3. We're comparing different retrieval methods
@@ -326,7 +328,8 @@ This approach helps ensure reliability across different types of function callin
326
328
327
329
9.**Research pragmatism**: Focus on solving specific problems with your data rather than chasing the latest research papers, which often reinvent existing techniques.
328
330
329
-
10.**Cross-encoders vs. bi-encoders**: Cross-encoders (re-rankers) understand semantic distinctions better but are slower; bi-encoders (embedding models) are faster but less nuanced. Use both for optimal performance.
331
+
10.**Cross-encoders vs. bi-encoders**: Cross-encoders (re-rankers) understand semantic distinctions better but are slower; bi-encoders (embedding models) are faster but less nuanced. Use both for optimal performance.
332
+
330
333
---
331
334
332
335
IF you want to get discounts and 6 day email source on the topic make sure to subscribe to
Copy file name to clipboardExpand all lines: cohort_2/office-hours/week3-summary.md
+10-4Lines changed: 10 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,6 +27,7 @@ The real goal isn't to get a number right - it's to figure out what to do next.
27
27
## Can you elaborate on your view on RAG versus recommendations? How would you approach the use case of friend suggestions?
28
28
29
29
When you build a recommendation system, there are several steps:
30
+
30
31
1.**Sourcing** - What inventory can I show my customer? In the friends case, this would be all users on the platform.
31
32
2.**Query** - Either your user ID or a question embedding.
32
33
3.**Scoring** - For simple RAG, this is cosine distance of embeddings and maybe re-ranker distance. For friends, it might include mutual connections, location, etc.
@@ -66,6 +67,7 @@ For embedding models specifically, I'd typically include everything, as more dat
66
67
Yes and no. Thumbs up/down is super useful, and it would be hard to convince me not to use these binary labels. Going to a 5-star scale creates issues where you don't know if users consider 3 or 4 stars to be "average."
67
68
68
69
With free text feedback, you'll face two issues:
70
+
69
71
1. Probably less than 10% of users will give a text response. If only 1% of users leave feedback at all, and only 10% of those leave text, you get very little text data, and you don't know how biased that sample is.
70
72
2. You likely won't be able to read all the free text, so you'll build clustering models to analyze the feedback - in which case, you might as well just have 5 buttons for the most common issues (too slow, answer too long, format incorrect, etc.).
71
73
@@ -79,7 +81,7 @@ But think about how often you've thumbs-downed a ChatGPT response, let alone wri
79
81
80
82
This is challenging, especially with something like a large software product knowledge base (44,000+ documents) where many people have been adding content, creating overlap and interstitial hub pages.
81
83
82
-
One approach is to build a system where if you retrieve a subset of pages, you can reference the connections. Similar to how e-commerce sites show "people who viewed this also viewed" suggestions.
84
+
One approach is to build a system where if you retrieve a subset of pages, you can reference the connections. Similar to how e-commerce sites show "people who viewed this also viewed" suggestions.
83
85
84
86
As context windows get larger, you could implement a system where if you pull in a page that references other documents, you traverse one level and bring in those referenced documents too.
85
87
@@ -92,6 +94,7 @@ The more fundamental question is about how you define relevance. Do you have a p
92
94
I prefer not to think about systems as "classical" versus "agent-based" RAG systems. Most RAG systems are essentially function calling in a for-loop or while-loop.
93
95
94
96
The goal is to provide the language model with two things:
97
+
95
98
1. Good functions
96
99
2. Good indices for each function to query that are well-defined
97
100
@@ -202,6 +205,7 @@ With these tools, the implementation is fairly straightforward. The bigger chall
202
205
This is an interesting historical shift. Around 2018, data labeling was a huge focus because the biggest models were vision models that required massive amounts of labeled data. Vision models aren't very data-efficient - training ImageNet required labeling a million JPEGs. Companies like Scale AI won by excelling at tasks like self-driving car LiDAR labeling.
203
206
204
207
As we've moved to LLMs, two things have changed:
208
+
205
209
1. The big winners (like Scale AI) have already established themselves and now focus on large contracts. Smaller players either grew or struggled to find viable business models on smaller contracts.
206
210
2. LLMs are much more data-efficient.
207
211
@@ -220,6 +224,7 @@ In e-commerce, we have additional rankers for things like price sensitivity, sea
220
224
As AI systems accumulate multiple years of memories about users, figuring out what information to put in context will become much more interesting. Re-rankers won't just measure string similarity between a question and document - they'll likely incorporate user features, environmental features, and contextual information to determine relevance.
221
225
222
226
For example:
227
+
223
228
- Security constraints (only searching documents you have access to)
224
229
- Time/recency components for memories
225
230
- Domain authority when sources disagree
@@ -232,6 +237,7 @@ Even systems like Deep Research might evolve to pull from sources you tend to ag
232
237
## Key Takeaways and Additional Resources
233
238
234
239
### Key Takeaways:
240
+
235
241
- Data quality is becoming more important than ever - good models make data quality the differentiator
236
242
- When collecting feedback, be specific with your questions to increase response rates
237
243
- Focus on economically valuable workflows, not just answering questions
@@ -242,15 +248,15 @@ Even systems like Deep Research might evolve to pull from sources you tend to ag
242
248
- Focus on impact (economic value) rather than just query volume
243
249
244
250
### Additional Resources:
245
-
- Google Search Relevancy document/policy is a good reference for defining relevance
251
+
252
+
- Google Search Relevancy document/policy is a good reference for defining relevance
246
253
- RAPTOR paper for document summarization approaches
247
254
- Week 3-4 content in the course covers more on these topics
248
255
- For prompt rewriting, Claude's prompt rewriter is highly recommended
249
256
- When dealing with streaming UIs and latencies, Notion's approach of showing steps visually is a good reference
250
257
- For friends example in recommendation systems, consider platforms like Facebook's friend recommendation system as reference implementations
251
258
252
-
*Note: I'll continue to add resources and notes from future office hours sessions*
253
-
---
259
+
## _Note: I'll continue to add resources and notes from future office hours sessions_
254
260
255
261
IF you want to get discounts and 6 day email source on the topic make sure to subscribe to
0 commit comments