Skip to content

Commit 7c557b6

Browse files
committed
Update documentation for improved clarity and consistency
- Added missing newlines and improved formatting in multiple README and summary files to enhance readability. - Standardized the use of markdown headers and emphasized key points for better organization across various documents. - Ensured consistent phrasing and structure in FAQs and instructional content to align with established style guidelines. - Enhanced the overall clarity of the documentation, making it easier for users to navigate and understand the material.
1 parent ad40fa9 commit 7c557b6

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

70 files changed

+1544
-1678
lines changed

cohort_1/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# systematically-improving-rag
2+
23
---
34

45
IF you want to get discounts and 6 day email source on the topic make sure to subscribe to

cohort_2/office-hours/README.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,15 +30,18 @@ You can use Cursor's AI capabilities to generate summary files from the transcri
3030

3131
4. **Use Cursor's AI composer (CTRL+K or CMD+K)**
3232
- In the composer, enter a prompt like:
33+
3334
```
3435
@[transcript-file-1] @[transcript-file-2] @[transcript-file-3]
35-
36+
3637
Create a summary file that extracts all Q&A pairs from these transcripts in the same format as the existing summary files.
3738
```
39+
3840
- For example:
41+
3942
```
4043
@02-04-2025-1353-merged.txt @02-04-2025-1744-merged.txt @02-06-2025-1854-merged.txt
41-
44+
4245
Create a summary file that extracts all Q&A pairs from these transcripts in the same format as week3-summary.md
4346
```
4447

@@ -65,12 +68,14 @@ MM-DD-YYYY-HHMM-type.ext
6568
```
6669

6770
Where:
71+
6872
- `MM-DD-YYYY`: Month, day, and year of the session
6973
- `HHMM`: Hour and minute of the session (24-hour format)
7074
- `type`: Type of file (session, merged, chat)
7175
- `ext`: File extension (.vtt, .txt, etc.)
7276

7377
Examples:
78+
7479
- `02-04-2025-1353-session.vtt`: Transcript from February 4, 2025 at 1:53 PM
7580
- `02-18-2025-1349-merged.txt`: Merged transcript from February 18, 2025 at 1:49 PM
7681
- `02-20-2025-1857-session.txt`: Transcript from February 20, 2025 at 6:57 PM
@@ -99,6 +104,7 @@ python3 move-files.py
99104
```
100105

101106
The script will:
107+
102108
- Show which files were found and where they were moved
103109
- Remove duplicate files
104110
- Print a summary of files in each week folder
@@ -112,9 +118,11 @@ When new transcripts are downloaded:
112118
3. The script will automatically organize and rename all new transcript files
113119

114120
The script handles various transcript file formats and naming patterns, including:
121+
115122
- Files with "transcript" in the name
116123
- Files with "recording" in the name (ending with .vtt, .txt, .srt)
117-
- Files starting with "GMT" followed by a date (ending with .vtt, .txt, .srt)
124+
- Files starting with "GMT" followed by a date (ending with .vtt, .txt, .srt)
125+
118126
---
119127

120128
IF you want to get discounts and 6 day email source on the topic make sure to subscribe to

cohort_2/office-hours/week1-summary.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Another good use of DSpy is around using LLMs as judges. If you have a tonality
1616

1717
## Is it useful to prompt language models with an understanding of structure and rationale for their actions?
1818

19-
Yes, absolutely. Understanding structure and rationale is critical because your product includes the ways you collect feedback, set expectations in the UI, perform data extraction, and represent chunks in the context.
19+
Yes, absolutely. Understanding structure and rationale is critical because your product includes the ways you collect feedback, set expectations in the UI, perform data extraction, and represent chunks in the context.
2020

2121
It's not just about the prompt—it's a whole system. And if you can spend time looking at how the model makes mistakes and what users are asking for, you'll make much more progress in improving the product holistically.
2222

@@ -35,6 +35,7 @@ The general idea is to use structured extraction to identify start and end dates
3535
In my 10 years of doing data science and machine learning, I generally stay away from any kind of graph modeling. The reason is that every time I've seen a company go into this graph-based world, within 4-5 years they decide to move back to a PostgreSQL database.
3636

3737
There are several issues with graph databases:
38+
3839
1. They're really hard to learn - it's much easier to hire talent that knows PostgreSQL than graph databases.
3940
2. Defining schemas in PostgreSQL and joins is well-defined, whereas in graph databases there's often too much debate and not enough best practices.
4041
3. Most cases don't require more than one or two traversals of your graph.
@@ -180,6 +181,7 @@ Another advantage is that LanceDB can be hosted on S3 and is easy to set up for
180181
## Which industry or application domain do you think is most difficult for LLMs?
181182

182183
It's hard to say definitively, but generally:
184+
183185
1. Tasks with complex images are difficult
184186
2. Highly regulated industries like legal and healthcare contexts present challenges
185187
3. Financial services, especially ratings agencies, face enormous regulatory hurdles
@@ -217,6 +219,7 @@ For visual content like photographs, CLIP embeddings work well since they're inh
217219
For instructional manuals with images, I'd pass the images to a language model and ask for a detailed summary of what the image shows, including all text in the image. Then embed that summary instead. This creates a text representation that points to the original image.
218220

219221
The approach has two steps:
222+
220223
1. Given an image, create a synthetic question that would retrieve it
221224
2. Create a summary that would be retrieved for that question
222225

@@ -262,6 +265,7 @@ Over a 12-year career, we kept trying different technologies (Hadoop, Spark, etc
262265
Prompt caching is a technique where language models can avoid reprocessing the beginning of prompts that are often identical.
263266

264267
Different providers handle this differently:
268+
265269
- Anthropic caches prompts for 5 minutes; if you make the same request within that time, the entire message is cached
266270
- OpenAI figures out the optimal prefix to cache automatically
267271

@@ -279,7 +283,7 @@ Reducto's performance comes from having people manually label thousands of PDFs,
279283

280284
## How does Brain Trust work with the notebooks in this course?
281285

282-
Brain Trust just saves the results that your laptop is running locally. It's not executing anything or using a better database—it's more like an observability tool (similar to Datadog).
286+
Brain Trust just saves the results that your laptop is running locally. It's not executing anything or using a better database—it's more like an observability tool (similar to Datadog).
283287

284288
When we run the notebooks, everything is running on your laptop in LanceDB. The only thing Brain Trust sees is row IDs and scores. Think of it as a powerful UI over a database that's saving your logs, not as a computation service.
285289

@@ -306,19 +310,19 @@ This is especially useful when you need embeddings to understand domain-specific
306310

307311
## How do you understand metrics like precision and recall in one-to-one answer scenarios?
308312

309-
For questions with exactly one correct answer, these metrics behave somewhat differently. Recall will be either 0% or 100% depending on whether K is large enough to include the correct answer.
313+
For questions with exactly one correct answer, these metrics behave somewhat differently. Recall will be either 0% or 100% depending on whether K is large enough to include the correct answer.
310314

311315
For example, if we want to retrieve exactly one document and there's only one correct answer, precision could be either 0% or 100%, and the same for recall.
312316

313317
The metrics become more meaningful when:
318+
314319
1. There are multiple relevant documents
315320
2. We're analyzing trends across many queries
316321
3. We're comparing different retrieval methods
317322

318323
Even with one-to-one mappings, MRR (Mean Reciprocal Rank) is still useful to see where the correct answer appears in your results.
319324

320-
What really matters isn't the absolute number but whether we can move these metrics in a positive direction with our interventions. It's like weighing yourself—the absolute number may vary by scale, but if you've gained two pounds, you've definitely gained two pounds.
321-
---
325+
## What really matters isn't the absolute number but whether we can move these metrics in a positive direction with our interventions. It's like weighing yourself—the absolute number may vary by scale, but if you've gained two pounds, you've definitely gained two pounds.
322326

323327
IF you want to get discounts and 6 day email source on the topic make sure to subscribe to
324328

cohort_2/office-hours/week2-summary.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,7 @@ Over a 12-year career, we kept trying different technologies (Hadoop, Spark, etc
110110
## What have you learned about prompt caching?
111111

112112
Prompt caching is a technique where language models can avoid reprocessing the beginning of prompts that are often identical:
113+
113114
- Anthropic caches prompts for 5 minutes; if you make the same request within that time, the entire message is cached
114115
- OpenAI figures out the optimal prefix to cache automatically
115116

@@ -184,11 +185,12 @@ If you can reorganize text chunks by clustering and bringing related information
184185

185186
## How do you understand metrics like precision and recall in one-to-one answer scenarios?
186187

187-
For questions with exactly one correct answer, these metrics behave somewhat differently. Recall will be either 0% or 100% depending on whether K is large enough to include the correct answer.
188+
For questions with exactly one correct answer, these metrics behave somewhat differently. Recall will be either 0% or 100% depending on whether K is large enough to include the correct answer.
188189

189190
For example, if we want to retrieve exactly one document and there's only one correct answer, precision could be either 0% or 100%, and the same for recall.
190191

191192
The metrics become more meaningful when:
193+
192194
1. There are multiple relevant documents
193195
2. We're analyzing trends across many queries
194196
3. We're comparing different retrieval methods
@@ -326,7 +328,8 @@ This approach helps ensure reliability across different types of function callin
326328

327329
9. **Research pragmatism**: Focus on solving specific problems with your data rather than chasing the latest research papers, which often reinvent existing techniques.
328330

329-
10. **Cross-encoders vs. bi-encoders**: Cross-encoders (re-rankers) understand semantic distinctions better but are slower; bi-encoders (embedding models) are faster but less nuanced. Use both for optimal performance.
331+
10. **Cross-encoders vs. bi-encoders**: Cross-encoders (re-rankers) understand semantic distinctions better but are slower; bi-encoders (embedding models) are faster but less nuanced. Use both for optimal performance.
332+
330333
---
331334

332335
IF you want to get discounts and 6 day email source on the topic make sure to subscribe to

cohort_2/office-hours/week3-summary.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ The real goal isn't to get a number right - it's to figure out what to do next.
2727
## Can you elaborate on your view on RAG versus recommendations? How would you approach the use case of friend suggestions?
2828

2929
When you build a recommendation system, there are several steps:
30+
3031
1. **Sourcing** - What inventory can I show my customer? In the friends case, this would be all users on the platform.
3132
2. **Query** - Either your user ID or a question embedding.
3233
3. **Scoring** - For simple RAG, this is cosine distance of embeddings and maybe re-ranker distance. For friends, it might include mutual connections, location, etc.
@@ -66,6 +67,7 @@ For embedding models specifically, I'd typically include everything, as more dat
6667
Yes and no. Thumbs up/down is super useful, and it would be hard to convince me not to use these binary labels. Going to a 5-star scale creates issues where you don't know if users consider 3 or 4 stars to be "average."
6768

6869
With free text feedback, you'll face two issues:
70+
6971
1. Probably less than 10% of users will give a text response. If only 1% of users leave feedback at all, and only 10% of those leave text, you get very little text data, and you don't know how biased that sample is.
7072
2. You likely won't be able to read all the free text, so you'll build clustering models to analyze the feedback - in which case, you might as well just have 5 buttons for the most common issues (too slow, answer too long, format incorrect, etc.).
7173

@@ -79,7 +81,7 @@ But think about how often you've thumbs-downed a ChatGPT response, let alone wri
7981

8082
This is challenging, especially with something like a large software product knowledge base (44,000+ documents) where many people have been adding content, creating overlap and interstitial hub pages.
8183

82-
One approach is to build a system where if you retrieve a subset of pages, you can reference the connections. Similar to how e-commerce sites show "people who viewed this also viewed" suggestions.
84+
One approach is to build a system where if you retrieve a subset of pages, you can reference the connections. Similar to how e-commerce sites show "people who viewed this also viewed" suggestions.
8385

8486
As context windows get larger, you could implement a system where if you pull in a page that references other documents, you traverse one level and bring in those referenced documents too.
8587

@@ -92,6 +94,7 @@ The more fundamental question is about how you define relevance. Do you have a p
9294
I prefer not to think about systems as "classical" versus "agent-based" RAG systems. Most RAG systems are essentially function calling in a for-loop or while-loop.
9395

9496
The goal is to provide the language model with two things:
97+
9598
1. Good functions
9699
2. Good indices for each function to query that are well-defined
97100

@@ -202,6 +205,7 @@ With these tools, the implementation is fairly straightforward. The bigger chall
202205
This is an interesting historical shift. Around 2018, data labeling was a huge focus because the biggest models were vision models that required massive amounts of labeled data. Vision models aren't very data-efficient - training ImageNet required labeling a million JPEGs. Companies like Scale AI won by excelling at tasks like self-driving car LiDAR labeling.
203206

204207
As we've moved to LLMs, two things have changed:
208+
205209
1. The big winners (like Scale AI) have already established themselves and now focus on large contracts. Smaller players either grew or struggled to find viable business models on smaller contracts.
206210
2. LLMs are much more data-efficient.
207211

@@ -220,6 +224,7 @@ In e-commerce, we have additional rankers for things like price sensitivity, sea
220224
As AI systems accumulate multiple years of memories about users, figuring out what information to put in context will become much more interesting. Re-rankers won't just measure string similarity between a question and document - they'll likely incorporate user features, environmental features, and contextual information to determine relevance.
221225

222226
For example:
227+
223228
- Security constraints (only searching documents you have access to)
224229
- Time/recency components for memories
225230
- Domain authority when sources disagree
@@ -232,6 +237,7 @@ Even systems like Deep Research might evolve to pull from sources you tend to ag
232237
## Key Takeaways and Additional Resources
233238

234239
### Key Takeaways:
240+
235241
- Data quality is becoming more important than ever - good models make data quality the differentiator
236242
- When collecting feedback, be specific with your questions to increase response rates
237243
- Focus on economically valuable workflows, not just answering questions
@@ -242,15 +248,15 @@ Even systems like Deep Research might evolve to pull from sources you tend to ag
242248
- Focus on impact (economic value) rather than just query volume
243249

244250
### Additional Resources:
245-
- Google Search Relevancy document/policy is a good reference for defining relevance
251+
252+
- Google Search Relevancy document/policy is a good reference for defining relevance
246253
- RAPTOR paper for document summarization approaches
247254
- Week 3-4 content in the course covers more on these topics
248255
- For prompt rewriting, Claude's prompt rewriter is highly recommended
249256
- When dealing with streaming UIs and latencies, Notion's approach of showing steps visually is a good reference
250257
- For friends example in recommendation systems, consider platforms like Facebook's friend recommendation system as reference implementations
251258

252-
*Note: I'll continue to add resources and notes from future office hours sessions*
253-
---
259+
## _Note: I'll continue to add resources and notes from future office hours sessions_
254260

255261
IF you want to get discounts and 6 day email source on the topic make sure to subscribe to
256262

0 commit comments

Comments
 (0)