jxnl
diff --git a/‎cohort_1/README.md‎
Lines changed: 1 addition & 0 deletions b/‎cohort_1/README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎cohort_2/office-hours/README.md‎
Lines changed: 11 additions & 3 deletions b/‎cohort_2/office-hours/README.md‎
Lines changed: 11 additions & 3 deletions
diff --git a/‎cohort_2/office-hours/week1-summary.md‎
Lines changed: 9 additions & 5 deletions b/‎cohort_2/office-hours/week1-summary.md‎
Lines changed: 9 additions & 5 deletions
diff --git a/‎cohort_2/office-hours/week2-summary.md‎
Lines changed: 5 additions & 2 deletions b/‎cohort_2/office-hours/week2-summary.md‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎cohort_2/office-hours/week3-summary.md‎
Lines changed: 10 additions & 4 deletions b/‎cohort_2/office-hours/week3-summary.md‎
Lines changed: 10 additions & 4 deletions
@@ -1,4 +1,5 @@
 # systematically-improving-rag
+
 ---
 
 IF you want to get discounts and 6 day email source on the topic make sure to subscribe to
 
@@ -30,15 +30,18 @@ You can use Cursor's AI capabilities to generate summary files from the transcri
 
 4. **Use Cursor's AI composer (CTRL+K or CMD+K)**
    - In the composer, enter a prompt like:
+
    ```
    @[transcript-file-1] @[transcript-file-2] @[transcript-file-3]
-   
+
    Create a summary file that extracts all Q&A pairs from these transcripts in the same format as the existing summary files.
    ```
+
    - For example:
+
    ```
    @02-04-2025-1353-merged.txt @02-04-2025-1744-merged.txt @02-06-2025-1854-merged.txt
-   
+
    Create a summary file that extracts all Q&A pairs from these transcripts in the same format as week3-summary.md
    ```
 
@@ -65,12 +68,14 @@ MM-DD-YYYY-HHMM-type.ext
 ```
 
 Where:
+
 - `MM-DD-YYYY`: Month, day, and year of the session
 - `HHMM`: Hour and minute of the session (24-hour format)
 - `type`: Type of file (session, merged, chat)
 - `ext`: File extension (.vtt, .txt, etc.)
 
 Examples:
+
 - `02-04-2025-1353-session.vtt`: Transcript from February 4, 2025 at 1:53 PM
 - `02-18-2025-1349-merged.txt`: Merged transcript from February 18, 2025 at 1:49 PM
 - `02-20-2025-1857-session.txt`: Transcript from February 20, 2025 at 6:57 PM
@@ -99,6 +104,7 @@ python3 move-files.py
 ```
 
 The script will:
+
 - Show which files were found and where they were moved
 - Remove duplicate files
 - Print a summary of files in each week folder
@@ -112,9 +118,11 @@ When new transcripts are downloaded:
 3. The script will automatically organize and rename all new transcript files
 
 The script handles various transcript file formats and naming patterns, including:
+
 - Files with "transcript" in the name
 - Files with "recording" in the name (ending with .vtt, .txt, .srt)
-- Files starting with "GMT" followed by a date (ending with .vtt, .txt, .srt) 
+- Files starting with "GMT" followed by a date (ending with .vtt, .txt, .srt)
+
 ---
 
 IF you want to get discounts and 6 day email source on the topic make sure to subscribe to
 
@@ -16,7 +16,7 @@ Another good use of DSpy is around using LLMs as judges. If you have a tonality
 
 ## Is it useful to prompt language models with an understanding of structure and rationale for their actions?
 
-Yes, absolutely. Understanding structure and rationale is critical because your product includes the ways you collect feedback, set expectations in the UI, perform data extraction, and represent chunks in the context. 
+Yes, absolutely. Understanding structure and rationale is critical because your product includes the ways you collect feedback, set expectations in the UI, perform data extraction, and represent chunks in the context.
 
 It's not just about the prompt—it's a whole system. And if you can spend time looking at how the model makes mistakes and what users are asking for, you'll make much more progress in improving the product holistically.
 
@@ -35,6 +35,7 @@ The general idea is to use structured extraction to identify start and end dates
 In my 10 years of doing data science and machine learning, I generally stay away from any kind of graph modeling. The reason is that every time I've seen a company go into this graph-based world, within 4-5 years they decide to move back to a PostgreSQL database.
 
 There are several issues with graph databases:
+
 1. They're really hard to learn - it's much easier to hire talent that knows PostgreSQL than graph databases.
 2. Defining schemas in PostgreSQL and joins is well-defined, whereas in graph databases there's often too much debate and not enough best practices.
 3. Most cases don't require more than one or two traversals of your graph.
@@ -180,6 +181,7 @@ Another advantage is that LanceDB can be hosted on S3 and is easy to set up for
 ## Which industry or application domain do you think is most difficult for LLMs?
 
 It's hard to say definitively, but generally:
+
 1. Tasks with complex images are difficult
 2. Highly regulated industries like legal and healthcare contexts present challenges
 3. Financial services, especially ratings agencies, face enormous regulatory hurdles
@@ -217,6 +219,7 @@ For visual content like photographs, CLIP embeddings work well since they're inh
 For instructional manuals with images, I'd pass the images to a language model and ask for a detailed summary of what the image shows, including all text in the image. Then embed that summary instead. This creates a text representation that points to the original image.
 
 The approach has two steps:
+
 1. Given an image, create a synthetic question that would retrieve it
 2. Create a summary that would be retrieved for that question
 
@@ -262,6 +265,7 @@ Over a 12-year career, we kept trying different technologies (Hadoop, Spark, etc
 Prompt caching is a technique where language models can avoid reprocessing the beginning of prompts that are often identical.
 
 Different providers handle this differently:
+
 - Anthropic caches prompts for 5 minutes; if you make the same request within that time, the entire message is cached
 - OpenAI figures out the optimal prefix to cache automatically
 
@@ -279,7 +283,7 @@ Reducto's performance comes from having people manually label thousands of PDFs,
 
 ## How does Brain Trust work with the notebooks in this course?
 
-Brain Trust just saves the results that your laptop is running locally. It's not executing anything or using a better database—it's more like an observability tool (similar to Datadog). 
+Brain Trust just saves the results that your laptop is running locally. It's not executing anything or using a better database—it's more like an observability tool (similar to Datadog).
 
 When we run the notebooks, everything is running on your laptop in LanceDB. The only thing Brain Trust sees is row IDs and scores. Think of it as a powerful UI over a database that's saving your logs, not as a computation service.
 
@@ -306,19 +310,19 @@ This is especially useful when you need embeddings to understand domain-specific
 
 ## How do you understand metrics like precision and recall in one-to-one answer scenarios?
 
-For questions with exactly one correct answer, these metrics behave somewhat differently. Recall will be either 0% or 100% depending on whether K is large enough to include the correct answer. 
+For questions with exactly one correct answer, these metrics behave somewhat differently. Recall will be either 0% or 100% depending on whether K is large enough to include the correct answer.
 
 For example, if we want to retrieve exactly one document and there's only one correct answer, precision could be either 0% or 100%, and the same for recall.
 
 The metrics become more meaningful when:
+
 1. There are multiple relevant documents
 2. We're analyzing trends across many queries
 3. We're comparing different retrieval methods
 
 Even with one-to-one mappings, MRR (Mean Reciprocal Rank) is still useful to see where the correct answer appears in your results.
 
-What really matters isn't the absolute number but whether we can move these metrics in a positive direction with our interventions. It's like weighing yourself—the absolute number may vary by scale, but if you've gained two pounds, you've definitely gained two pounds. 
----
+## What really matters isn't the absolute number but whether we can move these metrics in a positive direction with our interventions. It's like weighing yourself—the absolute number may vary by scale, but if you've gained two pounds, you've definitely gained two pounds.
 
 IF you want to get discounts and 6 day email source on the topic make sure to subscribe to
 
 
@@ -110,6 +110,7 @@ Over a 12-year career, we kept trying different technologies (Hadoop, Spark, etc
 ## What have you learned about prompt caching?
 
 Prompt caching is a technique where language models can avoid reprocessing the beginning of prompts that are often identical:
+
 - Anthropic caches prompts for 5 minutes; if you make the same request within that time, the entire message is cached
 - OpenAI figures out the optimal prefix to cache automatically
 
@@ -184,11 +185,12 @@ If you can reorganize text chunks by clustering and bringing related information
 
 ## How do you understand metrics like precision and recall in one-to-one answer scenarios?
 
-For questions with exactly one correct answer, these metrics behave somewhat differently. Recall will be either 0% or 100% depending on whether K is large enough to include the correct answer. 
+For questions with exactly one correct answer, these metrics behave somewhat differently. Recall will be either 0% or 100% depending on whether K is large enough to include the correct answer.
 
 For example, if we want to retrieve exactly one document and there's only one correct answer, precision could be either 0% or 100%, and the same for recall.
 
 The metrics become more meaningful when:
+
 1. There are multiple relevant documents
 2. We're analyzing trends across many queries
 3. We're comparing different retrieval methods
@@ -326,7 +328,8 @@ This approach helps ensure reliability across different types of function callin
 
 9. **Research pragmatism**: Focus on solving specific problems with your data rather than chasing the latest research papers, which often reinvent existing techniques.
 
-10. **Cross-encoders vs. bi-encoders**: Cross-encoders (re-rankers) understand semantic distinctions better but are slower; bi-encoders (embedding models) are faster but less nuanced. Use both for optimal performance. 
+10. **Cross-encoders vs. bi-encoders**: Cross-encoders (re-rankers) understand semantic distinctions better but are slower; bi-encoders (embedding models) are faster but less nuanced. Use both for optimal performance.
+
 ---
 
 IF you want to get discounts and 6 day email source on the topic make sure to subscribe to
 
@@ -27,6 +27,7 @@ The real goal isn't to get a number right - it's to figure out what to do next.
 ## Can you elaborate on your view on RAG versus recommendations? How would you approach the use case of friend suggestions?
 
 When you build a recommendation system, there are several steps:
+
 1. **Sourcing** - What inventory can I show my customer? In the friends case, this would be all users on the platform.
 2. **Query** - Either your user ID or a question embedding.
 3. **Scoring** - For simple RAG, this is cosine distance of embeddings and maybe re-ranker distance. For friends, it might include mutual connections, location, etc.
@@ -66,6 +67,7 @@ For embedding models specifically, I'd typically include everything, as more dat
 Yes and no. Thumbs up/down is super useful, and it would be hard to convince me not to use these binary labels. Going to a 5-star scale creates issues where you don't know if users consider 3 or 4 stars to be "average."
 
 With free text feedback, you'll face two issues:
+
 1. Probably less than 10% of users will give a text response. If only 1% of users leave feedback at all, and only 10% of those leave text, you get very little text data, and you don't know how biased that sample is.
 2. You likely won't be able to read all the free text, so you'll build clustering models to analyze the feedback - in which case, you might as well just have 5 buttons for the most common issues (too slow, answer too long, format incorrect, etc.).
 
@@ -79,7 +81,7 @@ But think about how often you've thumbs-downed a ChatGPT response, let alone wri
 
 This is challenging, especially with something like a large software product knowledge base (44,000+ documents) where many people have been adding content, creating overlap and interstitial hub pages.
 
-One approach is to build a system where if you retrieve a subset of pages, you can reference the connections. Similar to how e-commerce sites show "people who viewed this also viewed" suggestions. 
+One approach is to build a system where if you retrieve a subset of pages, you can reference the connections. Similar to how e-commerce sites show "people who viewed this also viewed" suggestions.
 
 As context windows get larger, you could implement a system where if you pull in a page that references other documents, you traverse one level and bring in those referenced documents too.
 
@@ -92,6 +94,7 @@ The more fundamental question is about how you define relevance. Do you have a p
 I prefer not to think about systems as "classical" versus "agent-based" RAG systems. Most RAG systems are essentially function calling in a for-loop or while-loop.
 
 The goal is to provide the language model with two things:
+
 1. Good functions
 2. Good indices for each function to query that are well-defined
 
@@ -202,6 +205,7 @@ With these tools, the implementation is fairly straightforward. The bigger chall
 This is an interesting historical shift. Around 2018, data labeling was a huge focus because the biggest models were vision models that required massive amounts of labeled data. Vision models aren't very data-efficient - training ImageNet required labeling a million JPEGs. Companies like Scale AI won by excelling at tasks like self-driving car LiDAR labeling.
 
 As we've moved to LLMs, two things have changed:
+
 1. The big winners (like Scale AI) have already established themselves and now focus on large contracts. Smaller players either grew or struggled to find viable business models on smaller contracts.
 2. LLMs are much more data-efficient.
 
@@ -220,6 +224,7 @@ In e-commerce, we have additional rankers for things like price sensitivity, sea
 As AI systems accumulate multiple years of memories about users, figuring out what information to put in context will become much more interesting. Re-rankers won't just measure string similarity between a question and document - they'll likely incorporate user features, environmental features, and contextual information to determine relevance.
 
 For example:
+
 - Security constraints (only searching documents you have access to)
 - Time/recency components for memories
 - Domain authority when sources disagree
@@ -232,6 +237,7 @@ Even systems like Deep Research might evolve to pull from sources you tend to ag
 ## Key Takeaways and Additional Resources
 
 ### Key Takeaways:
+
 - Data quality is becoming more important than ever - good models make data quality the differentiator
 - When collecting feedback, be specific with your questions to increase response rates
 - Focus on economically valuable workflows, not just answering questions
@@ -242,15 +248,15 @@ Even systems like Deep Research might evolve to pull from sources you tend to ag
 - Focus on impact (economic value) rather than just query volume
 
 ### Additional Resources:
-- Google Search Relevancy document/policy is a good reference for defining relevance 
+
+- Google Search Relevancy document/policy is a good reference for defining relevance
 - RAPTOR paper for document summarization approaches
 - Week 3-4 content in the course covers more on these topics
 - For prompt rewriting, Claude's prompt rewriter is highly recommended
 - When dealing with streaming UIs and latencies, Notion's approach of showing steps visually is a good reference
 - For friends example in recommendation systems, consider platforms like Facebook's friend recommendation system as reference implementations
 
-*Note: I'll continue to add resources and notes from future office hours sessions* 
----
+## _Note: I'll continue to add resources and notes from future office hours sessions_
 
 IF you want to get discounts and 6 day email source on the topic make sure to subscribe to
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,5 @@`
`1`	`1`	`# systematically-improving-rag`
	`2`	`+`
`2`	`3`	`---`
`3`	`4`
`4`	`5`	`IF you want to get discounts and 6 day email source on the topic make sure to subscribe to`