jxnl
diff --git a/‎docs/talks/chromadb-anton-chunking.md‎
Lines changed: 5 additions & 5 deletions b/‎docs/talks/chromadb-anton-chunking.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎docs/talks/colin-rag-agents.md‎
Lines changed: 12 additions & 12 deletions b/‎docs/talks/colin-rag-agents.md‎
Lines changed: 12 additions & 12 deletions
diff --git a/‎docs/talks/embedding-performance-generative-evals-kelly-hong.md‎
Lines changed: 11 additions & 11 deletions b/‎docs/talks/embedding-performance-generative-evals-kelly-hong.md‎
Lines changed: 11 additions & 11 deletions
@@ -11,7 +11,7 @@ date: 2025-01-01
 
 I hosted a special session with Anton from ChromaDB to discuss their latest technical research on text chunking for RAG applications. This session covers the fundamentals of chunking strategies, evaluation methods, and practical tips for improving retrieval performance in your AI systems.
 
-**What is chunking and why is it important for RAG systems?**
+## What is chunking and why is it important for RAG systems?
 Chunking is the process of splitting documents into smaller components to enable effective retrieval of relevant information. Despite what many believe, chunking remains critical even as LLM context windows grow larger.
 
 The fundamental purpose of chunking is to find the relevant text for a given query among all the divisions we've created from our documents. This becomes especially important when the information needed to answer a query spans multiple documents.
@@ -25,7 +25,7 @@ There are several compelling reasons why chunking matters regardless of context
 
 ***Key Takeaway:*** Chunking will remain important regardless of how large context windows become because it addresses fundamental challenges in retrieval efficiency, accuracy, and cost management.
 
-**What approaches exist for text chunking?**
+## What approaches exist for text chunking?
 There are two broad categories of chunking approaches that are currently being used:
 
 Heuristic approaches rely on separator characters (like newlines, question marks, periods) to divide documents based on their existing structure. The most widely used implementation is the recursive character text splitter, which uses a hierarchy of splitting characters to subdivide documents into pieces not exceeding a specified maximum length.
@@ -38,7 +38,7 @@ What's particularly interesting is that you can use the same embedding model for
 
 ***Key Takeaway:*** While heuristic approaches like recursive character text splitters are most common today, semantic chunking methods that identify natural topic boundaries show promise for more robust performance across diverse document types.
 
-**Does chunking strategy actually matter for performance?**
+## Does chunking strategy actually matter for performance?
 According to Anton's research, chunking strategy matters tremendously. Their technical report demonstrates significant performance variations based solely on chunking approach, even when using the same embedding model and retrieval system.
 
 They discovered two fundamental rules of thumb that exist in tension with each other:
@@ -52,7 +52,7 @@ By looking at your actual chunks, you can develop intuition about how your chunk
 
 ***Key Takeaway:*** There's no one-size-fits-all chunking strategy. The best approach depends on your specific data and task, which is why examining your actual chunks is essential for diagnosing retrieval problems.
 
-**How should we evaluate chunking strategies?**
+## How should we evaluate chunking strategies?
 When evaluating chunking strategies, focus on the retriever itself rather than the generative output. This differs from traditional information retrieval benchmarks in several important ways:
 
 Recall is the single most important metric. Modern models are increasingly good at ignoring irrelevant information, but they cannot complete a task if you haven't retrieved all the relevant information in the first place.
@@ -65,7 +65,7 @@ The ChromaDB team has released code for their generative benchmark, which can he
 
 ***Key Takeaway:*** Focus on passage-level recall rather than document-level metrics or ranking-sensitive measures. The model can handle irrelevant information, but it can't work with information that wasn't retrieved.
 
-**What practical advice can improve our chunking implementation?**
+## What practical advice can improve our chunking implementation?
 The most emphatic advice from Anton was: "Always, always, always look at your data." This point was stressed repeatedly throughout the presentation.
 
 Many retrieval problems stem from poor chunking that isn't apparent until you actually examine the chunks being produced. Default settings in popular libraries often produce surprisingly poor results for specific datasets.
 
@@ -11,7 +11,7 @@ date: 2025-06-30
 
 I hosted Colin Flaherty, previously a founding engineer at Augment and co-author of Meta's Cicero AI, to discuss autonomous coding agents and retrieval systems. This session explores how agentic approaches are transforming traditional RAG systems, what we can learn from state-of-the-art coding agents, and how these insights might apply to other domains.
 
-**Do agents make traditional RAG obsolete?**
+## Do agents make traditional RAG obsolete?
 Colin shared his experience building an agent for SWE-Bench Verified, a canonical AI coding evaluation where agents implement code changes based on problem descriptions. His team's agent reached the top of the leaderboard with a surprising discovery: embedding-based retrieval wasn't the bottleneck they expected.
 
 "We explored adding various embedding-based retrieval tools, but found that for SweeBench tasks this was not the bottleneck - grep and find were sufficient," Colin explained. This initially surprised him, as he expected embedding models to be significantly more powerful.
@@ -95,7 +95,7 @@ To enhance agentic retrieval, Colin recommends:
 
 One particularly effective technique is asynchronous pre-processing: "I've taken songs and used an LLM to create a dossier about each one. This simple pre-processing step took a totally non-working search system and turned it into something that works really well."
 
-**Why aren't more people training great embedding models?**
+## Why aren't more people training great embedding models?
 When asked what question people aren't asking enough, Colin highlighted the lack of expertise in training embedding models: "Very few people understand how to build and train good retrieval systems. It just confuses me why no one knows how to fine-tune really good embedding models."
 
 He attributed this partly to the specialized nature of the skill and partly to data availability. For code, there's abundant data on GitHub, but most domains lack comparable resources. Additionally, the most talented engineers often prefer working on LLMs rather than embedding models.
@@ -113,15 +113,15 @@ As these systems evolve, we'll likely see more specialized tools emerging for di
 
 **FAQs**
 
-**What is agentic retrieval and how does it differ from traditional RAG?**
+## What is agentic retrieval and how does it differ from traditional RAG?
 
 Agentic retrieval is an approach where AI agents use tools like grep, find, or embedding models to search through code and other content. Unlike traditional RAG (Retrieval-Augmented Generation), which typically uses embedding databases and vector searches, agentic retrieval gives the agent direct control over the search process. This allows the agent to be persistent, try multiple search strategies, and course-correct when initial attempts fail. Traditional RAG is more rigid but can be faster and more efficient for certain use cases.
 
-**Do agents make traditional RAG obsolete?**
+## Do agents make traditional RAG obsolete?
 
 No, agents don't make traditional RAG obsolete—they complement it. The best approach is often to build agentic retrieval on top of your existing retrieval system by exposing your embedding models and search capabilities as tools that an agent can use. This combines the strengths of both approaches: the persistence and flexibility of agents with the efficiency and scalability of well-tuned embedding models.
 
-**What are the benefits of using grep and find tools with agents?**
+## What are the benefits of using grep and find tools with agents?
 
 Using simple tools like grep and find with agents offers several advantages:
 
@@ -130,7 +130,7 @@ Using simple tools like grep and find with agents offers several advantages:
 - The system is easier to build and maintain without complex vector database dependencies
 - Agents can course-correct when searches don't yield useful results by trying different approaches
 
-**What are the limitations of using grep and find for retrieval?**
+## What are the limitations of using grep and find for retrieval?
 
 While grep and find work well for certain scenarios, they have significant limitations:
 
@@ -139,7 +139,7 @@ While grep and find work well for certain scenarios, they have significant limit
 - They work best with highly structured content like code that contains distinctive keywords
 - They can be slower than optimized embedding-based searches for large datasets
 
-**What's the ideal approach to retrieval for coding agents?**
+## What's the ideal approach to retrieval for coding agents?
 
 The best approach is often a hybrid system that combines:
 
@@ -148,15 +148,15 @@ The best approach is often a hybrid system that combines:
 3. The ability to choose the most appropriate tool based on the specific search task
 4. Course correction capabilities when initial searches don't yield useful results
 
-**How should I evaluate agentic retrieval systems?**
+## How should I evaluate agentic retrieval systems?
 
 Start with a qualitative "vibe check" using 5-10 examples to understand how the system performs. Observe the agent's behavior, identify patterns in successes and failures, and develop an intuition for where improvements are needed. Only after this initial assessment should you move to quantitative end-to-end evaluations or specific evaluations of individual components like embedding tools. Remember that improving a single component (like an embedding model) may not necessarily improve the end-to-end performance if the agent is already persistent enough to overcome limitations.
 
-**I already built a retrieval system with custom-trained embedding models. Should I replace it with agentic retrieval?**
+## I already built a retrieval system with custom-trained embedding models. Should I replace it with agentic retrieval?
 
 No, don't replace it—enhance it. Build agentic retrieval on top of your existing system by exposing your embedding models and search capabilities as tools that an agent can use. This gives you the best of both worlds: the quality and efficiency of your custom embeddings plus the persistence and flexibility of an agent.
 
-**How can I improve my agentic retrieval system?**
+## How can I improve my agentic retrieval system?
 
 Focus on building better tools for your agent:
 
@@ -166,11 +166,11 @@ Focus on building better tools for your agent:
 - Consider hierarchical retrieval approaches like creating summaries of files or directories
 - Add specialized tools for specific retrieval tasks (like searching commit history)
 
-**How do memories work with agentic retrieval systems?**
+## How do memories work with agentic retrieval systems?
 
 Memories in agentic systems can be implemented by adding tools that save and read memories. These memories can serve as a semantic cache that speeds up future searches by storing information about the codebase structure, relevant interfaces, or other insights gained during previous searches. This can significantly improve performance on similar tasks in the future.
 
-**Why did embedding models not improve performance on SWE-Bench?**
+## Why did embedding models not improve performance on SWE-Bench?
 
 For the SWE-Bench coding evaluation, embedding models didn't significantly improve performance because:
 
 
@@ -142,47 +142,47 @@ ______________________________________________________________________
 
 FAQs
 
-**What is generative benchmarking?**
+## What is generative benchmarking?
 
 Generative benchmarking is a method to create custom evaluation sets from your own data to test AI retrieval systems. It involves generating realistic queries from your document corpus and using these query-document pairs to evaluate how well different embedding models and retrieval systems perform with your specific data. Unlike public benchmarks, this approach gives you insights directly relevant to your use case.
 
-**Why are custom benchmarks better than public benchmarks like MTEB?**
+## Why are custom benchmarks better than public benchmarks like MTEB?
 
 Custom benchmarks address several limitations of public benchmarks like MTEB (Massive Text Embedding Benchmark). While MTEB is widely used for comparing embedding models, it uses generic data that may not reflect your specific domain, contains artificially clean query-document pairs, and may have been seen by models during training. Good performance on MTEB doesn't guarantee good performance on your specific data and use case.
 
-**How does the generative benchmarking process work?**
+## How does the generative benchmarking process work?
 
 The process involves two main steps. First, chunk filtering identifies document chunks that users would realistically query about, filtering out irrelevant content. Second, query generation creates realistic user queries from these filtered chunks. The resulting query-document pairs form your evaluation set, which you can use to test different embedding models and retrieval components.
 
-**What's involved in the chunk filtering step?**
+## What's involved in the chunk filtering step?
 
 Chunk filtering uses an aligned LLM judge to identify document chunks that contain information users would actually query. This involves creating criteria for relevance, providing a small set of human-labeled examples, and iterating on the LLM judge to improve alignment with human judgment. This step helps filter out irrelevant content like news articles or marketing material that wouldn't be useful in a support context.
 
-**How do you generate realistic queries?**
+## How do you generate realistic queries?
 
 Query generation uses an LLM with specific context about your application and example queries. Providing this context helps the LLM focus on topics users would ask about, while example queries guide the style of generated queries. This approach creates more realistic, sometimes ambiguous queries that better reflect how users actually search, rather than perfectly formed questions that match document content exactly.
 
-**How do you evaluate retrieval performance with the generated benchmark?**
+## How do you evaluate retrieval performance with the generated benchmark?
 
 Once you have your evaluation set with query-document pairs, you can test different embedding models by embedding each document chunk, storing them in a vector database, and then embedding each query to retrieve the top K document chunks. If the matching document is in the top K results, that counts as a success. This gives you metrics like recall@K and NDCG that you can compare across different models and configurations.
 
-**What insights can generative benchmarking provide?**
+## What insights can generative benchmarking provide?
 
 Generative benchmarking can help you select the best embedding model for your specific data, identify irrelevant content in your document corpus, and evaluate changes to your retrieval pipeline like adding re-ranking or chunk rewriting. It can also reveal when public benchmark rankings don't align with performance on your data, as demonstrated in a case study where model rankings differed from MTEB rankings.
 
-**Do I need production data to use generative benchmarking?**
+## Do I need production data to use generative benchmarking?
 
 No, you can use generative benchmarking even if you don't have production data yet. All you need is a document corpus to generate an evaluation set. However, if you do have production queries, you can use them to further align your generated queries to real user behavior, identify knowledge gaps in your document corpus, and make your evaluation set even more representative.
 
-**Is generative benchmarking fully automated?**
+## Is generative benchmarking fully automated?
 
 No, generative benchmarking isn't 100% automated. It requires human involvement to get good results. You'll need to align your LLM judge, provide context and example queries to steer query generation, and manually review data throughout the process. The human-in-the-loop aspect is critical for creating evaluation sets that truly reflect your use case.
 
-**How can I try generative benchmarking on my own data?**
+## How can I try generative benchmarking on my own data?
 
 You can try generative benchmarking on your own data by using Chroma's open-source tools. The full technical report is available at research.trychroma.com, and you can run the process with just a few lines of code. Chroma Cloud is also available if you want to use their hosted vector database solution.
 
-**How does contextual chunk rewriting fit into retrieval evaluation?**
+## How does contextual chunk rewriting fit into retrieval evaluation?
 
 Contextual chunk rewriting involves adding context to document chunks to improve retrieval. While it can be effective, especially for content like tables or technical information that lacks context, it's also expensive since it requires running an LLM on every chunk. A more efficient approach might be to only rewrite chunks that need additional context, which you can identify during the filtering process. The value of this approach can be quantified through your evaluation metrics.