You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor documentation for clarity and consistency in talks
- Updated headings in multiple talk documents to use consistent formatting, changing from bold to markdown headers for improved readability.
- Enhanced key takeaways and structured content for better organization and flow.
- Ensured all talks maintain a uniform style, making it easier for users to navigate and understand the material.
Copy file name to clipboardExpand all lines: docs/talks/chromadb-anton-chunking.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ date: 2025-01-01
11
11
12
12
I hosted a special session with Anton from ChromaDB to discuss their latest technical research on text chunking for RAG applications. This session covers the fundamentals of chunking strategies, evaluation methods, and practical tips for improving retrieval performance in your AI systems.
13
13
14
-
**What is chunking and why is it important for RAG systems?**
14
+
## What is chunking and why is it important for RAG systems?
15
15
Chunking is the process of splitting documents into smaller components to enable effective retrieval of relevant information. Despite what many believe, chunking remains critical even as LLM context windows grow larger.
16
16
17
17
The fundamental purpose of chunking is to find the relevant text for a given query among all the divisions we've created from our documents. This becomes especially important when the information needed to answer a query spans multiple documents.
@@ -25,7 +25,7 @@ There are several compelling reasons why chunking matters regardless of context
25
25
26
26
***Key Takeaway:*** Chunking will remain important regardless of how large context windows become because it addresses fundamental challenges in retrieval efficiency, accuracy, and cost management.
27
27
28
-
**What approaches exist for text chunking?**
28
+
## What approaches exist for text chunking?
29
29
There are two broad categories of chunking approaches that are currently being used:
30
30
31
31
Heuristic approaches rely on separator characters (like newlines, question marks, periods) to divide documents based on their existing structure. The most widely used implementation is the recursive character text splitter, which uses a hierarchy of splitting characters to subdivide documents into pieces not exceeding a specified maximum length.
@@ -38,7 +38,7 @@ What's particularly interesting is that you can use the same embedding model for
38
38
39
39
***Key Takeaway:*** While heuristic approaches like recursive character text splitters are most common today, semantic chunking methods that identify natural topic boundaries show promise for more robust performance across diverse document types.
40
40
41
-
**Does chunking strategy actually matter for performance?**
41
+
## Does chunking strategy actually matter for performance?
42
42
According to Anton's research, chunking strategy matters tremendously. Their technical report demonstrates significant performance variations based solely on chunking approach, even when using the same embedding model and retrieval system.
43
43
44
44
They discovered two fundamental rules of thumb that exist in tension with each other:
@@ -52,7 +52,7 @@ By looking at your actual chunks, you can develop intuition about how your chunk
52
52
53
53
***Key Takeaway:*** There's no one-size-fits-all chunking strategy. The best approach depends on your specific data and task, which is why examining your actual chunks is essential for diagnosing retrieval problems.
54
54
55
-
**How should we evaluate chunking strategies?**
55
+
## How should we evaluate chunking strategies?
56
56
When evaluating chunking strategies, focus on the retriever itself rather than the generative output. This differs from traditional information retrieval benchmarks in several important ways:
57
57
58
58
Recall is the single most important metric. Modern models are increasingly good at ignoring irrelevant information, but they cannot complete a task if you haven't retrieved all the relevant information in the first place.
@@ -65,7 +65,7 @@ The ChromaDB team has released code for their generative benchmark, which can he
65
65
66
66
***Key Takeaway:*** Focus on passage-level recall rather than document-level metrics or ranking-sensitive measures. The model can handle irrelevant information, but it can't work with information that wasn't retrieved.
67
67
68
-
**What practical advice can improve our chunking implementation?**
68
+
## What practical advice can improve our chunking implementation?
69
69
The most emphatic advice from Anton was: "Always, always, always look at your data." This point was stressed repeatedly throughout the presentation.
70
70
71
71
Many retrieval problems stem from poor chunking that isn't apparent until you actually examine the chunks being produced. Default settings in popular libraries often produce surprisingly poor results for specific datasets.
Copy file name to clipboardExpand all lines: docs/talks/colin-rag-agents.md
+12-12Lines changed: 12 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ date: 2025-06-30
11
11
12
12
I hosted Colin Flaherty, previously a founding engineer at Augment and co-author of Meta's Cicero AI, to discuss autonomous coding agents and retrieval systems. This session explores how agentic approaches are transforming traditional RAG systems, what we can learn from state-of-the-art coding agents, and how these insights might apply to other domains.
13
13
14
-
**Do agents make traditional RAG obsolete?**
14
+
## Do agents make traditional RAG obsolete?
15
15
Colin shared his experience building an agent for SWE-Bench Verified, a canonical AI coding evaluation where agents implement code changes based on problem descriptions. His team's agent reached the top of the leaderboard with a surprising discovery: embedding-based retrieval wasn't the bottleneck they expected.
16
16
17
17
"We explored adding various embedding-based retrieval tools, but found that for SweeBench tasks this was not the bottleneck - grep and find were sufficient," Colin explained. This initially surprised him, as he expected embedding models to be significantly more powerful.
@@ -95,7 +95,7 @@ To enhance agentic retrieval, Colin recommends:
95
95
96
96
One particularly effective technique is asynchronous pre-processing: "I've taken songs and used an LLM to create a dossier about each one. This simple pre-processing step took a totally non-working search system and turned it into something that works really well."
97
97
98
-
**Why aren't more people training great embedding models?**
98
+
## Why aren't more people training great embedding models?
99
99
When asked what question people aren't asking enough, Colin highlighted the lack of expertise in training embedding models: "Very few people understand how to build and train good retrieval systems. It just confuses me why no one knows how to fine-tune really good embedding models."
100
100
101
101
He attributed this partly to the specialized nature of the skill and partly to data availability. For code, there's abundant data on GitHub, but most domains lack comparable resources. Additionally, the most talented engineers often prefer working on LLMs rather than embedding models.
@@ -113,15 +113,15 @@ As these systems evolve, we'll likely see more specialized tools emerging for di
113
113
114
114
**FAQs**
115
115
116
-
**What is agentic retrieval and how does it differ from traditional RAG?**
116
+
## What is agentic retrieval and how does it differ from traditional RAG?
117
117
118
118
Agentic retrieval is an approach where AI agents use tools like grep, find, or embedding models to search through code and other content. Unlike traditional RAG (Retrieval-Augmented Generation), which typically uses embedding databases and vector searches, agentic retrieval gives the agent direct control over the search process. This allows the agent to be persistent, try multiple search strategies, and course-correct when initial attempts fail. Traditional RAG is more rigid but can be faster and more efficient for certain use cases.
119
119
120
-
**Do agents make traditional RAG obsolete?**
120
+
## Do agents make traditional RAG obsolete?
121
121
122
122
No, agents don't make traditional RAG obsolete—they complement it. The best approach is often to build agentic retrieval on top of your existing retrieval system by exposing your embedding models and search capabilities as tools that an agent can use. This combines the strengths of both approaches: the persistence and flexibility of agents with the efficiency and scalability of well-tuned embedding models.
123
123
124
-
**What are the benefits of using grep and find tools with agents?**
124
+
## What are the benefits of using grep and find tools with agents?
125
125
126
126
Using simple tools like grep and find with agents offers several advantages:
127
127
@@ -130,7 +130,7 @@ Using simple tools like grep and find with agents offers several advantages:
130
130
- The system is easier to build and maintain without complex vector database dependencies
131
131
- Agents can course-correct when searches don't yield useful results by trying different approaches
132
132
133
-
**What are the limitations of using grep and find for retrieval?**
133
+
## What are the limitations of using grep and find for retrieval?
134
134
135
135
While grep and find work well for certain scenarios, they have significant limitations:
136
136
@@ -139,7 +139,7 @@ While grep and find work well for certain scenarios, they have significant limit
139
139
- They work best with highly structured content like code that contains distinctive keywords
140
140
- They can be slower than optimized embedding-based searches for large datasets
141
141
142
-
**What's the ideal approach to retrieval for coding agents?**
142
+
## What's the ideal approach to retrieval for coding agents?
143
143
144
144
The best approach is often a hybrid system that combines:
145
145
@@ -148,15 +148,15 @@ The best approach is often a hybrid system that combines:
148
148
3. The ability to choose the most appropriate tool based on the specific search task
**How should I evaluate agentic retrieval systems?**
151
+
## How should I evaluate agentic retrieval systems?
152
152
153
153
Start with a qualitative "vibe check" using 5-10 examples to understand how the system performs. Observe the agent's behavior, identify patterns in successes and failures, and develop an intuition for where improvements are needed. Only after this initial assessment should you move to quantitative end-to-end evaluations or specific evaluations of individual components like embedding tools. Remember that improving a single component (like an embedding model) may not necessarily improve the end-to-end performance if the agent is already persistent enough to overcome limitations.
154
154
155
-
**I already built a retrieval system with custom-trained embedding models. Should I replace it with agentic retrieval?**
155
+
## I already built a retrieval system with custom-trained embedding models. Should I replace it with agentic retrieval?
156
156
157
157
No, don't replace it—enhance it. Build agentic retrieval on top of your existing system by exposing your embedding models and search capabilities as tools that an agent can use. This gives you the best of both worlds: the quality and efficiency of your custom embeddings plus the persistence and flexibility of an agent.
158
158
159
-
**How can I improve my agentic retrieval system?**
159
+
## How can I improve my agentic retrieval system?
160
160
161
161
Focus on building better tools for your agent:
162
162
@@ -166,11 +166,11 @@ Focus on building better tools for your agent:
166
166
- Consider hierarchical retrieval approaches like creating summaries of files or directories
167
167
- Add specialized tools for specific retrieval tasks (like searching commit history)
168
168
169
-
**How do memories work with agentic retrieval systems?**
169
+
## How do memories work with agentic retrieval systems?
170
170
171
171
Memories in agentic systems can be implemented by adding tools that save and read memories. These memories can serve as a semantic cache that speeds up future searches by storing information about the codebase structure, relevant interfaces, or other insights gained during previous searches. This can significantly improve performance on similar tasks in the future.
172
172
173
-
**Why did embedding models not improve performance on SWE-Bench?**
173
+
## Why did embedding models not improve performance on SWE-Bench?
174
174
175
175
For the SWE-Bench coding evaluation, embedding models didn't significantly improve performance because:
Generative benchmarking is a method to create custom evaluation sets from your own data to test AI retrieval systems. It involves generating realistic queries from your document corpus and using these query-document pairs to evaluate how well different embedding models and retrieval systems perform with your specific data. Unlike public benchmarks, this approach gives you insights directly relevant to your use case.
148
148
149
-
**Why are custom benchmarks better than public benchmarks like MTEB?**
149
+
## Why are custom benchmarks better than public benchmarks like MTEB?
150
150
151
151
Custom benchmarks address several limitations of public benchmarks like MTEB (Massive Text Embedding Benchmark). While MTEB is widely used for comparing embedding models, it uses generic data that may not reflect your specific domain, contains artificially clean query-document pairs, and may have been seen by models during training. Good performance on MTEB doesn't guarantee good performance on your specific data and use case.
152
152
153
-
**How does the generative benchmarking process work?**
153
+
## How does the generative benchmarking process work?
154
154
155
155
The process involves two main steps. First, chunk filtering identifies document chunks that users would realistically query about, filtering out irrelevant content. Second, query generation creates realistic user queries from these filtered chunks. The resulting query-document pairs form your evaluation set, which you can use to test different embedding models and retrieval components.
156
156
157
-
**What's involved in the chunk filtering step?**
157
+
## What's involved in the chunk filtering step?
158
158
159
159
Chunk filtering uses an aligned LLM judge to identify document chunks that contain information users would actually query. This involves creating criteria for relevance, providing a small set of human-labeled examples, and iterating on the LLM judge to improve alignment with human judgment. This step helps filter out irrelevant content like news articles or marketing material that wouldn't be useful in a support context.
160
160
161
-
**How do you generate realistic queries?**
161
+
## How do you generate realistic queries?
162
162
163
163
Query generation uses an LLM with specific context about your application and example queries. Providing this context helps the LLM focus on topics users would ask about, while example queries guide the style of generated queries. This approach creates more realistic, sometimes ambiguous queries that better reflect how users actually search, rather than perfectly formed questions that match document content exactly.
164
164
165
-
**How do you evaluate retrieval performance with the generated benchmark?**
165
+
## How do you evaluate retrieval performance with the generated benchmark?
166
166
167
167
Once you have your evaluation set with query-document pairs, you can test different embedding models by embedding each document chunk, storing them in a vector database, and then embedding each query to retrieve the top K document chunks. If the matching document is in the top K results, that counts as a success. This gives you metrics like recall@K and NDCG that you can compare across different models and configurations.
168
168
169
-
**What insights can generative benchmarking provide?**
169
+
## What insights can generative benchmarking provide?
170
170
171
171
Generative benchmarking can help you select the best embedding model for your specific data, identify irrelevant content in your document corpus, and evaluate changes to your retrieval pipeline like adding re-ranking or chunk rewriting. It can also reveal when public benchmark rankings don't align with performance on your data, as demonstrated in a case study where model rankings differed from MTEB rankings.
172
172
173
-
**Do I need production data to use generative benchmarking?**
173
+
## Do I need production data to use generative benchmarking?
174
174
175
175
No, you can use generative benchmarking even if you don't have production data yet. All you need is a document corpus to generate an evaluation set. However, if you do have production queries, you can use them to further align your generated queries to real user behavior, identify knowledge gaps in your document corpus, and make your evaluation set even more representative.
176
176
177
-
**Is generative benchmarking fully automated?**
177
+
## Is generative benchmarking fully automated?
178
178
179
179
No, generative benchmarking isn't 100% automated. It requires human involvement to get good results. You'll need to align your LLM judge, provide context and example queries to steer query generation, and manually review data throughout the process. The human-in-the-loop aspect is critical for creating evaluation sets that truly reflect your use case.
180
180
181
-
**How can I try generative benchmarking on my own data?**
181
+
## How can I try generative benchmarking on my own data?
182
182
183
183
You can try generative benchmarking on your own data by using Chroma's open-source tools. The full technical report is available at research.trychroma.com, and you can run the process with just a few lines of code. Chroma Cloud is also available if you want to use their hosted vector database solution.
184
184
185
-
**How does contextual chunk rewriting fit into retrieval evaluation?**
185
+
## How does contextual chunk rewriting fit into retrieval evaluation?
186
186
187
187
Contextual chunk rewriting involves adding context to document chunks to improve retrieval. While it can be effective, especially for content like tables or technical information that lacks context, it's also expensive since it requires running an LLM on every chunk. A more efficient approach might be to only rewrite chunks that need additional context, which you can identify during the filtering process. The value of this approach can be quantified through your evaluation metrics.
0 commit comments