You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enhance documentation clarity and consistency across multiple files
- Updated AGENT.md and CLAUDE.md to emphasize the use of `uv` for package management and installation commands, ensuring consistency in dependency management instructions.
- Improved formatting in mkdocs.yml by streamlining plugin and markdown extension listings for better readability.
- Added new sections and improved existing content in various talks to enhance clarity and user engagement, including better organization of tags and descriptions.
- Ensured consistent use of spacing and formatting across all documentation files to align with established style guidelines.
Copy file name to clipboardExpand all lines: docs/talks/AGENTS.md
+18-2Lines changed: 18 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,37 +2,43 @@
2
2
3
3
> **Project Setup Note:**
4
4
> To install all dependencies and extras for building and working with this documentation, always use:
5
-
>
5
+
>
6
6
> ```sh
7
7
> uv sync --all-extras
8
8
>```
9
9
10
10
## Overview
11
+
11
12
This directory contains industry talks and presentations from the Systematically Improving RAG Applications series. Each talk provides insights from experts at companies like ChromaDB, Zapier, Glean, Exa, and others, covering practical RAG implementation strategies and lessons learned.
description: "Technical session with Anton from ChromaDB on text chunking fundamentals, evaluation methods, and practical tips for improving retrieval performance"
I hosted a special session with Anton from ChromaDB to discuss their latest technical research on text chunking for RAG applications. This session covers the fundamentals of chunking strategies, evaluation methods, and practical tips for improving retrieval performance in your AI systems.
13
21
14
22
## What is chunking and why is it important for RAG systems?
23
+
15
24
Chunking is the process of splitting documents into smaller components to enable effective retrieval of relevant information. Despite what many believe, chunking remains critical even as LLM context windows grow larger.
16
25
17
26
The fundamental purpose of chunking is to find the relevant text for a given query among all the divisions we've created from our documents. This becomes especially important when the information needed to answer a query spans multiple documents.
@@ -23,9 +32,10 @@ There are several compelling reasons why chunking matters regardless of context
23
32
3. Information accuracy - Effective chunking eliminates distractors that could confuse the model
24
33
4. Retrieval performance - Proper chunking significantly improves your system's ability to find all relevant information
25
34
26
-
***Key Takeaway:*** Chunking will remain important regardless of how large context windows become because it addresses fundamental challenges in retrieval efficiency, accuracy, and cost management.
35
+
**_Key Takeaway:_** Chunking will remain important regardless of how large context windows become because it addresses fundamental challenges in retrieval efficiency, accuracy, and cost management.
27
36
28
37
## What approaches exist for text chunking?
38
+
29
39
There are two broad categories of chunking approaches that are currently being used:
30
40
31
41
Heuristic approaches rely on separator characters (like newlines, question marks, periods) to divide documents based on their existing structure. The most widely used implementation is the recursive character text splitter, which uses a hierarchy of splitting characters to subdivide documents into pieces not exceeding a specified maximum length.
@@ -36,9 +46,10 @@ Semantic approaches are more experimental but promising. These use embedding or
36
46
37
47
What's particularly interesting is that you can use the same embedding model for both chunking and retrieval, potentially finding an embedding-optimal chunking strategy. Since embeddings are relatively cheap, this approach is becoming more viable.
38
48
39
-
***Key Takeaway:*** While heuristic approaches like recursive character text splitters are most common today, semantic chunking methods that identify natural topic boundaries show promise for more robust performance across diverse document types.
49
+
**_Key Takeaway:_** While heuristic approaches like recursive character text splitters are most common today, semantic chunking methods that identify natural topic boundaries show promise for more robust performance across diverse document types.
40
50
41
51
## Does chunking strategy actually matter for performance?
52
+
42
53
According to Anton's research, chunking strategy matters tremendously. Their technical report demonstrates significant performance variations based solely on chunking approach, even when using the same embedding model and retrieval system.
43
54
44
55
They discovered two fundamental rules of thumb that exist in tension with each other:
@@ -50,9 +61,10 @@ The most important insight, however, is that you must always examine your data.
50
61
51
62
By looking at your actual chunks, you can develop intuition about how your chunking strategy is working for your specific use case. This is critical because there's likely no universal "best" chunking strategy - the optimal approach depends on your data and task.
52
63
53
-
***Key Takeaway:*** There's no one-size-fits-all chunking strategy. The best approach depends on your specific data and task, which is why examining your actual chunks is essential for diagnosing retrieval problems.
64
+
**_Key Takeaway:_** There's no one-size-fits-all chunking strategy. The best approach depends on your specific data and task, which is why examining your actual chunks is essential for diagnosing retrieval problems.
54
65
55
66
## How should we evaluate chunking strategies?
67
+
56
68
When evaluating chunking strategies, focus on the retriever itself rather than the generative output. This differs from traditional information retrieval benchmarks in several important ways:
57
69
58
70
Recall is the single most important metric. Modern models are increasingly good at ignoring irrelevant information, but they cannot complete a task if you haven't retrieved all the relevant information in the first place.
@@ -63,9 +75,10 @@ Ranking metrics like NDCG (which consider the order of retrieved documents) are
63
75
64
76
The ChromaDB team has released code for their generative benchmark, which can help evaluate chunking strategies against your specific data.
65
77
66
-
***Key Takeaway:*** Focus on passage-level recall rather than document-level metrics or ranking-sensitive measures. The model can handle irrelevant information, but it can't work with information that wasn't retrieved.
78
+
**_Key Takeaway:_** Focus on passage-level recall rather than document-level metrics or ranking-sensitive measures. The model can handle irrelevant information, but it can't work with information that wasn't retrieved.
67
79
68
80
## What practical advice can improve our chunking implementation?
81
+
69
82
The most emphatic advice from Anton was: "Always, always, always look at your data." This point was stressed repeatedly throughout the presentation.
70
83
71
84
Many retrieval problems stem from poor chunking that isn't apparent until you actually examine the chunks being produced. Default settings in popular libraries often produce surprisingly poor results for specific datasets.
@@ -81,7 +94,7 @@ While better tooling is being developed to help with this process, in the meanti
81
94
82
95
This approach acknowledges that we're in an interesting era of software development where AI application builders are being forced to learn machine learning best practices that have evolved over decades.
83
96
84
-
***Key Takeaway:*** No amount of sophisticated algorithms can compensate for not understanding your data. Examining your chunks and evaluating them against representative queries is the most reliable path to improving retrieval performance.
97
+
**_Key Takeaway:_** No amount of sophisticated algorithms can compensate for not understanding your data. Examining your chunks and evaluating them against representative queries is the most reliable path to improving retrieval performance.
85
98
86
99
**Final thoughts on chunking for RAG applications**
87
100
The fundamental tension in chunking is between maximizing the use of the embedding model's context window and avoiding the grouping of unrelated information. Finding the right balance requires understanding your specific data and use case.
@@ -92,8 +105,7 @@ As Anton emphasized, retrieval is not a general system but a task-specific one.
92
105
93
106
The ChromaDB team is developing better tooling to help with this process, but in the meantime, the most reliable approach is to manually examine your chunks and measure passage-level recall against representative queries.
94
107
95
-
By focusing on these fundamentals rather than blindly applying frameworks or following defaults, you can significantly improve the performance of your RAG applications and deliver better results to your users.
96
-
---
108
+
## By focusing on these fundamentals rather than blindly applying frameworks or following defaults, you can significantly improve the performance of your RAG applications and deliver better results to your users.
97
109
98
110
IF you want to get discounts and 6 day email source on the topic make sure to subscribe to
0 commit comments