You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/zh/docs/en/blogs/2025/kvcache-wins-you-can-see.md
-10Lines changed: 0 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -70,8 +70,6 @@ AI agents represent the most extreme case of prefix dominance. These systems ope
70
70
71
71
<small>*__FIGURE 2__: A visual of an agent loop, showing the massive, static context (tools, step-history) as the cached prefix and the new observation/action as the small suffix.*</small>
72
72
73
-
<br/><br/>
74
-
75
73
Reusing this massive context on each turn is essential for complex agents to be computationally viable and cost-effective.
76
74
77
75
!!! tip "What about RAG?"
@@ -88,8 +86,6 @@ Let's revisit our agentic workflow example to see the direct impact of being bli
88
86
89
87
<small>*__FIGURE 3__: A heartbreaking KV-cache miss scenario.*</small>
90
88
91
-
<br/><br/>
92
-
93
89
This single routing decision triggers a cascade of failures:
94
90
95
91
***Cache Miss:** The warm cache benefit on Pod A is completely lost
@@ -120,8 +116,6 @@ This two-layered architecture provides a continuously updated, scalable view of
120
116
121
117
<small>*__FIGURE 4__: Simplified architecture diagram. (1) - (3) show the read path, while (A) - (B) show the write pipeline.*</small>
122
118
123
-
<br/><br/>
124
-
125
119
**What about the overhead?** The memory overhead for this global index is negligible - see **Appendix A.3** for the scaling analysis showing a **1,000,000:1** data-to-metadata ratio.
126
120
127
121
!!! info "High availability support"
@@ -192,8 +186,6 @@ This allows you to handle significantly more traffic on the exact same hardware,
192
186
193
187
<small>*__FIGURE 5__: A tri-panel of TTFT, TPoT and Throughput measured through progressively rising QPS rates.*</small>
194
188
195
-
<br/><br/>
196
-
197
189
The charts above clearly illustrate these wins. The blue line (`precise-scheduling`) maintains the lowest Mean TTFT and achieves the highest Total Throughput as the request rate increases.
198
190
199
191
#### The "Why": From Saved Work to System Throughput
@@ -210,8 +202,6 @@ First, we measure the **Effective Cache Throughput** - the number of prompt **to
210
202
211
203
<small>*__FIGURE 6__: The total computational work **saved** by the KV-cache across the cluster, over the course of the benchmarks.*</small>
212
204
213
-
<br/><br/>
214
-
215
205
The chart clearly shows that `precise-scheduling` sustains a massive and stable throughput of saved work by hitting the prefixes effectively. In the middle, we see `approximate-scheduling` with good but lower efficiency, and on the right, `random-scheduling` saving almost no work.
216
206
217
207
##### 2. System State: The Consequence of Efficiency
0 commit comments