Skip to content

Commit a5fd64d

Browse files
authored
Update we-benchmarked-20-embedding-apis-with-milvus-7-insights-that-will-surprise-you.md
1 parent e819218 commit a5fd64d

File tree

1 file changed

+23
-19
lines changed

1 file changed

+23
-19
lines changed

blog/en/we-benchmarked-20-embedding-apis-with-milvus-7-insights-that-will-surprise-you.md

Lines changed: 23 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -28,14 +28,14 @@ You've nailed the retrieval accuracy, optimized your vector database, and your d
2828

2929
What happened? Here's the performance killer no one benchmarks: **embedding API latency**.
3030

31-
While MTEB rankings obsess over recall scores and model sizes, they ignore the metric your users actually feel—how long they wait before seeing any response. We tested every major embedding provider across real-world conditions and discovered latency differences so extreme they'll make you question your entire provider selection strategy.
31+
While MTEB rankings obsess over recall scores and model sizes, they ignore the metric your users feel—how long they wait before seeing any response. We tested every major embedding provider across real-world conditions and discovered extreme latency differences they'll make you question your entire provider selection strategy.
3232

3333
**_Spoiler: The most popular embedding APIs aren't the fastest. Geography matters more than model architecture. And sometimes a $20/month CPU beats a $200/month API call._**
3434

3535

3636
## Why Embedding API Latency Is the Hidden Bottleneck in RAG
3737

38-
When building RAG systems, e-commerce search, or recommendation engines, embedding models serve as the core component that transforms text into vectors, enabling machines to understand semantics and perform efficient similarity searches. While we typically pre-compute embeddings for document libraries, user queries still require real-time embedding API calls to convert questions into vectors before retrieval, and this real-time latency often becomes the performance bottleneck in the entire application chain.
38+
When building RAG systems, e-commerce search, or recommendation engines, embedding models are the core component that transforms text into vectors, enabling machines to understand semantics and perform efficient similarity searches. While we typically pre-compute embeddings for document libraries, user queries still require real-time embedding API calls to convert questions into vectors before retrieval, and this real-time latency often becomes the performance bottleneck in the entire application chain.
3939

4040
Popular embedding benchmarks like MTEB focus on recall accuracy or model size, often overlooking the crucial performance metric—API latency. Using Milvus's `TextEmbedding` Function, we conducted comprehensive real-world tests on major embedding service providers in North America and Asia. 
4141

@@ -50,7 +50,7 @@ In a typical RAG workflow, when a user asks a question, the system must:
5050

5151
- Search for similar vectors in Milvus
5252

53-
- Feed results and original question to an LLM
53+
- Feed results and the original question to an LLM
5454

5555
- Generate and return the answer
5656

@@ -61,17 +61,18 @@ Many developers assume the LLM's answer generation is the slowest part. However,
6161

6262
Whether building an index from scratch or performing routine updates, bulk ingestion requires vectorizing thousands or millions of text chunks. If each embedding call experiences high latency, your entire data pipeline slows dramatically, delaying product releases and knowledge base updates.
6363

64-
Both scenarios make embedding API latency a non-negotiable performance metric for production RAG systems.
64+
Both situations make embedding API latency a non-negotiable performance metric for production RAG systems.
6565

6666

6767
## Measuring Real-World Embedding API Latency with Milvus
6868

69-
Milvus is an open-source, high-performance vector database that offers a new `TextEmbedding` Function interface. This feature integrates popular embedding models from OpenAI, Cohere, AWS Bedrock, Google Vertex AI, Voyage AI, and many more providers directly into your data pipeline, streamlining your vector search pipeline with a single call. 
69+
Milvus is an open-source, high-performance vector database that offers a new `TextEmbedding` Function interface. This feature directly integrates popular embedding models from OpenAI, Cohere, AWS Bedrock, Google Vertex AI, Voyage AI, and many more providers into your data pipeline, streamlining your vector search pipeline with a single call.
7070

71-
Using this new function interface, we tested and benchmarked various popular embedding APIs from American model providers like OpenAI and Cohere, as well as Asian providers like AliCloud and SiliconFlow, measuring their end-to-end latency in realistic deployment scenarios.
71+
Using this new function interface, we tested and benchmarked popular embedding APIs from well-known providers like OpenAI and Cohere, as well as others like AliCloud and SiliconFlow, measuring their end-to-end latency in realistic deployment scenarios.
7272

7373
Our comprehensive test suite covered various model configurations:
7474

75+
7576
| **Provider** | **Model** | **Dimensions** |
7677
| ---------------- | ------------------------------------- | -------------- |
7778
| OpenAI | text-embedding-ada-002 | 1536 |
@@ -101,18 +102,18 @@ Our comprehensive test suite covered various model configurations:
101102

102103
## 7 Key Findings from Our Benchmarking Results 
103104

104-
We tested renowned embedding models from North America and Asia under different batch sizes, token lengths, and network conditions, measuring median latency across all scenarios. Our findings reveal critical insights that will reshape how you think about embedding API selection and optimization. Let’s take a look. 
105+
We tested leading embedding models under varying batch sizes, token lengths, and network conditions, measuring median latency across all scenarios. The results uncover key insights that could reshape how you choose and optimize embedding APIs. Let’s take a look.
105106

106107

107108
### 1. Global Network Effects Are More Significant Than You Think
108109

109-
Network environment is perhaps the most critical factor affecting embedding API performance. The same embedding API service provider can perform dramatically differently across network environments. 
110+
The network environment is perhaps the most critical factor affecting embedding API performance. The same embedding API service provider can perform dramatically differently across network environments. 
110111

111112
![](https://assets.zilliz.com/latency_in_Asia_vs_in_US_cb4b5a425a.png)
112113

113114
When your application is deployed in Asia and accesses services like OpenAI, Cohere, or VoyageAI deployed in North America, network latency increases significantly. Our real-world tests show API call latency universally increased by **3 to 4 times**!
114115

115-
Conversely, when your application is deployed North America and accesses Asian services like AliCloud Dashscope or SiliconFlow, performance degradation is even more severe. SiliconFlow, in particular, showed latency increases of **nearly 100 times** in cross-region scenarios!
116+
Conversely, when your application is deployed in North America and accesses Asian services like AliCloud Dashscope or SiliconFlow, performance degradation is even more severe. SiliconFlow, in particular, showed latency increases of **nearly 100 times** in cross-region scenarios!
116117

117118
This means you must always select embedding providers based on your deployment location and user geography—performance claims without network context are meaningless.
118119

@@ -121,9 +122,9 @@ This means you must always select embedding providers based on your deployment l
121122

122123
Our comprehensive latency testing revealed clear performance hierarchies:
123124

124-
- **North American Models (median latency)**: Cohere > Google Vertex AI > VoyageAI > OpenAI > AWS Bedrock
125+
- **North America-based Models (median latency)**: Cohere > Google Vertex AI > VoyageAI > OpenAI > AWS Bedrock
125126

126-
- **Asian Models (median latency)**: SiliconFlow > AliCloud Dashscope
127+
- **Asia-based Models (median latency)**: SiliconFlow > AliCloud Dashscope
127128

128129
These rankings challenge conventional wisdom about provider selection. 
129130

@@ -135,7 +136,7 @@ These rankings challenge conventional wisdom about provider selection. 
135136

136137
![](https://assets.zilliz.com/all_model_latency_vstoken_lengthwhen_batch_size_is_10_4dcf0d549a.png)
137138

138-
Note: Due to the significant impact of network environment and server geographic regions on real-time embedding API latency, we compared North American and Asian model latencies separately.
139+
Note: Due to the significant impact of network environment and server geographic regions on real-time embedding API latency, we compared North America and Asia-based model latencies separately.
139140

140141

141142
### 3. Model Size Impact Varies Dramatically by Provider
@@ -160,7 +161,7 @@ This indicates that API response time depends on multiple factors beyond model a
160161

161162
### 4. Token Length and Batch Size Create Complex Trade-offs
162163

163-
Depending on backend implementation, particularly batching strategies, token length might not significantly affect latency until batch sizes increase. Our testing revealed distinct patterns:
164+
Depending on your backend implementation, especially your batching strategy. Token length may have little impact on latency until batch sizes grow. Our testing revealed some clear patterns:
164165

165166
- **OpenAI's latency** remained fairly consistent between small and large batches, suggesting generous backend batching capabilities
166167

@@ -200,17 +201,16 @@ TEI Latency
200201

201202
### 7. Milvus Overhead Is Negligible
202203

203-
Since we used Milvus to test embedding API latency, we validated that the additional overhead introduced by Milvus's TextEmbedding Function is extremely small and virtually negligible. Our measurements show Milvus operations add only 20-40ms total, while embedding API calls take hundreds of milliseconds to several seconds—meaning **Milvus adds less than 5% overhead** to the total operation time. The performance bottleneck primarily lies in network transmission and the embedding API service providers' own processing capabilities, not in the Milvus server layer.
204-
204+
Since we used Milvus to test embedding API latency, we validated that the additional overhead introduced by Milvus's TextEmbedding Function is minimal and virtually negligible. Our measurements show Milvus operations add only 20-40ms in total while embedding API calls take hundreds of milliseconds to several seconds, meaning Milvus adds less than 5% overhead to the total operation time. The performance bottleneck primarily lies in network transmission and the embedding API service providers' processing capabilities, not in the Milvus server layer.
205205

206206
## Tips: How to Optimize Your RAG Embedding Performance
207207

208-
Based on our comprehensive benchmarks, we recommend the following strategies to optimize your RAG system's embedding performance:
208+
Based on our benchmarks, we recommend the following strategies to optimize your RAG system's embedding performance:
209209

210210

211211
### 1. Always Localize Your Testing
212212

213-
Don't blindly trust any generic benchmark reports (including this one!). You should always test models within your actual deployment environment rather than relying solely on published benchmarks. Network conditions, geographic proximity, and infrastructure differences can dramatically impact real-world performance.
213+
Don't trust any generic benchmark reports (including this one!). You should always test models within your actual deployment environment rather than relying solely on published benchmarks. Network conditions, geographic proximity, and infrastructure differences can dramatically impact real-world performance.
214214

215215

216216
### 2. Geo-Match Your Providers Strategically
@@ -234,14 +234,16 @@ One configuration doesn't fit all models or use cases. The optimal batch size an
234234

235235
### 5. Implement Strategic Caching
236236

237-
For high-frequency queries, cache both the query text and its generated embeddings (using solutions like Redis). Subsequent identical queries can directly hit the cache, reducing latency to milliseconds. This represents one of the most cost-effective and impactful query latency optimization techniques available. 
237+
For high-frequency queries, cache both the query text and its generated embeddings (using solutions like Redis). Subsequent identical queries can directly hit the cache, reducing latency to milliseconds. This represents one of the most cost-effective and impactful query latency optimization techniques.
238238

239239

240240
### 6. Consider Local Inference Deployment
241241

242242
If you have extremely high requirements for data ingestion latency, query latency, and data privacy, or if API call costs are prohibitive, consider deploying embedding models locally for inference. Standard API plans often come with QPS limitations, unstable latency, and lack of SLA guarantees—constraints that can be problematic for production environments.
243243

244-
For many individual developers or small teams, the lack of enterprise-grade GPUs might seem like a barrier to local deployment of high-performance embedding models. However, this doesn't mean abandoning local inference entirely. Combined with high-performance inference engines like [Hugging Face's text-embeddings-inference](https://github.com/huggingface/text-embeddings-inference), even running small to medium-sized embedding models on CPU can achieve decent performance that may outperform high-latency API calls, especially for large-scale offline embedding generation. 
244+
245+
For many individual developers or small teams, the lack of enterprise-grade GPUs is a barrier to local deployment of high-performance embedding models. However, this doesn't mean abandoning local inference entirely. With high-performance inference engines like [Hugging Face's text-embeddings-inference](https://github.com/huggingface/text-embeddings-inference), even running small to medium-sized embedding models on a CPU can achieve decent performance that may outperform high-latency API calls, especially for large-scale offline embedding generation.
246+
245247

246248
This approach requires careful consideration of trade-offs between cost, performance, and maintenance complexity.
247249

@@ -274,3 +276,5 @@ The silent killer of RAG performance isn't where most developers are looking. Wh
274276
These findings highlight a crucial blind spot in RAG optimization. Cross-region latency penalties, unexpected provider performance rankings, and the surprising competitiveness of local inference aren't edge cases—they're production realities affecting real applications. Understanding and measuring embedding API performance is essential for delivering responsive user experiences.
275277

276278
Your embedding provider choice is one critical piece of your RAG performance puzzle. By testing in your actual deployment environment, selecting geographically appropriate providers, and considering alternatives like local inference, you can eliminate a major source of user-facing delays and build truly responsive AI applications.
279+
280+
For more details on how we did this benchmarking, check [this notebook](https://github.com/zhuwenxing/text-embedding-bench).

0 commit comments

Comments
 (0)