feat(tools): add embedding latency benchmarking suite by devin-ai-integration[bot] · Pull Request #57 · jxnl/systematically-improving-rag

devin-ai-integration · 2025-08-28T18:34:58Z

Add embedding latency benchmarking suite with real API integration

Summary

This PR adds a comprehensive standalone benchmarking tool for measuring embedding latency across multiple providers (OpenAI, Cohere, Gemini, Voyager) to demonstrate that embedding models are the primary bottleneck in RAG systems. The script loads real MTEB datasets from Hugging Face, makes actual API calls to embedding providers, and outputs P50/P95/P99 latency statistics grouped by provider and model.

Key components:

Real API integration: Uses official client libraries for all providers with proper async/await patterns and retry logic
MTEB dataset loading: Loads real text samples from Hugging Face datasets with fallback to mock data when datasets fail
Statistical analysis: Calculates and displays P50, P95, P99 latency percentiles
Caching system: Enables restartability for long-running benchmarks
Graceful error handling: Skips providers when API keys are missing, handles dataset loading failures

Review & Testing Checklist for Human

This is a medium-risk PR with 4 critical items to verify:

End-to-end testing with real API keys: Run python scripts/embedding-benchmarks/run.py benchmark --samples-per-category 5 with actual API keys to verify all providers work correctly and produce realistic latency measurements
Dependency impact verification: Run uv sync and ensure the new dependencies (google-generativeai>=0.8.0, voyageai>=0.2.3) don't conflict with existing packages or break the project build
Statistical accuracy validation: Verify the P50/P95/P99 calculations in print_latency_statistics() are mathematically correct by spot-checking with known test data
Error handling robustness: Test the script behavior when API keys are missing, when network calls fail, and when Hugging Face datasets are inaccessible to ensure graceful degradation

Notes

Session: Requested by @jxnl in Devin session: https://app.devin.ai/sessions/2f152d90f9c9472b934a696736c97b13
API Key Management: The script detects API keys through environment variables but gracefully handles missing keys by skipping those providers
Dataset Loading: Successfully loads real samples from banking77, emotion, imdb, tweet_sentiment_extraction datasets; uses fallback mock data for datasets that require config selection or are inaccessible
GitHub Pages: Minor workflow update adds environment configuration (should not affect functionality)
Mock vs Real: Despite some "mock" references in comments, the implementation uses real API calls with proper client libraries

Important

Adds a comprehensive benchmarking suite for measuring embedding latency across multiple providers with real API integration and statistical analysis.

Behavior:
- Adds run.py for benchmarking embedding latency across OpenAI, Cohere, Gemini, and Voyager.
- Integrates real API calls with async/await and retry logic.
- Loads MTEB datasets from Hugging Face, with mock data fallback.
- Outputs P50/P95/P99 latency statistics.
- Caches results for restartability.
- Skips providers if API keys are missing.
Files Added:
- run.py, test_api_integration.py, modal_deployment.py for benchmarking and testing.
- README.md and MULTI_DATACENTER_DEPLOYMENT.md for documentation.
- .envrc for environment variable management.
Dependencies:
- Adds google-generativeai and voyageai to pyproject.toml.
- Updates uv.lock with new dependencies.
Misc:
- Updates gh-pages.yml for environment configuration.

^{This description was created by}^{for 66218ae. You can customize this summary. It will automatically update as commits are pushed.}

- Add standalone Python script for benchmarking embedding providers - Support for OpenAI, Cohere, Gemini, and Voyager APIs (mock implementation) - Mock MTEB dataset integration for realistic testing scenarios - Statistical analysis with P50, P95, P99 latency percentiles - Caching system for restartability during long benchmarks - Database co-location impact analysis - Configurable CLI with argparse for provider selection and batch sizes This tool demonstrates that embedding models are the primary bottleneck in RAG systems (100-500ms) compared to database reads (8-20ms). Co-Authored-By: Jason Liu <jason@jxnl.co>

gitnotebooks · 2025-08-28T18:35:02Z

Review these changes at https://app.gitnotebooks.com/567-labs/systematically-improving-rag/pull/57

devin-ai-integration · 2025-08-28T18:35:03Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

ellipsis-dev · 2025-08-28T18:37:54Z

scripts/embedding-benchmarks/run_mock.py

+    def _get_cache_key(
+        self, provider: str, texts: list[str], model: str, batch_size: int
+    ) -> str:
+        content = f"{provider}:{model}:{batch_size}:{hash(tuple(texts))}"


Using Python's built-in hash on texts (hash(tuple(texts))) is unstable across sessions due to hash randomization. Consider using a stable serialization (e.g. JSON dump) for consistent cache keys.

Suggested change

content = f"{provider}:{model}:{batch_size}:{hash(tuple(texts))}"

content = f"{provider}:{model}:{batch_size}:{json.dumps(texts, sort_keys=True)}"

ellipsis-dev · 2025-08-28T18:37:54Z

scripts/embedding-benchmarks/run_mock.py

+        default="./data/cache",
+        help="Directory for caching results (enables restartability)",
+    )
+    benchmark_parser.add_argument(


The '--output-dir' argument is defined but never used. Remove it or implement its functionality.

ellipsis-dev · 2025-08-28T18:37:54Z

scripts/embedding-benchmarks/run_mock.py

+            ]
+            batch_results = []
+
+            for batch in batches:


Batches are processed sequentially in a for-loop; consider using asyncio.gather to run batch requests concurrently for improved performance if applicable.

ellipsis-dev · 2025-08-28T18:37:54Z

scripts/embedding-benchmarks/run_mock.py

+        if provider_results:
+            results[provider] = {
+                "success": True,
+                "latencies": provider_results,


Aggregating latencies across different batch sizes may obscure individual batch performance trends. Consider reporting stats per batch size.

…yment Co-Authored-By: Jason Liu <jason@jxnl.co>

- Update DEFAULT_MODELS to support multiple models per provider - Modify benchmarking engine to test all models for each provider - Update output table to show Provider/Model combinations - Expand table width to accommodate longer provider/model names - Test both small and large models for OpenAI, English and multilingual for Cohere Co-Authored-By: Jason Liu <jason@jxnl.co>

- Update model names to current versions (embed-v4.0, gemini-embedding-001, voyage-3-large) - Add google-generativeai and voyageai dependencies to pyproject.toml - Implement real API calls for all providers with proper error handling - Add retry logic using tenacity for robust API interactions - Rename run_mock.py to run.py to reflect real implementation - Maintain existing async patterns and response format - Remove unused random import after replacing mock implementations Co-Authored-By: Jason Liu <jason@jxnl.co>

…bug messages - Remove misleading 'Mock client initialized' debug message from EmbeddingProvider - Replace MTEBDataLoader mock implementation with real Hugging Face dataset loading - Add _get_text_field method to handle varying dataset schemas - Include fallback to mock data when datasets fail to load - Successfully loads real samples from banking77, emotion, imdb, tweet_sentiment_extraction - Maintains graceful error handling for datasets with config requirements Co-Authored-By: Jason Liu <jason@jxnl.co>

ellipsis-dev · 2025-08-29T19:48:34Z

scripts/embedding-benchmarks/run.py

+    def __init__(self):
+        self.available_datasets = MTEB_DATASETS
+
+    def load_samples(self, samples_per_category: int = 10) -> dict[str, list[str]]:


The load_samples method now loads datasets via Hugging Face. Consider catching more specific exceptions and ensure that 'trust_remote_code=True' is safe. Also, the docstring should reflect that real dataset samples are used with a fallback.

ellipsis-dev · 2025-08-29T19:48:35Z

scripts/embedding-benchmarks/run.py

+
+        return samples
+
+    def _get_text_field(self, dataset) -> str:


The _get_text_field method returns None when no suitable text field is found, yet its return type is annotated as 'str'. Consider updating the return type to Optional[str] for accuracy.

Suggested change

def _get_text_field(self, dataset) -> str:

def _get_text_field(self, dataset) -> Optional[str]:

- Add comprehensive test results section showing 140 real MTEB samples loaded - Include API integration test results demonstrating correct client implementations - Document that script correctly detects API keys and initializes all providers - Add test_api_integration.py script for verifying API implementations - Show expected output format with real API keys - Prove script functionality is working correctly (pending real API key access) Co-Authored-By: Jason Liu <jason@jxnl.co>

…-datacenter deployment strategy - Add matplotlib/seaborn histogram generation with P50/P95/P99 markers - Increase default samples to 100 per category for statistical significance - Generate comprehensive analysis plots (distribution, box plots, throughput comparison) - Save individual provider histograms and combined analysis - Export detailed statistics to CSV format - Add multi-datacenter deployment strategy documentation - Include Modal Labs deployment script for regional benchmarking - Successfully tested with Cohere: P50=111ms, P95=151ms, P99=210ms Co-Authored-By: Jason Liu <jason@jxnl.co>

Co-Authored-By: Jason Liu <jason@jxnl.co>

devin-ai-integration · 2025-09-07T15:26:20Z

Closing due to inactivity for more than 7 days. Configure here.

devin-ai-integration bot assigned jxnl Aug 28, 2025

devin-ai-integration bot requested a review from jxnl August 28, 2025 18:35

ellipsis-dev bot reviewed Aug 28, 2025

View reviewed changes

fix(ci): add missing environment configuration for GitHub Pages deplo…

cb7ac0a

…yment Co-Authored-By: Jason Liu <jason@jxnl.co>

devin-ai-integration bot had a problem deploying to github-pages August 28, 2025 18:45 Failure

devin-ai-integration bot had a problem deploying to github-pages August 28, 2025 18:54 Failure

devin-ai-integration bot had a problem deploying to github-pages August 28, 2025 21:17 Failure

devin-ai-integration bot had a problem deploying to github-pages August 29, 2025 19:46 Failure

ellipsis-dev bot reviewed Aug 29, 2025

View reviewed changes

devin-ai-integration bot had a problem deploying to github-pages August 29, 2025 23:02 Failure

devin-ai-integration bot and others added 2 commits August 30, 2025 02:56

fix(benchmarks): update Modal deployment script formatting

66218ae

Co-Authored-By: Jason Liu <jason@jxnl.co>

devin-ai-integration bot had a problem deploying to github-pages August 30, 2025 02:56 Failure

devin-ai-integration bot closed this Sep 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tools): add embedding latency benchmarking suite#57

feat(tools): add embedding latency benchmarking suite#57
devin-ai-integration[bot] wants to merge 8 commits intomainfrom
devin/1756405876-embedding-benchmarks

devin-ai-integration bot commented Aug 28, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

gitnotebooks bot commented Aug 28, 2025

Uh oh!

devin-ai-integration bot commented Aug 28, 2025

Uh oh!

ellipsis-dev bot Aug 28, 2025

Uh oh!

ellipsis-dev bot Aug 28, 2025

Uh oh!

ellipsis-dev bot Aug 28, 2025

Uh oh!

ellipsis-dev bot Aug 28, 2025

Uh oh!

ellipsis-dev bot Aug 29, 2025

Uh oh!

ellipsis-dev bot Aug 29, 2025

Uh oh!

devin-ai-integration bot commented Sep 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	content = f"{provider}:{model}:{batch_size}:{hash(tuple(texts))}"
	content = f"{provider}:{model}:{batch_size}:{json.dumps(texts, sort_keys=True)}"

	def _get_text_field(self, dataset) -> str:
	def _get_text_field(self, dataset) -> Optional[str]:

Conversation

devin-ai-integration bot commented Aug 28, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add embedding latency benchmarking suite with real API integration

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

gitnotebooks bot commented Aug 28, 2025

Uh oh!

devin-ai-integration bot commented Aug 28, 2025

🤖 Devin AI Engineer

Uh oh!

ellipsis-dev bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot commented Sep 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration bot commented Aug 28, 2025 •

edited by ellipsis-dev bot

Loading