Add embeddings endpoint testing support#501
Add embeddings endpoint testing support#501maryamtahhan wants to merge 7 commits intovllm-project:mainfrom
Conversation
375eab2 to
02ad329
Compare
|
Sorry for the delay on review; I am hung up on some performance regression work this will probably be waiting a bit longer, possibly into next year. One note: we will be merging #478 first which will affect this PR. |
No problem. |
|
I will update this PR since #478 has been merged |
190d856 to
8292ba8
Compare
|
Still working on this - will post an update soon |
2dd66fc to
3cf0ffd
Compare
There was a problem hiding this comment.
Some high-level comment that need to be addressed before a full review:
- "quality" testing should be striped out of this PR and put in a followup PR.
- Start with just the console and json/yaml output. Move the rest to a followup.
- Try to remove the random formatting changes to unrelated code, or submit a separate PR that this one builds on which cleans up formatting.
- There is a lot of duplicated code that is going to lead to a maintenance headache down the line. For classes that contain "Generative AI" specific code but some of the functionality is needed for embeddings, move the shared functionality to a base class and have both versions inherent from it.
- Don't lazy or late import. If your importing something non-default, import it in
extras/embeddingsand do something like this (Without quality this is probably unnecessary in this PR).
Basically, strip this PR down to an MVP. It's too many misc changes to meaningfully review.
|
Ok - I will refactor this to an MVP and push incremental PRs |
3e47844 to
4253ca3
Compare
15e7c32 to
611018f
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
@maryamtahhan, this project requires a linear history on feature branches. You can do this by running: |
1b4e425 to
b3c6966
Compare
|
@maryamtahhan, this project requires a linear history on feature branches. You can do this by running: |
Implements embeddings benchmarking functionality with a streamlined MVP scope, removing non-essential features to focus on core functionality. Also refactors common code patterns to eliminate duplication between embeddings and generative benchmarks. MVP Scope Changes: - Remove CSV and HTML output formats (keeping JSON only for embeddings) - Remove quality validation (MTEB integration) from embeddings - Simplify embeddings output to focus on performance metrics only - Update tests to reflect MVP scope Code Refactoring: - Extract shared entrypoint utility functions (resolve_output_formats_generic, resolve_transient_phases) to entrypoints_utils.py - Create BaseBenchmarkArgs base class with ~30 common configuration fields shared between BenchmarkEmbeddingsArgs and BenchmarkGenerativeTextArgs - Consolidate benchmark orchestration into unified run_benchmark_workflow() function with customization points via modifier functions - Reduce embeddings_entrypoints.py - Add setup_backend_kwargs() modifier to configure embeddings-specific request_format (/v1/embeddings) and encoding_format - Add setup_profile_kwargs() modifiers to customize profile constraints Core Features: - Full embeddings benchmark support with JSON output - Support for all profile types (constant, sweep, poisson, synchronous, throughput) - Request latency metrics (mean, median, p95, p99) - Server throughput metrics (requests/sec, tokens/sec) - Concurrency tracking - Compatible with OpenAI-compatible embeddings endpoints - vLLM embeddings server support Testing: - All unit tests pass (1940/1940 embeddings tests) - All integration tests pass (2/2) - E2E tests pass - Live testing against AWS EC2 vLLM servers validates both embeddings (granite-embedding-english-r2) and generative (Qwen3-0.6B) workflows - Profile testing (constant, sweep, poisson, synchronous, throughput) - 140/140 requests successful across all profiles Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Integrated embeddings benchmarking into the main CLI after rebasing on origin/main. Fixed multiple issues to enable functional benchmarking of embedding models. Key Changes: - Added 'run-embeddings' CLI command to benchmark group - Fixed EmbeddingsColumnMapper to handle correct data pipeline format (list[dict] instead of dict) - Fixed EmbeddingsRequestCollator to return list for scheduler compatibility (scheduler expects Iterable[Iterable[Request]]) - Added UsageMetrics import and default value in accumulator - Changed data_num_workers default to 0 for macOS compatibility - Added progress tracker to embeddings entrypoint - Enhanced error logging in data loaders (exception vs error) - Added 'prompt_0' to embeddings column mapper defaults for synthetic data support Bug Fixes: - Fixed Console import paths (guidellm.utils.console) - Removed MultiTurnRequestT usage (doesn't exist in current codebase) - Fixed validation error when input_metrics is None - Fixed AttributeError when scheduler iterates single requests Testing: - Verified with Python 3.12 (required for macOS stability) - Tested with constant profile at 10 req/s - Confirmed metrics collection and console output Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Changed validation from gt=0 to ge=0 for output_tokens fields in SyntheticTextDatasetConfig. This allows embeddings benchmarks to use synthetic data with output_tokens=0, matching the embeddings paradigm where there are no output tokens. This enables users to run embeddings benchmarks with the same default synthetic data parameters as generative benchmarks, improving UX parity. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add extensive unit tests for embeddings functionality: - EmbeddingsColumnMapper: 23 tests covering auto-detection, custom mappings, case insensitivity, and synthetic data support - EmbeddingsRequestFinalizer: 20 tests for text processing, metrics, and edge cases - Embeddings schemas: 22 tests for benchmark, report, args, and metadata Add embeddings benchmarking guide (docs/guides/embeddings.md): - Quick start with synthetic and real datasets - Encoding format documentation (float vs base64) - Load profiles (constant, sweep, synchronous) - Metrics explanation (no output tokens, input-only processing) - Advanced usage examples and troubleshooting Update README.md with embeddings example using correct CLI command (benchmark run-embeddings) Add embeddings guide to docs/guides/index.md navigation Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Remove unused uvloop code from run_embeddings command - Fix missing imports in generative/entrypoints.py (BackendArgs, ValidationError) - Remove unused 'cast' import from embeddings_mapper.py - Format embeddings entrypoints to comply with line length limits - Apply mdformat changes to documentation Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
The mteb package was causing uv.lock validation failures due to a dependency on torch==2.9.1 which is not in the lock file. Since mteb is not currently used in the embeddings implementation, removing it resolves the issue. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
4949584 to
3602ebd
Compare
The embeddings implementation was not conforming to the data pipeline's turn-based architecture, causing type checking failures and test errors. This commit aligns the embeddings preprocessor and finalizer with the existing protocol definitions. Changes: - EmbeddingsColumnMapper now returns list[dict] (list of turns) instead of dict - EmbeddingsRequestFinalizer accepts list[dict] parameter matching protocol - Removed non-existent list_set_env() call from CLI - Updated validation tests to allow output_tokens=0 for embeddings - Fixed all embeddings-related unit tests to use correct signatures All type checks and tests now pass. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add Embeddings Benchmark Support (MVP)
Summary
This PR adds comprehensive support for benchmarking the
/v1/embeddingsendpoint with a streamlined MVP scope focused on performance testing. The implementation also refactors common code patterns between embeddings and generative benchmarks, eliminating ~360 lines of duplication.Key Features
1. Embeddings Performance Benchmarking
/v1/embeddingsendpoint2. Code Refactoring
entrypoints_utils.pyresolve_output_formats_generic()- Generic output format resolverresolve_transient_phases()- Warmup/cooldown configurationrun_benchmark_workflow()3. Mock Server Support
/v1/embeddings)MVP Scope Decisions
To deliver core functionality quickly, this PR excludes the following features (can be added in future PRs):
This allows us to:
Implementation Details
Core Embeddings Support
EmbeddingsBenchmark,EmbeddingsBenchmarkAccumulator,EmbeddingsMetrics,EmbeddingsBenchmarksReportEmbeddingsRequestFinalizerfor preparing embedding requestsEmbeddingsRequestCollatorfor batchingEmbeddingsColumnMapperpreprocessor/v1/embeddings)benchmark_embeddings()entrypointOutput Formats
EmbeddingsBenchmarkerConsolefor terminal displayEmbeddingsBenchmarkerOutputfor serialized outputsCode Refactoring
BaseBenchmarkArgswith 30+ common fieldsrun_benchmark_workflow()for unified orchestrationsetup_backend_kwargs()- Sets request_format to/v1/embeddingsand encoding_formatsetup_profile_kwargs()- Configures embeddings constraints (no rampup, uses max_duration)CLI Integration
guidellm benchmark embeddingscommandTesting
Code Quality
tox -e quality)tox -e types)Example Usage
Basic Embeddings Benchmark
Sweep Profile (Multiple Rates)
Example Output
Console Output
JSON Output
{ "benchmarks": [ { "mode": "embeddings", "config": { "strategy": { "type_": "constant", "rate": 10.0 } }, "metrics": { "request_latency": { "mean": 0.110, "median": 0.110, "p95": 0.111, "p99": 0.111 }, "server_ttft": { "mean": 0.0, "median": 0.0 }, "server_throughput_request": { "mean": 9.1, "median": 9.09 }, "server_throughput_token_input": { "mean": 524.8, "median": 475.2 } } } ] }Test Plan
Automated Tests
Manual Testing (Completed)
1. Mock Server Embeddings ✅
Result: 20/20 requests successful
2. AWS EC2 vLLM (granite-embedding-english-r2) ✅
Tested all profile types:
Total: 70/70 requests successful
3. AWS EC2 Qwen3-0.6B (generative) ✅
Verified refactored code works for generative benchmarks:
Total: 70/70 requests successful
Grand Total: 140/140 requests successful across all profiles and both model types! 🎉
Breaking Changes
None. This PR is fully backward compatible. All existing generative text benchmarking functionality remains unchanged and passes all tests.
Future Enhancements
The following features were scoped out of this MVP and can be added in future PRs:
Dependencies
No new required dependencies. All embeddings functionality works with existing dependencies.
Use of AI