AIPerf - Release 0.3.0

Summary

AIPerf 0.3.0 focuses on advanced metrics and analytics, endpoint ecosystem expansion, and developer experience improvements. In this release, timeslice metrics enable fine-grained temporal analysis of LLM performance, multi-turn conversation support reflects real-world chat patterns, and GPU telemetry provides comprehensive observability. The endpoint ecosystem expands with Hugging Face, Cohere, and Solido integrations, while infrastructure improvements enhance reproducibility, cross-platform support, and extensibility. AIPerf seamlessly supports benchmarking across all major LLM serving platforms including OpenAI-compatible endpoints, custom HTTP APIs via Jinja templates, and specialized endpoints for embeddings, rankings, and RAG systems.

Advanced Metrics & Analytics

AIPerf 0.3.0 introduces timeslice metrics for temporal performance analysis, allowing users to slice benchmark results by time duration for identifying performance degradation and anomalies. The new time-to-first-output (non-reasoning) metric provides accurate measurement of user-perceived latency by excluding reasoning tokens. Enhanced server token count parsing enables direct comparison with client-side measurements, while raw request/response export facilitates debugging and analysis of LLM interactions.

Endpoint Ecosystem Expansion

This release expands AIPerf's compatibility with major LLM serving platforms through native support for Hugging Face TEI (Text Embeddings Inference), Hugging Face TGI (Text Generation Inference), Cohere Rankings API, and Solido RAG endpoints. The new custom payload template system with Jinja support enables benchmarking of arbitrary HTTP APIs, while the decoupled endpoint/transport architecture accelerates plugin development for new platforms.

Reproducibility & Developer Experience

AIPerf 0.3.0 strengthens reproducibility with an order-independent RandomGenerator system that ensures consistent results across runs regardless of execution order. Infrastructure modernization includes moving to a src/ directory layout, comprehensive e2e integration tests with a built-in mock server, and cross-platform support for Python 3.10-3.13 on Ubuntu, macOS, and Windows. Dataset flexibility improves with new sampler implementations and separation of dataset entries from conversation count configuration.

Major Features & Improvements

Timeslice Metrics

Timeslice Duration Option: Added --slice-duration option for time-sliced metric analysis (#300), enabling performance monitoring over configurable time windows for detecting degradation patterns and anomalies.
Timeslice Export Formats: Implemented JSON and CSV output formats for timeslice metrics (#411), providing flexible data export for visualization and analysis tools.
Timeslice Calculation Pipeline: Added timeslice metric result calculation and handover to ExportManager (#378), integrating temporal analysis into the core metrics pipeline.
Timeslice Documentation: Comprehensive tutorial documentation for timeslice metrics feature (#420), including usage examples and interpretation guidance.

Multi-Turn Conversations

Multi-Turn Support: Full implementation of multi-turn conversation benchmarking (#360), enabling realistic evaluation of chatbot and assistant workloads with conversation context and state management.
Inter-Turn Delays: Configurable delays between conversation turns (#452, #455), simulating realistic user think time and typing patterns for accurate throughput modeling.

Custom Endpoint Integration

Jinja Template Payloads: Fully custom Jinja template support for endpoint payloads (#406) with autoescape security (#461), enabling benchmarking of arbitrary HTTP APIs and custom LLM serving frameworks.
Endpoint/Transport Decoupling: Refactored architecture to decouple endpoints and transports (#389), accelerating development of new endpoint plugins and improving code maintainability.
URL Flexibility: Support for /v1 suffix in URLs (#349), simplifying endpoint configuration for OpenAI-compatible servers.

Endpoint Integrations

Hugging Face TEI: Added support for Hugging Face Text Embeddings Inference endpoints with rankings API (#398, #419), enabling benchmarking of embedding and ranking workloads.
Hugging Face TGI: Native support for Hugging Face Text Generation Inference generate endpoints (#412, #419), expanding compatibility with popular open-source serving.
Cohere Rankings API: Integration with Cohere Rankings API (#398, #419) for benchmarking reranking and retrieval-augmented generation pipelines.
Solido RAG Endpoints: Support for Solido RAG endpoints (#396), enabling evaluation of retrieval-augmented generation systems.

GPU Telemetry & Observability

Real-Time Dashboard: GPU telemetry real-time dashboard display (#370) with live metrics visualization for monitoring GPU utilization, memory, power, and temperature during benchmarks.
DCGM Simulator: Realistic DCGM metrics simulator (#361) for testing telemetry pipelines without physical GPUs, improving development workflows.
Endpoint Reachability: Improved GPU telemetry endpoint reachability logging (#397) with better error messages when DCGM endpoints are unavailable.
Default Endpoints: Added http://localhost:9400/metrics to default telemetry endpoints (#369) for easier local development.

Video Generation

WebM/VP9 Support: Added WebM container and VP9 codec support to video generator (#460), enabling efficient video compression for multimodal benchmarking.
Video Tutorial: Comprehensive video generation tutorial documentation (#409), covering configuration and usage patterns.

Metrics & Accuracy

Time to First Output (Non-Reasoning): New metric excluding reasoning tokens (#359) with migration guide (#365), providing accurate measurement of user-perceived latency for reasoning models.
Server Token Counts: Parse and report server-provided usage data (#405), enabling validation of client-side token counting and detecting discrepancies.
Error Record Conversion: Convert invalid parsed responses to error records (#416), ensuring proper tracking of malformed responses in metrics calculations.
SSE Error Parsing: Enhanced SSE parsing to detect and handle error events from Dynamo and other servers (#385), improving error attribution.
Nested Input Parsing: Fixed parsing of nested lists/tuples for extra inputs (#318), enabling complex structured inputs in benchmarks.

Reproducibility

Order-Independent RNG: Hardened reproducibility with order-independent RandomGenerator system (#415), ensuring consistent results across runs regardless of async execution order and message arrival timing.

Dataset & Configuration

Dataset Samplers: New dataset sampler implementations (#395) for flexible sampling strategies including random, sequential, and weighted selection.
Dataset Entries Option: Separated --dataset-entries CLI option from --num-conversations (#421) with updated documentation (#430), clarifying configuration semantics and enabling independent control.
Environment Settings: Moved constants to Pydantic environment settings (#390), improving configurability and enabling environment-based overrides.

Developer Experience

Project Structure: Moved aiperf into src/ directory (#387) following Python community conventions, improving packaging and import semantics.
Mock Server Auto-Install: make install auto-installs mock server (#382), streamlining local development setup.
E2E Integration Tests: Comprehensive e2e integration tests with mock server covering all endpoints (#377), improving test coverage and catching integration regressions.
Cross-Platform Support:
- Auto-detect and disable uvloop on Windows (#413) for seamless Windows development
- macOS semaphore cleanup fixes (#379) preventing resource leaks
- Fixed spurious test errors on macOS due to incorrect patching (#422)
Python Version Support: Unit tests on Python 3.10, 3.11, 3.12 across Ubuntu and macOS (#356), ensuring broad compatibility.
Docker Compliance: Dockerfile OSRB compliance (#337) and Python 3.13 support (#454, #478).
Verbose Logging: -v and -vv flags auto-enable simple UI mode (#401), with override capability for customization.

User Experience

Fullscreen Logs: Show logs fullscreen until first progress messages (#402), improving visibility of startup diagnostics and errors.
Dashboard Screenshot: Added dashboard screenshot to README (#371), helping users understand telemetry capabilities.
Request-Rate Documentation: Comprehensive documentation on request-rate with max concurrency (#380), clarifying load generation behavior.

Performance & Stability

Goodput Calculation: Fixed goodput release calculation issues (#373), ensuring accurate reporting of successful request throughput.
SSE Chunk Parsing: Fixed SSE parsing when multiple messages arrive in a single buffered chunk (#368), preventing message loss and corruption.
Task Cancellation: Wait for flush tasks to finish before cancelling (#404), preventing data loss during shutdown.
Log Queue Cleanup: Added timeout for log queue cleanup (#393), preventing deadlocks during service shutdown.
ZMQ Context Termination: Fixed ZMQ context termination and TimeoutError issues (#474), improving clean shutdown behavior.
GPU Telemetry Timing: Fixed Telemetry Manager shutdown race condition (#367), preventing profile start failures.

Documentation

Timeslice Tutorial: Tutorial documentation for timeslice metrics (#420) with interpretation guidance.
Video Generation Tutorial: Comprehensive video generation tutorial (#409) covering configuration options.
API Documentation: Documentation for HF TEI, HF TGI, Cohere API support (#419) with example configurations.
Migration Guide: Notes about reasoning tokens in migration guide (#365), helping users upgrade from 0.2.0.
GAP Comparison: Updated AIPerf vs GenAI-Perf comparison document (#408), clarifying differences and use cases.
Dataset Entries: Updated docs to use dataset entries instead of num conversations (#430), improving clarity.

Bug Fixes

ZMQ Context Termination: Fixed ZMQ context termination and TimeoutError issues (#474) for clean shutdown.
Jinja Autoescape: Fixed autoescape in Jinja templating (#461) to prevent XSS vulnerabilities in custom templates.
SSE Multiple Messages: Fixed SSE parsing when multiple messages arrive in a single buffered chunk (#368).
Task Cancellation: Wait for flush tasks to finish before cancelling (#404) to prevent data loss.
Log Queue Timeout: Added timeout for log queue cleanup (#393) to prevent shutdown deadlocks.
GPU Telemetry Shutdown: Fixed Telemetry Manager shutdown timing issue (#367) preventing profile start failures.
GPU Telemetry Deprecated Field: Fixed System Controller using deprecated endpoints_tested field (#376).
Nested Input Parsing: Fixed parsing of nested lists/tuples for extra inputs (#318).
macOS Test Errors: Fixed spurious test errors on macOS due to incorrect patching (#422).
Timeslice Tutorial Typo: Fixed typo in timeslice tutorial documentation (#436).

Known Issues

Breaking Change: Rankings Endpoint Type

The generic --endpoint-type rankings has been removed in v0.3.0.
Migration required: Use provider-specific types instead:
- --endpoint-type nim_rankings (NVIDIA NIM)
- --endpoint-type hf_tei_rankings (Hugging Face TEI)
- --endpoint-type cohere_rankings (Cohere)

What's Next

Full Changelog

What's Changed

🚀 Features & Improvements

feat: add --slice-duration option for time slicing mode in #300
feat: Add multi turn support in #360
feat: add time to first output (non reasoning) metric in #359
feat: add realistic DCGM metrics simulator in #361
feat: GPU Telemetry Realtime Dashboard Display in #370
feat: e2e integration tests with new mock server all endpoints in #377
feat: Add support for timeslice metric result calculation and handover to ExportManager in #378
feat: make install will auto-install mock server in #382
feat: chore: move root aiperf directory into new root src directory for community convention in #387
feat: decouple endpoints and transports for faster development and better plugin experience in #389
feat: move constants to environment pydantic settings in #390
feat: support raw request and response payload export files in #392
feat: add dataset samplers implementations in #395
feat: support for Solido RAG endpoints in #396
feat: Add Huggingface TEI Rankings API and Cohere Rankings API support in #398
feat: -v and -vv auto enable simple ui, can be overridden in #401
feat: show logs fullscreen until first progress messages in #402
feat: parse server reported usage data (server token counts) in #405
feat: fully custom template support for endpoint payloads in #406
feat: Add support for timeslice metrics JSON and CSV outputs in #411
feat: Add huggingface tgi generate endpoint support in #412
feat: automatically detect and disable uvloop on windows in #413
feat: Harden reproducibility with order-independent RandomGenerator system in #415
feat: Separate the dataset entry cli option from the num in #421
feat: bring dockerfile into OSRB compliance in #337
feat: support /v1 suffix in the url for simplicity in #349
feat: Add delay to multi turn conversations in #452
feat: Add delay to multi turn conversations in #455
feat: Add WebM and VP9 support to video generator + libvpx9 in #460

🐛 Bug Fixes

fix: issue with parsing nested lists/tuples for extra inputs in #318
fix: bring dockerfile into OSRB compliance in #337
fix: SSE doesn't correctly parse multiple messages in a single buffered chunk in #368
fix: Fix goodput release issue in #373
fix: GPU Telemetry System Controller Using Deprecated 'endpoints_tested' in #376
fix: fix for semaphore not cleaned up errors on macOS in #379
fix: parse and detect sse error event data from dynamo in #385
fix: add timeout for log queue cleanup in #393
fix: GPU telemetry endpoint reachability logging + fix tests that weren't asserting properly in #397
fix: wait for flush tasks to finish before cancelling them in #404
fix: convert invalid records to error records for proper tracking in #416
fix: spurious test errors on macos due to incorrect patching in #422
fix: Typo in timeslice tutorial in #436
fix: autoescape jinja templating in #461
fix: zmq context termination + timeouterror in #474

📚 Documentation

docs: add notes about reasoning tokens in migration guide in #365
docs: Add screenshot of dashboard to README.md in #371
docs: add comprehensive docs on request-rate with max concurrency in #380
docs: video tutorial documentation in #409
docs: Update GAP AIPerf comparison doc in #408
docs: Add docs for HF TEI, HF TGI, Cohere API support in #419
docs: Add tutorial documentation for timeslice metrics feature in #420
docs: Update docs to use dataset entries instead of num conversations in #430

🛠️ Build, CI and Test

chore: run unit tests on python 3.10, 3.11, 3.12 ubuntu and macOS in #356
chore: remove unused openai dependency in #358
chore: mark additional sse test as @pytest.mark.performance in #362
chore: ignore changes to launch.json to enable local development ease in #366
chore: Add coderabbit yaml mirroring dynamo in #374
chore: delete unused test fixtures for sse tests in #400
chore: update mkinit paths in #388
chore: Update versions in #433
chore: dockerfile uses py3.13 in #454
chore: dockerfile uses py3.13 in #478

Full Changelog: release/0.2.0...release/0.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AIPerf v0.3.0 Release

Choose a tag to compare

Sorry, something went wrong.