Skip to content

AIPerf v0.3.0 Release

Choose a tag to compare

@saturley-hall saturley-hall released this 20 Nov 22:11
3626694

AIPerf - Release 0.3.0

Summary

AIPerf 0.3.0 focuses on advanced metrics and analytics, endpoint ecosystem expansion, and developer experience improvements. In this release, timeslice metrics enable fine-grained temporal analysis of LLM performance, multi-turn conversation support reflects real-world chat patterns, and GPU telemetry provides comprehensive observability. The endpoint ecosystem expands with Hugging Face, Cohere, and Solido integrations, while infrastructure improvements enhance reproducibility, cross-platform support, and extensibility. AIPerf seamlessly supports benchmarking across all major LLM serving platforms including OpenAI-compatible endpoints, custom HTTP APIs via Jinja templates, and specialized endpoints for embeddings, rankings, and RAG systems.

Advanced Metrics & Analytics

AIPerf 0.3.0 introduces timeslice metrics for temporal performance analysis, allowing users to slice benchmark results by time duration for identifying performance degradation and anomalies. The new time-to-first-output (non-reasoning) metric provides accurate measurement of user-perceived latency by excluding reasoning tokens. Enhanced server token count parsing enables direct comparison with client-side measurements, while raw request/response export facilitates debugging and analysis of LLM interactions.

Endpoint Ecosystem Expansion

This release expands AIPerf's compatibility with major LLM serving platforms through native support for Hugging Face TEI (Text Embeddings Inference), Hugging Face TGI (Text Generation Inference), Cohere Rankings API, and Solido RAG endpoints. The new custom payload template system with Jinja support enables benchmarking of arbitrary HTTP APIs, while the decoupled endpoint/transport architecture accelerates plugin development for new platforms.

Reproducibility & Developer Experience

AIPerf 0.3.0 strengthens reproducibility with an order-independent RandomGenerator system that ensures consistent results across runs regardless of execution order. Infrastructure modernization includes moving to a src/ directory layout, comprehensive e2e integration tests with a built-in mock server, and cross-platform support for Python 3.10-3.13 on Ubuntu, macOS, and Windows. Dataset flexibility improves with new sampler implementations and separation of dataset entries from conversation count configuration.

Major Features & Improvements

Timeslice Metrics

  • Timeslice Duration Option: Added --slice-duration option for time-sliced metric analysis (#300), enabling performance monitoring over configurable time windows for detecting degradation patterns and anomalies.
  • Timeslice Export Formats: Implemented JSON and CSV output formats for timeslice metrics (#411), providing flexible data export for visualization and analysis tools.
  • Timeslice Calculation Pipeline: Added timeslice metric result calculation and handover to ExportManager (#378), integrating temporal analysis into the core metrics pipeline.
  • Timeslice Documentation: Comprehensive tutorial documentation for timeslice metrics feature (#420), including usage examples and interpretation guidance.

Multi-Turn Conversations

  • Multi-Turn Support: Full implementation of multi-turn conversation benchmarking (#360), enabling realistic evaluation of chatbot and assistant workloads with conversation context and state management.
  • Inter-Turn Delays: Configurable delays between conversation turns (#452, #455), simulating realistic user think time and typing patterns for accurate throughput modeling.

Custom Endpoint Integration

  • Jinja Template Payloads: Fully custom Jinja template support for endpoint payloads (#406) with autoescape security (#461), enabling benchmarking of arbitrary HTTP APIs and custom LLM serving frameworks.
  • Endpoint/Transport Decoupling: Refactored architecture to decouple endpoints and transports (#389), accelerating development of new endpoint plugins and improving code maintainability.
  • URL Flexibility: Support for /v1 suffix in URLs (#349), simplifying endpoint configuration for OpenAI-compatible servers.

Endpoint Integrations

  • Hugging Face TEI: Added support for Hugging Face Text Embeddings Inference endpoints with rankings API (#398, #419), enabling benchmarking of embedding and ranking workloads.
  • Hugging Face TGI: Native support for Hugging Face Text Generation Inference generate endpoints (#412, #419), expanding compatibility with popular open-source serving.
  • Cohere Rankings API: Integration with Cohere Rankings API (#398, #419) for benchmarking reranking and retrieval-augmented generation pipelines.
  • Solido RAG Endpoints: Support for Solido RAG endpoints (#396), enabling evaluation of retrieval-augmented generation systems.

GPU Telemetry & Observability

  • Real-Time Dashboard: GPU telemetry real-time dashboard display (#370) with live metrics visualization for monitoring GPU utilization, memory, power, and temperature during benchmarks.
  • DCGM Simulator: Realistic DCGM metrics simulator (#361) for testing telemetry pipelines without physical GPUs, improving development workflows.
  • Endpoint Reachability: Improved GPU telemetry endpoint reachability logging (#397) with better error messages when DCGM endpoints are unavailable.
  • Default Endpoints: Added http://localhost:9400/metrics to default telemetry endpoints (#369) for easier local development.

Video Generation

  • WebM/VP9 Support: Added WebM container and VP9 codec support to video generator (#460), enabling efficient video compression for multimodal benchmarking.
  • Video Tutorial: Comprehensive video generation tutorial documentation (#409), covering configuration and usage patterns.

Metrics & Accuracy

  • Time to First Output (Non-Reasoning): New metric excluding reasoning tokens (#359) with migration guide (#365), providing accurate measurement of user-perceived latency for reasoning models.
  • Server Token Counts: Parse and report server-provided usage data (#405), enabling validation of client-side token counting and detecting discrepancies.
  • Error Record Conversion: Convert invalid parsed responses to error records (#416), ensuring proper tracking of malformed responses in metrics calculations.
  • SSE Error Parsing: Enhanced SSE parsing to detect and handle error events from Dynamo and other servers (#385), improving error attribution.
  • Nested Input Parsing: Fixed parsing of nested lists/tuples for extra inputs (#318), enabling complex structured inputs in benchmarks.

Reproducibility

  • Order-Independent RNG: Hardened reproducibility with order-independent RandomGenerator system (#415), ensuring consistent results across runs regardless of async execution order and message arrival timing.

Dataset & Configuration

  • Dataset Samplers: New dataset sampler implementations (#395) for flexible sampling strategies including random, sequential, and weighted selection.
  • Dataset Entries Option: Separated --dataset-entries CLI option from --num-conversations (#421) with updated documentation (#430), clarifying configuration semantics and enabling independent control.
  • Environment Settings: Moved constants to Pydantic environment settings (#390), improving configurability and enabling environment-based overrides.

Developer Experience

  • Project Structure: Moved aiperf into src/ directory (#387) following Python community conventions, improving packaging and import semantics.
  • Mock Server Auto-Install: make install auto-installs mock server (#382), streamlining local development setup.
  • E2E Integration Tests: Comprehensive e2e integration tests with mock server covering all endpoints (#377), improving test coverage and catching integration regressions.
  • Cross-Platform Support:
    • Auto-detect and disable uvloop on Windows (#413) for seamless Windows development
    • macOS semaphore cleanup fixes (#379) preventing resource leaks
    • Fixed spurious test errors on macOS due to incorrect patching (#422)
  • Python Version Support: Unit tests on Python 3.10, 3.11, 3.12 across Ubuntu and macOS (#356), ensuring broad compatibility.
  • Docker Compliance: Dockerfile OSRB compliance (#337) and Python 3.13 support (#454, #478).
  • Verbose Logging: -v and -vv flags auto-enable simple UI mode (#401), with override capability for customization.

User Experience

  • Fullscreen Logs: Show logs fullscreen until first progress messages (#402), improving visibility of startup diagnostics and errors.
  • Dashboard Screenshot: Added dashboard screenshot to README (#371), helping users understand telemetry capabilities.
  • Request-Rate Documentation: Comprehensive documentation on request-rate with max concurrency (#380), clarifying load generation behavior.

Performance & Stability

  • Goodput Calculation: Fixed goodput release calculation issues (#373), ensuring accurate reporting of successful request throughput.
  • SSE Chunk Parsing: Fixed SSE parsing when multiple messages arrive in a single buffered chunk (#368), preventing message loss and corruption.
  • Task Cancellation: Wait for flush tasks to finish before cancelling (#404), preventing data loss during shutdown.
  • Log Queue Cleanup: Added timeout for log queue cleanup (#393), preventing deadlocks during service shutdown.
  • ZMQ Context Termination: Fixed ZMQ context termination and TimeoutError issues (#474), improving clean shutdown behavior.
  • GPU Telemetry Timing: Fixed Telemetry Manager shutdown race condition (#367), preventing profile start failures.

Documentation

  • Timeslice Tutorial: Tutorial documentation for timeslice metrics (#420) with interpretation guidance.
  • Video Generation Tutorial: Comprehensive video generation tutorial (#409) covering configuration options.
  • API Documentation: Documentation for HF TEI, HF TGI, Cohere API support (#419) with example configurations.
  • Migration Guide: Notes about reasoning tokens in migration guide (#365), helping users upgrade from 0.2.0.
  • GAP Comparison: Updated AIPerf vs GenAI-Perf comparison document (#408), clarifying differences and use cases.
  • Dataset Entries: Updated docs to use dataset entries instead of num conversations (#430), improving clarity.

Bug Fixes

  • ZMQ Context Termination: Fixed ZMQ context termination and TimeoutError issues (#474) for clean shutdown.
  • Jinja Autoescape: Fixed autoescape in Jinja templating (#461) to prevent XSS vulnerabilities in custom templates.
  • SSE Multiple Messages: Fixed SSE parsing when multiple messages arrive in a single buffered chunk (#368).
  • Task Cancellation: Wait for flush tasks to finish before cancelling (#404) to prevent data loss.
  • Log Queue Timeout: Added timeout for log queue cleanup (#393) to prevent shutdown deadlocks.
  • GPU Telemetry Shutdown: Fixed Telemetry Manager shutdown timing issue (#367) preventing profile start failures.
  • GPU Telemetry Deprecated Field: Fixed System Controller using deprecated endpoints_tested field (#376).
  • Nested Input Parsing: Fixed parsing of nested lists/tuples for extra inputs (#318).
  • macOS Test Errors: Fixed spurious test errors on macOS due to incorrect patching (#422).
  • Timeslice Tutorial Typo: Fixed typo in timeslice tutorial documentation (#436).

Known Issues

Breaking Change: Rankings Endpoint Type

  • The generic --endpoint-type rankings has been removed in v0.3.0.
  • Migration required: Use provider-specific types instead:
    • --endpoint-type nim_rankings (NVIDIA NIM)
    • --endpoint-type hf_tei_rankings (Hugging Face TEI)
    • --endpoint-type cohere_rankings (Cohere)

What's Next

Full Changelog

What's Changed

🚀 Features & Improvements

  • feat: add --slice-duration option for time slicing mode in #300
  • feat: Add multi turn support in #360
  • feat: add time to first output (non reasoning) metric in #359
  • feat: add realistic DCGM metrics simulator in #361
  • feat: GPU Telemetry Realtime Dashboard Display in #370
  • feat: e2e integration tests with new mock server all endpoints in #377
  • feat: Add support for timeslice metric result calculation and handover to ExportManager in #378
  • feat: make install will auto-install mock server in #382
  • feat: chore: move root aiperf directory into new root src directory for community convention in #387
  • feat: decouple endpoints and transports for faster development and better plugin experience in #389
  • feat: move constants to environment pydantic settings in #390
  • feat: support raw request and response payload export files in #392
  • feat: add dataset samplers implementations in #395
  • feat: support for Solido RAG endpoints in #396
  • feat: Add Huggingface TEI Rankings API and Cohere Rankings API support in #398
  • feat: -v and -vv auto enable simple ui, can be overridden in #401
  • feat: show logs fullscreen until first progress messages in #402
  • feat: parse server reported usage data (server token counts) in #405
  • feat: fully custom template support for endpoint payloads in #406
  • feat: Add support for timeslice metrics JSON and CSV outputs in #411
  • feat: Add huggingface tgi generate endpoint support in #412
  • feat: automatically detect and disable uvloop on windows in #413
  • feat: Harden reproducibility with order-independent RandomGenerator system in #415
  • feat: Separate the dataset entry cli option from the num in #421
  • feat: bring dockerfile into OSRB compliance in #337
  • feat: support /v1 suffix in the url for simplicity in #349
  • feat: Add delay to multi turn conversations in #452
  • feat: Add delay to multi turn conversations in #455
  • feat: Add WebM and VP9 support to video generator + libvpx9 in #460

🐛 Bug Fixes

  • fix: issue with parsing nested lists/tuples for extra inputs in #318
  • fix: bring dockerfile into OSRB compliance in #337
  • fix: SSE doesn't correctly parse multiple messages in a single buffered chunk in #368
  • fix: Fix goodput release issue in #373
  • fix: GPU Telemetry System Controller Using Deprecated 'endpoints_tested' in #376
  • fix: fix for semaphore not cleaned up errors on macOS in #379
  • fix: parse and detect sse error event data from dynamo in #385
  • fix: add timeout for log queue cleanup in #393
  • fix: GPU telemetry endpoint reachability logging + fix tests that weren't asserting properly in #397
  • fix: wait for flush tasks to finish before cancelling them in #404
  • fix: convert invalid records to error records for proper tracking in #416
  • fix: spurious test errors on macos due to incorrect patching in #422
  • fix: Typo in timeslice tutorial in #436
  • fix: autoescape jinja templating in #461
  • fix: zmq context termination + timeouterror in #474

📚 Documentation

  • docs: add notes about reasoning tokens in migration guide in #365
  • docs: Add screenshot of dashboard to README.md in #371
  • docs: add comprehensive docs on request-rate with max concurrency in #380
  • docs: video tutorial documentation in #409
  • docs: Update GAP AIPerf comparison doc in #408
  • docs: Add docs for HF TEI, HF TGI, Cohere API support in #419
  • docs: Add tutorial documentation for timeslice metrics feature in #420
  • docs: Update docs to use dataset entries instead of num conversations in #430

🛠️ Build, CI and Test

  • chore: run unit tests on python 3.10, 3.11, 3.12 ubuntu and macOS in #356
  • chore: remove unused openai dependency in #358
  • chore: mark additional sse test as @pytest.mark.performance in #362
  • chore: ignore changes to launch.json to enable local development ease in #366
  • chore: Add coderabbit yaml mirroring dynamo in #374
  • chore: delete unused test fixtures for sse tests in #400
  • chore: update mkinit paths in #388
  • chore: Update versions in #433
  • chore: dockerfile uses py3.13 in #454
  • chore: dockerfile uses py3.13 in #478

Full Changelog: release/0.2.0...release/0.3.0