Releases: ai-dynamo/aiperf
AIPerf v0.4.0 Release
AIPerf 0.4.0 Release Notes
Major Features
AIPerf Plot Feature (#511)
New aiperf plot command for visualizing benchmark results. See Visualization & Plotting Tutorial.
- Interactive dashboard via
--dashboardflag with dynamic metric switching, run filtering, and plot customization - Static PNG export with NVIDIA brand styling (default behavior)
- Automatic mode detection distinguishes single-run time-series from multi-run comparisons
- GPU telemetry integration displaying power, utilization, memory, and throughput correlation
- Timeslice analysis for performance evolution across time windows
- Experiment classification for baseline/treatment color assignment in A/B testing
- Theme support with
--theme darkoption - YAML configuration via
~/.aiperf/plot_config.yaml
Server-Side Prometheus Metrics Collection (#488)
Collect server-side metrics from LLM inference server Prometheus endpoints (vLLM, SGLang, TRT-LLM, Dynamo). See Server Metrics Documentation.
- Automatic endpoint discovery from inference server base URL +
/metrics - Collection at configurable intervals (default 333ms) with reachability testing before profiling
- Multiple export formats: JSON (aggregated stats), CSV (tabular), JSONL (time-series), Parquet (with delta calculations)
- Supports Prometheus types: counter, gauge, histogram with percentile estimation from buckets
- Timesliced statistics when used with
--slice-durationfor windowed analysis - CLI usage:
--server-metrics URL [URL...]for additional endpoints - Disable with:
--no-server-metrics
Shared System Prompt and User Context Prompt (#506)
New CLI options for prompt composition:
--shared-system-prompt-length: Single system prompt shared across all conversations--user-context-prompt-length: Per-conversation user context prompts for KV-Cache testing
OpenAI Image Generation Endpoint Support (#468)
Native support for benchmarking OpenAI-compatible image generation endpoints (e.g., SGLang). New image_generation endpoint type with response parsing for both base64 and URL-based image outputs. See SGLang Image Generation Tutorial.
Vision Endpoint and Image Metrics (#450)
New metrics for vision and multimodal workloads:
num_images: Total images processed across conversation turnsimage_throughput: Images processed per unit timeimage_latency: Latency per individual image
Rankings Enhancements
See Rankings Tutorial.
- Synthetic Data for Rankings (#440): Generate synthetic ranking workloads without external datasets
- Rankings Prompt and Query Token Options (#498): Configure prompt and query token lengths for ranking benchmarks
GPU Telemetry Improvements
- JSONL Export (#441): Export GPU telemetry data to JSONL format for external analysis
- Custom Metrics via CSV (#424): Define custom GPU metric configurations using CSV files
Video Generation Enhancements
- WebM and VP9 Support (#434): Generate synthetic video in WebM format with VP9 codec for video benchmarking workloads
Auto-Detect Custom Dataset Type (#399)
Automatic inference of --custom-dataset-type from --input-file content. See Benchmark Datasets.
- Examines file structure to classify datasets without explicit user specification
- Auto-selects appropriate
--dataset-sampling-strategybased on loader capabilities - Supports single-turn, multi-turn, mooncake trace, and random pool formats
- Manual override still available when needed
New Metrics
- Total Token Throughput (#500): New aggregate throughput metric across all tokens
- Time to First Output (TTFO) added to dashboard (#446)
- Renamed
prefill_throughputtoprefill_throughput_per_userfor clarity
Usability Improvements
- Legacy Max Tokens Option (#481): CLI flag
--use-legacy-max-tokensfor compatibility with older API versions - API Error Parsing (#482): Parse and display helpful error messages when
max_completions_tokensis not supported by the server
Bug Fixes
- Set default dataset sampling strategy (#519, #521)
- Fixed stack trace when error message is not JSON string (#505)
- Removed reasoning content from multi-turn conversations (#499)
- Fixed concurrency validation when request_count is not set (#480)
- Fixed invalid parsed response records conversion to error records (#477)
- Fixed error when concurrency exceeds request count (#475)
- Fixed timeout when dataset configuration takes too long (#471)
- Fixed duplicate requests in fixed schedule for multi-turn conversations (#444)
- Fixed ZMQ context termination deadlock issue (#469)
Documentation
- Auto-generated CLI Options Reference from cyclopts app (#476)
- Auto-generated Environment Variables documentation (#487)
Infrastructure & Maintenance
- Upgraded Python base container to 3.13.11
- Added Python 3.13 support to GitHub Actions
- Updated dependencies: matplotlib 3.10.0+, aiohttp 3.13.3+, pydantic 2.10+, cyclopts v4
- Updated container ffmpeg to 8.0.1
- Added Contributor License Agreement
- Optimized base pydantic models with
exclude_none(#426)
New Contributors
- Lei Gao (@leigao97) - First external community contributor! (#444)
- Anant Sharma (@nv-anants)
Known Issues
- Server metrics timeout on unreachable endpoints: When a server metrics endpoint is not reachable, the benchmark may timeout instead of gracefully continuing. Workaround: use
--no-server-metricsto disable server metrics collection if the Prometheus endpoint is unavailable. - Shared/user context prompts not included in ISL: Tokens from
--shared-system-prompt-lengthand--user-context-prompt-lengthare not included in input sequence length (ISL) metric calculations. - MP4 video generation incompatible with NVIDIA NIM: Generated MP4 videos use fragmented
empty_moovformat, which is incompatible with NVIDIA NIM video endpoints. Workaround: use--video-format webminstead.
AIPerf v0.3.0 Release
AIPerf - Release 0.3.0
Summary
AIPerf 0.3.0 focuses on advanced metrics and analytics, endpoint ecosystem expansion, and developer experience improvements. In this release, timeslice metrics enable fine-grained temporal analysis of LLM performance, multi-turn conversation support reflects real-world chat patterns, and GPU telemetry provides comprehensive observability. The endpoint ecosystem expands with Hugging Face, Cohere, and Solido integrations, while infrastructure improvements enhance reproducibility, cross-platform support, and extensibility. AIPerf seamlessly supports benchmarking across all major LLM serving platforms including OpenAI-compatible endpoints, custom HTTP APIs via Jinja templates, and specialized endpoints for embeddings, rankings, and RAG systems.
Advanced Metrics & Analytics
AIPerf 0.3.0 introduces timeslice metrics for temporal performance analysis, allowing users to slice benchmark results by time duration for identifying performance degradation and anomalies. The new time-to-first-output (non-reasoning) metric provides accurate measurement of user-perceived latency by excluding reasoning tokens. Enhanced server token count parsing enables direct comparison with client-side measurements, while raw request/response export facilitates debugging and analysis of LLM interactions.
Endpoint Ecosystem Expansion
This release expands AIPerf's compatibility with major LLM serving platforms through native support for Hugging Face TEI (Text Embeddings Inference), Hugging Face TGI (Text Generation Inference), Cohere Rankings API, and Solido RAG endpoints. The new custom payload template system with Jinja support enables benchmarking of arbitrary HTTP APIs, while the decoupled endpoint/transport architecture accelerates plugin development for new platforms.
Reproducibility & Developer Experience
AIPerf 0.3.0 strengthens reproducibility with an order-independent RandomGenerator system that ensures consistent results across runs regardless of execution order. Infrastructure modernization includes moving to a src/ directory layout, comprehensive e2e integration tests with a built-in mock server, and cross-platform support for Python 3.10-3.13 on Ubuntu, macOS, and Windows. Dataset flexibility improves with new sampler implementations and separation of dataset entries from conversation count configuration.
Major Features & Improvements
Timeslice Metrics
- Timeslice Duration Option: Added
--slice-durationoption for time-sliced metric analysis (#300), enabling performance monitoring over configurable time windows for detecting degradation patterns and anomalies. - Timeslice Export Formats: Implemented JSON and CSV output formats for timeslice metrics (#411), providing flexible data export for visualization and analysis tools.
- Timeslice Calculation Pipeline: Added timeslice metric result calculation and handover to ExportManager (#378), integrating temporal analysis into the core metrics pipeline.
- Timeslice Documentation: Comprehensive tutorial documentation for timeslice metrics feature (#420), including usage examples and interpretation guidance.
Multi-Turn Conversations
- Multi-Turn Support: Full implementation of multi-turn conversation benchmarking (#360), enabling realistic evaluation of chatbot and assistant workloads with conversation context and state management.
- Inter-Turn Delays: Configurable delays between conversation turns (#452, #455), simulating realistic user think time and typing patterns for accurate throughput modeling.
Custom Endpoint Integration
- Jinja Template Payloads: Fully custom Jinja template support for endpoint payloads (#406) with autoescape security (#461), enabling benchmarking of arbitrary HTTP APIs and custom LLM serving frameworks.
- Endpoint/Transport Decoupling: Refactored architecture to decouple endpoints and transports (#389), accelerating development of new endpoint plugins and improving code maintainability.
- URL Flexibility: Support for
/v1suffix in URLs (#349), simplifying endpoint configuration for OpenAI-compatible servers.
Endpoint Integrations
- Hugging Face TEI: Added support for Hugging Face Text Embeddings Inference endpoints with rankings API (#398, #419), enabling benchmarking of embedding and ranking workloads.
- Hugging Face TGI: Native support for Hugging Face Text Generation Inference generate endpoints (#412, #419), expanding compatibility with popular open-source serving.
- Cohere Rankings API: Integration with Cohere Rankings API (#398, #419) for benchmarking reranking and retrieval-augmented generation pipelines.
- Solido RAG Endpoints: Support for Solido RAG endpoints (#396), enabling evaluation of retrieval-augmented generation systems.
GPU Telemetry & Observability
- Real-Time Dashboard: GPU telemetry real-time dashboard display (#370) with live metrics visualization for monitoring GPU utilization, memory, power, and temperature during benchmarks.
- DCGM Simulator: Realistic DCGM metrics simulator (#361) for testing telemetry pipelines without physical GPUs, improving development workflows.
- Endpoint Reachability: Improved GPU telemetry endpoint reachability logging (#397) with better error messages when DCGM endpoints are unavailable.
- Default Endpoints: Added
http://localhost:9400/metricsto default telemetry endpoints (#369) for easier local development.
Video Generation
- WebM/VP9 Support: Added WebM container and VP9 codec support to video generator (#460), enabling efficient video compression for multimodal benchmarking.
- Video Tutorial: Comprehensive video generation tutorial documentation (#409), covering configuration and usage patterns.
Metrics & Accuracy
- Time to First Output (Non-Reasoning): New metric excluding reasoning tokens (#359) with migration guide (#365), providing accurate measurement of user-perceived latency for reasoning models.
- Server Token Counts: Parse and report server-provided usage data (#405), enabling validation of client-side token counting and detecting discrepancies.
- Error Record Conversion: Convert invalid parsed responses to error records (#416), ensuring proper tracking of malformed responses in metrics calculations.
- SSE Error Parsing: Enhanced SSE parsing to detect and handle error events from Dynamo and other servers (#385), improving error attribution.
- Nested Input Parsing: Fixed parsing of nested lists/tuples for extra inputs (#318), enabling complex structured inputs in benchmarks.
Reproducibility
- Order-Independent RNG: Hardened reproducibility with order-independent RandomGenerator system (#415), ensuring consistent results across runs regardless of async execution order and message arrival timing.
Dataset & Configuration
- Dataset Samplers: New dataset sampler implementations (#395) for flexible sampling strategies including random, sequential, and weighted selection.
- Dataset Entries Option: Separated
--dataset-entriesCLI option from--num-conversations(#421) with updated documentation (#430), clarifying configuration semantics and enabling independent control. - Environment Settings: Moved constants to Pydantic environment settings (#390), improving configurability and enabling environment-based overrides.
Developer Experience
- Project Structure: Moved aiperf into
src/directory (#387) following Python community conventions, improving packaging and import semantics. - Mock Server Auto-Install:
make installauto-installs mock server (#382), streamlining local development setup. - E2E Integration Tests: Comprehensive e2e integration tests with mock server covering all endpoints (#377), improving test coverage and catching integration regressions.
- Cross-Platform Support:
- Python Version Support: Unit tests on Python 3.10, 3.11, 3.12 across Ubuntu and macOS (#356), ensuring broad compatibility.
- Docker Compliance: Dockerfile OSRB compliance (#337) and Python 3.13 support (#454, #478).
- Verbose Logging:
-vand-vvflags auto-enable simple UI mode (#401), with override capability for customization.
User Experience
- Fullscreen Logs: Show logs fullscreen until first progress messages (#402), improving visibility of startup diagnostics and errors.
- Dashboard Screenshot: Added dashboard screenshot to README (#371), helping users understand telemetry capabilities.
- Request-Rate Documentation: Comprehensive documentation on request-rate with max concurrency (#380), clarifying load generation behavior.
Performance & Stability
- Goodput Calculation: Fixed goodput release calculation issues (#373), ensuring accurate reporting of successful request throughput.
- SSE Chunk Parsing: Fixed SSE parsing when multiple messages arrive in a single buffered chunk (#368), preventing message loss and corruption.
- Task Cancellation: Wait for flush tasks to finish before cancelling (#404), preventing data loss during shutdown.
- Log Queue Cleanup: Added timeout for log queue cleanup (#393), preventing deadlocks during service shutdown.
- ZMQ Context Termination: Fixed ZMQ context termination and TimeoutError issues (#474), improving clean shutdown behavior.
- GPU Telemetry Timing: Fixed Telemetry Manager shutdown race condition (#367), preventing profile start failures.
Documentation
- Timeslice Tutorial: Tutoria...
AIPerf v0.2.0
AIPerf Release Notes
Summary
AIPerf v0.2.0 introduces time-based benchmarking with configurable grace periods and request cancellation capabilities. The release adds advanced metrics including goodput measurement for SLO compliance, GPU telemetry, and inter-chunk latency tracking.
New Features
Time-Based Benchmarking
- Time-based benchmarking support - Run benchmarks for a specified duration with configurable grace periods for more realistic testing scenarios
- Benchmark grace period - Added grace period functionality to allow for proper warmup and cooldown phases during benchmarking
Request Management & Control
- Request cancellation - Added ability to cancel requests during benchmarking to test timeout behavior and service resilience
- Fixed-schedule for Mooncake traces - Enhanced trace replay with fixed-schedule detection and support for non-fixed-schedule trace formats
- Request rate with concurrency limits - Added ability to limit HTTP connections and control request concurrency for more realistic load testing
Advanced Metrics & Monitoring
- Goodput metric - Added goodput metric to measure throughput of requests meeting user-defined SLOs, with comprehensive tutorial support
- GPU Telemetry - Integrated GPU monitoring and telemetry collection for comprehensive performance analysis
- Inter-chunk-latency metric - Added inter-chunk latency tracking using raw value lists for detailed streaming performance analysis
- Total ISL/OSL metrics - Added total input/output sequence length metrics with improved CSV/JSON export support
- Per-record metrics export - Enhanced profile export with per-record metrics in
profile_export.jsonl - Mixed ISL/OSL distributions - Support for mixed input/output sequence length distributions in benchmarking
Video & Multimedia Support
- Synthetic video support - Added support for video benchmarking and synthetic video generation
Enhanced Data Management
- Inputs.json file for dataset traceability - Added dataset traceability through inputs.json file generation
- Request traceability headers - Added X-Request-Id and X-Correlation-Id headers for improved request tracking
Bug Fixes
Core Functionality
- ZMQ graceful termination - Fixed graceful termination of ZMQ context to prevent hanging processes
- Worker count limits - Capped default maximum workers to 32 to prevent resource exhaustion
- Race conditions in credit issuing - Fixed race conditions in credit issuing strategy for more stable performance
- Startup error handling - Improved error handling during startup with clear error messages and proper process exit
Request Processing
- Empty choices array handling - Fixed IndexError when OpenAI choices array is empty
- Request metadata validation - Fixed bug with request metadata validation for failed requests
Export & Data Handling
- CSV export logic - Fixed CSV export parsing to ensure correct data formatting
- JSONL file writing - Resolved issues with writing to JSONL files
- GenAI-Perf JSON format compatibility - Fixed JSON summary export to match GenAI-Perf format for better compatibility
Platform-Specific Fixes
- macOS Textual UI Dashboard - Fixed compatibility issues with Textual UI Dashboard on macOS systems
- Image test random seed - Set proper random seeds for image tests to fix sporadic test failures
Telemetry & Performance
- Telemetry Manager shutdown timing - Fixed issue where Telemetry Manager shuts down before profile configuration finishes
- Goodput release issues - Cherry-picked fix for goodput-related release problems
- CPU usage warnings - Added warnings when worker CPU usage exceeds 85% to help identify performance bottlenecks
Documentation & Tutorials
New Documentation
- Goodput tutorial - Complete tutorial on using the goodput metric for SLO validation
- Advanced features tutorials - Tutorials covering advanced benchmarking features
- Trace replay tutorial with real data - Updated trace replay tutorial with real Mooncake data examples
- Feature comparison with GenAI-Perf - Added detailed feature comparison matrix between AIPerf and GenAI-Perf
Infrastructure & Development
Build & Dependencies
- Flexible dependencies - Made package dependencies more flexible for better compatibility
- PyProject.toml cleanup - Cleaned up and organized pyproject.toml configuration
- License field compliance - Updated pyproject.toml license field for wheeltamer compliance
- Dependency updates - Removed pandas dependency (now using numpy only) and updated numpy to 1.26.4
Refactoring & Performance
Core Components
- Credit processor refactoring - Refactored credit processing system for better performance and maintainability
- Console output enhancements - Added median values to console output for better statistical insight
Performance Optimizations
- Performance test marking - Properly marked SSE tests as performance tests for better test organization
Known Issues
- InvalidStateError - Logs show an InvalidStateError during benchmarking. This is handled gracefully and will not impact benchmark results.
Initial release of AIPerf v0.1.1
Release Highlights:
The initial release of AIPerf, the successor to GenAI-Perf, delivers extensive benchmarking capabilities.
AIPerf is written entirely in Python, offering an easy installation with a modular design for user extensibility.
Major Features of AIPerf
Comprehensive Benchmarking
- Detailed Performance Metrics: Measures throughput, latency, and comprehensive token-level metrics for generative AI models
- Flexible Data Sources: Supports both synthetic and dataset-driven input modes
Scalable Load Generation
- Parallel Processing: Multiprocess support for local scaling
- Configurable Load Patterns: High concurrency and request-rate modes with configurable patterns
Trace Replay
- Production Workload Simulation: Reproduce real-world or synthetic workload traces for validation and stress testing
- Industry Standard Formats: Supports Mooncake trace format and custom JSONL datasets when using the
--fixed-scheduleoption
Flexible Model and Endpoint Support
- Universal Compatibility: Works with OpenAI-compatible APIs, including vLLM, Dynamo, and other compatible services
- OpenAI APIs: Chat completions, completions, and embeddings supported
Advanced Input and Output Configuration
- Granular Token Control: Fine-grained control over input/output token counts and streaming
- Extended Request Support: Pass extra inputs and custom payloads to endpoints
Rich Reporting and Export
- Multiple Export Options: Console, CSV, and JSON output formats for results
- Artifact Management: Artifact directory support for saving logs and metrics
Automation and Integration
- CLI-First Design: CLI-first workflow for scripting and automation
- Deployment Flexibility: Compatible with containerized and cloud environments
Security and Customization
- Security and Authentication: Support for custom headers, authentication, and advanced API options
- Deterministic Testing: Random seed and reproducibility controls
Console UI Options
- Real-Time Monitoring: Real-time metrics dashboard with live progress tracking and worker status monitoring
- Multiple UI Modes: Simple UI mode for streamlined monitoring and headless mode for automated environments
Key Improvements Over GenAI-Perf
AIPerf introduces several enhancements over GenAI-Perf:
Performance & Scaling
- Distributed Architecture: Scalable service-oriented design built for horizontal scalability
- Python Multiprocessing: Native multiprocessing implementation with automatic worker provisioning and lifecycle management, enabling true parallel load generation from a single node
- Request-Rate with Max Concurrency: Combine request-rate control with concurrency limits to throttle requests or provide controlled ramp-up to prevent burst traffic
User Experience
- Live Dashboard: Interactive terminal-based UI with real-time metrics visualization, progress tracking, and worker status monitoring
- Multiple UI Modes: Dashboard mode for interactive use, simple mode for streamlined monitoring, and headless mode for automation
Observability & Control
- API Error Analytics: Comprehensive tracking and categorization of request failures with detailed error summaries grouped by failure reason
- Early Termination Support: Cancel benchmarks mid-run while preserving all completed results and metrics
Extensibility & Integration
- Pure Python Architecture: Eliminates complex mixed-language dependencies for simpler installation, deployment, and customization
- ShareGPT Integration: Automatic download, caching, and conversation processing of public datasets
Installation
pip install aiperfMigration from GenAI-Perf
AIPerf is designed to be a drop-in replacement of GenAI-Perf for currently supported features. To migrate your existing GenAI-Perf commands, please refer to the Migrating from GenAI-Perf documentation.
Getting Started
Please refer to the Tutorials documentation for information on how to use AIPerf.
Additional Information
Known Issues
- Output sequence length constraints (
--output-tokens-mean) cannot be guaranteed unless you passignore_eosand/ormin_tokensvia--extra-inputsto an inference server that supports them. - A couple of options in the CLI help text use underscore instead of a hyphen inconsistently.
- Very high concurrency settings (typically >15,000 concurrency) may lead to port exhaustion on some systems, causing connection failures during benchmarking. If encountered, consider adjusting system limits or reducing concurrency.
- Startup errors caused by invalid configuration settings can cause AIPerf to hang indefinitely. If AIPerf appears to freeze during initialization, terminate the process and check configuration settings.
- Mooncake trace format currently requires the
--fixed-scheduleoption to be set. - Dashboard UI may cause corrupted ANSI sequences on macOS or certain terminal environments, making the terminal unusable. Run
resetcommand to restore normal terminal functionality, or switch to--ui simplefor a lightweight progress bar interface.