Skip to content

AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution.

License

Notifications You must be signed in to change notification settings

ai-dynamo/aiperf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

AIPerf

PyPI version License Codecov Discord Ask DeepWiki

Architecture | Design Proposals | Migrating from Genai-Perf | CLI Options | Metrics Reference

AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution. It provides detailed metrics using a command line display as well as extensive benchmark performance reports.

AIPerf provides multiprocess support out of the box for a single scalable solution.

AIPerf UI Dashboard

Features


Tutorials & Advanced Features

Getting Started

  • Basic Tutorial - Learn the fundamentals with Dynamo and vLLM examples

Load Control & Timing

Feature Description Use Cases
Request Rate with Max Concurrency Dual control of request timing and concurrent connection ceiling (Poisson or constant modes) Testing API rate/concurrency limits, avoiding thundering herd, realistic client simulation
Arrival Patterns Configure traffic patterns (constant, Poisson, gamma) with tunable burstiness Realistic traffic simulation, stress testing, vLLM-compatible benchmarks
Prefill Concurrency Limit concurrent prefill operations to prevent memory exhaustion with long-context workloads Long-context benchmarking, OOM prevention, memory-safe stress testing
Gradual Ramping Smooth ramp-up of concurrency and request rate over time Capacity discovery, avoiding cold-start spikes, server warm-up
Warmup Phase Configure pre-benchmark warmup to eliminate cold-start effects Accurate measurements, JIT warm-up, cache priming
User-Centric Timing Per-user rate limiting with precise timing for KV cache benchmarking KV cache effectiveness, multi-user simulation, cache TTL testing
Request Cancellation Test timeout behavior and service resilience SLA validation, cancellation modeling

Workloads & Data

Feature Description Use Cases
Trace Benchmarking Deterministic workload replay with custom datasets Regression testing, A/B testing
Custom Prompt Benchmarking Send each prompt from your file exactly as-is, without sampling or generation Regression testing, A/B testing, debugging specific prompts
Fixed Schedule Precise timestamp-based request execution Traffic replay, temporal analysis, burst testing
Time-based Benchmarking Duration-based testing with grace period control Stability testing, sustained performance
Sequence Distributions Mixed ISL/OSL pairings Benchmarking mixed use cases
Random Number Generation & Reproducibility Deterministic dataset generation with --random-seed Debugging, regression testing, controlled experiments
Template Endpoint Benchmark custom APIs with flexible Jinja2 request templates Custom API formats, rapid prototyping, non-standard endpoints
SGLang Image Generation Benchmark image generation APIs using SGLang with FLUX.1-dev model Image generation testing, text-to-image benchmarking, extracting generated images

Analysis & Monitoring

Feature Description Use Cases
Timeslice Metrics Split up benchmark into timeslices and calculate metrics for each timeslice Load pattern impact, detecting warm-up effects, performance degradation analysis
Goodput Throughput of requests meeting user-defined SLOs SLO validation, capacity planning, runtime/model comparisons
HTTP Trace Metrics Detailed HTTP request lifecycle timing (DNS, TCP/TLS, TTFB) following k6 and HAR conventions Connection debugging, latency breakdown, transport-layer analysis
Profile Exports Parse and analyze profile_export.jsonl with Pydantic models, custom metrics, and async processing Custom analysis, data pipelines, post-processing
Visualization & Plotting Generate PNG visualizations with automatic mode detection (single-run analysis or multi-run comparison) Parameter sweep analysis, performance debugging, model comparison
GPU Telemetry Real-time GPU metrics collection via DCGM (power, utilization, memory, temperature, etc) Performance optimization, resource monitoring, multi-node telemetry
Server Metrics Collect Prometheus-compatible server metrics during benchmarking Performance optimization, resource monitoring, multi-node telemetry

Quick Navigation

# Basic profiling
aiperf profile --model Qwen/Qwen3-0.6B --url localhost:8000 --endpoint-type chat

# Request timeout testing
aiperf profile --request-timeout-seconds 30.0 [other options...]

# Trace-based benchmarking
aiperf profile --input-file trace.jsonl --custom-dataset-type single_turn [other options...]

# Fixed schedule execution
aiperf profile --input-file schedule.jsonl --fixed-schedule --fixed-schedule-auto-offset [other options...]

# Time-based benchmarking
aiperf profile --benchmark-duration 300.0 --benchmark-grace-period 30.0 [other options...]

Supported APIs

  • OpenAI chat completions
  • OpenAI completions
  • OpenAI embeddings
  • OpenAI audio: request throughput and latency
  • OpenAI images: request throughput and latency
  • NIM embeddings
  • NIM rankings

Installation

pip install aiperf

Quick Start

Basic Usage

Run a simple benchmark against a model:

aiperf profile \
  --model your_model_name \
  --url http://localhost:8000 \
  --endpoint-type chat \
  --streaming

Example with Custom Configuration

aiperf profile \
  --model Qwen/Qwen3-0.6B \
  --url http://localhost:8000 \
  --endpoint-type chat \
  --concurrency 10 \
  --request-count 100 \
  --streaming

Example output:

NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃                               Metric ┃       avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p75 ┃   std ┃
┑━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
β”‚             Time to First Token (ms) β”‚     18.26 β”‚  11.22 β”‚ 106.32 β”‚  68.82 β”‚  27.76 β”‚  16.62 β”‚ 12.07 β”‚
β”‚            Time to Second Token (ms) β”‚     11.40 β”‚   0.02 β”‚  85.91 β”‚  34.54 β”‚  12.59 β”‚  11.65 β”‚  7.01 β”‚
β”‚                 Request Latency (ms) β”‚    487.30 β”‚ 267.07 β”‚ 769.57 β”‚ 715.99 β”‚ 580.83 β”‚ 536.17 β”‚ 79.60 β”‚
β”‚             Inter Token Latency (ms) β”‚     11.23 β”‚   8.80 β”‚  13.17 β”‚  12.48 β”‚  11.73 β”‚  11.37 β”‚  0.45 β”‚
β”‚     Output Token Throughput Per User β”‚     89.23 β”‚  75.93 β”‚ 113.60 β”‚ 102.28 β”‚  90.91 β”‚  90.29 β”‚  3.70 β”‚
β”‚                    (tokens/sec/user) β”‚           β”‚        β”‚        β”‚        β”‚        β”‚        β”‚       β”‚
β”‚      Output Sequence Length (tokens) β”‚     42.83 β”‚  24.00 β”‚  65.00 β”‚  64.00 β”‚  52.00 β”‚  47.00 β”‚  7.21 β”‚
β”‚       Input Sequence Length (tokens) β”‚     10.00 β”‚  10.00 β”‚  10.00 β”‚  10.00 β”‚  10.00 β”‚  10.00 β”‚  0.00 β”‚
β”‚ Output Token Throughput (tokens/sec) β”‚ 10,944.03 β”‚    N/A β”‚    N/A β”‚    N/A β”‚    N/A β”‚    N/A β”‚   N/A β”‚
β”‚    Request Throughput (requests/sec) β”‚    255.54 β”‚    N/A β”‚    N/A β”‚    N/A β”‚    N/A β”‚    N/A β”‚   N/A β”‚
β”‚             Request Count (requests) β”‚    711.00 β”‚    N/A β”‚    N/A β”‚    N/A β”‚    N/A β”‚    N/A β”‚   N/A β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

Metrics Reference

AIPerf provides comprehensive metrics organized into multiple functional categories. For detailed descriptions, requirements, and nuances of each metric, see the Complete Metrics Reference.

Streaming Metrics

Metrics specific to streaming requests that measure real-time token generation characteristics. Requires --streaming flag.

Metric Tag Formula Unit
Time to First Token (TTFT) time_to_first_token content_responses[0].perf_ns - request.start_perf_ns ms
Time to Second Token (TTST) time_to_second_token content_responses[1].perf_ns - content_responses[0].perf_ns ms
Inter Token Latency (ITL) inter_token_latency (request_latency - time_to_first_token) / (output_sequence_length - 1) ms
Inter Chunk Latency (ICL) inter_chunk_latency [content_responses[i].perf_ns - content_responses[i-1].perf_ns for i in range(1, len(content_responses))] ms
Output Token Throughput Per User output_token_throughput_per_user 1.0 / inter_token_latency_seconds tokens/sec/user
Time to First Output Token (TTFO) time_to_first_output_token first_non_reasoning_token_perf_ns - request.start_perf_ns ms
Prefill Throughput Per User prefill_throughput_per_user input_sequence_length / time_to_first_token_seconds tokens/sec/user

Token Based Metrics

Metrics for token-producing endpoints that track token counts and throughput. Requires text-generating endpoints (chat, completion, etc.).

Metric Tag Formula Unit
Output Token Count output_token_count len(tokenizer.encode(content, add_special_tokens=False)) tokens
Output Sequence Length (OSL) output_sequence_length (output_token_count or 0) + (reasoning_token_count or 0) tokens
Input Sequence Length (ISL) input_sequence_length len(tokenizer.encode(prompt, add_special_tokens=False)) tokens
Total Output Tokens total_output_tokens sum(r.output_token_count for r in records if r.valid) tokens
Total Output Sequence Length total_osl sum(r.output_sequence_length for r in records if r.valid) tokens
Total Input Sequence Length total_isl sum(r.input_sequence_length for r in records if r.valid) tokens
Output Token Throughput output_token_throughput total_osl / benchmark_duration_seconds tokens/sec
Total Token Throughput total_token_throughput (total_isl + total_osl) / benchmark_duration_seconds tokens/sec

Image Metrics

Metrics for image processing endpoints. Requires image-capable endpoints.

Metric Tag Formula Unit
Image Throughput image_throughput num_images / request_latency_seconds images/sec
Image Latency image_latency request_latency_ms / num_images ms/image

Reasoning Metrics

Metrics specific to models that support reasoning/thinking tokens. Requires models with separate reasoning_content field.

Metric Tag Formula Unit
Reasoning Token Count reasoning_token_count len(tokenizer.encode(reasoning_content, add_special_tokens=False)) tokens
Total Reasoning Tokens total_reasoning_tokens sum(r.reasoning_token_count for r in records if r.valid) tokens

Usage Field Metrics

Metrics tracking API-reported token counts from the usage field in responses. Useful for comparing client-side vs server-side token counts.

Metric Tag Formula Unit
Usage Prompt Tokens usage_prompt_tokens response.usage.prompt_tokens tokens
Usage Completion Tokens usage_completion_tokens response.usage.completion_tokens tokens
Usage Total Tokens usage_total_tokens response.usage.total_tokens tokens
Usage Reasoning Tokens usage_reasoning_tokens response.usage.completion_tokens_details.reasoning_tokens tokens
Total Usage Prompt Tokens total_usage_prompt_tokens sum(r.usage_prompt_tokens for r in records if r.valid) tokens
Total Usage Completion Tokens total_usage_completion_tokens sum(r.usage_completion_tokens for r in records if r.valid) tokens
Total Usage Total Tokens total_usage_total_tokens sum(r.usage_total_tokens for r in records if r.valid) tokens

Usage Discrepancy Metrics

Metrics measuring differences between API-reported and client-computed token counts.

Metric Tag Formula Unit
Usage Prompt Tokens Diff % usage_prompt_tokens_diff_pct abs((usage_prompt_tokens - input_sequence_length) / input_sequence_length) * 100 %
Usage Completion Tokens Diff % usage_completion_tokens_diff_pct abs((usage_completion_tokens - output_sequence_length) / output_sequence_length) * 100 %
Usage Reasoning Tokens Diff % usage_reasoning_tokens_diff_pct abs((usage_reasoning_tokens - reasoning_token_count) / reasoning_token_count) * 100 %
Usage Discrepancy Count usage_discrepancy_count sum(1 for r in records if r.any_diff > threshold) requests

Goodput Metrics

Metrics measuring throughput of requests meeting user-defined Service Level Objectives (SLOs).

Metric Tag Formula Unit
Good Request Count good_request_count sum(1 for r in records if r.all_slos_met) requests
Goodput goodput good_request_count / benchmark_duration_seconds requests/sec

Error Metrics

Metrics computed for failed/error requests.

Metric Tag Formula Unit
Error Input Sequence Length error_isl input_sequence_length (for error requests) tokens
Total Error Input Sequence Length total_error_isl sum(r.input_sequence_length for r in records if not r.valid) tokens
Error Request Count error_request_count sum(1 for r in records if not r.valid) requests

General Metrics

Metrics available for all benchmark runs with no special requirements.

Metric Tag Formula Unit
Request Latency request_latency content_responses[-1].perf_ns - request.start_perf_ns ms
Request Throughput request_throughput request_count / benchmark_duration_seconds requests/sec
Request Count request_count sum(1 for r in records if r.valid) requests
Minimum Request Timestamp min_request_timestamp min(r.timestamp_ns for r in records) datetime
Maximum Response Timestamp max_response_timestamp max(r.timestamp_ns + r.request_latency for r in records) datetime
Benchmark Duration benchmark_duration max_response_timestamp - min_request_timestamp sec

HTTP Trace Metrics

Low-level HTTP timing metrics following k6 and HAR conventions. Requires HTTP trace data collection enabled.

Metric Tag Formula Unit
HTTP Request Blocked http_req_blocked connection_pool_wait_end_perf_ns - connection_pool_wait_start_perf_ns ms
HTTP Request DNS Lookup http_req_dns_lookup dns_lookup_end_perf_ns - dns_lookup_start_perf_ns ms
HTTP Request Connecting http_req_connecting tcp_connect_end_perf_ns - tcp_connect_start_perf_ns ms
HTTP Request Sending http_req_sending request_send_end_perf_ns - request_send_start_perf_ns ms
HTTP Request Waiting http_req_waiting response_chunks[0][0] - request_send_end_perf_ns ms
HTTP Request Receiving http_req_receiving response_chunks[-1][0] - response_chunks[0][0] ms
HTTP Request Duration http_req_duration response_receive_end_perf_ns - request_send_start_perf_ns ms
HTTP Request Connection Overhead http_req_connection_overhead http_req_blocked + http_req_dns_lookup + http_req_connecting ms
HTTP Request Total http_req_total http_req_blocked + http_req_dns_lookup + http_req_connecting + http_req_sending + http_req_waiting + http_req_receiving ms
HTTP Request Data Sent http_req_data_sent sum(size for _, size in request_chunks) bytes
HTTP Request Data Received http_req_data_received sum(size for _, size in response_chunks) bytes
HTTP Request Connection Reused http_req_connection_reused 1 if connection_reused_perf_ns is not None else 0 boolean

Known Issues

  • Output sequence length constraints (--output-tokens-mean) cannot be guaranteed unless you pass ignore_eos and/or min_tokens via --extra-inputs to an inference server that supports them.
  • Very high concurrency settings (typically >15,000 concurrency) may lead to port exhaustion on some systems, causing connection failures during benchmarking. If encountered, consider adjusting system limits or reducing concurrency.
  • Startup errors caused by invalid configuration settings can cause AIPerf to hang indefinitely. If AIPerf appears to freeze during initialization, terminate the process and check configuration settings.

About

AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages