Skip to content

nearai/genai-benchmark

Repository files navigation

GenAI Benchmark

A high-performance load testing tool for OpenAI-compatible LLM APIs, written in Rust.

Features

  • OpenAI-compatible API support: Works with any OpenAI-compatible endpoint (vLLM, TGI, Ollama, etc.)
  • Streaming metrics: Measures Time to First Token (TTFT), Inter-token Latency (ITL), and Time per Output Token (TPOT)
  • TEE signature verification: Verify chat completions come from genuine Trusted Execution Environments (TEE) with signature verification and latency tracking
  • Audio input/output testing: Test multimodal models with audio input (transcription) and audio output (TTS)
  • Image generation testing: Benchmark image generation endpoints with metrics tracking
  • Built-in scenarios: Pre-configured benchmarks included in the binary
  • Provider comparison: Test the same model across multiple providers and compare results
  • Detailed statistics: P50, P90, P95, P99, P100 percentiles for all metrics

Quick Start

List available scenarios

genai-benchmark list

Describe a scenario (see config and required env vars)

genai-benchmark describe near-vs-bedrock

Run a scenario

export NEARAI_API_KEY=your-key
export AWS_BEARER_TOKEN_BEDROCK=your-token
genai-benchmark run near-vs-bedrock

You can also use a .env file in the current directory to set environment variables:

# .env
NEARAI_API_KEY=your-key
AWS_BEARER_TOKEN_BEDROCK=your-token

Export and customize a scenario

genai-benchmark export near-vs-bedrock > my-benchmark.yaml
# Edit my-benchmark.yaml
genai-benchmark scenario my-benchmark.yaml

TEE Signature Verification

To enable TEE signature verification for a provider, add verify: true to the provider configuration in your scenario YAML:

providers:
  - name: "NEAR AI"
    base_url: "https://cloud-api.near.ai/v1"
    api_key: "${NEARAI_API_KEY}"
    verify: true  # Enable TEE signature verification

When enabled, the benchmark will:

  • Fetch the TEE signature for each chat completion from /signature/{chat_id}
  • Track verification success/failure rates
  • Measure and report verification latency separately
  • Include verification time in the total request duration metrics

The verification results are displayed in the benchmark output with separate latency statistics.

Audio Input/Output Testing

Test multimodal models like Qwen3-Omni with audio:

# Test audio input (transcription)
genai-benchmark run audio-input

# Test audio output (text-to-speech)
genai-benchmark run audio-output

# Test both audio input and output
genai-benchmark run multimodal

Or use CLI flags:

# Add test audio to chat requests
genai-benchmark --base-url https://cloud-api.near.ai/v1 --model Qwen/Qwen3-Omni-30B-A3B-Instruct --audio-input --verify

# Enable audio output (sets modalities: ["text", "audio"])
genai-benchmark --base-url https://cloud-api.near.ai/v1 --model Qwen/Qwen3-Omni-30B-A3B-Instruct --audio-output --verify

Image Generation Testing

Benchmark image generation endpoints:

# Run the built-in image generation scenario
genai-benchmark run image-generation

# Or use CLI flags
genai-benchmark --base-url https://cloud-api.near.ai/v1 --model Qwen/Qwen-Image-2512 --image-generation --image-size 1024x1024 --verify

Image Generation Performance Scenarios

Multiple scenarios are provided to test different aspects of image generation throughput:

# Quick test (5 images, basic metrics)
genai-benchmark run image-generation

# High-throughput stress test (100 images, high concurrency)
genai-benchmark run image-generation-stress

# Sustained load test (200 images over extended period)
genai-benchmark run image-generation-sustained

# Smaller images for performance comparison (512x512)
genai-benchmark run image-generation-512

# Batch generation (4 images per request)
genai-benchmark run image-generation-batch

Image generation metrics include:

  • Total images generated
  • Total/average image data size
  • Mean and P95 generation time
  • Images per second (throughput)
  • Data throughput (MB/s)
  • TEE signature verification status

Multi-Phase Benchmarks

Multi-phase benchmarks allow testing cache effectiveness with warmup and query phases:

# List available multi-phase scenarios
genai-benchmark list

# Run built-in multi-phase scenarios
genai-benchmark run same-doc-qa      # Same document QA benchmark
genai-benchmark run multi-round-qa   # Multi-round conversation QA
genai-benchmark run rag              # RAG with quality metrics
genai-benchmark run long-doc-qa      # Long document QA
genai-benchmark run multi-doc-qa     # Multi-document QA

# Run a custom multi-phase scenario file
genai-benchmark multi-phase-scenario my-scenario.yaml

Multi-phase scenarios support:

  • Warmup phase: Prime the cache with initial requests
  • Query phase: Measure cache hit rates and performance
  • Cache metrics: Track cache effectiveness across providers
  • Quality metrics: F1 and ROUGE-L scores for answer quality
  • Provider comparison: Compare with and without cache systems

Installation

One-liner install:

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/nearai/genai-benchmark/releases/latest/download/genai-benchmark-installer.sh | sh

Or download pre-built binaries from Releases, or build from source:

cargo install --path .

Library Usage

use genai_benchmark::{BenchmarkConfig, run_benchmark, load_dataset, DatasetConfig};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let config = BenchmarkConfig {
        name: Some("My Test".to_string()),
        base_url: "https://api.example.com/v1".to_string(),
        api_key: "your-key".to_string(),
        model: "gpt-4".to_string(),
        max_tokens: 256,
        concurrency: 5,
        rps: 10.0,
        timeout_secs: 300,
    };

    let dataset = DatasetConfig::Synthetic { seed: Some(42) };
    let prompts = load_dataset(&dataset, 100).await?;

    let result = run_benchmark(&config, prompts, 100).await?;
    genai_benchmark::print_result(&result);

    Ok(())
}

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages