MLPerf® Inference Endpoint Benchmarking System

A high-performance benchmarking tool for LLM endpoints.

Quick Start

Installation

Requirements: Python 3.12+ (Python 3.12 is recommended for optimal performance. GIL-less mode in higher Python versions is not yet supported.)

# Clone the repository
# Note: This repo will be migrated to https://github.com/mlcommons/endpoints
git clone https://github.com/mlcommons/endpoints.git
cd endpoints

This project uses uv for dependency management. All dependencies are pinned in uv.lock. Install uv first: curl -LsSf https://astral.sh/uv/install.sh | sh (see uv installation docs for other methods).

# Install dependencies
uv sync

# For development (includes linting, testing, and type-checking tools)
uv sync --extra dev --extra test
uv run pre-commit install

# Run project commands with `uv run ...`, or activate the venv directly:
# source .venv/bin/activate

Using pip + venv instead (backward-compatible)

Note: pip installs from pyproject.toml directly and does not use uv.lock. Dependency versions may differ.

python3.12 -m venv venv && source venv/bin/activate
pip install -e ".[dev,test]"
pre-commit install

After activating the venv, all commands below work without the uv run prefix.

Basic Usage

# Show help
uv run inference-endpoint --help

# Show system information
uv run inference-endpoint -v info

# Test endpoint connectivity
uv run inference-endpoint probe \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B

# Run offline benchmark (max throughput - uses all dataset samples)
uv run inference-endpoint benchmark offline \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl

# Run online benchmark (sustained QPS - requires --target-qps, --load-pattern)
uv run inference-endpoint benchmark online \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl \
  --load-pattern poisson \
  --target-qps 100

# With explicit sample count
uv run inference-endpoint benchmark offline \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl \
  --num-samples 5000

# ... or activate the venv to skip the `uv run` prefix:
# source .venv/bin/activate
# inference-endpoint --help
# inference-endpoint benchmark offline --endpoints URL --model NAME --dataset PATH

Running Locally

# Start local echo server
uv run python -m inference_endpoint.testing.echo_server --port 8765 &

# Test with dummy dataset (included in repo)
uv run inference-endpoint benchmark offline \
  --endpoints http://localhost:8765 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl

# Stop echo server
pkill -f echo_server

# ... or with an activated venv (source .venv/bin/activate):
# python -m inference_endpoint.testing.echo_server --port 8765 &
# inference-endpoint benchmark offline --endpoints http://localhost:8765 --model Qwen/Qwen3-8B --dataset tests/datasets/dummy_1k.jsonl

See Local Testing Guide for detailed instructions.

Running Tests and Examples

uv run pytest -m "not performance and not run_explicitly"
uv run pytest -m unit
uv run pytest --cov=src --cov-report=html

# ... or with an activated venv (source .venv/bin/activate):
# pytest -m "not performance and not run_explicitly"
# pytest -m unit

# Run examples: follow instructions in examples/*/README.md

📚 Documentation

AGENTS.md - Architecture, conventions, and AI agent guidelines
CLI Quick Reference - Command-line interface guide
Local Testing Guide - Test with echo server
Development Guide - How to contribute and develop
Performance Architecture - Hot-path design and tuning
Performance Tuning - CPU affinity and client tuning
GitHub Setup Guide - GitHub authentication and setup

Component Design Specs

Each top-level component under src/inference_endpoint/ has a corresponding spec:

Component	Spec
Core types	docs/core/DESIGN.md
Load generator	docs/load_generator/DESIGN.md
Endpoint client	docs/endpoint_client/DESIGN.md
Metrics	docs/metrics/DESIGN.md
Config	docs/config/DESIGN.md
Async utils	docs/async_utils/DESIGN.md
Dataset manager	docs/dataset_manager/DESIGN.md
Commands (CLI)	docs/commands/DESIGN.md
OpenAI adapter	docs/openai/DESIGN.md
SGLang adapter	docs/sglang/DESIGN.md
Evaluation	docs/evaluation/DESIGN.md
Testing utilities	docs/testing/DESIGN.md
Profiling	docs/profiling/DESIGN.md
Plugins	docs/plugins/DESIGN.md
Utils	docs/utils/DESIGN.md

🎯 Architecture

The system follows a modular, event-driven architecture:

Dataset Manager ──► Load Generator ──► Endpoint Client ──► External Endpoint
                          │
                    Metrics Collector
                 (event logging + reporting)

Dataset Manager: Loads benchmark datasets and applies transform pipelines
Load Generator: Central orchestrator — controls timing (scheduler), issues queries, and emits sample events
Endpoint Client: Multi-process HTTP worker pool communicating over ZMQ IPC
Metrics Collector: Receives sample events from Load Generator; writes to SQLite (EventRecorder), aggregates after the run (MetricsReporter)

Accuracy Evaluation

You can run accuracy evaluation with Pass@1 scoring by specifying accuracy datasets in the benchmark configuration. Currently, Inference Endpoints provides the following pre-defined accuracy benchmarks:

GPQA (default: GPQA Diamond)
AIME (default: AIME 2025)
LiveCodeBench (default: lite, release_v6)

However, LiveCodeBench will not work out-of-the-box and requires some additional setup. See the LiveCodeBench documentation for details and explanations.

🚧 Pending Features

The following features are planned for future releases:

Submission Ruleset Integration - Full MLPerf submission workflow support
Documentation Generation and Hosting - Sphinx-based API documentation with GitHub Pages

🤝 Contributing

We welcome contributions! Please see our Development Guide for details on:

Setting up your development environment
Code style and quality standards
Testing requirements
Pull request process

🙏 Acknowledgements

This project draws inspiration from and learns from the following excellent projects:

MLCommons Inference - MLPerf Inference benchmark suite
AIPerf - AI model performance profiling framework
SGLang GenAI-Bench - Token-level performance evaluation tool
vLLM Benchmarks - Performance benchmarking tools for vLLM
InferenceMAX - LLM inference optimization toolkit

We are grateful to these communities for their contributions to LLM benchmarking and performance analysis.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🔗 Links

MLCommons - Machine Learning Performance Standards
Project Repository
MLPerf Inference

👥 Contributors

Credits to core contributors of the project:

MLCommons Committee
NVIDIA: Zhihan Jiang, Rashid Kaleem, Viraat Chandra, Alice Cheng
...

See ATTRIBUTION for detailed attribution information.

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: See docs/ directory for guides

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLPerf® Inference Endpoint Benchmarking System

Quick Start

Installation

Basic Usage

Running Locally

Running Tests and Examples

📚 Documentation

Component Design Specs

🎯 Architecture

Accuracy Evaluation

🚧 Pending Features

🤝 Contributing

🙏 Acknowledgements

📄 License

🔗 Links

👥 Contributors

📞 Support

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

MLPerf® Inference Endpoint Benchmarking System

Quick Start

Installation

Basic Usage

Running Locally

Running Tests and Examples

📚 Documentation

Component Design Specs

🎯 Architecture

Accuracy Evaluation

🚧 Pending Features

🤝 Contributing

🙏 Acknowledgements

📄 License

🔗 Links

👥 Contributors

📞 Support