Skip to content

Latest commit

 

History

History
238 lines (173 loc) · 9.03 KB

File metadata and controls

238 lines (173 loc) · 9.03 KB

MLPerf® Inference Endpoint Benchmarking System

A high-performance benchmarking tool for LLM endpoints.

Quick Start

Installation

Requirements: Python 3.12+ (Python 3.12 is recommended for optimal performance. GIL-less mode in higher Python versions is not yet supported.)

# Clone the repository
# Note: This repo will be migrated to https://github.com/mlcommons/endpoints
git clone https://github.com/mlcommons/endpoints.git
cd endpoints

This project uses uv for dependency management. All dependencies are pinned in uv.lock. Install uv first: curl -LsSf https://astral.sh/uv/install.sh | sh (see uv installation docs for other methods).

# Install dependencies
uv sync

# For development (includes linting, testing, and type-checking tools)
uv sync --extra dev --extra test
uv run pre-commit install

# Run project commands with `uv run ...`, or activate the venv directly:
# source .venv/bin/activate
Using pip + venv instead (backward-compatible)

Note: pip installs from pyproject.toml directly and does not use uv.lock. Dependency versions may differ.

python3.12 -m venv venv && source venv/bin/activate
pip install -e ".[dev,test]"
pre-commit install

After activating the venv, all commands below work without the uv run prefix.

Basic Usage

# Show help
uv run inference-endpoint --help

# Show system information
uv run inference-endpoint -v info

# Test endpoint connectivity
uv run inference-endpoint probe \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B

# Run offline benchmark (max throughput - uses all dataset samples)
uv run inference-endpoint benchmark offline \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl

# Run online benchmark (sustained QPS - requires --target-qps, --load-pattern)
uv run inference-endpoint benchmark online \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl \
  --load-pattern poisson \
  --target-qps 100

# With explicit sample count
uv run inference-endpoint benchmark offline \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl \
  --num-samples 5000

# ... or activate the venv to skip the `uv run` prefix:
# source .venv/bin/activate
# inference-endpoint --help
# inference-endpoint benchmark offline --endpoints URL --model NAME --dataset PATH

Running Locally

# Start local echo server
uv run python -m inference_endpoint.testing.echo_server --port 8765 &

# Test with dummy dataset (included in repo)
uv run inference-endpoint benchmark offline \
  --endpoints http://localhost:8765 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl

# Stop echo server
pkill -f echo_server

# ... or with an activated venv (source .venv/bin/activate):
# python -m inference_endpoint.testing.echo_server --port 8765 &
# inference-endpoint benchmark offline --endpoints http://localhost:8765 --model Qwen/Qwen3-8B --dataset tests/datasets/dummy_1k.jsonl

See Local Testing Guide for detailed instructions.

Running Tests and Examples

uv run pytest -m "not performance and not run_explicitly"
uv run pytest -m unit
uv run pytest --cov=src --cov-report=html

# ... or with an activated venv (source .venv/bin/activate):
# pytest -m "not performance and not run_explicitly"
# pytest -m unit

# Run examples: follow instructions in examples/*/README.md

📚 Documentation

Component Design Specs

Each top-level component under src/inference_endpoint/ has a corresponding spec:

Component Spec
Core types docs/core/DESIGN.md
Load generator docs/load_generator/DESIGN.md
Endpoint client docs/endpoint_client/DESIGN.md
Metrics docs/metrics/DESIGN.md
Config docs/config/DESIGN.md
Async utils docs/async_utils/DESIGN.md
Dataset manager docs/dataset_manager/DESIGN.md
Commands (CLI) docs/commands/DESIGN.md
OpenAI adapter docs/openai/DESIGN.md
SGLang adapter docs/sglang/DESIGN.md
Evaluation docs/evaluation/DESIGN.md
Testing utilities docs/testing/DESIGN.md
Profiling docs/profiling/DESIGN.md
Plugins docs/plugins/DESIGN.md
Utils docs/utils/DESIGN.md

🎯 Architecture

The system follows a modular, event-driven architecture:

Dataset Manager ──► Load Generator ──► Endpoint Client ──► External Endpoint
                          │
                    Metrics Collector
                 (event logging + reporting)
  • Dataset Manager: Loads benchmark datasets and applies transform pipelines
  • Load Generator: Central orchestrator — controls timing (scheduler), issues queries, and emits sample events
  • Endpoint Client: Multi-process HTTP worker pool communicating over ZMQ IPC
  • Metrics Collector: Receives sample events from Load Generator; writes to SQLite (EventRecorder), aggregates after the run (MetricsReporter)

Accuracy Evaluation

You can run accuracy evaluation with Pass@1 scoring by specifying accuracy datasets in the benchmark configuration. Currently, Inference Endpoints provides the following pre-defined accuracy benchmarks:

  • GPQA (default: GPQA Diamond)
  • AIME (default: AIME 2025)
  • LiveCodeBench (default: lite, release_v6)

However, LiveCodeBench will not work out-of-the-box and requires some additional setup. See the LiveCodeBench documentation for details and explanations.

🚧 Pending Features

The following features are planned for future releases:

  • Submission Ruleset Integration - Full MLPerf submission workflow support
  • Documentation Generation and Hosting - Sphinx-based API documentation with GitHub Pages

🤝 Contributing

We welcome contributions! Please see our Development Guide for details on:

  • Setting up your development environment
  • Code style and quality standards
  • Testing requirements
  • Pull request process

🙏 Acknowledgements

This project draws inspiration from and learns from the following excellent projects:

We are grateful to these communities for their contributions to LLM benchmarking and performance analysis.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🔗 Links

👥 Contributors

Credits to core contributors of the project:

  • MLCommons Committee
  • NVIDIA: Zhihan Jiang, Rashid Kaleem, Viraat Chandra, Alice Cheng
  • ...

See ATTRIBUTION for detailed attribution information.

📞 Support