Skip to content

Latest commit

 

History

History
209 lines (152 loc) · 7.69 KB

File metadata and controls

209 lines (152 loc) · 7.69 KB

MLPerf® Inference Endpoint Benchmarking System

A high-performance benchmarking tool for LLM endpoints.

Quick Start

Installation

Requirements: Python 3.12+ (Python 3.12 is recommended for optimal performance. GIL-less mode in higher Python versions is not yet supported.)

# Clone the repository
# Note: This repo will be migrated to https://github.com/mlcommons/endpoints
git clone https://github.com/mlcommons/endpoints.git
cd endpoints

# Create virtual environment
python3.12 -m venv venv
source venv/bin/activate

# As a user
pip install .

# As a developer (with development and test extras)
pip install -e ".[dev,test]"
pre-commit install

Basic Usage

# Show help
inference-endpoint --help

# Show system information
inference-endpoint -v info

# Test endpoint connectivity
inference-endpoint probe \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B

# Run offline benchmark (max throughput - uses all dataset samples)
inference-endpoint benchmark offline \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl

# Run online benchmark (sustained QPS - requires --target-qps, --load-pattern)
inference-endpoint benchmark online \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl \
  --load-pattern poisson \
  --target-qps 100

# With explicit sample count
inference-endpoint benchmark offline \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl \
  --num-samples 5000

Running Locally

# Start local echo server
python3 -m inference_endpoint.testing.echo_server --port 8765 &

# Test with dummy dataset (included in repo)
inference-endpoint benchmark offline \
  --endpoints http://localhost:8765 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl

# Stop echo server
pkill -f echo_server

See Local Testing Guide for detailed instructions.

Running Tests and Examples

# Install test dependencies
pip install ".[test]"

# Run tests (excluding performance and explicit-run tests)
pytest -m "not performance and not run_explicitly"

# Run examples: follow instructions in examples/*/README.md

📚 Documentation

Component Design Specs

Each top-level component under src/inference_endpoint/ has a corresponding spec:

Component Spec
Core types docs/core/DESIGN.md
Load generator docs/load_generator/DESIGN.md
Endpoint client docs/endpoint_client/DESIGN.md
Metrics docs/metrics/DESIGN.md
Config docs/config/DESIGN.md
Async utils docs/async_utils/DESIGN.md
Dataset manager docs/dataset_manager/DESIGN.md
Commands (CLI) docs/commands/DESIGN.md
OpenAI adapter docs/openai/DESIGN.md
SGLang adapter docs/sglang/DESIGN.md
Evaluation docs/evaluation/DESIGN.md
Testing utilities docs/testing/DESIGN.md
Profiling docs/profiling/DESIGN.md
Plugins docs/plugins/DESIGN.md
Utils docs/utils/DESIGN.md

🎯 Architecture

The system follows a modular, event-driven architecture:

Dataset Manager ──► Load Generator ──► Endpoint Client ──► External Endpoint
                          │
                    Metrics Collector
                 (event logging + reporting)
  • Dataset Manager: Loads benchmark datasets and applies transform pipelines
  • Load Generator: Central orchestrator — controls timing (scheduler), issues queries, and emits sample events
  • Endpoint Client: Multi-process HTTP worker pool communicating over ZMQ IPC
  • Metrics Collector: Receives sample events from Load Generator; writes to SQLite (EventRecorder), aggregates after the run (MetricsReporter)

Accuracy Evaluation

You can run accuracy evaluation with Pass@1 scoring by specifying accuracy datasets in the benchmark configuration. Currently, Inference Endpoints provides the following pre-defined accuracy benchmarks:

  • GPQA (default: GPQA Diamond)
  • AIME (default: AIME 2025)
  • LiveCodeBench (default: lite, release_v6)

However, LiveCodeBench will not work out-of-the-box and requires some additional setup. See the LiveCodeBench documentation for details and explanations.

🚧 Pending Features

The following features are planned for future releases:

  • Submission Ruleset Integration - Full MLPerf submission workflow support
  • Documentation Generation and Hosting - Sphinx-based API documentation with GitHub Pages

🤝 Contributing

We welcome contributions! Please see our Development Guide for details on:

  • Setting up your development environment
  • Code style and quality standards
  • Testing requirements
  • Pull request process

🙏 Acknowledgements

This project draws inspiration from and learns from the following excellent projects:

We are grateful to these communities for their contributions to LLM benchmarking and performance analysis.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🔗 Links

👥 Contributors

Credits to core contributors of the project:

  • MLCommons Committee
  • NVIDIA: Zhihan Jiang, Rashid Kaleem, Viraat Chandra, Alice Cheng
  • ...

See ATTRIBUTION for detailed attribution information.

📞 Support