Guidelines for AI coding agents working in the MLPerf Inference Endpoint Benchmarking System repository.
High-performance benchmarking tool for LLM inference endpoints targeting 50k+ QPS. Python 3.12+, Apache 2.0 licensed.
# Development setup
uv sync --extra dev --extra test
uv run pre-commit install
# Testing
uv run pytest # All tests (excludes slow/performance)
uv run pytest -m unit # Unit tests only
uv run pytest -m integration # Integration tests only
uv run pytest --cov=src --cov-report=html # With coverage
uv run pytest -xvs tests/unit/path/to/test_file.py # Single test file
# Code quality (run before commits)
uv run pre-commit run --all-files
# Local testing with echo server
uv run python -m inference_endpoint.testing.echo_server --port 8765
uv run inference-endpoint probe --endpoints http://localhost:8765 --model test-model
# CLI usage
uv run inference-endpoint benchmark offline --endpoints URL --model NAME --dataset PATH
uv run inference-endpoint benchmark online --endpoints URL --model NAME --dataset PATH --load-pattern poisson --target-qps 100
uv run inference-endpoint benchmark from-config --config config.yamlDoes not use uv.lock — dependency versions may differ from the lockfile.
python3.12 -m venv venv && source venv/bin/activate
pip install -e ".[dev,test]"
pre-commit install
# After activating the venv, commands run without the `uv run` prefix:
pytest -m unit
pre-commit run --all-files
inference-endpoint benchmark offline --endpoints URL --model NAME --dataset PATHDataset Manager --> Load Generator --> Endpoint Client --> External Endpoint
|
Metrics Collector (EventRecorder + MetricsReporter)
| Component | Location | Purpose |
|---|---|---|
| Load Generator | src/inference_endpoint/load_generator/ |
Central orchestrator: BenchmarkSession owns the lifecycle, Scheduler controls timing, LoadGenerator issues queries |
| Endpoint Client | src/inference_endpoint/endpoint_client/ |
Multi-process HTTP workers communicating via ZMQ IPC. HTTPEndpointClient is the main entry point |
| Dataset Manager | src/inference_endpoint/dataset_manager/ |
Loads JSONL, HuggingFace, CSV, JSON, Parquet datasets. Dataset base class with load_sample()/num_samples() interface |
| Metrics | src/inference_endpoint/metrics/ |
EventRecorder writes to SQLite, MetricsReporter reads and aggregates (QPS, latency, TTFT, TPOT) |
| Config | src/inference_endpoint/config/, endpoint_client/config.py |
Pydantic-based YAML schema (schema.py), HTTPClientConfig (single Pydantic model for CLI/YAML/runtime), RuntimeSettings |
| CLI | src/inference_endpoint/main.py, commands/benchmark/cli.py |
cyclopts-based, auto-generated from schema.py and HTTPClientConfig Pydantic models. Flat shorthands via cyclopts.Parameter(alias=...) |
| Async Utils | src/inference_endpoint/async_utils/ |
LoopManager (uvloop + eager_task_factory), ZMQ transport layer, event publisher |
| OpenAI/SGLang | src/inference_endpoint/openai/, sglang/ |
Protocol adapters and response accumulators for different API formats |
Multi-process, event-loop design optimized for throughput:
BenchmarkSessionthread schedules samples with busy-wait timing- Worker processes (N instances) handle HTTP requests via ZMQ IPC
- Uses
eager_task_factoryanduvloopfor minimal async overhead - CPU affinity support (
cpu_affinity.py) for performance tuning - Custom HTTP connection pooling (
http.py) withhttptoolsparser
CLI is auto-generated from config/schema.py Pydantic models via cyclopts. Fields annotated with cyclopts.Parameter(alias="--flag") get flat shorthands; all other fields get auto-generated dotted flags (kebab-case).
- CLI mode (
offline/online): cyclopts constructsOfflineBenchmarkConfig/OnlineBenchmarkConfig(subclasses inconfig/schema.py) directly from CLI args. Type locked viaLiteral.--datasetis repeatable with TOML-style format[perf|acc:]<path>[,key=value...](e.g.--dataset data.csv,samples=500,parser.prompt=article). Full accuracy support viaaccuracy_config.eval_method=pass_at_1etc. - YAML mode (
from-config):BenchmarkConfig.from_yaml_file()loads YAML, resolves env vars, and auto-selects the right subclass via Pydantic discriminated union. Optional--timeout/--modeoverrides viaconfig.with_updates(). - eval: Not yet implemented (raises
CLIErrorwith a tracking issue link)
Both CLI and YAML produce the same subclass via Pydantic discriminated union on type:
CLI offline/online: cyclopts → OfflineBenchmarkConfig/OnlineBenchmarkConfig → with_updates(datasets) → run_benchmark
YAML from-config: from_yaml_file(path) → discriminated union → same subclass → run_benchmark
OfflineBenchmarkConfig and OnlineBenchmarkConfig (in config/schema.py) inherit BenchmarkConfig:
type: locked viaLiteral[TestType.OFFLINE]/Literal[TestType.ONLINE]settings:OfflineSettings(hides load pattern) /OnlineSettingssubmission_ref,benchmark_mode:show=Falseon base class
Validation is layered:
- Field-level (Pydantic):
Field(ge=0)on durations,Field(ge=-1)on workers,Literalonbenchmark_mode - Field validators:
workers != 0check - Model validator (
_resolve_and_validate): streaming AUTO resolution, model name fromsubmission_ref, load pattern vs test type, cross-field duration check, duplicate datasets
max_throughput: Offline burst (all queries at t=0)poisson: Fixed QPS with Poisson arrival distributionconcurrency: Fixed concurrent requests
src/inference_endpoint/
├── main.py # Entry point + CLI app: cyclopts app, commands, error formatter, run()
├── exceptions.py # CLIError, ExecutionError, InputValidationError, SetupError
├── commands/ # Command execution logic
│ ├── benchmark/
│ │ ├── __init__.py
│ │ ├── cli.py # benchmark_app: offline, online, from-config subcommands
│ │ └── execute.py # Phased execution: setup/run_threaded/finalize + BenchmarkContext
│ ├── probe.py # ProbeConfig + execute_probe()
│ ├── info.py # execute_info()
│ ├── validate.py # execute_validate()
│ └── init.py # execute_init()
├── core/types.py # APIType, Query, QueryResult, StreamChunk, QueryStatus (msgspec Structs)
├── load_generator/
│ ├── session.py # BenchmarkSession - top-level orchestrator
│ ├── load_generator.py # LoadGenerator, SchedulerBasedLoadGenerator
│ ├── scheduler.py # Scheduler, timing strategies
│ ├── sample.py # SampleEventHandler
│ └── events.py # SessionEvent, SampleEvent enums
├── endpoint_client/
│ ├── http_client.py # HTTPEndpointClient - main client interface
│ ├── worker.py # Worker process implementation
│ ├── worker_manager.py # Manages worker lifecycle
│ ├── http.py # ConnectionPool, HttpRequestTemplate, raw HTTP
│ ├── http_sample_issuer.py # Bridges load generator to HTTP client
│ ├── config.py # HTTPClientConfig (single Pydantic model — CLI/YAML/runtime)
│ ├── adapter_protocol.py # HttpRequestAdapter protocol
│ ├── accumulator_protocol.py # Response accumulation protocol
│ ├── cpu_affinity.py # CPU pinning
│ └── utils.py # Port range helpers
├── async_utils/
│ ├── loop_manager.py # LoopManager (uvloop + eager_task_factory)
│ ├── runner.py # run_async() — uvloop + eager_task_factory entry point for CLI commands
│ ├── event_publisher.py # Async event pub/sub
│ ├── services/
│ │ ├── event_logger/ # EventLoggerService: writes EventRecords to JSONL/SQLite
│ │ └── metrics_aggregator/ # MetricsAggregatorService: real-time metrics (TTFT, TPOT, ISL, OSL)
│ └── transport/ # ZMQ-based IPC transport layer
│ ├── protocol.py # Transport protocols + TransportConfig base
│ ├── record.py # Transport records
│ └── zmq/ # ZMQ implementation (context, pubsub, transport, ZMQTransportConfig)
├── dataset_manager/
│ ├── dataset.py # Dataset base class, DatasetFormat enum
│ ├── factory.py # Dataset factory
│ ├── transforms.py # ColumnRemap and other transforms
│ └── predefined/ # Built-in datasets (aime25, cnndailymail, gpqa, etc.)
├── metrics/
│ ├── recorder.py # EventRecorder (SQLite-backed)
│ ├── reporter.py # MetricsReporter (aggregation)
│ └── metric.py # Metric types (Throughput, etc.)
├── config/
│ ├── schema.py # Single source of truth: Pydantic models + cyclopts annotations
│ ├── runtime_settings.py # RuntimeSettings dataclass
│ ├── ruleset_base.py # BenchmarkSuiteRuleset base
│ ├── ruleset_registry.py # Ruleset registry
│ ├── user_config.py # UserConfig dataclass for ruleset user overrides
│ ├── rulesets/mlcommons/ # MLCommons-specific rules, datasets, models
│ └── templates/ # YAML config templates (offline, online, eval, etc.)
├── openai/ # OpenAI-compatible API types and adapters
│ ├── types.py # OpenAI response types
│ ├── openai_adapter.py # Request/response adapter
│ ├── openai_msgspec_adapter.py # msgspec-based adapter (fast path)
│ ├── accumulator.py # Streaming response accumulator
│ └── harmony.py # openai_harmony integration
├── sglang/ # SGLang API adapter
├── evaluation/ # Accuracy evaluation (extractor, scoring, livecodebench)
├── plugins/ # Plugin system
├── profiling/ # line_profiler integration, pytest plugin
├── testing/
│ ├── echo_server.py # Local echo server for testing
│ ├── max_throughput_server.py # Max throughput test server
│ └── docker_server.py # Docker-based server management
└── utils/
├── logging.py # Logging setup
├── version.py # Version info
├── dataset_utils.py # Dataset utilities
└── benchmark_httpclient.py # HTTP client throughput benchmarking utility
tests/
├── conftest.py # Shared fixtures (echo/oracle servers, datasets, settings)
├── test_helpers.py # Test utility functions
├── unit/ # Unit tests (mirror src/ structure)
├── integration/ # Integration tests (real servers, end-to-end)
│ ├── endpoint_client/ # HTTP client integration tests
│ └── commands/ # CLI command integration tests
├── performance/ # Performance benchmarks (pytest-benchmark)
└── datasets/ # Test data (dummy_1k.jsonl, squad_pruned/)
- Formatter/Linter:
ruff(line-length 88, target Python 3.12) - Type checking:
mypy(via pre-commit) - Formatting:
ruff-format(double quotes, space indent) - License headers: Required on all Python files (enforced by pre-commit hook
scripts/add_license_header.py) - Conventional commits:
feat:,fix:,docs:,test:,chore:
All of these hooks run automatically on commit: trailing-whitespace, end-of-file-fixer, check-yaml, check-merge-conflict, debug-statements, ruff (lint + autofix), ruff-format, mypy, prettier (YAML/JSON/Markdown), license header enforcement.
Always run pre-commit run --all-files before committing.
See Development Guide for full setup and workflow details.
- Core types (
Query,QueryResult,StreamChunk):msgspec.Structwithfrozen=True,array_like=True,gc=False,omit_defaults=True - Config types:
pydantic.BaseModelfor validation - Enums:
str, Enumpattern for serializable enums (e.g.,LoadPatternType,APIType) - Serialization:
msgspec.jsonfor hot-path (ZMQ transport),pydanticfor config
Coverage target: >90% for all new code.
Test markers:
@pytest.mark.unit # Unit tests
@pytest.mark.integration # Integration tests (may need servers)
@pytest.mark.slow # Skip in CI
@pytest.mark.performance # No timeout, skip in CI
@pytest.mark.run_explicitly # Only run when explicitly selectedAsync tests: Use @pytest.mark.asyncio(mode="strict") — the project uses strict asyncio mode.
Key fixtures (defined in tests/conftest.py):
mock_http_echo_server— real HTTP echo server on dynamic portmock_http_oracle_server— dataset-driven response serverdummy_dataset— in-memory test datasethf_squad_dataset— HuggingFace squad datasetevents_db— pre-populated SQLite events databasemax_throughput_runtime_settings,poisson_runtime_settings,concurrency_runtime_settings— preset configsclean_sample_event_hooks— ensures event hooks are cleared between tests
Test data: tests/datasets/dummy_1k.jsonl (1000 samples), tests/datasets/squad_pruned/
These apply especially to code in the hot path (load generator, endpoint client, transport):
- No
matchstatements in hot paths — use dict dispatch instead - Use
dataclass(slots=True)for frequently instantiated classes (ormsgspec.Struct) - Prefer generators over list comprehensions for large datasets
- Minimize async suspends in hot path code
- Use
msgspecoverjson/pydanticfor serialization in the data path - Connection pooling: The HTTP client uses custom
ConnectionPoolwithhttptoolsparser — notaiohttp/requests - Event loop:
uvloopwitheager_task_factoryviaLoopManager - IPC: ZMQ-based transport between main process and worker processes
| Package | Purpose |
|---|---|
uvloop |
Performance-optimized event loop |
httptools |
Fast HTTP parser for custom connection pool |
msgspec |
Fast serialization for core types and ZMQ transport |
pyzmq |
ZMQ IPC between main process and workers |
pydantic |
Configuration validation |
cyclopts |
CLI framework — auto-generates flags from Pydantic |
duckdb |
Data aggregation |
transformers |
Tokenization for OSL reporting |
src/inference_endpoint/openai/openai_types_gen.py— auto-generated, excluded from ruff/pre-commitsrc/inference_endpoint/openai/openapi.yaml— OpenAI API spec, excluded from pre-commit
This file is the source of truth for AI agents working in this repo. If it is stale or wrong, every AI-assisted session starts from a broken foundation.
Update AGENTS.md as part of any PR that includes a significant refactor, meaning:
- Moved, renamed, or deleted modules/packages — update the Code Organization tree and Key Components table
- Changed architectural boundaries (e.g., new IPC transport, replaced pydantic with msgspec for config) — update Architecture and Data Types sections
- Added or removed CLI commands/subcommands — update CLI Modes and Common Commands
- Changed test infrastructure (new fixtures, changed markers, new test directories) — update Testing section
- Added or removed key dependencies — update Key Dependencies table
- Changed build/tooling (new pre-commit hooks, changed ruff config, new CI steps) — update docs/DEVELOPMENT.md
- Changed hot-path patterns (new transport, changed serialization, new performance constraints) — update Performance Guidelines
- Treat AGENTS.md changes as part of the refactor itself — include them in the same PR, not as a follow-up
- Keep descriptions factual and concise — what exists and where, not aspirational design docs
- If you add a new top-level module under
src/inference_endpoint/, add it to both the Key Components table and the Code Organization tree - If you remove something, remove it from AGENTS.md — stale entries are worse than missing ones
When reviewing PRs with significant structural changes, verify:
- Code Organization tree matches the actual directory structure post-merge
- Key Components table reflects any moved/renamed/new components
- No references to deleted files, classes, or modules remain
Known failure modes when AI tools generate code for this project. Reference these during code review of AI-assisted PRs.
- Inventing abstractions that don't exist: AI may introduce new base classes, registries, or factory patterns that don't match the existing architecture. This project uses concrete types and explicit wiring — check that new code follows existing patterns rather than imposing unfamiliar frameworks.
- Misunderstanding the multi-process boundary: The endpoint client uses separate worker processes (not threads) communicating over ZMQ. AI-generated code often assumes shared memory, passes unpicklable objects across processes, or adds synchronization primitives (locks, semaphores) that don't work cross-process.
- Confusing hot-path vs. cold-path: AI tends to treat all code uniformly. Code in
load_generator/,endpoint_client/worker.py, andasync_utils/transport/is latency-critical. Pydantic validation, excessive logging, or try/except blocks in these paths will degrade throughput.
- Using the wrong serialization library: This project uses
msgspecfor hot-path data andpydanticfor config. AI frequently defaults tojson.dumps/json.loads,dataclasses, or applies pydantic where msgspec is required. If the type is amsgspec.Struct, encode/decode withmsgspec.json, not stdlib json. - Breaking msgspec Struct constraints: Core types (
Query,QueryResult,StreamChunk) arefrozen=Truewitharray_like=True. AI may try to mutate fields directly (must useforce_setattr), add mutable default fields, or assume dict-like serialization when the wire format is array-based. - Adding
dataclasswheremsgspec.Structis expected: If neighboring types use msgspec, new types in the same module should too. AI defaults to@dataclassout of habit.
- Mixing sync and async incorrectly: AI may
awaitin a sync context, call blocking I/O inside anasync def, or useasyncio.run()when a loop is already running (this project manages its own loops viaLoopManager). - Creating new event loops: Workers and the main process already have managed event loops. AI may call
asyncio.new_event_loop()orasyncio.run()which conflicts with the existingLoopManager/uvloopsetup. - Ignoring
eager_task_factory: The project uses Python 3.12'seager_task_factoryfor performance. AI-generated code that creates coroutines expecting lazy scheduling will behave differently than expected.
- Generating mock-heavy tests for integration scenarios: This project has real echo/oracle server fixtures. AI tends to mock HTTP calls even when
mock_http_echo_serverormock_http_oracle_serverfixtures exist and should be used. - Missing test markers: Every test function needs
@pytest.mark.unit,@pytest.mark.integration, or another marker. AI-generated tests almost always omit markers, which breaks CI filtering. - Wrong asyncio mode: Tests must use
@pytest.mark.asyncio(mode="strict")— AI often writes bare@pytest.mark.asyncioor forgets it entirely, causing silent test skips or failures. - Fabricating fixture names: AI may invent fixtures that don't exist in
conftest.py. Always check that referenced fixtures actually exist before using them.
- Missing license headers: Every Python file needs the Apache 2.0 SPDX header. AI never generates these — the pre-commit hook will add them, but be aware of this when reviewing diffs.
- Importing removed or renamed modules: After refactors, AI (working from stale context) may import old module paths. Always verify imports resolve to actual files.
- Over-documenting: AI generates verbose docstrings, inline comments explaining obvious code, and type annotations on trivial variables. This project prefers minimal comments — only where the why isn't obvious from the code.
- Adding backwards-compatibility shims: If something was renamed or removed, AI may add re-exports, aliases, or deprecation wrappers. In this project, just delete the old thing and update all call sites.
- Adding new dependencies without justification: AI may
pip installor add imports for packages not inpyproject.toml. Any new runtime, dev, or test dependency must be justified, added to the correct optional group, and pinned to an exact version (==). After adding a dependency, runpip-audit(included indevextras) to verify it has no known vulnerabilities. When adding dependencies, useuv add <package>==<version>to update bothpyproject.tomlanduv.lockatomically, then runuv run pip-auditto check for vulnerabilities. Note:[build-system] requiresis also pinned to exact versions for reproducibility. - Using
requests/aiohttpfor HTTP: This project has its own HTTP client (endpoint_client/http.py) usinghttptools. AI defaults torequestsoraiohttp— these should not appear in production code (test dependencies are fine).