Guidelines for AI coding agents working in the MLPerf Inference Endpoint Benchmarking System repository.
High-performance benchmarking tool for LLM inference endpoints targeting 50k+ QPS. Python 3.12+, Apache 2.0 licensed.
# Development setup
python3.12 -m venv venv && source venv/bin/activate
pip install -e ".[dev,test]"
pre-commit install
# Testing
pytest # All tests (excludes slow/performance)
pytest -m unit # Unit tests only
pytest -m integration # Integration tests only
pytest --cov=src --cov-report=html # With coverage
pytest -xvs tests/unit/path/to/test_file.py # Single test file
# Code quality (run before commits)
pre-commit run --all-files
# Local testing with echo server
python -m inference_endpoint.testing.echo_server --port 8765
inference-endpoint probe --endpoints http://localhost:8765 --model test-model
# CLI usage
inference-endpoint benchmark offline --endpoints URL --model NAME --dataset PATH
inference-endpoint benchmark online --endpoints URL --model NAME --dataset PATH --load-pattern poisson --target-qps 100
inference-endpoint benchmark from-config --config config.yamlDataset Manager --> Load Generator --> Endpoint Client --> External Endpoint
|
Metrics Collector (EventRecorder + MetricsReporter)
| Component | Location | Purpose |
|---|---|---|
| Load Generator | src/inference_endpoint/load_generator/ |
Central orchestrator: BenchmarkSession owns the lifecycle, Scheduler controls timing, LoadGenerator issues queries |
| Endpoint Client | src/inference_endpoint/endpoint_client/ |
Multi-process HTTP workers communicating via ZMQ IPC. HTTPEndpointClient is the main entry point |
| Dataset Manager | src/inference_endpoint/dataset_manager/ |
Loads JSONL, HuggingFace, CSV, JSON, Parquet datasets. Dataset base class with load_sample()/num_samples() interface |
| Metrics | src/inference_endpoint/metrics/ |
EventRecorder writes to SQLite, MetricsReporter reads and aggregates (QPS, latency, TTFT, TPOT) |
| Config | src/inference_endpoint/config/, endpoint_client/config.py |
Pydantic-based YAML schema (schema.py), HTTPClientConfig (single Pydantic model for CLI/YAML/runtime), RuntimeSettings |
| CLI | src/inference_endpoint/main.py, commands/benchmark/cli.py |
cyclopts-based, auto-generated from schema.py and HTTPClientConfig Pydantic models. Flat shorthands via cyclopts.Parameter(alias=...) |
| Async Utils | src/inference_endpoint/async_utils/ |
LoopManager (uvloop + eager_task_factory), ZMQ transport layer, event publisher |
| OpenAI/SGLang | src/inference_endpoint/openai/, sglang/ |
Protocol adapters and response accumulators for different API formats |
Multi-process, event-loop design optimized for throughput:
BenchmarkSessionthread schedules samples with busy-wait timing- Worker processes (N instances) handle HTTP requests via ZMQ IPC
- Uses
eager_task_factoryanduvloopfor minimal async overhead - CPU affinity support (
cpu_affinity.py) for performance tuning - Custom HTTP connection pooling (
http.py) withhttptoolsparser
CLI is auto-generated from config/schema.py Pydantic models via cyclopts. Fields annotated with cyclopts.Parameter(alias="--flag") get flat shorthands; all other fields get auto-generated dotted flags (kebab-case).
- CLI mode (
offline/online): cyclopts constructsOfflineBenchmarkConfig/OnlineBenchmarkConfig(subclasses inconfig/schema.py) directly from CLI args. Type locked viaLiteral.--datasetis repeatable with TOML-style format[perf|acc:]<path>[,key=value...](e.g.--dataset data.csv,samples=500,parser.prompt=article). Full accuracy support viaaccuracy_config.eval_method=pass_at_1etc. - YAML mode (
from-config):BenchmarkConfig.from_yaml_file()loads YAML, resolves env vars, and auto-selects the right subclass via Pydantic discriminated union. Optional--timeout/--modeoverrides viaconfig.with_updates(). - eval: Not yet implemented (raises
CLIErrorwith a tracking issue link)
Both CLI and YAML produce the same subclass via Pydantic discriminated union on type:
CLI offline/online: cyclopts → OfflineBenchmarkConfig/OnlineBenchmarkConfig → with_updates(datasets) → run_benchmark
YAML from-config: from_yaml_file(path) → discriminated union → same subclass → run_benchmark
OfflineBenchmarkConfig and OnlineBenchmarkConfig (in config/schema.py) inherit BenchmarkConfig:
type: locked viaLiteral[TestType.OFFLINE]/Literal[TestType.ONLINE]settings:OfflineSettings(hides load pattern) /OnlineSettingssubmission_ref,benchmark_mode:show=Falseon base class
Validation is layered:
- Field-level (Pydantic):
Field(ge=0)on durations,Field(ge=-1)on workers,Literalonbenchmark_mode - Field validators:
workers != 0check - Model validator (
_resolve_and_validate): streaming AUTO resolution, model name fromsubmission_ref, load pattern vs test type, cross-field duration check, duplicate datasets
max_throughput: Offline burst (all queries at t=0)poisson: Fixed QPS with Poisson arrival distributionconcurrency: Fixed concurrent requests
src/inference_endpoint/
├── main.py # Entry point + CLI app: cyclopts app, commands, error formatter, run()
├── exceptions.py # CLIError, ExecutionError, InputValidationError, SetupError
├── commands/ # Command execution logic
│ ├── benchmark/
│ │ ├── __init__.py
│ │ ├── cli.py # benchmark_app: offline, online, from-config subcommands
│ │ └── execute.py # Phased execution: setup/run_threaded/finalize + BenchmarkContext
│ ├── probe.py # ProbeConfig + execute_probe()
│ ├── info.py # execute_info()
│ ├── validate.py # execute_validate()
│ └── init.py # execute_init()
├── core/types.py # APIType, Query, QueryResult, StreamChunk, QueryStatus (msgspec Structs)
├── load_generator/
│ ├── session.py # BenchmarkSession - top-level orchestrator
│ ├── load_generator.py # LoadGenerator, SchedulerBasedLoadGenerator
│ ├── scheduler.py # Scheduler, timing strategies
│ ├── sample.py # SampleEventHandler
│ └── events.py # SessionEvent, SampleEvent enums
├── endpoint_client/
│ ├── http_client.py # HTTPEndpointClient - main client interface
│ ├── worker.py # Worker process implementation
│ ├── worker_manager.py # Manages worker lifecycle
│ ├── http.py # ConnectionPool, HttpRequestTemplate, raw HTTP
│ ├── http_sample_issuer.py # Bridges load generator to HTTP client
│ ├── config.py # HTTPClientConfig (single Pydantic model — CLI/YAML/runtime)
│ ├── adapter_protocol.py # HttpRequestAdapter protocol
│ ├── accumulator_protocol.py # Response accumulation protocol
│ ├── cpu_affinity.py # CPU pinning
│ └── utils.py # Port range helpers
├── async_utils/
│ ├── loop_manager.py # LoopManager (uvloop + eager_task_factory)
│ ├── runner.py # run_async() — uvloop + eager_task_factory entry point for CLI commands
│ ├── event_publisher.py # Async event pub/sub
│ ├── services/
│ │ ├── event_logger/ # EventLoggerService: writes EventRecords to JSONL/SQLite
│ │ └── metrics_aggregator/ # MetricsAggregatorService: real-time metrics (TTFT, TPOT, ISL, OSL)
│ └── transport/ # ZMQ-based IPC transport layer
│ ├── protocol.py # Transport protocols + TransportConfig base
│ ├── record.py # Transport records
│ └── zmq/ # ZMQ implementation (context, pubsub, transport, ZMQTransportConfig)
├── dataset_manager/
│ ├── dataset.py # Dataset base class, DatasetFormat enum
│ ├── factory.py # Dataset factory
│ ├── transforms.py # ColumnRemap and other transforms
│ └── predefined/ # Built-in datasets (aime25, cnndailymail, gpqa, etc.)
├── metrics/
│ ├── recorder.py # EventRecorder (SQLite-backed)
│ ├── reporter.py # MetricsReporter (aggregation)
│ └── metric.py # Metric types (Throughput, etc.)
├── config/
│ ├── schema.py # Single source of truth: Pydantic models + cyclopts annotations
│ ├── runtime_settings.py # RuntimeSettings dataclass
│ ├── ruleset_base.py # BenchmarkSuiteRuleset base
│ ├── ruleset_registry.py # Ruleset registry
│ ├── user_config.py # UserConfig dataclass for ruleset user overrides
│ ├── rulesets/mlcommons/ # MLCommons-specific rules, datasets, models
│ └── templates/ # YAML config templates (_template.yaml minimal, _template_full.yaml all defaults)
├── openai/ # OpenAI-compatible API types and adapters
│ ├── types.py # OpenAI response types
│ ├── openai_adapter.py # Request/response adapter
│ ├── openai_msgspec_adapter.py # msgspec-based adapter (fast path)
│ ├── accumulator.py # Streaming response accumulator
│ └── harmony.py # openai_harmony integration
├── sglang/ # SGLang API adapter
├── evaluation/ # Accuracy evaluation (extractor, scoring, livecodebench)
├── plugins/ # Plugin system
├── profiling/ # line_profiler integration, pytest plugin
├── testing/
│ ├── echo_server.py # Local echo server for testing
│ ├── max_throughput_server.py # Max throughput test server
│ └── docker_server.py # Docker-based server management
└── utils/
├── logging.py # Logging setup
├── version.py # Version info
├── dataset_utils.py # Dataset utilities
└── benchmark_httpclient.py # HTTP client throughput benchmarking utility
tests/
├── conftest.py # Shared fixtures (echo/oracle servers, datasets, settings)
├── test_helpers.py # Test utility functions
├── unit/ # Unit tests (mirror src/ structure)
├── integration/ # Integration tests (real servers, end-to-end)
│ ├── endpoint_client/ # HTTP client integration tests
│ └── commands/ # CLI command integration tests
├── performance/ # Performance benchmarks (pytest-benchmark)
└── datasets/ # Test data (dummy_1k.jsonl, squad_pruned/)
- Formatter/Linter:
ruff(line-length 88, target Python 3.12) - Type checking:
mypy(via pre-commit) - Formatting:
ruff-format(double quotes, space indent) - License headers: Required on all Python files (enforced by pre-commit hook
scripts/add_license_header.py) - Conventional commits:
feat:,fix:,docs:,test:,chore:
All of these run automatically on commit:
- trailing-whitespace, end-of-file-fixer, check-yaml, check-merge-conflict, debug-statements
ruff(lint + autofix) andruff-formatmypytype checkingprettierfor YAML/JSON/Markdown- License header enforcement
regenerate-templates: auto-regenerates YAML config templates from schema defaults whenschema.py,config.py, orregenerate_templates.pychange
Always run pre-commit run --all-files before committing.
See Development Guide for full setup and workflow details.
- Core types (
Query,QueryResult,StreamChunk):msgspec.Structwithfrozen=True,array_like=True,gc=False,omit_defaults=True - Config types:
pydantic.BaseModelfor validation - Enums:
str, Enumpattern for serializable enums (e.g.,LoadPatternType,APIType) - Serialization:
msgspec.jsonfor hot-path (ZMQ transport),pydanticfor config
Coverage target: >90% for all new code.
Test markers:
@pytest.mark.unit # Unit tests
@pytest.mark.integration # Integration tests (may need servers)
@pytest.mark.slow # Skip in CI
@pytest.mark.performance # No timeout, skip in CI
@pytest.mark.run_explicitly # Only run when explicitly selectedAsync tests: Use @pytest.mark.asyncio(mode="strict") — the project uses strict asyncio mode.
Key fixtures (defined in tests/conftest.py):
mock_http_echo_server— real HTTP echo server on dynamic portmock_http_oracle_server— dataset-driven response serverdummy_dataset— in-memory test datasethf_squad_dataset— HuggingFace squad datasetevents_db— pre-populated SQLite events databasemax_throughput_runtime_settings,poisson_runtime_settings,concurrency_runtime_settings— preset configsclean_sample_event_hooks— ensures event hooks are cleared between tests
Test data: tests/datasets/dummy_1k.jsonl (1000 samples), tests/datasets/squad_pruned/
These apply especially to code in the hot path (load generator, endpoint client, transport):
- No
matchstatements in hot paths — use dict dispatch instead - Use
dataclass(slots=True)for frequently instantiated classes (ormsgspec.Struct) - Prefer generators over list comprehensions for large datasets
- Minimize async suspends in hot path code
- Use
msgspecoverjson/pydanticfor serialization in the data path - Connection pooling: The HTTP client uses custom
ConnectionPoolwithhttptoolsparser — notaiohttp/requests - Event loop:
uvloopwitheager_task_factoryviaLoopManager - IPC: ZMQ-based transport between main process and worker processes
| Package | Purpose |
|---|---|
uvloop |
Performance-optimized event loop |
httptools |
Fast HTTP parser for custom connection pool |
msgspec |
Fast serialization for core types and ZMQ transport |
pyzmq |
ZMQ IPC between main process and workers |
pydantic |
Configuration validation |
cyclopts |
CLI framework — auto-generates flags from Pydantic |
duckdb |
Data aggregation |
transformers |
Tokenization for OSL reporting |
src/inference_endpoint/openai/openai_types_gen.py— auto-generated, excluded from ruff/pre-commitsrc/inference_endpoint/openai/openapi.yaml— OpenAI API spec, excluded from pre-commit
This file is the source of truth for AI agents working in this repo. If it is stale or wrong, every AI-assisted session starts from a broken foundation.
Update AGENTS.md as part of any PR that includes a significant refactor, meaning:
- Moved, renamed, or deleted modules/packages — update the Code Organization tree and Key Components table
- Changed architectural boundaries (e.g., new IPC transport, replaced pydantic with msgspec for config) — update Architecture and Data Types sections
- Added or removed CLI commands/subcommands — update CLI Modes and Common Commands
- Changed test infrastructure (new fixtures, changed markers, new test directories) — update Testing section
- Added or removed key dependencies — update Key Dependencies table
- Changed build/tooling (new pre-commit hooks, changed ruff config, new CI steps) — update docs/DEVELOPMENT.md
- Changed hot-path patterns (new transport, changed serialization, new performance constraints) — update Performance Guidelines
- Treat AGENTS.md changes as part of the refactor itself — include them in the same PR, not as a follow-up
- Keep descriptions factual and concise — what exists and where, not aspirational design docs
- If you add a new top-level module under
src/inference_endpoint/, add it to both the Key Components table and the Code Organization tree - If you remove something, remove it from AGENTS.md — stale entries are worse than missing ones
When reviewing PRs with significant structural changes, verify:
- Code Organization tree matches the actual directory structure post-merge
- Key Components table reflects any moved/renamed/new components
- No references to deleted files, classes, or modules remain
Known failure modes when AI tools generate code for this project. Reference these during code review of AI-assisted PRs.
- Inventing abstractions that don't exist: AI may introduce new base classes, registries, or factory patterns that don't match the existing architecture. This project uses concrete types and explicit wiring — check that new code follows existing patterns rather than imposing unfamiliar frameworks.
- Misunderstanding the multi-process boundary: The endpoint client uses separate worker processes (not threads) communicating over ZMQ. AI-generated code often assumes shared memory, passes unpicklable objects across processes, or adds synchronization primitives (locks, semaphores) that don't work cross-process.
- Confusing hot-path vs. cold-path: AI tends to treat all code uniformly. Code in
load_generator/,endpoint_client/worker.py, andasync_utils/transport/is latency-critical. Pydantic validation, excessive logging, or try/except blocks in these paths will degrade throughput.
- Using the wrong serialization library: This project uses
msgspecfor hot-path data andpydanticfor config. AI frequently defaults tojson.dumps/json.loads,dataclasses, or applies pydantic where msgspec is required. If the type is amsgspec.Struct, encode/decode withmsgspec.json, not stdlib json. - Breaking msgspec Struct constraints: Core types (
Query,QueryResult,StreamChunk) arefrozen=Truewitharray_like=True. AI may try to mutate fields directly (must useforce_setattr), add mutable default fields, or assume dict-like serialization when the wire format is array-based. - Adding
dataclasswheremsgspec.Structis expected: If neighboring types use msgspec, new types in the same module should too. AI defaults to@dataclassout of habit.
- Mixing sync and async incorrectly: AI may
awaitin a sync context, call blocking I/O inside anasync def, or useasyncio.run()when a loop is already running (this project manages its own loops viaLoopManager). - Creating new event loops: Workers and the main process already have managed event loops. AI may call
asyncio.new_event_loop()orasyncio.run()which conflicts with the existingLoopManager/uvloopsetup. - Ignoring
eager_task_factory: The project uses Python 3.12'seager_task_factoryfor performance. AI-generated code that creates coroutines expecting lazy scheduling will behave differently than expected.
- Generating mock-heavy tests for integration scenarios: This project has real echo/oracle server fixtures. AI tends to mock HTTP calls even when
mock_http_echo_serverormock_http_oracle_serverfixtures exist and should be used. - Missing test markers: Every test function needs
@pytest.mark.unit,@pytest.mark.integration, or another marker. AI-generated tests almost always omit markers, which breaks CI filtering. - Wrong asyncio mode: Tests must use
@pytest.mark.asyncio(mode="strict")— AI often writes bare@pytest.mark.asyncioor forgets it entirely, causing silent test skips or failures. - Fabricating fixture names: AI may invent fixtures that don't exist in
conftest.py. Always check that referenced fixtures actually exist before using them.
- Missing license headers: Every Python file needs the Apache 2.0 SPDX header. AI never generates these — the pre-commit hook will add them, but be aware of this when reviewing diffs.
- Importing removed or renamed modules: After refactors, AI (working from stale context) may import old module paths. Always verify imports resolve to actual files.
- Over-documenting: AI generates verbose docstrings, inline comments explaining obvious code, and type annotations on trivial variables. This project prefers minimal comments — only where the why isn't obvious from the code.
- Adding backwards-compatibility shims: If something was renamed or removed, AI may add re-exports, aliases, or deprecation wrappers. In this project, just delete the old thing and update all call sites.
- Adding new dependencies without justification: AI may
pip installor add imports for packages not inpyproject.toml. Any new dependency must be justified, added to the correct optional group, and pinned to an exact version (==). After adding a dependency, runpip-audit(included indevextras) to verify it has no known vulnerabilities. - Using
requests/aiohttpfor HTTP: This project has its own HTTP client (endpoint_client/http.py) usinghttptools. AI defaults torequestsoraiohttp— these should not appear in production code (test dependencies are fine).