AGENTS.md

Guidelines for AI coding agents working in the MLPerf Inference Endpoint Benchmarking System repository.

Project Overview

High-performance benchmarking tool for LLM inference endpoints targeting 50k+ QPS. Python 3.12+, Apache 2.0 licensed.

Common Commands

# Development setup
uv sync --extra dev --extra test
uv run pre-commit install

# Testing
uv run pytest                                        # All tests (excludes slow/performance)
uv run pytest -m unit                                # Unit tests only
uv run pytest -m integration                         # Integration tests only
uv run pytest --cov=src --cov-report=html            # With coverage
uv run pytest -xvs tests/unit/path/to/test_file.py  # Single test file

# Code quality (run before commits)
uv run pre-commit run --all-files

# Local testing with echo server
uv run python -m inference_endpoint.testing.echo_server --port 8765
uv run inference-endpoint probe --endpoints http://localhost:8765 --model test-model

# CLI usage
uv run inference-endpoint benchmark offline --endpoints URL --model NAME --dataset PATH
uv run inference-endpoint benchmark online --endpoints URL --model NAME --dataset PATH --load-pattern poisson --target-qps 100
uv run inference-endpoint benchmark from-config --config config.yaml

Backward-compatible setup (pip + venv)

Does not use uv.lock — dependency versions may differ from the lockfile.

python3.12 -m venv venv && source venv/bin/activate
pip install -e ".[dev,test]"
pre-commit install

# After activating the venv, commands run without the `uv run` prefix:
pytest -m unit
pre-commit run --all-files
inference-endpoint benchmark offline --endpoints URL --model NAME --dataset PATH

Architecture

Core Data Flow

Dataset Manager --> Load Generator --> Endpoint Client --> External Endpoint
                        |
                   Metrics Collector (EventRecorder + MetricsReporter)

Key Components

Component	Location	Purpose
Load Generator	`src/inference_endpoint/load_generator/`	Central orchestrator: `BenchmarkSession` owns the lifecycle, `Scheduler` controls timing, `LoadGenerator` issues queries
Endpoint Client	`src/inference_endpoint/endpoint_client/`	Multi-process HTTP workers communicating via ZMQ IPC. `HTTPEndpointClient` is the main entry point
Dataset Manager	`src/inference_endpoint/dataset_manager/`	Loads JSONL, HuggingFace, CSV, JSON, Parquet datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface
Metrics	`src/inference_endpoint/metrics/`	`EventRecorder` writes to SQLite, `MetricsReporter` reads and aggregates (QPS, latency, TTFT, TPOT)
Config	`src/inference_endpoint/config/`, `endpoint_client/config.py`	Pydantic-based YAML schema (`schema.py`), `HTTPClientConfig` (single Pydantic model for CLI/YAML/runtime), `RuntimeSettings`
CLI	`src/inference_endpoint/main.py`, `commands/benchmark/cli.py`	cyclopts-based, auto-generated from `schema.py` and `HTTPClientConfig` Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)`
Async Utils	`src/inference_endpoint/async_utils/`	`LoopManager` (uvloop + eager_task_factory), ZMQ transport layer, event publisher
OpenAI/SGLang	`src/inference_endpoint/openai/`, `sglang/`	Protocol adapters and response accumulators for different API formats

Hot-Path Architecture

Multi-process, event-loop design optimized for throughput:

BenchmarkSession thread schedules samples with busy-wait timing
Worker processes (N instances) handle HTTP requests via ZMQ IPC
Uses eager_task_factory and uvloop for minimal async overhead
CPU affinity support (cpu_affinity.py) for performance tuning
Custom HTTP connection pooling (http.py) with httptools parser

CLI Modes

CLI is auto-generated from config/schema.py Pydantic models via cyclopts. Fields annotated with cyclopts.Parameter(alias="--flag") get flat shorthands; all other fields get auto-generated dotted flags (kebab-case).

CLI mode (offline/online): cyclopts constructs OfflineBenchmarkConfig/OnlineBenchmarkConfig (subclasses in config/schema.py) directly from CLI args. Type locked via Literal. --dataset is repeatable with TOML-style format [perf|acc:]<path>[,key=value...] (e.g. --dataset data.csv,samples=500,parser.prompt=article). Full accuracy support via accuracy_config.eval_method=pass_at_1 etc.
YAML mode (from-config): BenchmarkConfig.from_yaml_file() loads YAML, resolves env vars, and auto-selects the right subclass via Pydantic discriminated union. Optional --timeout/--mode overrides via config.with_updates().
eval: Not yet implemented (raises CLIError with a tracking issue link)

Config Construction & Validation

Both CLI and YAML produce the same subclass via Pydantic discriminated union on type:

CLI offline/online:  cyclopts → OfflineBenchmarkConfig/OnlineBenchmarkConfig → with_updates(datasets) → run_benchmark
YAML from-config:    from_yaml_file(path) → discriminated union → same subclass → run_benchmark

OfflineBenchmarkConfig and OnlineBenchmarkConfig (in config/schema.py) inherit BenchmarkConfig:

type: locked via Literal[TestType.OFFLINE] / Literal[TestType.ONLINE]
settings: OfflineSettings (hides load pattern) / OnlineSettings
submission_ref, benchmark_mode: show=False on base class

Validation is layered:

Field-level (Pydantic): Field(ge=0) on durations, Field(ge=-1) on workers, Literal on benchmark_mode
Field validators: workers != 0 check
Model validator (_resolve_and_validate): streaming AUTO resolution, model name from submission_ref, load pattern vs test type, cross-field duration check, duplicate datasets

Load Patterns

max_throughput: Offline burst (all queries at t=0)
poisson: Fixed QPS with Poisson arrival distribution
concurrency: Fixed concurrent requests

Code Organization

src/inference_endpoint/
├── main.py                    # Entry point + CLI app: cyclopts app, commands, error formatter, run()
├── exceptions.py              # CLIError, ExecutionError, InputValidationError, SetupError
├── commands/                  # Command execution logic
│   ├── benchmark/
│   │   ├── __init__.py
│   │   ├── cli.py             # benchmark_app: offline, online, from-config subcommands
│   │   └── execute.py         # Phased execution: setup/run_threaded/finalize + BenchmarkContext
│   ├── probe.py               # ProbeConfig + execute_probe()
│   ├── info.py                # execute_info()
│   ├── validate.py            # execute_validate()
│   └── init.py                # execute_init()
├── core/types.py              # APIType, Query, QueryResult, StreamChunk, QueryStatus (msgspec Structs)
├── load_generator/
│   ├── session.py             # BenchmarkSession - top-level orchestrator
│   ├── load_generator.py      # LoadGenerator, SchedulerBasedLoadGenerator
│   ├── scheduler.py           # Scheduler, timing strategies
│   ├── sample.py              # SampleEventHandler
│   └── events.py              # SessionEvent, SampleEvent enums
├── endpoint_client/
│   ├── http_client.py         # HTTPEndpointClient - main client interface
│   ├── worker.py              # Worker process implementation
│   ├── worker_manager.py      # Manages worker lifecycle
│   ├── http.py                # ConnectionPool, HttpRequestTemplate, raw HTTP
│   ├── http_sample_issuer.py  # Bridges load generator to HTTP client
│   ├── config.py              # HTTPClientConfig (single Pydantic model — CLI/YAML/runtime)
│   ├── adapter_protocol.py    # HttpRequestAdapter protocol
│   ├── accumulator_protocol.py # Response accumulation protocol
│   ├── cpu_affinity.py        # CPU pinning
│   └── utils.py               # Port range helpers
├── async_utils/
│   ├── loop_manager.py        # LoopManager (uvloop + eager_task_factory)
│   ├── runner.py              # run_async() — uvloop + eager_task_factory entry point for CLI commands
│   ├── event_publisher.py     # Async event pub/sub
│   ├── services/
│   │   ├── event_logger/      # EventLoggerService: writes EventRecords to JSONL/SQLite
│   │   └── metrics_aggregator/ # MetricsAggregatorService: real-time metrics (TTFT, TPOT, ISL, OSL)
│   └── transport/             # ZMQ-based IPC transport layer
│       ├── protocol.py        # Transport protocols + TransportConfig base
│       ├── record.py          # Transport records
│       └── zmq/               # ZMQ implementation (context, pubsub, transport, ZMQTransportConfig)
├── dataset_manager/
│   ├── dataset.py             # Dataset base class, DatasetFormat enum
│   ├── factory.py             # Dataset factory
│   ├── transforms.py          # ColumnRemap and other transforms
│   └── predefined/            # Built-in datasets (aime25, cnndailymail, gpqa, etc.)
├── metrics/
│   ├── recorder.py            # EventRecorder (SQLite-backed)
│   ├── reporter.py            # MetricsReporter (aggregation)
│   └── metric.py              # Metric types (Throughput, etc.)
├── config/
│   ├── schema.py              # Single source of truth: Pydantic models + cyclopts annotations
│   ├── runtime_settings.py    # RuntimeSettings dataclass
│   ├── ruleset_base.py        # BenchmarkSuiteRuleset base
│   ├── ruleset_registry.py    # Ruleset registry
│   ├── user_config.py         # UserConfig dataclass for ruleset user overrides
│   ├── rulesets/mlcommons/    # MLCommons-specific rules, datasets, models
│   └── templates/             # YAML config templates (offline, online, eval, etc.)
├── openai/                    # OpenAI-compatible API types and adapters
│   ├── types.py               # OpenAI response types
│   ├── openai_adapter.py      # Request/response adapter
│   ├── openai_msgspec_adapter.py  # msgspec-based adapter (fast path)
│   ├── accumulator.py         # Streaming response accumulator
│   └── harmony.py             # openai_harmony integration
├── sglang/                    # SGLang API adapter
├── evaluation/                # Accuracy evaluation (extractor, scoring, livecodebench)
├── plugins/                   # Plugin system
├── profiling/                 # line_profiler integration, pytest plugin
├── testing/
│   ├── echo_server.py         # Local echo server for testing
│   ├── max_throughput_server.py # Max throughput test server
│   └── docker_server.py       # Docker-based server management
└── utils/
    ├── logging.py             # Logging setup
    ├── version.py             # Version info
    ├── dataset_utils.py       # Dataset utilities
    └── benchmark_httpclient.py # HTTP client throughput benchmarking utility

tests/
├── conftest.py                # Shared fixtures (echo/oracle servers, datasets, settings)
├── test_helpers.py            # Test utility functions
├── unit/                      # Unit tests (mirror src/ structure)
├── integration/               # Integration tests (real servers, end-to-end)
│   ├── endpoint_client/       # HTTP client integration tests
│   └── commands/              # CLI command integration tests
├── performance/               # Performance benchmarks (pytest-benchmark)
└── datasets/                  # Test data (dummy_1k.jsonl, squad_pruned/)

Development Standards

Code Style and Pre-commit Hooks

Formatter/Linter: ruff (line-length 88, target Python 3.12)
Type checking: mypy (via pre-commit)
Formatting: ruff-format (double quotes, space indent)
License headers: Required on all Python files (enforced by pre-commit hook scripts/add_license_header.py)
Conventional commits: feat:, fix:, docs:, test:, chore:

All of these hooks run automatically on commit: trailing-whitespace, end-of-file-fixer, check-yaml, check-merge-conflict, debug-statements, ruff (lint + autofix), ruff-format, mypy, prettier (YAML/JSON/Markdown), license header enforcement.

Always run pre-commit run --all-files before committing.

See Development Guide for full setup and workflow details.

Data Types & Serialization

Core types (Query, QueryResult, StreamChunk): msgspec.Struct with frozen=True, array_like=True, gc=False, omit_defaults=True
Config types: pydantic.BaseModel for validation
Enums: str, Enum pattern for serializable enums (e.g., LoadPatternType, APIType)
Serialization: msgspec.json for hot-path (ZMQ transport), pydantic for config

Testing

Coverage target: >90% for all new code.

Test markers:

@pytest.mark.unit           # Unit tests
@pytest.mark.integration    # Integration tests (may need servers)
@pytest.mark.slow           # Skip in CI
@pytest.mark.performance    # No timeout, skip in CI
@pytest.mark.run_explicitly # Only run when explicitly selected

Async tests: Use @pytest.mark.asyncio(mode="strict") — the project uses strict asyncio mode.

Key fixtures (defined in tests/conftest.py):

mock_http_echo_server — real HTTP echo server on dynamic port
mock_http_oracle_server — dataset-driven response server
dummy_dataset — in-memory test dataset
hf_squad_dataset — HuggingFace squad dataset
events_db — pre-populated SQLite events database
max_throughput_runtime_settings, poisson_runtime_settings, concurrency_runtime_settings — preset configs
clean_sample_event_hooks — ensures event hooks are cleared between tests

Test data: tests/datasets/dummy_1k.jsonl (1000 samples), tests/datasets/squad_pruned/

Performance Guidelines

These apply especially to code in the hot path (load generator, endpoint client, transport):

No match statements in hot paths — use dict dispatch instead
Use dataclass(slots=True) for frequently instantiated classes (or msgspec.Struct)
Prefer generators over list comprehensions for large datasets
Minimize async suspends in hot path code
Use msgspec over json/pydantic for serialization in the data path
Connection pooling: The HTTP client uses custom ConnectionPool with httptools parser — not aiohttp/requests
Event loop: uvloop with eager_task_factory via LoopManager
IPC: ZMQ-based transport between main process and worker processes

Key Dependencies

Package	Purpose
`uvloop`	Performance-optimized event loop
`httptools`	Fast HTTP parser for custom connection pool
`msgspec`	Fast serialization for core types and ZMQ transport
`pyzmq`	ZMQ IPC between main process and workers
`pydantic`	Configuration validation
`cyclopts`	CLI framework — auto-generates flags from Pydantic
`duckdb`	Data aggregation
`transformers`	Tokenization for OSL reporting

Files to NOT Modify

src/inference_endpoint/openai/openai_types_gen.py — auto-generated, excluded from ruff/pre-commit
src/inference_endpoint/openai/openapi.yaml — OpenAI API spec, excluded from pre-commit

Keeping AGENTS.md Up to Date

This file is the source of truth for AI agents working in this repo. If it is stale or wrong, every AI-assisted session starts from a broken foundation.

When to Update

Update AGENTS.md as part of any PR that includes a significant refactor, meaning:

Moved, renamed, or deleted modules/packages — update the Code Organization tree and Key Components table
Changed architectural boundaries (e.g., new IPC transport, replaced pydantic with msgspec for config) — update Architecture and Data Types sections
Added or removed CLI commands/subcommands — update CLI Modes and Common Commands
Changed test infrastructure (new fixtures, changed markers, new test directories) — update Testing section
Added or removed key dependencies — update Key Dependencies table
Changed build/tooling (new pre-commit hooks, changed ruff config, new CI steps) — update docs/DEVELOPMENT.md
Changed hot-path patterns (new transport, changed serialization, new performance constraints) — update Performance Guidelines

How to Update

Treat AGENTS.md changes as part of the refactor itself — include them in the same PR, not as a follow-up
Keep descriptions factual and concise — what exists and where, not aspirational design docs
If you add a new top-level module under src/inference_endpoint/, add it to both the Key Components table and the Code Organization tree
If you remove something, remove it from AGENTS.md — stale entries are worse than missing ones

Reviewer Checklist

When reviewing PRs with significant structural changes, verify:

Code Organization tree matches the actual directory structure post-merge
Key Components table reflects any moved/renamed/new components
No references to deleted files, classes, or modules remain

Common AI Coding Pitfalls

Known failure modes when AI tools generate code for this project. Reference these during code review of AI-assisted PRs.

Architecture & Design

Inventing abstractions that don't exist: AI may introduce new base classes, registries, or factory patterns that don't match the existing architecture. This project uses concrete types and explicit wiring — check that new code follows existing patterns rather than imposing unfamiliar frameworks.
Misunderstanding the multi-process boundary: The endpoint client uses separate worker processes (not threads) communicating over ZMQ. AI-generated code often assumes shared memory, passes unpicklable objects across processes, or adds synchronization primitives (locks, semaphores) that don't work cross-process.
Confusing hot-path vs. cold-path: AI tends to treat all code uniformly. Code in load_generator/, endpoint_client/worker.py, and async_utils/transport/ is latency-critical. Pydantic validation, excessive logging, or try/except blocks in these paths will degrade throughput.

Serialization & Types

Using the wrong serialization library: This project uses msgspec for hot-path data and pydantic for config. AI frequently defaults to json.dumps/json.loads, dataclasses, or applies pydantic where msgspec is required. If the type is a msgspec.Struct, encode/decode with msgspec.json, not stdlib json.
Breaking msgspec Struct constraints: Core types (Query, QueryResult, StreamChunk) are frozen=True with array_like=True. AI may try to mutate fields directly (must use force_setattr), add mutable default fields, or assume dict-like serialization when the wire format is array-based.
Adding dataclass where msgspec.Struct is expected: If neighboring types use msgspec, new types in the same module should too. AI defaults to @dataclass out of habit.

Async & Concurrency

Mixing sync and async incorrectly: AI may await in a sync context, call blocking I/O inside an async def, or use asyncio.run() when a loop is already running (this project manages its own loops via LoopManager).
Creating new event loops: Workers and the main process already have managed event loops. AI may call asyncio.new_event_loop() or asyncio.run() which conflicts with the existing LoopManager/uvloop setup.
Ignoring eager_task_factory: The project uses Python 3.12's eager_task_factory for performance. AI-generated code that creates coroutines expecting lazy scheduling will behave differently than expected.

Testing

Generating mock-heavy tests for integration scenarios: This project has real echo/oracle server fixtures. AI tends to mock HTTP calls even when mock_http_echo_server or mock_http_oracle_server fixtures exist and should be used.
Missing test markers: Every test function needs @pytest.mark.unit, @pytest.mark.integration, or another marker. AI-generated tests almost always omit markers, which breaks CI filtering.
Wrong asyncio mode: Tests must use @pytest.mark.asyncio(mode="strict") — AI often writes bare @pytest.mark.asyncio or forgets it entirely, causing silent test skips or failures.
Fabricating fixture names: AI may invent fixtures that don't exist in conftest.py. Always check that referenced fixtures actually exist before using them.

Code Style & Repo Conventions

Missing license headers: Every Python file needs the Apache 2.0 SPDX header. AI never generates these — the pre-commit hook will add them, but be aware of this when reviewing diffs.
Importing removed or renamed modules: After refactors, AI (working from stale context) may import old module paths. Always verify imports resolve to actual files.
Over-documenting: AI generates verbose docstrings, inline comments explaining obvious code, and type annotations on trivial variables. This project prefers minimal comments — only where the why isn't obvious from the code.
Adding backwards-compatibility shims: If something was renamed or removed, AI may add re-exports, aliases, or deprecation wrappers. In this project, just delete the old thing and update all call sites.

Dependency & Environment

Adding new dependencies without justification: AI may pip install or add imports for packages not in pyproject.toml. Any new runtime, dev, or test dependency must be justified, added to the correct optional group, and pinned to an exact version (==). After adding a dependency, run pip-audit (included in dev extras) to verify it has no known vulnerabilities. When adding dependencies, use uv add <package>==<version> to update both pyproject.toml and uv.lock atomically, then run uv run pip-audit to check for vulnerabilities. Note: [build-system] requires is also pinned to exact versions for reproducibility.
Using requests/aiohttp for HTTP: This project has its own HTTP client (endpoint_client/http.py) using httptools. AI defaults to requests or aiohttp — these should not appear in production code (test dependencies are fine).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AGENTS.md

Project Overview

Common Commands

Backward-compatible setup (pip + venv)

Architecture

Core Data Flow

Key Components

Hot-Path Architecture

CLI Modes

Config Construction & Validation

Load Patterns

Code Organization

Development Standards

Code Style and Pre-commit Hooks

Data Types & Serialization

Testing

Performance Guidelines

Key Dependencies

Files to NOT Modify

Keeping AGENTS.md Up to Date

When to Update

How to Update

Reviewer Checklist

Common AI Coding Pitfalls

Architecture & Design

Serialization & Types

Async & Concurrency

Testing

Code Style & Repo Conventions

Dependency & Environment

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

AGENTS.md

Project Overview

Common Commands

Backward-compatible setup (pip + venv)

Architecture

Core Data Flow

Key Components

Hot-Path Architecture

CLI Modes

Config Construction & Validation

Load Patterns

Code Organization

Development Standards

Code Style and Pre-commit Hooks

Data Types & Serialization

Testing

Performance Guidelines

Key Dependencies

Files to NOT Modify

Keeping AGENTS.md Up to Date

When to Update

How to Update

Reviewer Checklist

Common AI Coding Pitfalls

Architecture & Design

Serialization & Types

Async & Concurrency

Testing

Code Style & Repo Conventions

Dependency & Environment