Skip to content

Commit 281b2e9

Browse files
committed
Introduce the mlperf-inf-mm-q3vl benchmark plugin system
1 parent 27db053 commit 281b2e9

File tree

4 files changed

+255
-22
lines changed

4 files changed

+255
-22
lines changed

multimodal/qwen3-vl/README.md

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -268,6 +268,196 @@ bash submit.sh --help
268268
- Testing duration $\ge$ 10 mins.
269269
- Sample concatenation permutation is enabled.
270270

271+
## Plugin System for `mlperf-inf-mm-q3vl benchmark`
272+
273+
The `mlperf-inf-mm-q3vl` package supports a plugin system that allows third-party
274+
packages to register additional subcommands under `mlperf-inf-mm-q3vl benchmark`. This
275+
uses Python's standard entry points mechanism.
276+
277+
The purpose of this feature is to allow benchmark result submitters to customize and fit
278+
`mlperf-inf-mm-q3vl` to the inference system that they would like to benchmark,
279+
**without** direct modification to the source code of `mlperf-inf-mm-q3vl` which is
280+
frozen after the benchmark being finalized.
281+
282+
### How it works
283+
284+
1. **Plugin Discovery**: When the CLI starts, it automatically discovers all registered
285+
plugins via the `mlperf_inf_mm_q3vl.benchmark_plugins` entry point group.
286+
2. **Plugin Loading**: Each plugin's entry point function is called to retrieve either a
287+
single command or a Typer app.
288+
3. **Command Registration**: The plugin's commands are automatically added to the
289+
`benchmark` subcommand group.
290+
291+
### Example: creating a `mlperf-inf-mm-q3vl-foo` plugin package for `mlperf-inf-mm-q3vl benchmark foo`
292+
293+
#### Step 1: Package Structure
294+
295+
Create a new Python package with the following structure:
296+
297+
```
298+
mlperf-inf-mm-q3vl-foo/
299+
├── pyproject.toml
300+
└── src/
301+
└── mlperf_inf_mm_q3vl_foo/
302+
├── __init__.py
303+
└── plugin.py
304+
```
305+
306+
#### Step 2: Implement the `mlperf-inf-mm-q3vl-foo` plugin
307+
308+
Create your plugin entry point function in `plugin.py`:
309+
310+
```python
311+
"""Plugin to support benchmarking the Foo inference system."""
312+
313+
from typing import Annotated
314+
from collections.abc import Callable
315+
from loguru import logger
316+
from pydantic_typer import Typer
317+
from typer import Option
318+
from mlperf_inf_mm_q3vl.schema import Settings, Dataset, Endpoint, Verbosity
319+
from mlperf_inf_mm_q3vl.log import setup_loguru_for_benchmark
320+
321+
from .schema import FooEndpoint
322+
323+
def register_foo_benchmark() -> Callable[[Settings, Dataset, FooEndpoint, int, int, Verbosity], None]:
324+
"""Entry point for the plugin to benchmark the Foo inference system.
325+
326+
This function is called when the CLI discovers the plugin.
327+
It should return either:
328+
- A single command function (decorated with appropriate options)
329+
- A tuple of (Typer app, command name) for more complex hierarchies
330+
"""
331+
332+
def benchmark_foo(
333+
*,
334+
settings: Settings,
335+
dataset: Dataset,
336+
# Add your foo-specific parameters here
337+
foo: FooEndpoint,
338+
custom_param: Annotated[
339+
int,
340+
Option(help="Custom parameter for foo backend"),
341+
] = 2,
342+
random_seed: Annotated[
343+
int,
344+
Option(help="The seed for the random number generator."),
345+
] = 12345,
346+
verbosity: Annotated[
347+
Verbosity,
348+
Option(help="The verbosity level of the logger."),
349+
] = Verbosity.INFO,
350+
) -> None:
351+
"""Deploy and benchmark using Foo backend.
352+
353+
This command deploys a model using the Foo backend
354+
and runs the MLPerf benchmark against it.
355+
"""
356+
from .deploy import FooDeployer
357+
358+
setup_loguru_for_benchmark(settings=settings, verbosity=verbosity)
359+
logger.info(
360+
f"Start to benchmark the Foo inference system with endpoint spec {} and custom param {}",
361+
foo,
362+
custom_param,
363+
)
364+
# Your implementation here
365+
with FooDeployer(endpoint=foo, settings=settings, custom_param=custom_param):
366+
# FooDeployer will make sure that Foo is deployed and currently healthy.
367+
# Run benchmark using the core run_benchmark function
368+
run_benchmark(
369+
settings=settings,
370+
dataset=dataset,
371+
endpoint=vllm,
372+
random_seed=random_seed,
373+
)
374+
375+
# Return the command function
376+
# The entry point name will be used as the subcommand name
377+
return benchmark_foo
378+
```
379+
380+
#### Step 3: Configure `pyproject.toml`
381+
382+
Register the plugin in its package's `pyproject.toml`:
383+
384+
```toml
385+
[project]
386+
name = "mlperf-inf-mm-q3vl-foo"
387+
version = "0.1.0"
388+
description = "Enable mlperf-inf-mm-q3vl to benchmark the Foo inference system."
389+
requires-python = ">=3.12"
390+
dependencies = [
391+
"mlperf-inf-mm-q3vl @ git+https://github.com/mlcommons/inference.git#subdirectory=multimodal/qwen3-vl/",
392+
# Add your backend-specific dependencies here
393+
]
394+
395+
[project.entry-points."mlperf_inf_mm_q3vl.benchmark_plugins"]
396+
# The key here becomes the subcommand name.
397+
foo = "mlperf_inf_mm_q3vl_foo.plugin:register_foo_benchmark"
398+
399+
[build-system]
400+
requires = ["setuptools>=80"]
401+
build-backend = "setuptools.build_meta"
402+
```
403+
404+
#### Step 4: Install and use `mlperf-inf-mm-q3vl benchmark foo`
405+
406+
```bash
407+
# Install your plugin package
408+
pip install mlperf-inf-mm-q3vl-foo
409+
410+
# The new subcommand is now available
411+
mlperf-inf-mm-q3vl benchmark foo --help
412+
mlperf-inf-mm-q3vl benchmark foo \
413+
--settings-file settings.toml \
414+
--dataset shopify-global-catalogue \
415+
--custom-param 3
416+
```
417+
418+
#### Advanced: Nested Subcommands
419+
420+
If you want to create multiple subcommands under a single plugin (e.g.,
421+
`mlperf-inf-mm-q3vl benchmark foo standard` and
422+
`mlperf-inf-mm-q3vl benchmark foo optimized`), return a tuple of `(Typer app, name)`:
423+
424+
```python
425+
def register_foo_benchmark() -> tuple[Typer, str]:
426+
"""Entry point that creates nested subcommands."""
427+
from pydantic_typer import Typer
428+
429+
# Create a Typer app for your plugin
430+
foo_app = Typer(help="Benchmarking options for the Foo inference systems.")
431+
432+
@foo_app.command(name="standard")
433+
def foo_standard(...) -> None:
434+
"""Run standard Foo benchmark."""
435+
# Implementation
436+
...
437+
438+
@foo_app.command(name="optimized")
439+
def foo_optimized(...) -> None:
440+
"""Run optimized Foo benchmark with max performance."""
441+
# Implementation
442+
...
443+
444+
# Return tuple of (app, command_name)
445+
return (foo_app, "foo")
446+
```
447+
448+
This will create:
449+
- `mlperf-inf-mm-q3vl benchmark foo standard`
450+
- `mlperf-inf-mm-q3vl benchmark foo optimized`
451+
452+
### Best Practices
453+
454+
1. Dependencies: Declare `mlperf-inf-mm-q3vl` as a dependency in your plugin package.
455+
2. Documentation: Provide clear docstrings for your plugin commands - they appear in
456+
`--help` output.
457+
3. Schema Reuse: Reuse the core `Settings`, `Dataset`, and other schemas from
458+
`mlperf_inf_mm_q3vl.schema` for consistency and minimizing boilerplate code.
459+
4. Lazy Imports: If your plugin has heavy dependencies, import them inside functions
460+
rather than at module level to avoid slowing down CLI startup
271461

272462
## Developer Guide
273463

multimodal/qwen3-vl/scripts/slurm/submit.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -99,12 +99,12 @@ while [[ $# -gt 0 ]]; do
9999
shift
100100
;;
101101
-seq | --server-expected-qps)
102-
server_expected_qps=$2
102+
server_target_qps=$2
103103
shift
104104
shift
105105
;;
106106
-seq=* | --server-expected-qps=*)
107-
server_expected_qps=${1#*=}
107+
server_target_qps=${1#*=}
108108
shift
109109
;;
110110
-tps | --tensor-parallel-size)

multimodal/qwen3-vl/src/mlperf_inf_mm_q3vl/cli.py

Lines changed: 57 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
from __future__ import annotations
44

5+
from collections.abc import Sequence
6+
from importlib.metadata import entry_points
57
from typing import Annotated
68

79
import mlperf_loadgen as lg
@@ -24,6 +26,56 @@
2426
help="Main CLI for running the Qwen3-VL (Q3VL) benchmark.",
2527
)
2628

29+
_PLUGIN_RESULT_APP_AND_NAME = 2
30+
31+
32+
def _load_benchmark_plugins() -> None:
33+
"""Load and register benchmark plugins from third-party packages."""
34+
# Discover plugins from the entry point group
35+
discovered_plugins = entry_points(group="mlperf_inf_mm_q3vl.benchmark_plugins")
36+
37+
for entry_point in discovered_plugins:
38+
try:
39+
# Load the plugin function
40+
plugin_func = entry_point.load()
41+
42+
# Call the plugin function to get the command/typer app
43+
plugin_result = plugin_func()
44+
45+
# Register it with the benchmark app
46+
if (
47+
isinstance(plugin_result, Sequence)
48+
and len(plugin_result) == _PLUGIN_RESULT_APP_AND_NAME
49+
):
50+
# Plugin returns (typer_app, name)
51+
plugin_app, plugin_name = plugin_result
52+
benchmark_app.add_typer(plugin_app, name=plugin_name)
53+
logger.debug(
54+
"Loaded benchmark plugin: {} from {}",
55+
plugin_name,
56+
entry_point.name,
57+
)
58+
elif callable(plugin_result):
59+
# Plugin returns just a command function
60+
benchmark_app.command(name=entry_point.name)(plugin_result)
61+
logger.debug("Loaded benchmark command: {}", entry_point.name)
62+
else:
63+
logger.warning(
64+
"Unsupported plugin function return type {} for plugin {}",
65+
type(plugin_result),
66+
entry_point.name,
67+
)
68+
except Exception as e: # noqa: BLE001
69+
logger.warning(
70+
"Failed to load benchmark plugin {} with error: {}",
71+
entry_point.name,
72+
e,
73+
)
74+
75+
76+
# Load plugins when the module is imported
77+
_load_benchmark_plugins()
78+
2779

2880
@app.command()
2981
def evaluate(
@@ -66,32 +118,28 @@ def benchmark_endpoint(
66118
accessible via a URL (and an API key, if applicable).
67119
"""
68120
setup_loguru_for_benchmark(settings=settings, verbosity=verbosity)
69-
_run_benchmark(
121+
run_benchmark(
70122
settings=settings,
71123
dataset=dataset,
72124
endpoint=endpoint,
73125
random_seed=random_seed,
74126
)
75127

76128

77-
def _run_benchmark(
129+
def run_benchmark(
78130
settings: Settings,
79131
dataset: Dataset,
80132
endpoint: Endpoint,
81133
random_seed: int,
82134
) -> None:
83135
"""Run the Qwen3-VL (Q3VL) benchmark."""
84-
logger.info(
85-
"Running Qwen3-VL (Q3VL) benchmark with settings: {}",
86-
settings)
136+
logger.info("Running Qwen3-VL (Q3VL) benchmark with settings: {}", settings)
87137
logger.info("Running Qwen3-VL (Q3VL) benchmark with dataset: {}", dataset)
88138
logger.info(
89139
"Running Qwen3-VL (Q3VL) benchmark with OpenAI API endpoint: {}",
90140
endpoint,
91141
)
92-
logger.info(
93-
"Running Qwen3-VL (Q3VL) benchmark with random seed: {}",
94-
random_seed)
142+
logger.info("Running Qwen3-VL (Q3VL) benchmark with random seed: {}", random_seed)
95143
test_settings, log_settings = settings.to_lgtype()
96144
task = ShopifyGlobalCatalogue(
97145
dataset=dataset,
@@ -130,7 +178,7 @@ def benchmark_vllm(
130178
"""
131179
setup_loguru_for_benchmark(settings=settings, verbosity=verbosity)
132180
with LocalVllmDeployer(endpoint=vllm, settings=settings):
133-
_run_benchmark(
181+
run_benchmark(
134182
settings=settings,
135183
dataset=dataset,
136184
endpoint=vllm,

multimodal/qwen3-vl/src/mlperf_inf_mm_q3vl/task.py

Lines changed: 6 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -67,8 +67,7 @@ def __init__(
6767
self.openai_api_client = AsyncOpenAI(
6868
base_url=endpoint.url,
6969
http_client=DefaultAioHttpClient(
70-
timeout=httpx.Timeout(
71-
timeout=request_timeout_seconds, connect=5.0),
70+
timeout=httpx.Timeout(timeout=request_timeout_seconds, connect=5.0),
7271
),
7372
api_key=endpoint.api_key,
7473
timeout=request_timeout_seconds,
@@ -188,9 +187,7 @@ def estimated_num_performance_samples(self) -> int:
188187
"""
189188
estimation_indices = random.sample(
190189
range(self.total_num_samples),
191-
k=min(
192-
MAX_NUM_ESTIMATION_PERFORMANCE_SAMPLES,
193-
self.total_num_samples),
190+
k=min(MAX_NUM_ESTIMATION_PERFORMANCE_SAMPLES, self.total_num_samples),
194191
)
195192
estimation_samples = [
196193
self.formulate_loaded_sample(
@@ -277,8 +274,7 @@ def _unload_samples_from_ram(query_sample_indices: list[int]) -> None:
277274
_unload_samples_from_ram,
278275
)
279276

280-
async def _query_endpoint_async_batch(
281-
self, query_sample: lg.QuerySample) -> None:
277+
async def _query_endpoint_async_batch(self, query_sample: lg.QuerySample) -> None:
282278
"""Query the endpoint through the async OpenAI API client."""
283279
try:
284280
sample = self.loaded_samples[query_sample.index]
@@ -295,7 +291,7 @@ async def _query_endpoint_async_batch(
295291
sample,
296292
)
297293
tic = time.perf_counter()
298-
response = await self.openai_api_client.chat.completions.create( # type: ignore[call-overload]
294+
response = await self.openai_api_client.chat.completions.create( # type: ignore[call-overload, misc]
299295
model=self.endpoint.model.repo_id,
300296
messages=sample.messages,
301297
response_format=(
@@ -364,8 +360,7 @@ async def _query_endpoint_async_batch(
364360
],
365361
)
366362

367-
async def _query_endpoint_async_stream(
368-
self, query_sample: lg.QuerySample) -> None:
363+
async def _query_endpoint_async_stream(self, query_sample: lg.QuerySample) -> None:
369364
"""Query the endpoint through the async OpenAI API client."""
370365
ttft_set = False
371366
try:
@@ -383,7 +378,7 @@ async def _query_endpoint_async_stream(
383378
sample,
384379
)
385380
word_array = []
386-
stream = await self.openai_api_client.chat.completions.create( # type: ignore[call-overload]
381+
stream = await self.openai_api_client.chat.completions.create( # type: ignore[call-overload, misc]
387382
stream=True,
388383
model=self.endpoint.model.repo_id,
389384
messages=sample.messages,

0 commit comments

Comments
 (0)