[diffusion] benchmark: Add SLO metric for SGL-Diffusion by yyy1000 · Pull Request #18907 · sgl-project/sglang

yyy1000 · 2026-02-16T22:19:37Z

Motivation

Modifications

add slo metrics in SGL-diffusion bench_serving.py

Accuracy Tests

N/A

Benchmarking and Profiling

root@a20053551888:/data/junhao/sglang#    python3 /data/junhao/sglang/python/sglang/multimodal_gen/benchmarks/bench_serving.py --dataset vbench --num-prompts 10 --port 
1231 --slo --slo-scale 3.0 --warmup-requests 2
[02-17 22:42:06] Waiting for service at http://localhost:1231...
[02-17 22:42:06] Service is ready.
[02-17 22:42:06] Updated model name from server: Wan-AI/Wan2.2-T2V-A14B-Diffusers
[02-17 22:42:06] Task from args None is different from huggingface pipeline_tag text-to-video, args.task will be ignored!
[02-17 22:42:06] Loading requests...
[02-17 22:42:06] Prepared 10 requests from vbench dataset.
[02-17 22:42:06] Running 2 warmup request(s) with num_inference_steps=1...
[02-17 23:10:35] Warmup 1/2: latency=1708.93s, success=True
[02-17 23:34:56] Warmup 2/2: latency=1461.22s, success=True
100%|█████████████████████████████████████████| 10/10 [4:03:46<00:00, 1462.62s/it]

================= Serving Benchmark Result =================
Task:                                    text-to-video  
Model:                                   Wan-AI/Wan2.2-T2V-A14B-Diffusers
Dataset:                                 vbench         
--------------------------------------------------
Benchmark duration (s):                  14626.17       
Request rate:                            inf            
Max request concurrency:                 1              
Successful requests:                     10/10             
--------------------------------------------------
Request throughput (req/s):              0.00           
Latency Mean (s):                        1462.6169      
Latency Median (s):                      1461.2150      
Latency P99 (s):                         1474.0264      
--------------------------------------------------
Peak Memory Max (MB):                    23764.88       
Peak Memory Mean (MB):                   23764.68       
Peak Memory Median (MB):                 23764.88       
--------------------------------------------------
SLO Attainment Rate:                     0.00%          
SLO Met (Success):                       0              
SLO Scale:                               3.00           
============================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-16T22:19:54Z

Summary of Changes

Hello @yyy1000, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the bench_serving.py script by introducing Service Level Objective (SLO) metrics for SGL-Diffusion benchmarks. The changes enable the system to define and track performance targets, dynamically calculate expected latencies based on request characteristics and warmup runs, and report on the attainment rate of these objectives. This provides a more comprehensive and robust way to evaluate the performance and reliability of multimodal generation services.

Highlights

SLO Metric Integration: Introduced Service Level Objective (SLO) metric calculation into the bench_serving.py script for SGL-Diffusion benchmarks.
Data Model Enhancement: Added slo_ms and num_inference_steps to RequestFuncInput and slo_achieved to RequestFuncOutput to support SLO tracking.
Dynamic SLO Target Calculation: Implemented functions to infer base latency from warmup requests and dynamically calculate slo_ms targets for subsequent requests based on image dimensions, frame count, and inference steps.
Benchmarking Workflow Update: Integrated warmup request execution and SLO population into the main benchmarking loop.
Metric Reporting: Extended the calculate_metrics function to report SLO attainment rate, successful SLOs, and the configured SLO scale.
CLI Argument Expansion: Added new command-line arguments (--slo, --slo-scale, --warmup-requests, --warmup-num-inference-steps, --num-inference-steps) to control SLO behavior and warmup phases.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/multimodal_gen/benchmarks/bench_serving.py
- Added slo_ms and num_inference_steps fields to RequestFuncInput.
- Added slo_achieved field to RequestFuncOutput.
- Implemented _compute_expected_latency_ms_from_base, _infer_slo_base_time_ms_from_warmups, and _populate_slo_ms_from_warmups functions for dynamic SLO target calculation.
- Modified async_request_image_sglang and async_request_video_sglang to check and record slo_achieved status.
- Updated calculate_metrics to include SLO attainment rate and related statistics in the benchmark results.
- Introduced a warmup request phase before the main benchmark run.
- Added command-line arguments: --slo, --slo-scale, --warmup-requests, --warmup-num-inference-steps, and --num-inference-steps.
- Updated example usage in the docstring to include SLO options.

Activity

No human activity (comments, reviews, etc.) has occurred on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces SLO (Service Level Objective) metrics to the multimodal serving benchmark. The changes include adding new command-line arguments for SLO configuration, implementing logic for warmup runs to establish a baseline latency, calculating expected latencies for requests, and reporting SLO attainment rates. The overall implementation is well-structured. I've provided a few suggestions to improve maintainability and conciseness by reducing code duplication and simplifying some conditional logic. Please take a look at the detailed comments.

gemini-code-assist · 2026-02-16T22:21:19Z

python/sglang/multimodal_gen/benchmarks/bench_serving.py

+    frame_scale = frames if isinstance(frames, int) and frames > 0 else 1
+    step_scale = steps if isinstance(steps, int) and steps > 0 else 1
+
+    area_units = max((float(width) * float(height)) / float(16 * 16), 1.0)


The magic number 16 * 16 is used here and also in _infer_slo_base_time_ms_from_warmups at line 400. To improve readability and maintainability, consider defining it as a module-level constant, for example: _AREA_UNIT_DIVISOR = 16 * 16. This likely represents the patch area for a vision model.

gemini-code-assist · 2026-02-16T22:21:19Z

python/sglang/multimodal_gen/benchmarks/bench_serving.py

+    if input.slo_ms is not None and output.success:
+        output.slo_achieved = (output.latency * 1000.0) <= input.slo_ms


This SLO check logic is identical to the one in async_request_image_sglang (lines 526-527). To avoid code duplication and improve maintainability, consider extracting this logic into a small helper function. For example, a function like _update_slo_achievement(input: RequestFuncInput, output: RequestFuncOutput) could be defined and called from both places.

python/sglang/multimodal_gen/benchmarks/bench_serving.py

yyy1000 · 2026-02-16T22:23:00Z

My past dev machine doesn't have quota now and I'm planning to rent a new one to test this PR.

ping1jing2 · 2026-02-17T07:56:56Z

Hi @yyy1000, thanks for your contribution! Please:

Update the title to [diffusion] <scope>: <subject>
Run pre-commit run --all-files before pushing commits
See details in docs/diffusion/contributing.md and docs/developer_guide/contribution_guide.md

yyy1000 · 2026-02-17T08:08:39Z

Hi @ping1jing2 , thank you so much for the review! I have done 1 & 2 and due to lack of personal GPU machine, I can't run the benchmark test right now but will figure out a way to do it soon.

ping1jing2 · 2026-02-17T08:12:51Z

Hi @ping1jing2 , thank you so much for the review! I have done 1 & 2 and due to lack of personal GPU machine, I can't run the benchmark test right now but will figure out a way to do it soon.

please let me know after your PR is ready for review

gemini-code-assist · 2026-02-18T03:44:30Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yyy1000 · 2026-02-18T03:45:05Z

Hi @ping1jing2 , thanks for the note and I have uploaded the test result into the PR description and could you help review it again? Thank you!

yyy1000 · 2026-02-18T17:49:03Z

Hi @ping1jing2 , the PR is ready to review now, thanks!

python/sglang/multimodal_gen/benchmarks/bench_serving.py

yyy1000 · 2026-02-18T20:27:27Z

Hi @ping1jing2 , thank you for your review and I have resolved your comments. Could you review again when you're available and let me know what should I fix, thank you!

mickqian · 2026-02-19T01:35:47Z

python/sglang/multimodal_gen/benchmarks/bench_serving.py

+        help="Number of warmup requests to run before measurement.",
+    )
+    parser.add_argument(
+        "--warmup-num-inference-steps",


this should always be 1, iiuc

Thank you for the review, and I just checked the server set warmup-inference-step to 1 so this should be 1 too and I made the change.

python/sglang/multimodal_gen/benchmarks/bench_serving.py

yyy1000 · 2026-02-19T05:55:19Z

/tag-and-rerun-ci

ping1jing2 · 2026-02-19T05:57:14Z

/tag-and-rerun-ci

add slo for diffusion

93e6f0f

yyy1000 requested review from mickqian and yhyang201 as code owners February 16, 2026 22:19

github-actions bot added the diffusion SGLang Diffusion label Feb 16, 2026

yyy1000 marked this pull request as draft February 16, 2026 22:19

gemini-code-assist bot reviewed Feb 16, 2026

View reviewed changes

yyy1000 mentioned this pull request Feb 16, 2026

[Feature] [diffusion] Add SLO metric for SGL-Diffusion #18722

Open

2 tasks

ping1jing2 self-assigned this Feb 17, 2026

yyy1000 changed the title ~~Add SLO metric for SGL-Diffusion~~ [diffusion] benchmark: Add SLO metric for SGL-Diffusion Feb 17, 2026

yyy1000 added 2 commits February 17, 2026 00:05

format

3aa716c

format

14b6d52

yyy1000 force-pushed the slo-metric branch from d89dcd0 to 14b6d52 Compare February 17, 2026 08:06

yyy1000 marked this pull request as ready for review February 18, 2026 03:44

yyy1000 requested a review from ping1jing2 as a code owner February 18, 2026 03:44

ping1jing2 mentioned this pull request Feb 18, 2026

[Roadmap] [NPU] Sglang Diffusion on Ascend #18967

Open

46 tasks

ping1jing2 reviewed Feb 18, 2026

View reviewed changes

resolve comments

49f9305

mickqian reviewed Feb 19, 2026

View reviewed changes

remove warmup-inference-step bench param

bd153bd

ping1jing2 reviewed Feb 19, 2026

View reviewed changes

python/sglang/multimodal_gen/benchmarks/bench_serving.py Outdated Show resolved Hide resolved

ping1jing2 approved these changes Feb 19, 2026

View reviewed changes

format logger output

22ab3e0

Merge branch 'main' into slo-metric

ad65568

github-actions bot added the run-ci label Feb 19, 2026

		if input.slo_ms is not None and output.success:
		output.slo_achieved = (output.latency * 1000.0) <= input.slo_ms

Conversation

yyy1000 commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yyy1000 commented Feb 16, 2026

Uh oh!

ping1jing2 commented Feb 17, 2026

Uh oh!

yyy1000 commented Feb 17, 2026

Uh oh!

ping1jing2 commented Feb 17, 2026

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Uh oh!

yyy1000 commented Feb 18, 2026

Uh oh!

yyy1000 commented Feb 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yyy1000 commented Feb 18, 2026

Uh oh!

mickqian Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

yyy1000 Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yyy1000 commented Feb 19, 2026

Uh oh!

ping1jing2 commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

yyy1000 commented Feb 16, 2026 •

edited

Loading