Skip to content

[diffusion] benchmark: Add SLO metric for SGL-Diffusion#18907

Open
yyy1000 wants to merge 7 commits intosgl-project:mainfrom
yyy1000:slo-metric
Open

[diffusion] benchmark: Add SLO metric for SGL-Diffusion#18907
yyy1000 wants to merge 7 commits intosgl-project:mainfrom
yyy1000:slo-metric

Conversation

@yyy1000
Copy link

@yyy1000 yyy1000 commented Feb 16, 2026

Motivation

Closes #18722

Modifications

add slo metrics in SGL-diffusion bench_serving.py

Accuracy Tests

N/A

Benchmarking and Profiling

root@a20053551888:/data/junhao/sglang#    python3 /data/junhao/sglang/python/sglang/multimodal_gen/benchmarks/bench_serving.py --dataset vbench --num-prompts 10 --port 
1231 --slo --slo-scale 3.0 --warmup-requests 2
[02-17 22:42:06] Waiting for service at http://localhost:1231...
[02-17 22:42:06] Service is ready.
[02-17 22:42:06] Updated model name from server: Wan-AI/Wan2.2-T2V-A14B-Diffusers
[02-17 22:42:06] Task from args None is different from huggingface pipeline_tag text-to-video, args.task will be ignored!
[02-17 22:42:06] Loading requests...
[02-17 22:42:06] Prepared 10 requests from vbench dataset.
[02-17 22:42:06] Running 2 warmup request(s) with num_inference_steps=1...
[02-17 23:10:35] Warmup 1/2: latency=1708.93s, success=True
[02-17 23:34:56] Warmup 2/2: latency=1461.22s, success=True
100%|█████████████████████████████████████████| 10/10 [4:03:46<00:00, 1462.62s/it]

================= Serving Benchmark Result =================
Task:                                    text-to-video  
Model:                                   Wan-AI/Wan2.2-T2V-A14B-Diffusers
Dataset:                                 vbench         
--------------------------------------------------
Benchmark duration (s):                  14626.17       
Request rate:                            inf            
Max request concurrency:                 1              
Successful requests:                     10/10             
--------------------------------------------------
Request throughput (req/s):              0.00           
Latency Mean (s):                        1462.6169      
Latency Median (s):                      1461.2150      
Latency P99 (s):                         1474.0264      
--------------------------------------------------
Peak Memory Max (MB):                    23764.88       
Peak Memory Mean (MB):                   23764.68       
Peak Memory Median (MB):                 23764.88       
--------------------------------------------------
SLO Attainment Rate:                     0.00%          
SLO Met (Success):                       0              
SLO Scale:                               3.00           
============================================================

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions bot added the diffusion SGLang Diffusion label Feb 16, 2026
@yyy1000 yyy1000 marked this pull request as draft February 16, 2026 22:19
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yyy1000, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the bench_serving.py script by introducing Service Level Objective (SLO) metrics for SGL-Diffusion benchmarks. The changes enable the system to define and track performance targets, dynamically calculate expected latencies based on request characteristics and warmup runs, and report on the attainment rate of these objectives. This provides a more comprehensive and robust way to evaluate the performance and reliability of multimodal generation services.

Highlights

  • SLO Metric Integration: Introduced Service Level Objective (SLO) metric calculation into the bench_serving.py script for SGL-Diffusion benchmarks.
  • Data Model Enhancement: Added slo_ms and num_inference_steps to RequestFuncInput and slo_achieved to RequestFuncOutput to support SLO tracking.
  • Dynamic SLO Target Calculation: Implemented functions to infer base latency from warmup requests and dynamically calculate slo_ms targets for subsequent requests based on image dimensions, frame count, and inference steps.
  • Benchmarking Workflow Update: Integrated warmup request execution and SLO population into the main benchmarking loop.
  • Metric Reporting: Extended the calculate_metrics function to report SLO attainment rate, successful SLOs, and the configured SLO scale.
  • CLI Argument Expansion: Added new command-line arguments (--slo, --slo-scale, --warmup-requests, --warmup-num-inference-steps, --num-inference-steps) to control SLO behavior and warmup phases.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/multimodal_gen/benchmarks/bench_serving.py
    • Added slo_ms and num_inference_steps fields to RequestFuncInput.
    • Added slo_achieved field to RequestFuncOutput.
    • Implemented _compute_expected_latency_ms_from_base, _infer_slo_base_time_ms_from_warmups, and _populate_slo_ms_from_warmups functions for dynamic SLO target calculation.
    • Modified async_request_image_sglang and async_request_video_sglang to check and record slo_achieved status.
    • Updated calculate_metrics to include SLO attainment rate and related statistics in the benchmark results.
    • Introduced a warmup request phase before the main benchmark run.
    • Added command-line arguments: --slo, --slo-scale, --warmup-requests, --warmup-num-inference-steps, and --num-inference-steps.
    • Updated example usage in the docstring to include SLO options.
Activity
  • No human activity (comments, reviews, etc.) has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces SLO (Service Level Objective) metrics to the multimodal serving benchmark. The changes include adding new command-line arguments for SLO configuration, implementing logic for warmup runs to establish a baseline latency, calculating expected latencies for requests, and reporting SLO attainment rates. The overall implementation is well-structured. I've provided a few suggestions to improve maintainability and conciseness by reducing code duplication and simplifying some conditional logic. Please take a look at the detailed comments.

frame_scale = frames if isinstance(frames, int) and frames > 0 else 1
step_scale = steps if isinstance(steps, int) and steps > 0 else 1

area_units = max((float(width) * float(height)) / float(16 * 16), 1.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The magic number 16 * 16 is used here and also in _infer_slo_base_time_ms_from_warmups at line 400. To improve readability and maintainability, consider defining it as a module-level constant, for example: _AREA_UNIT_DIVISOR = 16 * 16. This likely represents the patch area for a vision model.

Comment on lines +683 to +684
if input.slo_ms is not None and output.success:
output.slo_achieved = (output.latency * 1000.0) <= input.slo_ms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This SLO check logic is identical to the one in async_request_image_sglang (lines 526-527). To avoid code duplication and improve maintainability, consider extracting this logic into a small helper function. For example, a function like _update_slo_achievement(input: RequestFuncInput, output: RequestFuncOutput) could be defined and called from both places.

@yyy1000
Copy link
Author

yyy1000 commented Feb 16, 2026

My past dev machine doesn't have quota now and I'm planning to rent a new one to test this PR.

@ping1jing2
Copy link
Collaborator

Hi @yyy1000, thanks for your contribution! Please:

  1. Update the title to [diffusion] <scope>: <subject>
  2. Run pre-commit run --all-files before pushing commits
  3. See details in docs/diffusion/contributing.md and docs/developer_guide/contribution_guide.md

@yyy1000 yyy1000 changed the title Add SLO metric for SGL-Diffusion [diffusion] benchmark: Add SLO metric for SGL-Diffusion Feb 17, 2026
@yyy1000
Copy link
Author

yyy1000 commented Feb 17, 2026

Hi @ping1jing2 , thank you so much for the review! I have done 1 & 2 and due to lack of personal GPU machine, I can't run the benchmark test right now but will figure out a way to do it soon.

@ping1jing2
Copy link
Collaborator

Hi @ping1jing2 , thank you so much for the review! I have done 1 & 2 and due to lack of personal GPU machine, I can't run the benchmark test right now but will figure out a way to do it soon.

please let me know after your PR is ready for review

@yyy1000 yyy1000 marked this pull request as ready for review February 18, 2026 03:44
@yyy1000 yyy1000 requested a review from ping1jing2 as a code owner February 18, 2026 03:44
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yyy1000
Copy link
Author

yyy1000 commented Feb 18, 2026

Hi @ping1jing2 , thanks for the note and I have uploaded the test result into the PR description and could you help review it again? Thank you!

@yyy1000
Copy link
Author

yyy1000 commented Feb 18, 2026

Hi @ping1jing2 , the PR is ready to review now, thanks!

@yyy1000
Copy link
Author

yyy1000 commented Feb 18, 2026

Hi @ping1jing2 , thank you for your review and I have resolved your comments. Could you review again when you're available and let me know what should I fix, thank you!

help="Number of warmup requests to run before measurement.",
)
parser.add_argument(
"--warmup-num-inference-steps",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should always be 1, iiuc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review, and I just checked the server set warmup-inference-step to 1 so this should be 1 too and I made the change.

@yyy1000
Copy link
Author

yyy1000 commented Feb 19, 2026

/tag-and-rerun-ci

1 similar comment
@ping1jing2
Copy link
Collaborator

/tag-and-rerun-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] [diffusion] Add SLO metric for SGL-Diffusion

3 participants

Comments