[Diffusion] [Profiling] Add end-to-end profiling support for diffusion serving pipeline#18367
[Diffusion] [Profiling] Add end-to-end profiling support for diffusion serving pipeline#18367zhyajie wants to merge 10 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @zhyajie, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the observability of diffusion serving pipelines by introducing a robust end-to-end profiling system. It allows developers to capture synchronized performance traces across the entire stack, from HTTP request handling to GPU computation, using a unified API. This capability is crucial for identifying and resolving performance bottlenecks, ensuring efficient operation of multimodal generation models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces end-to-end profiling for the diffusion serving pipeline, which is a great addition for performance analysis. The implementation correctly synchronizes profiling between the HTTP server and GPU workers. The code is well-structured, introducing new API endpoints and leveraging a dedicated profiler utility class. My review includes a few suggestions to improve code clarity, consistency, and efficiency. Specifically, I've pointed out opportunities to refactor parameter handling in the benchmark script, remove duplicated logic for parsing environment variables, and simplify a function call by removing unused parameters. Overall, this is a solid contribution.
| # Add profiling parameter if set | ||
| if input.extra_body.get("profile"): | ||
| data.add_field("profile", "true") | ||
|
|
||
| # Merge extra parameters (exclude profile param already handled) | ||
| for key, value in input.extra_body.items(): | ||
| data.add_field(key, str(value)) | ||
| if key != "profile": | ||
| data.add_field(key, str(value)) |
There was a problem hiding this comment.
The current logic for handling extra_body in multipart requests is a bit inefficient as it involves checking for the "profile" key and then iterating over the dictionary again to add other parameters. This can be made cleaner and more efficient by creating a copy of the dictionary and using pop to handle the special "profile" key.
| # Add profiling parameter if set | |
| if input.extra_body.get("profile"): | |
| data.add_field("profile", "true") | |
| # Merge extra parameters (exclude profile param already handled) | |
| for key, value in input.extra_body.items(): | |
| data.add_field(key, str(value)) | |
| if key != "profile": | |
| data.add_field(key, str(value)) | |
| # Add profiling and other extra parameters | |
| extra_params = input.extra_body.copy() | |
| if extra_params.pop("profile", None): | |
| data.add_field("profile", "true") | |
| for key, value in extra_params.items(): | |
| data.add_field(key, str(value)) |
| env_with_stack = os.getenv("SGLANG_PROFILE_WITH_STACK", "false").lower() in ( | ||
| "true", | ||
| "1", | ||
| "yes", | ||
| ) | ||
| env_record_shapes = os.getenv( | ||
| "SGLANG_PROFILE_RECORD_SHAPES", "false" | ||
| ).lower() in ("true", "1", "yes") |
There was a problem hiding this comment.
The logic to parse boolean values from environment variables is duplicated here. The SGLDiffusionProfiler class already uses sglang.srt.utils.get_bool_env_var for the same purpose. To improve consistency and reduce code duplication, you should use get_bool_env_var here as well.
You'll need to add from sglang.srt.utils import get_bool_env_var at the top of the file.
env_with_stack = get_bool_env_var("SGLANG_PROFILE_WITH_STACK", "false")
env_record_shapes = get_bool_env_var(
"SGLANG_PROFILE_RECORD_SHAPES", "false"
)| profiler = SGLDiffusionProfiler( | ||
| request_id=profile_id, | ||
| rank=self.rank, | ||
| full_profile=True, # Always profile all stages | ||
| activities=activities, | ||
| num_steps=-1, # Profile all steps | ||
| num_inference_steps=100, # Large number to capture all | ||
| log_dir=output_dir, | ||
| with_stack=with_stack, | ||
| record_shapes=record_shapes, | ||
| ) |
There was a problem hiding this comment.
When full_profile=True is passed to SGLDiffusionProfiler, the num_steps and num_inference_steps parameters are ignored because a continuous profiler is used instead of a scheduled one. Passing these parameters can be confusing for future readers. It would be cleaner to omit them when they are not used.
profiler = SGLDiffusionProfiler(
request_id=profile_id,
rank=self.rank,
full_profile=True, # Always profile all stages
activities=activities,
log_dir=output_dir,
with_stack=with_stack,
record_shapes=record_shapes,
)
python/sglang/multimodal_gen/runtime/entrypoints/http_server.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/runtime/entrypoints/http_server.py
Outdated
Show resolved
Hide resolved
|
@Makcum888e please help to review this PR if you are free |
f3cc742 to
3f2a858
Compare
|
@ping1jing2 Thank you for your careful review. I have revised the code according to the code review feedback. |
0f1cc88 to
94ad30d
Compare
|
I rebased on upstream/main and resolved some conflicts. |
|
@zhyajie please ping me to trigger the CI after you resolve the conflicts. |
94ad30d to
9469e49
Compare
|
@ping1jing2 I’ve rebased to the latest main branch, fixed the conflicts, and tested the features locally. Please help me trigger the CI. Thanks a lot |
2c784a5 to
8dc9eed
Compare
https://github.com/sgl-project/sglang/actions/runs/22137625096/job/63993329620?pr=18367 please resolve the lint issue pip3 install pre-commit
pre-commit install
pre-commit run --all-files |
Done, Please try again, thank you. |
@zhyajie sorry for replying late and that you need to resolve the code conflicts. BTW, i already added the |
|
@ping1jing2 I fixed the conflicts and finished the ci test. Please take a look when you have time. Thanks |
|
@ping1jing2 There was a new conflict today, and I have resolved it and verified the profling feature locally. |
this PR is ok for me, but I can't merge it. please ping Mick or bbuf late. thanks for your contribution |
@ping1jing2 Thank you for your reply. @mickqian Could you please review this PR when you have time? This PR includes modifications to numerous files, leading to frequent conflict triggers. |
[Diffusion] [Profiling] Add end-to-end profiling support for diffusion serving pipeline
Summary
This PR adds end-to-end profiling support for the diffusion serving pipeline, enabling developers to capture synchronized
torch.profilertraces from both the HTTP Server (host) process and the GPU Worker process with a single command. The profiling workflow is designed to be consistent with the existing LLM profiling API in SGLang.Motivation
Profiling diffusion serving pipelines requires visibility into both the host-side overhead (HTTP request handling, scheduling, data encoding/decoding) and the GPU-side computation (denoising steps, attention, VAE). Previously, there was no unified way to capture traces from both processes simultaneously. This PR provides that capability, making it easy to identify bottlenecks across the full serving stack.
Usage
1. Launch the server with profiling environment variables:
2. Run the benchmark with profiling:
3. Inspect traces in Chrome (
chrome://tracing) or Perfetto UI:The profiler generates two trace files per profiling session:
{profile_id}-host.trace.json.gz— Host process timeline{profile_id}-rank-{rank}.trace.json.gz— GPU Worker timelineProfiling Results
Host Process Timeline:
GPU Worker Timeline: