[Diffusion] [Profiling] Add end-to-end profiling support for diffusion serving pipeline by zhyajie · Pull Request #18367 · sgl-project/sglang

zhyajie · 2026-02-06T11:11:16Z

[Diffusion] [Profiling] Add end-to-end profiling support for diffusion serving pipeline

Summary

This PR adds end-to-end profiling support for the diffusion serving pipeline, enabling developers to capture synchronized torch.profiler traces from both the HTTP Server (host) process and the GPU Worker process with a single command. The profiling workflow is designed to be consistent with the existing LLM profiling API in SGLang.

Motivation

Profiling diffusion serving pipelines requires visibility into both the host-side overhead (HTTP request handling, scheduling, data encoding/decoding) and the GPU-side computation (denoising steps, attention, VAE). Previously, there was no unified way to capture traces from both processes simultaneously. This PR provides that capability, making it easy to identify bottlenecks across the full serving stack.

Usage

1. Launch the server with profiling environment variables:

export SGLANG_TORCH_PROFILER_DIR=./sglang_qwen_profiling
export SGLANG_PROFILE_WITH_STACK=1
export SGLANG_PROFILE_RECORD_SHAPES=1

sglang serve \
    --model-path <model_path> \
    --num-gpus 2 \
    --ulysses-degree 2 \
    --host 0.0.0.0 \
    --port 40000

2. Run the benchmark with profiling:

# Warm up (no profiling)
python3 -m sglang.multimodal_gen.benchmarks.bench_serving \
    --backend sglang-image --task image-to-image \
    --dataset vbench --dataset-path <dataset_path> \
    --max-concurrency 1 --num-prompts 1

# Profile run
python3 -m sglang.multimodal_gen.benchmarks.bench_serving \
    --backend sglang-image --task image-to-image \
    --dataset vbench --dataset-path <dataset_path> \
    --max-concurrency 1 --num-prompts 2 \
    --num-inference-steps 8 --guidance-scale 1 \
    --profile

3. Inspect traces in Chrome (chrome://tracing) or Perfetto UI:

The profiler generates two trace files per profiling session:

{profile_id}-host.trace.json.gz — Host process timeline
{profile_id}-rank-{rank}.trace.json.gz — GPU Worker timeline

Profiling Results

Host Process Timeline:

GPU Worker Timeline:

gemini-code-assist · 2026-02-06T11:11:39Z

Summary of Changes

Hello @zhyajie, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the observability of diffusion serving pipelines by introducing a robust end-to-end profiling system. It allows developers to capture synchronized performance traces across the entire stack, from HTTP request handling to GPU computation, using a unified API. This capability is crucial for identifying and resolving performance bottlenecks, ensuring efficient operation of multimodal generation models.

Highlights

End-to-End Profiling: Introduced comprehensive end-to-end profiling support for diffusion serving pipelines, allowing synchronized torch.profiler traces from both the HTTP Server (host) and GPU Worker processes.
Unified Profiling API: Implemented new API endpoints (/start_profile and /stop_profile) in the HTTP server to manage profiling sessions, consistent with existing LLM profiling APIs in SGLang.
Benchmarking Tool Integration: Updated the bench_serving.py script to include a --profile command-line argument, enabling easy activation of the full pipeline profiling during benchmark runs. Also added num_inference_steps and guidance_scale parameters to the benchmark requests.
Configurable Profiler: Enhanced the SGLDiffusionProfiler to be configurable via environment variables (SGLANG_TORCH_PROFILER_DIR, SGLANG_PROFILE_WITH_STACK, SGLANG_PROFILE_RECORD_SHAPES) and API parameters, offering control over output directory, activities (CPU/GPU), stack tracing, and shape recording.
Distinct Trace Files: Modified trace file naming conventions to clearly distinguish between host process traces ({profile_id}-host.trace.json.gz) and GPU worker traces ({profile_id}-rank-{rank}.trace.json.gz), facilitating easier analysis.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/multimodal_gen/benchmarks/bench_serving.py
- Added profiling example usage in comments.
- Introduced num_inference_steps and guidance_scale fields to RequestFuncInput.
- Modified __getitem__ to pass new inference parameters.
- Updated async_request_image_sglang to include inference parameters and a 'profile' flag in requests.
- Added ProfileReqOutput dataclass and async_request_profile function for API interaction.
- Integrated start_profile and stop_profile calls into the benchmark function based on the --profile argument.
- Added --num-inference-steps, --guidance-scale, and --profile command-line arguments.
python/sglang/multimodal_gen/runtime/entrypoints/http_server.py
- Imported necessary types (Optional, Response, BaseModel) and utilities (init_logger, SGLDiffusionProfiler).
- Defined ProfileReqInput and ProfileReqOutput Pydantic models for profiling API.
- Implemented /start_profile and /stop_profile API endpoints to control profiling on the host and communicate with GPU workers.
- Introduced _global_profiler_state to manage the profiler's lifecycle.
python/sglang/multimodal_gen/runtime/entrypoints/openai/utils.py
- Added StartProfileReq and StopProfileReq dataclasses for inter-process communication with GPU workers regarding profiling.
python/sglang/multimodal_gen/runtime/managers/gpu_worker.py
- Implemented start_profile and stop_profile methods to manage SGLDiffusionProfiler instances within the GPU worker process, handling initialization, starting, stopping, and trace saving.
python/sglang/multimodal_gen/runtime/managers/scheduler.py
- Imported StartProfileReq and StopProfileReq.
- Mapped StartProfileReq and StopProfileReq to new handler methods (_handle_start_profile, _handle_stop_profile).
- Implemented _handle_start_profile and _handle_stop_profile to delegate profiling control to the gpu_worker.
python/sglang/multimodal_gen/runtime/pipelines_core/executors/pipeline_executor.py
- Modified execute_with_profiling to set dump_rank=None for profile_execution, ensuring all ranks save their own trace files.
python/sglang/multimodal_gen/runtime/utils/profiler.py
- Imported get_bool_env_var for environment variable parsing.
- Updated SGLDiffusionProfiler constructor to accept activities, with_stack, record_shapes, and is_host parameters for flexible configuration.
- Modified log_dir resolution to prioritize parameters, then environment variables, then default value.
- Added _resolve_activities static method to parse activity strings (e.g., 'CPU', 'GPU') into torch.profiler.ProfilerActivity enums.
- Updated profiler configuration to use configurable with_stack and record_shapes.
- Adjusted trace filename generation to differentiate between host and GPU worker traces.

Activity

The pull request introduces a new profiling feature, requiring changes across multiple components of the SGLang multimodal generation system.
The author has provided detailed instructions on how to use the new profiling capabilities, including environment variable setup and benchmark execution commands.
Screenshots of host and GPU worker timelines are included in the PR description, demonstrating the output of the new profiling system.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces end-to-end profiling for the diffusion serving pipeline, which is a great addition for performance analysis. The implementation correctly synchronizes profiling between the HTTP server and GPU workers. The code is well-structured, introducing new API endpoints and leveraging a dedicated profiler utility class. My review includes a few suggestions to improve code clarity, consistency, and efficiency. Specifically, I've pointed out opportunities to refactor parameter handling in the benchmark script, remove duplicated logic for parsing environment variables, and simplify a function call by removing unused parameters. Overall, this is a solid contribution.

gemini-code-assist · 2026-02-06T11:13:24Z

python/sglang/multimodal_gen/benchmarks/bench_serving.py

+        # Add profiling parameter if set
+        if input.extra_body.get("profile"):
+            data.add_field("profile", "true")
+
+        # Merge extra parameters (exclude profile param already handled)
        for key, value in input.extra_body.items():
-            data.add_field(key, str(value))
+            if key != "profile":
+                data.add_field(key, str(value))


The current logic for handling extra_body in multipart requests is a bit inefficient as it involves checking for the "profile" key and then iterating over the dictionary again to add other parameters. This can be made cleaner and more efficient by creating a copy of the dictionary and using pop to handle the special "profile" key.

Suggested change

# Add profiling parameter if set

if input.extra_body.get("profile"):

data.add_field("profile", "true")

# Merge extra parameters (exclude profile param already handled)

for key, value in input.extra_body.items():

data.add_field(key, str(value))

if key != "profile":

data.add_field(key, str(value))

# Add profiling and other extra parameters

extra_params = input.extra_body.copy()

if extra_params.pop("profile", None):

data.add_field("profile", "true")

for key, value in extra_params.items():

data.add_field(key, str(value))

gemini-code-assist · 2026-02-06T11:13:25Z

python/sglang/multimodal_gen/runtime/entrypoints/http_server.py

+        env_with_stack = os.getenv("SGLANG_PROFILE_WITH_STACK", "false").lower() in (
+            "true",
+            "1",
+            "yes",
+        )
+        env_record_shapes = os.getenv(
+            "SGLANG_PROFILE_RECORD_SHAPES", "false"
+        ).lower() in ("true", "1", "yes")


The logic to parse boolean values from environment variables is duplicated here. The SGLDiffusionProfiler class already uses sglang.srt.utils.get_bool_env_var for the same purpose. To improve consistency and reduce code duplication, you should use get_bool_env_var here as well.

You'll need to add from sglang.srt.utils import get_bool_env_var at the top of the file.

env_with_stack = get_bool_env_var("SGLANG_PROFILE_WITH_STACK", "false") env_record_shapes = get_bool_env_var( "SGLANG_PROFILE_RECORD_SHAPES", "false" )

gemini-code-assist · 2026-02-06T11:13:25Z

python/sglang/multimodal_gen/runtime/managers/gpu_worker.py

+            profiler = SGLDiffusionProfiler(
+                request_id=profile_id,
+                rank=self.rank,
+                full_profile=True,  # Always profile all stages
+                activities=activities,
+                num_steps=-1,  # Profile all steps
+                num_inference_steps=100,  # Large number to capture all
+                log_dir=output_dir,
+                with_stack=with_stack,
+                record_shapes=record_shapes,
+            )


When full_profile=True is passed to SGLDiffusionProfiler, the num_steps and num_inference_steps parameters are ignored because a continuous profiler is used instead of a scheduled one. Passing these parameters can be confusing for future readers. It would be cleaner to omit them when they are not used.

profiler = SGLDiffusionProfiler( request_id=profile_id, rank=self.rank, full_profile=True, # Always profile all stages activities=activities, log_dir=output_dir, with_stack=with_stack, record_shapes=record_shapes, )

python/sglang/multimodal_gen/benchmarks/bench_serving.py

python/sglang/multimodal_gen/runtime/entrypoints/http_server.py

python/sglang/multimodal_gen/runtime/utils/profiler.py

ping1jing2 · 2026-02-07T17:41:18Z

@Makcum888e please help to review this PR if you are free

zhyajie · 2026-02-08T13:00:10Z

@ping1jing2 Thank you for your careful review. I have revised the code according to the code review feedback.

python/sglang/multimodal_gen/benchmarks/bench_serving.py

zhyajie · 2026-02-10T09:12:18Z

I rebased on upstream/main and resolved some conflicts.

ping1jing2 · 2026-02-17T14:39:05Z

@zhyajie please ping me to trigger the CI after you resolve the conflicts.

zhyajie · 2026-02-18T10:11:55Z

@ping1jing2 I’ve rebased to the latest main branch, fixed the conflicts, and tested the features locally. Please help me trigger the CI. Thanks a lot

ping1jing2 · 2026-02-18T11:24:33Z

@ping1jing2 I’ve rebased to the latest main branch, fixed the conflicts, and tested the features locally. Please help me trigger the CI. Thanks a lot

https://github.com/sgl-project/sglang/actions/runs/22137625096/job/63993329620?pr=18367

please resolve the lint issue

pip3 install pre-commit
pre-commit install
pre-commit run --all-files

zhyajie · 2026-02-18T12:31:24Z

@ping1jing2 I’ve rebased to the latest main branch, fixed the conflicts, and tested the features locally. Please help me trigger the CI. Thanks a lot

https://github.com/sgl-project/sglang/actions/runs/22137625096/job/63993329620?pr=18367

please resolve the lint issue
pip3 install pre-commit
pre-commit install
pre-commit run --all-files

Done, Please try again, thank you.

ping1jing2 · 2026-02-18T19:45:39Z

Done, Please try again, thank you.

@zhyajie sorry for replying late and that you need to resolve the code conflicts. BTW, i already added the run-ci label, so the CI will be triggered after you push commit

zhyajie · 2026-02-19T05:16:48Z

@ping1jing2 I fixed the conflicts and finished the ci test. Please take a look when you have time. Thanks

zhyajie · 2026-02-19T14:43:18Z

@ping1jing2 There was a new conflict today, and I have resolved it and verified the profling feature locally.

ping1jing2 · 2026-02-19T15:17:09Z

@ping1jing2 There was a new conflict today, and I have resolved it and verified the profling feature locally.

this PR is ok for me, but I can't merge it. please ping Mick or bbuf late. thanks for your contribution

zhyajie · 2026-02-19T15:49:35Z

@ping1jing2 There was a new conflict today, and I have resolved it and verified the profling feature locally.

this PR is ok for me, but I can't merge it. please ping Mick or bbuf late. thanks for your contribution

@ping1jing2 Thank you for your reply. @mickqian Could you please review this PR when you have time? This PR includes modifications to numerous files, leading to frequent conflict triggers.

zhyajie requested review from mickqian and yhyang201 as code owners February 6, 2026 11:11

github-actions bot added the diffusion SGLang Diffusion label Feb 6, 2026

gemini-code-assist bot reviewed Feb 6, 2026

View reviewed changes

zhyajie mentioned this pull request Feb 6, 2026

[Tracking][Diffusion][AMD] Diffusion models performance optimization on AMD Platform #18138

Open

12 tasks

ping1jing2 reviewed Feb 7, 2026

View reviewed changes

zhyajie force-pushed the qwen_image_profling branch from f3cc742 to 3f2a858 Compare February 8, 2026 12:52

ping1jing2 approved these changes Feb 8, 2026

View reviewed changes

Makcum888e reviewed Feb 10, 2026

View reviewed changes

python/sglang/multimodal_gen/benchmarks/bench_serving.py Show resolved Hide resolved

Makcum888e approved these changes Feb 10, 2026

View reviewed changes

zhyajie force-pushed the qwen_image_profling branch from 0f1cc88 to 94ad30d Compare February 10, 2026 09:09

ping1jing2 self-assigned this Feb 17, 2026

zhyajie added 5 commits February 18, 2026 09:28

Add diffusion serving profiling

d6533a1

clean up diffusion profiling code

a861e0c

Revise the code based on the review feedback

2fa8883

Add --warmup-requests flag to multimodal bench_serving

00f38a7

set warmup-requests default is 1

9469e49

zhyajie force-pushed the qwen_image_profling branch from 94ad30d to 9469e49 Compare February 18, 2026 09:42

zhyajie requested review from Fridge003, HaiShaw, Kangyan-Zhou, iforgetmyname, ishandhanani, ispobock and merrymercy as code owners February 18, 2026 10:08

github-actions bot added documentation Improvements or additions to documentation amd lora speculative-decoding npu labels Feb 18, 2026

fix rebase conflicts

8dc9eed

zhyajie force-pushed the qwen_image_profling branch from 2c784a5 to 8dc9eed Compare February 18, 2026 10:17

Merge branch 'main' into qwen_image_profling

8832f73

fix lint

8428831

ping1jing2 added the run-ci label Feb 18, 2026

fix conflicts

95d8ca6

fix conflicts

c967c53

Conversation

zhyajie commented Feb 6, 2026

[Diffusion] [Profiling] Add end-to-end profiling support for diffusion serving pipeline

Summary

Motivation

Usage

Profiling Results

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ping1jing2 commented Feb 7, 2026

Uh oh!

zhyajie commented Feb 8, 2026

Uh oh!

Uh oh!

zhyajie commented Feb 10, 2026

Uh oh!

ping1jing2 commented Feb 17, 2026

Uh oh!

zhyajie commented Feb 18, 2026

Uh oh!

ping1jing2 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhyajie commented Feb 18, 2026

Uh oh!

ping1jing2 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhyajie commented Feb 19, 2026

Uh oh!

zhyajie commented Feb 19, 2026

Uh oh!

ping1jing2 commented Feb 19, 2026

Uh oh!

zhyajie commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

ping1jing2 commented Feb 18, 2026 •

edited

Loading

ping1jing2 commented Feb 18, 2026 •

edited

Loading