Skip to content

[Diffusion] [Profiling] Add end-to-end profiling support for diffusion serving pipeline#18367

Open
zhyajie wants to merge 10 commits intosgl-project:mainfrom
zhyajie:qwen_image_profling
Open

[Diffusion] [Profiling] Add end-to-end profiling support for diffusion serving pipeline#18367
zhyajie wants to merge 10 commits intosgl-project:mainfrom
zhyajie:qwen_image_profling

Conversation

@zhyajie
Copy link

@zhyajie zhyajie commented Feb 6, 2026

[Diffusion] [Profiling] Add end-to-end profiling support for diffusion serving pipeline

Summary

This PR adds end-to-end profiling support for the diffusion serving pipeline, enabling developers to capture synchronized torch.profiler traces from both the HTTP Server (host) process and the GPU Worker process with a single command. The profiling workflow is designed to be consistent with the existing LLM profiling API in SGLang.

Motivation

Profiling diffusion serving pipelines requires visibility into both the host-side overhead (HTTP request handling, scheduling, data encoding/decoding) and the GPU-side computation (denoising steps, attention, VAE). Previously, there was no unified way to capture traces from both processes simultaneously. This PR provides that capability, making it easy to identify bottlenecks across the full serving stack.

Usage

1. Launch the server with profiling environment variables:

export SGLANG_TORCH_PROFILER_DIR=./sglang_qwen_profiling
export SGLANG_PROFILE_WITH_STACK=1
export SGLANG_PROFILE_RECORD_SHAPES=1

sglang serve \
    --model-path <model_path> \
    --num-gpus 2 \
    --ulysses-degree 2 \
    --host 0.0.0.0 \
    --port 40000

2. Run the benchmark with profiling:

# Warm up (no profiling)
python3 -m sglang.multimodal_gen.benchmarks.bench_serving \
    --backend sglang-image --task image-to-image \
    --dataset vbench --dataset-path <dataset_path> \
    --max-concurrency 1 --num-prompts 1

# Profile run
python3 -m sglang.multimodal_gen.benchmarks.bench_serving \
    --backend sglang-image --task image-to-image \
    --dataset vbench --dataset-path <dataset_path> \
    --max-concurrency 1 --num-prompts 2 \
    --num-inference-steps 8 --guidance-scale 1 \
    --profile

3. Inspect traces in Chrome (chrome://tracing) or Perfetto UI:

The profiler generates two trace files per profiling session:

  • {profile_id}-host.trace.json.gz — Host process timeline
  • {profile_id}-rank-{rank}.trace.json.gz — GPU Worker timeline

Profiling Results

Host Process Timeline:

image

GPU Worker Timeline:

image

@github-actions github-actions bot added the diffusion SGLang Diffusion label Feb 6, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zhyajie, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the observability of diffusion serving pipelines by introducing a robust end-to-end profiling system. It allows developers to capture synchronized performance traces across the entire stack, from HTTP request handling to GPU computation, using a unified API. This capability is crucial for identifying and resolving performance bottlenecks, ensuring efficient operation of multimodal generation models.

Highlights

  • End-to-End Profiling: Introduced comprehensive end-to-end profiling support for diffusion serving pipelines, allowing synchronized torch.profiler traces from both the HTTP Server (host) and GPU Worker processes.
  • Unified Profiling API: Implemented new API endpoints (/start_profile and /stop_profile) in the HTTP server to manage profiling sessions, consistent with existing LLM profiling APIs in SGLang.
  • Benchmarking Tool Integration: Updated the bench_serving.py script to include a --profile command-line argument, enabling easy activation of the full pipeline profiling during benchmark runs. Also added num_inference_steps and guidance_scale parameters to the benchmark requests.
  • Configurable Profiler: Enhanced the SGLDiffusionProfiler to be configurable via environment variables (SGLANG_TORCH_PROFILER_DIR, SGLANG_PROFILE_WITH_STACK, SGLANG_PROFILE_RECORD_SHAPES) and API parameters, offering control over output directory, activities (CPU/GPU), stack tracing, and shape recording.
  • Distinct Trace Files: Modified trace file naming conventions to clearly distinguish between host process traces ({profile_id}-host.trace.json.gz) and GPU worker traces ({profile_id}-rank-{rank}.trace.json.gz), facilitating easier analysis.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/multimodal_gen/benchmarks/bench_serving.py
    • Added profiling example usage in comments.
    • Introduced num_inference_steps and guidance_scale fields to RequestFuncInput.
    • Modified __getitem__ to pass new inference parameters.
    • Updated async_request_image_sglang to include inference parameters and a 'profile' flag in requests.
    • Added ProfileReqOutput dataclass and async_request_profile function for API interaction.
    • Integrated start_profile and stop_profile calls into the benchmark function based on the --profile argument.
    • Added --num-inference-steps, --guidance-scale, and --profile command-line arguments.
  • python/sglang/multimodal_gen/runtime/entrypoints/http_server.py
    • Imported necessary types (Optional, Response, BaseModel) and utilities (init_logger, SGLDiffusionProfiler).
    • Defined ProfileReqInput and ProfileReqOutput Pydantic models for profiling API.
    • Implemented /start_profile and /stop_profile API endpoints to control profiling on the host and communicate with GPU workers.
    • Introduced _global_profiler_state to manage the profiler's lifecycle.
  • python/sglang/multimodal_gen/runtime/entrypoints/openai/utils.py
    • Added StartProfileReq and StopProfileReq dataclasses for inter-process communication with GPU workers regarding profiling.
  • python/sglang/multimodal_gen/runtime/managers/gpu_worker.py
    • Implemented start_profile and stop_profile methods to manage SGLDiffusionProfiler instances within the GPU worker process, handling initialization, starting, stopping, and trace saving.
  • python/sglang/multimodal_gen/runtime/managers/scheduler.py
    • Imported StartProfileReq and StopProfileReq.
    • Mapped StartProfileReq and StopProfileReq to new handler methods (_handle_start_profile, _handle_stop_profile).
    • Implemented _handle_start_profile and _handle_stop_profile to delegate profiling control to the gpu_worker.
  • python/sglang/multimodal_gen/runtime/pipelines_core/executors/pipeline_executor.py
    • Modified execute_with_profiling to set dump_rank=None for profile_execution, ensuring all ranks save their own trace files.
  • python/sglang/multimodal_gen/runtime/utils/profiler.py
    • Imported get_bool_env_var for environment variable parsing.
    • Updated SGLDiffusionProfiler constructor to accept activities, with_stack, record_shapes, and is_host parameters for flexible configuration.
    • Modified log_dir resolution to prioritize parameters, then environment variables, then default value.
    • Added _resolve_activities static method to parse activity strings (e.g., 'CPU', 'GPU') into torch.profiler.ProfilerActivity enums.
    • Updated profiler configuration to use configurable with_stack and record_shapes.
    • Adjusted trace filename generation to differentiate between host and GPU worker traces.
Activity
  • The pull request introduces a new profiling feature, requiring changes across multiple components of the SGLang multimodal generation system.
  • The author has provided detailed instructions on how to use the new profiling capabilities, including environment variable setup and benchmark execution commands.
  • Screenshots of host and GPU worker timelines are included in the PR description, demonstrating the output of the new profiling system.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces end-to-end profiling for the diffusion serving pipeline, which is a great addition for performance analysis. The implementation correctly synchronizes profiling between the HTTP server and GPU workers. The code is well-structured, introducing new API endpoints and leveraging a dedicated profiler utility class. My review includes a few suggestions to improve code clarity, consistency, and efficiency. Specifically, I've pointed out opportunities to refactor parameter handling in the benchmark script, remove duplicated logic for parsing environment variables, and simplify a function call by removing unused parameters. Overall, this is a solid contribution.

Comment on lines 376 to 383
# Add profiling parameter if set
if input.extra_body.get("profile"):
data.add_field("profile", "true")

# Merge extra parameters (exclude profile param already handled)
for key, value in input.extra_body.items():
data.add_field(key, str(value))
if key != "profile":
data.add_field(key, str(value))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current logic for handling extra_body in multipart requests is a bit inefficient as it involves checking for the "profile" key and then iterating over the dictionary again to add other parameters. This can be made cleaner and more efficient by creating a copy of the dictionary and using pop to handle the special "profile" key.

Suggested change
# Add profiling parameter if set
if input.extra_body.get("profile"):
data.add_field("profile", "true")
# Merge extra parameters (exclude profile param already handled)
for key, value in input.extra_body.items():
data.add_field(key, str(value))
if key != "profile":
data.add_field(key, str(value))
# Add profiling and other extra parameters
extra_params = input.extra_body.copy()
if extra_params.pop("profile", None):
data.add_field("profile", "true")
for key, value in extra_params.items():
data.add_field(key, str(value))

Comment on lines 157 to 164
env_with_stack = os.getenv("SGLANG_PROFILE_WITH_STACK", "false").lower() in (
"true",
"1",
"yes",
)
env_record_shapes = os.getenv(
"SGLANG_PROFILE_RECORD_SHAPES", "false"
).lower() in ("true", "1", "yes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to parse boolean values from environment variables is duplicated here. The SGLDiffusionProfiler class already uses sglang.srt.utils.get_bool_env_var for the same purpose. To improve consistency and reduce code duplication, you should use get_bool_env_var here as well.

You'll need to add from sglang.srt.utils import get_bool_env_var at the top of the file.

        env_with_stack = get_bool_env_var("SGLANG_PROFILE_WITH_STACK", "false")
        env_record_shapes = get_bool_env_var(
            "SGLANG_PROFILE_RECORD_SHAPES", "false"
        )

Comment on lines 320 to 330
profiler = SGLDiffusionProfiler(
request_id=profile_id,
rank=self.rank,
full_profile=True, # Always profile all stages
activities=activities,
num_steps=-1, # Profile all steps
num_inference_steps=100, # Large number to capture all
log_dir=output_dir,
with_stack=with_stack,
record_shapes=record_shapes,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When full_profile=True is passed to SGLDiffusionProfiler, the num_steps and num_inference_steps parameters are ignored because a continuous profiler is used instead of a scheduled one. Passing these parameters can be confusing for future readers. It would be cleaner to omit them when they are not used.

            profiler = SGLDiffusionProfiler(
                request_id=profile_id,
                rank=self.rank,
                full_profile=True,  # Always profile all stages
                activities=activities,
                log_dir=output_dir,
                with_stack=with_stack,
                record_shapes=record_shapes,
            )

@ping1jing2
Copy link
Collaborator

@Makcum888e please help to review this PR if you are free

@zhyajie zhyajie force-pushed the qwen_image_profling branch from f3cc742 to 3f2a858 Compare February 8, 2026 12:52
@zhyajie
Copy link
Author

zhyajie commented Feb 8, 2026

@ping1jing2 Thank you for your careful review. I have revised the code according to the code review feedback.

@zhyajie zhyajie force-pushed the qwen_image_profling branch from 0f1cc88 to 94ad30d Compare February 10, 2026 09:09
@zhyajie
Copy link
Author

zhyajie commented Feb 10, 2026

I rebased on upstream/main and resolved some conflicts.

@ping1jing2
Copy link
Collaborator

@zhyajie please ping me to trigger the CI after you resolve the conflicts.

@ping1jing2 ping1jing2 self-assigned this Feb 17, 2026
@github-actions github-actions bot added documentation Improvements or additions to documentation amd lora speculative-decoding npu labels Feb 18, 2026
@zhyajie
Copy link
Author

zhyajie commented Feb 18, 2026

@ping1jing2 I’ve rebased to the latest main branch, fixed the conflicts, and tested the features locally. Please help me trigger the CI. Thanks a lot

@zhyajie zhyajie force-pushed the qwen_image_profling branch from 2c784a5 to 8dc9eed Compare February 18, 2026 10:17
@ping1jing2
Copy link
Collaborator

ping1jing2 commented Feb 18, 2026

@ping1jing2 I’ve rebased to the latest main branch, fixed the conflicts, and tested the features locally. Please help me trigger the CI. Thanks a lot

https://github.com/sgl-project/sglang/actions/runs/22137625096/job/63993329620?pr=18367

please resolve the lint issue

pip3 install pre-commit
pre-commit install
pre-commit run --all-files

@zhyajie
Copy link
Author

zhyajie commented Feb 18, 2026

@ping1jing2 I’ve rebased to the latest main branch, fixed the conflicts, and tested the features locally. Please help me trigger the CI. Thanks a lot

https://github.com/sgl-project/sglang/actions/runs/22137625096/job/63993329620?pr=18367

please resolve the lint issue

pip3 install pre-commit
pre-commit install
pre-commit run --all-files

Done, Please try again, thank you.

@ping1jing2
Copy link
Collaborator

ping1jing2 commented Feb 18, 2026

Done, Please try again, thank you.

@zhyajie sorry for replying late and that you need to resolve the code conflicts. BTW, i already added the run-ci label, so the CI will be triggered after you push commit

@zhyajie
Copy link
Author

zhyajie commented Feb 19, 2026

@ping1jing2 I fixed the conflicts and finished the ci test. Please take a look when you have time. Thanks

@zhyajie
Copy link
Author

zhyajie commented Feb 19, 2026

@ping1jing2 There was a new conflict today, and I have resolved it and verified the profling feature locally.

@ping1jing2
Copy link
Collaborator

@ping1jing2 There was a new conflict today, and I have resolved it and verified the profling feature locally.

this PR is ok for me, but I can't merge it. please ping Mick or bbuf late. thanks for your contribution

@zhyajie
Copy link
Author

zhyajie commented Feb 19, 2026

@ping1jing2 There was a new conflict today, and I have resolved it and verified the profling feature locally.

this PR is ok for me, but I can't merge it. please ping Mick or bbuf late. thanks for your contribution

@ping1jing2 Thank you for your reply. @mickqian Could you please review this PR when you have time? This PR includes modifications to numerous files, leading to frequent conflict triggers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd diffusion SGLang Diffusion documentation Improvements or additions to documentation lora npu run-ci speculative-decoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments