[diffusion] chore: improve memory usage on consumer-level GPU by mickqian · Pull Request #18997 · sgl-project/sglang

mickqian · 2026-02-19T04:48:14Z

Previously, --dit-cpu-offload is disabled automatically when --dit-layerwise-offload is set, making Flux.1-dev unable to run on 4090

Motivation

Set --dit-cpu-offload to false only when it is not explicit set.

sglang generate --pin-cpu-memory --prompt='A curious raccoon' --log-level=debug --width=720 --height=720 --output-path=outputs --model-path=black-forest-labs/FLUX.1-dev --save-output --output-file-name=FLUX.1-dev --output-file-name="flux-original" --text-encoder-cpu-offload --dit-layerwise-offload --dit-cpu-offload --dit-offload-prefetch-size=5

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-19T04:48:30Z

Summary of Changes

Hello @mickqian, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing memory efficiency and performance, particularly for systems with consumer-grade GPUs. It introduces intelligent offloading mechanisms to prevent out-of-memory issues during model loading and execution, and automates parallelism settings for better resource utilization. Additionally, it improves debugging capabilities by logging GPU memory usage at critical stages.

Highlights

Memory Optimization for Low-Memory GPUs: Implemented automatic CPU offloading for all model components (DIT, text encoder, image encoder, VAE) when the detected GPU has less than 30GB of total memory, significantly improving memory usage on consumer-level GPUs.
Improved Model Loading with Layerwise Offload: Modified the model loading process to ensure CPU offloading is always active during initial weight loading when dit_layerwise_offload is enabled. This prevents GPU out-of-memory errors by loading weights to CPU first, even if dit_cpu_offload is otherwise disabled at runtime.
Automatic Parallelism Configuration: Introduced logic to automatically enable enable_cfg_parallel for optimal performance if it's not explicitly set and there are at least two GPU groups available after considering data and tensor parallelism.
Enhanced Performance Logging: Added logging of available GPU memory at the start of each performance stage when debug logging is enabled, providing better insights into memory utilization during execution.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/multimodal_gen/runtime/loader/component_loaders/transformer_loader.py
- Updated the cpu_offload parameter in maybe_load_fsdp_model to ensure weights are loaded to CPU first when dit_layerwise_offload is enabled, preventing GPU OOM during loading.
python/sglang/multimodal_gen/runtime/server_args.py
- Imported BYTES_PER_GB constant.
- Implemented automatic CPU offloading for all components when running on GPUs with less than 30GB of memory.
- Introduced automatic enabling of enable_cfg_parallel for optimal performance when sufficient GPU groups are available.
- Adjusted the condition for automatically disabling dit_cpu_offload when dit_layerwise_offload is active, now only if dit_cpu_offload is not explicitly set.
python/sglang/multimodal_gen/runtime/utils/perf_logger.py
- Imported the logging module.
- Imported current_platform for platform-specific information.
- Enhanced performance logging to include available GPU memory at the start of each stage when debug logging is active.

Activity

No activity has been recorded on this pull request yet, as it has just been created.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to improve memory usage on consumer-level GPUs by adjusting offloading strategies. The changes in transformer_loader.py and perf_logger.py are beneficial. However, I've identified a critical issue in server_args.py where conflicting settings can be enabled, and a separate maintainability issue due to code duplication. I have provided suggestions to address these points.

gemini-code-assist · 2026-02-19T04:50:55Z

python/sglang/multimodal_gen/runtime/server_args.py

+        if current_platform.get_device_total_memory() / BYTES_PER_GB < 30:
+            logger.info("Enabling all offloading for GPU with low device memory")
+            if self.dit_cpu_offload is None:
+                self.dit_cpu_offload = True
+            if self.text_encoder_cpu_offload is None:
+                self.text_encoder_cpu_offload = True
+            if self.image_encoder_cpu_offload is None:
+                self.image_encoder_cpu_offload = True
+            if self.vae_cpu_offload is None:
+                self.vae_cpu_offload = True


The added logic for low-memory GPUs is nearly identical to the else block on lines 438-447, which introduces code duplication. This can make future modifications more error-prone.

Additionally, the value 30 is a magic number. It would be better to define it as a constant, for example LOW_GPU_MEMORY_THRESHOLD_GB = 30, to improve readability and make it easier to change.

Please consider refactoring the _adjust_offload method to eliminate this duplication.

mickqian · 2026-02-19T04:53:00Z

/tag-and-rerun-ci

mickqian · 2026-02-19T04:56:46Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a mechanism to automatically enable CPU offloading on GPUs with low memory, which is a good step towards improving usability on consumer-level hardware. My review uncovered a critical issue in the validation logic that could lead to conflicting configurations, and I've also provided a suggestion to refactor the new offloading logic for better maintainability.

gemini-code-assist · 2026-02-19T04:58:01Z

python/sglang/multimodal_gen/runtime/server_args.py

                self.use_fsdp_inference = False

-            if self.dit_cpu_offload:
+            if self.dit_cpu_offload is None:


This change from if self.dit_cpu_offload: to if self.dit_cpu_offload is None: appears to introduce a bug. When dit_layerwise_offload is enabled, dit_cpu_offload must be disabled to prevent conflicts. However, with this change, if _adjust_offload sets dit_cpu_offload=True on a low-memory GPU, this condition will be false, and dit_cpu_offload will incorrectly remain True, leading to a configuration conflict.

The previous logic correctly handled this by disabling dit_cpu_offload whenever it was enabled. Please revert this change to ensure the conflict is always resolved.

Suggested change

if self.dit_cpu_offload is None:

if self.dit_cpu_offload:

gemini-code-assist · 2026-02-19T04:58:01Z

python/sglang/multimodal_gen/runtime/server_args.py

+        if current_platform.get_device_total_memory() / BYTES_PER_GB < 30:
+            logger.info("Enabling all offloading for GPU with low device memory")
+            if self.dit_cpu_offload is None:
+                self.dit_cpu_offload = True
+            if self.text_encoder_cpu_offload is None:
+                self.text_encoder_cpu_offload = True
+            if self.image_encoder_cpu_offload is None:
+                self.image_encoder_cpu_offload = True
+            if self.vae_cpu_offload is None:
+                self.vae_cpu_offload = True


The hardcoded value 30 for the memory threshold can be extracted into a named constant to improve readability. Additionally, the series of if statements to enable offloading for different components is repetitive. This block can be refactored into a loop to make the code more concise and easier to maintain.

Suggested change

if current_platform.get_device_total_memory() / BYTES_PER_GB < 30:

logger.info("Enabling all offloading for GPU with low device memory")

if self.dit_cpu_offload is None:

self.dit_cpu_offload = True

if self.text_encoder_cpu_offload is None:

self.text_encoder_cpu_offload = True

if self.image_encoder_cpu_offload is None:

self.image_encoder_cpu_offload = True

if self.vae_cpu_offload is None:

self.vae_cpu_offload = True

if current_platform.get_device_total_memory() / BYTES_PER_GB < 30:

logger.info("Enabling all offloading for GPU with low device memory")

offload_attrs = [

"dit_cpu_offload",

"text_encoder_cpu_offload",

"image_encoder_cpu_offload",

"vae_cpu_offload",

]

for attr in offload_attrs:

if getattr(self, attr) is None:

setattr(self, attr, True)

[diffusion] chore: improve memory usage on consumer-level GPU

8f1890e

mickqian requested review from ping1jing2 and yhyang201 as code owners February 19, 2026 04:48

github-actions bot added the diffusion SGLang Diffusion label Feb 19, 2026

upd

dac97bd

gemini-code-assist bot reviewed Feb 19, 2026

View reviewed changes

mickqian added 2 commits February 19, 2026 12:52

upd

082dfcc

upd

81e831b

github-actions bot added the run-ci label Feb 19, 2026

gemini-code-assist bot reviewed Feb 19, 2026

View reviewed changes

upd

8932d63

mickqian merged commit d73f06f into main Feb 19, 2026
146 of 152 checks passed

mickqian deleted the diffusion-refactor branch February 19, 2026 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[diffusion] chore: improve memory usage on consumer-level GPU#18997

[diffusion] chore: improve memory usage on consumer-level GPU#18997
mickqian merged 5 commits intomainfrom
diffusion-refactor

mickqian commented Feb 19, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 19, 2026

Uh oh!

mickqian commented Feb 19, 2026

Uh oh!

mickqian commented Feb 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 19, 2026

Uh oh!

gemini-code-assist bot Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

mickqian commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

mickqian commented Feb 19, 2026

Uh oh!

mickqian commented Feb 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

mickqian commented Feb 19, 2026 •

edited

Loading