Skip to content

[diffusion] chore: improve memory usage on consumer-level GPU#18997

Merged
mickqian merged 5 commits intomainfrom
diffusion-refactor
Feb 19, 2026
Merged

[diffusion] chore: improve memory usage on consumer-level GPU#18997
mickqian merged 5 commits intomainfrom
diffusion-refactor

Conversation

@mickqian
Copy link
Collaborator

@mickqian mickqian commented Feb 19, 2026

Previously, --dit-cpu-offload is disabled automatically when --dit-layerwise-offload is set, making Flux.1-dev unable to run on 4090

Motivation

Set --dit-cpu-offload to false only when it is not explicit set.

sglang generate --pin-cpu-memory --prompt='A curious raccoon' --log-level=debug --width=720 --height=720 --output-path=outputs --model-path=black-forest-labs/FLUX.1-dev --save-output --output-file-name=FLUX.1-dev --output-file-name="flux-original" --text-encoder-cpu-offload --dit-layerwise-offload --dit-cpu-offload --dit-offload-prefetch-size=5

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions bot added the diffusion SGLang Diffusion label Feb 19, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @mickqian, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing memory efficiency and performance, particularly for systems with consumer-grade GPUs. It introduces intelligent offloading mechanisms to prevent out-of-memory issues during model loading and execution, and automates parallelism settings for better resource utilization. Additionally, it improves debugging capabilities by logging GPU memory usage at critical stages.

Highlights

  • Memory Optimization for Low-Memory GPUs: Implemented automatic CPU offloading for all model components (DIT, text encoder, image encoder, VAE) when the detected GPU has less than 30GB of total memory, significantly improving memory usage on consumer-level GPUs.
  • Improved Model Loading with Layerwise Offload: Modified the model loading process to ensure CPU offloading is always active during initial weight loading when dit_layerwise_offload is enabled. This prevents GPU out-of-memory errors by loading weights to CPU first, even if dit_cpu_offload is otherwise disabled at runtime.
  • Automatic Parallelism Configuration: Introduced logic to automatically enable enable_cfg_parallel for optimal performance if it's not explicitly set and there are at least two GPU groups available after considering data and tensor parallelism.
  • Enhanced Performance Logging: Added logging of available GPU memory at the start of each performance stage when debug logging is enabled, providing better insights into memory utilization during execution.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/multimodal_gen/runtime/loader/component_loaders/transformer_loader.py
    • Updated the cpu_offload parameter in maybe_load_fsdp_model to ensure weights are loaded to CPU first when dit_layerwise_offload is enabled, preventing GPU OOM during loading.
  • python/sglang/multimodal_gen/runtime/server_args.py
    • Imported BYTES_PER_GB constant.
    • Implemented automatic CPU offloading for all components when running on GPUs with less than 30GB of memory.
    • Introduced automatic enabling of enable_cfg_parallel for optimal performance when sufficient GPU groups are available.
    • Adjusted the condition for automatically disabling dit_cpu_offload when dit_layerwise_offload is active, now only if dit_cpu_offload is not explicitly set.
  • python/sglang/multimodal_gen/runtime/utils/perf_logger.py
    • Imported the logging module.
    • Imported current_platform for platform-specific information.
    • Enhanced performance logging to include available GPU memory at the start of each stage when debug logging is active.
Activity
  • No activity has been recorded on this pull request yet, as it has just been created.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve memory usage on consumer-level GPUs by adjusting offloading strategies. The changes in transformer_loader.py and perf_logger.py are beneficial. However, I've identified a critical issue in server_args.py where conflicting settings can be enabled, and a separate maintainability issue due to code duplication. I have provided suggestions to address these points.

Comment on lines +416 to +425
if current_platform.get_device_total_memory() / BYTES_PER_GB < 30:
logger.info("Enabling all offloading for GPU with low device memory")
if self.dit_cpu_offload is None:
self.dit_cpu_offload = True
if self.text_encoder_cpu_offload is None:
self.text_encoder_cpu_offload = True
if self.image_encoder_cpu_offload is None:
self.image_encoder_cpu_offload = True
if self.vae_cpu_offload is None:
self.vae_cpu_offload = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The added logic for low-memory GPUs is nearly identical to the else block on lines 438-447, which introduces code duplication. This can make future modifications more error-prone.

Additionally, the value 30 is a magic number. It would be better to define it as a constant, for example LOW_GPU_MEMORY_THRESHOLD_GB = 30, to improve readability and make it easier to change.

Please consider refactoring the _adjust_offload method to eliminate this duplication.

@mickqian
Copy link
Collaborator Author

/tag-and-rerun-ci

@mickqian
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to automatically enable CPU offloading on GPUs with low memory, which is a good step towards improving usability on consumer-level hardware. My review uncovered a critical issue in the validation logic that could lead to conflicting configurations, and I've also provided a suggestion to refactor the new offloading logic for better maintainability.

self.use_fsdp_inference = False

if self.dit_cpu_offload:
if self.dit_cpu_offload is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change from if self.dit_cpu_offload: to if self.dit_cpu_offload is None: appears to introduce a bug. When dit_layerwise_offload is enabled, dit_cpu_offload must be disabled to prevent conflicts. However, with this change, if _adjust_offload sets dit_cpu_offload=True on a low-memory GPU, this condition will be false, and dit_cpu_offload will incorrectly remain True, leading to a configuration conflict.

The previous logic correctly handled this by disabling dit_cpu_offload whenever it was enabled. Please revert this change to ensure the conflict is always resolved.

Suggested change
if self.dit_cpu_offload is None:
if self.dit_cpu_offload:

Comment on lines +416 to +425
if current_platform.get_device_total_memory() / BYTES_PER_GB < 30:
logger.info("Enabling all offloading for GPU with low device memory")
if self.dit_cpu_offload is None:
self.dit_cpu_offload = True
if self.text_encoder_cpu_offload is None:
self.text_encoder_cpu_offload = True
if self.image_encoder_cpu_offload is None:
self.image_encoder_cpu_offload = True
if self.vae_cpu_offload is None:
self.vae_cpu_offload = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The hardcoded value 30 for the memory threshold can be extracted into a named constant to improve readability. Additionally, the series of if statements to enable offloading for different components is repetitive. This block can be refactored into a loop to make the code more concise and easier to maintain.

Suggested change
if current_platform.get_device_total_memory() / BYTES_PER_GB < 30:
logger.info("Enabling all offloading for GPU with low device memory")
if self.dit_cpu_offload is None:
self.dit_cpu_offload = True
if self.text_encoder_cpu_offload is None:
self.text_encoder_cpu_offload = True
if self.image_encoder_cpu_offload is None:
self.image_encoder_cpu_offload = True
if self.vae_cpu_offload is None:
self.vae_cpu_offload = True
if current_platform.get_device_total_memory() / BYTES_PER_GB < 30:
logger.info("Enabling all offloading for GPU with low device memory")
offload_attrs = [
"dit_cpu_offload",
"text_encoder_cpu_offload",
"image_encoder_cpu_offload",
"vae_cpu_offload",
]
for attr in offload_attrs:
if getattr(self, attr) is None:
setattr(self, attr, True)

@mickqian mickqian merged commit d73f06f into main Feb 19, 2026
146 of 152 checks passed
@mickqian mickqian deleted the diffusion-refactor branch February 19, 2026 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments