Description
I've noticed that with each new generation when using flux models, the model transfer time keeps getting longer and longer.
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: f2.0.1v1.10.1-previous-644-gde1670a4
Commit hash: de1670a
Launching Web UI with arguments:
Total VRAM 8191 MB, total RAM 16335 MB
pytorch version: 2.4.0+cu124
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3050 : native
Hint: your device supports --cuda-malloc for potential speed improvements.
VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16
CUDA Using Stream: False
D:\AI\Forge\system\python\lib\site-packages\transformers\utils\hub.py:128: FutureWarning: Using TRANSFORMERS_CACHE
is deprecated and will be removed in v5 of Transformers. Use HF_HOME
instead.
warnings.warn(
Using pytorch cross attention
Using pytorch attention for VAE
ControlNet preprocessor location: D:\AI\Forge\webui\models\ControlNetPreprocessor
2025-02-13 20:27:02,860 - ControlNet - INFO - ControlNet UI callback registered.
Model selected: {'checkpoint_info': {'filename': 'D:\AI\Forge\webui\models\Stable-diffusion\flux1-schnell-bnb-nf4.safetensors', 'hash': '7d3d1873'}, 'additional_modules': ['D:\AI\Forge\webui\models\text_encoder\t5xxl_fp8_e4m3fn.safetensors', 'D:\AI\Forge\webui\models\text_encoder\clip_l.safetensors', 'D:\AI\Forge\webui\models\VAE\ae.safetensors'], 'unet_storage_dtype': None}
Using online LoRAs in FP16: False
Running on local URL: http://127.0.0.1:7860
To create a public link, set share=True
in launch()
.
Startup time: 78.8s (prepare environment: 19.2s, launcher: 1.8s, import torch: 37.3s, initialize shared: 1.8s, other imports: 2.0s, setup gfpgan: 0.2s, list SD models: 0.6s, load scripts: 9.0s, create ui: 3.5s, gradio launch: 3.9s).
Model selected: {'checkpoint_info': {'filename': 'D:\AI\Forge\webui\models\Stable-diffusion\flux1-dev-bnb-nf4-v2.safetensors', 'hash': 'f0770152'}, 'additional_modules': ['D:\AI\Forge\webui\models\text_encoder\t5xxl_fp8_e4m3fn.safetensors', 'D:\AI\Forge\webui\models\text_encoder\clip_l.safetensors', 'D:\AI\Forge\webui\models\VAE\ae.safetensors'], 'unet_storage_dtype': None}
Using online LoRAs in FP16: False
Loading Model: {'checkpoint_info': {'filename': 'D:\AI\Forge\webui\models\Stable-diffusion\flux1-dev-bnb-nf4-v2.safetensors', 'hash': 'f0770152'}, 'additional_modules': ['D:\AI\Forge\webui\models\text_encoder\t5xxl_fp8_e4m3fn.safetensors', 'D:\AI\Forge\webui\models\text_encoder\clip_l.safetensors', 'D:\AI\Forge\webui\models\VAE\ae.safetensors'], 'unet_storage_dtype': None}
[Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Done.
StateDict Keys: {'transformer': 1722, 'vae': 244, 'text_encoder': 196, 'text_encoder_2': 220, 'ignore': 0}
Using Detected T5 Data Type: torch.float8_e4m3fn
Using Detected UNet Type: nf4
Using pre-quant state dict!
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
K-Model Created: {'storage_dtype': 'nf4', 'computation_dtype': torch.bfloat16}
Model loaded in 22.3s (unload existing model: 0.2s, forge model load: 22.1s).
[LORA] Loaded D:\AI\Forge\webui\models\Lora\Anime_Furry_Style_Flux.safetensors for KModel-UNet with 304 keys at weight 0.7 (skipped 0 keys) with on_the_fly = False
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
[Unload] Trying to free 7723.54 MB for cuda:0 with 0 models keep loaded ... Done.
[Memory Management] Target: JointTextEncoder, Free GPU: 7184.00 MB, Model Require: 5153.49 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 1006.51 MB, All loaded to GPU.
Moving model(s) has taken 24.04 seconds
Distilled CFG Scale: 3.5
[Unload] Trying to free 9411.13 MB for cuda:0 with 0 models keep loaded ... Current free memory is 1911.42 MB ... Unload model JointTextEncoder Done.
[Memory Management] Target: KModel, Free GPU: 7144.03 MB, Model Require: 6246.84 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: -126.81 MB, CPU Swap Loaded (blocked method): 1435.50 MB, GPU Loaded: 4811.34 MB
Moving model(s) has taken 148.36 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [01:12<00:00, 7.28s/it]
[Unload] Trying to free 4495.77 MB for cuda:0 with 0 models keep loaded ... Current free memory is 2125.69 MB ... Unload model KModel Done.
[Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 7134.06 MB, Model Require: 159.87 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 5950.19 MB, All loaded to GPU.
Moving model(s) has taken 54.55 seconds
Total progress: 100%|██████████████████████████████████████████████████████████████████| 10/10 [02:20<00:00, 14.03s/it]
Environment vars changed: {'stream': True, 'inference_memory': 1024.0, 'pin_shared_memory': False}:20<00:00, 5.18s/it]
[GPU Setting] You will use 87.50% GPU memory (7167.00 MB) to load weights, and use 12.50% GPU memory (1024.00 MB) to do matrix computation.
Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False}
[GPU Setting] You will use 87.50% GPU memory (7167.00 MB) to load weights, and use 12.50% GPU memory (1024.00 MB) to do matrix computation.
[Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Current free memory is 6974.41 MB ... Unload model IntegratedAutoencoderKL Done.
[LORA] Loaded D:\AI\Forge\webui\models\Lora\Anime_Furry_Style_Flux.safetensors for KModel-UNet with 304 keys at weight 0.9 (skipped 0 keys) with on_the_fly = False
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
[Unload] Trying to free 7817.77 MB for cuda:0 with 0 models keep loaded ... Done.
[Memory Management] Target: JointTextEncoder, Free GPU: 7135.05 MB, Model Require: 5225.98 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 885.07 MB, All loaded to GPU.
Moving model(s) has taken 244.25 seconds
Distilled CFG Scale: 3.5
[Unload] Trying to free 9411.08 MB for cuda:0 with 0 models keep loaded ... Current free memory is 1900.58 MB ... Unload model JointTextEncoder Done.
[Memory Management] Target: KModel, Free GPU: 7130.06 MB, Model Require: 6246.80 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: -140.74 MB, CPU Swap Loaded (blocked method): 1435.50 MB, GPU Loaded: 4811.30 MB
Moving model(s) has taken 687.08 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [01:06<00:00, 6.69s/it]
[Unload] Trying to free 4495.77 MB for cuda:0 with 0 models keep loaded ... Current free memory is 2127.72 MB ... Unload model KModel Done.
[Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 7128.09 MB, Model Require: 159.87 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 5944.22 MB, All loaded to GPU.
Moving model(s) has taken 426.07 seconds
Total progress: 100%|██████████████████████████████████████████████████████████████████| 10/10 [09:14<00:00, 55.42s/it]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 10/10 [09:13<00:00, 5.14s/it]
I had a similar problem a few months ago (long models moving) and I solved it with the "gpu_for_t5" extension (https://github.com/Juqowel/GPU_For_T5) while putting cpu. However, after a while the problem resolved itself and the extension did not affect anything else.
I have now tried this extension again and it helped, however I want to find out what is the reason for such a big difference in speed, as I didn't find a direct answer or didn't understand it here: #1591
Activity