Bad use of VRAM + OOM's

Hello there!
I used to generate Wan 2.1 (First Last Frame to Video 720p (FLF2V) 14B) with SageAttention 2 + Compile Transformer Model, on a old laptop with RTX 3060 6GB VRAM. that's was slow but no OOM's, the compiling process would take ages but surely it compiles.

But now, the exact same process (I took the rendering settings from a video rendered from the said laptop) on a brand new desktop, RTX 5060 TI with 16gb VRAM + 32 GB ram, i'm plagued with OOM's and errors. the inference is only possible with "Compile Transformer Model " at off. no more than 7gb VRAM is used at any given time. 

I got this OOM's while the VRAM usage stood still: 

_Lora 'loras\wan_i2v\Wan2.1_I2V_14B_FusionX_LoRA.safetensors' was loaded in model 'models.wan.modules.model'
Unable to pin data of 'loras\wan_i2v\Wan2.1_I2V_14B_FusionX_LoRA.safetensors' to reserved RAM as there is no reserved RAM left. Transfer speed from RAM to VRAM will may be slower.
Traceback (most recent call last):
  File "C:\Users\user\Desktop\pourwangp\ComfyUI_windows_portable\Wan2GP-main\wgp.py", line 5624, in generate_video
    samples = wan_model.generate(
        input_prompt = prompt,
    ...<80 lines>...
        temperature=temperature,
    )
  File "C:\Users\user\Desktop\pourwangp\ComfyUI_windows_portable\Wan2GP-main\models\wan\any2video.py", line 493, in generate
    context = self.text_encoder([input_prompt], self.device)[0].to(self.dtype)
              ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\Desktop\pourwangp\ComfyUI_windows_portable\Wan2GP-main\models\wan\modules\t5.py", line 674, in __call__
    seq_lens = mask.gt(0).sum(dim=1).long()
               ~~~~~~~^^^
  File "C:\Users\user\Desktop\comfy\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\utils\_device.py", line 103, in __torch_function__
    return func(*args, **kwargs)
torch.AcceleratorError: CUDA error: out of memory
Search for `cudaErrorMemoryAllocation' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions._


<img width="937" height="952" alt="Image" src="https://github.com/user-attachments/assets/d1d9467c-1aaa-4e02-af5c-23bd22613327" />

Of course, the inference is still possible and quite fast, but I was wondering if I there's a bottleneck somewhere.
here is my configuration:
Total VRAM 16310 MB, total RAM 32691 MB
pytorch version: 2.9.1+cu130
Device: cuda:0 NVIDIA GeForce RTX 5060 Ti : cudaMallocAsync
Python version: 3.13.9 (tags/v3.13.9:8183fa5, Oct 14 2025, 14:09:13) [MSC v.1944 64 bit (AMD64)]


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bad use of VRAM + OOM's #553

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bad use of VRAM + OOM's #553

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions