Skip to content

Extremely Slow Inference when running Wan2.1 T2V-1.3B locally on RTX 4060 Laptop GPU (16GB) #555

@wjkuser

Description

@wjkuser

Hi, thank you for the great work on Wan2.1!
I’m currently deploying the Wan2.1 T2V-1.3B text-to-video model locally on my Windows machine, but the inference speed is far slower than expected. I would like to ask for help diagnosing the issue.

Environment

OS: Windows 11

GPU: NVIDIA GeForce RTX 4060 Laptop GPU (16GB)

CUDA: cu128

PyTorch: 2.7.0+cu128

FlashAttention: Installed from precompiled wheels:
https://github.com/PLISGOOD/flash-attention-windows-wheels
Verified working via:

python -c "import flash_attn; print('Flash Attention installed successfully!')"

Model: Wan2.1-T2V-1.3B

Command used (from README example):

python generate.py  --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --offload_model True --t5_cpu --sample_shift 8 --sample_guide_scale 6 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."

Problem Description

During local generation:

Each inference step takes ~400 seconds, even though FlashAttention is installed.

A full 50-step video generation takes around 4–5 hours.

GPU memory usage stays around ~11 GB.

This is significantly slower than expected for a 1.3B-parameter model running on an RTX 4060 Laptop GPU.

python generate.py  --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --offload_model True --t5_cpu --sample_shift 8 --sample_guide_scale 6 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."

[2025-12-04 20:57:33,337] INFO: Generation job args: Namespace(task='t2v-1.3B', size='832*480', frame_num=81, ckpt_dir='./Wan2.1-T2V-1.3B', offload_model=True, ulysses_size=1, ring_size=1, t5_fsdp=False, t5_cpu=True, dit_fsdp=False, save_file=None, src_video=None, src_mask=None, src_ref_images=None, prompt='Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.', use_prompt_extend=False, prompt_extend_method='local_qwen', prompt_extend_model=None, prompt_extend_target_lang='zh', base_seed=1697777428839187584, image=None, first_frame=None, last_frame=None, sample_solver='unipc', sample_steps=50, sample_shift=8.0, sample_guide_scale=6.0)
[2025-12-04 20:57:33,337] INFO: Generation model config: {'__name__': 'Config: Wan T2V 1.3B', 't5_model': 'umt5_xxl', 't5_dtype': torch.bfloat16, 'text_len': 512, 'param_dtype': torch.bfloat16, 'num_train_timesteps': 1000, 'sample_fps': 16, 'sample_neg_prompt': '色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走', 't5_checkpoint': 'models_t5_umt5-xxl-enc-bf16.pth', 't5_tokenizer': 'google/umt5-xxl', 'vae_checkpoint': 'Wan2.1_VAE.pth', 'vae_stride': (4, 8, 8), 'patch_size': (1, 2, 2), 'dim': 1536, 'ffn_dim': 8960, 'freq_dim': 256, 'num_heads': 12, 'num_layers': 30, 'window_size': (-1, -1), 'qk_norm': True, 'cross_attn_norm': True, 'eps': 1e-06}
[2025-12-04 20:57:33,337] INFO: Input prompt: Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.
[2025-12-04 20:57:33,337] INFO: Creating WanT2V pipeline.
[2025-12-04 21:01:51,412] INFO: loading ./Wan2.1-T2V-1.3B\models_t5_umt5-xxl-enc-bf16.pth
[2025-12-04 21:03:44,969] INFO: loading ./Wan2.1-T2V-1.3B\Wan2.1_VAE.pth
[2025-12-04 21:03:46,098] INFO: Creating WanModel from ./Wan2.1-T2V-1.3B
[2025-12-04 21:03:55,388] INFO: Generating video ...
 22%|█████████████████████████████▉                                                                                                          | 11/50 [1:10:02<4:35:54, 424.46s/it]

Questions

Is this slow generation time expected on a 4060 Laptop GPU, or is something misconfigured?

Does Wan2.1 actually use FlashAttention on Windows builds?
Is the FlashAttention wheel I installed compatible with the attention kernels used in Wan2.1?

Could the Windows environment itself be causing the slowdown?

Additional Notes

I’m not using WSL; everything runs directly on Windows.

FlashAttention loads successfully but I’m unsure whether Wan2.1 is actually utilizing it.

Any guidance would be greatly appreciated. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions