Extremely Slow Inference when running Wan2.1 T2V-1.3B locally on RTX 4060 Laptop GPU (16GB)

Hi, thank you for the great work on Wan2.1!
I’m currently deploying the Wan2.1 T2V-1.3B text-to-video model locally on my Windows machine, but the inference speed is far slower than expected. I would like to ask for help diagnosing the issue.

## Environment

OS: Windows 11

GPU: NVIDIA GeForce RTX 4060 Laptop GPU (16GB)

CUDA: cu128

PyTorch: 2.7.0+cu128

FlashAttention: Installed from precompiled wheels:
<https://github.com/PLISGOOD/flash-attention-windows-wheels>
Verified working via:

```bash
python -c "import flash_attn; print('Flash Attention installed successfully!')"
```

Model: Wan2.1-T2V-1.3B

Command used (from README example):

```bash
python generate.py  --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --offload_model True --t5_cpu --sample_shift 8 --sample_guide_scale 6 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
```

## Problem Description

During local generation:

Each inference step takes ~400 seconds, even though FlashAttention is installed.

A full 50-step video generation takes around 4–5 hours.

GPU memory usage stays around ~11 GB.

This is significantly slower than expected for a 1.3B-parameter model running on an RTX 4060 Laptop GPU.

```bash
python generate.py  --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --offload_model True --t5_cpu --sample_shift 8 --sample_guide_scale 6 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."

[2025-12-04 20:57:33,337] INFO: Generation job args: Namespace(task='t2v-1.3B', size='832*480', frame_num=81, ckpt_dir='./Wan2.1-T2V-1.3B', offload_model=True, ulysses_size=1, ring_size=1, t5_fsdp=False, t5_cpu=True, dit_fsdp=False, save_file=None, src_video=None, src_mask=None, src_ref_images=None, prompt='Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.', use_prompt_extend=False, prompt_extend_method='local_qwen', prompt_extend_model=None, prompt_extend_target_lang='zh', base_seed=1697777428839187584, image=None, first_frame=None, last_frame=None, sample_solver='unipc', sample_steps=50, sample_shift=8.0, sample_guide_scale=6.0)
[2025-12-04 20:57:33,337] INFO: Generation model config: {'__name__': 'Config: Wan T2V 1.3B', 't5_model': 'umt5_xxl', 't5_dtype': torch.bfloat16, 'text_len': 512, 'param_dtype': torch.bfloat16, 'num_train_timesteps': 1000, 'sample_fps': 16, 'sample_neg_prompt': '色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走', 't5_checkpoint': 'models_t5_umt5-xxl-enc-bf16.pth', 't5_tokenizer': 'google/umt5-xxl', 'vae_checkpoint': 'Wan2.1_VAE.pth', 'vae_stride': (4, 8, 8), 'patch_size': (1, 2, 2), 'dim': 1536, 'ffn_dim': 8960, 'freq_dim': 256, 'num_heads': 12, 'num_layers': 30, 'window_size': (-1, -1), 'qk_norm': True, 'cross_attn_norm': True, 'eps': 1e-06}
[2025-12-04 20:57:33,337] INFO: Input prompt: Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.
[2025-12-04 20:57:33,337] INFO: Creating WanT2V pipeline.
[2025-12-04 21:01:51,412] INFO: loading ./Wan2.1-T2V-1.3B\models_t5_umt5-xxl-enc-bf16.pth
[2025-12-04 21:03:44,969] INFO: loading ./Wan2.1-T2V-1.3B\Wan2.1_VAE.pth
[2025-12-04 21:03:46,098] INFO: Creating WanModel from ./Wan2.1-T2V-1.3B
[2025-12-04 21:03:55,388] INFO: Generating video ...
 22%|█████████████████████████████▉                                                                                                          | 11/50 [1:10:02<4:35:54, 424.46s/it]
```

## Questions

Is this slow generation time expected on a 4060 Laptop GPU, or is something misconfigured?

Does Wan2.1 actually use FlashAttention on Windows builds?
Is the FlashAttention wheel I installed compatible with the attention kernels used in Wan2.1?

Could the Windows environment itself be causing the slowdown?

## Additional Notes

I’m not using WSL; everything runs directly on Windows.

FlashAttention loads successfully but I’m unsure whether Wan2.1 is actually utilizing it.

Any guidance would be greatly appreciated. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extremely Slow Inference when running Wan2.1 T2V-1.3B locally on RTX 4060 Laptop GPU (16GB) #555

Environment

Problem Description

Questions

Additional Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extremely Slow Inference when running Wan2.1 T2V-1.3B locally on RTX 4060 Laptop GPU (16GB) #555

Description

Environment

Problem Description

Questions

Additional Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions