Flash Attention v2 achieved 44% faster than xformers/pytorch_sdp_attention on large image #11884

WuSiYu · 2023-07-19T18:45:55Z

WuSiYu
Jul 19, 2023

FlashAttention-2 (repo: https://github.com/Dao-AILab/flash-attention, package: flash-attn, paper) released this weak, which provide "Better Parallelism and Work Partitioning" than existing works, leading to a much higher performance (up to 2x said in their paper) than old FlashAttention (similar to xformers/pytorch_sdp_attention).

I wrote a small patch for sd webui and did a quick benckmark. I use "Hires. fix" in the tests, which is popular and you could both see the performance on small image and large image:

# tested on RTX A4000 (ECC off), CPU: AMD 5700G, no overclock

# test image: 1024 x 512, DPM++ 2S a Karras, 25 steps, batch 1
#             Hires. fix upscale 1.75 (to 1792 x 896), another 25 steps
#             all with --no-half-vae (very little overhead)

baseline (only --no-half-vae):
100%|████████████████████████████████| 25/25 [00:14<00:00,  1.77it/s]   # steps on 1024 x 512
100%|████████████████████████████████| 25/25 [01:52<00:00,  4.50s/it]   # steps on 1792 x 896
Total progress: 100%|████████████████| 50/50 [02:08<00:00,  2.57s/it]   # total (one image)


pytorch_sdp_attention (--opt-sdp-attention --no-half-vae):
100%|████████████████████████████████| 25/25 [00:09<00:00,  2.69it/s]
100%|████████████████████████████████| 25/25 [00:43<00:00,  1.74s/it]
Total progress: 100%|████████████████| 50/50 [01:04<00:00,  1.29s/it]
100%|████████████████████████████████| 25/25 [00:09<00:00,  2.72it/s]
100%|████████████████████████████████| 25/25 [00:45<00:00,  1.80s/it]
Total progress: 100%|████████████████| 50/50 [00:56<00:00,  1.13s/it]
100%|████████████████████████████████| 25/25 [00:09<00:00,  2.76it/s]
100%|████████████████████████████████| 25/25 [00:44<00:00,  1.80s/it]
Total progress: 100%|████████████████| 50/50 [00:56<00:00,  1.13s/it]
100%|████████████████████████████████| 25/25 [00:08<00:00,  2.78it/s]
100%|████████████████████████████████| 25/25 [00:44<00:00,  1.79s/it]
Total progress: 100%|████████████████| 50/50 [00:55<00:00,  1.11s/it]
100%|████████████████████████████████| 25/25 [00:08<00:00,  2.81it/s]
100%|████████████████████████████████| 25/25 [00:44<00:00,  1.77s/it]
Total progress: 100%|████████████████| 50/50 [00:55<00:00,  1.11s/it]


FlashAttention-2 (--flash-attn (to PR) --no-half-vae):
100%|████████████████████████████████| 25/25 [00:07<00:00,  3.20it/s]
100%|████████████████████████████████| 25/25 [00:28<00:00,  1.14s/it]
Total progress: 100%|████████████████| 50/50 [00:38<00:00,  1.30it/s]
100%|████████████████████████████████| 25/25 [00:07<00:00,  3.27it/s]
100%|████████████████████████████████| 25/25 [00:29<00:00,  1.17s/it]
Total progress: 100%|████████████████| 50/50 [00:39<00:00,  1.28it/s]
100%|████████████████████████████████| 25/25 [00:07<00:00,  3.35it/s]
100%|████████████████████████████████| 25/25 [00:28<00:00,  1.15s/it]
Total progress: 100%|████████████████| 50/50 [00:38<00:00,  1.30it/s]
100%|████████████████████████████████| 25/25 [00:07<00:00,  3.35it/s]
100%|████████████████████████████████| 25/25 [00:28<00:00,  1.15s/it]
Total progress: 100%|████████████████| 50/50 [00:38<00:00,  1.30it/s]
100%|████████████████████████████████| 25/25 [00:07<00:00,  3.38it/s]
100%|████████████████████████████████| 25/25 [00:28<00:00,  1.13s/it]
Total progress: 100%|████████████████| 50/50 [00:37<00:00,  1.33it/s]
100%|████████████████████████████████| 25/25 [00:07<00:00,  3.24it/s]
100%|████████████████████████████████| 25/25 [00:29<00:00,  1.18s/it]
Total progress: 100%|████████████████| 50/50 [00:39<00:00,  1.27it/s]

It has a massive improvement on large image. And it's 44% faster total time than pytorch_sdp_attention (38s vs 55s), while xformers has a similar performance with pytorch_sdp_attention on my device. The output image may has very little difference from baseline, just like xformers or pytorch_sdp_attention.

I think it's worth to been added to stable diffusion webui, and I will soon make a draft PR about this. However, their are still some problems to discuss:

for now, install flash-attn version 2.0 package will break transformers, which is a dependency of sd webui and prevent it to run. This is due to flash-attn 2.0 renamed some functions. We may need to waiting for transformers to fix it. (A workaround is set a alisa in flash-attn package files)
xformers is looking on implementing FlashAttention-2, so you may could have a similar improvements soon just by upgrading xformers. But since xformers is a different implementation with the original flash-attn package, I think make a PR to support flash-attn may still been useful.
flash-attn only support fp16 and bfloat16, so it's need fallback to other attention implementations (like sdp_a) when a attention is performed in other precisions. This won't affect performance a lot.
flash-attn not support windows yet, and flash-attn will takes a lot of memory when installing (it will compile something), so we may left it for advanced users to install it by them self.

cheald · 2023-07-20T20:24:20Z

cheald
Jul 20, 2023

Flash Attention 2 has landed in xformers but my initial tests aren't showing any significant speed difference between xformers 0.0.20 and 0.0.21-dev571.

5 replies

AshtakaOOf Jul 20, 2023

I have tried it and it's faster even if not by much.
Comparison at 512x512 and 1024x1024 on DDIM 100 steps.

(also sorry for the error message I forgot about that)

xformers 0.0.21.dev556 (I had installed a dev version for testing a while ago)
xformers-0.0.21.dev571 (which has the new flash attention)

We can get an improvement in inference speed of about 5% max, I would say?
More testing needed but yes there is an improvement.

cheald Jul 21, 2023

0.0.20 with Torch 2.01, DDIM @ 512x512 gives me a consistent 5.86 it/s +/- 0.05 it/s:

And 1.19 it/s at 1024x1024:

Then, installing xformers-0.0.21+85c0244.d20230720-cp310-cp310-linux_x86_64 and torch-2.1.0.dev20230719+cu121-cp310-cp310-linux_x86_64:

5.88 it/s at 512x512 (0.3% improvement, within margin of error, IMO)

And 1024x1024 gives me ~1.29 it/s (+8.4%):

Launch args are just --xformers --no-half-vae, running on an RTX 3060, Linux, latest head on the dev branch (4bf6497).

So, the performance gains on the larger image are certainly welcome, but not exactly blowing the doors off.

WuSiYu Jul 21, 2023
Author

Looks reasonable, only on large images has a big performance gains. Personally I usually turned on "Hires. fix" to get a image with more detail (it works well on anime style as far as I know). So it's a good help for me.

Also, I think the performance gains may various from different GPUs, maybe the gains is higher on the high-end GPU, more testing is needed.

Enferlain Jul 21, 2023

Not for colab I guess RuntimeError: FlashAttention only supports Ampere GPUs or newer.

cavit99 Jul 21, 2023

Colab T4 or V100 won’t work, but A100 will

WuSiYu · 2023-07-21T05:31:46Z

WuSiYu
Jul 21, 2023
Author

I made a draft PR for testing: #11902

0 replies

FurkanGozukara · 2023-07-21T14:12:27Z

FurkanGozukara
Jul 21, 2023

i think xformers is also making updates

latest dev version is not working on free google colab gpu

1 reply

AshtakaOOf Jul 21, 2023

Please be smart and read here

WuSiYu · 2023-07-24T08:10:05Z

WuSiYu
Jul 24, 2023
Author

I compiled and installed the xformers (mainline) and run a benchmark on the same settings. The performances is close to flash-attn (39s vs 38s).

100%|███████████████████████████| 25/25 [00:07<00:00,  3.25it/s]
100%|███████████████████████████| 25/25 [00:29<00:00,  1.19s/it]
Total progress: 100%|███████████| 50/50 [00:39<00:00,  1.25it/s]
100%|███████████████████████████| 25/25 [00:07<00:00,  3.25it/s]
100%|███████████████████████████| 25/25 [00:29<00:00,  1.20s/it]
Total progress: 100%|███████████| 50/50 [00:40<00:00,  1.25it/s]
100%|███████████████████████████| 25/25 [00:07<00:00,  3.26it/s]
100%|███████████████████████████| 25/25 [00:29<00:00,  1.19s/it]
Total progress: 100%|███████████| 50/50 [00:40<00:00,  1.25it/s]
100%|███████████████████████████| 25/25 [00:07<00:00,  3.29it/s]
100%|███████████████████████████| 25/25 [00:29<00:00,  1.19s/it]
Total progress: 100%|███████████| 50/50 [00:39<00:00,  1.26it/s]
100%|███████████████████████████| 25/25 [00:07<00:00,  3.36it/s]
100%|███████████████████████████| 25/25 [00:28<00:00,  1.15s/it]
Total progress: 100%|███████████| 50/50 [00:38<00:00,  1.30it/s]

1 reply

YUANMU227 Aug 17, 2023

hello, can you add more test data? For example, different picture sizes and different sd configurations. We can make a comparison.

FurkanGozukara · 2023-07-24T08:20:03Z

FurkanGozukara
Jul 24, 2023

I compiled and installed the xformers (mainline) and run a benchmark on the same settings. The performances is close to flash-attn (39s vs 38s).

100%|███████████████████████████| 25/25 [00:07<00:00,  3.25it/s]
100%|███████████████████████████| 25/25 [00:29<00:00,  1.19s/it]
Total progress: 100%|███████████| 50/50 [00:39<00:00,  1.25it/s]
100%|███████████████████████████| 25/25 [00:07<00:00,  3.25it/s]
100%|███████████████████████████| 25/25 [00:29<00:00,  1.20s/it]
Total progress: 100%|███████████| 50/50 [00:40<00:00,  1.25it/s]
100%|███████████████████████████| 25/25 [00:07<00:00,  3.26it/s]
100%|███████████████████████████| 25/25 [00:29<00:00,  1.19s/it]
Total progress: 100%|███████████| 50/50 [00:40<00:00,  1.25it/s]
100%|███████████████████████████| 25/25 [00:07<00:00,  3.29it/s]
100%|███████████████████████████| 25/25 [00:29<00:00,  1.19s/it]
Total progress: 100%|███████████| 50/50 [00:39<00:00,  1.26it/s]
100%|███████████████████████████| 25/25 [00:07<00:00,  3.36it/s]
100%|███████████████████████████| 25/25 [00:28<00:00,  1.15s/it]
Total progress: 100%|███████████| 50/50 [00:38<00:00,  1.30it/s]

nice

what we install via pip (dev version) still don't have this version right? because i didnt see dramatic increase

2 replies

WuSiYu Jul 24, 2023
Author

I'm not sure about the pip dev version wheels, it also looks updated within the same day as the git commit to support v2, while I can't use the pip pre-compiled version because I am using a different version of pytorch (2.0.0).

slayermaster Jul 25, 2023

@FurkanGozukara I tried the dev version released an hour ago, can confirm I get some speedup. Went from 1.2-1.4it/s to 1.8-1.9it/s.

Edit: xformers version is 0.0.21.dev575

neavo · 2023-07-27T07:22:28Z

neavo
Jul 27, 2023

@FurkanGozukara I tried the dev version released an hour ago, can confirm I get some speedup. Went from 1.2-1.4it/s to 1.8-1.9it/s.

Edit: xformers version is 0.0.21.dev575

why did I have a opposite result 🤣 speed go down about 10% between xformers 0.0.20 and 0.0.21 dev

2080 Ti at Win11

4 replies

WuSiYu Jul 27, 2023
Author

Looks the performance gains may various from different GPUs. And AFAIK the flash-attn's FlashAttention-2 implementation only works on Ampere, Ada, or Hopper GPUs (https://github.com/Dao-AILab/flash-attention#installation-and-features)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash Attention v2 achieved 44% faster than xformers/pytorch_sdp_attention on large image #11884

{{title}}

Replies: 6 comments 13 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Flash Attention v2 achieved 44% faster than xformers/pytorch_sdp_attention on large image #11884

Replies: 6 comments · 13 replies

WuSiYu Jul 21, 2023 Author

WuSiYu Jul 21, 2023 Author

WuSiYu Jul 24, 2023 Author

WuSiYu Jul 24, 2023 Author

WuSiYu Jul 27, 2023 Author

Replies: 6 comments 13 replies

WuSiYu Jul 21, 2023
Author

WuSiYu
Jul 21, 2023
Author

WuSiYu
Jul 24, 2023
Author

WuSiYu Jul 24, 2023
Author

WuSiYu Jul 27, 2023
Author