Skip to content

Commit de00c63

Browse files
Document sequential CPU offload method on Stable Diffusion pipeline (apple#1024)
* document cpu offloading method * address review comments Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Patrick von Platen <[email protected]>
1 parent a6314a8 commit de00c63

File tree

2 files changed

+62
-7
lines changed

2 files changed

+62
-7
lines changed

docs/source/optimization/fp16.mdx

Lines changed: 57 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,17 +14,20 @@ specific language governing permissions and limitations under the License.
1414

1515
We present some techniques and ideas to optimize 🤗 Diffusers _inference_ for memory or speed.
1616

17-
1817
| | Latency | Speedup |
19-
|------------------|---------|---------|
18+
| ---------------- | ------- | ------- |
2019
| original | 9.50s | x1 |
2120
| cuDNN auto-tuner | 9.37s | x1.01 |
2221
| autocast (fp16) | 5.47s | x1.91 |
2322
| fp16 | 3.61s | x2.91 |
2423
| channels last | 3.30s | x2.87 |
2524
| traced UNet | 3.21s | x2.96 |
2625

27-
<em>obtained on NVIDIA TITAN RTX by generating a single image of size 512x512 from the prompt "a photo of an astronaut riding a horse on mars" with 50 DDIM steps.</em>
26+
<em>
27+
obtained on NVIDIA TITAN RTX by generating a single image of size 512x512 from
28+
the prompt "a photo of an astronaut riding a horse on mars" with 50 DDIM
29+
steps.
30+
</em>
2831

2932
## Enable cuDNN auto-tuner
3033

@@ -61,7 +64,7 @@ pipe = pipe.to("cuda")
6164

6265
prompt = "a photo of an astronaut riding a horse on mars"
6366
with autocast("cuda"):
64-
image = pipe(prompt).images[0]
67+
image = pipe(prompt).images[0]
6568
```
6669

6770
Despite the precision loss, in our experience the final image results look the same as the `float32` versions. Feel free to experiment and report back!
@@ -79,15 +82,18 @@ pipe = StableDiffusionPipeline.from_pretrained(
7982
pipe = pipe.to("cuda")
8083

8184
prompt = "a photo of an astronaut riding a horse on mars"
82-
image = pipe(prompt).images[0]
85+
image = pipe(prompt).images[0]
8386
```
8487

8588
## Sliced attention for additional memory savings
8689

8790
For even additional memory savings, you can use a sliced version of attention that performs the computation in steps instead of all at once.
8891

8992
<Tip>
90-
Attention slicing is useful even if a batch size of just 1 is used - as long as the model uses more than one attention head. If there is more than one attention head the *QK^T* attention matrix can be computed sequentially for each head which can save a significant amount of memory.
93+
Attention slicing is useful even if a batch size of just 1 is used - as long
94+
as the model uses more than one attention head. If there is more than one
95+
attention head the *QK^T* attention matrix can be computed sequentially for
96+
each head which can save a significant amount of memory.
9197
</Tip>
9298

9399
To perform the attention computation sequentially over each head, you only need to invoke [`~StableDiffusionPipeline.enable_attention_slicing`] in your pipeline before inference, like here:
@@ -105,11 +111,55 @@ pipe = pipe.to("cuda")
105111

106112
prompt = "a photo of an astronaut riding a horse on mars"
107113
pipe.enable_attention_slicing()
108-
image = pipe(prompt).images[0]
114+
image = pipe(prompt).images[0]
109115
```
110116

111117
There's a small performance penalty of about 10% slower inference times, but this method allows you to use Stable Diffusion in as little as 3.2 GB of VRAM!
112118

119+
## Offloading to CPU with accelerate for memory savings
120+
121+
For additional memory savings, you can offload the weights to CPU and load them to GPU when performing the forward pass.
122+
123+
To perform CPU offloading, all you have to do is invoke [`~StableDiffusionPipeline.enable_sequential_cpu_offload`]:
124+
125+
```Python
126+
import torch
127+
from diffusers import StableDiffusionPipeline
128+
129+
pipe = StableDiffusionPipeline.from_pretrained(
130+
"runwayml/stable-diffusion-v1-5",
131+
revision="fp16",
132+
torch_dtype=torch.float16,
133+
)
134+
pipe = pipe.to("cuda")
135+
136+
prompt = "a photo of an astronaut riding a horse on mars"
137+
pipe.enable_sequential_cpu_offload()
138+
image = pipe(prompt).images[0]
139+
```
140+
141+
And you can get the memory consumption to < 2GB.
142+
143+
If is also possible to chain it with attention slicing for minimal memory consumption, running it in as little as < 800mb of GPU vRAM:
144+
145+
```Python
146+
import torch
147+
from diffusers import StableDiffusionPipeline
148+
149+
pipe = StableDiffusionPipeline.from_pretrained(
150+
"runwayml/stable-diffusion-v1-5",
151+
revision="fp16",
152+
torch_dtype=torch.float16,
153+
)
154+
pipe = pipe.to("cuda")
155+
156+
prompt = "a photo of an astronaut riding a horse on mars"
157+
pipe.enable_sequential_cpu_offload()
158+
pipe.enable_attention_slicing(1)
159+
160+
image = pipe(prompt).images[0]
161+
```
162+
113163
## Using Channels Last memory format
114164

115165
Channels last memory format is an alternative way of ordering NCHW tensors in memory preserving dimensions ordering. Channels last tensors ordered in such a way that channels become the densest dimension (aka storing images pixel-per-pixel). Since not all operators currently support channels last format it may result in a worst performance, so it's better to try it and see if it works for your model.

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,11 @@ def disable_attention_slicing(self):
120120
self.enable_attention_slicing(None)
121121

122122
def enable_sequential_cpu_offload(self):
123+
r"""
124+
Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, unet,
125+
text_encoder, vae and safety checker have their state dicts saved to CPU and then are moved to a
126+
`torch.device('meta') and loaded to GPU only when their specific submodule has its `forward` method called.
127+
"""
123128
if is_accelerate_available():
124129
from accelerate import cpu_offload
125130
else:

0 commit comments

Comments
 (0)