@@ -14,17 +14,20 @@ specific language governing permissions and limitations under the License.
1414
1515We  present  some  techniques  and  ideas  to  optimize  🤗  Diffusers  _inference_  for  memory  or  speed. 
1616
17- 
1817|                   |  Latency  |  Speedup  | 
19- |------------------| ---------| --------- | 
18+ |   ----------------   |   -------   |   -------   | 
2019|  original          |  9.50s    |  x1       | 
2120|  cuDNN  auto-tuner  |  9.37s    |  x1.01    | 
2221|  autocast  (fp16)   |  5.47s    |  x1.91    | 
2322|  fp16              |  3.61s    |  x2.91    | 
2423|  channels  last     |  3.30s    |  x2.87    | 
2524|  traced  UNet       |  3.21s    |  x2.96    | 
2625
27- <em>obtained  on  NVIDIA  TITAN  RTX  by  generating  a  single  image  of  size  512x512  from  the  prompt  " a photo of an astronaut riding a horse on mars"   with  50  DDIM  steps.</em> 
26+ <em> 
27+   obtained  on  NVIDIA  TITAN  RTX  by  generating  a  single  image  of  size  512x512  from 
28+   the  prompt  " a photo of an astronaut riding a horse on mars"   with  50  DDIM 
29+   steps. 
30+ </em> 
2831
2932##  Enable  cuDNN  auto-tuner 
3033
@@ -61,7 +64,7 @@ pipe = pipe.to("cuda")
6164
6265prompt  =  " a photo of an astronaut riding a horse on mars" 
6366with  autocast("cuda"): 
64-     image  =  pipe(prompt).images[0]    
67+     image  =  pipe(prompt).images[0] 
6568``` 
6669
6770Despite  the  precision  loss,  in  our  experience  the  final  image  results  look  the  same  as  the  `float32`  versions.  Feel  free  to  experiment  and  report  back! 
@@ -79,15 +82,18 @@ pipe = StableDiffusionPipeline.from_pretrained(
7982pipe  =  pipe.to("cuda") 
8083
8184prompt  =  " a photo of an astronaut riding a horse on mars" 
82- image  =  pipe(prompt).images[0]    
85+ image  =  pipe(prompt).images[0] 
8386``` 
8487
8588##  Sliced  attention  for  additional  memory  savings 
8689
8790For  even  additional  memory  savings,  you  can  use  a  sliced  version  of  attention  that  performs  the  computation  in  steps  instead  of  all  at  once. 
8891
8992<Tip> 
90- Attention  slicing  is  useful  even  if  a  batch  size  of  just  1  is  used  -  as  long  as  the  model  uses  more  than  one  attention  head.  If  there  is  more  than  one  attention  head  the  *QK^T*  attention  matrix  can  be  computed  sequentially  for  each  head  which  can  save  a  significant  amount  of  memory. 
93+   Attention  slicing  is  useful  even  if  a  batch  size  of  just  1  is  used  -  as  long 
94+   as  the  model  uses  more  than  one  attention  head.  If  there  is  more  than  one 
95+   attention  head  the  *QK^T*  attention  matrix  can  be  computed  sequentially  for 
96+   each  head  which  can  save  a  significant  amount  of  memory. 
9197</Tip> 
9298
9399To  perform  the  attention  computation  sequentially  over  each  head,  you  only  need  to  invoke  [`~StableDiffusionPipeline.enable_attention_slicing`]  in  your  pipeline  before  inference,  like  here: 
@@ -105,11 +111,55 @@ pipe = pipe.to("cuda")
105111
106112prompt  =  " a photo of an astronaut riding a horse on mars" 
107113pipe.enable_attention_slicing() 
108- image  =  pipe(prompt).images[0]    
114+ image  =  pipe(prompt).images[0] 
109115``` 
110116
111117There's  a  small  performance  penalty  of  about  10%  slower  inference  times,  but  this  method  allows  you  to  use  Stable  Diffusion  in  as  little  as  3.2  GB  of  VRAM! 
112118
119+ ##  Offloading  to  CPU  with  accelerate  for  memory  savings 
120+ 
121+ For  additional  memory  savings,  you  can  offload  the  weights  to  CPU  and  load  them  to  GPU  when  performing  the  forward  pass. 
122+ 
123+ To  perform  CPU  offloading,  all  you  have  to  do  is  invoke  [`~StableDiffusionPipeline.enable_sequential_cpu_offload`]: 
124+ 
125+ ```Python 
126+ import  torch 
127+ from  diffusers  import  StableDiffusionPipeline 
128+ 
129+ pipe  =  StableDiffusionPipeline.from_pretrained( 
130+     " runwayml/stable-diffusion-v1-5" , 
131+     revision = " fp16" , 
132+     torch_dtype =torch.float16, 
133+ ) 
134+ pipe  =  pipe.to("cuda") 
135+ 
136+ prompt  =  " a photo of an astronaut riding a horse on mars" 
137+ pipe.enable_sequential_cpu_offload() 
138+ image  =  pipe(prompt).images[0] 
139+ ``` 
140+ 
141+ And  you  can  get  the  memory  consumption  to  <  2GB. 
142+ 
143+ If  is  also  possible  to  chain  it  with  attention  slicing  for  minimal  memory  consumption,  running  it  in  as  little  as  <  800mb  of  GPU  vRAM: 
144+ 
145+ ```Python 
146+ import  torch 
147+ from  diffusers  import  StableDiffusionPipeline 
148+ 
149+ pipe  =  StableDiffusionPipeline.from_pretrained( 
150+     " runwayml/stable-diffusion-v1-5" , 
151+     revision = " fp16" , 
152+     torch_dtype =torch.float16, 
153+ ) 
154+ pipe  =  pipe.to("cuda") 
155+ 
156+ prompt  =  " a photo of an astronaut riding a horse on mars" 
157+ pipe.enable_sequential_cpu_offload() 
158+ pipe.enable_attention_slicing(1) 
159+ 
160+ image  =  pipe(prompt).images[0] 
161+ ``` 
162+ 
113163##  Using  Channels  Last  memory  format 
114164
115165Channels  last  memory  format  is  an  alternative  way  of  ordering  NCHW  tensors  in  memory  preserving  dimensions  ordering.  Channels  last  tensors  ordered  in  such  a  way  that  channels  become  the  densest  dimension  (aka  storing  images  pixel-per-pixel).  Since  not  all  operators  currently  support  channels  last  format  it  may  result  in  a  worst  performance,  so  it's  better  to  try  it  and  see  if  it  works  for  your  model. 
0 commit comments