xformer and flash_attention are not used for low resolution

When I set `pn: str = '1_2_3_4'`, xformer and flash_attention are not used, information is followed:


````
[dist initialize] mp method=spawn
[lrk=0, rk=0]
[rank0]:[W1014 15:12:21.376079326 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[10-14 15:12:21] (nt/VAR/utils/arg_util.py, line 183)=> [tf32] [precis] torch.get_float32_matmul_precision(): high
[10-14 15:12:21] (nt/VAR/utils/arg_util.py, line 184)=> [tf32] [ conv ] torch.backends.cudnn.allow_tf32: True
[10-14 15:12:21] (nt/VAR/utils/arg_util.py, line 185)=> [tf32] [matmul] torch.backends.cuda.matmul.allow_tf32: True
[10-14 15:12:21] (29/Document/VAR/train.py, line  36)=> global bs=32, local bs=32
[10-14 15:12:21] (29/Document/VAR/train.py, line  37)=> initial args:
{
  data_path           : ~/Document/datasets/artbench_c2i
  exp_name            : text
  vfast               : 0
  tfast               : 0
  depth               : 20
  ini                 : -1
  hd                  : 0.02
  aln                 : 0.5
  alng                : 0.001
  fp16                : 1
  tblr                : 0.0001
  tlr                 : 1.25e-05
  twd                 : 0
  twde                : 0
  tclip               : 2.0
  ls                  : 0.0
  bs                  : 32
  batch_size          : 32
  glb_batch_size      : 32
  ac                  : 1
  ep                  : 100
  wp                  : 2.0
  wp0                 : 0.005
  wpe                 : 0.0
  sche                : const
  opt                 : adamw
  afuse               : True
  saln                : False
  anorm               : True
  fuse                : True
  pn                  : 1_2_3_4_5_6
  patch_size          : 16
  patch_nums          : (1, 2, 3, 4, 5, 6)
  resos               : (16, 32, 48, 64, 80, 96)
  data_load_reso      : 96
  mid_reso            : 1.125
  hflip               : False
  workers             : 0
  pg                  : 0.0
  pg0                 : 4
  pgwp                : 0.3333333333333333
  cmd                 : --depth=20 --bs=32 --ep=100 --fp16=1 --alng=1e-3 --wpe=0.0 --data_path=~/Document/datasets/artbench_c2i --pretrained_model_path=./var_d20.pth
  acc_mean            : None
  acc_tail            : None
  L_mean              : None
  L_tail              : None
  vacc_mean           : None
  vacc_tail           : None
  vL_mean             : None
  vL_tail             : None
  grad_norm           : None
  cur_lr              : None
  cur_wd              : None
  cur_it              : 
  cur_ep              : 
  remain_time         : 
  finish_time         : 
  local_out_dir_path  : /public/home/cs029/Document/VAR/local_output
  tb_log_dir_path     : /public/home/cs029/Document/VAR/local_output/tb-VARd20__pn1_2_3_4_5_6__b32ep100adamlr0.0001wd0
  log_txt_path        : /public/home/cs029/Document/VAR/local_output/log.txt
  last_ckpt_path      : /public/home/cs029/Document/VAR/local_output/ar-ckpt-last.pth
  pretrained_model_path: ./var_d20.pth
  auto_grow           : False
  num_stages          : 4
  r_scale             : 0.25
  l_scale             : 0.25
  search_epochs       : 1
  tf32                : True
  seed                : None
  same_seed_for_all_ranks: 0
  local_debug         : False
  dbg_nan             : False
}

[10-14 15:12:21] (29/Document/VAR/train.py, line  41)=> [build PT data] ...

[10-14 15:12:21] (cument/VAR/utils/data.py, line  34)=> [Dataset] len(train_set)=49999, len(val_set)=10000, num_classes=1000
[10-14 15:12:21] (cument/VAR/utils/data.py, line  48)=> Transform [train] = 
[10-14 15:12:21] (cument/VAR/utils/data.py, line  51)=> Resize(size=108, interpolation=lanczos, max_size=None, antialias=True)
[10-14 15:12:21] (cument/VAR/utils/data.py, line  51)=> RandomCrop(size=(96, 96), padding=None)
[10-14 15:12:21] (cument/VAR/utils/data.py, line  51)=> ToTensor()
[10-14 15:12:21] (cument/VAR/utils/data.py, line  51)=> <function normalize_01_into_pm1 at 0x15080f529ab0>
[10-14 15:12:21] (cument/VAR/utils/data.py, line  54)=> ---------------------------

[10-14 15:12:21] (cument/VAR/utils/data.py, line  48)=> Transform [val] = 
[10-14 15:12:21] (cument/VAR/utils/data.py, line  51)=> Resize(size=108, interpolation=lanczos, max_size=None, antialias=True)
[10-14 15:12:21] (cument/VAR/utils/data.py, line  51)=> CenterCrop(size=(96, 96))
[10-14 15:12:21] (cument/VAR/utils/data.py, line  51)=> ToTensor()
[10-14 15:12:21] (cument/VAR/utils/data.py, line  51)=> <function normalize_01_into_pm1 at 0x15080f529ab0>
[10-14 15:12:21] (cument/VAR/utils/data.py, line  54)=> ---------------------------

[10-14 15:12:21] (29/Document/VAR/train.py, line  64)=> [auto_resume] no ckpt found @ /public/home/cs029/Document/VAR/local_output/ar-ckpt*.pth
[10-14 15:12:21] (29/Document/VAR/train.py, line  64)=> [auto_resume quit]
[10-14 15:12:21] (29/Document/VAR/train.py, line  65)=> [dataloader multi processing] ...     [dataloader multi processing](*) finished! (0.00s)
[10-14 15:12:21] (29/Document/VAR/train.py, line  71)=> [dataloader] gbs=32, lbs=32, iters_train=1563, types(tr, va)=('DatasetFolder', 'DatasetFolder')
[10-14 15:12:22] (cument/VAR/models/var.py, line  98)=> 
[constructor]  ==== flash_if_available=True (0/20), fused_if_available=True (fusing_add_ln=0/20, fusing_mlp=0/20) ==== 
    [VAR config ] embed_dim=1280, num_heads=20, depth=20, mlp_ratio=4.0
    [drop ratios ] drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0833333 (tensor([0.0000, 0.0044, 0.0088, 0.0132, 0.0175, 0.0219, 0.0263, 0.0307, 0.0351,
        0.0395, 0.0439, 0.0482, 0.0526, 0.0570, 0.0614, 0.0658, 0.0702, 0.0746,
        0.0789, 0.0833]))

[10-14 15:12:22] (cument/VAR/models/var.py, line 257)=> [init_weights] VAR with init_std=0.0161374
/public/home/cs029/Document/VAR/train.py:98: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  vae_local.load_state_dict(torch.load(vae_ckpt, map_location='cpu'), strict=True)
[10-14 15:12:23] (29/Document/VAR/train.py, line 104)=> [INIT] VAR model = VAR(
  drop_path_rate=0.0833333
  (word_embed): Linear(in_features=32, out_features=1280, bias=True)
  (class_emb): Embedding(1001, 1280)
  (lvl_embed): Embedding(6, 1280)
  (shared_ada_lin): Identity()
  (blocks): ModuleList(
    (0): AdaLNSelfAttn(
      shared_aln=False
      (drop_path): Identity()
      (attn): SelfAttention(
        using_flash=False, using_xform=True, attn_l2_norm=True
        (mat_qkv): Linear(in_features=1280, out_features=3840, bias=False)
        (proj): Linear(in_features=1280, out_features=1280, bias=True)
        (proj_drop): Identity()
      )
      (ffn): FFN(
        fused_mlp_func=False
        (fc1): Linear(in_features=1280, out_features=5120, bias=True)
        (act): GELU(approximate='tanh')
        (fc2): Linear(in_features=5120, out_features=1280, bias=True)
        (drop): Identity()
      )
      (ln_wo_grad): LayerNorm((1280,), eps=1e-06, elementwise_affine=False)
      (ada_lin): Sequential(
        (0): SiLU()
        (1): Linear(in_features=1280, out_features=7680, bias=True)
      )
    )
    (1-19): 19 x AdaLNSelfAttn(
      shared_aln=False
      (drop_path): DropPath((drop_prob=...))
      (attn): SelfAttention(
        using_flash=False, using_xform=True, attn_l2_norm=True
        (mat_qkv): Linear(in_features=1280, out_features=3840, bias=False)
        (proj): Linear(in_features=1280, out_features=1280, bias=True)
        (proj_drop): Identity()
      )
      (ffn): FFN(
        fused_mlp_func=False
        (fc1): Linear(in_features=1280, out_features=5120, bias=True)
        (act): GELU(approximate='tanh')
        (fc2): Linear(in_features=5120, out_features=1280, bias=True)
        (drop): Identity()
      )
      (ln_wo_grad): LayerNorm((1280,), eps=1e-06, elementwise_affine=False)
      (ada_lin): Sequential(
        (0): SiLU()
        (1): Linear(in_features=1280, out_features=7680, bias=True)
      )
    )
  )
  (head_nm): AdaLNBeforeHead(
    (ln_wo_grad): LayerNorm((1280,), eps=1e-06, elementwise_affine=False)
    (ada_lin): Sequential(
      (0): SiLU()
      (1): Linear(in_features=1280, out_features=2560, bias=True)
    )
  )
  (head): Linear(in_features=1280, out_features=4096, bias=True)
)


[10-14 15:12:23] (29/Document/VAR/train.py, line 106)=> [INIT][#para] VAE=108.95, VAE.enc=44.11, VAE.dec=64.65, VAE.quant=0.17
[10-14 15:12:23] (29/Document/VAR/train.py, line 107)=> [INIT][#para] VAR=600.16


[10-14 15:12:23] (/VAR/utils/lr_control.py, line 106)=> [get_param_groups][rank0] type(model).__name__='VAR' count=250, numel=600158096
[10-14 15:12:23] (/VAR/utils/lr_control.py, line 107)=> 
[10-14 15:12:23] (29/Document/VAR/train.py, line 122)=> [INIT] optim=functools.partial(<class 'torch.optim.adamw.AdamW'>, betas=(0.9, 0.95), fused=True), opt_kw={'lr': 1.25e-05, 'weight_decay': 0}

/public/home/cs029/Document/VAR/utils/amp_sc.py:27: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = torch.cuda.amp.GradScaler(init_scale=2. ** 11, growth_interval=1000) if self.using_fp16_rather_bf16 else None # only fp16 needs a scaler
/public/home/cs029/Document/VAR/models/var.py:200: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
[rank0]: Traceback (most recent call last):
[rank0]:   File "/public/home/cs029/Document/VAR/train.py", line 331, in <module>
[rank0]:     try: main_training()
[rank0]:   File "/public/home/cs029/Document/VAR/train.py", line 196, in main_training
[rank0]:     stats, (sec, remain_time, finish_time) = train_one_ep(
[rank0]:   File "/public/home/cs029/Document/VAR/train.py", line 299, in train_one_ep
[rank0]:     grad_norm, scale_log2 = trainer.train_step(
[rank0]:   File "/public/home/cs029/Document/VAR/trainer.py", line 111, in train_step
[rank0]:     logits_BLV = self.var(label_B, x_BLCv_wo_first_l)
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
[rank0]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
[rank0]:     return self.module(*inputs, **kwargs)  # type: ignore[index]
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/public/home/cs029/Document/VAR/models/var.py", line 222, in forward
[rank0]:     x_BLC = b(x=x_BLC, cond_BD=cond_BD_or_gss, attn_bias=attn_bias)
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/public/home/cs029/Document/VAR/models/basic_var.py", line 157, in forward
[rank0]:     x = x + self.drop_path(self.attn( self.ln_wo_grad(x).mul(scale1.add(1)).add_(shift1), attn_bias=attn_bias ).mul_(gamma1))
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/public/home/cs029/Document/VAR/models/basic_var.py", line 115, in forward
[rank0]:     oup = memory_efficient_attention(q.to(dtype=main_type), k.to(dtype=main_type), v.to(dtype=main_type), attn_bias=None if attn_bias is None else attn_bias.to(dtype=main_type).expand(B, self.num_heads, -1, -1), p=dropout_p, scale=self.scale).view(B, L, C)
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 301, in memory_efficient_attention
[rank0]:     return _memory_efficient_attention(
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 470, in _memory_efficient_attention
[rank0]:     return _fMHA.apply(
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 84, in forward
[rank0]:     out, op_ctx = _memory_efficient_attention_forward_requires_grad(
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 495, in _memory_efficient_attention_forward_requires_grad
[rank0]:     op = _dispatch_fw(inp, True)
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py", line 135, in _dispatch_fw
[rank0]:     return _run_priority_list(
[rank0]:   File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py", line 76, in _run_priority_list
[rank0]:     raise NotImplementedError(msg)
[rank0]: NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
[rank0]:      query       : shape=(32, 91, 20, 64) (torch.float16)
[rank0]:      key         : shape=(32, 91, 20, 64) (torch.float16)
[rank0]:      value       : shape=(32, 91, 20, 64) (torch.float16)
[rank0]:      attn_bias   : <class 'torch.Tensor'>
[rank0]:      p           : 0.0
[rank0]: `fa2F@v2.5.7-pt` is not supported because:
[rank0]:     attn_bias type is <class 'torch.Tensor'>
[rank0]: `cutlassF-pt` is not supported because:
[rank0]:     attn_bias.stride(-2) % 8 != 0 (attn_bias.stride() = (0, 0, 91, 1))
[rank0]:     HINT: To use an `attn_bias` with a sequence length that is not a multiple of 8, you need to ensure memory is aligned by slicing a bigger tensor. Example: use `attn_bias = torch.zeros([1, 1, 5, 8])[:,:,:,:5]` instead of `torch.zeros([1, 1, 5, 5])`
```
What can I do to solve this problem?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

xformer and flash_attention are not used for low resolution #176

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

xformer and flash_attention are not used for low resolution #176

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions