-
Notifications
You must be signed in to change notification settings - Fork 542
Open
Description
When I set pn: str = '1_2_3_4', xformer and flash_attention are not used, information is followed:
[dist initialize] mp method=spawn
[lrk=0, rk=0]
[rank0]:[W1014 15:12:21.376079326 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[10-14 15:12:21] (nt/VAR/utils/arg_util.py, line 183)=> [tf32] [precis] torch.get_float32_matmul_precision(): high
[10-14 15:12:21] (nt/VAR/utils/arg_util.py, line 184)=> [tf32] [ conv ] torch.backends.cudnn.allow_tf32: True
[10-14 15:12:21] (nt/VAR/utils/arg_util.py, line 185)=> [tf32] [matmul] torch.backends.cuda.matmul.allow_tf32: True
[10-14 15:12:21] (29/Document/VAR/train.py, line 36)=> global bs=32, local bs=32
[10-14 15:12:21] (29/Document/VAR/train.py, line 37)=> initial args:
{
data_path : ~/Document/datasets/artbench_c2i
exp_name : text
vfast : 0
tfast : 0
depth : 20
ini : -1
hd : 0.02
aln : 0.5
alng : 0.001
fp16 : 1
tblr : 0.0001
tlr : 1.25e-05
twd : 0
twde : 0
tclip : 2.0
ls : 0.0
bs : 32
batch_size : 32
glb_batch_size : 32
ac : 1
ep : 100
wp : 2.0
wp0 : 0.005
wpe : 0.0
sche : const
opt : adamw
afuse : True
saln : False
anorm : True
fuse : True
pn : 1_2_3_4_5_6
patch_size : 16
patch_nums : (1, 2, 3, 4, 5, 6)
resos : (16, 32, 48, 64, 80, 96)
data_load_reso : 96
mid_reso : 1.125
hflip : False
workers : 0
pg : 0.0
pg0 : 4
pgwp : 0.3333333333333333
cmd : --depth=20 --bs=32 --ep=100 --fp16=1 --alng=1e-3 --wpe=0.0 --data_path=~/Document/datasets/artbench_c2i --pretrained_model_path=./var_d20.pth
acc_mean : None
acc_tail : None
L_mean : None
L_tail : None
vacc_mean : None
vacc_tail : None
vL_mean : None
vL_tail : None
grad_norm : None
cur_lr : None
cur_wd : None
cur_it :
cur_ep :
remain_time :
finish_time :
local_out_dir_path : /public/home/cs029/Document/VAR/local_output
tb_log_dir_path : /public/home/cs029/Document/VAR/local_output/tb-VARd20__pn1_2_3_4_5_6__b32ep100adamlr0.0001wd0
log_txt_path : /public/home/cs029/Document/VAR/local_output/log.txt
last_ckpt_path : /public/home/cs029/Document/VAR/local_output/ar-ckpt-last.pth
pretrained_model_path: ./var_d20.pth
auto_grow : False
num_stages : 4
r_scale : 0.25
l_scale : 0.25
search_epochs : 1
tf32 : True
seed : None
same_seed_for_all_ranks: 0
local_debug : False
dbg_nan : False
}
[10-14 15:12:21] (29/Document/VAR/train.py, line 41)=> [build PT data] ...
[10-14 15:12:21] (cument/VAR/utils/data.py, line 34)=> [Dataset] len(train_set)=49999, len(val_set)=10000, num_classes=1000
[10-14 15:12:21] (cument/VAR/utils/data.py, line 48)=> Transform [train] =
[10-14 15:12:21] (cument/VAR/utils/data.py, line 51)=> Resize(size=108, interpolation=lanczos, max_size=None, antialias=True)
[10-14 15:12:21] (cument/VAR/utils/data.py, line 51)=> RandomCrop(size=(96, 96), padding=None)
[10-14 15:12:21] (cument/VAR/utils/data.py, line 51)=> ToTensor()
[10-14 15:12:21] (cument/VAR/utils/data.py, line 51)=> <function normalize_01_into_pm1 at 0x15080f529ab0>
[10-14 15:12:21] (cument/VAR/utils/data.py, line 54)=> ---------------------------
[10-14 15:12:21] (cument/VAR/utils/data.py, line 48)=> Transform [val] =
[10-14 15:12:21] (cument/VAR/utils/data.py, line 51)=> Resize(size=108, interpolation=lanczos, max_size=None, antialias=True)
[10-14 15:12:21] (cument/VAR/utils/data.py, line 51)=> CenterCrop(size=(96, 96))
[10-14 15:12:21] (cument/VAR/utils/data.py, line 51)=> ToTensor()
[10-14 15:12:21] (cument/VAR/utils/data.py, line 51)=> <function normalize_01_into_pm1 at 0x15080f529ab0>
[10-14 15:12:21] (cument/VAR/utils/data.py, line 54)=> ---------------------------
[10-14 15:12:21] (29/Document/VAR/train.py, line 64)=> [auto_resume] no ckpt found @ /public/home/cs029/Document/VAR/local_output/ar-ckpt*.pth
[10-14 15:12:21] (29/Document/VAR/train.py, line 64)=> [auto_resume quit]
[10-14 15:12:21] (29/Document/VAR/train.py, line 65)=> [dataloader multi processing] ... [dataloader multi processing](*) finished! (0.00s)
[10-14 15:12:21] (29/Document/VAR/train.py, line 71)=> [dataloader] gbs=32, lbs=32, iters_train=1563, types(tr, va)=('DatasetFolder', 'DatasetFolder')
[10-14 15:12:22] (cument/VAR/models/var.py, line 98)=>
[constructor] ==== flash_if_available=True (0/20), fused_if_available=True (fusing_add_ln=0/20, fusing_mlp=0/20) ====
[VAR config ] embed_dim=1280, num_heads=20, depth=20, mlp_ratio=4.0
[drop ratios ] drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0833333 (tensor([0.0000, 0.0044, 0.0088, 0.0132, 0.0175, 0.0219, 0.0263, 0.0307, 0.0351,
0.0395, 0.0439, 0.0482, 0.0526, 0.0570, 0.0614, 0.0658, 0.0702, 0.0746,
0.0789, 0.0833]))
[10-14 15:12:22] (cument/VAR/models/var.py, line 257)=> [init_weights] VAR with init_std=0.0161374
/public/home/cs029/Document/VAR/train.py:98: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
vae_local.load_state_dict(torch.load(vae_ckpt, map_location='cpu'), strict=True)
[10-14 15:12:23] (29/Document/VAR/train.py, line 104)=> [INIT] VAR model = VAR(
drop_path_rate=0.0833333
(word_embed): Linear(in_features=32, out_features=1280, bias=True)
(class_emb): Embedding(1001, 1280)
(lvl_embed): Embedding(6, 1280)
(shared_ada_lin): Identity()
(blocks): ModuleList(
(0): AdaLNSelfAttn(
shared_aln=False
(drop_path): Identity()
(attn): SelfAttention(
using_flash=False, using_xform=True, attn_l2_norm=True
(mat_qkv): Linear(in_features=1280, out_features=3840, bias=False)
(proj): Linear(in_features=1280, out_features=1280, bias=True)
(proj_drop): Identity()
)
(ffn): FFN(
fused_mlp_func=False
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='tanh')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Identity()
)
(ln_wo_grad): LayerNorm((1280,), eps=1e-06, elementwise_affine=False)
(ada_lin): Sequential(
(0): SiLU()
(1): Linear(in_features=1280, out_features=7680, bias=True)
)
)
(1-19): 19 x AdaLNSelfAttn(
shared_aln=False
(drop_path): DropPath((drop_prob=...))
(attn): SelfAttention(
using_flash=False, using_xform=True, attn_l2_norm=True
(mat_qkv): Linear(in_features=1280, out_features=3840, bias=False)
(proj): Linear(in_features=1280, out_features=1280, bias=True)
(proj_drop): Identity()
)
(ffn): FFN(
fused_mlp_func=False
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): GELU(approximate='tanh')
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
(drop): Identity()
)
(ln_wo_grad): LayerNorm((1280,), eps=1e-06, elementwise_affine=False)
(ada_lin): Sequential(
(0): SiLU()
(1): Linear(in_features=1280, out_features=7680, bias=True)
)
)
)
(head_nm): AdaLNBeforeHead(
(ln_wo_grad): LayerNorm((1280,), eps=1e-06, elementwise_affine=False)
(ada_lin): Sequential(
(0): SiLU()
(1): Linear(in_features=1280, out_features=2560, bias=True)
)
)
(head): Linear(in_features=1280, out_features=4096, bias=True)
)
[10-14 15:12:23] (29/Document/VAR/train.py, line 106)=> [INIT][#para] VAE=108.95, VAE.enc=44.11, VAE.dec=64.65, VAE.quant=0.17
[10-14 15:12:23] (29/Document/VAR/train.py, line 107)=> [INIT][#para] VAR=600.16
[10-14 15:12:23] (/VAR/utils/lr_control.py, line 106)=> [get_param_groups][rank0] type(model).__name__='VAR' count=250, numel=600158096
[10-14 15:12:23] (/VAR/utils/lr_control.py, line 107)=>
[10-14 15:12:23] (29/Document/VAR/train.py, line 122)=> [INIT] optim=functools.partial(<class 'torch.optim.adamw.AdamW'>, betas=(0.9, 0.95), fused=True), opt_kw={'lr': 1.25e-05, 'weight_decay': 0}
/public/home/cs029/Document/VAR/utils/amp_sc.py:27: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
self.scaler = torch.cuda.amp.GradScaler(init_scale=2. ** 11, growth_interval=1000) if self.using_fp16_rather_bf16 else None # only fp16 needs a scaler
/public/home/cs029/Document/VAR/models/var.py:200: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
[rank0]: Traceback (most recent call last):
[rank0]: File "/public/home/cs029/Document/VAR/train.py", line 331, in <module>
[rank0]: try: main_training()
[rank0]: File "/public/home/cs029/Document/VAR/train.py", line 196, in main_training
[rank0]: stats, (sec, remain_time, finish_time) = train_one_ep(
[rank0]: File "/public/home/cs029/Document/VAR/train.py", line 299, in train_one_ep
[rank0]: grad_norm, scale_log2 = trainer.train_step(
[rank0]: File "/public/home/cs029/Document/VAR/trainer.py", line 111, in train_step
[rank0]: logits_BLV = self.var(label_B, x_BLCv_wo_first_l)
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
[rank0]: else self._run_ddp_forward(*inputs, **kwargs)
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
[rank0]: return self.module(*inputs, **kwargs) # type: ignore[index]
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/public/home/cs029/Document/VAR/models/var.py", line 222, in forward
[rank0]: x_BLC = b(x=x_BLC, cond_BD=cond_BD_or_gss, attn_bias=attn_bias)
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/public/home/cs029/Document/VAR/models/basic_var.py", line 157, in forward
[rank0]: x = x + self.drop_path(self.attn( self.ln_wo_grad(x).mul(scale1.add(1)).add_(shift1), attn_bias=attn_bias ).mul_(gamma1))
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/public/home/cs029/Document/VAR/models/basic_var.py", line 115, in forward
[rank0]: oup = memory_efficient_attention(q.to(dtype=main_type), k.to(dtype=main_type), v.to(dtype=main_type), attn_bias=None if attn_bias is None else attn_bias.to(dtype=main_type).expand(B, self.num_heads, -1, -1), p=dropout_p, scale=self.scale).view(B, L, C)
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 301, in memory_efficient_attention
[rank0]: return _memory_efficient_attention(
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 470, in _memory_efficient_attention
[rank0]: return _fMHA.apply(
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[rank0]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 84, in forward
[rank0]: out, op_ctx = _memory_efficient_attention_forward_requires_grad(
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 495, in _memory_efficient_attention_forward_requires_grad
[rank0]: op = _dispatch_fw(inp, True)
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py", line 135, in _dispatch_fw
[rank0]: return _run_priority_list(
[rank0]: File "/public/home/cs029/miniconda3/envs/var/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py", line 76, in _run_priority_list
[rank0]: raise NotImplementedError(msg)
[rank0]: NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
[rank0]: query : shape=(32, 91, 20, 64) (torch.float16)
[rank0]: key : shape=(32, 91, 20, 64) (torch.float16)
[rank0]: value : shape=(32, 91, 20, 64) (torch.float16)
[rank0]: attn_bias : <class 'torch.Tensor'>
[rank0]: p : 0.0
[rank0]: `[email protected]` is not supported because:
[rank0]: attn_bias type is <class 'torch.Tensor'>
[rank0]: `cutlassF-pt` is not supported because:
[rank0]: attn_bias.stride(-2) % 8 != 0 (attn_bias.stride() = (0, 0, 91, 1))
[rank0]: HINT: To use an `attn_bias` with a sequence length that is not a multiple of 8, you need to ensure memory is aligned by slicing a bigger tensor. Example: use `attn_bias = torch.zeros([1, 1, 5, 8])[:,:,:,:5]` instead of `torch.zeros([1, 1, 5, 5])`
```
What can I do to solve this problem?
Metadata
Metadata
Assignees
Labels
No labels