Skip to content

temporally causal performance worse than full attention for natten3d #306

@coconutruben

Description

@coconutruben

Hi there

We were benchmarking natten3d for our needs, and noticed that turning on temporally causal is worse than running full attention (vs sdpa on cudnn).

Is this something you have noticed too? for inputs like this

  • [1, t, h, w]
  • num heads: 24
  • head dim: 128
  • for height/width we're trying 32 and 64
  • for t (temporal) we're trying 30 and 60
Prefix GPUs Median (ms) time height width
attention_ 1 8.582 30 32 32
natten_ima 1 17.338 30 32 32
attention_ 1 33.817 60 32 32
natten_ima 1 61.646 60 32 32
attention_ 1 134.908 30 64 64
natten_ima 1 261.328 30 64 64
attention_ 1 548.231 60 64 64
natten_ima 1 965.665 60 64 64

other information

  • blackwell gb200
  • bf16 dtype
  • natten 0.21.1
  • GPU Driver 580.126.09
  • PyTorch 2.9.0+cu130
  • CUDA13.0
  • cuDNN9.13.00
  • NCCL2.27.7
  • Triton3.5.0
  • Python3.12.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions