Skip to content

Implementation of Flash Attention with OpenCL #3454

@dkjung

Description

@dkjung

Is your feature request related to a problem? Please describe.
Attention block is not yet implemented as a OpenCL kernel. As a result, the Attention computation is done by CPU even if GPU is activated. This costs additional overhead for synchronization between GPU and CPU.

Describe the solution you'd like

  • Flash Attention OpenCL kernel to be implemented.
  • It is a technique that speeds up attention's execution on GPUs while giving the same result with vanilla Attention. This is a not-working, unfinished draft of it: Add kernel code string for flash attention #3238.
  • You may start with vanilla Attention for testing

Additional context

  • A unit test that compares output with that of NNTrainer's Attention layer is necessary.
  • Report execution time and compare it with CPU's

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions