Implementation of Flash Attention with OpenCL

**Is your feature request related to a problem? Please describe.**
Attention block is not yet implemented as a OpenCL kernel. As a result, the _Attention_ computation is done by CPU even if GPU is activated. This costs additional overhead for synchronization between GPU and CPU. 

**Describe the solution you'd like**
- Flash Attention OpenCL kernel to be implemented. 
- It is a technique that speeds up attention's execution on GPUs while giving the same result with vanilla Attention. This is a not-working, unfinished draft of it: https://github.com/nnstreamer/nntrainer/pull/3238.
- You may start with vanilla Attention for testing

**Additional context**
- A unit test that compares output with that of NNTrainer's Attention layer is necessary.
- Report execution time and compare it with CPU's 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementation of Flash Attention with OpenCL #3454

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implementation of Flash Attention with OpenCL #3454

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions