-
Notifications
You must be signed in to change notification settings - Fork 94
Open
Description
Is your feature request related to a problem? Please describe.
Attention block is not yet implemented as a OpenCL kernel. As a result, the Attention computation is done by CPU even if GPU is activated. This costs additional overhead for synchronization between GPU and CPU.
Describe the solution you'd like
- Flash Attention OpenCL kernel to be implemented.
- It is a technique that speeds up attention's execution on GPUs while giving the same result with vanilla Attention. This is a not-working, unfinished draft of it: Add kernel code string for flash attention #3238.
- You may start with vanilla Attention for testing
Additional context
- A unit test that compares output with that of NNTrainer's Attention layer is necessary.
- Report execution time and compare it with CPU's
Metadata
Metadata
Assignees
Labels
No labels