Skip to content

KONAKONA666/q8_kernels

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Q8 Kernels


Q8Kernels is a efficent implementation of 8bit kernels(FP8 and INT8).

Features:

-8bit GEMM(with fused gelu and bias) / 2x faster than cuBLAS FP8 and 3.5x faster than torch.mm
-FP8 Flash Attention 2 with Fast Hadamard Transform(also supports cross attention mask) / 2x faster than flash attention 2
-Mixed Precision Fast Hadamard Transform
-RMSNorm
-Mixed Precision FMA
-RoPE Layer
-Quantizers

All operations are implemented in CUDA. Current version supports ADA Architecture(Ampere optimizations are coming soon!).

Installation

q8_kernels requires CUDA Version >= 12.4 and pytorch >=2.4. q8_kernels was tested on Windows machine. Dont see problem with building on Linux systems. Install ninja pip install ninja Make sure that ninja is installed and that it works correctly (e.g. ninja --version). Without ninja installation is very slow.

git clone https://github.com/KONAKONA666/q8_kernels
cd q8_kernels 
git submodule init
git submodule update

python setup.py install
pip install . # for utility

It takes ~10-15 minutes to compile and install all modules.

Supported models

Speed ups are computed relative to transformers with inference with 16bit and flash attention 2

Model name Speed up
LTXVideo up to 2.5x

Acknowledgement

Thanks to: Flash attention

@66RING

fast-hadamard-transform

cutlass

@weishengying: Check his CUTE exercises and flash attn implementations

Authors

KONAKONA666

License

MIT Free Software

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published