Support NVidia Hopper GPUs and exactly implement all components on the arxiv paper: https://arxiv.org/pdf/2512.14080
A preliminary version of Blackwell GPUs are runnable with by setting USE_QUACK_GEMM=1 environment variable. For example, USE_QUACK_GEMM=1 python benchmarks/moe-cute.py. But the Blackwell versions are under optimized at this moment, which we will improve on the next release.