List view
1. Int8 dynamic quantization (weight only) support 2. Make UNet without Triton to run as fast as possible (because Triton has a significant CPU overhead and is not stable enough) 3. Faster convolution kernel with FP16 accumulator 4. Demonstrate ability to optimize LLMs (I have already used stable-fast to do this internally)
No due date