We should create a kernel that is as fast as cuBLAS on all shapes. Really important for LLM inference.