Using FlashInfer CUTLASS Backend for vLLM is Slow on SM120/121

vLLM has reported that SM121 NVFP4 inference is generally slow on SM12x, especially on DGX Spark (SM121)

Current issue tracks ongoing efforts to improve performance on SM120 & 121 for NVFP4 inference.

Benchmarking:
* #3002 

Kernel Improvements:
* #3008
* #3014
* #3026
* #3051 
* #3066
* #3080
* #3193

Misc. fixes
* #3152 
* #3191