[CUDA] GroupQueryAttention with XQA and Quantized KV Cache Support #3785

Job	Run time
Windows x64 QNN CI Pipeline (static_lib)	18m 5s
Windows x64 QNN CI Pipeline (shared_lib)	17m 30s
	35m 35s

Provide feedback