[attention] Investigate overlapping matmul and softmax

In Flash Attention 3 we see a technique to overlap matmul and softmax from different waves to maximize mfma utilization. We should consider how to use it for current attention. Need to understand hardware scheduler and see how to work with/around it, like using `s_setprio` instructions.