Open
Description
I was testing GMRES on the Spock machine with AMD MI100 GPUs. (The code is from the /examples/gmres directory.)
Performance of the fp64 GEMV on the MI100 was significantly slower than that of the V100:
A breakdown in timings for fp64 GMRES (BentPipe2D1500 matrix):
V100:
SpMV: 7.33s
Gemv no trans: 19.01 s (Cublas here, I think)
Gemv Trans: 20.20 s (Seher’s dot-based Gemm, I think?)
Norm: 1.72 s
MI100:
SpMV: 7.38 s
Gemv no trans: 48.5 s (Again, this is the gemv[twoLevel])
Gemv Trans: 29.82 s ( gemv[twoLevelTranspose)
Norm: 1.29s
@brian-kelley discussed this a bit in the #1081 issue, suggesting that the row block length for GEMV should be 64 for AMD GPUs.
Here we are calling only Kokkos-native GEMV kernels. There is no use of RocBlas at this time.
Timings were similarly slow for GEMV when I ran related experiments with fp32.