Slow GEMV AMD MI100 vs V100

I was testing GMRES on the Spock machine with AMD MI100 GPUs.  (The code is from the /examples/gmres directory.)

Performance of the fp64 GEMV on the MI100 was significantly slower than that of the V100:

```
A breakdown in timings for fp64 GMRES (BentPipe2D1500 matrix):
 
V100:
SpMV: 7.33s
Gemv no trans: 19.01 s (Cublas here, I think)
Gemv Trans: 20.20 s (Seher’s dot-based Gemm, I think?)
Norm: 1.72 s
 
MI100:
SpMV: 7.38 s
Gemv no trans: 48.5 s (Again, this is the gemv[twoLevel])
Gemv Trans: 29.82 s ( gemv[twoLevelTranspose)
Norm: 1.29s
```

@brian-kelley discussed this a bit in the #1081 issue, suggesting that the row block length for GEMV should be 64 for AMD GPUs.  

Here we are calling only Kokkos-native GEMV kernels.  There is no use of RocBlas at this time.  

Timings were similarly slow for GEMV when I ran related experiments with fp32. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow GEMV AMD MI100 vs V100 #1083

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow GEMV AMD MI100 vs V100 #1083

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions