`batched_mul` launches one kernel per matrix when eltypes differ rather than promoting

Small bug that can lead to massive slowdowns.

Minimal example:

``` 
A = CUDA.rand(TA, 2, 2, 100)
B = CUDA.rand(TB, 2, 2, 100)
CUDA.@profile NNlib.batched_mul(A, B)
```

For TA = TB = Float32, we have expected results

```

Profiler ran for 86.78 µs, capturing 32 events.

Host-side activity: calling CUDA APIs took 27.18 µs (31.32% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                   │
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────┤
│   20.33% │   17.64 µs │     1 │                                      │ cudaLaunchKernel                                       │
│    6.04% │    5.25 µs │     3 │   1.75 µs ± 1.85   (  0.24 ‥ 3.81)   │ cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    3.57% │     3.1 µs │     1 │                                      │ cuMemAllocFromPoolAsync                                │
│    0.27% │  238.42 ns │     2 │ 119.21 ns ± 168.59 (   0.0 ‥ 238.42) │ cudaGetLastError                                       │
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 1.91 µs (2.20% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Name                                                                                                                                                                                                                                                        ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│    2.20% │    1.91 µs │     1 │ void gemmSN_NN_kernel<double, 256, 4, 2, 8, 2, 4, false, cublasGemvTensorStridedBatched<double const>, cublasGemvTensorStridedBatched<double const>, cublasGemvTensorStridedBatched<double>>(cublasGemmSmallNParams<cublasGemvTensorStridedBatched<double c ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
         
```

When the types differ, rather than promoting, the generic method is called which launches one kernel per matmul and is incredibly slow.

```
Profiler ran for 6.41 ms, capturing 3310 events.

Host-side activity: calling CUDA APIs took 354.53 µs (5.53% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                   │ Name                    │
├──────────┼────────────┼───────┼─────────────────────────────────────┼─────────────────────────┤
│    3.64% │  233.65 µs │   100 │   2.34 µs ± 2.66   (  1.67 ‥ 28.13) │ cuLaunchKernel          │
│    0.11% │    6.91 µs │     1 │                                     │ cuMemAllocFromPoolAsync │
└──────────┴────────────┴───────┴─────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 162.6 µs (2.54% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                                                                                                                                                                                                   ⋯
├──────────┼────────────┼───────┼────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│    2.54% │   162.6 µs │   100 │   1.63 µs ± 0.17   (  1.43 ‥ 1.91) │ gpu_matmatmul_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, NDRange<2, DynamicSize, DynamicSize, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int ⋯
└──────────┴────────────┴───────┴────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
    
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`batched_mul` launches one kernel per matrix when eltypes differ rather than promoting #621

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

batched_mul launches one kernel per matrix when eltypes differ rather than promoting #621

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`batched_mul` launches one kernel per matrix when eltypes differ rather than promoting #621