Slow ∇softmax! compared with generic version.

The [specialized backward-pass for softmax](https://github.com/FluxML/NNlibCUDA.jl/blob/06ccd9f5b0fa6d3bfc9c9d52dbf865a78d76a576/src/cudnn/softmax.jl#L77) takes a lot longer than the generic implementation from NNlib.jl.
The effect seems especially pronounced when the batch-dimension is larger.

[Here's](https://gist.github.com/jumerckx/082db491760052afb2f376153097e4ba) the code to reproduce this issue.

Below are the results of my benchmarks:
<img src="https://user-images.githubusercontent.com/31353884/136658050-ba57e799-4ce3-4967-a5e5-bd480e9c91c6.png" width="250">

my CUDA versioninfo:
```
CUDA toolkit 11.4.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 461.9.0

Libraries: 
- CUBLAS: 11.5.4
- CURAND: 10.2.5
- CUFFT: 10.5.1
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+461.9
- CUDNN: 8.20.2 (for CUDA 11.4.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)

Toolchain:
- Julia: 1.6.1
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: GeForce RTX 2070 with Max-Q Design (sm_75, 5.873 GiB / 8.000 GiB available)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow ∇softmax! compared with generic version. #513

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow ∇softmax! compared with generic version. #513

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions