Skip to content

Slow ∇softmax! compared with generic version. #513

Open
FluxML/NNlibCUDA.jl
#44
@jumerckx

Description

@jumerckx

The specialized backward-pass for softmax takes a lot longer than the generic implementation from NNlib.jl.
The effect seems especially pronounced when the batch-dimension is larger.

Here's the code to reproduce this issue.

Below are the results of my benchmarks:

my CUDA versioninfo:

CUDA toolkit 11.4.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 461.9.0

Libraries: 
- CUBLAS: 11.5.4
- CURAND: 10.2.5
- CUFFT: 10.5.1
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+461.9
- CUDNN: 8.20.2 (for CUDA 11.4.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)

Toolchain:
- Julia: 1.6.1
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: GeForce RTX 2070 with Max-Q Design (sm_75, 5.873 GiB / 8.000 GiB available)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions