Incorrect gradients of batchnorm in testmode

I have been trying to take the derivative of a Flux model in testmode, and noticed that the BatchNorm layer behaves incorrectly for 4D and 5D CUDA-arrays.
Here is a MVE of this behaviour, computing the gradient of the BatchNorm for differently reshaped inputs:
```
using Flux, CUDA, Zygote

function gradient_varying_shape(m, x, n_dims, device)
    m = m |> device
    Flux.testmode!(m)

    x = reshape(x, ntuple(i -> 1, n_dims)) |> device
    return gradient(input -> sum(m(input).^2), x)[1] |> cpu
end

model = BatchNorm(1)
x = [1f0]

for i=2:7
    cpu_gradient = gradient_varying_shape(model, x, i, cpu) 
    gpu_gradient = gradient_varying_shape(model, x, i, gpu) 
    println("n_dim=$i, cpu: $(cpu_gradient[1]), gpu: $(gpu_gradient[1])")
end
```
This gives the following output for me:
```
n_dim=2, cpu: 1.99998, gpu: 0.0
n_dim=3, cpu: 1.99998, gpu: 1.99998
n_dim=4, cpu: 1.99998, gpu: 0.0
n_dim=5, cpu: 1.99998, gpu: 0.0
n_dim=6, cpu: 1.99998, gpu: 1.99998
n_dim=7, cpu: 1.99998, gpu: 1.99998
```

Looking through the Code, I found that the implementation of the CUDA backwards batchnorm [here](https://github.com/FluxML/NNlib.jl/blob/607de4b8fec751e1079d2822ac950028bb819c1c/ext/NNlibCUDACUDNNExt/batchnorm.jl#L124) ignores the argument `training`. Could this be the origin of this behavior?

I'm using Julia 1.9.3 with NNlib version 0.9.7 and this environment:
```
[052768ef] CUDA v5.0.0
[587475ba] Flux v0.14.6
[e88e6eb3] Zygote v0.6.66
[02a925ec] cuDNN v1.2.0
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Incorrect gradients of batchnorm in testmode #548

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Incorrect gradients of batchnorm in testmode #548

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions