Open
Description
I have been trying to take the derivative of a Flux model in testmode, and noticed that the BatchNorm layer behaves incorrectly for 4D and 5D CUDA-arrays.
Here is a MVE of this behaviour, computing the gradient of the BatchNorm for differently reshaped inputs:
using Flux, CUDA, Zygote
function gradient_varying_shape(m, x, n_dims, device)
m = m |> device
Flux.testmode!(m)
x = reshape(x, ntuple(i -> 1, n_dims)) |> device
return gradient(input -> sum(m(input).^2), x)[1] |> cpu
end
model = BatchNorm(1)
x = [1f0]
for i=2:7
cpu_gradient = gradient_varying_shape(model, x, i, cpu)
gpu_gradient = gradient_varying_shape(model, x, i, gpu)
println("n_dim=$i, cpu: $(cpu_gradient[1]), gpu: $(gpu_gradient[1])")
end
This gives the following output for me:
n_dim=2, cpu: 1.99998, gpu: 0.0
n_dim=3, cpu: 1.99998, gpu: 1.99998
n_dim=4, cpu: 1.99998, gpu: 0.0
n_dim=5, cpu: 1.99998, gpu: 0.0
n_dim=6, cpu: 1.99998, gpu: 1.99998
n_dim=7, cpu: 1.99998, gpu: 1.99998
Looking through the Code, I found that the implementation of the CUDA backwards batchnorm here ignores the argument training
. Could this be the origin of this behavior?
I'm using Julia 1.9.3 with NNlib version 0.9.7 and this environment:
[052768ef] CUDA v5.0.0
[587475ba] Flux v0.14.6
[e88e6eb3] Zygote v0.6.66
[02a925ec] cuDNN v1.2.0