DepthwiseConv significantly slower than Conv during training

# DepthwiseConv significantly slower than Conv during training

## Description:
During training, 'DepthwiseConv' appears to be much slower than standard 'Conv', especially when using anisotropic kernels. This behavior is unexpected, as depthwise convolutions are generally considered more efficient in terms of parameter count and computation.

## Key Observations:

Standard Conv is consistently faster than DepthwiseConv in my benchmarks.
Anisotropic depthwise convolutions (both sequential and parallel) are the slowest, despite having fewer trainable parameters.
Manual memory management in CUDA (commented out in the example) does not seem to help.

## Question:
Am I missing something in my implementation or usage? Should I be performing manual memory management in the CUDA space to improve performance?

## Minimal working example :
```julia
using Flux
using CUDA
using Pkg

nchannels = 4
batchsize = 8

# Fake data
X = [rand(Float32, 512, 512, nchannels, batchsize) for _ in 1:500]
Y = [rand(Float32, 512, 512, nchannels, batchsize) for _ in 1:500]

# Classic conv
conv() = Conv((3,3), nchannels=>nchannels, pad=(1,1))

# Depthwise conv
dwconv() = DepthwiseConv((3,3), nchannels=>nchannels, pad=(1,1))

# Depthwise conv with anisotropic kernel
adwconv() = [
    DepthwiseConv((1,3), nchannels=>nchannels, pad=(0,1)),
    DepthwiseConv((3,1), nchannels=>nchannels, pad=(1,0))
]

# Parallel depthwise conv with anisotropic kernel : idea from this paper https://arxiv.org/abs/2306.16103
padwconv() = Parallel(+,
    identity,
    DepthwiseConv((1,3), nchannels=>nchannels, pad=(0,1)),
    DepthwiseConv((3,1), nchannels=>nchannels, pad=(1,0))
)

# Check number of trainable parameters
npars(model) = sum(length,Flux.trainables(model))

# Fake training loop
function training_loop!(model, nepochs = 5)
    println("Number of trainable parameters: $(npars(model))")
    myloss(ŷ, y) = Flux.Losses.mse(ŷ, y)
    opt = Adam()
    opt_state = Flux.setup(opt, model)
    for epoch in 1:nepochs
            tepoch = time()
            trainmode!(model)
            for (x, y) in zip(X,Y)
                xb, yb = gpu(x), gpu(y)
                grads = Flux.gradient(m -> myloss(m(xb), yb), model)
                Flux.update!(opt_state, model, grads[1])
                # Try to do manual memory management
                #CUDA.unsafe_free!(xb) 
                #CUDA.unsafe_free!(yb)
            end
            elapsed_time = time() - tepoch
            println("Epoch: $epoch. Elapsed time: $(elapsed_time / 60) minutes")
    end
end

# Number of layers 
nlayers = 30

println("EXPERIMENTS:")
println()

println("Training model with classic conv:")
m1 = Chain([conv() for _ in 1:nlayers]...) |> gpu
training_loop!(m1)
println()

println("Training model with dw conv:")
m2 = Chain([dwconv() for _ in 1:nlayers]...) |> gpu
training_loop!(m2)
println()

println("Training model with dw anisotropic conv:")
m3 = Chain(([adwconv() for _ in 1:nlayers]...)...) |> gpu
training_loop!(m3)
println()

println("Training model with parallel dw anisotropic conv:")
m4 = Chain([padwconv() for _ in 1:nlayers]...) |> gpu
training_loop!(m4)
println()

println("ENVIRONMENT:")
versioninfo() |> display
println()
CUDA.device(CUDA.context()) |> display
println()
Pkg.status() |> display
```

## Output:
```
EXPERIMENTS:

Training model with classic conv:
Number of trainable parameters: 4440
Epoch: 1. Elapsed time: 1.7558403174082438 minutes
Epoch: 2. Elapsed time: 0.8149041493733724 minutes
Epoch: 3. Elapsed time: 0.8191895643870036 minutes
Epoch: 4. Elapsed time: 0.8240762670834859 minutes
Epoch: 5. Elapsed time: 0.8252299666404724 minutes

Training model with dw conv:
Number of trainable parameters: 1200
Epoch: 1. Elapsed time: 2.28381826877594 minutes
Epoch: 2. Elapsed time: 2.2780550479888917 minutes
Epoch: 3. Elapsed time: 2.2785532514254254 minutes
Epoch: 4. Elapsed time: 2.2790298024813334 minutes
Epoch: 5. Elapsed time: 2.279213802019755 minutes

Training model with dw anisotropic conv:
Number of trainable parameters: 960
Epoch: 1. Elapsed time: 3.666351866722107 minutes
Epoch: 2. Elapsed time: 2.933931132157644 minutes
Epoch: 3. Elapsed time: 2.933724248409271 minutes
Epoch: 4. Elapsed time: 2.9334154526392617 minutes
Epoch: 5. Elapsed time: 2.933711850643158 minutes

Training model with parallel dw anisotropic conv:
Number of trainable parameters: 960
Epoch: 1. Elapsed time: 3.8321314493815106 minutes
Epoch: 2. Elapsed time: 3.2672752340634665 minutes
Epoch: 3. Elapsed time: 3.26718141635259 minutes
Epoch: 4. Elapsed time: 3.2664114316304524 minutes
Epoch: 5. Elapsed time: 3.266859567165375 minutes
```

## Environment
```
ENVIRONMENT:
Julia Version 1.11.7
Commit f2b3dbda30a (2025-09-08 12:10 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × Intel(R) Xeon(R) W-2265 CPU @ 3.50GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, cascadelake)
Threads: 4 default, 0 interactive, 2 GC (on 24 virtual cores)
Environment:
  JULIA_NUM_THREADS = 4

CuDevice(0): NVIDIA RTX A5000

Status `~/.julia/environments/v1.11/Project.toml`
⌅ [052768ef] CUDA v5.8.5
  [3da002f7] ColorTypes v0.12.1
  [5ae59095] Colors v0.13.1
  [8f4d0f93] Conda v1.10.2
  [992eb4ea] CondaPkg v0.2.33
  [5789e2e9] FileIO v1.17.1
  [587475ba] Flux v0.16.5
  [7073ff75] IJulia v1.31.0
  [4e3cecfd] ImageShow v0.3.8
  [916415d5] Images v0.26.2
  [eb30cadb] MLDatasets v0.7.18
  [f1d291b0] MLUtils v0.4.8
  [5fb14364] OhMyREPL v0.5.31
  [d7d3b36b] ParameterSchedulers v0.4.3
  [91a5bcdd] Plots v1.41.1
  [92933f4c] ProgressMeter v1.11.0
  [02a925ec] cuDNN v1.4.5
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated`
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

DepthwiseConv significantly slower than Conv during training #2626

DepthwiseConv significantly slower than Conv during training

Description:

Key Observations:

Question:

Minimal working example :

Output:

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

DepthwiseConv significantly slower than Conv during training #2626

Description

DepthwiseConv significantly slower than Conv during training

Description:

Key Observations:

Question:

Minimal working example :

Output:

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions