Unrelated try-catch causes CUDA arrays to not be freed

Originally posted here https://github.com/JuliaGPU/CUDA.jl/issues/2197

Take a GPU training loop like this
```
for epoch in 1:epochs
    for (x, y) in train_loader
        x = x |> gpu; y = y |> gpu
        gs, _ = gradient(model, x) do m, _x
            logitcrossentropy(m(_x), y)
        end
        state, model = Optimisers.update(state, model, gs)
    end
    try
        true
    catch
    end
end
```
With the try-catch, the GPU runs out of memory very quickly. Without the try-catch no issue.

Approximately quoting @gbaraldi from slack:


---
try-catches introduce some phic nodes to store variables in case we error and still need their values.
(from the example above)
```
   store volatile {}* %value_phi61, {}** %phic, align 8
   store volatile {}* %value_phi62, {}** %phic1, align 16
   store volatile {}* %value_phi46, {}** %phic2, align 8
   store volatile {}* %value_phi47, {}** %phic3, align 16
   store volatile i64 %value_phi48, i64* %phic4, align 8
   store volatile i64 %value_phi49, i64* %phic5, align 8
   store volatile i8 0, i8* %phic6, align 1
   store volatile {}* null, {}** %phic7, align 8
   store volatile i8 0, i8* %phic8, align 1
   store volatile {}* %278, {}** %phic9, align 16
   store volatile {}* %267, {}** %phic10, align 8
   store volatile {}* inttoptr (i64 140366834286144 to {}*), {}** %phic11, align 16
```
I have the suspicion some of them are holding our CUDA arrays

---


This is especially nasty because the logging macros introduce `try-catch` blocks if they cannot be proven to not error. i.e. are more than simple strings.

So the `@info` log with interpolation here introduces the issue, while no interpolation like `@info "completed epoch"` doesn't.

```
for epoch in 1:epochs
    for (x, y) in train_loader
        x = x |> gpu; y = y |> gpu
        gs, _ = gradient(model, x) do m, _x
            logitcrossentropy(m(_x), y)
        end
        state, model = Optimisers.update(state, model, gs)
    end
    @info "completed epoch $epoch"
end
```
However @Keno's automatic `try-catch` elision on 1.11 might fix that? 
Note that this is on 1.9.4. CUDA has issues on julia master so I haven't been able to test this yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unrelated try-catch causes CUDA arrays to not be freed #52533

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Unrelated try-catch causes CUDA arrays to not be freed #52533

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions