Skip to content

Unrelated try-catch causes CUDA arrays to not be freed #52533

Open
@IanButterworth

Description

@IanButterworth

Originally posted here JuliaGPU/CUDA.jl#2197

Take a GPU training loop like this

for epoch in 1:epochs
    for (x, y) in train_loader
        x = x |> gpu; y = y |> gpu
        gs, _ = gradient(model, x) do m, _x
            logitcrossentropy(m(_x), y)
        end
        state, model = Optimisers.update(state, model, gs)
    end
    try
        true
    catch
    end
end

With the try-catch, the GPU runs out of memory very quickly. Without the try-catch no issue.

Approximately quoting @gbaraldi from slack:


try-catches introduce some phic nodes to store variables in case we error and still need their values.
(from the example above)

   store volatile {}* %value_phi61, {}** %phic, align 8
   store volatile {}* %value_phi62, {}** %phic1, align 16
   store volatile {}* %value_phi46, {}** %phic2, align 8
   store volatile {}* %value_phi47, {}** %phic3, align 16
   store volatile i64 %value_phi48, i64* %phic4, align 8
   store volatile i64 %value_phi49, i64* %phic5, align 8
   store volatile i8 0, i8* %phic6, align 1
   store volatile {}* null, {}** %phic7, align 8
   store volatile i8 0, i8* %phic8, align 1
   store volatile {}* %278, {}** %phic9, align 16
   store volatile {}* %267, {}** %phic10, align 8
   store volatile {}* inttoptr (i64 140366834286144 to {}*), {}** %phic11, align 16

I have the suspicion some of them are holding our CUDA arrays


This is especially nasty because the logging macros introduce try-catch blocks if they cannot be proven to not error. i.e. are more than simple strings.

So the @info log with interpolation here introduces the issue, while no interpolation like @info "completed epoch" doesn't.

for epoch in 1:epochs
    for (x, y) in train_loader
        x = x |> gpu; y = y |> gpu
        gs, _ = gradient(model, x) do m, _x
            logitcrossentropy(m(_x), y)
        end
        state, model = Optimisers.update(state, model, gs)
    end
    @info "completed epoch $epoch"
end

However @Keno's automatic try-catch elision on 1.11 might fix that?
Note that this is on 1.9.4. CUDA has issues on julia master so I haven't been able to test this yet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    GCGarbage collectorbugIndicates an unexpected problem or unintended behaviorgpuAffects running Julia on a GPU

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions