Skip to content

Add a hostcall interface #1140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

Add a hostcall interface #1140

wants to merge 5 commits into from

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Sep 9, 2021

Fixes #440

Initial, simple implementation. I still need to steal ideas from ADMGPU.jl and optimizations from #567, but the initial goal is a simple but correct implementation that we can use for unlikely code paths such as error reporting.

Demo:

julia> using CUDA

julia> function test(x)
         println("This is a hostcall from thread $x")
         x+1
       end
test (generic function with 1 method)

julia> function kernel()
         rv = hostcall(test, Int, Tuple{Int}, threadIdx().x)
         @cuprintln("Hostcall returned $rv")
         return
       end
kernel (generic function with 1 method)

julia> @cuda threads=2 kernel();
This is a hostcall from thread 1
Hostcall returned 2
This is a hostcall from thread 2
Hostcall returned 3

Depends on #1110.

Probably requires Base support like JuliaLang/julia#42302

cc @jpsamaroo

@maleadt maleadt added cuda kernels Stuff about writing CUDA kernels. enhancement New feature or request labels Sep 9, 2021
@maleadt maleadt marked this pull request as draft September 9, 2021 12:56
@codecov
Copy link

codecov bot commented Sep 9, 2021

Codecov Report

Merging #1140 (35026b9) into master (5b74388) will increase coverage by 8.97%.
The diff coverage is 86.07%.

❗ Current head 35026b9 differs from pull request most recent head 1fe2b4c. Consider uploading reports for the commit 1fe2b4c to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1140      +/-   ##
==========================================
+ Coverage   66.97%   75.94%   +8.97%     
==========================================
  Files         118      119       +1     
  Lines        7955     7737     -218     
==========================================
+ Hits         5328     5876     +548     
+ Misses       2627     1861     -766     
Impacted Files Coverage Δ
lib/cudadrv/types.jl 83.33% <0.00%> (-16.67%) ⬇️
src/CUDA.jl 100.00% <ø> (ø)
src/compiler/hostcall.jl 85.48% <85.48%> (ø)
src/compiler/execution.jl 84.61% <85.71%> (+0.54%) ⬆️
lib/cudadrv/execution.jl 100.00% <100.00%> (+3.44%) ⬆️
src/compiler/exceptions.jl 64.28% <100.00%> (-29.84%) ⬇️
src/compiler/gpucompiler.jl 82.14% <100.00%> (-1.73%) ⬇️
examples/wmma/low-level.jl 0.00% <0.00%> (-100.00%) ⬇️
examples/wmma/high-level.jl 0.00% <0.00%> (-100.00%) ⬇️
src/linalg.jl 36.36% <0.00%> (-50.01%) ⬇️
... and 72 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b74388...1fe2b4c. Read the comment docs.

@maleadt
Copy link
Member Author

maleadt commented Sep 9, 2021

Hmm, one problem is that the following deadlocks:

# hostcall watcher task/thread
Threads.@spawn begin
    while true
        println(1)
        sleep(1)
    end
end

# the application, possibly getting stuck in a CUDA API call that needs the kernel to finish
while true
    ccall(:sleep, Cuint, (Cuint,), 1)
end

I had expected this when running with a single thread, because the main task isn't preemtible, but even with multiple threads the main task getting stuck apparently blocks the scheduler, keeping the hostcall watcher thread from making progress. That would cause a deadlock. @vchuravy any thoughts? How does AMDGPU.jl solve this?

@maleadt maleadt force-pushed the tb/hostcall branch 2 times, most recently from 97b3ad8 to 5538c0b Compare September 10, 2021 12:57
@maleadt
Copy link
Member Author

maleadt commented Sep 10, 2021

And for some preliminary time measurements:

julia> kernel() = hostcall(identity, Nothing, Tuple{Nothing}, nothing)

julia> @benchmark CUDA.@sync @cuda threads=1024 blocks=10 kernel()
BenchmarkTools.Trial: 79 samples with 1 evaluation.
 Range (min … max):  23.918 ms … 103.041 ms  ┊ GC (min … max): 0.00% … 2.35%
 Time  (median):     82.768 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   64.525 ms ±  31.968 ms  ┊ GC (mean ± σ):  0.74% ± 1.92%

  █                                                             
  █▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▂▄▁▆▅▂▁▅▁▁▂▂▄ ▁
  23.9 ms         Histogram: frequency by time          101 ms <

So 2.25us 'per' hostcall (uncontended, and nonblocking since the call doesn't return anything). That's not great, but it's a start. I also don't want to build on this before I'm sure this won't deadlock applications.

And for reference, @cuprint and malloc (two calls that could be replaced by hostcall-based alternatives) are both an order of magnitude faster, but that's somewhat expected as both don't actually need to communicate with the CPU (printf uses a ring buffer and is happy to trample over unprocessed output, while malloc uses a fixed-size, preallocated buffer as the source for a bump allocator). Still, in the uncontended case (which basically is also a ring buffer) we should be able to do much better.

@maleadt maleadt force-pushed the tb/hostcall branch 2 times, most recently from 1d35604 to bf93220 Compare September 15, 2021 16:21
@maleadt maleadt marked this pull request as ready for review September 15, 2021 16:22
Base automatically changed from tb/kernel_state to master September 17, 2021 12:53
@vchuravy
Copy link
Member

I had expected this when running with a single thread, because the main task isn't preemtible, but even with multiple threads the main task getting stuck apparently blocks the scheduler, keeping the hostcall watcher thread from making progress. That would cause a deadlock.

Are you sure you are blocking the scheduler or are you blocking GC? You need at least a safepoint in the loop

@maleadt
Copy link
Member Author

maleadt commented Sep 30, 2021

You need at least a safepoint in the loop

In which loop? The first does a sleep, so that's a yield point. The second loop doesn't need to be a loop, if could as well be an API call that blocks 'indefinitely'.

@maleadt
Copy link
Member Author

maleadt commented Oct 4, 2021

Seems to deadlock regularly on CI, so I guess this will have to wait unless we have either application threads, or a way to make CUDA's blocking API calls yield.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda kernels Stuff about writing CUDA kernels. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hostcall
2 participants