Add a hostcall interface #1140

maleadt · 2021-09-09T12:51:08Z

Fixes #440

Initial, simple implementation. I still need to steal ideas from ADMGPU.jl and optimizations from #567, but the initial goal is a simple but correct implementation that we can use for unlikely code paths such as error reporting.

Demo:

julia> using CUDA

julia> function test(x)
         println("This is a hostcall from thread $x")
         x+1
       end
test (generic function with 1 method)

julia> function kernel()
         rv = hostcall(test, Int, Tuple{Int}, threadIdx().x)
         @cuprintln("Hostcall returned $rv")
         return
       end
kernel (generic function with 1 method)

julia> @cuda threads=2 kernel();
This is a hostcall from thread 1
Hostcall returned 2
This is a hostcall from thread 2
Hostcall returned 3

Depends on #1110.

Probably requires Base support like JuliaLang/julia#42302

cc @jpsamaroo

codecov · 2021-09-09T13:16:00Z

Codecov Report

Merging #1140 (35026b9) into master (5b74388) will increase coverage by 8.97%.
The diff coverage is 86.07%.

❗ Current head 35026b9 differs from pull request most recent head 1fe2b4c. Consider uploading reports for the commit 1fe2b4c to get more accurate results

@@            Coverage Diff             @@
##           master    #1140      +/-   ##
==========================================
+ Coverage   66.97%   75.94%   +8.97%     
==========================================
  Files         118      119       +1     
  Lines        7955     7737     -218     
==========================================
+ Hits         5328     5876     +548     
+ Misses       2627     1861     -766

Impacted Files	Coverage Δ
lib/cudadrv/types.jl	`83.33% <0.00%> (-16.67%)`	⬇️
src/CUDA.jl	`100.00% <ø> (ø)`
src/compiler/hostcall.jl	`85.48% <85.48%> (ø)`
src/compiler/execution.jl	`84.61% <85.71%> (+0.54%)`	⬆️
lib/cudadrv/execution.jl	`100.00% <100.00%> (+3.44%)`	⬆️
src/compiler/exceptions.jl	`64.28% <100.00%> (-29.84%)`	⬇️
src/compiler/gpucompiler.jl	`82.14% <100.00%> (-1.73%)`	⬇️
examples/wmma/low-level.jl	`0.00% <0.00%> (-100.00%)`	⬇️
examples/wmma/high-level.jl	`0.00% <0.00%> (-100.00%)`	⬇️
src/linalg.jl	`36.36% <0.00%> (-50.01%)`	⬇️
... and 72 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b74388...1fe2b4c. Read the comment docs.

maleadt · 2021-09-09T13:52:19Z

Hmm, one problem is that the following deadlocks:

# hostcall watcher task/thread
Threads.@spawn begin
    while true
        println(1)
        sleep(1)
    end
end

# the application, possibly getting stuck in a CUDA API call that needs the kernel to finish
while true
    ccall(:sleep, Cuint, (Cuint,), 1)
end

I had expected this when running with a single thread, because the main task isn't preemtible, but even with multiple threads the main task getting stuck apparently blocks the scheduler, keeping the hostcall watcher thread from making progress. That would cause a deadlock. @vchuravy any thoughts? How does AMDGPU.jl solve this?

maleadt · 2021-09-10T13:09:17Z

And for some preliminary time measurements:

julia> kernel() = hostcall(identity, Nothing, Tuple{Nothing}, nothing)

julia> @benchmark CUDA.@sync @cuda threads=1024 blocks=10 kernel()
BenchmarkTools.Trial: 79 samples with 1 evaluation.
 Range (min … max):  23.918 ms … 103.041 ms  ┊ GC (min … max): 0.00% … 2.35%
 Time  (median):     82.768 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   64.525 ms ±  31.968 ms  ┊ GC (mean ± σ):  0.74% ± 1.92%

  █                                                             
  █▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▂▄▁▆▅▂▁▅▁▁▂▂▄ ▁
  23.9 ms         Histogram: frequency by time          101 ms <

So 2.25us 'per' hostcall (uncontended, and nonblocking since the call doesn't return anything). That's not great, but it's a start. I also don't want to build on this before I'm sure this won't deadlock applications.

And for reference, @cuprint and malloc (two calls that could be replaced by hostcall-based alternatives) are both an order of magnitude faster, but that's somewhat expected as both don't actually need to communicate with the CPU (printf uses a ring buffer and is happy to trample over unprocessed output, while malloc uses a fixed-size, preallocated buffer as the source for a bump allocator). Still, in the uncontended case (which basically is also a ring buffer) we should be able to do much better.

vchuravy · 2021-09-30T16:03:36Z

I had expected this when running with a single thread, because the main task isn't preemtible, but even with multiple threads the main task getting stuck apparently blocks the scheduler, keeping the hostcall watcher thread from making progress. That would cause a deadlock.

Are you sure you are blocking the scheduler or are you blocking GC? You need at least a safepoint in the loop

maleadt · 2021-09-30T16:08:06Z

You need at least a safepoint in the loop

In which loop? The first does a sleep, so that's a yield point. The second loop doesn't need to be a loop, if could as well be an API call that blocks 'indefinitely'.

maleadt · 2021-10-04T08:21:49Z

Seems to deadlock regularly on CI, so I guess this will have to wait unless we have either application threads, or a way to make CUDA's blocking API calls yield.

maleadt added cuda kernels Stuff about writing CUDA kernels. enhancement New feature or request labels Sep 9, 2021

maleadt marked this pull request as draft September 9, 2021 12:56

maleadt force-pushed the tb/hostcall branch from c0a028f to 079084a Compare September 9, 2021 12:56

maleadt force-pushed the tb/hostcall branch 2 times, most recently from 97b3ad8 to 5538c0b Compare September 10, 2021 12:57

maleadt force-pushed the tb/kernel_state branch from e99e290 to 165e41a Compare September 15, 2021 16:15

maleadt force-pushed the tb/hostcall branch 2 times, most recently from 1d35604 to bf93220 Compare September 15, 2021 16:21

maleadt marked this pull request as ready for review September 15, 2021 16:22

maleadt mentioned this pull request Sep 15, 2021

Implement a 'native' print method based on hostcall #1145

Draft

maleadt force-pushed the tb/kernel_state branch from 165e41a to f0950dd Compare September 17, 2021 09:10

Base automatically changed from tb/kernel_state to master September 17, 2021 12:53

maleadt force-pushed the tb/hostcall branch from bf93220 to 9990547 Compare September 17, 2021 12:53

maleadt added 5 commits September 30, 2021 13:58

Make convert_arguments reusable.

bde4aa1

Make nanosleep more userfriendly.

2d14bd0

Simplify exception flag.

b4c57dc

Optimize intrinsics to avoid exceptions.

1fd86b7

Add an experimental hostcall interface.

1fe2b4c

maleadt force-pushed the tb/hostcall branch from 9990547 to 1fe2b4c Compare September 30, 2021 12:12

maleadt marked this pull request as draft October 4, 2021 08:21

maleadt force-pushed the master branch from b4a4542 to 0e9ad9d Compare October 13, 2021 15:32

maleadt force-pushed the master branch from db0ecb0 to 4d6dbd1 Compare May 13, 2022 07:37

ToucheSir mentioned this pull request Dec 30, 2022

Fast path onehotbatch(::Vector{Int}, ::UnitRange) FluxML/OneHotArrays.jl#27

Merged

maleadt force-pushed the master branch from 476979e to d53a63e Compare March 16, 2023 12:34

maleadt force-pushed the master branch from c97bc77 to d57e020 Compare September 8, 2023 20:12

maleadt force-pushed the master branch from 1cb1f53 to 1a1d127 Compare September 18, 2023 16:28

maleadt force-pushed the master branch from aef3298 to 4b017c6 Compare January 18, 2024 12:09

maleadt force-pushed the master branch 15 times, most recently from 5d585c4 to c850163 Compare December 20, 2024 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a hostcall interface #1140

Add a hostcall interface #1140

maleadt commented Sep 9, 2021 •

edited

Loading

codecov bot commented Sep 9, 2021 •

edited

Loading

maleadt commented Sep 9, 2021

maleadt commented Sep 10, 2021 •

edited

Loading

vchuravy commented Sep 30, 2021

maleadt commented Sep 30, 2021

maleadt commented Oct 4, 2021

Add a hostcall interface #1140

Are you sure you want to change the base?

Add a hostcall interface #1140

Conversation

maleadt commented Sep 9, 2021 • edited Loading

codecov bot commented Sep 9, 2021 • edited Loading

Codecov Report

maleadt commented Sep 9, 2021

maleadt commented Sep 10, 2021 • edited Loading

vchuravy commented Sep 30, 2021

maleadt commented Sep 30, 2021

maleadt commented Oct 4, 2021

maleadt commented Sep 9, 2021 •

edited

Loading

codecov bot commented Sep 9, 2021 •

edited

Loading

maleadt commented Sep 10, 2021 •

edited

Loading