RFC: Add a hook for detecting task switches. #39994

maleadt · 2021-03-12T09:03:04Z

Certain libraries are configured using global or thread-local state
instead of passing handles to every function. CUDA, for example, has a
cudaSetDevice function that binds a device to the current thread for
all future API calls. This is at odds with Julia's task-based
concurrency, which presents an execution environment that's local to the
current task (e.g., in the case of CUDA, using a different device per task).

This PR adds a hook mechanism that can be used to detect task switches,
and synchronize Julia's task-local environment with the library's global
or thread-local state.

TODO/questions:

support for multiple callbacks?

Intended use:

function task_switch(to::Task)
    Core.println("Thread $(Threads.threadid()): switching from $(current_task()) to $to")
    return
end

cb = @cfunction(task_switch, Nothing, (Any,))
ccall(:jl_hook_task_switch, Nothing, (Ptr{Nothing},), cb)

FWIW, the overhead of calling a no-op hook is around 5ns (I compared against saving the function and doing a jl_call1, which took around 30ns).

KristofferC · 2021-03-12T09:46:18Z

support for multiple callbacks?

In TimerOutputs.jl there's been some discussion about thread safety and the ability to time different tasks (KristofferC/TimerOutputs.jl#80). Perhaps something like this could be used for that, and in that case, allowing multiple callbacks seem desirable.

maleadt · 2021-03-12T15:54:53Z

OK, I made it a list of hooks.

Certain libraries are configured using global or thread-local state instead of passing handles to every function. CUDA, for example, has a `cudaSetDevice` function that binds a device to the current thread for all future API calls. This is at odds with Julia's task-based concurrency, which presents an execution environment that's local to the current task (e.g., in the case of CUDA, using a different device). This PR adds a hook mechanism that can be used to detect task switches, and synchronize Julia's task-local environment with the library's global or thread-local state.

JeffBezanson · 2021-03-12T17:13:40Z

This seems quite intrusive to me. Would you rather have this, or a very-fast-access task-local pointer? And/or, we could make this hook task-specific rather than global. As it is, this could be called for millions of unrelated task switches.

maleadt · 2021-03-12T17:36:43Z

Yeah it's pretty intrusive... The alternative (and current approach) seems even worse though: before every CUDA-related API call or operation, check whether the task-local state matches the global one. Furthermore, querying CUDA's global state takes 10s of ns, which is too slow to to perform before every API call, so we cache that in a thread-local buffer. All that is what leads to the hot mess that is https://github.com/JuliaGPU/CUDA.jl/blob/master/src/state.jl.

With the task switch hook, at least we only pay that cost when switching tasks. And in that hook we can check if the switched-to task has any CUDA state in its task-local storage, and only conditionally set-up CUDA's global state.

a very-fast-access task-local pointer

I'd very much like that, but it would still require comparing the task-local state to the global CUDA state (or its per-thread cached counterpart) on every API call & operation, which is pretty expensive and fragile.

we could make this hook task-specific rather than global

I considered that, but we generally don't know which tasks are going to be performing GPU computations. A user can just fire up a task, import CUDA.jl, and perform API calls.

Maybe there's another solution, I've been staring at this approach for a while now, so feel free to suggest other ideas.

vtjnash · 2021-03-12T23:53:32Z

I don't think we'll promise to maintain a single-thread cooperative non-migrating scheduler, so this seems like an unfeasible approach. I agree we should improve the performance of task-local-storage access. Let me see if there's some easy things we can do to improve that anyways.

tkf · 2021-03-13T05:36:51Z

src/task.c

@@ -507,6 +518,17 @@ JL_DLLEXPORT void jl_switch(void)
        jl_error("cannot switch to task running on another thread");
    }

+    if (jl_task_switch_hooks) {


Do we need a lock (or some atomics) here? If so, wouldn't it increase the overhead of the task switch even if there are no hooks?

tkf · 2021-03-13T05:39:04Z

If CUDA.jl wants to manage global states coupled to OS threads, we need an ("unsafe") API asking to not task switch/migrate within a given scope, don't we? But then, if we have such an API, I guess I'm missing what's necessary beyond STATES[threadid()]. Having said that, this PR sounds like halfway towards what Kotlin does for JVM threads IIUC.

maleadt · 2021-03-15T07:54:43Z

@JeffBezanson

As it is, this could be called for millions of unrelated task switches.

Looking at some semi-realistic applications, the number of API calls is orders of magnitude larger than the number of task switches, so from a performance PoV it seems better to pay that cost when switching tasks rather than checking the state on every API call.

EDIT: concrete example, doing an AlphaZero.jl run:

Task switches:  1670895
CUDA API calls: 30137249

@vtjnash

I don't think we'll promise to maintain a single-thread cooperative non-migrating scheduler, so this seems like an unfeasible approach.

The hook could be extended to inform about thread migration, so why is this unfeasible?

@tkf

we need an ("unsafe") API asking to not task switch/migrate within a given scope

But I do want to switch tasks, since that's useful for overlapping computation on a GPU (by asynchronously submitting work from different tasks), or for working with multiple devices.

tkf · 2021-03-26T03:05:04Z

IIUC, it sounds like you need two things. One is thread-local storage for interacting with external libraries and another is what I call context variables (#35833) for tracking asynchronous events across tasks. I don't know if that's enough for GPU, but, in general, it'd be great if we can orthogonalize the API.

jpsamaroo · 2021-05-20T22:11:21Z

Adding another use case that this PR could help improve:

Dagger.jl estimates the cost of tasks being scheduled by timing them with a form of time_ns(), and uses that information to inform future scheduling of tasks with similar function call signatures, and it generally works well. However, once case where it won't work well is when multiple Dagger tasks are executing (and yielding) on the same thread of the same processor; we'd like to be able to isolate out the execution time of each task, but instead we only get an aggregate. Being able to start and stop timing at task switches would potentially alleviate this issue by making it possible to accumulate the time on a per-task basis.

maleadt added the gpu Affects running Julia on a GPU label Mar 12, 2021

maleadt requested a review from vtjnash March 12, 2021 09:03

maleadt force-pushed the tb/task_switch_hook branch from 64df01d to 2a3b42b Compare March 12, 2021 15:48

maleadt force-pushed the tb/task_switch_hook branch from 2a3b42b to c82e339 Compare March 12, 2021 16:02

maleadt force-pushed the tb/task_switch_hook branch from c82e339 to 0c2fb63 Compare March 12, 2021 16:11

Factor out voidpointer array type.

f2bcb6e

tkf reviewed Mar 13, 2021

View reviewed changes

jpsamaroo mentioned this pull request Jul 28, 2021

Add support for USDT probes #41685

Merged

jpsamaroo mentioned this pull request Mar 21, 2023

RFC: Add task hooks for create/switch/finish events #49083

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC: Add a hook for detecting task switches. #39994

RFC: Add a hook for detecting task switches. #39994

Uh oh!

maleadt commented Mar 12, 2021 •

edited

Loading

Uh oh!

KristofferC commented Mar 12, 2021

Uh oh!

maleadt commented Mar 12, 2021

Uh oh!

JeffBezanson commented Mar 12, 2021

Uh oh!

maleadt commented Mar 12, 2021

Uh oh!

vtjnash commented Mar 12, 2021

Uh oh!

tkf Mar 13, 2021

Uh oh!

tkf commented Mar 13, 2021

Uh oh!

maleadt commented Mar 15, 2021 •

edited

Loading

Uh oh!

tkf commented Mar 26, 2021

Uh oh!

jpsamaroo commented May 20, 2021

Uh oh!

Uh oh!

Uh oh!

RFC: Add a hook for detecting task switches. #39994

Are you sure you want to change the base?

RFC: Add a hook for detecting task switches. #39994

Uh oh!

Conversation

maleadt commented Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KristofferC commented Mar 12, 2021

Uh oh!

maleadt commented Mar 12, 2021

Uh oh!

JeffBezanson commented Mar 12, 2021

Uh oh!

maleadt commented Mar 12, 2021

Uh oh!

vtjnash commented Mar 12, 2021

Uh oh!

tkf Mar 13, 2021

Choose a reason for hiding this comment

Uh oh!

tkf commented Mar 13, 2021

Uh oh!

maleadt commented Mar 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tkf commented Mar 26, 2021

Uh oh!

jpsamaroo commented May 20, 2021

Uh oh!

Uh oh!

maleadt commented Mar 12, 2021 •

edited

Loading

maleadt commented Mar 15, 2021 •

edited

Loading