Skip to content

RFC: Add a hook for detecting task switches. #39994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Mar 12, 2021

Certain libraries are configured using global or thread-local state
instead of passing handles to every function. CUDA, for example, has a
cudaSetDevice function that binds a device to the current thread for
all future API calls. This is at odds with Julia's task-based
concurrency, which presents an execution environment that's local to the
current task (e.g., in the case of CUDA, using a different device per task).

This PR adds a hook mechanism that can be used to detect task switches,
and synchronize Julia's task-local environment with the library's global
or thread-local state.

TODO/questions:

  • support for multiple callbacks?

Intended use:

function task_switch(to::Task)
    Core.println("Thread $(Threads.threadid()): switching from $(current_task()) to $to")
    return
end

cb = @cfunction(task_switch, Nothing, (Any,))
ccall(:jl_hook_task_switch, Nothing, (Ptr{Nothing},), cb)

FWIW, the overhead of calling a no-op hook is around 5ns (I compared against saving the function and doing a jl_call1, which took around 30ns).

@maleadt maleadt added the gpu Affects running Julia on a GPU label Mar 12, 2021
@maleadt maleadt requested a review from vtjnash March 12, 2021 09:03
@KristofferC
Copy link
Member

support for multiple callbacks?

In TimerOutputs.jl there's been some discussion about thread safety and the ability to time different tasks (KristofferC/TimerOutputs.jl#80). Perhaps something like this could be used for that, and in that case, allowing multiple callbacks seem desirable.

@maleadt maleadt force-pushed the tb/task_switch_hook branch from 64df01d to 2a3b42b Compare March 12, 2021 15:48
@maleadt
Copy link
Member Author

maleadt commented Mar 12, 2021

OK, I made it a list of hooks.

@maleadt maleadt force-pushed the tb/task_switch_hook branch from 2a3b42b to c82e339 Compare March 12, 2021 16:02
Certain libraries are configured using global or thread-local state
instead of passing handles to every function. CUDA, for example, has a
`cudaSetDevice` function that binds a device to the current thread for
all future API calls. This is at odds with Julia's task-based
concurrency, which presents an execution environment that's local to the
current task (e.g., in the case of CUDA, using a different device).

This PR adds a hook mechanism that can be used to detect task switches,
and synchronize Julia's task-local environment with the library's global
or thread-local state.
@maleadt maleadt force-pushed the tb/task_switch_hook branch from c82e339 to 0c2fb63 Compare March 12, 2021 16:11
@JeffBezanson
Copy link
Member

This seems quite intrusive to me. Would you rather have this, or a very-fast-access task-local pointer? And/or, we could make this hook task-specific rather than global. As it is, this could be called for millions of unrelated task switches.

@maleadt
Copy link
Member Author

maleadt commented Mar 12, 2021

Yeah it's pretty intrusive... The alternative (and current approach) seems even worse though: before every CUDA-related API call or operation, check whether the task-local state matches the global one. Furthermore, querying CUDA's global state takes 10s of ns, which is too slow to to perform before every API call, so we cache that in a thread-local buffer. All that is what leads to the hot mess that is https://github.com/JuliaGPU/CUDA.jl/blob/master/src/state.jl.

With the task switch hook, at least we only pay that cost when switching tasks. And in that hook we can check if the switched-to task has any CUDA state in its task-local storage, and only conditionally set-up CUDA's global state.

a very-fast-access task-local pointer

I'd very much like that, but it would still require comparing the task-local state to the global CUDA state (or its per-thread cached counterpart) on every API call & operation, which is pretty expensive and fragile.

we could make this hook task-specific rather than global

I considered that, but we generally don't know which tasks are going to be performing GPU computations. A user can just fire up a task, import CUDA.jl, and perform API calls.

Maybe there's another solution, I've been staring at this approach for a while now, so feel free to suggest other ideas.

@vtjnash
Copy link
Member

vtjnash commented Mar 12, 2021

I don't think we'll promise to maintain a single-thread cooperative non-migrating scheduler, so this seems like an unfeasible approach. I agree we should improve the performance of task-local-storage access. Let me see if there's some easy things we can do to improve that anyways.

@@ -507,6 +518,17 @@ JL_DLLEXPORT void jl_switch(void)
jl_error("cannot switch to task running on another thread");
}

if (jl_task_switch_hooks) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a lock (or some atomics) here? If so, wouldn't it increase the overhead of the task switch even if there are no hooks?

@tkf
Copy link
Member

tkf commented Mar 13, 2021

If CUDA.jl wants to manage global states coupled to OS threads, we need an ("unsafe") API asking to not task switch/migrate within a given scope, don't we? But then, if we have such an API, I guess I'm missing what's necessary beyond STATES[threadid()]. Having said that, this PR sounds like halfway towards what Kotlin does for JVM threads IIUC.

@maleadt
Copy link
Member Author

maleadt commented Mar 15, 2021

@JeffBezanson

As it is, this could be called for millions of unrelated task switches.

Looking at some semi-realistic applications, the number of API calls is orders of magnitude larger than the number of task switches, so from a performance PoV it seems better to pay that cost when switching tasks rather than checking the state on every API call.

EDIT: concrete example, doing an AlphaZero.jl run:

Task switches:  1670895
CUDA API calls: 30137249

@vtjnash

I don't think we'll promise to maintain a single-thread cooperative non-migrating scheduler, so this seems like an unfeasible approach.

The hook could be extended to inform about thread migration, so why is this unfeasible?

@tkf

we need an ("unsafe") API asking to not task switch/migrate within a given scope

But I do want to switch tasks, since that's useful for overlapping computation on a GPU (by asynchronously submitting work from different tasks), or for working with multiple devices.

@tkf
Copy link
Member

tkf commented Mar 26, 2021

IIUC, it sounds like you need two things. One is thread-local storage for interacting with external libraries and another is what I call context variables (#35833) for tracking asynchronous events across tasks. I don't know if that's enough for GPU, but, in general, it'd be great if we can orthogonalize the API.

@jpsamaroo
Copy link
Member

Adding another use case that this PR could help improve:

Dagger.jl estimates the cost of tasks being scheduled by timing them with a form of time_ns(), and uses that information to inform future scheduling of tasks with similar function call signatures, and it generally works well. However, once case where it won't work well is when multiple Dagger tasks are executing (and yielding) on the same thread of the same processor; we'd like to be able to isolate out the execution time of each task, but instead we only get an aggregate. Being able to start and stop timing at task switches would potentially alleviate this issue by making it possible to accumulate the time on a per-task basis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpu Affects running Julia on a GPU
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants