Description
For certain low-level tasks, it is necessary to call a function exactly once on each hardware thread, sometimes even concurrently. For example, I might want to check that the hardware threads' CPU bindings are correctly set up by calling the respective hwloc
function, or I might want to initialize PAPI
on each thread.
(Why do I suspect problems with CPU bindings? Because I used both OpenMP and Qthreads in an application, and I didn't realize that both set the CPU binding for the main thread, but they do it differently, leading to conflicts and 50% performance loss even if OpenMP is not used, and the OpenMP threads are all sleeping on the OS. These kinds of issues are more easily debugged if one has access to certain low-level primitives in Qthreads.)
I have currently a mechanism to work around this, by starting many threads that busy-loop for a certain amount of time, and this often succeeds. However, a direct implementation and an official API would be convenient.