Description
As part of transitioning from using the deprecated tbb::task API to tbb::task_group I have been doing performance measurement on our applications. I have found that when using a single tbb::task_group we get highly diminished thread scaling. To illustrate the problem, I created four highly simplified versions of the main processing loop of our applications. The code for the simple applications can be found here: https://github.com/Dr15Jones/tbb_group_scaling. Each application does the same processing but uses TBB in a different way. The differences are
- using tbb::tasks directly which are all created using allocate_root (this is how our application typically works)
- using 1 tbb::task_group to launch all the needed work
- using N tbb::task_groups where we can use a task_group per thread we are requesting.
- using tbb::tasks directly but using allocate_additional_child_of (created based on studying the performance of the other three cases).
When testing on either an Intel or AMD CPU, the single tbb::task_group was found to either not scale as the number of threads increased or to have extremely weak scaling compared to the other options. The tbb::task using allocate_additional_child_of had the best performance followed closely by the N tbb::task_groups case.
My question is, are there plans to improve the performance when using a single tbb::task_group? If not, is the use of multiple tbb::task_groups working together to share the load on creating tasks a supported use case? Alternatively, could a new API for creating a performant hierarchy of task_groups be developed in order to avoid doing a 'spin' loop over the task_group::wait calls?