-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Turn task_queue into singly-linked list #3125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR converts the task queue from a circular_buffer<task*> to a singly-linked list (task_slist) to address exception safety, memory efficiency, and cache locality concerns. The refactoring makes task scheduling operations truly noexcept by eliminating the need for dynamic buffer allocation, reduces memory overhead by embedding the list structure within task objects, and improves cache performance by only touching task cache lines during queue traversal.
Key changes:
- Introduced
task_slist, a singly-linked list implementation that stores next pointers in the task objects themselves - Modified the
taskclass to use a dual-purpose field_scheduling_group_id_or_next_taskthat encodes either the scheduling group ID (when not queued) or the next task pointer (when queued) using bit 0 as a discriminator - Updated the shuffle functionality to work with the new list-based queue structure
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
include/seastar/core/task.hh |
Adds task_slist class and modifies task to store scheduling group/next pointer in a single field using bit manipulation |
include/seastar/core/reactor.hh |
Changes task queue storage from circular_buffer<task*> to task_slist |
include/seastar/core/coroutine.hh |
Updates set_scheduling_group to delegate to base class method instead of directly accessing the field |
src/core/reactor.cc |
Updates queue operations to use task_slist API and refactors shuffle logic to work with the new structure |
tests/unit/task_queue_test.cc |
Adds comprehensive fuzz test for task_slist operations (push_back, push_front, pop_front) |
tests/unit/CMakeLists.txt |
Registers the new task queue test |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
abf69d6 to
cf45e72
Compare
|
upd:
|
|
Nice. Just found this today after hitting a large allocation warning on the task queue. |
|
@dotnwat , how many bytes was it? And once it happened, was there "too longv queue accumulated" message in logs, or were the tasks short enough to drain within a single task quota? |
|
I'm sure there's a hard-to-prove performance regression in there. We execute hundreds of thousands of tasks per second. With a circular buffer, if the tasks are small (like doing conversion from one future type to another) and the CPU can guess the vptr of the task (possible if many similar task-sets are in the queue) the CPU can execute several tasks in parallel. With a linked list, there's a memory dependency, and it's likely not to sit in cache due to FIFO execution. |
|
I acknowledge the problems with circular_buffer. But removing the allocation doesn't fix anything. Coroutine frames and continuations must still be allocated, and there is no reasonable way to deal with an allocation failure. |
|
We can do something like chunked_fifo (but optimize it a little, so it aligns itself to a power-of-two). Another approach is chunked-fifo-like, but lay out tasks instead of task pointers. It requires that we schedule tasks by either sending a final type (most eventually end up final), or via a base pointer, and add size() member. |
This doesn't work, often the tasks are embedded in immovable objects like coroutine frames. |
Some more light shed on this effect, or "further reading" links are very welcome here :) |
I read some time ago https://people.csail.mit.edu/delimitrou/papers/2024.asplos.memory.pdf and thought it was an interesting read. Donno if it helps for this specific discussion though. |
Here's an article showing some of the points: https://douglasrumbaugh.com/post/list-secrets/ Append() corresponds to schedule(), which I expect to run equally fast, because the end() of the list will be in cache. Average() corresponds to run_some_tasks(), which I expect to be slower for the list due to memory latency fetching the next pointer. It's a little more complicated: the cpu will likely learn that ->next is followed and will prefetch it ahead of time. But if the task is short, prefetching won't get it in time. If the task is short (in instruction count) but long (due to cache missed), we can have instructions from multiple tasks executed at the same time in an out-of-order CPU. I expect to have data cache misses, again due to FIFO. FIFO is bad in that it temporally separates tasks operating on the same data, giving them time to drop out of cache. That's why we have schedule_urgent_task(). |
I would say it doesn't really apply here. The malloc-s overhead from the article is reversed here. Task is allocated anyway, and to enqueue it array might need amortized allocation, while list does not at all. Next, the arcile compares two access patterns: array vs the very same list Respectively,
run_some_tasks() will fetch the I was interested more about this:
What is it? |
Note that
Why can't we throw a normal bad_alloc exception when a we can't allocate a coroutine frame? |
|
Context switch perf test results
|
Array, while smaller in overall size than a list I assume (no need for pointers), doesn't solve the above issue, no? |
|
Currently we have array of pointers to tasks. Both, tasks and array itself, need to be allocated. After this PR only tasks are allocated, array is gone on its own. And no change in sizeof(task) here, the "next" pointer is unioned with pre-existing member. So memory footprint with this PR is reduced. |
|
Interesting. Run-queue processing times (#3134)
LIST takes less instructions, but more cycles (and more time) |
|
☝️ |
This is a good point. It's not a separate list like std::forward_list or our chunked_fifo - it's an intrusive list, stored inside the tasks. |
The unsafe thing here is when union contains next pointer instead of sched group id. In this case there's no way to get sched group at that time and std::moving task is fatal. The former is, well, yes, no excuse. But the latter is as safe as it is today -- if one std::moves a task while there's a pointer to it from the circular-buffer run-queue, it's just as fatal. |
I'm not sure about how many bytes, I just noticed this in the logs: Is there something else to look for to answer the bytes question? |
I just noticed this in the logs:
The allocation we saw was around 220K. I don't think that this particular instance of the log message was associated with a large allocation--we've been hitting this and the large allocation recently and not always at the same time--so 1207 tasks might not be consistent with an allocation that large. |
|
Ah, 1.2k tasks. OK, it's ~4k bytes for the buffer with task* pointers. And the allocation of runqueue for 220k should be ~60k tasks 🙀 , but if they are short, they can fit into task-quota |
|
If shuffling the tasks in #3134 (once when allocating, "wakeups" produce the same sequence all the time)
|
|
For future reference. With
Disassembled release-mode binary ARRAY LIST |
|
Optimized-out the task* task_slist::snap() noexcept {
auto ret = _first;
_first = nullptr;
_last_p = &_first;
_size = 0;
return ret;
}
bool reactor::task_queue::run_tasks() {
// Make sure new tasks will inherit our scheduling group
auto current_sg = scheduling_group(_id);
task* n = _q.snap();
while (n != nullptr) {
auto tsk = n;
n = tsk->_next;
tsk->_scheduling_group_id = task::disguise_sched_group(current_sg);
tsk->run_and_dispose();
}
return !_q.empty();
decoded run_tasks |
|
To emphasize the ARRAY vs LIST-opt cases: Main loop in assemblyARRAY LIST-opt Perf test
|
|
Updated ARRAY case to also modify task->_sg (like if it was LIST updating task's sg/next union) bool reactor::task_queue::run_tasks() {
auto current = scheduling_group(_id);
while (!_q.empty()) {
auto tsk = _q.front();
_q.pop_front();
tsk->_sg = current;
tsk->run_and_dispose();
}
return !_q.empty();
}
|
|
In LIST case, de-unioned next and sched-group-id on task to drop updating sched group in the loop ( bool reactor::task_queue::run_tasks() {
task* n = _q.snap();
while (n != nullptr) {
auto tsk = n;
n = tsk->_next;
tsk->run_and_dispose();
}
return !_q.empty();
}
|
|
https://gist.github.com/xemul/d5f83adac66d34b70fa4fef49a861d96 ☝️ a perf test that allocates 10k dispersed integers and sums those values by dereferencing each via array of pointers or intrusive list: |
|
Split lists help a lot
|
|
Is "split list" another name for chunked_fifo? This looks like another opportunity to refer to https://github.com/avikivity/seastar/commits/sort-tasks/. |
|
No, it's not chunked fifo, it's literally orthogonal. Here's how elements are stored in This is And this is Currently task queue is and using chunked fifo would make it Using chunked fifo will still require potentially growing the pointers storage on wakeup, but, of course, in smaller chunks The split-list is ~= it uses fixed-size array and doesn't need to allocate memory on wake-up. And also note the interleaving -- when scanning elements in the displayed order, it scans two parallel tracks "in parallel" as if they were pointers in plain array (this requires consuming the "next" pointer, so that the Nth wave of scan stipp references pointers from the root-level array, but task runqueue scanning does it) |
But what's the problem? Isn't the goal to eliminate large allocations? chunked_fifo does this. We can't completely eliminate allocations (the tasks themselves must be allocated), so we don't win from removing allocations if they are well amortized.
Seems complicated. |
The goal was to check if it's possible to eliminate wakeup allocations at all
There are two paths (sometimes they immediately follow one another, but not always) -- allocating a task and waking the task up. Currently we allocate on both, but allocating on the latter is not necessary. AFAIU, if there's a chain of yet unresolved futures at hand, the task is allocated as a part of continuation_base. When later one does promise::set_value() the task is woken up. Not having (even amortized) allocations in set_value() is the goal. |
Why do we allocate on wakeup? Usually the queue is not full. |
Well, a little bit, but in the end of the day it's pretty much compact code: #3138
Because |
Add a test that populates run-queue with 10, 1k and 1M no-op tasks and measures the time it takes for the scheduler to process one. No allocations/frees happen measure-time, tasks are pre-populated. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two coroutine_traits_base::promise_task templates, both inherit from task, but one sets task::_sg by hand, and another enjoys task protected implementation. Make both do it. This allows moving task::_sg into private section, protecting it reliably from future changes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Rename task::_sg member into uintptr and keep scheduling group index in there. Next patch will keep task* pointer unioned with it, so to tell which value is in there, add disguising of sched group ID by shifting it left one bit and setting the zeroth one. Also add a static assertion that when task pointer will be put there, this bit will not be set. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The split list is the collection of objects that's optimized for scanning from begin to end, with the ability to add entries from two ends -- front and back, like std::deque does. Splitting is needed to let CPU pre-fetch next element while scanning some current one. With classical linked lists this is difficult, because in order to prefetch the next element CPU needs to read the next pointer from the current element, which is being read. Thus the forward scan of a plain list is serielized. With split lists several adjacent elements can be prefetched in parallel. To achieve that, the "first" pointer is "sharded" -- there are F first pointers (as well as F cached last pointers) and the list is populated in round-robin manner -- first 0th list is appended, then the 1st, then the 2nd, ... then the Fth and population wraps around and starts over. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The element (class task) shares the "next" pointer with sched group index, thus living in two states -- queued and inactive. When created it gets into the latter state and carries sched group index on-board. When add_task is called, the task becomes queued and the sched group index is thus lost. Later, when the task is picked for execution, the task_queue restores this index into "current", naturally. Probably the restoration is not needed, but it's better to be on the safe side. The task::set_scheduling_group() is no longer callable for queued tasks, but the method is protected and is only used in two places. First is coroutine switch-to, which happens on running (i.e. -- not queued) task. Second is the shared_future tweaking unresolved shared_state (i.e. -- inactive too). Debug-mode task queue shuffling with constant time is no longer possible and is temporarily removed, next patch will restore it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently shuffling is performed by swapping a newly-activated task with some random queued one. Previous patch turned task queue into a singly-linked list, and picking a random task from it in constant time is impossible. The new shuffling works run-time -- when a queue picks up next front task to execute, it rolls the dice and optionally re-queued it into the back of the queue. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
cf45e72 to
a5455c6
Compare
|
upd:
|
|
Overhead is ~57% on 1k, ~17% on 1M tasks, ~8.5% on 10 tasks |
☝️ with |
☝️ |
|
The above is on Threadripper processor, will measure on i7i's Xeon |
|
i7i.2xlarge, Intel Xeon Platinum 8559C, clang-20.1.2,
|
Currently it's a circular_buffer with pointers to task-s, and it has several concerns.
First, add_task() and schedule() are not-exception safe. Respectively, promise::set_value() isn't either. They are marked as noexcept, but circular buffer may need to grow and allocation may throw.
Next, even if circular buffer manages to grow to its de-facto maximum size, it can be huge. We've seen that task queues grow up to tens of thousands of entries and it occupies contiguous memory of relevant size.
Finally, walking the run-list touches more cache-lines than just tasks'. And as per above -- queue buffer can span several of those.
This PR converts the task queue into a singly-linked list of tasks, solving all the above difficulties.
Adding a task becomes truly noexcept.
No need for extra memory for queue. The sizeof(task) isn't changed either.
Walking the run-queue only touches tasks' cache lines.
refs: #84, #254