Turn task_queue into singly-linked list #3125

xemul · 2025-11-28T10:29:13Z

Currently it's a circular_buffer with pointers to task-s, and it has several concerns.

First, add_task() and schedule() are not-exception safe. Respectively, promise::set_value() isn't either. They are marked as noexcept, but circular buffer may need to grow and allocation may throw.

Next, even if circular buffer manages to grow to its de-facto maximum size, it can be huge. We've seen that task queues grow up to tens of thousands of entries and it occupies contiguous memory of relevant size.

Finally, walking the run-list touches more cache-lines than just tasks'. And as per above -- queue buffer can span several of those.

This PR converts the task queue into a singly-linked list of tasks, solving all the above difficulties.

Adding a task becomes truly noexcept.
No need for extra memory for queue. The sizeof(task) isn't changed either.
Walking the run-queue only touches tasks' cache lines.

refs: #84, #254

Copilot

Pull request overview

This PR converts the task queue from a circular_buffer<task*> to a singly-linked list (task_slist) to address exception safety, memory efficiency, and cache locality concerns. The refactoring makes task scheduling operations truly noexcept by eliminating the need for dynamic buffer allocation, reduces memory overhead by embedding the list structure within task objects, and improves cache performance by only touching task cache lines during queue traversal.

Key changes:

Introduced task_slist, a singly-linked list implementation that stores next pointers in the task objects themselves
Modified the task class to use a dual-purpose field _scheduling_group_id_or_next_task that encodes either the scheduling group ID (when not queued) or the next task pointer (when queued) using bit 0 as a discriminator
Updated the shuffle functionality to work with the new list-based queue structure

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`include/seastar/core/task.hh`	Adds `task_slist` class and modifies `task` to store scheduling group/next pointer in a single field using bit manipulation
`include/seastar/core/reactor.hh`	Changes task queue storage from `circular_buffer<task*>` to `task_slist`
`include/seastar/core/coroutine.hh`	Updates `set_scheduling_group` to delegate to base class method instead of directly accessing the field
`src/core/reactor.cc`	Updates queue operations to use `task_slist` API and refactors shuffle logic to work with the new structure
`tests/unit/task_queue_test.cc`	Adds comprehensive fuzz test for `task_slist` operations (push_back, push_front, pop_front)
`tests/unit/CMakeLists.txt`	Registers the new task queue test

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

include/seastar/core/task.hh

xemul · 2025-12-01T05:50:57Z

upd:

replace uintptr on task into union of unsigned and task*

dotnwat · 2025-12-01T19:45:29Z

Nice. Just found this today after hitting a large allocation warning on the task queue.

xemul · 2025-12-02T02:55:30Z

@dotnwat , how many bytes was it? And once it happened, was there "too longv queue accumulated" message in logs, or were the tasks short enough to drain within a single task quota?

avikivity · 2025-12-02T13:22:59Z

I'm sure there's a hard-to-prove performance regression in there. We execute hundreds of thousands of tasks per second. With a circular buffer, if the tasks are small (like doing conversion from one future type to another) and the CPU can guess the vptr of the task (possible if many similar task-sets are in the queue) the CPU can execute several tasks in parallel.

With a linked list, there's a memory dependency, and it's likely not to sit in cache due to FIFO execution.

avikivity · 2025-12-02T13:26:17Z

I acknowledge the problems with circular_buffer. But removing the allocation doesn't fix anything. Coroutine frames and continuations must still be allocated, and there is no reasonable way to deal with an allocation failure.

avikivity · 2025-12-02T13:29:08Z

We can do something like chunked_fifo (but optimize it a little, so it aligns itself to a power-of-two).

Another approach is chunked-fifo-like, but lay out tasks instead of task pointers. It requires that we schedule tasks by either sending a final type (most eventually end up final), or via a base pointer, and add size() member.

avikivity · 2025-12-02T13:29:51Z

Another approach is chunked-fifo-like, but lay out tasks instead of task pointers. It requires that we schedule tasks by either sending a final type (most eventually end up final), or via a base pointer, and add size() member.

This doesn't work, often the tasks are embedded in immovable objects like coroutine frames.

xemul · 2025-12-02T14:02:43Z

With a circular buffer, if the tasks are small ... and the CPU can guess the vptr of the task ... the CPU can execute several tasks in parallel.
With a linked list, there's a memory dependency, and it's likely not to sit in cache due to FIFO execution.

Some more light shed on this effect, or "further reading" links are very welcome here :)

mykaul · 2025-12-02T14:13:10Z

With a circular buffer, if the tasks are small ... and the CPU can guess the vptr of the task ... the CPU can execute several tasks in parallel.
With a linked list, there's a memory dependency, and it's likely not to sit in cache due to FIFO execution.

Some more light shed on this effect, or "further reading" links are very welcome here :)

I read some time ago https://people.csail.mit.edu/delimitrou/papers/2024.asplos.memory.pdf and thought it was an interesting read. Donno if it helps for this specific discussion though.

avikivity · 2025-12-02T15:06:12Z

With a circular buffer, if the tasks are small ... and the CPU can guess the vptr of the task ... the CPU can execute several tasks in parallel.
With a linked list, there's a memory dependency, and it's likely not to sit in cache due to FIFO execution.

Some more light shed on this effect, or "further reading" links are very welcome here :)

Here's an article showing some of the points: https://douglasrumbaugh.com/post/list-secrets/

Append() corresponds to schedule(), which I expect to run equally fast, because the end() of the list will be in cache.

Average() corresponds to run_some_tasks(), which I expect to be slower for the list due to memory latency fetching the next pointer.

It's a little more complicated: the cpu will likely learn that ->next is followed and will prefetch it ahead of time. But if the task is short, prefetching won't get it in time.

If the task is short (in instruction count) but long (due to cache missed), we can have instructions from multiple tasks executed at the same time in an out-of-order CPU. I expect to have data cache misses, again due to FIFO. FIFO is bad in that it temporally separates tasks operating on the same data, giving them time to drop out of cache. That's why we have schedule_urgent_task().

xemul · 2025-12-03T11:39:40Z

Here's an article showing some of the points: https://douglasrumbaugh.com/post/list-secrets/

I would say it doesn't really apply here. The malloc-s overhead from the article is reversed here. Task is allocated anyway, and to enqueue it array might need amortized allocation, while list does not at all.

Next, the arcile compares two access patterns: array [0|1|2|3|...|N] vs list [0] -> [1] -> [2] -> [3] -...-> [N]. And array definitely wins in cache efficiency, because its elements are packed and list nodes are not. Here the access pattern is different. Namely array of_pointers

[0|1|2|...|N]
 | | |     |
 0 1 2     N

vs the very same list [0] -> [1] -> [2] -...-> [N]. And scanning array dereferencing each pointer is not necessarilly winning (but of cource I didn't measure it).

Respectively,

Average() corresponds to run_some_tasks(), which I expect to be slower for the list due to memory latency fetching the next pointer.

run_some_tasks() will fetch the run_and_dispose pointer from vtable and for that it will fetch vtable itself from the task object, for both array and list, so fetching next pointer will come from cache (for short continuation chans), just like if fetching the {N+1}th pointer from array.

I was interested more about this:

... if the tasks are small (like doing conversion from one future type to another) and the CPU can guess the vptr of the task (possible if many similar task-sets are in the queue) the CPU can execute several tasks in parallel.
With a linked list, there's a memory dependency, and it's likely not to sit in cache due to FIFO execution.

What is it?

nyh · 2025-12-03T11:56:28Z

I acknowledge the problems with circular_buffer.

Note that chunked_fifo<> was developed exactly to address the shortcoming of circular_buffer<>, especially how it was used in semaphore (see #140). So it's rather natural to switch to chunked_fifo<> here too. Perhaps I made a mistake when I decided to parametrized this template with items_per_chunk and not bytes_per_chunk, but this can easily be solved (I think you can just calculate the right items_per_chunk as something like 128*1024/sizeof(item)).

But removing the allocation doesn't fix anything. Coroutine frames and continuations must still be allocated, and there is no reasonable way to deal with an allocation failure.

Why can't we throw a normal bad_alloc exception when a we can't allocate a coroutine frame?

xemul · 2025-12-03T12:27:56Z

Context switch perf test results

	test	iters	runtime		inst	cycles
ARRAY	sched.context_switch	7979000	122.73ns	± 0.37%	723.11	488.9
	sched.context_switch_x2	3686000	269.11ns	± 0.12%	1481.36	1081.7
	sched.context_switch_x1_5	5029000	197.39ns	± 1.11%	1102.05	782.5
LIST	sched.context_switch	7961000	122.98ns	± 0.05%	717.11	489.4
	sched.context_switch_x2	3714000	266.72ns	± 0.05%	1475.36	1072.7
	sched.context_switch_x1_5	5140000	191.57ns	± 0.18%	1096.03	765.3

mykaul · 2025-12-03T12:41:22Z

Next, even if circular buffer manages to grow to its de-facto maximum size, it can be huge. We've seen that task queues grow up to tens of thousands of entries and it occupies contiguous memory of relevant size.

Array, while smaller in overall size than a list I assume (no need for pointers), doesn't solve the above issue, no?

xemul · 2025-12-03T12:46:46Z

Currently we have array of pointers to tasks. Both, tasks and array itself, need to be allocated. After this PR only tasks are allocated, array is gone on its own. And no change in sizeof(task) here, the "next" pointer is unioned with pre-existing member. So memory footprint with this PR is reduced.

xemul · 2025-12-03T14:23:10Z

Interesting. Run-queue processing times (#3134)

	test	iters		runtime	inst	cycles
ARRAY	sched.runqueue_x	1738090	291.65ns	± 0.31%	136.04	146.7
	sched.runqueue_m	95997000	5.43ns	± 0.09%	26.13	11.6
	sched.runqueue_M	133000000	2.57ns	± 0.04%	25.01	10.3
LIST	sched.runqueue_x	1723850	286.79ns	± 0.13%	135.25	132.7
	sched.runqueue_m	93316000	6.06ns	± 0.17%	25.13	14.2
	sched.runqueue_M	134000000	3.23ns	± 0.62%	24.02	13.1

LIST takes less instructions, but more cycles (and more time)

xemul · 2025-12-03T14:24:18Z

☝️ _x is 10 tasks, _m is 1'000 tasks and _M is 1'000'000 tasks

nyh · 2025-12-03T20:16:25Z

Currently we have array of pointers to tasks. Both, tasks and array itself, need to be allocated. After this PR only tasks are allocated, array is gone on its own. And no change in sizeof(task) here, the "next" pointer is unioned with pre-existing member. So memory footprint with this PR is reduced.

This is a good point. It's not a separate list like std::forward_list or our chunked_fifo - it's an intrusive list, stored inside the tasks.
I'm not sure what to say about the union of the pointer - on one hand it's really cool you found a way to do this. On the other hand it's kind of worrying ;-)

xemul · 2025-12-04T06:21:47Z

I'm not sure what to say about the union of the pointer - on one hand it's really cool you found a way to do this. On the other hand it's kind of worrying ;-)

The unsafe thing here is when union contains next pointer instead of sched group id. In this case there's no way to get sched group at that time and std::moving task is fatal. The former is, well, yes, no excuse. But the latter is as safe as it is today -- if one std::moves a task while there's a pointer to it from the circular-buffer run-queue, it's just as fatal.

dotnwat · 2025-12-04T06:30:17Z

@dotnwat , how many bytes was it? And once it happened, was there "too longv queue accumulated" message in logs, or were the tasks short enough to drain within a single task quota?

I'm not sure about how many bytes, I just noticed this in the logs:

WARN  2025-10-14 03:07:52,962 [shard 1:main] seastar - Too long queue accumulated for main (1207 tasks)

Is there something else to look for to answer the bytes question?

dotnwat · 2025-12-04T06:32:49Z

@dotnwat , how many bytes was it? And once it happened, was there "too longv queue accumulated" message in logs, or were the tasks short enough to drain within a single task quota?

I just noticed this in the logs:

WARN  2025-10-14 03:07:52,962 [shard 1:main] seastar - Too long queue accumulated for main (1207 tasks)

The allocation we saw was around 220K. I don't think that this particular instance of the log message was associated with a large allocation--we've been hitting this and the large allocation recently and not always at the same time--so 1207 tasks might not be consistent with an allocation that large.

xemul · 2025-12-04T06:47:04Z

Ah, 1.2k tasks. OK, it's ~4k bytes for the buffer with task* pointers. And the allocation of runqueue for 220k should be ~60k tasks 🙀 , but if they are short, they can fit into task-quota

xemul · 2025-12-04T08:35:04Z

If shuffling the tasks in #3134 (once when allocating, "wakeups" produce the same sequence all the time)

	test	iters		runtime	inst	cycles
ARRAY	sched.runqueue_x	1728050	291.75ns	± 0.08%	136.04	148.8
	sched.runqueue_m	94524000	5.46ns	± 0.08%	26.13	11.6
	sched.runqueue_M	14000000	13.65ns	± 0.08%	25.07	55.5
LIST	sched.runqueue_x	1749190	288.53ns	± 0.14%	134.78	136.1
	sched.runqueue_m	89083000	5.99ns	± 0.23%	25.13	13.8
	sched.runqueue_M	10000000	52.29ns	± 0.35%	24.24	213.4

xemul · 2025-12-04T10:55:13Z

For future reference.

With task_queue::run_tasks() stripped down to bare

// ARRAY (current code)
bool reactor::task_queue::run_tasks() {
    while (!_q.empty()) {
        auto tsk = _q.front();
        _q.pop_front();
        tsk->run_and_dispose();
    }

    return !_q.empty();
}

// LIST (this PR)
bool reactor::task_queue::run_tasks() {
    auto current_sg = scheduling_group(_id);
    while (!_q.empty()) {
        auto tsk = _q.pop_front(current_sg);
        tsk->run_and_dispose();
    }

    return !_q.empty();
}

	test	iters		runtime	inst	cycles
ARRAY	sched.runqueue_x	1717290	297.01ns	± 0.97%	118.39	141.8
	sched.runqueue_m	96838000	5.03ns	± 0.01%	14.07	9.7
	sched.runqueue_M	14000000	8.94ns	± 0.14%	13.01	36.1
LIST	test	iters		runtime	inst	cycles
	sched.runqueue_x	1673950	291.55ns	± 0.69%	118.96	130
	sched.runqueue_m	77171000	5.92ns	± 0.17%	14.09	13.4
	sched.runqueue_M	10000000	51.46ns	± 0.73%	13.01	210.3

Disassembled release-mode binary

ARRAY

0000000000000090 <_ZN7seastar7reactor10task_queue9run_tasksEv>:
      90:	48 8b 47 70          	mov    0x70(%rdi),%rax
      94:	48 3b 47 78          	cmp    0x78(%rdi),%rax
      98:	74 3e                	je     d8 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x48>
      9a:	53                   	push   %rbx
      9b:	48 89 fb             	mov    %rdi,%rbx
      9e:	66 90                	xchg   %ax,%ax

      a0:	48 8b b3 80 00 00 00 	mov    0x80(%rbx),%rsi
      a7:	48 8b 4b 68          	mov    0x68(%rbx),%rcx
      ab:	48 8d 56 ff          	lea    -0x1(%rsi),%rdx
      af:	48 21 c2             	and    %rax,%rdx
      b2:	48 83 c0 01          	add    $0x1,%rax
      b6:	48 8b 3c d1          	mov    (%rcx,%rdx,8),%rdi
      ba:	48 89 43 70          	mov    %rax,0x70(%rbx)
      be:	48 8b 07             	mov    (%rdi),%rax
      c1:	ff 10                	call   *(%rax)
      c3:	48 8b 43 70          	mov    0x70(%rbx),%rax
      c7:	48 3b 43 78          	cmp    0x78(%rbx),%rax
      cb:	75 d3                	jne    a0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x10>

      cd:	31 c0                	xor    %eax,%eax
      cf:	5b                   	pop    %rbx
      d0:	c3                   	ret
      d1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
      d8:	31 c0                	xor    %eax,%eax
      da:	c3                   	ret
      db:	90                   	nop
      dc:	0f 1f 40 00          	nopl   0x0(%rax)

LIST

0000000000000090 <_ZN7seastar7reactor10task_queue9run_tasksEv>:
      90:	41 54                	push   %r12
      92:	55                   	push   %rbp
      93:	53                   	push   %rbx
      94:	48 89 fb             	mov    %rdi,%rbx
      97:	48 8b 7f 68          	mov    0x68(%rdi),%rdi
      9b:	48 85 ff             	test   %rdi,%rdi
      9e:	74 40                	je     e0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x50>
      a0:	0f b6 6b 58          	movzbl 0x58(%rbx),%ebp
      a4:	4c 8d 63 68          	lea    0x68(%rbx),%r12
      a8:	8d 6c 2d 01          	lea    0x1(%rbp,%rbp,1),%ebp
      ac:	eb 18                	jmp    c6 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x36>
      ae:	66 90                	xchg   %ax,%ax

      b0:	48 8b 07             	mov    (%rdi),%rax
      b3:	89 6f 08             	mov    %ebp,0x8(%rdi)
      b6:	48 83 6b 78 01       	subq   $0x1,0x78(%rbx)
      bb:	ff 10                	call   *(%rax)
      bd:	48 8b 7b 68          	mov    0x68(%rbx),%rdi
      c1:	48 85 ff             	test   %rdi,%rdi
      c4:	74 1a                	je     e0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x50>
      c6:	48 8b 47 08          	mov    0x8(%rdi),%rax
      ca:	48 89 43 68          	mov    %rax,0x68(%rbx)
      ce:	48 8d 47 08          	lea    0x8(%rdi),%rax
      d2:	48 39 43 70          	cmp    %rax,0x70(%rbx)
      d6:	75 d8                	jne    b0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x20>
      d8:	4c 89 63 70          	mov    %r12,0x70(%rbx)
      dc:	eb d2                	jmp    b0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x20>

      de:	66 90                	xchg   %ax,%ax
      e0:	5b                   	pop    %rbx
      e1:	31 c0                	xor    %eax,%eax
      e3:	5d                   	pop    %rbp
      e4:	41 5c                	pop    %r12
      e6:	c3                   	ret
      e7:	90                   	nop
      e8:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
      ef:	00

xemul · 2025-12-04T12:37:29Z

Optimized-out the last_p checks in task_slist::pop_front()

task* task_slist::snap() noexcept {
    auto ret = _first;
    _first = nullptr;
    _last_p = &_first; 
    _size = 0;
    return ret;
}

bool reactor::task_queue::run_tasks() {
    // Make sure new tasks will inherit our scheduling group
    auto current_sg = scheduling_group(_id);

    task* n = _q.snap();
    while (n != nullptr) {
        auto tsk = n;
        n = tsk->_next;
        tsk->_scheduling_group_id = task::disguise_sched_group(current_sg);
        tsk->run_and_dispose();
    }

    return !_q.empty();

	test	iters		runtime	inst	cycles
LIST	sched.runqueue_x	1673950	291.55ns	± 0.69%	118.96	130
	sched.runqueue_m	77171000	5.92ns	± 0.17%	14.09	13.4
	sched.runqueue_M	10000000	51.46ns	± 0.73%	13.01	210.3
LIST-o	sched.runqueue_x	1732710	292.69ns	± 0.16%	112.03	138.8
	sched.runqueue_m	109276000	4.25ns	± 0.09%	9.05	6.6
	sched.runqueue_M	10000000	49.72ns	± 0.90%	8.01	202.6

decoded run_tasks

0000000000000090 <_ZN7seastar7reactor10task_queue9run_tasksEv>:
      90:	41 54                	push   %r12
      92:	48 8d 47 68          	lea    0x68(%rdi),%rax
      96:	55                   	push   %rbp
      97:	53                   	push   %rbx
      98:	48 8b 5f 68          	mov    0x68(%rdi),%rbx
      9c:	48 89 47 70          	mov    %rax,0x70(%rdi)
      a0:	0f b6 6f 58          	movzbl 0x58(%rdi),%ebp
      a4:	48 c7 47 68 00 00 00 	movq   $0x0,0x68(%rdi)
      ab:	00 
      ac:	48 c7 47 78 00 00 00 	movq   $0x0,0x78(%rdi)
      b3:	00 
      b4:	48 85 db             	test   %rbx,%rbx
      b7:	74 2f                	je     e8 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x58>
      b9:	49 89 fc             	mov    %rdi,%r12
      bc:	8d 6c 2d 01          	lea    0x1(%rbp,%rbp,1),%ebp

      c0:	48 89 df             	mov    %rbx,%rdi
      c3:	48 8b 5b 08          	mov    0x8(%rbx),%rbx
      c7:	48 8b 07             	mov    (%rdi),%rax
      ca:	89 6f 08             	mov    %ebp,0x8(%rdi)
      cd:	ff 10                	call   *(%rax)
      cf:	48 85 db             	test   %rbx,%rbx
      d2:	75 ec                	jne    c0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x30>

      d4:	49 83 7c 24 68 00    	cmpq   $0x0,0x68(%r12)
      da:	5b                   	pop    %rbx
      db:	0f 95 c0             	setne  %al
      de:	5d                   	pop    %rbp
      df:	41 5c                	pop    %r12
      e1:	c3                   	ret
      e2:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
      e8:	5b                   	pop    %rbx
      e9:	31 c0                	xor    %eax,%eax
      eb:	5d                   	pop    %rbp
      ec:	41 5c                	pop    %r12
      ee:	c3                   	ret
      ef:	90                   	nop

xemul · 2025-12-04T12:45:13Z

To emphasize the ARRAY vs LIST-opt cases:

Main loop in assembly

ARRAY

      a0:	48 8b b3 80 00 00 00 	mov    0x80(%rbx),%rsi
      a7:	48 8b 4b 68          	mov    0x68(%rbx),%rcx
      ab:	48 8d 56 ff          	lea    -0x1(%rsi),%rdx
      af:	48 21 c2             	and    %rax,%rdx
      b2:	48 83 c0 01          	add    $0x1,%rax
      b6:	48 8b 3c d1          	mov    (%rcx,%rdx,8),%rdi
      ba:	48 89 43 70          	mov    %rax,0x70(%rbx)
      be:	48 8b 07             	mov    (%rdi),%rax
      c1:	ff 10                	call   *(%rax)
      c3:	48 8b 43 70          	mov    0x70(%rbx),%rax
      c7:	48 3b 43 78          	cmp    0x78(%rbx),%rax
      cb:	75 d3                	jne    a0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x10>

LIST-opt

      c0:	48 89 df             	mov    %rbx,%rdi
      c3:	48 8b 5b 08          	mov    0x8(%rbx),%rbx
      c7:	48 8b 07             	mov    (%rdi),%rax
      ca:	89 6f 08             	mov    %ebp,0x8(%rdi)
      cd:	ff 10                	call   *(%rax)
      cf:	48 85 db             	test   %rbx,%rbx
      d2:	75 ec                	jne    c0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x30>

Perf test

	test	iters		runtime	inst	cycles
ARRAY	sched.runqueue_x	1717290	297.01ns	± 0.97%	118.39	141.8
	sched.runqueue_m	96838000	5.03ns	± 0.01%	14.07	9.7
	sched.runqueue_M	14000000	8.94ns	± 0.14%	13.01	36.1
LIST-o	sched.runqueue_x	1732710	292.69ns	± 0.16%	🟢 112.03	🟢 138.8
	sched.runqueue_m	109276000	4.25ns	± 0.09%	🟢 9.05	🟢 6.6
	sched.runqueue_M	10000000	49.72ns	± 0.90%	🟢 8.01	🔴 202.6

_x -- 10 tasks, _m -- 1'000 tasks, _M -- 1'000'000 tasks

xemul · 2025-12-04T13:15:08Z

Updated ARRAY case to also modify task->_sg (like if it was LIST updating task's sg/next union)

bool reactor::task_queue::run_tasks() {
    auto current = scheduling_group(_id);
    while (!_q.empty()) {
        auto tsk = _q.front();
        _q.pop_front();
        tsk->_sg = current;
        tsk->run_and_dispose();
    }

    return !_q.empty();
}

test	iters		runtime	inst	cycles
sched.runqueue_x	1710960	297.74ns	± 0.08%	120.61	149.6
sched.runqueue_m	82919000	6.51ns	± 0.12%	15.09	15.6
sched.runqueue_M	16000000	10.86ns	± 0.83%	14.01	43.6

xemul · 2025-12-04T13:42:01Z

In LIST case, de-unioned next and sched-group-id on task to drop updating sched group in the loop (tsk->_scheduling_group_id = task::disguise_sched_group(current_sg); )

bool reactor::task_queue::run_tasks() {

    task* n = _q.snap();
    while (n != nullptr) {
        auto tsk = n;
        n = tsk->_next;
        tsk->run_and_dispose();
    }

    return !_q.empty();
}

      c0:   48 89 df                mov    %rbx,%rdi
      c3:   48 8b 5b 10             mov    0x10(%rbx),%rbx
      c7:   48 8b 07                mov    (%rdi),%rax
      ca:   ff 10                   call   *(%rax)
      cc:   48 85 db                test   %rbx,%rbx
      cf:   75 ef                   jne    c0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x30>

test	iters		runtime	inst	cycles
sched.runqueue_x	1730610	297.71ns	± 0.34%	110.65	139.1
sched.runqueue_m	94810000	4.91ns	± 1.41%	8.05	9.1
sched.runqueue_M	7000000	71.07ns	± 0.67%	7.01	285.6

xemul · 2025-12-04T17:07:31Z

https://gist.github.com/xemul/d5f83adac66d34b70fa4fef49a861d96

☝️ a perf test that allocates 10k dispersed integers and sums those values by dereferencing each via array of pointers or intrusive list:

test                    iters            runtime     allocs      tasks       inst     cycles
sum_perf.sum_array  651810000     1.24ns ± 1.16%      0.000      0.000       7.04        3.9
sum_perf.sum_list   135080000     6.65ns ± 0.20%      0.000      0.000       5.04       26.0

xemul · 2025-12-06T16:35:52Z

Split lists help a lot

	Array	Array of pointers	Intrusive list	Split list [2]	Split-list [16]
10	65	63	63	64	63
100	9	8	11	9	9
1000	3	3	13	7	3
10000	3	3	26	13	5
100000	3	5	35	18	5
1000000	10	22	327	165	26

avikivity · 2025-12-06T20:50:10Z

Is "split list" another name for chunked_fifo?

This looks like another opportunity to refer to https://github.com/avikivity/seastar/commits/sort-tasks/.

xemul · 2025-12-08T06:18:25Z

No, it's not chunked fifo, it's literally orthogonal.

Here's how elements are stored in std::array<T>: [ 0 1 2 3 4 5 ... ]. And the same is true for std::vector<T> and circular_buffer<T>

This is intrusive_list<T>: [0] -> [1] -> [2] -> [3] -> [4] -> [5] -> ...

And this is chunked_fifo<T> (~= intrusive_list<array<T>>) : [ 0 1 2 ] -> [ 3 4 5 ] -> ...

Currently task queue is circular_buffer<task*>:

 [0] [2] [4]
  |   |   |
[ . . . . . .   ... ]
    |   |   |
   [1] [3] [5]

and using chunked fifo would make it chunked_fifo<task*>:

 [0] [2]        [4]
  |   |          |
[ . . . ] -> [ . . . ] -> ...
    |          |   |
   [1]        [3] [5]

Using chunked fifo will still require potentially growing the pointers storage on wakeup, but, of course, in smaller chunks

The split-list is ~= std::array<intrusive_list<task>>, like this:

 [0] -> [2] -> [4] -> ...
  |
[ . . ]
    |
   [1] -> [3] -> [5] -> ...

it uses fixed-size array and doesn't need to allocate memory on wake-up. And also note the interleaving -- when scanning elements in the displayed order, it scans two parallel tracks "in parallel" as if they were pointers in plain array (this requires consuming the "next" pointer, so that the Nth wave of scan stipp references pointers from the root-level array, but task runqueue scanning does it)

avikivity · 2025-12-08T08:21:09Z

No, it's not chunked fifo, it's literally orthogonal.

Here's how elements are stored in std::array<T>: [ 0 1 2 3 4 5 ... ]. And the same is true for std::vector<T> and circular_buffer<T>

This is intrusive_list<T>: [0] -> [1] -> [2] -> [3] -> [4] -> [5] -> ...

And this is chunked_fifo<T> (~= intrusive_list<array<T>>) : [ 0 1 2 ] -> [ 3 4 5 ] -> ...

Currently task queue is circular_buffer<task*>:
 [0] [2] [4]
  |   |   |
[ . . . . . .   ... ]
    |   |   |
   [1] [3] [5]
and using chunked fifo would make it chunked_fifo<task*>:
 [0] [2]        [4]
  |   |          |
[ . . . ] -> [ . . . ] -> ...
    |          |   |
   [1]        [3] [5]
Using chunked fifo will still require potentially growing the pointers storage on wakeup, but, of course, in smaller chunks

But what's the problem? Isn't the goal to eliminate large allocations? chunked_fifo does this.

We can't completely eliminate allocations (the tasks themselves must be allocated), so we don't win from removing allocations if they are well amortized.

The split-list is ~= std::array<intrusive_list<task>>, like this:
 [0] -> [2] -> [4] -> ...
  |
[ . . ]
    |
   [1] -> [3] -> [5] -> ...
it uses fixed-size array and doesn't need to allocate memory on wake-up. And also note the interleaving -- when scanning elements in the displayed order, it scans two parallel tracks "in parallel" as if they were pointers in plain array (this requires consuming the "next" pointer, so that the Nth wave of scan stipp references pointers from the root-level array, but task runqueue scanning does it)

Seems complicated.

xemul · 2025-12-08T08:40:37Z

But what's the problem? Isn't the goal to eliminate large allocations? chunked_fifo does this.

The goal was to check if it's possible to eliminate wakeup allocations at all

We can't completely eliminate allocations (the tasks themselves must be allocated), so we don't win from removing allocations if they are well amortized.

There are two paths (sometimes they immediately follow one another, but not always) -- allocating a task and waking the task up. Currently we allocate on both, but allocating on the latter is not necessary. AFAIU, if there's a chain of yet unresolved futures at hand, the task is allocated as a part of continuation_base. When later one does promise::set_value() the task is woken up. Not having (even amortized) allocations in set_value() is the goal.

avikivity · 2025-12-08T08:43:32Z

But what's the problem? Isn't the goal to eliminate large allocations? chunked_fifo does this.

The goal was to check if it's possible to eliminate wakeup allocations at all

We can't completely eliminate allocations (the tasks themselves must be allocated), so we don't win from removing allocations if they are well amortized.

There are two paths (sometimes they immediately follow one another, but not always) -- allocating a task and waking the task up. Currently we allocate on both, but allocating on the latter is not necessary. AFAIU, if there's a chain of yet unresolved futures at hand, the task is allocated as a part of continuation_base. When later one does promise::set_value() the task is woken up. Not having (even amortized) allocations in set_value() is the goal.

Why do we allocate on wakeup? Usually the queue is not full.

xemul · 2025-12-08T09:06:08Z

Seems complicated

Well, a little bit, but in the end of the day it's pretty much compact code: #3138

Why do we allocate on wakeup? Usually the queue is not full.

Because circular_buffer<>::push_back() is potentially allocating operation. Nonetheless, I do agree that "usually" the queue is grown up to its large-enough size and then stops growing until next wakeup storm (if it ever happens).

Add a test that populates run-queue with 10, 1k and 1M no-op tasks and measures the time it takes for the scheduler to process one. No allocations/frees happen measure-time, tasks are pre-populated. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

There are two coroutine_traits_base::promise_task templates, both inherit from task, but one sets task::_sg by hand, and another enjoys task protected implementation. Make both do it. This allows moving task::_sg into private section, protecting it reliably from future changes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Rename task::_sg member into uintptr and keep scheduling group index in there. Next patch will keep task* pointer unioned with it, so to tell which value is in there, add disguising of sched group ID by shifting it left one bit and setting the zeroth one. Also add a static assertion that when task pointer will be put there, this bit will not be set. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

The split list is the collection of objects that's optimized for scanning from begin to end, with the ability to add entries from two ends -- front and back, like std::deque does. Splitting is needed to let CPU pre-fetch next element while scanning some current one. With classical linked lists this is difficult, because in order to prefetch the next element CPU needs to read the next pointer from the current element, which is being read. Thus the forward scan of a plain list is serielized. With split lists several adjacent elements can be prefetched in parallel. To achieve that, the "first" pointer is "sharded" -- there are F first pointers (as well as F cached last pointers) and the list is populated in round-robin manner -- first 0th list is appended, then the 1st, then the 2nd, ... then the Fth and population wraps around and starts over. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

The element (class task) shares the "next" pointer with sched group index, thus living in two states -- queued and inactive. When created it gets into the latter state and carries sched group index on-board. When add_task is called, the task becomes queued and the sched group index is thus lost. Later, when the task is picked for execution, the task_queue restores this index into "current", naturally. Probably the restoration is not needed, but it's better to be on the safe side. The task::set_scheduling_group() is no longer callable for queued tasks, but the method is protected and is only used in two places. First is coroutine switch-to, which happens on running (i.e. -- not queued) task. Second is the shared_future tweaking unresolved shared_state (i.e. -- inactive too). Debug-mode task queue shuffling with constant time is no longer possible and is temporarily removed, next patch will restore it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Currently shuffling is performed by swapping a newly-activated task with some random queued one. Previous patch turned task queue into a singly-linked list, and picking a random task from it in constant time is impossible. The new shuffling works run-time -- when a queue picks up next front task to execute, it rolls the dice and optionally re-queued it into the back of the queue. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

xemul · 2025-12-10T14:46:53Z

upd:

cherry-picked collection from Introduce intrusive split-list and compare its scanning performance with some other collections #3138 and use intrusive_split_list<task, 16, &task::_next>

test	iters	runtime		tasks	inst	cycles
master, 10 tasks	3228650	159.54ns	± 0.15%	1.200	135.50	143.1
master, 1k tasks	123110000	3.97ns	± 1.45%	1.002	26.13	10.8
master, 1M tasks	14000000	13.81ns	± 1.29%	1.000	25.07	55.2
this PR, 10 tasks	3263140	157.05ns	± 0.09%	1.200	147.27	140.7
this PR, 1k tasks	94840000	5.45ns	± 0.88%	1.002	33.20	17.0
this PR, 1M tasks	14000000	16.03ns	± 0.64%	1.000	32.08	64.6

xemul · 2025-12-10T14:48:50Z

Overhead is ~57% on 1k, ~17% on 1M tasks, ~8.5% on 10 tasks

xemul · 2025-12-10T14:55:44Z

test	iters	runtime		tasks	inst	cycles
this PR, 10 tasks	3191910	158.38ns	± 0.12%	1.200	147.28	141.4
this PR, 1k tasks	94409000	5.48ns	± 0.24%	1.002	33.25	17.0
this PR, 1M tasks	14000000	16.07ns	± 0.62%	1.000	32.08	64.7

☝️ with intrusive_split_list<32>

xemul · 2025-12-10T15:01:05Z

test	iters	runtime		tasks	inst	cycles
sched.runqueue_x	3134860	157.53ns	± 0.48%	1.200	146.40	131.9
sched.runqueue_m	96570000	4.82ns	± 2.81%	1.002	33.17	13.8
sched.runqueue_M	13000000	18.30ns	± 2.46%	1.000	32.09	71.2

☝️ intrusive_split_list<8>

xemul · 2025-12-10T15:01:33Z

The above is on Threadripper processor, will measure on i7i's Xeon

xemul · 2025-12-10T15:20:35Z

i7i.2xlarge, Intel Xeon Platinum 8559C, clang-20.1.2, intrusive_split_list<16>

test	iters	runtime		tasks	inst	cycles
master, 10 tasks	1076840	418.24ns	± 0.02%	1.200	137.80	157.1
master, 1k tasks	75366000	6.39ns	± 0.03%	1.002	26.15	8.3
mater, 1M tasks	119000000	3.49ns	± 0.02%	1.000	25.02	11.1
this PR, 10 tasks	1080530	415.89ns	± 0.02%	1.200	145.26	153.9
this PR, 1k tasks	74981000	6.71ns	± 0.05%	1.002	31.18	9.5
this PR, 1M tasks	117000000	3.85ns	± 0.01%	1.000	30.02	12.2

xemul requested a review from Copilot November 28, 2025 10:58

Copilot started reviewing on behalf of xemul November 28, 2025 10:59 View session

Copilot finished reviewing on behalf of xemul November 28, 2025 11:02

Copilot AI reviewed Nov 28, 2025

View reviewed changes

include/seastar/core/task.hh Outdated Show resolved Hide resolved

xemul force-pushed the br-task-queue-make-slist branch 2 times, most recently from abf69d6 to cf45e72 Compare December 1, 2025 05:50

xemul mentioned this pull request Dec 4, 2025

test,perf: Measure time it takes to traverse a runqueue #3134

Open

xemul marked this pull request as draft December 4, 2025 10:55

xemul mentioned this pull request Dec 10, 2025

Introduce intrusive split-list and compare its scanning performance with some other collections #3138

Open

xemul added 6 commits December 10, 2025 17:21

xemul force-pushed the br-task-queue-make-slist branch from cf45e72 to a5455c6 Compare December 10, 2025 14:40

Turn task_queue into singly-linked list #3125

Are you sure you want to change the base?

Turn task_queue into singly-linked list #3125

Uh oh!

Conversation

xemul commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

xemul commented Dec 1, 2025

Uh oh!

dotnwat commented Dec 1, 2025

Uh oh!

xemul commented Dec 2, 2025

Uh oh!

avikivity commented Dec 2, 2025

Uh oh!

avikivity commented Dec 2, 2025

Uh oh!

avikivity commented Dec 2, 2025

Uh oh!

avikivity commented Dec 2, 2025

Uh oh!

xemul commented Dec 2, 2025

Uh oh!

mykaul commented Dec 2, 2025

Uh oh!

avikivity commented Dec 2, 2025

Uh oh!

xemul commented Dec 3, 2025

Uh oh!

nyh commented Dec 3, 2025

Uh oh!

xemul commented Dec 3, 2025

Uh oh!

mykaul commented Dec 3, 2025

Uh oh!

xemul commented Dec 3, 2025

Uh oh!

xemul commented Dec 3, 2025

Uh oh!

xemul commented Dec 3, 2025

Uh oh!

nyh commented Dec 3, 2025

Uh oh!

xemul commented Dec 4, 2025

Uh oh!

dotnwat commented Dec 4, 2025

Uh oh!

dotnwat commented Dec 4, 2025

Uh oh!

xemul commented Dec 4, 2025

Uh oh!

xemul commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xemul commented Dec 4, 2025

Uh oh!

xemul commented Dec 4, 2025

Uh oh!

xemul commented Dec 4, 2025

Main loop in assembly

Perf test

Uh oh!

xemul commented Dec 4, 2025

Uh oh!

xemul commented Dec 4, 2025

Uh oh!

xemul commented Dec 4, 2025

Uh oh!

xemul commented Dec 6, 2025

Uh oh!

avikivity commented Dec 6, 2025

Uh oh!

xemul commented Dec 8, 2025

xemul commented Nov 28, 2025 •

edited

Loading

xemul commented Dec 4, 2025 •

edited

Loading

xemul commented Dec 10, 2025 •

edited

Loading

xemul commented Dec 10, 2025 •

edited

Loading

xemul commented Dec 10, 2025 •

edited

Loading