Skip to content

Conversation

@xemul
Copy link
Contributor

@xemul xemul commented Nov 28, 2025

Currently it's a circular_buffer with pointers to task-s, and it has several concerns.

First, add_task() and schedule() are not-exception safe. Respectively, promise::set_value() isn't either. They are marked as noexcept, but circular buffer may need to grow and allocation may throw.

Next, even if circular buffer manages to grow to its de-facto maximum size, it can be huge. We've seen that task queues grow up to tens of thousands of entries and it occupies contiguous memory of relevant size.

Finally, walking the run-list touches more cache-lines than just tasks'. And as per above -- queue buffer can span several of those.

This PR converts the task queue into a singly-linked list of tasks, solving all the above difficulties.

Adding a task becomes truly noexcept.
No need for extra memory for queue. The sizeof(task) isn't changed either.
Walking the run-queue only touches tasks' cache lines.

refs: #84, #254

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR converts the task queue from a circular_buffer<task*> to a singly-linked list (task_slist) to address exception safety, memory efficiency, and cache locality concerns. The refactoring makes task scheduling operations truly noexcept by eliminating the need for dynamic buffer allocation, reduces memory overhead by embedding the list structure within task objects, and improves cache performance by only touching task cache lines during queue traversal.

Key changes:

  • Introduced task_slist, a singly-linked list implementation that stores next pointers in the task objects themselves
  • Modified the task class to use a dual-purpose field _scheduling_group_id_or_next_task that encodes either the scheduling group ID (when not queued) or the next task pointer (when queued) using bit 0 as a discriminator
  • Updated the shuffle functionality to work with the new list-based queue structure

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
include/seastar/core/task.hh Adds task_slist class and modifies task to store scheduling group/next pointer in a single field using bit manipulation
include/seastar/core/reactor.hh Changes task queue storage from circular_buffer<task*> to task_slist
include/seastar/core/coroutine.hh Updates set_scheduling_group to delegate to base class method instead of directly accessing the field
src/core/reactor.cc Updates queue operations to use task_slist API and refactors shuffle logic to work with the new structure
tests/unit/task_queue_test.cc Adds comprehensive fuzz test for task_slist operations (push_back, push_front, pop_front)
tests/unit/CMakeLists.txt Registers the new task queue test

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

@xemul xemul force-pushed the br-task-queue-make-slist branch 2 times, most recently from abf69d6 to cf45e72 Compare December 1, 2025 05:50
@xemul
Copy link
Contributor Author

xemul commented Dec 1, 2025

upd:

  • replace uintptr on task into union of unsigned and task*

@dotnwat
Copy link
Contributor

dotnwat commented Dec 1, 2025

Nice. Just found this today after hitting a large allocation warning on the task queue.

@xemul
Copy link
Contributor Author

xemul commented Dec 2, 2025

@dotnwat , how many bytes was it? And once it happened, was there "too longv queue accumulated" message in logs, or were the tasks short enough to drain within a single task quota?

@avikivity
Copy link
Member

I'm sure there's a hard-to-prove performance regression in there. We execute hundreds of thousands of tasks per second. With a circular buffer, if the tasks are small (like doing conversion from one future type to another) and the CPU can guess the vptr of the task (possible if many similar task-sets are in the queue) the CPU can execute several tasks in parallel.

With a linked list, there's a memory dependency, and it's likely not to sit in cache due to FIFO execution.

@avikivity
Copy link
Member

I acknowledge the problems with circular_buffer. But removing the allocation doesn't fix anything. Coroutine frames and continuations must still be allocated, and there is no reasonable way to deal with an allocation failure.

@avikivity
Copy link
Member

We can do something like chunked_fifo (but optimize it a little, so it aligns itself to a power-of-two).

Another approach is chunked-fifo-like, but lay out tasks instead of task pointers. It requires that we schedule tasks by either sending a final type (most eventually end up final), or via a base pointer, and add size() member.

@avikivity
Copy link
Member

Another approach is chunked-fifo-like, but lay out tasks instead of task pointers. It requires that we schedule tasks by either sending a final type (most eventually end up final), or via a base pointer, and add size() member.

This doesn't work, often the tasks are embedded in immovable objects like coroutine frames.

@xemul
Copy link
Contributor Author

xemul commented Dec 2, 2025

With a circular buffer, if the tasks are small ... and the CPU can guess the vptr of the task ... the CPU can execute several tasks in parallel.
With a linked list, there's a memory dependency, and it's likely not to sit in cache due to FIFO execution.

Some more light shed on this effect, or "further reading" links are very welcome here :)

@mykaul
Copy link
Contributor

mykaul commented Dec 2, 2025

With a circular buffer, if the tasks are small ... and the CPU can guess the vptr of the task ... the CPU can execute several tasks in parallel.
With a linked list, there's a memory dependency, and it's likely not to sit in cache due to FIFO execution.

Some more light shed on this effect, or "further reading" links are very welcome here :)

I read some time ago https://people.csail.mit.edu/delimitrou/papers/2024.asplos.memory.pdf and thought it was an interesting read. Donno if it helps for this specific discussion though.

@avikivity
Copy link
Member

With a circular buffer, if the tasks are small ... and the CPU can guess the vptr of the task ... the CPU can execute several tasks in parallel.
With a linked list, there's a memory dependency, and it's likely not to sit in cache due to FIFO execution.

Some more light shed on this effect, or "further reading" links are very welcome here :)

Here's an article showing some of the points: https://douglasrumbaugh.com/post/list-secrets/

Append() corresponds to schedule(), which I expect to run equally fast, because the end() of the list will be in cache.

Average() corresponds to run_some_tasks(), which I expect to be slower for the list due to memory latency fetching the next pointer.

It's a little more complicated: the cpu will likely learn that ->next is followed and will prefetch it ahead of time. But if the task is short, prefetching won't get it in time.

If the task is short (in instruction count) but long (due to cache missed), we can have instructions from multiple tasks executed at the same time in an out-of-order CPU. I expect to have data cache misses, again due to FIFO. FIFO is bad in that it temporally separates tasks operating on the same data, giving them time to drop out of cache. That's why we have schedule_urgent_task().

@xemul
Copy link
Contributor Author

xemul commented Dec 3, 2025

Here's an article showing some of the points: https://douglasrumbaugh.com/post/list-secrets/

I would say it doesn't really apply here. The malloc-s overhead from the article is reversed here. Task is allocated anyway, and to enqueue it array might need amortized allocation, while list does not at all.

Next, the arcile compares two access patterns: array [0|1|2|3|...|N] vs list [0] -> [1] -> [2] -> [3] -...-> [N]. And array definitely wins in cache efficiency, because its elements are packed and list nodes are not. Here the access pattern is different. Namely array of_pointers

[0|1|2|...|N]
 | | |     |
 0 1 2     N

vs the very same list [0] -> [1] -> [2] -...-> [N]. And scanning array dereferencing each pointer is not necessarilly winning (but of cource I didn't measure it).

Respectively,

Average() corresponds to run_some_tasks(), which I expect to be slower for the list due to memory latency fetching the next pointer.

run_some_tasks() will fetch the run_and_dispose pointer from vtable and for that it will fetch vtable itself from the task object, for both array and list, so fetching next pointer will come from cache (for short continuation chans), just like if fetching the {N+1}th pointer from array.

I was interested more about this:

... if the tasks are small (like doing conversion from one future type to another) and the CPU can guess the vptr of the task (possible if many similar task-sets are in the queue) the CPU can execute several tasks in parallel.
With a linked list, there's a memory dependency, and it's likely not to sit in cache due to FIFO execution.

What is it?

@nyh
Copy link
Contributor

nyh commented Dec 3, 2025

I acknowledge the problems with circular_buffer.

Note that chunked_fifo<> was developed exactly to address the shortcoming of circular_buffer<>, especially how it was used in semaphore (see #140). So it's rather natural to switch to chunked_fifo<> here too. Perhaps I made a mistake when I decided to parametrized this template with items_per_chunk and not bytes_per_chunk, but this can easily be solved (I think you can just calculate the right items_per_chunk as something like 128*1024/sizeof(item)).

But removing the allocation doesn't fix anything. Coroutine frames and continuations must still be allocated, and there is no reasonable way to deal with an allocation failure.

Why can't we throw a normal bad_alloc exception when a we can't allocate a coroutine frame?

@xemul
Copy link
Contributor Author

xemul commented Dec 3, 2025

Context switch perf test results

  test iters runtime   inst cycles
ARRAY sched.context_switch 7979000 122.73ns ± 0.37% 723.11 488.9
  sched.context_switch_x2 3686000 269.11ns ± 0.12% 1481.36 1081.7
  sched.context_switch_x1_5 5029000 197.39ns ± 1.11% 1102.05 782.5
LIST sched.context_switch 7961000 122.98ns ± 0.05% 717.11 489.4
  sched.context_switch_x2 3714000 266.72ns ± 0.05% 1475.36 1072.7
  sched.context_switch_x1_5 5140000 191.57ns ± 0.18% 1096.03 765.3

@mykaul
Copy link
Contributor

mykaul commented Dec 3, 2025

Next, even if circular buffer manages to grow to its de-facto maximum size, it can be huge. We've seen that task queues grow up to tens of thousands of entries and it occupies contiguous memory of relevant size.

Array, while smaller in overall size than a list I assume (no need for pointers), doesn't solve the above issue, no?

@xemul
Copy link
Contributor Author

xemul commented Dec 3, 2025

Currently we have array of pointers to tasks. Both, tasks and array itself, need to be allocated. After this PR only tasks are allocated, array is gone on its own. And no change in sizeof(task) here, the "next" pointer is unioned with pre-existing member. So memory footprint with this PR is reduced.

@xemul
Copy link
Contributor Author

xemul commented Dec 3, 2025

Interesting. Run-queue processing times (#3134)

  test iters   runtime inst cycles
ARRAY sched.runqueue_x 1738090 291.65ns ± 0.31% 136.04 146.7
  sched.runqueue_m 95997000 5.43ns ± 0.09% 26.13 11.6
  sched.runqueue_M 133000000 2.57ns ± 0.04% 25.01 10.3
LIST sched.runqueue_x 1723850 286.79ns ± 0.13% 135.25 132.7
  sched.runqueue_m 93316000 6.06ns ± 0.17% 25.13 14.2
  sched.runqueue_M 134000000 3.23ns ± 0.62% 24.02 13.1

LIST takes less instructions, but more cycles (and more time)

@xemul
Copy link
Contributor Author

xemul commented Dec 3, 2025

☝️ _x is 10 tasks, _m is 1'000 tasks and _M is 1'000'000 tasks

@nyh
Copy link
Contributor

nyh commented Dec 3, 2025

Currently we have array of pointers to tasks. Both, tasks and array itself, need to be allocated. After this PR only tasks are allocated, array is gone on its own. And no change in sizeof(task) here, the "next" pointer is unioned with pre-existing member. So memory footprint with this PR is reduced.

This is a good point. It's not a separate list like std::forward_list or our chunked_fifo - it's an intrusive list, stored inside the tasks.
I'm not sure what to say about the union of the pointer - on one hand it's really cool you found a way to do this. On the other hand it's kind of worrying ;-)

@xemul
Copy link
Contributor Author

xemul commented Dec 4, 2025

I'm not sure what to say about the union of the pointer - on one hand it's really cool you found a way to do this. On the other hand it's kind of worrying ;-)

The unsafe thing here is when union contains next pointer instead of sched group id. In this case there's no way to get sched group at that time and std::moving task is fatal. The former is, well, yes, no excuse. But the latter is as safe as it is today -- if one std::moves a task while there's a pointer to it from the circular-buffer run-queue, it's just as fatal.

@dotnwat
Copy link
Contributor

dotnwat commented Dec 4, 2025

@dotnwat , how many bytes was it? And once it happened, was there "too longv queue accumulated" message in logs, or were the tasks short enough to drain within a single task quota?

I'm not sure about how many bytes, I just noticed this in the logs:

WARN  2025-10-14 03:07:52,962 [shard 1:main] seastar - Too long queue accumulated for main (1207 tasks)

Is there something else to look for to answer the bytes question?

@dotnwat
Copy link
Contributor

dotnwat commented Dec 4, 2025

@dotnwat , how many bytes was it? And once it happened, was there "too longv queue accumulated" message in logs, or were the tasks short enough to drain within a single task quota?

I just noticed this in the logs:

WARN  2025-10-14 03:07:52,962 [shard 1:main] seastar - Too long queue accumulated for main (1207 tasks)

The allocation we saw was around 220K. I don't think that this particular instance of the log message was associated with a large allocation--we've been hitting this and the large allocation recently and not always at the same time--so 1207 tasks might not be consistent with an allocation that large.

@xemul
Copy link
Contributor Author

xemul commented Dec 4, 2025

Ah, 1.2k tasks. OK, it's ~4k bytes for the buffer with task* pointers. And the allocation of runqueue for 220k should be ~60k tasks 🙀 , but if they are short, they can fit into task-quota

@xemul
Copy link
Contributor Author

xemul commented Dec 4, 2025

If shuffling the tasks in #3134 (once when allocating, "wakeups" produce the same sequence all the time)

  test iters   runtime inst cycles
ARRAY sched.runqueue_x 1728050 291.75ns ± 0.08% 136.04 148.8
  sched.runqueue_m 94524000 5.46ns ± 0.08% 26.13 11.6
  sched.runqueue_M 14000000 13.65ns ± 0.08% 25.07 55.5
LIST  sched.runqueue_x 1749190 288.53ns ± 0.14% 134.78 136.1
  sched.runqueue_m 89083000 5.99ns ± 0.23% 25.13 13.8
  sched.runqueue_M 10000000 52.29ns ± 0.35% 24.24 213.4

@xemul
Copy link
Contributor Author

xemul commented Dec 4, 2025

For future reference.

With task_queue::run_tasks() stripped down to bare

// ARRAY (current code)
bool reactor::task_queue::run_tasks() {
    while (!_q.empty()) {
        auto tsk = _q.front();
        _q.pop_front();
        tsk->run_and_dispose();
    }

    return !_q.empty();
}

// LIST (this PR)
bool reactor::task_queue::run_tasks() {
    auto current_sg = scheduling_group(_id);
    while (!_q.empty()) {
        auto tsk = _q.pop_front(current_sg);
        tsk->run_and_dispose();
    }

    return !_q.empty();
}
  test iters   runtime inst cycles
ARRAY sched.runqueue_x 1717290 297.01ns ± 0.97% 118.39 141.8
  sched.runqueue_m 96838000 5.03ns ± 0.01% 14.07 9.7
  sched.runqueue_M 14000000 8.94ns ± 0.14% 13.01 36.1
LIST test iters   runtime inst cycles
  sched.runqueue_x 1673950 291.55ns ± 0.69% 118.96 130
  sched.runqueue_m 77171000 5.92ns ± 0.17% 14.09 13.4
  sched.runqueue_M 10000000 51.46ns ± 0.73% 13.01 210.3

Disassembled release-mode binary

ARRAY

0000000000000090 <_ZN7seastar7reactor10task_queue9run_tasksEv>:
      90:	48 8b 47 70          	mov    0x70(%rdi),%rax
      94:	48 3b 47 78          	cmp    0x78(%rdi),%rax
      98:	74 3e                	je     d8 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x48>
      9a:	53                   	push   %rbx
      9b:	48 89 fb             	mov    %rdi,%rbx
      9e:	66 90                	xchg   %ax,%ax

      a0:	48 8b b3 80 00 00 00 	mov    0x80(%rbx),%rsi
      a7:	48 8b 4b 68          	mov    0x68(%rbx),%rcx
      ab:	48 8d 56 ff          	lea    -0x1(%rsi),%rdx
      af:	48 21 c2             	and    %rax,%rdx
      b2:	48 83 c0 01          	add    $0x1,%rax
      b6:	48 8b 3c d1          	mov    (%rcx,%rdx,8),%rdi
      ba:	48 89 43 70          	mov    %rax,0x70(%rbx)
      be:	48 8b 07             	mov    (%rdi),%rax
      c1:	ff 10                	call   *(%rax)
      c3:	48 8b 43 70          	mov    0x70(%rbx),%rax
      c7:	48 3b 43 78          	cmp    0x78(%rbx),%rax
      cb:	75 d3                	jne    a0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x10>

      cd:	31 c0                	xor    %eax,%eax
      cf:	5b                   	pop    %rbx
      d0:	c3                   	ret
      d1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
      d8:	31 c0                	xor    %eax,%eax
      da:	c3                   	ret
      db:	90                   	nop
      dc:	0f 1f 40 00          	nopl   0x0(%rax)

LIST

0000000000000090 <_ZN7seastar7reactor10task_queue9run_tasksEv>:
      90:	41 54                	push   %r12
      92:	55                   	push   %rbp
      93:	53                   	push   %rbx
      94:	48 89 fb             	mov    %rdi,%rbx
      97:	48 8b 7f 68          	mov    0x68(%rdi),%rdi
      9b:	48 85 ff             	test   %rdi,%rdi
      9e:	74 40                	je     e0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x50>
      a0:	0f b6 6b 58          	movzbl 0x58(%rbx),%ebp
      a4:	4c 8d 63 68          	lea    0x68(%rbx),%r12
      a8:	8d 6c 2d 01          	lea    0x1(%rbp,%rbp,1),%ebp
      ac:	eb 18                	jmp    c6 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x36>
      ae:	66 90                	xchg   %ax,%ax

      b0:	48 8b 07             	mov    (%rdi),%rax
      b3:	89 6f 08             	mov    %ebp,0x8(%rdi)
      b6:	48 83 6b 78 01       	subq   $0x1,0x78(%rbx)
      bb:	ff 10                	call   *(%rax)
      bd:	48 8b 7b 68          	mov    0x68(%rbx),%rdi
      c1:	48 85 ff             	test   %rdi,%rdi
      c4:	74 1a                	je     e0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x50>
      c6:	48 8b 47 08          	mov    0x8(%rdi),%rax
      ca:	48 89 43 68          	mov    %rax,0x68(%rbx)
      ce:	48 8d 47 08          	lea    0x8(%rdi),%rax
      d2:	48 39 43 70          	cmp    %rax,0x70(%rbx)
      d6:	75 d8                	jne    b0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x20>
      d8:	4c 89 63 70          	mov    %r12,0x70(%rbx)
      dc:	eb d2                	jmp    b0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x20>

      de:	66 90                	xchg   %ax,%ax
      e0:	5b                   	pop    %rbx
      e1:	31 c0                	xor    %eax,%eax
      e3:	5d                   	pop    %rbp
      e4:	41 5c                	pop    %r12
      e6:	c3                   	ret
      e7:	90                   	nop
      e8:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
      ef:	00 

@xemul xemul marked this pull request as draft December 4, 2025 10:55
@xemul
Copy link
Contributor Author

xemul commented Dec 4, 2025

Optimized-out the last_p checks in task_slist::pop_front()

task* task_slist::snap() noexcept {
    auto ret = _first;
    _first = nullptr;
    _last_p = &_first; 
    _size = 0;
    return ret;
}

bool reactor::task_queue::run_tasks() {
    // Make sure new tasks will inherit our scheduling group
    auto current_sg = scheduling_group(_id);

    task* n = _q.snap();
    while (n != nullptr) {
        auto tsk = n;
        n = tsk->_next;
        tsk->_scheduling_group_id = task::disguise_sched_group(current_sg);
        tsk->run_and_dispose();
    }

    return !_q.empty();
  test iters   runtime inst cycles
LIST sched.runqueue_x 1673950 291.55ns ± 0.69% 118.96 130
  sched.runqueue_m 77171000 5.92ns ± 0.17% 14.09 13.4
  sched.runqueue_M 10000000 51.46ns ± 0.73% 13.01 210.3
LIST-o sched.runqueue_x 1732710 292.69ns ± 0.16% 112.03 138.8
  sched.runqueue_m 109276000 4.25ns ± 0.09% 9.05 6.6
  sched.runqueue_M 10000000 49.72ns ± 0.90% 8.01 202.6

decoded run_tasks

0000000000000090 <_ZN7seastar7reactor10task_queue9run_tasksEv>:
      90:	41 54                	push   %r12
      92:	48 8d 47 68          	lea    0x68(%rdi),%rax
      96:	55                   	push   %rbp
      97:	53                   	push   %rbx
      98:	48 8b 5f 68          	mov    0x68(%rdi),%rbx
      9c:	48 89 47 70          	mov    %rax,0x70(%rdi)
      a0:	0f b6 6f 58          	movzbl 0x58(%rdi),%ebp
      a4:	48 c7 47 68 00 00 00 	movq   $0x0,0x68(%rdi)
      ab:	00 
      ac:	48 c7 47 78 00 00 00 	movq   $0x0,0x78(%rdi)
      b3:	00 
      b4:	48 85 db             	test   %rbx,%rbx
      b7:	74 2f                	je     e8 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x58>
      b9:	49 89 fc             	mov    %rdi,%r12
      bc:	8d 6c 2d 01          	lea    0x1(%rbp,%rbp,1),%ebp

      c0:	48 89 df             	mov    %rbx,%rdi
      c3:	48 8b 5b 08          	mov    0x8(%rbx),%rbx
      c7:	48 8b 07             	mov    (%rdi),%rax
      ca:	89 6f 08             	mov    %ebp,0x8(%rdi)
      cd:	ff 10                	call   *(%rax)
      cf:	48 85 db             	test   %rbx,%rbx
      d2:	75 ec                	jne    c0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x30>

      d4:	49 83 7c 24 68 00    	cmpq   $0x0,0x68(%r12)
      da:	5b                   	pop    %rbx
      db:	0f 95 c0             	setne  %al
      de:	5d                   	pop    %rbp
      df:	41 5c                	pop    %r12
      e1:	c3                   	ret
      e2:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
      e8:	5b                   	pop    %rbx
      e9:	31 c0                	xor    %eax,%eax
      eb:	5d                   	pop    %rbp
      ec:	41 5c                	pop    %r12
      ee:	c3                   	ret
      ef:	90                   	nop

@xemul
Copy link
Contributor Author

xemul commented Dec 4, 2025

To emphasize the ARRAY vs LIST-opt cases:

Main loop in assembly

ARRAY

      a0:	48 8b b3 80 00 00 00 	mov    0x80(%rbx),%rsi
      a7:	48 8b 4b 68          	mov    0x68(%rbx),%rcx
      ab:	48 8d 56 ff          	lea    -0x1(%rsi),%rdx
      af:	48 21 c2             	and    %rax,%rdx
      b2:	48 83 c0 01          	add    $0x1,%rax
      b6:	48 8b 3c d1          	mov    (%rcx,%rdx,8),%rdi
      ba:	48 89 43 70          	mov    %rax,0x70(%rbx)
      be:	48 8b 07             	mov    (%rdi),%rax
      c1:	ff 10                	call   *(%rax)
      c3:	48 8b 43 70          	mov    0x70(%rbx),%rax
      c7:	48 3b 43 78          	cmp    0x78(%rbx),%rax
      cb:	75 d3                	jne    a0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x10>

LIST-opt

      c0:	48 89 df             	mov    %rbx,%rdi
      c3:	48 8b 5b 08          	mov    0x8(%rbx),%rbx
      c7:	48 8b 07             	mov    (%rdi),%rax
      ca:	89 6f 08             	mov    %ebp,0x8(%rdi)
      cd:	ff 10                	call   *(%rax)
      cf:	48 85 db             	test   %rbx,%rbx
      d2:	75 ec                	jne    c0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x30>

Perf test

  test iters   runtime inst cycles
ARRAY sched.runqueue_x 1717290 297.01ns ± 0.97% 118.39 141.8
  sched.runqueue_m 96838000 5.03ns ± 0.01% 14.07 9.7
  sched.runqueue_M 14000000 8.94ns ± 0.14% 13.01 36.1
LIST-o sched.runqueue_x 1732710 292.69ns ± 0.16% 🟢 112.03 🟢 138.8
  sched.runqueue_m 109276000 4.25ns ± 0.09% 🟢 9.05 🟢 6.6
  sched.runqueue_M 10000000 49.72ns ± 0.90% 🟢 8.01 🔴 202.6

_x -- 10 tasks, _m -- 1'000 tasks, _M -- 1'000'000 tasks

@xemul
Copy link
Contributor Author

xemul commented Dec 4, 2025

Updated ARRAY case to also modify task->_sg (like if it was LIST updating task's sg/next union)

bool reactor::task_queue::run_tasks() {
    auto current = scheduling_group(_id);
    while (!_q.empty()) {
        auto tsk = _q.front();
        _q.pop_front();
        tsk->_sg = current;
        tsk->run_and_dispose();
    }

    return !_q.empty();
}
test iters   runtime inst cycles
sched.runqueue_x 1710960 297.74ns ± 0.08% 120.61 149.6
sched.runqueue_m 82919000 6.51ns ± 0.12% 15.09 15.6
sched.runqueue_M 16000000 10.86ns ± 0.83% 14.01 43.6

@xemul
Copy link
Contributor Author

xemul commented Dec 4, 2025

In LIST case, de-unioned next and sched-group-id on task to drop updating sched group in the loop (tsk->_scheduling_group_id = task::disguise_sched_group(current_sg); )

bool reactor::task_queue::run_tasks() {

    task* n = _q.snap();
    while (n != nullptr) {
        auto tsk = n;
        n = tsk->_next;
        tsk->run_and_dispose();
    }

    return !_q.empty();
}
      c0:   48 89 df                mov    %rbx,%rdi
      c3:   48 8b 5b 10             mov    0x10(%rbx),%rbx
      c7:   48 8b 07                mov    (%rdi),%rax
      ca:   ff 10                   call   *(%rax)
      cc:   48 85 db                test   %rbx,%rbx
      cf:   75 ef                   jne    c0 <_ZN7seastar7reactor10task_queue9run_tasksEv+0x30>
test iters   runtime inst cycles
sched.runqueue_x 1730610 297.71ns ± 0.34% 110.65 139.1
sched.runqueue_m 94810000 4.91ns ± 1.41% 8.05 9.1
sched.runqueue_M 7000000 71.07ns ± 0.67% 7.01 285.6

@xemul
Copy link
Contributor Author

xemul commented Dec 4, 2025

https://gist.github.com/xemul/d5f83adac66d34b70fa4fef49a861d96

☝️ a perf test that allocates 10k dispersed integers and sums those values by dereferencing each via array of pointers or intrusive list:

test                    iters            runtime     allocs      tasks       inst     cycles
sum_perf.sum_array  651810000     1.24ns ± 1.16%      0.000      0.000       7.04        3.9
sum_perf.sum_list   135080000     6.65ns ± 0.20%      0.000      0.000       5.04       26.0

@xemul
Copy link
Contributor Author

xemul commented Dec 6, 2025

Split lists help a lot

Array Array of pointers Intrusive list Split list [2] Split-list [16]
10 65 63 63 64 63
100 9 8 11 9 9
1000 3 3 13 7 3
10000 3 3 26 13 5
100000 3 5 35 18 5
1000000 10 22 327 165 26

@avikivity
Copy link
Member

Is "split list" another name for chunked_fifo?

This looks like another opportunity to refer to https://github.com/avikivity/seastar/commits/sort-tasks/.

@xemul
Copy link
Contributor Author

xemul commented Dec 8, 2025

No, it's not chunked fifo, it's literally orthogonal.

Here's how elements are stored in std::array<T>: [ 0 1 2 3 4 5 ... ]. And the same is true for std::vector<T> and circular_buffer<T>

This is intrusive_list<T>: [0] -> [1] -> [2] -> [3] -> [4] -> [5] -> ...

And this is chunked_fifo<T> (~= intrusive_list<array<T>>) : [ 0 1 2 ] -> [ 3 4 5 ] -> ...

Currently task queue is circular_buffer<task*>:

 [0] [2] [4]
  |   |   |
[ . . . . . .   ... ]
    |   |   |
   [1] [3] [5]

and using chunked fifo would make it chunked_fifo<task*>:

 [0] [2]        [4]
  |   |          |
[ . . . ] -> [ . . . ] -> ...
    |          |   |
   [1]        [3] [5]

Using chunked fifo will still require potentially growing the pointers storage on wakeup, but, of course, in smaller chunks

The split-list is ~= std::array<intrusive_list<task>>, like this:

 [0] -> [2] -> [4] -> ...
  |
[ . . ]
    |
   [1] -> [3] -> [5] -> ...

it uses fixed-size array and doesn't need to allocate memory on wake-up. And also note the interleaving -- when scanning elements in the displayed order, it scans two parallel tracks "in parallel" as if they were pointers in plain array (this requires consuming the "next" pointer, so that the Nth wave of scan stipp references pointers from the root-level array, but task runqueue scanning does it)

@avikivity
Copy link
Member

No, it's not chunked fifo, it's literally orthogonal.

Here's how elements are stored in std::array<T>: [ 0 1 2 3 4 5 ... ]. And the same is true for std::vector<T> and circular_buffer<T>

This is intrusive_list<T>: [0] -> [1] -> [2] -> [3] -> [4] -> [5] -> ...

And this is chunked_fifo<T> (~= intrusive_list<array<T>>) : [ 0 1 2 ] -> [ 3 4 5 ] -> ...

Currently task queue is circular_buffer<task*>:

 [0] [2] [4]
  |   |   |
[ . . . . . .   ... ]
    |   |   |
   [1] [3] [5]

and using chunked fifo would make it chunked_fifo<task*>:

 [0] [2]        [4]
  |   |          |
[ . . . ] -> [ . . . ] -> ...
    |          |   |
   [1]        [3] [5]

Using chunked fifo will still require potentially growing the pointers storage on wakeup, but, of course, in smaller chunks

But what's the problem? Isn't the goal to eliminate large allocations? chunked_fifo does this.

We can't completely eliminate allocations (the tasks themselves must be allocated), so we don't win from removing allocations if they are well amortized.

The split-list is ~= std::array<intrusive_list<task>>, like this:

 [0] -> [2] -> [4] -> ...
  |
[ . . ]
    |
   [1] -> [3] -> [5] -> ...

it uses fixed-size array and doesn't need to allocate memory on wake-up. And also note the interleaving -- when scanning elements in the displayed order, it scans two parallel tracks "in parallel" as if they were pointers in plain array (this requires consuming the "next" pointer, so that the Nth wave of scan stipp references pointers from the root-level array, but task runqueue scanning does it)

Seems complicated.

@xemul
Copy link
Contributor Author

xemul commented Dec 8, 2025

But what's the problem? Isn't the goal to eliminate large allocations? chunked_fifo does this.

The goal was to check if it's possible to eliminate wakeup allocations at all

We can't completely eliminate allocations (the tasks themselves must be allocated), so we don't win from removing allocations if they are well amortized.

There are two paths (sometimes they immediately follow one another, but not always) -- allocating a task and waking the task up. Currently we allocate on both, but allocating on the latter is not necessary. AFAIU, if there's a chain of yet unresolved futures at hand, the task is allocated as a part of continuation_base. When later one does promise::set_value() the task is woken up. Not having (even amortized) allocations in set_value() is the goal.

@avikivity
Copy link
Member

But what's the problem? Isn't the goal to eliminate large allocations? chunked_fifo does this.

The goal was to check if it's possible to eliminate wakeup allocations at all

We can't completely eliminate allocations (the tasks themselves must be allocated), so we don't win from removing allocations if they are well amortized.

There are two paths (sometimes they immediately follow one another, but not always) -- allocating a task and waking the task up. Currently we allocate on both, but allocating on the latter is not necessary. AFAIU, if there's a chain of yet unresolved futures at hand, the task is allocated as a part of continuation_base. When later one does promise::set_value() the task is woken up. Not having (even amortized) allocations in set_value() is the goal.

Why do we allocate on wakeup? Usually the queue is not full.

@xemul
Copy link
Contributor Author

xemul commented Dec 8, 2025

Seems complicated

Well, a little bit, but in the end of the day it's pretty much compact code: #3138

Why do we allocate on wakeup? Usually the queue is not full.

Because circular_buffer<>::push_back() is potentially allocating operation. Nonetheless, I do agree that "usually" the queue is grown up to its large-enough size and then stops growing until next wakeup storm (if it ever happens).

Add a test that populates run-queue with 10, 1k and 1M no-op tasks and
measures the time it takes for the scheduler to process one.

No allocations/frees happen measure-time, tasks are pre-populated.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two coroutine_traits_base::promise_task templates,
both inherit from task, but one sets task::_sg by hand, and
another enjoys task protected implementation. Make both do it.

This allows moving task::_sg into private section, protecting
it reliably from future changes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Rename task::_sg member into uintptr and keep scheduling group index
in there. Next patch will keep task* pointer unioned with it, so to tell
which value is in there, add disguising of sched group ID by shifting it
left one bit and setting the zeroth one. Also add a static assertion
that when task pointer will be put there, this bit will not be set.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The split list is the collection of objects that's optimized for
scanning from begin to end, with the ability to add entries from two
ends -- front and back, like std::deque does.

Splitting is needed to let CPU pre-fetch next element while scanning
some current one. With classical linked lists this is difficult, because
in order to prefetch the next element CPU needs to read the next pointer
from the current element, which is being read. Thus the forward scan of
a plain list is serielized. With split lists several adjacent elements
can be prefetched in parallel.

To achieve that, the "first" pointer is "sharded" -- there are F first
pointers (as well as F cached last pointers) and the list is populated
in round-robin manner -- first 0th list is appended, then the 1st, then
the 2nd, ... then the Fth and population wraps around and starts over.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The element (class task) shares the "next" pointer with sched group index,
thus living in two states -- queued and inactive. When created it gets into
the latter state and carries sched group index on-board. When add_task is
called, the task becomes queued and the sched group index is thus lost. Later,
when the task is picked for execution, the task_queue restores this index
into "current", naturally. Probably the restoration is not needed, but it's
better to be on the safe side.

The task::set_scheduling_group() is no longer callable for queued tasks, but
the method is protected and is only used in two places. First is coroutine
switch-to, which happens on running (i.e. -- not queued) task. Second is the
shared_future tweaking unresolved shared_state (i.e. -- inactive too).

Debug-mode task queue shuffling with constant time is no longer possible and
is temporarily removed, next patch will restore it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently shuffling is performed by swapping a newly-activated task with
some random queued one. Previous patch turned task queue into a singly-linked
list, and picking a random task from it in constant time is impossible.

The new shuffling works run-time -- when a queue picks up next front task
to execute, it rolls the dice and optionally re-queued it into the back of
the queue.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
@xemul xemul force-pushed the br-task-queue-make-slist branch from cf45e72 to a5455c6 Compare December 10, 2025 14:40
@xemul
Copy link
Contributor Author

xemul commented Dec 10, 2025

upd:

test iters runtime allocs tasks inst cycles
master, 10 tasks 3228650 159.54ns ± 0.15% 0.000 1.200 135.50 143.1
master, 1k tasks 123110000 3.97ns ± 1.45% 0.000 1.002 26.13 10.8
master, 1M tasks 14000000 13.81ns ± 1.29% 0.000 1.000 25.07 55.2
this PR, 10 tasks 3263140 157.05ns ± 0.09% 0.000 1.200 147.27 140.7
this PR, 1k tasks 94840000 5.45ns ± 0.88% 0.000 1.002 33.20 17.0
this PR, 1M tasks 14000000 16.03ns ± 0.64% 0.000 1.000 32.08 64.6

@xemul
Copy link
Contributor Author

xemul commented Dec 10, 2025

Overhead is ~57% on 1k, ~17% on 1M tasks, ~8.5% on 10 tasks

@xemul
Copy link
Contributor Author

xemul commented Dec 10, 2025

test iters runtime allocs tasks inst cycles
this PR, 10 tasks 3191910 158.38ns ± 0.12% 0.000 1.200 147.28 141.4
this PR, 1k tasks 94409000 5.48ns ± 0.24% 0.000 1.002 33.25 17.0
this PR, 1M tasks 14000000 16.07ns ± 0.62% 0.000 1.000 32.08 64.7

☝️ with intrusive_split_list<32>

@xemul
Copy link
Contributor Author

xemul commented Dec 10, 2025

test iters runtime allocs tasks inst cycles
sched.runqueue_x 3134860 157.53ns ± 0.48% 0.000 1.200 146.40 131.9
sched.runqueue_m 96570000 4.82ns ± 2.81% 0.000 1.002 33.17 13.8
sched.runqueue_M 13000000 18.30ns ± 2.46% 0.000 1.000 32.09 71.2

☝️ intrusive_split_list<8>

@xemul
Copy link
Contributor Author

xemul commented Dec 10, 2025

The above is on Threadripper processor, will measure on i7i's Xeon

@xemul
Copy link
Contributor Author

xemul commented Dec 10, 2025

i7i.2xlarge, Intel Xeon Platinum 8559C, clang-20.1.2, intrusive_split_list<16>

test iters runtime allocs tasks inst cycles
master, 10 tasks 1076840 418.24ns ± 0.02% 0.000 1.200 137.80 157.1
master, 1k tasks 75366000 6.39ns ± 0.03% 0.000 1.002 26.15 8.3
mater, 1M tasks 119000000 3.49ns ± 0.02% 0.000 1.000 25.02 11.1
this PR, 10 tasks 1080530 415.89ns ± 0.02% 0.000 1.200 145.26 153.9
this PR, 1k tasks 74981000 6.71ns ± 0.05% 0.000 1.002 31.18 9.5
this PR, 1M tasks 117000000 3.85ns ± 0.01% 0.000 1.000 30.02 12.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants