Skip to content

Conversation

@xemul
Copy link
Contributor

@xemul xemul commented Dec 8, 2025

The purpose of this PR is to demonstrate how different "queues" (see below) perform on full-scan scenario.

Here a "queue" is a collection of objects that

  • has std::deque-like ability to push_back and push_front elements to it
  • can be scanned from start to end
  • objects can be manipulated independently, in particular be of different types inheriting from some base class

Initially such a "queue" was met in CPU scheduler run-queue. In it, the stored element is task* and actual elements are classes that inherit from class task. The run-queue is implemented as circular_buffer<task*> container.

Scan performance comparison on 10k elements of different collections is:

On AMD Ryzen Threadripper 2970WX processor

GCC-15.2.1

test iters runtime ± allocs tasks inst cycles
vector<T> 815590000 0.96ns ± 0.56% 0.000 0.000 7.04 2.9
vector<T*> 655960000 1.19ns ± 0.09% 0.000 0.000 7.04 3.8
chunked_fifo<T*, 128> 573820000 1.40ns ± 0.26% 0.000 0.000 8.08 4.6
split_list<T, 16> 504530000 1.55ns ± 1.45% 0.000 0.000 10.04 5.2
intrusive_list<T> 147000000 6.48ns ± 0.07% 0.000 0.000 5.04 25.8

Clang-20.1.8

test iters runtime ± allocs tasks inst cycles
vector<T> 1e+09 0.83ns ± 6.96% 0.000 0.000 7.00 3.2
vector<T*> 1e+09 0.93ns ± 0.25% 0.000 0.000 7.00 3.7
split_list<T, 16> 808970000 1.24ns ± 0.54% 0.000 0.000 8.50 4.9
chunked_fifo<T*, 128> 576560000 1.73ns ± 0.26% 0.000 0.000 13.04 6.9
intrusive_list<T> 153940000 6.20ns ± 0.75% 0.000 0.000 5.00 24.7

On Intel i7-10750H processor
GCC-13.3.0

test iters runtime ± allocs tasks inst cycles
vector<T> 920610000 0.95ns ± 0.19% 0.000 0.000 7.00 4.2
vector<T*> 832120000 1.18ns ± 0.33% 0.000 0.000 7.00 5.3
split_list<T, 16> 845490000 1.26ns ± 1.74% 0.000 0.000 10.00 5.4
chunked_fifo<T*, 128> 784010000 1.26ns ± 0.37% 0.000 0.000 8.04 5.8
intrusive_list<T> 124030000 8.16ns ± 1.11% 0.000 0.000 5.00 38.1

On Intel(R) Xeon(R) Platinum 8559C processor (i7i.2xlarge instance)
Clang-20.1.2

test iters runtime ± allocs tasks inst cycles
vector<T> 2e+09 0.43ns ± 0.01% 0.000 0.000 7.00 1.4
vector<T*> 2e+09 0.53ns ± 0.06% 0.000 0.000 7.00 1.8
split_list<T, 16> 2e+09 0.65ns ± 0.19% 0.000 0.000 8.50 2.1
chunked_fifo<T*, 128> 1e+09 0.72ns ± 0.37% 0.000 0.000 13.04 2.4
intrusive_list<T> 188810000 5.33ns ± 0.10% 0.000 0.000 5.00 17.5

@xemul
Copy link
Contributor Author

xemul commented Dec 8, 2025

TODO list for this PR

  • documentation
  • iterator instead of split_list::snap() method
  • rename to intrusive_split_list
  • add chunked_fifo

@xemul xemul force-pushed the br-container-scan-perf-test branch from a8fcbd6 to 8acacc4 Compare December 8, 2025 13:03
@xemul
Copy link
Contributor Author

xemul commented Dec 8, 2025

upd:

  • reimplemented iteration with "standard" iterators approach. This did saved one instruction in assembly and improved the cycles a bit (5.8 -> 5.2)

@xemul
Copy link
Contributor Author

xemul commented Dec 8, 2025

split-list main loop decoded

  43b540:   89 c2                   mov    %eax,%edx
  43b542:   83 c0 01                add    $0x1,%eax
  43b545:   83 e2 0f                and    $0xf,%edx
  43b548:   48 8b 4c d4 88          mov    -0x78(%rsp,%rdx,8),%rcx
  43b54d:   8b 79 10                mov    0x10(%rcx),%edi
  43b550:   48 8b 49 08             mov    0x8(%rcx),%rcx
  43b554:   48 01 fe                add    %rdi,%rsi
  43b557:   48 89 4c d4 88          mov    %rcx,-0x78(%rsp,%rdx,8)
  43b55c:   44 39 c0                cmp    %r8d,%eax
  43b55f:   75 df                   jne    43b540 <_ZNK8sum_perf8sum_sl16Ev+0x80>

bi::cache_last<true>,
bi::member_hook<element, bi::slist_member_hook<>, &element::l_next>>;
static_assert(sizeof(slist) <= 2 * sizeof(void*));
slist _singly_linked_list;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious if the intrusive list was what you reached for first but found the performance overhead to be too much? we had encountered this is doubly-linked intrusive list (even with all the safe/auto stuff turned off), and hand coded a simple version of effectively the same thing and got like a 10x performance reduction. not sure if slist has the same issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I first played with intrusive list in #3125 and saw that it's times worse that circular_buffer<T*>. And then came this split-list thing

}

T* pop_front() noexcept {
auto i = _head_idx++ % F;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's any performance gain to be had by limiting F to be a power of two, and using & instead of %. chunked_fifo does this, and has

    static_assert((items_per_chunk & (items_per_chunk - 1)) == 0,
            "chunked_fifo chunk size must be power of two");

Alternatively, maybe the optimizer will already convert "%" to "&" if it's a power of two? Did you try your benchmark with different choices of F, power of two or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did. When it's power-of-two compiler (in release build) uses & : #3138 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this has a noticeable impact on performance, I think a comment recommending to use a power-of-two for F is in order. If the impact is very noticable, perhaps it makes sense to just force it, with a static_assert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using power-of-two F is very welcome

clang-20

F intsructions cycles
17 15.5 6.6
16 8.5 4.8
15 14.5 6.2

GCC-15

F intsructions cycles
17 16.0 6.0
16 10.0 5.1
15 16.0 6.7


/*
* Copyright (C) 2025 ScyllaDB
*/
Copy link
Contributor

@nyh nyh Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file desperately needs a fairly long comment explaining the purpose of this new container, some overview of how it works, and a comparison to other similar containers (list, dequeue, circular_buffer, chunked_fifo, boost::intrusive::slist) helping a reader understand when to choose this one over other alternatives.

See an example in chunked_fifo.hh.

By the way, any reason why the older chunked_fifo.hh should be in core/ but this new one is in util/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely. The "add documentation" item is in the TODO list

By the way, any reason why the older chunked_fifo.hh should be in core/ but this new one is in util/?

No specific reason, I just thought it was an auxiliary container, not "core" one whatever it means

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think chunked_fifo.hh and circular_buffer.hh predate the separate directory for all the include files (include/seastar/), so they were in core/ because they were used in the core's implementation - and not really intended to be used by end-users. The placement in include/core is not because the header file is any more "core" or important for end-users than other header files.

In that sense, the new list is also used in the core's implementation so could be in core/ instead of util/ - like its other friends, chunked_fifo and circular_buffer.

But maybe this discussion is actually a sign that when created a new include/ directory, we shouldn't have kept the distinct "core" vs "util" subdirectory :-(

@nyh
Copy link
Contributor

nyh commented Dec 10, 2025

This is a really interesting and clever implementation. I think its benefit over regular intrusive singly-linked lists is closely tied to how prefetching works on contemporary CPUs - that it actually prefetches data pointed by any pointer it sees, so it is helpful to let it see a few (F) pointers in advance instead of just the next one. I wonder how much this advantage remains if the scanning code does something more than just advance but actually do a bit more work on each item - in this case the pre-fetching of F items in advance might be too early - as we still have a lot of work to do before we're ready to need the Fth next item. Also in this case the speedup of just the scan might be negligible compared to the real amount of work we need to do on the items.

Do you know if something like your "split list" was ever done before? I searched a bit and got to a concept called a "chain" for each of the individual linked list, and "multi-chain" for the idea of having multiple independent lists each allowing one prefetching - but most discussions I read about these ideas where how to write a better prefetcher - not how to write a linked list that works better with the existing prefetcher. Maybe you can write a paper - or at least a blog post - on this new (?) technique :-)

You mentioned that you were looking for the ability to "push_back and push_front" - does your intended use case (run-queue) actually use both? chunked_fifo<> doesn't do push_front, but this was a decision to simplify the API - it's not that it was algorithmically impossible to implement. Of course if your goal is to be an intrusive collection, then obviously chunked_fifo is not what you're looking for and you needed to create a new type. But in that case I think the word intrusive should be part of the new collection's name, and part of the pull request's title. If, on the other hand, "intrusive" was not very important to you - it appears that the existing chunked_fifo is actually a tiny bit more efficient than the new split_list, right?

@xemul
Copy link
Contributor Author

xemul commented Dec 10, 2025

I wonder how much this advantage remains if the scanning code does something more than just advance but actually do a bit more work on each item - in this case the pre-fetching of F items in advance might be too early - as we still have a lot of work to do before we're ready to need the Fth next item. Also in this case the speedup of just the scan might be negligible compared to the real amount of work we need to do on the items.

It's perfectly valid concern. I will be able to answer it more reliably after I rebase #3125 onto this new container and see more real-life-ish example of tasks queue processing. There's a perf test for runqueue scan in that PR too. It runs no-op tasks, so it's still a simplification.

Do you know if something like your "split list" was ever done before?

I'm pretty sure that someone did that. At least, Redpanda says that they coded something similar. I also tried to find more info about this Avi's concen, and also found some "related" discussions, but overall no, I didn't find any deep research on that topic.

Maybe you can write a paper - or at least a blog post - on this new (?) technique :-)

I already did :) I'll send you a link

You mentioned that you were looking for the ability to "push_back and push_front" - does your intended use case (run-queue) actually use both?

Yes. reactor::add_task() does push_back and reactor::add_urgent_task() does push_front.

Of course if your goal is to be an intrusive collection, then obviously chunked_fifo is not what you're looking for and you needed to create a new type. But in that case I think the word intrusive should be part of the new collection's name, and part of the pull request's title.

Yes, definitely. Renaming of the collection is in the TODO list as well.

If, on the other hand, "intrusive" was not very important to you - it appears that the existing chunked_fifo is actually a tiny bit more efficient than the new split_list, right?

Yes, that's correct. And chunked_fifo is actually faster than split-list, so we have a choice of three elements

  • circular_buffers -- the fastest (3.8 cycles per element), but allocates and potentially performs "large allocation"
  • chunked_fifo -- a bit slower (4.6 cycles per element), also allocates, but doesn't suffer from large allocation problem
  • split-list -- the slowest (5.2 cycles per element), but doesn't need extra memory to maintain elements (tasks)

(and a reminder, to have some reference number for those "cycles" numbers -- plain intrusive list takes 25.8 cycles per element; all numbers are for 10k elements in the collection)

@xemul
Copy link
Contributor Author

xemul commented Dec 10, 2025

upd:

  • I updated the PR description with test results on i7 processor (previously I ran it on AMD threadripper). On i7 split-list beats chunked_fifo!

@xemul xemul force-pushed the br-container-scan-perf-test branch from 8acacc4 to af69879 Compare December 10, 2025 08:40
@xemul
Copy link
Contributor Author

xemul commented Dec 10, 2025

upd:

  • renamed the collection to be intrusive_split_list
  • re-implemented properly the operator-> and operator* for iterator
  • initialize objects in base class constructor. This allows avoiding the ugly init_once() call and drop the manual start_measuring/stop_measuring tuning
  • introduced generic tests parametrization. This allows having a single scanning test function implemented and change the collection size from the command-line

@xemul
Copy link
Contributor Author

xemul commented Dec 10, 2025

Documentation is yet to be written 😕

@xemul xemul changed the title Introduce split-list and show its scanning performance test Introduce intrusive split-list and compare its scanning performance with some other collections Dec 10, 2025
@avikivity
Copy link
Member

Looks like chunked_fifo wins. And again I need to re-measure https://github.com/avikivity/seastar/commits/sort-tasks/, I found some bugs in it.

@xemul
Copy link
Contributor Author

xemul commented Dec 10, 2025

Looks like chunked_fifo wins

Looks like it doesn't :) I measured on i7i.2xlarge instance with Xeon processor -- split-list 2.1 vs chunked-fifo 2.4

@xemul
Copy link
Contributor Author

xemul commented Dec 10, 2025

I updated the PR description with full table

@xemul
Copy link
Contributor Author

xemul commented Dec 10, 2025

Added (to PR description) results for Threadripper, but with different compiler (clang-20). Split-list 4.9 vs chunked-fifo 6.9

The split list is the collection of objects that's optimized for
scanning from begin to end, with the ability to add entries from two
ends -- front and back, like std::deque does.

Splitting is needed to let CPU pre-fetch next element while scanning
some current one. With classical linked lists this is difficult, because
in order to prefetch the next element CPU needs to read the next pointer
from the current element, which is being read. Thus the forward scan of
a plain list is serielized. With split lists several adjacent elements
can be prefetched in parallel.

To achieve that, the "first" pointer is "sharded" -- there are F first
pointers (as well as F cached last pointers) and the list is populated
in round-robin manner -- first 0th list is appended, then the 1st, then
the 2nd, ... then the Fth and population wraps around and starts over.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add the `--parameter name=value` vector option. Tests may then do
`perf_tests::get_parameter("foo")` to get the user-configured parameter
and update themselves accordingly.

It will be useful for future scanning tests, as they show very different
results on collections of different sizes, and hard-coding a test for
each interesting collection size is too much code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This test allcoates a vector of elements containing integer, and
measures the time it takes to scan the collection sequentially. Elements
are artificially enarged to occupy 64 bytes, so that this collection
doesn't benefit from caching several values on read.

This is basic reference test for further testing (next patches).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Extend the vector<element> scanning test with vector<element*> scanning
one. The pointers are randomized.

This test works almost as fast as the base reference test does, because
at most one element fits a cache-line, so CPU caching is effectively off
here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Extend the test with boost::intrusive_list<element> scan. This is to
show that such scanning, despite occupying less memory, works _slower_
than base reference test. This is becase CPU cannot prefetch adjacent
elements in parallel -- the next pointer needs to be read anyway.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Use split-list with the factor of 16 as it shows the nicest results on
my CPU (Ryzen Threadripper 2970WX).

This test beats singly linked list greatly, and works ~2x times slower
on 10k elements that the array of pointers does. But unlike array of
pointers, it doesn't require memory allocation on appending operations.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Here the fifo stores pointers to elements, to behave like array of
pointers does, yet avoiding large (re)allocations on append.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
@xemul xemul force-pushed the br-container-scan-perf-test branch from af69879 to ed08c21 Compare December 10, 2025 14:38
@xemul
Copy link
Contributor Author

xemul commented Dec 10, 2025

upd:

  • added size_t intrusive_split_list::size() needed to implement task queue using this collection

}

size_t size() const noexcept {
return _tail_idx - _head_idx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that in your implementation _tail_idx and _head_idx can keep growing as the queue is written and read over a long period of time, and wrap around the 32-bit unsigned. This is fine, almost, except for one thing: the "%" operation doesn't wrap around as expected when F is not a power of two. For example, consider F=5.
If _tail_idx reaches the last unsigned integer, _tail_idx=4294967295, and _tail_idx%5 = 0. Then we wrap around to _tail_idx=0, and _tail_idx%5 = 0 again!

So I think you need the static_assert that F is a power of two - not just for performance but also for correctness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, what a catch! Thanks!
I assume circular_buffer rounds up its size to power-of-two for the same reason

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, with push_front setting head_idx to -1 immediately, it might be easy to write a test that reproduces this bug when F is not a power of two.

std::array<T*, F> _first;
std::array<T**, F> _last_p;
unsigned _tail_idx;
unsigned _head_idx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "unsigned" limits the the number of elements in the list to 2^32 - and pushing more than this will, if I understand correctly, fail silently. I don't think this limit will bother us, but perhaps it should be documented at least.


public:
iterator(unsigned idx) noexcept
: _cur({nullptr})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I think the {nullptr} here is confusing on what it really does -
I think (but correct me if I'm wrong) that it only initializes the first entry to nullptr explicitly, and then by default initializes the other ones to nullptr as well. So I think just {} would do exactly the same thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it initializes zeroth with nullptr and the rest with defaults (nullptrs as well)
It's a old-C-way of zeroing-out an array
I'll change it to {}

Copy link

@mitso23 mitso23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the performance metrics, it appears that split_list and chunked_fifo show very similar performance - with chunked_fifo actually being slightly faster in some configurations (e.g., AMD Ryzen + GCC-15).

Would it be worth considering chunked_fifo as a first iteration? It's already well-established in Seastar and might solve the performance problem you're targeting without introducing a new data structure.

Additionally, it would be helpful to include circular_buffer in the benchmarks, since that's the current implementation for the scheduler run-queue. This would provide a clearer picture of the actual improvement.

I assume the main concern with circular_buffer is the potential reallocation when it reaches max capacity - is that correct?

@xemul
Copy link
Contributor Author

xemul commented Dec 11, 2025

Would it be worth considering chunked_fifo as a first iteration?

It's better to discuss it in #3125 , because that's the place where the new collection is actually used. This PR just adds some benchmarks. However ...

I assume the main concern with circular_buffer is the potential reallocation when it reaches max capacity - is that correct?

Not only. It's other main concern is that it allocates at all, thus making task wakeup potentially failing operation. However, I agree, that the problem is amortized, the buffer is never shrank back, so once it passes a wake-up storm, it survives until next, larger tsunami.

Additionally, it would be helpful to include circular_buffer in the benchmarks, since that's the current implementation for the scheduler run-queue. This would provide a clearer picture of the actual improvement.

Agree. Actually, the purpose of this test was to show the difference in handling different memory layouts, and circular_buffer is layed out as vector. But I think that testing different implementations of the same layout is worthwhile.

@avikivity
Copy link
Member

My feeling is that this thing is more sensitive to different workloads than chunked_fifo.

A run with very small+fast tasks will prefer large F, so that prefetch can happen in time.
A run with large+slow tasks will prefer small F, so there is smaller cache footprint and less chance we lost the data structure roots from the cache.

I believe the second effect is much less important so we'd choose a reasonably large F (say, 8 -- what did you measure with?).

A different objection is that I'd like to pursue my sort-tasks thing which un-FIFOs the task queue. But it needs more work.

@avikivity
Copy link
Member

Would it be worth considering chunked_fifo as a first iteration?

It's better to discuss it in #3125 , because that's the place where the new collection is actually used. This PR just adds some benchmarks. However ...

I assume the main concern with circular_buffer is the potential reallocation when it reaches max capacity - is that correct?

Not only. It's other main concern is that it allocates at all, thus making task wakeup potentially failing operation. However, I agree, that the problem is amortized, the buffer is never shrank back, so once it passes a wake-up storm, it survives until next, larger tsunami.

The tasks themselves are allocated; if we can allocate a task, we can allocate a chunk for chunked_fifo.

@xemul
Copy link
Contributor Author

xemul commented Dec 11, 2025

what did you measure with?

16

class intrusive_split_list {
std::array<T*, F> _first;
std::array<T**, F> _last_p;
unsigned _tail_idx;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using 64 bit unsigned int, then you don't really need to worry about the tail wrapping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Head wraps anyway (immediately after push_front() that happens before pop_front()). And 32 bits is more than enough for typical use-cases, seastar shard may have at most 64Gb, so 32-bit index is enough for 16-bytes objects, I don't think it's a problem

template <typename T, unsigned F, T* T::*Member>
class intrusive_split_list {
std::array<T*, F> _first;
std::array<T**, F> _last_p;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more readable to use dummy nodes as head of link list, you would't need then to use T**

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T can be an abstract type, like class task

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it's crucial to have first pointers packed as tight as possible to be cached on read, with dummy nodes it's not going to work as nice

}
};

PERF_TEST_F(sum_perf, sum_plain) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity why are not using google benchmark, what are the benefits of using seastar benchmark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because there's seastar perf test suite in the repo :)

// checks that forward scanning reveals consistent results
BOOST_AUTO_TEST_CASE(test_split_list_grow_and_shrink) {
const auto seed = std::random_device()();
auto rnd_engine = std::mt19937(seed);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also test iterator logic and maybe wrap around logic if you want to use 32bit unsigned integers for head/tail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterator logic is checked with

       } else if (op == 2) { // scan
            unsigned scanned = 0;
            fmt::print("= {} ... {}\n", first, last);
            for (auto it = sl.begin(); it != sl.end(); ++it) {
                fmt::print("  {}\n", it->value);
                BOOST_REQUIRE_EQUAL(it->value, first + 1 + scanned);
                scanned++;
            }
            BOOST_REQUIRE_EQUAL(scanned, last - first);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants