Noticeable overhead in crossbeam-epoch under many Locals

Hi, nice to meet you and thanks for maintaining this crate. :) 

I found some kind of performance issue spanning cross-crate code interactions, ultimately resulting in many cpu cycles being wasted at `crossbeam-epoch`. And I'm wondering where the proper fix could be placed across those related crates.

In short, there are circumstances where `crossbeam-epoch`'s epoch bookkeeping code exposes significant overhead. And our production code was hit at it.

Dirtiest hack would be reducing frequency of gc collection (= `global().collect(&guard);`) by increasing `PINNINGS_BETWEEN_COLLECT` like this:

```patch
$ git diff
diff --git a/crossbeam-epoch/src/internal.rs b/crossbeam-epoch/src/internal.rs
index de208b1..0394c03 100644
--- a/crossbeam-epoch/src/internal.rs
+++ b/crossbeam-epoch/src/internal.rs
@@ -387,7 +387,7 @@ fn local_size() {
 impl Local {
     /// Number of pinnings after which a participant will execute some deferred functions from the
     /// global queue.
-    const PINNINGS_BETWEEN_COLLECT: usize = 128;
+    const PINNINGS_BETWEEN_COLLECT: usize = 128 * 128;
 
     /// Registers a new `Local` in the provided `Global`.
     pub(crate) fn register(collector: &Collector) -> LocalHandle {

```

However, I'm not fully sure this is the right fix (esp, regarding its ramifications due to reduced garbage collections).

Let me explain the rather complicated context a bit.

Firstly, `crossbeam-epoch` is used by `crossbeam-deque` (duh), which in turn by `rayon` (task scheduler library) as task queues , then by `solana-validator` (which experinced the performance issue; https://github.com/solana-labs/solana/issues/22603).

so far, so good. it's just normal use case of rayon for multi cores by an application code.

The twist is that `solana-validator` is holding *many* `rayon` thread pools managed by its internal subsystems. So, it's well exceeding system's core count by great factor like 2000 threads on a 64 core machine.

(We know this is a bit silly setup. But every subsystem isn't 100% cpu persistently. Rather they're mostly idling. On the other hand, we want to maximize processing throughput/minimize latency at the time of peak load. Also, casual `top -H` instropection and granular kernel thread priority tuning is handy. Lastly, sharing a single or several thread pool would introduce unneeded synchronization cost across subsystems and implementaion complexities in `solana-validator` code. All in all, each independent thread pools makes sense to us at least for now.)

So, that whopping 2000 (rayon) threads means that all of them are registering as `Local`s to the singleton `crossbeam-epoch`'s `Global`. Then, `global().collect()` suddenly become very slow because it's doing linear scan over the `Local`s (= _O(n)_)...

(Then, this affects all indepedent rayon pools inside a process. That's because of the use of the singleton `Global`. This seemingly-unrelated subsytem's performance degradtion was hard to debug, by the way...)


Regarding the extent of the overhead, I observed certain rayon-heavy subsystem is **~5%** faster with the above 1-line hack alone in walltime. And, I also saw 100x reduction of overhead by Linux's `perf`.

Possible solutions:

- Accept this dirty hack as is?
- ... or polish it so that the gc frequency could be adaptive to the number of `Local`s?
- Adjust rayon to use separate `Global`s for each thread pools as a kind of scope?
- Introduce background gc collection thread like jemalloc?
- Reduce thread pool in `solana-validator`?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Noticeable overhead in crossbeam-epoch under many Locals #852

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Noticeable overhead in crossbeam-epoch under many Locals #852

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions