Skip to content

Noticeable overhead in crossbeam-epoch under many Locals #852

Open
@ryoqun

Description

@ryoqun

Hi, nice to meet you and thanks for maintaining this crate. :)

I found some kind of performance issue spanning cross-crate code interactions, ultimately resulting in many cpu cycles being wasted at crossbeam-epoch. And I'm wondering where the proper fix could be placed across those related crates.

In short, there are circumstances where crossbeam-epoch's epoch bookkeeping code exposes significant overhead. And our production code was hit at it.

Dirtiest hack would be reducing frequency of gc collection (= global().collect(&guard);) by increasing PINNINGS_BETWEEN_COLLECT like this:

$ git diff
diff --git a/crossbeam-epoch/src/internal.rs b/crossbeam-epoch/src/internal.rs
index de208b1..0394c03 100644
--- a/crossbeam-epoch/src/internal.rs
+++ b/crossbeam-epoch/src/internal.rs
@@ -387,7 +387,7 @@ fn local_size() {
 impl Local {
     /// Number of pinnings after which a participant will execute some deferred functions from the
     /// global queue.
-    const PINNINGS_BETWEEN_COLLECT: usize = 128;
+    const PINNINGS_BETWEEN_COLLECT: usize = 128 * 128;
 
     /// Registers a new `Local` in the provided `Global`.
     pub(crate) fn register(collector: &Collector) -> LocalHandle {

However, I'm not fully sure this is the right fix (esp, regarding its ramifications due to reduced garbage collections).

Let me explain the rather complicated context a bit.

Firstly, crossbeam-epoch is used by crossbeam-deque (duh), which in turn by rayon (task scheduler library) as task queues , then by solana-validator (which experinced the performance issue; solana-labs/solana#22603).

so far, so good. it's just normal use case of rayon for multi cores by an application code.

The twist is that solana-validator is holding many rayon thread pools managed by its internal subsystems. So, it's well exceeding system's core count by great factor like 2000 threads on a 64 core machine.

(We know this is a bit silly setup. But every subsystem isn't 100% cpu persistently. Rather they're mostly idling. On the other hand, we want to maximize processing throughput/minimize latency at the time of peak load. Also, casual top -H instropection and granular kernel thread priority tuning is handy. Lastly, sharing a single or several thread pool would introduce unneeded synchronization cost across subsystems and implementaion complexities in solana-validator code. All in all, each independent thread pools makes sense to us at least for now.)

So, that whopping 2000 (rayon) threads means that all of them are registering as Locals to the singleton crossbeam-epoch's Global. Then, global().collect() suddenly become very slow because it's doing linear scan over the Locals (= O(n))...

(Then, this affects all indepedent rayon pools inside a process. That's because of the use of the singleton Global. This seemingly-unrelated subsytem's performance degradtion was hard to debug, by the way...)

Regarding the extent of the overhead, I observed certain rayon-heavy subsystem is ~5% faster with the above 1-line hack alone in walltime. And, I also saw 100x reduction of overhead by Linux's perf.

Possible solutions:

  • Accept this dirty hack as is?
  • ... or polish it so that the gc frequency could be adaptive to the number of Locals?
  • Adjust rayon to use separate Globals for each thread pools as a kind of scope?
  • Introduce background gc collection thread like jemalloc?
  • Reduce thread pool in solana-validator?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions