ThreadManager: store Weak instead of Arc pool #5826

apfitzge · 2025-04-15T14:29:48Z

Problems

Problem 1 - Circular Reference of `SchdulerPool`

solScCleaner holds a weak reference to the SchedulerPool and will break only if all Arcs of SchedulerPool are dropped
Arc<SchedulerPool> are held in two places: BankForks and ThreadManager
ThreadManager ownership chain:
- owned by PooledSchedulerInner
  - owned by PooledScheduler
    - owned by SchedulerPool (circular reference)
ThreadManager uses pool to check status and possibly return to the pool; in either-case if the ThreadManager can be modified to handle if the pool no longer exists

Problem 2 - Circular Ownership model of `new_task_sender`

new_task_sender has been owned by ThreadManager
new_task_sender became owned viaBankingStageHelper
- BankingStageHelper is held in the scheduler thread.
- scheduler thread will hang because there is never a Err received on channel disconnect (since it holds the sender itself!)
When ThreadManager is dropped, it will attempt to join all threads,
- the scheduler thread cannot be joined because it will never exit the loop

Summary of Changes

ThreadManager modified to hold a Weak reference to the SchedulerPool instead of a strong reference.
BankingStageHelper modified to hold a Weak<Sender<..>> so that scheduler can exit if the actual new_task_sender is dropped (not the handler threads which send retryable transactions).

Fixes #5435
Fixes #4211

Testing

Ran cargo test --package solana-local-cluster --test local_cluster -- test_snapshot_restart_tower --exact --show-output.
Added some additional error logging when ReplayStage loop exited, BankForks was dropped, unified-scheduler cleaner thread exits, etc.
At the end of the test, wait 10s after everything is dropped. If the cleaner is logging at that point it has stayed alive when it should not have.

Last few log lines from patched test:

master:
[2025-04-15T14:10:04.993411000Z ERROR solana_unified_scheduler_pool] [ThreadId(67)] Scheduler pool(2) cleaner: dropped 0 idle inners, 0 trashed inners, triggered 1 timeout listeners
[2025-04-15T14:10:05.073398000Z ERROR solana_unified_scheduler_pool] [ThreadId(711)] Scheduler pool(2) cleaner: dropped 0 idle inners, 0 trashed inners, triggered 1 timeout listeners
[2025-04-15T14:10:05.274471000Z ERROR solana_unified_scheduler_pool] [ThreadId(376)] Scheduler pool(2) cleaner: dropped 0 idle inners, 0 trashed inners, triggered 0 timeout listeners

w/ patch:
[2025-04-15T14:12:01.921534000Z ERROR solana_unified_scheduler_pool] [ThreadId(67)] Scheduler pool(2) cleaner: dropped 0 idle inners, 0 trashed inners, triggered 1 timeout listeners
[2025-04-15T14:12:01.950369000Z ERROR solana_runtime::bank_forks] [ThreadId(764)] BankForks dropped
[2025-04-15T14:12:02.210603000Z ERROR solana_runtime::bank_forks] [ThreadId(120)] BankForks dropped
[2025-04-15T14:12:02.286298000Z ERROR solana_unified_scheduler_pool] [ThreadId(710)] Cleaner exited
[2025-04-15T14:12:02.326595000Z ERROR solana_unified_scheduler_pool] [ThreadId(67)] Cleaner exited
[2025-04-15T14:12:02.341763000Z ERROR local_cluster] everything is dropped at this point...if anything is still alive it's a naughty boy

Note in the master logs, the pool(STONG_COUNT) still reads 2, indicating there are still 2 remaining strong counts. One is the upgraded pool used in the cleaning loop itself - the other is the stuck ThreadManager.

apfitzge · 2025-04-15T14:33:02Z

unified-scheduler-pool/src/lib.rs

+            .pool
+            .upgrade()
+            .map(|pool| self.usage_queue_loader.count() > pool.max_usage_queue_count)
+            .unwrap_or_default()


AFAICT this value is used by cleaner to possibly drop or return to pool, so the return value here doesn't matter much if the pool no longer exists.

Might be good to call that out in a comment here or function header

I'm trying to think what we want to have happen here if we lose the handle to the pool... Do we want to consider it overgrown? Maybe this should be unwrap_or(true)?

It's an odd set up, and it just can't really happen afaict.
Maybe we should just panic?

The current setup:

SchedulerPool::return_scheduler

is_trashed

is_overgrown

so is_overgrown is never called (in non-test code at least) unless we have a SchedulerPool.

This makes me wonder...could we just pass the pool into is_trashed?
The only other use of the pool is in return_to_pool, can we just pass it there somehow? Ultimatey it seems called by timeout listeners, which iirc are called by the cleaner loop, which has a weak reference to the pool itself and could pass it in.

That's a bit more restructuring, and I'm not 100% it would work - but would certainly simplify the ownership and reference model - ThreadManager would just no longer have a Pool reference at all.

apfitzge · 2025-04-15T15:56:34Z

unified-scheduler-pool/src/lib.rs

@@ -262,11 +262,11 @@ clone_trait_object!(BankingPacketHandler);
 pub struct BankingStageHelper {
    usage_queue_loader: UsageQueueLoader,
    next_task_id: AtomicUsize,
-    new_task_sender: Sender<NewTaskPayload>,
+    new_task_sender: Weak<Sender<NewTaskPayload>>,


this is least invasive change to fix the circular dependency on the sender/recv connection in scheduler.
We may want to re-structure the scheduler thread to not hold the entire HandlerContext

codecov-commenter · 2025-04-15T16:31:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.0%. Comparing base (c5edcfb) to head (6b670c3).
Report is 15 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #5826     +/-   ##
=========================================
- Coverage    83.0%    83.0%   -0.1%     
=========================================
  Files         828      828             
  Lines      375510   375520     +10     
=========================================
- Hits       311857   311848      -9     
- Misses      63653    63672     +19

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

bw-solana · 2025-04-15T20:00:37Z

unified-scheduler-pool/src/lib.rs

+            .pool
+            .upgrade()
+            .map(|pool| self.usage_queue_loader.count() > pool.max_usage_queue_count)
+            .unwrap_or_default()


Might be good to call that out in a comment here or function header

bw-solana · 2025-04-15T20:02:40Z

unified-scheduler-pool/src/lib.rs

+            .pool
+            .upgrade()
+            .map(|pool| self.usage_queue_loader.count() > pool.max_usage_queue_count)
+            .unwrap_or_default()


I'm trying to think what we want to have happen here if we lose the handle to the pool... Do we want to consider it overgrown? Maybe this should be unwrap_or(true)?

bw-solana · 2025-04-15T20:03:27Z

unified-scheduler-pool/src/lib.rs

@@ -296,6 +296,8 @@ impl BankingStageHelper {

    pub fn send_new_task(&self, task: Task) {
        self.new_task_sender
+            .upgrade()
+            .unwrap()


should we instead just drop the task if the upgrade fails?

I preserved the current behavior of panicing on failure to send; realistically we probably want to return an error and break whatever loops we're in wherever we send these

yeah, I'm okay w/ following up on that separately

ThreadManager: store Weak instead of Arc pool

a7be82c

apfitzge commented Apr 15, 2025

View reviewed changes

Handler - hold Weak new_task_sender

6b670c3

apfitzge commented Apr 15, 2025

View reviewed changes

apfitzge marked this pull request as ready for review April 15, 2025 16:58

apfitzge requested review from ryoqun, jstarry and bw-solana April 15, 2025 16:58

bw-solana reviewed Apr 15, 2025

View reviewed changes

ryoqun mentioned this pull request Apr 17, 2025

Finally introduce sane unified scheduler shutdown #5866

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ThreadManager: store Weak instead of Arc pool #5826

ThreadManager: store Weak instead of Arc pool #5826

Uh oh!

apfitzge commented Apr 15, 2025 •

edited

Loading

Uh oh!

apfitzge Apr 15, 2025

Uh oh!

bw-solana Apr 15, 2025

Uh oh!

bw-solana Apr 15, 2025

Uh oh!

apfitzge Apr 16, 2025

Uh oh!

apfitzge Apr 15, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 15, 2025

Uh oh!

bw-solana Apr 15, 2025

Uh oh!

bw-solana Apr 15, 2025

Uh oh!

bw-solana Apr 15, 2025

Uh oh!

apfitzge Apr 16, 2025

Uh oh!

bw-solana Apr 16, 2025

Uh oh!

Uh oh!

ThreadManager: store Weak instead of Arc pool #5826

Are you sure you want to change the base?

ThreadManager: store Weak instead of Arc pool #5826

Uh oh!

Conversation

apfitzge commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problems

Problem 1 - Circular Reference of SchdulerPool

Problem 2 - Circular Ownership model of new_task_sender

Summary of Changes

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apfitzge Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 15, 2025

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

apfitzge commented Apr 15, 2025 •

edited

Loading

Problem 1 - Circular Reference of `SchdulerPool`

Problem 2 - Circular Ownership model of `new_task_sender`

apfitzge Apr 15, 2025 •

edited

Loading