Skip to content

ThreadManager: store Weak instead of Arc pool #5826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

apfitzge
Copy link

@apfitzge apfitzge commented Apr 15, 2025

Problems

Problem 1 - Circular Reference of SchdulerPool
  • solScCleaner holds a weak reference to the SchedulerPool and will break only if all Arcs of SchedulerPool are dropped
  • Arc<SchedulerPool> are held in two places: BankForks and ThreadManager
  • ThreadManager ownership chain:
    • owned by PooledSchedulerInner
      • owned by PooledScheduler
        • owned by SchedulerPool (circular reference)
  • ThreadManager uses pool to check status and possibly return to the pool; in either-case if the ThreadManager can be modified to handle if the pool no longer exists
Problem 2 - Circular Ownership model of new_task_sender
  • new_task_sender has been owned by ThreadManager
  • new_task_sender became owned viaBankingStageHelper
    • BankingStageHelper is held in the scheduler thread.
    • scheduler thread will hang because there is never a Err received on channel disconnect (since it holds the sender itself!)
  • When ThreadManager is dropped, it will attempt to join all threads,
    • the scheduler thread cannot be joined because it will never exit the loop

Summary of Changes

  • ThreadManager modified to hold a Weak reference to the SchedulerPool instead of a strong reference.
  • BankingStageHelper modified to hold a Weak<Sender<..>> so that scheduler can exit if the actual new_task_sender is dropped (not the handler threads which send retryable transactions).

Fixes #5435
Fixes #4211

Testing

Ran cargo test --package solana-local-cluster --test local_cluster -- test_snapshot_restart_tower --exact --show-output.
Added some additional error logging when ReplayStage loop exited, BankForks was dropped, unified-scheduler cleaner thread exits, etc.
At the end of the test, wait 10s after everything is dropped. If the cleaner is logging at that point it has stayed alive when it should not have.

Last few log lines from patched test:

master:
[2025-04-15T14:10:04.993411000Z ERROR solana_unified_scheduler_pool] [ThreadId(67)] Scheduler pool(2) cleaner: dropped 0 idle inners, 0 trashed inners, triggered 1 timeout listeners
[2025-04-15T14:10:05.073398000Z ERROR solana_unified_scheduler_pool] [ThreadId(711)] Scheduler pool(2) cleaner: dropped 0 idle inners, 0 trashed inners, triggered 1 timeout listeners
[2025-04-15T14:10:05.274471000Z ERROR solana_unified_scheduler_pool] [ThreadId(376)] Scheduler pool(2) cleaner: dropped 0 idle inners, 0 trashed inners, triggered 0 timeout listeners

w/ patch:
[2025-04-15T14:12:01.921534000Z ERROR solana_unified_scheduler_pool] [ThreadId(67)] Scheduler pool(2) cleaner: dropped 0 idle inners, 0 trashed inners, triggered 1 timeout listeners
[2025-04-15T14:12:01.950369000Z ERROR solana_runtime::bank_forks] [ThreadId(764)] BankForks dropped
[2025-04-15T14:12:02.210603000Z ERROR solana_runtime::bank_forks] [ThreadId(120)] BankForks dropped
[2025-04-15T14:12:02.286298000Z ERROR solana_unified_scheduler_pool] [ThreadId(710)] Cleaner exited
[2025-04-15T14:12:02.326595000Z ERROR solana_unified_scheduler_pool] [ThreadId(67)] Cleaner exited
[2025-04-15T14:12:02.341763000Z ERROR local_cluster] everything is dropped at this point...if anything is still alive it's a naughty boy

Note in the master logs, the pool(STONG_COUNT) still reads 2, indicating there are still 2 remaining strong counts. One is the upgraded pool used in the cleaning loop itself - the other is the stuck ThreadManager.

.pool
.upgrade()
.map(|pool| self.usage_queue_loader.count() > pool.max_usage_queue_count)
.unwrap_or_default()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this value is used by cleaner to possibly drop or return to pool, so the return value here doesn't matter much if the pool no longer exists.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to call that out in a comment here or function header

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to think what we want to have happen here if we lose the handle to the pool... Do we want to consider it overgrown? Maybe this should be unwrap_or(true)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an odd set up, and it just can't really happen afaict.
Maybe we should just panic?

The current setup:

  • SchedulerPool::return_scheduler
    • is_trashed
      • is_overgrown

so is_overgrown is never called (in non-test code at least) unless we have a SchedulerPool.

This makes me wonder...could we just pass the pool into is_trashed?
The only other use of the pool is in return_to_pool, can we just pass it there somehow? Ultimatey it seems called by timeout listeners, which iirc are called by the cleaner loop, which has a weak reference to the pool itself and could pass it in.

That's a bit more restructuring, and I'm not 100% it would work - but would certainly simplify the ownership and reference model - ThreadManager would just no longer have a Pool reference at all.

@@ -262,11 +262,11 @@ clone_trait_object!(BankingPacketHandler);
pub struct BankingStageHelper {
usage_queue_loader: UsageQueueLoader,
next_task_id: AtomicUsize,
new_task_sender: Sender<NewTaskPayload>,
new_task_sender: Weak<Sender<NewTaskPayload>>,
Copy link
Author

@apfitzge apfitzge Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is least invasive change to fix the circular dependency on the sender/recv connection in scheduler.
We may want to re-structure the scheduler thread to not hold the entire HandlerContext

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.0%. Comparing base (c5edcfb) to head (6b670c3).
Report is 15 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #5826     +/-   ##
=========================================
- Coverage    83.0%    83.0%   -0.1%     
=========================================
  Files         828      828             
  Lines      375510   375520     +10     
=========================================
- Hits       311857   311848      -9     
- Misses      63653    63672     +19     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@apfitzge apfitzge marked this pull request as ready for review April 15, 2025 16:58
.pool
.upgrade()
.map(|pool| self.usage_queue_loader.count() > pool.max_usage_queue_count)
.unwrap_or_default()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to call that out in a comment here or function header

.pool
.upgrade()
.map(|pool| self.usage_queue_loader.count() > pool.max_usage_queue_count)
.unwrap_or_default()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to think what we want to have happen here if we lose the handle to the pool... Do we want to consider it overgrown? Maybe this should be unwrap_or(true)?

@@ -296,6 +296,8 @@ impl BankingStageHelper {

pub fn send_new_task(&self, task: Task) {
self.new_task_sender
.upgrade()
.unwrap()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we instead just drop the task if the upgrade fails?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preserved the current behavior of panicing on failure to send; realistically we probably want to return an error and break whatever loops we're in wherever we send these

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'm okay w/ following up on that separately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants