-
Notifications
You must be signed in to change notification settings - Fork 445
ThreadManager: store Weak instead of Arc pool #5826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
.pool | ||
.upgrade() | ||
.map(|pool| self.usage_queue_loader.count() > pool.max_usage_queue_count) | ||
.unwrap_or_default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this value is used by cleaner to possibly drop or return to pool, so the return value here doesn't matter much if the pool no longer exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to call that out in a comment here or function header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to think what we want to have happen here if we lose the handle to the pool... Do we want to consider it overgrown? Maybe this should be unwrap_or(true)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an odd set up, and it just can't really happen afaict.
Maybe we should just panic?
The current setup:
SchedulerPool::return_scheduler
is_trashed
is_overgrown
so is_overgrown
is never called (in non-test code at least) unless we have a SchedulerPool
.
This makes me wonder...could we just pass the pool into is_trashed
?
The only other use of the pool is in return_to_pool
, can we just pass it there somehow? Ultimatey it seems called by timeout listeners, which iirc are called by the cleaner loop, which has a weak reference to the pool itself and could pass it in.
That's a bit more restructuring, and I'm not 100% it would work - but would certainly simplify the ownership and reference model - ThreadManager
would just no longer have a Pool
reference at all.
@@ -262,11 +262,11 @@ clone_trait_object!(BankingPacketHandler); | |||
pub struct BankingStageHelper { | |||
usage_queue_loader: UsageQueueLoader, | |||
next_task_id: AtomicUsize, | |||
new_task_sender: Sender<NewTaskPayload>, | |||
new_task_sender: Weak<Sender<NewTaskPayload>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is least invasive change to fix the circular dependency on the sender/recv connection in scheduler.
We may want to re-structure the scheduler thread to not hold the entire HandlerContext
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5826 +/- ##
=========================================
- Coverage 83.0% 83.0% -0.1%
=========================================
Files 828 828
Lines 375510 375520 +10
=========================================
- Hits 311857 311848 -9
- Misses 63653 63672 +19 🚀 New features to boost your workflow:
|
.pool | ||
.upgrade() | ||
.map(|pool| self.usage_queue_loader.count() > pool.max_usage_queue_count) | ||
.unwrap_or_default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to call that out in a comment here or function header
.pool | ||
.upgrade() | ||
.map(|pool| self.usage_queue_loader.count() > pool.max_usage_queue_count) | ||
.unwrap_or_default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to think what we want to have happen here if we lose the handle to the pool... Do we want to consider it overgrown? Maybe this should be unwrap_or(true)
?
@@ -296,6 +296,8 @@ impl BankingStageHelper { | |||
|
|||
pub fn send_new_task(&self, task: Task) { | |||
self.new_task_sender | |||
.upgrade() | |||
.unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we instead just drop the task if the upgrade fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I preserved the current behavior of panicing on failure to send; realistically we probably want to return an error and break whatever loops we're in wherever we send these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I'm okay w/ following up on that separately
Problems
Problem 1 - Circular Reference of
SchdulerPool
solScCleaner
holds a weak reference to theSchedulerPool
and will break only if all Arcs ofSchedulerPool
are droppedArc<SchedulerPool>
are held in two places:BankForks
andThreadManager
ThreadManager
ownership chain:PooledSchedulerInner
PooledScheduler
SchedulerPool
(circular reference)ThreadManager
usespool
to check status and possibly return to the pool; in either-case if theThreadManager
can be modified to handle if thepool
no longer existsProblem 2 - Circular Ownership model of
new_task_sender
new_task_sender
has been owned byThreadManager
new_task_sender
became owned viaBankingStageHelper
BankingStageHelper
is held in the scheduler thread.Err
received on channel disconnect (since it holds the sender itself!)ThreadManager
isdropped
, it will attempt to join all threads,Summary of Changes
ThreadManager
modified to hold aWeak
reference to theSchedulerPool
instead of a strong reference.BankingStageHelper
modified to hold aWeak<Sender<..>>
so that scheduler can exit if the actual new_task_sender is dropped (not the handler threads which send retryable transactions).Fixes #5435
Fixes #4211
Testing
Ran
cargo test --package solana-local-cluster --test local_cluster -- test_snapshot_restart_tower --exact --show-output
.Added some additional error logging when ReplayStage loop exited, BankForks was dropped, unified-scheduler cleaner thread exits, etc.
At the end of the test, wait 10s after everything is dropped. If the cleaner is logging at that point it has stayed alive when it should not have.
Last few log lines from patched test:
Note in the
master
logs, the pool(STONG_COUNT) still reads 2, indicating there are still 2 remaining strong counts. One is the upgraded pool used in the cleaning loop itself - the other is the stuckThreadManager
.