Skip to content

Finally introduce sane unified scheduler shutdown #5866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Apr 17, 2025

Problem

I was lazy. There was wip shutdown impl buried in an ancient pr..

Summary of Changes

An alternative approach to #5826

fyi: salvaged from #593

@ryoqun ryoqun marked this pull request as draft April 17, 2025 14:34
@@ -644,6 +651,12 @@ impl ForkGraph for BankForks {
}
}

impl Drop for BankForks {
fn drop(&mut self) {
error!("BankForks::drop(): successfully dropped");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: now that, ProgramCache's BankForks circular-referenc is gone. maybe i can wire prepare_to_drop from here to exercise the shutdown code-path (esp in tests!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is done: 163015a

@ryoqun ryoqun requested a review from apfitzge April 17, 2025 14:47
@apfitzge apfitzge added the CI Pull Request is ready to enter CI label Apr 18, 2025
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Apr 18, 2025
Copy link

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit biased toward my solution in #5826 because I think have the strong circular references adds quite a bit of room for us to introduce a similar bug in the future whereas weak references make that harder to do.

However I think most of these changes make sense in addition to my PR, and give us stronger confidence in shutdown logic. AFAICT these two PRs are not conflicting, please let me know if you see a reason why this would not be the case.

drop(self.cluster_info);

self.poh_service.join().expect("poh_service");
macro_rules! join_then_log {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a separate change to assist in debugging thread-joining issues. Probably a good change to make, but not related to fixing this issue afiact?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. another piggy-bucked salvation from the ancient pr #593 by procrastinating me.

@@ -259,6 +259,13 @@ impl BankForks {
bank
}

pub fn prepare_to_drop(&mut self) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just do a custom drop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is done: #5866 (comment)

@codecov-commenter
Copy link

codecov-commenter commented Apr 18, 2025

Codecov Report

Attention: Patch coverage is 94.40559% with 8 lines in your changes missing coverage. Please review.

Project coverage is 83.0%. Comparing base (0df4a2d) to head (163015a).
Report is 79 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5866   +/-   ##
=======================================
  Coverage    83.0%    83.0%           
=======================================
  Files         828      828           
  Lines      375691   375715   +24     
=======================================
+ Hits       311874   311921   +47     
+ Misses      63817    63794   -23     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ryoqun
Copy link
Member Author

ryoqun commented Apr 21, 2025

I'm a bit biased toward my solution in #5826 because I think have the strong circular references adds quite a bit of room for us to introduce a similar bug in the future whereas weak references make that harder to do.

First of all, thanks for working on unified scheduler code.

To be clear, i have a strong preference regarding strong circular references and relaxing it by weak ones. but I'm not planning to push my idea to the team so hard.

So, if i cannot change your mind with this comment, I'm fine to merge #5826. it's written after good understanding of unified scheduler and it's indeed a good solution. I'm quite happy to see these kinds of prs from someone other than me.. :) And it reduces my work load for now. however, I might revisit it in the future, as i said at bottom of this wall of text (due to real perf reasons)... xD

Maybe, the difference of approaches in the two prs boils down to different coding stances.

I agree that a strong circular references should be avoided if possible and that it's error-prone. However, at the same time, i think this downside can be under control with strict coding of full of assert!()s and comments. Maybe, this stance is coming from my high-demanding assumption that code should be written with full and precise understanding of all of surrounding subsystems. ;)

Fyi, this bug is known (at least by me!) since start, and i still don't think local-cluster test (#5435) and production agave-validator binary is experiencing the reported shutdown problems directly due to these strong circular references. That's my justification i significantly lowered working on shipping this shutdown impl. For example, scheduler won't prevent BankFork from being Drop-ed as far as i run the local-cluster test locally on the base commit of this pr.

On the other hand, I agree weak circular references does indeed fix the pressing problem (if any!). However, i think it clouds system responsibilities. Namely, guarantee of ThreadManager and new_task_sender outlive relationship to others if upgrade() error condition (returned None) is squashed in some way. From my experience, tracking down bugs from these relaxed runtime invariant is very hard and time-consuming when this is happening outside unit tests.

Lastly, there's a rationale for strong circular references other than introduction of strict runtime invariant: performance. For new_task_sender, Weak::upgrade is similar to Arc::clone in terms of perf impact. It introduces another source of cache-line bouncing and memory indirection under many threads. and it's done per task, meaning quite large overhead (I don't have concrete synesthetic bench numbers off hand right now, though).

However I think most of these changes make sense in addition to my PR, and give us stronger confidence in shutdown logic. AFAICT these two PRs are not conflicting, please let me know if you see a reason why this would not be the case.

Yeah, we can mix the two technically. however, i think merging either one should be enough. I think we should stick to one philosophy and be consistent with it.

@ryoqun ryoqun requested a review from apfitzge April 22, 2025 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants