-
Notifications
You must be signed in to change notification settings - Fork 998
Start next ledger trigger timer after nomination accept #4688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release/v22.3.0
Are you sure you want to change the base?
Conversation
No unit tests? Maybe AI can help write some. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! We get a lot of testing of this change "for free" with the vnext build.
I think supercluster testing would be valuable here, if you haven't already done so. Specifically to:
- Verify this actually reduces block times (I believe you've done this already), and
- Test the upgrade from protocol 22 to 23. I can't think of any potential issues, but it doesn't hurt to get more assurance.
std::optional<VirtualClock::time_point> | ||
getNominationAccept(uint64_t slotIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: please add docs to this function.
std::optional<VirtualClock::time_point> mNominationStart; | ||
std::optional<VirtualClock::time_point> mPrepareStart; | ||
std::optional<VirtualClock::time_point> mNominationStart{}; | ||
std::optional<VirtualClock::time_point> mNominationAccept{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be useful to export a metric for nominate accept - this way we know the delay to trigger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
I've run some more tests on larger topologies of about 100 nodes (tier 1 + some watchers) and saw a modest improvement of about 200 ms of average block time. The test was comparing current release/v22.3 with this commit @ 900 TPS: Control on left, changes on right. Average ledger age across all pods with acceptance timer change: |
Average ledger age across all pods before change: It looks like we have slightly more timeouts with this change (probably because we start nomination earlier, so there's less "free" time before starting our timeout timer), but overall nomination latency and block time decreases. Once SSC is stable again, I'll run pubnet simulation withthe full topology. |
Can you expand on this? I imagine you're talking about timeouts during nomination? Is it that you have large variance in the time it takes for the ballot protocol (between nodes) or that the time between "first nomination" and "ballot protocol starts" has a lot of variance? |
The variance is between "first nomination" and "ballot protocol starts", mostly due to TX set flooding, which is the most expensive part of consensus from both a bandwidth and compute standpoint. The additional timeouts were rare, and only experienced by a few nodes, not the whole network. I think this probably happens because our timer logic has moved to being based a little less on global timing and more based on local node performance. Before, we started the timer at ballot prepare, which is more strongly synchronized. Now, a node starts it's timer when it votes to accept for the first time. This is still a rough synchronization point, since it won't vote to accept until it's heard nominations from the network, but I think there is more variance in "first vote to accept" than there is in starting ballot phase. Also, since we're moving up the point at which we start the timer, this might let fast nodes drift more than before, since there's less "dead time" we spend spinning waiting for the next ledger trigger. |
Then it sounds like this change should probably be done after we ensure that SCP traffic does not get impacted by other traffic. Like I said above, if this ends up triggering more timeouts in nomination, this may have some pretty bad impact on the network in periods of high activity. |
Description
Helps alleviate https://github.com/stellar/stellar-core-internal/issues/343.
This change makes validators base the next ledger trigger timer on nomination accept instead of prepare. Specifically, validators start the next ledger timer when they accept the first nomination message for the given ledger. Because we trigger at acceptance, there's still a rough synchronization point for the timer. Moving the timer trigger earlier in consensus should bring block times closer to the target 5s value.
Checklist
clang-format
v8.0.0 (viamake format
or the Visual Studio extension)