Skip to content

Start next ledger trigger timer after nomination accept #4688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: release/v22.3.0
Choose a base branch
from

Conversation

SirTyson
Copy link
Contributor

Description

Helps alleviate https://github.com/stellar/stellar-core-internal/issues/343.

This change makes validators base the next ledger trigger timer on nomination accept instead of prepare. Specifically, validators start the next ledger timer when they accept the first nomination message for the given ledger. Because we trigger at acceptance, there's still a rough synchronization point for the timer. Moving the timer trigger earlier in consensus should bring block times closer to the target 5s value.

Checklist

  • Reviewed the contributing document
  • Rebased on top of master (no merge commits)
  • Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
  • Compiles
  • Ran all tests
  • If change impacts performance, include supporting evidence per the performance document

@anupsdf
Copy link
Contributor

anupsdf commented Apr 11, 2025

No unit tests? Maybe AI can help write some.

Copy link
Contributor

@bboston7 bboston7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! We get a lot of testing of this change "for free" with the vnext build.

I think supercluster testing would be valuable here, if you haven't already done so. Specifically to:

  1. Verify this actually reduces block times (I believe you've done this already), and
  2. Test the upgrade from protocol 22 to 23. I can't think of any potential issues, but it doesn't hurt to get more assurance.

Comment on lines +118 to +122
std::optional<VirtualClock::time_point>
getNominationAccept(uint64_t slotIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: please add docs to this function.

std::optional<VirtualClock::time_point> mNominationStart;
std::optional<VirtualClock::time_point> mPrepareStart;
std::optional<VirtualClock::time_point> mNominationStart{};
std::optional<VirtualClock::time_point> mNominationAccept{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be useful to export a metric for nominate accept - this way we know the delay to trigger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@SirTyson
Copy link
Contributor Author

I've run some more tests on larger topologies of about 100 nodes (tier 1 + some watchers) and saw a modest improvement of about 200 ms of average block time. The test was comparing current release/v22.3 with this commit @ 900 TPS:

Control on left, changes on right.

image

image

Average ledger age across all pods with acceptance timer change:

image

@SirTyson
Copy link
Contributor Author

SirTyson commented Apr 28, 2025

Average ledger age across all pods before change:

screenshot-2025-04-28-23-35-16

It looks like we have slightly more timeouts with this change (probably because we start nomination earlier, so there's less "free" time before starting our timeout timer), but overall nomination latency and block time decreases. Once SSC is stable again, I'll run pubnet simulation withthe full topology.

@MonsieurNicolas
Copy link
Contributor

It looks like we have slightly more timeouts with this change (probably because we start nomination earlier, so there's less "free" time before starting our timeout timer), but overall nomination latency and block time decreases. Once SSC is stable again, I'll run pubnet simulation withthe full topology.

Can you expand on this? I imagine you're talking about timeouts during nomination?
We need to understand this better: timeouts during nomination are the worst type of timeouts as they imply picking a new leader (and therefore flooding more transaction sets that are very expensive).

Is it that you have large variance in the time it takes for the ballot protocol (between nodes) or that the time between "first nomination" and "ballot protocol starts" has a lot of variance?
The timeout could also be observed only on a very small number of nodes, which should not be too much of a problem.

@SirTyson
Copy link
Contributor Author

Is it that you have large variance in the time it takes for the ballot protocol (between nodes) or that the time between "first nomination" and "ballot protocol starts" has a lot of variance?

The variance is between "first nomination" and "ballot protocol starts", mostly due to TX set flooding, which is the most expensive part of consensus from both a bandwidth and compute standpoint. The additional timeouts were rare, and only experienced by a few nodes, not the whole network.

I think this probably happens because our timer logic has moved to being based a little less on global timing and more based on local node performance. Before, we started the timer at ballot prepare, which is more strongly synchronized. Now, a node starts it's timer when it votes to accept for the first time. This is still a rough synchronization point, since it won't vote to accept until it's heard nominations from the network, but I think there is more variance in "first vote to accept" than there is in starting ballot phase. Also, since we're moving up the point at which we start the timer, this might let fast nodes drift more than before, since there's less "dead time" we spend spinning waiting for the next ledger trigger.

@MonsieurNicolas
Copy link
Contributor

Then it sounds like this change should probably be done after we ensure that SCP traffic does not get impacted by other traffic. Like I said above, if this ends up triggering more timeouts in nomination, this may have some pretty bad impact on the network in periods of high activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants