Start next ledger trigger timer after nomination accept #4688

SirTyson · 2025-04-11T22:21:01Z

Description

Helps alleviate https://github.com/stellar/stellar-core-internal/issues/343.

This change makes validators base the next ledger trigger timer on nomination accept instead of prepare. Specifically, validators start the next ledger timer when they accept the first nomination message for the given ledger. Because we trigger at acceptance, there's still a rough synchronization point for the timer. Moving the timer trigger earlier in consensus should bring block times closer to the target 5s value.

Checklist

Reviewed the contributing document
Rebased on top of master (no merge commits)
Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
Compiles
Ran all tests
If change impacts performance, include supporting evidence per the performance document

anupsdf · 2025-04-11T22:56:21Z

No unit tests? Maybe AI can help write some.

bboston7

This looks good to me! We get a lot of testing of this change "for free" with the vnext build.

I think supercluster testing would be valuable here, if you haven't already done so. Specifically to:

Verify this actually reduces block times (I believe you've done this already), and
Test the upgrade from protocol 22 to 23. I can't think of any potential issues, but it doesn't hurt to get more assurance.

bboston7 · 2025-04-23T23:50:52Z

src/herder/HerderSCPDriver.h

+    std::optional<VirtualClock::time_point>
+    getNominationAccept(uint64_t slotIndex);


Nitpick: please add docs to this function.

marta-lokhova · 2025-04-25T00:20:59Z

src/herder/HerderSCPDriver.h

-        std::optional<VirtualClock::time_point> mNominationStart;
-        std::optional<VirtualClock::time_point> mPrepareStart;
+        std::optional<VirtualClock::time_point> mNominationStart{};
+        std::optional<VirtualClock::time_point> mNominationAccept{};


It would probably be useful to export a metric for nominate accept - this way we know the delay to trigger.

SirTyson · 2025-04-28T23:33:57Z

I've run some more tests on larger topologies of about 100 nodes (tier 1 + some watchers) and saw a modest improvement of about 200 ms of average block time. The test was comparing current release/v22.3 with this commit @ 900 TPS:

Control on left, changes on right.

Average ledger age across all pods with acceptance timer change:

SirTyson · 2025-04-28T23:37:00Z

Average ledger age across all pods before change:

It looks like we have slightly more timeouts with this change (probably because we start nomination earlier, so there's less "free" time before starting our timeout timer), but overall nomination latency and block time decreases. Once SSC is stable again, I'll run pubnet simulation withthe full topology.

MonsieurNicolas · 2025-04-29T00:31:34Z

It looks like we have slightly more timeouts with this change (probably because we start nomination earlier, so there's less "free" time before starting our timeout timer), but overall nomination latency and block time decreases. Once SSC is stable again, I'll run pubnet simulation withthe full topology.

Can you expand on this? I imagine you're talking about timeouts during nomination?
We need to understand this better: timeouts during nomination are the worst type of timeouts as they imply picking a new leader (and therefore flooding more transaction sets that are very expensive).

Is it that you have large variance in the time it takes for the ballot protocol (between nodes) or that the time between "first nomination" and "ballot protocol starts" has a lot of variance?
The timeout could also be observed only on a very small number of nodes, which should not be too much of a problem.

SirTyson · 2025-04-29T17:58:15Z

Is it that you have large variance in the time it takes for the ballot protocol (between nodes) or that the time between "first nomination" and "ballot protocol starts" has a lot of variance?

The variance is between "first nomination" and "ballot protocol starts", mostly due to TX set flooding, which is the most expensive part of consensus from both a bandwidth and compute standpoint. The additional timeouts were rare, and only experienced by a few nodes, not the whole network.

I think this probably happens because our timer logic has moved to being based a little less on global timing and more based on local node performance. Before, we started the timer at ballot prepare, which is more strongly synchronized. Now, a node starts it's timer when it votes to accept for the first time. This is still a rough synchronization point, since it won't vote to accept until it's heard nominations from the network, but I think there is more variance in "first vote to accept" than there is in starting ballot phase. Also, since we're moving up the point at which we start the timer, this might let fast nodes drift more than before, since there's less "dead time" we spend spinning waiting for the next ledger trigger.

MonsieurNicolas · 2025-05-15T18:17:11Z

Then it sounds like this change should probably be done after we ensure that SCP traffic does not get impacted by other traffic. Like I said above, if this ends up triggering more timeouts in nomination, this may have some pretty bad impact on the network in periods of high activity.

SirTyson · 2025-05-23T18:16:17Z

Closing, given that we want to test using ledger close time instread.

SirTyson requested review from bboston7 and marta-lokhova April 11, 2025 22:21

bboston7 approved these changes Apr 24, 2025

View reviewed changes

marta-lokhova reviewed Apr 25, 2025

View reviewed changes

Start next ledger trigger timer after nomination accept

2503af0

SirTyson force-pushed the trigger-timer branch from 8c688b6 to 2503af0 Compare April 28, 2025 20:19

SirTyson closed this May 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Start next ledger trigger timer after nomination accept #4688

Start next ledger trigger timer after nomination accept #4688

Uh oh!

SirTyson commented Apr 11, 2025

Uh oh!

anupsdf commented Apr 11, 2025

Uh oh!

bboston7 left a comment

Uh oh!

bboston7 Apr 23, 2025

Uh oh!

marta-lokhova Apr 25, 2025

Uh oh!

SirTyson Apr 28, 2025

Uh oh!

SirTyson commented Apr 28, 2025

Uh oh!

SirTyson commented Apr 28, 2025 •

edited

Loading

Uh oh!

MonsieurNicolas commented Apr 29, 2025

Uh oh!

SirTyson commented Apr 29, 2025

Uh oh!

MonsieurNicolas commented May 15, 2025

Uh oh!

SirTyson commented May 23, 2025

Uh oh!

Uh oh!

		std::optional<VirtualClock::time_point>
		getNominationAccept(uint64_t slotIndex);

Start next ledger trigger timer after nomination accept #4688

Start next ledger trigger timer after nomination accept #4688

Uh oh!

Conversation

SirTyson commented Apr 11, 2025

Description

Checklist

Uh oh!

anupsdf commented Apr 11, 2025

Uh oh!

bboston7 left a comment

Choose a reason for hiding this comment

Uh oh!

bboston7 Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

marta-lokhova Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

SirTyson Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

SirTyson commented Apr 28, 2025

Uh oh!

SirTyson commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MonsieurNicolas commented Apr 29, 2025

Uh oh!

SirTyson commented Apr 29, 2025

Uh oh!

MonsieurNicolas commented May 15, 2025

Uh oh!

SirTyson commented May 23, 2025

Uh oh!

Uh oh!

SirTyson commented Apr 28, 2025 •

edited

Loading