collator-protocol-revamp: CollationManager and subsystem impl #8541

alindima · 2025-05-15T10:33:24Z

Implements the CollationManager and the new collator protocol (validator side) subsystem.

Issues #8182 and #7752.

These are the big remaining parts which would enable us to test the entire implementation.

TODO:

add a couple more unit tests (see the suggestions at the bottom of the tests file)
polish the ClaimQueueState and verify if it's sufficiently covered by unit tests
- ClaimQueueState cosmetics #10334
- Fix ClaimQueue inconsistencies #10368
add metrics and polish logs - Add metrics to collator-protocol/validator-side-experimental #10730
add a CLI parameter for enabling the experimental subsystem (and remove the compile-time feature) -> Add --experimental-collator-protocol cli argument to enable the new collator protocol implementation #10285
implement registered paras update, using Add para_ids Runtime API #9055
do some manual zombienet tests with v1 protocol version and with restarting validators (including syncing with warp sync)
prdoc
Rollback
- 03e8915
- 05e1497
  These commits were added just to run the CI tests for this PR with the new experimental protocol

After merging:

versi testing

Uses a slightly modified version of the ClaimQueueState written by @tdimitrov in #7114.

…variant for the validator side

…rotocol-revamp-peer-manager

…p-peer-manager' into alindima/collator-protocol-revamp-reputation-db-draft

…of leaves

…p-reputation-db-draft' into alindima/collator-protocol-revamp-collation-manager

sandreim · 2026-01-13T16:29:37Z

I agree, we discussed it somewhere here but there are too many comments and I can't find it anymore. Long story short - right now we use 300ms for the stop gap, I say we use the same value here.

Yeah, 300 sounds good too me, I had the same concern.

MAX_STORED_SCORES_PER_PARA .. looks unnecessary low?

Atm this is 150 collators per parachain. Isn't it enough? Keeping too much scores sounds wasteful to me.

No chains have that many now, but who knows. If it happens, it would be a hassle to upgrade since you need to upgrade validators. Having a higher value here does not cost us anything now.

INVALID_COLLATION_SLASH ... is there any reason this can happen honestly ... if not, I think we can punish more - essentially bring it back to 0 immediately.

Not sure we can make a distinction between a bug and malicious behaviour. I'd go for maximum punishment vs FAILED_FETCH_SLASH which should be less of punishment.

tdimitrov · 2026-01-14T08:31:36Z

Please have a look at 5505d51

What it does:

VALID_INCLUDED_CANDIDATE_BUMP: 50 -> 100
MAX_STORED_SCORES_PER_PARA: 150 -> 1000
INSTANT_FETCH_REP_THRESHOLD: 1000 -> 1800
UNDER_THRESHOLD_FETCH_DELAY: 1000ms -> 300ms
MIN_FETCH_TIMER_DELAY: 500ms -> 150ms

The main motivation for the parameters change is to make sure the setup can handle 100 collators (here I've shown why 50 is the max).
The most straightforward way was to make INACTIVITY_DECAY 0.5 but I wanted to avoid playing with rational numbers so instead I've bumped the score a collator gets for submitting a valid collation. This gave a nice "Time to reach threshold" curve similar to the one with INACTIVITY_DECAY=0.5 but shifted downwards, meaning that the collator will reach the instant fetch threshold two times faster (for ~2mins instead of 4). So to keep the initial behaviour (~4mins) I've bumped INSTANT_FETCH_REP_THRESHOLD to 1800.

The results:

The yellow line is the implemented behaviour. Blue - the current one. Red - the one where the only change is decay=0.5.

And for completeness the 'net score' - in theory we sholud be able to handle 100 collators but the time to reach the threshold score is 6 hours:

eskimor

There is an issue with the fetch logic: We prioritize time over reputation. If an anonymous peer sent us a collation and the fetch delay elapsed, we will fetch the anonymous peer collation, instead of any that might have arrived which already have some reputation.

The idea of the delay was to give higher rep collators a chance to make the validator aware, before it is already busy with processing a garbage collation from an anonymous dude. What we have instead: Yes we already learned about a better peer, but we ignore him regardless.

It also seems meaningless to track reputation among peers < instant_fetch_threshold, as we will fetch the earliest peer regardless anyways. I think the logic can be significantly simplified and improved.

E.g. instead of delay from receiving the first advertisement, we wait for e.g. 300ms from the time we received the scheduling/relay parent and then simply sieve through advertisements and pick the highest rep. If we want to be fancy, what we could do is adjust the delay dynamically: Something like: Total wait == 300ms * (1-highest received rep/max rep).

So if we are 150ms for example in, when we receive a max_rep/2 advertisement we would stop immediately - and take that. If we receive highest_possible_rep advertisement immediately after the leaf - there is no point to wait at all, nothing better will come.

The idea would be that we just add some delay after having seen the scheduling parent/leaf, we account for global network latency - so high rep, but far away collators have the same chances. We just make it an even play ground for everybody, but we would not add additional latency for a single advertisement that came in already quite late.

eskimor · 2026-01-14T13:19:52Z

I think I found another issue, a race: If a collator provides a collation for a leaf that is about to go out of scope, it would not receive a punishment for providing an invalid collation, because by the time we receive the invalid message from backing, we already removed the peer state.

sandreim · 2026-01-15T10:22:46Z

There is an issue with the fetch logic: We prioritize time over reputation. If an anonymous peer sent us a collation and the fetch delay elapsed, we will fetch the anonymous peer collation, instead of any that might have arrived which already have some reputation.

I had a suggestion on this on different PR. I think parametrising it correctly would also this would solve the problem you describe while also making the code simpler. WDYT @eskimor

eskimor · 2026-01-15T13:50:41Z

There is an issue with the fetch logic: We prioritize time over reputation. If an anonymous peer sent us a collation and the fetch delay elapsed, we will fetch the anonymous peer collation, instead of any that might have arrived which already have some reputation.

I had a suggestion on this on different PR. I think parametrising it correctly would also this would solve the problem you describe while also making the code simpler. WDYT @eskimor

Yep, looks very similar to what I had in mind too. 👍

serban300 · 2026-01-18T10:30:11Z

...adot/node/network/collator-protocol/src/validator_side_experimental/collation_manager/mod.rs

+					timestamp: adv_timestamp,
+				})
+			})
+			.collect::<BTreeSet<_>>();


Can we just call min() here ?

paritytech-workflow-stopper · 2026-01-23T10:01:48Z

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/21281850629
Failed job name: cargo-clippy

alindima

LGTM, i'm happy with this. Some more metrics and tests could be added but can be follow-ups.

Cannot approve since I'm the original author, but consider this my approval :D

alindima · 2026-01-23T10:15:55Z

polkadot/node/network/collator-protocol/src/validator_side/mod.rs

 mod tests;

-pub use metrics::Metrics;
+pub use crate::validator_side_metrics::Metrics;


it will probably be needed to add different metrics depending on the subsystem variant. It's good to have the common ones deduplicated but I'd also add some more for the new version

I'd like us to have finer grained metrics, including timers for all operations. But this can be a follow-up, to avoid keeping this monster around open forever

I preferred not to add metrics which we won't use so I just transferred what we already have.
I agree in tough situations extra metrics will be useful so it can be a followup - I didn't close #10402 for this reason.

alindima · 2026-01-23T10:17:30Z

polkadot/node/network/collator-protocol/src/validator_side_metrics.rs

 	handle_collation_request_result: prometheus::Histogram,
 	collator_peer_count: prometheus::Gauge<prometheus::U64>,
 	collation_request_duration: prometheus::Histogram,
+	// TODO: Not available for the new implementation. Remove with the old implementation.


conceptually, requesting unblocked collations are present in both variants

polkadot/node/network/collator-protocol/src/validator_side_experimental/common.rs

alindima · 2026-01-23T10:26:11Z

prdoc/pr_8541.prdoc

+doc:
+- audience: Node Operator
+  description: |-
+    This PR adds a new collator protocol (validator side) subsystem.


Probably deserves a bit more :D

...adot/node/network/collator-protocol/src/validator_side_experimental/collation_manager/mod.rs

polkadot/node/network/collator-protocol/src/validator_side/mod.rs

…erimental/collation_manager/mod.rs Co-authored-by: Alin Dima <[email protected]>

…n-manager

eskimor

The following improvements can be a followup, but the constants should really be fixed/properly argued why they make sense.

Follow up improvements (not for this PR, but so that they are noted somewhere):

Parallel fetch after some small timeout: Is implemented in legacy implementation, should be brought back & improved. This is a fix for us not having proper streaming and can help a great deal to mitigate impact on fetch attacks - @tdimitrov knows details.
Negative reputation bump on fetch problems: We might not be able to punish hard for single issues (as network issues can happen to honest nodes - measurements would be good), but if implemented properly (e.g. above parallel fetch) any real harm will only come from coordinated attacks, thus we should look into punishing harder on coordination.
Race condition fix
Timer should start at leaf activation - parallel fetches should likely alter that behavior even more (we can likely be more aggressive in fetching, if we have parallel fetches) - instead of waiting doing nothing, we might as well fetch what is there, if we can fetch more if it arrives.
Possibly others from my chat with @tdimitrov

Fix for this PR: Get constants in order - or have a proper argument why they are good as is.

eskimor · 2026-01-31T16:34:45Z

polkadot/node/network/collator-protocol/src/validator_side_experimental/common.rs

+/// saturated to this value.
+pub const MAX_SCORE: u16 = 35_000;
+/// Reputation bump for getting a valid candidate included in a finalized block.
+pub const VALID_INCLUDED_CANDIDATE_BUMP: u16 = 100;


This is not what we agreed on, ok I just found the reasoning above for the value - it only argues in one dimension (against the inactivity decay, which seems to be causing more problems than it solves), but not against the other axes/more important axis: Relationship to negative reputation changes and with regards to those, this value is completely off.

eskimor

On second thought, let's get this PR merged finally. Any further fixes can come in a followup.

alindima added 30 commits April 3, 2025 11:18

add enable-experimental-collator-protocol CLI flag and new subsystem …

4667163

…variant for the validator side

fix clap arg

96b2cd1

rollback experimental cli flag

1d283d5

WIP peer manager impl

73b61fd

continue impl

d943dbb

Merge remote-tracking branch 'origin/master' into alindima/collator-p…

2bfdaf0

…rotocol-revamp-peer-manager

make taplo happy

fa731c6

minor fix

1691ec8

Merge remote-tracking branch 'origin/master' into alindima/collator-p…

75f00f4

…rotocol-revamp-peer-manager

fix todo after merging previous PR

3b17c38

initial memory db impl

41127ae

satisfy clippy

2bf2b05

WIP

6fa2d4e

Merge remote-tracking branch 'origin/master' into alindima/collator-p…

f5ae202

…rotocol-revamp-peer-manager

review comments

32c30ea

switch to processing rep bumps only for finalized block notifications

5271ad7

add unit tests for Score primitive

86cd891

some unit tests

7ffde5c

some fixes and almost all needed unit tests for ConnectedPeers

b5e1b13

add test for update_reputation

81140bc

some fixes

c42c33b

add a log

608fcb0

Merge remote-tracking branch 'origin/master' into alindima/collator-p…

e37e6d3

…rotocol-revamp-peer-manager

enforce limit is non-zero

b23c4e1

Merge remote-tracking branch 'origin/alindima/collator-protocol-revam…

e71d4e8

…p-peer-manager' into alindima/collator-protocol-revamp-reputation-db-draft

fix merge damage, most importantly handling finalized blocks instead …

24a9097

…of leaves

Merge remote-tracking branch 'origin/alindima/collator-protocol-revam…

4205200

…p-reputation-db-draft' into alindima/collator-protocol-revamp-collation-manager

WIP

82078e6

still WIP

7a490a9

more work

c82006a

Update parameters in common.rs

5505d51

eskimor requested changes Jan 14, 2026

View reviewed changes

serban300 and others added 4 commits January 14, 2026 14:05

Add comment

dc4426e

small fix

e48add5

Update parameters in common.rs - missed msg

efa3008

Failing fetching test

5f183c3

tdimitrov added 2 commits January 16, 2026 10:33

Simplify advertisement fetching logic

c42ef31

fmt

67373ea

serban300 reviewed Jan 18, 2026

View reviewed changes

Adjust MAX_SCORE, FAILED_FETCH_SLASH and INVALID_COLLATION_SLASH

e1c301d

clippy

c5d5210

tdimitrov requested a review from eskimor January 23, 2026 11:10

tdimitrov mentioned this pull request Jan 23, 2026

A collator providing advertisement late might avoid reputation decrease in collator protocol revamp #10887

Open

alindima commented Jan 23, 2026

View reviewed changes

tdimitrov and others added 8 commits January 23, 2026 15:33

Update polkadot/node/network/collator-protocol/src/validator_side_exp…

89cd009

…erimental/collation_manager/mod.rs Co-authored-by: Alin Dima <[email protected]>

spelling

8165457

prdoc

d088c5d

Move non-pub types in collation_manager after the pub types

4ac381f

fix a log line

c9a658c

Increase INVALID_COLLATION_SLASH

bc237d5

Merge branch 'master' into alindima/collator-protocol-revamp-collatio…

9f36299

…n-manager

Merge branch 'master' into alindima/collator-protocol-revamp-collatio…

6574a59

…n-manager

eskimor requested changes Jan 31, 2026

View reviewed changes

eskimor approved these changes Jan 31, 2026

View reviewed changes

collator-protocol-revamp: CollationManager and subsystem impl #8541

Are you sure you want to change the base?

collator-protocol-revamp: CollationManager and subsystem impl #8541

Uh oh!

Conversation

alindima commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandreim commented Jan 13, 2026

Uh oh!

tdimitrov commented Jan 14, 2026

Uh oh!

eskimor left a comment

Choose a reason for hiding this comment

Uh oh!

eskimor commented Jan 14, 2026

Uh oh!

sandreim commented Jan 15, 2026

Uh oh!

eskimor commented Jan 15, 2026

Uh oh!

serban300 Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

paritytech-workflow-stopper bot commented Jan 23, 2026

Uh oh!

alindima left a comment

Choose a reason for hiding this comment

Uh oh!

alindima Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

alindima Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

tdimitrov Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

alindima Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alindima Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eskimor left a comment

Choose a reason for hiding this comment

Uh oh!

eskimor Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

eskimor left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

alindima commented May 15, 2025 •

edited

Loading