Skip to content

Conversation

@alindima
Copy link
Contributor

@alindima alindima commented May 15, 2025

Implements the CollationManager and the new collator protocol (validator side) subsystem.

Issues #8182 and #7752.

These are the big remaining parts which would enable us to test the entire implementation.

TODO:

After merging:

  • versi testing

Uses a slightly modified version of the ClaimQueueState written by @tdimitrov in #7114.

alindima added 30 commits April 3, 2025 11:18
…p-peer-manager' into alindima/collator-protocol-revamp-reputation-db-draft
…p-reputation-db-draft' into alindima/collator-protocol-revamp-collation-manager
@sandreim
Copy link
Contributor

I agree, we discussed it somewhere here but there are too many comments and I can't find it anymore. Long story short - right now we use 300ms for the stop gap, I say we use the same value here.

Yeah, 300 sounds good too me, I had the same concern.

MAX_STORED_SCORES_PER_PARA .. looks unnecessary low?

Atm this is 150 collators per parachain. Isn't it enough? Keeping too much scores sounds wasteful to me.

No chains have that many now, but who knows. If it happens, it would be a hassle to upgrade since you need to upgrade validators. Having a higher value here does not cost us anything now.

INVALID_COLLATION_SLASH ... is there any reason this can happen honestly ... if not, I think we can punish more - essentially bring it back to 0 immediately.

Not sure we can make a distinction between a bug and malicious behaviour. I'd go for maximum punishment vs FAILED_FETCH_SLASH which should be less of punishment.

@tdimitrov
Copy link
Contributor

Please have a look at 5505d51

What it does:

  • VALID_INCLUDED_CANDIDATE_BUMP: 50 -> 100
  • MAX_STORED_SCORES_PER_PARA: 150 -> 1000
  • INSTANT_FETCH_REP_THRESHOLD: 1000 -> 1800
  • UNDER_THRESHOLD_FETCH_DELAY: 1000ms -> 300ms
  • MIN_FETCH_TIMER_DELAY: 500ms -> 150ms

The main motivation for the parameters change is to make sure the setup can handle 100 collators (here I've shown why 50 is the max).
The most straightforward way was to make INACTIVITY_DECAY 0.5 but I wanted to avoid playing with rational numbers so instead I've bumped the score a collator gets for submitting a valid collation. This gave a nice "Time to reach threshold" curve similar to the one with INACTIVITY_DECAY=0.5 but shifted downwards, meaning that the collator will reach the instant fetch threshold two times faster (for ~2mins instead of 4). So to keep the initial behaviour (~4mins) I've bumped INSTANT_FETCH_REP_THRESHOLD to 1800.

The results:

Time to reach treshold score

The yellow line is the implemented behaviour. Blue - the current one. Red - the one where the only change is decay=0.5.

And for completeness the 'net score' - in theory we sholud be able to handle 100 collators but the time to reach the threshold score is 6 hours:

chart

Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue with the fetch logic: We prioritize time over reputation. If an anonymous peer sent us a collation and the fetch delay elapsed, we will fetch the anonymous peer collation, instead of any that might have arrived which already have some reputation.

The idea of the delay was to give higher rep collators a chance to make the validator aware, before it is already busy with processing a garbage collation from an anonymous dude. What we have instead: Yes we already learned about a better peer, but we ignore him regardless.

It also seems meaningless to track reputation among peers < instant_fetch_threshold, as we will fetch the earliest peer regardless anyways. I think the logic can be significantly simplified and improved.

E.g. instead of delay from receiving the first advertisement, we wait for e.g. 300ms from the time we received the scheduling/relay parent and then simply sieve through advertisements and pick the highest rep. If we want to be fancy, what we could do is adjust the delay dynamically: Something like: Total wait == 300ms * (1-highest received rep/max rep).

So if we are 150ms for example in, when we receive a max_rep/2 advertisement we would stop immediately - and take that. If we receive highest_possible_rep advertisement immediately after the leaf - there is no point to wait at all, nothing better will come.

The idea would be that we just add some delay after having seen the scheduling parent/leaf, we account for global network latency - so high rep, but far away collators have the same chances. We just make it an even play ground for everybody, but we would not add additional latency for a single advertisement that came in already quite late.

@eskimor
Copy link
Member

eskimor commented Jan 14, 2026

I think I found another issue, a race: If a collator provides a collation for a leaf that is about to go out of scope, it would not receive a punishment for providing an invalid collation, because by the time we receive the invalid message from backing, we already removed the peer state.

@sandreim
Copy link
Contributor

There is an issue with the fetch logic: We prioritize time over reputation. If an anonymous peer sent us a collation and the fetch delay elapsed, we will fetch the anonymous peer collation, instead of any that might have arrived which already have some reputation.

I had a suggestion on this on different PR. I think parametrising it correctly would also this would solve the problem you describe while also making the code simpler. WDYT @eskimor

@eskimor
Copy link
Member

eskimor commented Jan 15, 2026

There is an issue with the fetch logic: We prioritize time over reputation. If an anonymous peer sent us a collation and the fetch delay elapsed, we will fetch the anonymous peer collation, instead of any that might have arrived which already have some reputation.

I had a suggestion on this on different PR. I think parametrising it correctly would also this would solve the problem you describe while also making the code simpler. WDYT @eskimor

Yep, looks very similar to what I had in mind too. 👍

timestamp: adv_timestamp,
})
})
.collect::<BTreeSet<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just call min() here ?

@paritytech-workflow-stopper
Copy link

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/21281850629
Failed job name: cargo-clippy

Copy link
Contributor Author

@alindima alindima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, i'm happy with this. Some more metrics and tests could be added but can be follow-ups.

Cannot approve since I'm the original author, but consider this my approval :D

mod tests;

pub use metrics::Metrics;
pub use crate::validator_side_metrics::Metrics;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will probably be needed to add different metrics depending on the subsystem variant. It's good to have the common ones deduplicated but I'd also add some more for the new version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like us to have finer grained metrics, including timers for all operations. But this can be a follow-up, to avoid keeping this monster around open forever

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred not to add metrics which we won't use so I just transferred what we already have.
I agree in tough situations extra metrics will be useful so it can be a followup - I didn't close #10402 for this reason.

handle_collation_request_result: prometheus::Histogram,
collator_peer_count: prometheus::Gauge<prometheus::U64>,
collation_request_duration: prometheus::Histogram,
// TODO: Not available for the new implementation. Remove with the old implementation.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conceptually, requesting unblocked collations are present in both variants

doc:
- audience: Node Operator
description: |-
This PR adds a new collator protocol (validator side) subsystem.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably deserves a bit more :D

Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following improvements can be a followup, but the constants should really be fixed/properly argued why they make sense.

Follow up improvements (not for this PR, but so that they are noted somewhere):

  • Parallel fetch after some small timeout: Is implemented in legacy implementation, should be brought back & improved. This is a fix for us not having proper streaming and can help a great deal to mitigate impact on fetch attacks - @tdimitrov knows details.
  • Negative reputation bump on fetch problems: We might not be able to punish hard for single issues (as network issues can happen to honest nodes - measurements would be good), but if implemented properly (e.g. above parallel fetch) any real harm will only come from coordinated attacks, thus we should look into punishing harder on coordination.
  • Race condition fix
  • Timer should start at leaf activation - parallel fetches should likely alter that behavior even more (we can likely be more aggressive in fetching, if we have parallel fetches) - instead of waiting doing nothing, we might as well fetch what is there, if we can fetch more if it arrives.
  • Possibly others from my chat with @tdimitrov

Fix for this PR: Get constants in order - or have a proper argument why they are good as is.

/// saturated to this value.
pub const MAX_SCORE: u16 = 35_000;
/// Reputation bump for getting a valid candidate included in a finalized block.
pub const VALID_INCLUDED_CANDIDATE_BUMP: u16 = 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what we agreed on, ok I just found the reasoning above for the value - it only argues in one dimension (against the inactivity decay, which seems to be causing more problems than it solves), but not against the other axes/more important axis: Relationship to negative reputation changes and with regards to those, this value is completely off.

Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, let's get this PR merged finally. Any further fixes can come in a followup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T8-polkadot This PR/Issue is related to/affects the Polkadot network. T18-zombienet_tests Trigger zombienet CI tests.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants