-
Notifications
You must be signed in to change notification settings - Fork 1.1k
collator-protocol-revamp: CollationManager and subsystem impl #8541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
collator-protocol-revamp: CollationManager and subsystem impl #8541
Conversation
…variant for the validator side
…rotocol-revamp-peer-manager
…rotocol-revamp-peer-manager
…rotocol-revamp-peer-manager
…rotocol-revamp-peer-manager
…p-peer-manager' into alindima/collator-protocol-revamp-reputation-db-draft
…p-reputation-db-draft' into alindima/collator-protocol-revamp-collation-manager
Yeah, 300 sounds good too me, I had the same concern.
No chains have that many now, but who knows. If it happens, it would be a hassle to upgrade since you need to upgrade validators. Having a higher value here does not cost us anything now.
Not sure we can make a distinction between a bug and malicious behaviour. I'd go for maximum punishment vs |
|
Please have a look at 5505d51 What it does:
The main motivation for the parameters change is to make sure the setup can handle 100 collators (here I've shown why 50 is the max). The results:
The yellow line is the implemented behaviour. Blue - the current one. Red - the one where the only change is decay=0.5. And for completeness the 'net score' - in theory we sholud be able to handle 100 collators but the time to reach the threshold score is 6 hours:
|
eskimor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue with the fetch logic: We prioritize time over reputation. If an anonymous peer sent us a collation and the fetch delay elapsed, we will fetch the anonymous peer collation, instead of any that might have arrived which already have some reputation.
The idea of the delay was to give higher rep collators a chance to make the validator aware, before it is already busy with processing a garbage collation from an anonymous dude. What we have instead: Yes we already learned about a better peer, but we ignore him regardless.
It also seems meaningless to track reputation among peers < instant_fetch_threshold, as we will fetch the earliest peer regardless anyways. I think the logic can be significantly simplified and improved.
E.g. instead of delay from receiving the first advertisement, we wait for e.g. 300ms from the time we received the scheduling/relay parent and then simply sieve through advertisements and pick the highest rep. If we want to be fancy, what we could do is adjust the delay dynamically: Something like: Total wait == 300ms * (1-highest received rep/max rep).
So if we are 150ms for example in, when we receive a max_rep/2 advertisement we would stop immediately - and take that. If we receive highest_possible_rep advertisement immediately after the leaf - there is no point to wait at all, nothing better will come.
The idea would be that we just add some delay after having seen the scheduling parent/leaf, we account for global network latency - so high rep, but far away collators have the same chances. We just make it an even play ground for everybody, but we would not add additional latency for a single advertisement that came in already quite late.
|
I think I found another issue, a race: If a collator provides a collation for a leaf that is about to go out of scope, it would not receive a punishment for providing an invalid collation, because by the time we receive the invalid message from backing, we already removed the peer state. |
I had a suggestion on this on different PR. I think parametrising it correctly would also this would solve the problem you describe while also making the code simpler. WDYT @eskimor |
Yep, looks very similar to what I had in mind too. 👍 |
| timestamp: adv_timestamp, | ||
| }) | ||
| }) | ||
| .collect::<BTreeSet<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just call min() here ?
|
All GitHub workflows were cancelled due to failure one of the required jobs. |
alindima
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, i'm happy with this. Some more metrics and tests could be added but can be follow-ups.
Cannot approve since I'm the original author, but consider this my approval :D
| mod tests; | ||
|
|
||
| pub use metrics::Metrics; | ||
| pub use crate::validator_side_metrics::Metrics; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will probably be needed to add different metrics depending on the subsystem variant. It's good to have the common ones deduplicated but I'd also add some more for the new version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like us to have finer grained metrics, including timers for all operations. But this can be a follow-up, to avoid keeping this monster around open forever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I preferred not to add metrics which we won't use so I just transferred what we already have.
I agree in tough situations extra metrics will be useful so it can be a followup - I didn't close #10402 for this reason.
| handle_collation_request_result: prometheus::Histogram, | ||
| collator_peer_count: prometheus::Gauge<prometheus::U64>, | ||
| collation_request_duration: prometheus::Histogram, | ||
| // TODO: Not available for the new implementation. Remove with the old implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conceptually, requesting unblocked collations are present in both variants
polkadot/node/network/collator-protocol/src/validator_side_experimental/common.rs
Outdated
Show resolved
Hide resolved
polkadot/node/network/collator-protocol/src/validator_side_experimental/common.rs
Outdated
Show resolved
Hide resolved
prdoc/pr_8541.prdoc
Outdated
| doc: | ||
| - audience: Node Operator | ||
| description: |- | ||
| This PR adds a new collator protocol (validator side) subsystem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably deserves a bit more :D
...adot/node/network/collator-protocol/src/validator_side_experimental/collation_manager/mod.rs
Outdated
Show resolved
Hide resolved
...adot/node/network/collator-protocol/src/validator_side_experimental/collation_manager/mod.rs
Outdated
Show resolved
Hide resolved
polkadot/node/network/collator-protocol/src/validator_side/mod.rs
Outdated
Show resolved
Hide resolved
…erimental/collation_manager/mod.rs Co-authored-by: Alin Dima <[email protected]>
eskimor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following improvements can be a followup, but the constants should really be fixed/properly argued why they make sense.
Follow up improvements (not for this PR, but so that they are noted somewhere):
- Parallel fetch after some small timeout: Is implemented in legacy implementation, should be brought back & improved. This is a fix for us not having proper streaming and can help a great deal to mitigate impact on fetch attacks - @tdimitrov knows details.
- Negative reputation bump on fetch problems: We might not be able to punish hard for single issues (as network issues can happen to honest nodes - measurements would be good), but if implemented properly (e.g. above parallel fetch) any real harm will only come from coordinated attacks, thus we should look into punishing harder on coordination.
- Race condition fix
- Timer should start at leaf activation - parallel fetches should likely alter that behavior even more (we can likely be more aggressive in fetching, if we have parallel fetches) - instead of waiting doing nothing, we might as well fetch what is there, if we can fetch more if it arrives.
- Possibly others from my chat with @tdimitrov
Fix for this PR: Get constants in order - or have a proper argument why they are good as is.
| /// saturated to this value. | ||
| pub const MAX_SCORE: u16 = 35_000; | ||
| /// Reputation bump for getting a valid candidate included in a finalized block. | ||
| pub const VALID_INCLUDED_CANDIDATE_BUMP: u16 = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not what we agreed on, ok I just found the reasoning above for the value - it only argues in one dimension (against the inactivity decay, which seems to be causing more problems than it solves), but not against the other axes/more important axis: Relationship to negative reputation changes and with regards to those, this value is completely off.
eskimor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, let's get this PR merged finally. Any further fixes can come in a followup.


Implements the
CollationManagerand the new collator protocol (validator side) subsystem.Issues #8182 and #7752.
These are the big remaining parts which would enable us to test the entire implementation.
TODO:
ClaimQueueStatecosmetics #10334--experimental-collator-protocolcli argument to enable the new collator protocol implementation #10285para_idsRuntime API #9055These commits were added just to run the CI tests for this PR with the new experimental protocol
After merging:
Uses a slightly modified version of the ClaimQueueState written by @tdimitrov in #7114.