-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
TLDR; Allow for explicit no-show messages to be sent in approval checking when a node is certain it will no-show (ultra overworked or some weird error preventing it from completing the work)
This issue is an early exploration of a slight alteration to the Approval Checking Process. This is a minor departure from the ELVES paper, but remains in the same spirit. RFC will most likely follow after the design is a bit more clear.
No-Shows Reminder
No-Shows are when a relay chain validator publishes his assignment for approval checking but then cannot fulfill the assignment in a timely fashion. This results in a no-show which prompts the system to request even more approval checkers (2-3 more per no-show). This is a soft escalation method.
No-Shows are there as a countermeasure for DoS attacks and so they should be triggered when there is a DoS attack. There is also one other intended cause for no-shows but it is a heavy defense in depth scenario when a validator gets slashed, disabled but still participates in approval checking, he no longer can cause disputes (hard escalations) but he can still cause intentional no-shows (small escalations).
Ideally except the two cases above no-shows should occur as rarely as possible.
Issue
Unfortunately as it stands no-shows are still quite frequently triggered on Kusama and Polkadot and we suspect it has nothing to do with a DoS attack. Sometimes nodes can be overworked (more frequent for under or just on spec nodes) and they will simply not complete the assignment in time, resulting in a no-show.
No-shows without a good reason start a vicious cycle. Some nodes were overworked and they no-showed which soft escalates adding a new tranche of checkers adding even more work to the system causing more nodes to potentially be overworked.
As of today a node that is so overworked it knows for sure it will not complete an assignment will still publish their assignment and the system cannot move on from that candidate untill that assignment timeouts or completes. The timeout window is rather large so every time we get a slow validator everything stalls at least for the timeout, assuming the next tranche was perfectly competent.
This leads to some inconsistent finality times and random short finality lag spikes.
Solution
-
When a validator knows it cannot fulfill an assignment it could skip on publishing the assignment but that is quite risky. We need the approval checkers and completely skipping the checks can give a noticeable edge to the attackers.
-
Alternatively the validator instead of waiting for a full timeout window the validator can issue a new statement type, an explicit no-show message. This instead of waiting for the timeout window until counting a no-show will immediately trigger the next tranche. The next tranche if alsos not overworked will complete the check and everything can proceed as normal, probably before we even reached the original no-show timeout.
-
Pushing this idea even further, the overworked validator creating the explicit no-show message is not as severe of a situation as someone being straight up DoS'ed. So it might warrant a smaller escalation (less than a tranche of 2-3). We might opt in for a 1:1 replacement policy. The overworked validator would issue the explicit no-show statement which most likely will use a VRF to randomly select another validator to take their place.
There is an open question as to how to select the replacement. Maybe the replacement can be freely chosen. Maybe some heuristic is used that takes into account past approval checking performance so better validators are chosen more often. Maybe we use a completely random selection with a VRF for the replacement choosing.
Further Details
No matter what strategy we are choosing this also further improves the new disabling strategy which added a new no-show reason. Disabled validators issuing explicit no-shows (small escalations) makes sense because they can NEVER raise a dispute anyway. This is a straight up upgrade as we get to skip the timeout window in those cases. This benefit is a bit of a defense in depth and it will only materialize when some honest nodes are disabled on chain, while there is an invalid candidate being checked. Unlikely to happen but possible.
This is loosely connected to the issue of being more explicit when raising disputes (hard escalations) discussed here: #872
As of today we don't have approval checking rewards but once those are implemented it would be intended for the replacement approval checker to get them to further incentivise hard working validators (on or above spec).
Additionally, if security or 1:1 replacements are a concern we could potentially opt for replacements only in tranche 0 or early tranches. The policy does not have to be uniform over tranches.
Concerns
This is not a straight up upgrade (except for the disabled node scenario). Especially the 1:1 replacement policy can pose some risks and it needs to be looked into more.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status