feat: implement two-step MC hash verification at block import#1139
feat: implement two-step MC hash verification at block import#1139ozgb wants to merge 2 commits intoinput-output-hk:release-v-1-8-1from
Conversation
Move MC hash verification from VerifierCIDP to a new McHashVerifyingBlockImport wrapper to implement DOS-resistant validation: Step 1 - Existence check: Query get_block_by_hash() to verify the MC block exists in Cardano. If not found, return error (triggers peer penalty for fabricated hashes). Step 2 - Stability check: Query get_stable_block_for() to verify the block has enough confirmations. If not stable yet, return MissingState (no penalty, retry later when db-sync catches up). This prevents the bidirectional banning cascade where nodes would penalize each other for blocks that are valid but not yet confirmed in db-sync. Changes: - Add McHashVerifyingBlockImport wrapper in demo/node - Wire block import wrapper in service.rs - Add new_deferred() constructor to McHashInherentDataProvider - Update VerifierCIDP to skip db-sync query (deferred to block import)
|
My proposal for the fix is to not introduce additional
|
|
And re-posting an analysis of this issue appearing in boot-node-01 in qanet: Midnight Boot Node Sync Stall AnalysisProblem SummaryBoot node (non-validator) on qanet experiences repeated sync stalls followed by rapid catch-up bursts. The node falls behind the network, stays idle with connected peers, then bulk-syncs to catch up. Network confirmed to be producing blocks normally during stalls. Observed BehaviorShort Stalls (~45-60 seconds)
Long Stall (30+ minutes)
Root Cause AnalysisInitial Trigger: Main-Chain Reference Verification FailureWhen db-sync performs a rollback or lags behind, block verification fails: Code path: The verification queries db-sync for a "stable" block, which requires:
Cascade Effect: Peer PenaltiesWhen verification fails, Substrate penalizes the peer who sent the block: // chain_sync.rs:738-752
Err(BlockImportError::VerificationFailed(peer_id, e)) => {
self.actions.push(SyncingAction::DropPeer(BadPeer(peer, rep::VERIFICATION_FAIL)));
self.restart();
}Penalties applied:
Mechanism 1: DisconnectedPeers Backoff (explains ~60s stalls)// disconnected_peers.rs
DISCONNECTED_PEER_BACKOFF_SECONDS = 60
MAX_NUM_DISCONNECTS = 3When peer is dropped during an active request:
During backoff, Code path: if !peer.state.is_available() ||
!allowed_requests.contains(&id) ||
!disconnected_peers.is_peer_available(&id) // <-- blocks requestsMechanism 2: Reputation System (explains extended stalls)// peer_store.rs
BANNED_THRESHOLD = -1,524,713,356 (71% of i32::MIN)
INVERSE_DECREMENT = 200 (decays ~0.5% per second)
Mechanism 3: Block Request Logic// chain_sync.rs:2174 - peer_block_request()
if best_num >= peer.best_number {
return None; // No blocks requested
}If the node's Mechanism 4: Bidirectional Banning (CONFIRMED - explains 30-minute stall)Key Finding: The boot node wasn't just penalizing peers—peers were also banning the boot node. When verification fails, the boot node calls The cascade:
Why boot node appears connected but gets no blocks:
Evidence from logs (2026-01-16 ~11:26:24-25):
What We Know
|
@LGLO I like the idea of enriching observability layer.
|
Alternative Approach: Voting-Based MC Hash ConsensusStepping back from the immediate fix - I've been thinking about the root cause here. The current design assumes all validators have synchronized views of Cardano via db-sync. Repeated incidents show this assumption is fragile. Option 1: Extrinsic-Based VotingValidators submit explicit vote extrinsics when they observe a new stable MC hash. Pros:
Cons:
Option 2: Inherent-Based Voting (Simpler)Block producers include their MC hash votes as part of the block inherent. Need to vote on all the previous MC hashes in one inherent. Pros:
Cons:
On the delay: We already wait for Cardano block finalization (security parameter, typically 2160 blocks on mainnet). An additional ~5 minutes for MC hash consensus might be acceptable given we're already operating with significant MC confirmation delays. The simplicity gain could outweigh the latency cost. |
This is true, and this PR changes that assumption to; enough validators have synchronized views of db-sync I like the voting proposals - recommend we implement the solution in this PR as a first step, and consider alternative architectural solutions later, as these would require a fork. |
|
@ozgb there are many solutions proposed and discussions on this ticket Can you break it down please into the different options and which ones the PR on this ticket is addressing please? Also - For the current proposed PR what is the impact on Node Operators - would this change require a Runtime Upgrade, a Binary Upgrade only etc |
I've updated the PR description to reduce noise and reflect the current state of the code - the description includes comparisons between different options. These changes would be a soft-fork, binary update only. Nodes running this updated block import function would be fully compatible with the existing node network. These changes are implemented in the partner-chains repo, but could be re-implemented in the midnight-node repo to avoid creating a new partner-chains release. |


Problem
We've run into several issues with nodes when db-sync falls behind the tip - this can be due to rollbacks or infra-outages. In this scenario, the node with the outdated db-sync will be unable to validate main-chain hash references in imported blocks.
As a result, the BlockImport function returns
VerifcationFailed- this causes the node with the lagging db-sync to penalise and disconnect from the peers that sent them the block. When reputation recovers and the peers reconnect, the node requests the same block range, resulting in a ban.This bi-directional banning has caused outages recently on qanet (Midnight Slack link), with the most recent outage happening 16/01/2026 (last Friday)
Related tickets:
Logs and technical details
And the banning mechanism:
Existing validation
Proposed Solution (this PR)
Move MC hash verification from VerifierCIDP to a new McHashVerifyingBlockImport wrapper to implement DOS-resistant validation:
The key part of this solution is to return
MissingStateon non-malicious cases 2 and 3. UnlikeVerificationFailed, there is no penalty given to peers when reportingMissingState(polkadot-sdk source link).This prevents the bidirectional banning cascade where nodes would penalize each other for blocks that are valid but not yet confirmed in db-sync.
Backwards Compatible
This solution is a soft-fork, and updates the node binary only. Nodes running this improved import function are compatible with the existing network.
It can be implemented entirely in the
midnightntwrk/midnight-noderepo to avoid requiring a new partner-chains version.Other possible solutions:
Retry Loop in Verifier
Pros:
Cons:
Location:
partner-chains/toolkit/sidechain/sidechain-mc-hash/src/lib.rs:379-422Increase
BLOCK_STABILITY_MARGINfurtherCurrently set to 10 on Midnight - the issue with this setting is that it's no a protocol parameter - it's up to the node operators what they'd like to set this to.
Add jitter or exponential backoff to block re-requests (Polkadot SDK fork required)
Prevents rapid-fire duplicate requests that trigger peer bans
Reduce
VERIFICATION_FAILpenalty** severity (Polkadot SDK fork required)Test plan
Checklist
changelog.mdfor affected crate