Skip to content

feat: implement two-step MC hash verification at block import#1139

Open
ozgb wants to merge 2 commits intoinput-output-hk:release-v-1-8-1from
ozgb:feat/soft-db-sync-validation
Open

feat: implement two-step MC hash verification at block import#1139
ozgb wants to merge 2 commits intoinput-output-hk:release-v-1-8-1from
ozgb:feat/soft-db-sync-validation

Conversation

@ozgb
Copy link
Copy Markdown
Contributor

@ozgb ozgb commented Jan 19, 2026

Problem

We've run into several issues with nodes when db-sync falls behind the tip - this can be due to rollbacks or infra-outages. In this scenario, the node with the outdated db-sync will be unable to validate main-chain hash references in imported blocks.

As a result, the BlockImport function returns VerifcationFailed - this causes the node with the lagging db-sync to penalise and disconnect from the peers that sent them the block. When reputation recovers and the peers reconnect, the node requests the same block range, resulting in a ban.

This bi-directional banning has caused outages recently on qanet (Midnight Slack link), with the most recent outage happening 16/01/2026 (last Friday)

Related tickets:

Logs and technical details

When verification fails, Substrate penalizes the peer who sent the block:

// chain_sync.rs:738-752
Err(BlockImportError::VerificationFailed(peer_id, e)) => {
    self.actions.push(SyncingAction::DropPeer(BadPeer(peer, rep::VERIFICATION_FAIL)));
    self.restart();
}

Penalties applied:

  • VERIFICATION_FAIL = -536,870,912 reputation
  • Peer is disconnected (DropPeer)
  • Peer enters DisconnectedPeers backoff

And the banning mechanism:

When verification fails, the boot node calls restart() and re-requests the same block range. Peers interpret this as spam and ban the requester:

node-06: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

node-07: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

node-08: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

node-09: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

boot-03: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

The cascade:

Step Boot Node Peers
1 Requests block range from peer Receives request, sends blocks
2 Verification fails → penalizes peer
3 Calls restart() → re-requests same blocks Sees duplicate request
4 Repeat steps 2-3 Bans boot node (i32::MIN penalty)
5 Shows "8 peers connected" Won't serve blocks to banned peer

Existing validation

  1. Check if the block is STABLE → proceed with import
  2. Else → error (block doesn't exist, peer penalty)

Proposed Solution (this PR)

Move MC hash verification from VerifierCIDP to a new McHashVerifyingBlockImport wrapper to implement DOS-resistant validation:

  1. Check if the block is STABLE → proceed with import
  2. Else, if the block EXISTS in Cardano → MissingState (wait for stability) (new)
  3. Else, if our Cardano tip is stale → MissingState (db-sync might be lagging) (new)
  4. Else → error (block doesn't exist, peer penalty)

The key part of this solution is to return MissingState on non-malicious cases 2 and 3. Unlike VerificationFailed, there is no penalty given to peers when reporting MissingState (polkadot-sdk source link).

This prevents the bidirectional banning cascade where nodes would penalize each other for blocks that are valid but not yet confirmed in db-sync.

Backwards Compatible

This solution is a soft-fork, and updates the node binary only. Nodes running this improved import function are compatible with the existing network.

It can be implemented entirely in the midnightntwrk/midnight-node repo to avoid requiring a new partner-chains version.

Other possible solutions:

Retry Loop in Verifier

Add a retry loop with timeout in the verifier's main-chain reference check. Since Verifier::verify() is async, it can sleep without blocking threads.

async fn verify(&self, block: BlockImportParams<B>) -> Result<BlockImportParams<B>, String> {
    let max_wait = Duration::from_secs(300); // 5 minutes
    let retry_interval = Duration::from_secs(10);
    let start = Instant::now();

    loop {
        match self.check_main_chain_ref(&block).await {
            Ok(()) => return Ok(block),
            Err(e) if e.is_db_sync_not_ready() && start.elapsed() < max_wait => {
                warn!(target: LOG_TARGET, "db-sync not ready, retrying in {:?}...", retry_interval);
                tokio::time::sleep(retry_interval).await;
                continue;
            }
            Err(e) => return Err(e.to_string()),
        }
    }
}

Pros:

  • Simple implementation
  • No architectural changes
  • Self-healing without operator intervention

Cons:

  • Holds block import hostage for up to N minutes
  • Other blocks behind this one queue up
  • Need to tune timeout based on observed db-sync lag

Location: partner-chains/toolkit/sidechain/sidechain-mc-hash/src/lib.rs:379-422

Increase BLOCK_STABILITY_MARGIN further

Currently set to 10 on Midnight - the issue with this setting is that it's no a protocol parameter - it's up to the node operators what they'd like to set this to.

Add jitter or exponential backoff to block re-requests (Polkadot SDK fork required)

Prevents rapid-fire duplicate requests that trigger peer bans

Reduce VERIFICATION_FAIL penalty** severity (Polkadot SDK fork required)

  • Current: -536,870,912
  • Consider: -100,000,000 (allows more retries before ban)

Test plan

  • Verify node syncs correctly when db-sync is caught up
  • Verify node handles db-sync lag gracefully (MissingState, no peer penalty)
  • Verify fabricated mc_hash values are rejected with peer penalty
  • Verify DB connection errors don't cause peer penalties

Checklist

  • Commit sequence broadly makes sense and commits have useful messages.
  • The size limit of 400 LOC isn't needlessly exceeded
  • The PR refers to a JIRA ticket (if one exists)
  • New tests are added if needed and existing tests are updated.
  • New code is documented and existing documentation is updated.
  • Relevant logging and metrics added
  • Any changes are noted in the changelog.md for affected crate
  • Self-reviewed the diff

Move MC hash verification from VerifierCIDP to a new McHashVerifyingBlockImport
wrapper to implement DOS-resistant validation:

Step 1 - Existence check: Query get_block_by_hash() to verify the MC block
exists in Cardano. If not found, return error (triggers peer penalty for
fabricated hashes).

Step 2 - Stability check: Query get_stable_block_for() to verify the block
has enough confirmations. If not stable yet, return MissingState (no penalty,
retry later when db-sync catches up).

This prevents the bidirectional banning cascade where nodes would penalize
each other for blocks that are valid but not yet confirmed in db-sync.

Changes:
- Add McHashVerifyingBlockImport wrapper in demo/node
- Wire block import wrapper in service.rs
- Add new_deferred() constructor to McHashInherentDataProvider
- Update VerifierCIDP to skip db-sync query (deferred to block import)
@LGLO
Copy link
Copy Markdown
Contributor

LGLO commented Jan 19, 2026

My proposal for the fix is to not introduce additional BlockImport, but enrich observability layer and make decision based on result of this layer.
Namely, make pub async fn get_stable_block_for(&self, hash: McBlockHash, reference_timestamp: Timestamp)
return an enum that signals different situation:

  • StableBlockFound => everything is okay, we can proceed with the block validation
  • CardanoIsNotTrustworthy => hold operations. We know that there is no stable block on Cardano, the block that has k blocks on top is outside allowed timestamps.
  • BlockIsNotFound => because it is not case CardanoIsNotTrustworthy, we decide that someone else is trying to scam us (however I have no idea how useful is hash to non-existent block)
  • BlockIsFoundButNotStable this one requires additional data, we have to look at Cardano tip, if our Cardano tip is "recent", then we are sure Cardano observability doesn't lag
    • if Cardano tip is recent, then someone is trying to push us unstable block, perhaps containing data of his favor, what is more, such a block could be rolled back => don't accept it
    • if Cardano tips in not recent => hold operations until the new Cardano tip (with recent timestamp) appears, in such a case we can reevaluated the given hash and we can't end up in this decision tree branch

@ozgb
Copy link
Copy Markdown
Contributor Author

ozgb commented Jan 19, 2026

Further evidence of this issue - newly-reset environment shows the same problem:
image

image

@ozgb
Copy link
Copy Markdown
Contributor Author

ozgb commented Jan 19, 2026

And re-posting an analysis of this issue appearing in boot-node-01 in qanet:

Midnight Boot Node Sync Stall Analysis

Problem Summary

Boot node (non-validator) on qanet experiences repeated sync stalls followed by rapid catch-up bursts. The node falls behind the network, stays idle with connected peers, then bulk-syncs to catch up. Network confirmed to be producing blocks normally during stalls.


Observed Behavior

Short Stalls (~45-60 seconds)

  • Node imports blocks normally
  • Verification failure occurs (db-sync temporarily behind)
  • Node goes idle with 7-8 peers connected
  • After ~45-60 seconds, rapid burst of imports
  • Pattern repeats

Long Stall (30+ minutes)

  • Node at #78673 at 09:35
  • Node at #78680 at 10:29 (only 7 blocks in 54 minutes)
  • 8 peers connected throughout
  • No verification errors in logs
  • Network producing blocks normally

Root Cause Analysis

Initial Trigger: Main-Chain Reference Verification Failure

When db-sync performs a rollback or lags behind, block verification fails:

"Main chain state [hash] referenced in imported block at slot [X] not found"

Code path: partner-chains/toolkit/sidechain/sidechain-mc-hash/src/lib.rs:379-422

The verification queries db-sync for a "stable" block, which requires:

  • block.block_no + security_parameter <= latest_block.block_no
  • Timestamp within allowable range (k/f to 3k/f window)

Cascade Effect: Peer Penalties

When verification fails, Substrate penalizes the peer who sent the block:

// chain_sync.rs:738-752
Err(BlockImportError::VerificationFailed(peer_id, e)) => {
    self.actions.push(SyncingAction::DropPeer(BadPeer(peer, rep::VERIFICATION_FAIL)));
    self.restart();
}

Penalties applied:

  • VERIFICATION_FAIL = -536,870,912 reputation
  • Peer is disconnected (DropPeer)
  • Peer enters DisconnectedPeers backoff

Mechanism 1: DisconnectedPeers Backoff (explains ~60s stalls)

// disconnected_peers.rs
DISCONNECTED_PEER_BACKOFF_SECONDS = 60
MAX_NUM_DISCONNECTS = 3

When peer is dropped during an active request:

  1. First disconnect: 60 second backoff
  2. Second disconnect: 120 second backoff
  3. Third disconnect: BANNED (fatal reputation)

During backoff, is_peer_available() returns false, preventing block requests.

Code path: chain_sync.rs:1836-1838

if !peer.state.is_available() ||
    !allowed_requests.contains(&id) ||
    !disconnected_peers.is_peer_available(&id)  // <-- blocks requests

Mechanism 2: Reputation System (explains extended stalls)

// peer_store.rs
BANNED_THRESHOLD = -1,524,713,356 (71% of i32::MIN)
INVERSE_DECREMENT = 200 (decays ~0.5% per second)
  • After 3 verification failures: reputation ~-1.6B (below BANNED_THRESHOLD)
  • Banned peers are disconnected from all protocols
  • Recovery time from i32::MIN to threshold: ~69 seconds
  • Full decay to 0: ~59 minutes

Mechanism 3: Block Request Logic

// chain_sync.rs:2174 - peer_block_request()
if best_num >= peer.best_number {
    return None;  // No blocks requested
}

If the node's best_queued_number >= all peers' best_number, no blocks are requested.

Mechanism 4: Bidirectional Banning (CONFIRMED - explains 30-minute stall)

Key Finding: The boot node wasn't just penalizing peers—peers were also banning the boot node.

When verification fails, the boot node calls restart() and re-requests the same block range. Peers interpret this as spam and ban the requester:

node-06: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

node-07: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

node-08: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

node-09: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

boot-03: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

The cascade:

Step Boot Node Peers
1 Requests block range from peer Receives request, sends blocks
2 Verification fails → penalizes peer
3 Calls restart() → re-requests same blocks Sees duplicate request
4 Repeat steps 2-3 Bans boot node (i32::MIN penalty)
5 Shows "8 peers connected" Won't serve blocks to banned peer

Why boot node appears connected but gets no blocks:

  • TCP connections remain open (libp2p layer)
  • Peers have banned boot node from sync protocol (application layer)
  • Boot node is effectively isolated while appearing "connected"

Evidence from logs (2026-01-16 ~11:26:24-25):

  • Multiple validators (node-06, 07, 08, 09) banned boot-01/02 simultaneously
  • Penalty applied: -2147483648 (i32::MIN) = instant ban
  • Reason: "Same block request multiple times"

What We Know

Observation Implication
7-8 peers connected during stalls Not a connectivity issue (TCP level)
No verification errors in later stalls Initial trigger doesn't repeat
Very low bandwidth during stall (12 B/s) No block data being transferred
Rapid sync when recovery happens Blocks are available, just not requested
Network producing blocks normally Boot node specific issue
7 blocks in 54 minutes (not zero) Occasional sync happening (reputation decay)
Peers banned boot-01/02 for repeated requests Boot node isolated at protocol level despite TCP connections

@mpskowron
Copy link
Copy Markdown
Contributor

My proposal for the fix is to not introduce additional BlockImport, but enrich observability layer and make decision based on result of this layer...

@LGLO I like the idea of enriching observability layer.
A couple of thoughts:

  1. Where to query the tip?
    If we're querying db-sync for the tip to diagnose db-sync lag, we're somewhat circular. What if we queried a different source - for example the local cardano-node directly via Ogmios or another service(s)?
  2. Query tip in all failure cases as a heuristic of db-sync state?

@mpskowron
Copy link
Copy Markdown
Contributor

Alternative Approach: Voting-Based MC Hash Consensus

Stepping back from the immediate fix - I've been thinking about the root cause here. The current design assumes all validators have synchronized views of Cardano via db-sync. Repeated incidents show this assumption is fragile.
What if validators voted on the MC hash and reached consensus, rather than independently verifying against their local db-sync?
This approach has been proven in production by other substrate-based chains (e.g. Chainflip).
If MC hash is approved, but our db-sync doesn't see the block as finalized, we can be sure that something is wrong with our db-sync/cardano-node setup.


Option 1: Extrinsic-Based Voting

Validators submit explicit vote extrinsics when they observe a new stable MC hash.
Pallets that can be used as a template:

Pros:

  • Fast convergence (1-2 blocks to reach threshold)
  • Battle-tested pattern - run in production by other substrate-based chains
  • Clear punishment mechanism for non-participation
  • Node with lagging db-sync knows it's behind - it sees 2/3+1 voting for a hash it can't verify locally
  • Complete decoupling from db-sync synchronization issues

Cons:

  • Significant implementation effort
  • 0.3 * N vote extrinsics per block (with N validators, MC hash updating every ~20s, PC blocks every ~6s)
  • Changes security model from "objective verification" to "subjective consensus"

Option 2: Inherent-Based Voting (Simpler)

Block producers include their MC hash votes as part of the block inherent. Need to vote on all the previous MC hashes in one inherent.

Pros:

  • Much simpler implementation in comparison to option 1
  • Zero extrinsic overhead - votes piggyback on block production
  • Fits existing inherent data provider pattern
  • Clear punishment mechanism for non-participation
  • Node with lagging db-sync knows it's behind - it sees 2/3+1 voting for a hash it can't verify locally
  • Complete decoupling from db-sync synchronization issues

Cons:

  • Slower convergence: need 2/3n+1 blocks to confirm (~67 blocks with 100 validators, ~7 minutes at 6s block time)
  • Changes security model from "objective verification" to "subjective consensus"

On the delay: We already wait for Cardano block finalization (security parameter, typically 2160 blocks on mainnet). An additional ~5 minutes for MC hash consensus might be acceptable given we're already operating with significant MC confirmation delays. The simplicity gain could outweigh the latency cost.

@ozgb
Copy link
Copy Markdown
Contributor Author

ozgb commented Mar 23, 2026

The current design assumes all validators have synchronized views of Cardano via db-sync.

This is true, and this PR changes that assumption to; enough validators have synchronized views of db-sync

I like the voting proposals - recommend we implement the solution in this PR as a first step, and consider alternative architectural solutions later, as these would require a fork.

@sineadplunkett
Copy link
Copy Markdown

sineadplunkett commented Mar 24, 2026

@ozgb there are many solutions proposed and discussions on this ticket

Can you break it down please into the different options and which ones the PR on this ticket is addressing please?

Also - For the current proposed PR what is the impact on Node Operators - would this change require a Runtime Upgrade, a Binary Upgrade only etc

@ozgb
Copy link
Copy Markdown
Contributor Author

ozgb commented Mar 24, 2026

@ozgb there are many solutions proposed and discussions on this ticket

Can you break it down please into the different options and which ones the PR on this ticket is addressing please?

Also - For the current proposed PR what is the impact on Node Operators - would this change require a Runtime Upgrade, a Binary Upgrade only etc

I've updated the PR description to reduce noise and reflect the current state of the code - the description includes comparisons between different options.

These changes would be a soft-fork, binary update only. Nodes running this updated block import function would be fully compatible with the existing node network.

These changes are implemented in the partner-chains repo, but could be re-implemented in the midnight-node repo to avoid creating a new partner-chains release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants