feat: implement two-step MC hash verification at block import by ozgb · Pull Request #1139 · input-output-hk/partner-chains

ozgb · 2026-01-19T10:00:19Z

Problem

We've run into several issues with nodes when db-sync falls behind the tip - this can be due to rollbacks or infra-outages. In this scenario, the node with the outdated db-sync will be unable to validate main-chain hash references in imported blocks.

As a result, the BlockImport function returns VerifcationFailed - this causes the node with the lagging db-sync to penalise and disconnect from the peers that sent them the block. When reputation recovers and the peers reconnect, the node requests the same block range, resulting in a ban.

This bi-directional banning has caused outages recently on qanet (Midnight Slack link), with the most recent outage happening 16/01/2026 (last Friday)

Related tickets:

Logs and technical details

When verification fails, Substrate penalizes the peer who sent the block:
// chain_sync.rs:738-752
Err(BlockImportError::VerificationFailed(peer_id, e)) => {
    self.actions.push(SyncingAction::DropPeer(BadPeer(peer, rep::VERIFICATION_FAIL)));
    self.restart();
}
Penalties applied:

VERIFICATION_FAIL = -536,870,912 reputation

Peer is disconnected (DropPeer)

Peer enters DisconnectedPeers backoff

And the banning mechanism:

When verification fails, the boot node calls restart() and re-requests the same block range. Peers interpret this as spam and ban the requester:
node-06: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

node-07: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

node-08: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

node-09: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

boot-03: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.
The cascade:

Step Boot Node Peers

1 Requests block range from peer Receives request, sends blocks

2 Verification fails → penalizes peer —

3 Calls restart() → re-requests same blocks Sees duplicate request

4 Repeat steps 2-3 Bans boot node (i32::MIN penalty)

5 Shows "8 peers connected" Won't serve blocks to banned peer

Existing validation

Check if the block is STABLE → proceed with import
Else → error (block doesn't exist, peer penalty)

Proposed Solution (this PR)

Move MC hash verification from VerifierCIDP to a new McHashVerifyingBlockImport wrapper to implement DOS-resistant validation:

Check if the block is STABLE → proceed with import
Else, if the block EXISTS in Cardano → MissingState (wait for stability) (new)
Else, if our Cardano tip is stale → MissingState (db-sync might be lagging) (new)
Else → error (block doesn't exist, peer penalty)

The key part of this solution is to return MissingState on non-malicious cases 2 and 3. Unlike VerificationFailed, there is no penalty given to peers when reporting MissingState (polkadot-sdk source link).

This prevents the bidirectional banning cascade where nodes would penalize each other for blocks that are valid but not yet confirmed in db-sync.

Backwards Compatible

This solution is a soft-fork, and updates the node binary only. Nodes running this improved import function are compatible with the existing network.

It can be implemented entirely in the midnightntwrk/midnight-node repo to avoid requiring a new partner-chains version.

Test plan

Verify node syncs correctly when db-sync is caught up
Verify node handles db-sync lag gracefully (MissingState, no peer penalty)
Verify fabricated mc_hash values are rejected with peer penalty
Verify DB connection errors don't cause peer penalties

Checklist

Commit sequence broadly makes sense and commits have useful messages.
The size limit of 400 LOC isn't needlessly exceeded
The PR refers to a JIRA ticket (if one exists)
New tests are added if needed and existing tests are updated.
New code is documented and existing documentation is updated.
Relevant logging and metrics added
Any changes are noted in the changelog.md for affected crate
Self-reviewed the diff

Move MC hash verification from VerifierCIDP to a new McHashVerifyingBlockImport wrapper to implement DOS-resistant validation: Step 1 - Existence check: Query get_block_by_hash() to verify the MC block exists in Cardano. If not found, return error (triggers peer penalty for fabricated hashes). Step 2 - Stability check: Query get_stable_block_for() to verify the block has enough confirmations. If not stable yet, return MissingState (no penalty, retry later when db-sync catches up). This prevents the bidirectional banning cascade where nodes would penalize each other for blocks that are valid but not yet confirmed in db-sync. Changes: - Add McHashVerifyingBlockImport wrapper in demo/node - Wire block import wrapper in service.rs - Add new_deferred() constructor to McHashInherentDataProvider - Update VerifierCIDP to skip db-sync query (deferred to block import)

LGLO · 2026-01-19T12:30:37Z

My proposal for the fix is to not introduce additional BlockImport, but enrich observability layer and make decision based on result of this layer.
Namely, make pub async fn get_stable_block_for(&self, hash: McBlockHash, reference_timestamp: Timestamp)
return an enum that signals different situation:

StableBlockFound => everything is okay, we can proceed with the block validation
CardanoIsNotTrustworthy => hold operations. We know that there is no stable block on Cardano, the block that has k blocks on top is outside allowed timestamps.
BlockIsNotFound => because it is not case CardanoIsNotTrustworthy, we decide that someone else is trying to scam us (however I have no idea how useful is hash to non-existent block)
BlockIsFoundButNotStable this one requires additional data, we have to look at Cardano tip, if our Cardano tip is "recent", then we are sure Cardano observability doesn't lag
- if Cardano tip is recent, then someone is trying to push us unstable block, perhaps containing data of his favor, what is more, such a block could be rolled back => don't accept it
- if Cardano tips in not recent => hold operations until the new Cardano tip (with recent timestamp) appears, in such a case we can reevaluated the given hash and we can't end up in this decision tree branch

ozgb · 2026-01-19T16:31:26Z

Further evidence of this issue - newly-reset environment shows the same problem:

ozgb · 2026-01-19T16:34:29Z

And re-posting an analysis of this issue appearing in boot-node-01 in qanet:

Midnight Boot Node Sync Stall Analysis

Problem Summary

Boot node (non-validator) on qanet experiences repeated sync stalls followed by rapid catch-up bursts. The node falls behind the network, stays idle with connected peers, then bulk-syncs to catch up. Network confirmed to be producing blocks normally during stalls.

Observed Behavior

Short Stalls (~45-60 seconds)

Node imports blocks normally
Verification failure occurs (db-sync temporarily behind)
Node goes idle with 7-8 peers connected
After ~45-60 seconds, rapid burst of imports
Pattern repeats

Long Stall (30+ minutes)

Node at #78673 at 09:35
Node at #78680 at 10:29 (only 7 blocks in 54 minutes)
8 peers connected throughout
No verification errors in logs
Network producing blocks normally

Root Cause Analysis

Initial Trigger: Main-Chain Reference Verification Failure

When db-sync performs a rollback or lags behind, block verification fails:

"Main chain state [hash] referenced in imported block at slot [X] not found"

Code path: partner-chains/toolkit/sidechain/sidechain-mc-hash/src/lib.rs:379-422

The verification queries db-sync for a "stable" block, which requires:

block.block_no + security_parameter <= latest_block.block_no
Timestamp within allowable range (k/f to 3k/f window)

Cascade Effect: Peer Penalties

When verification fails, Substrate penalizes the peer who sent the block:

// chain_sync.rs:738-752
Err(BlockImportError::VerificationFailed(peer_id, e)) => {
    self.actions.push(SyncingAction::DropPeer(BadPeer(peer, rep::VERIFICATION_FAIL)));
    self.restart();
}

Penalties applied:

VERIFICATION_FAIL = -536,870,912 reputation
Peer is disconnected (DropPeer)
Peer enters DisconnectedPeers backoff

Mechanism 1: DisconnectedPeers Backoff (explains ~60s stalls)

// disconnected_peers.rs
DISCONNECTED_PEER_BACKOFF_SECONDS = 60
MAX_NUM_DISCONNECTS = 3

When peer is dropped during an active request:

First disconnect: 60 second backoff
Second disconnect: 120 second backoff
Third disconnect: BANNED (fatal reputation)

During backoff, is_peer_available() returns false, preventing block requests.

Code path: chain_sync.rs:1836-1838

if !peer.state.is_available() ||
    !allowed_requests.contains(&id) ||
    !disconnected_peers.is_peer_available(&id)  // <-- blocks requests

Mechanism 2: Reputation System (explains extended stalls)

// peer_store.rs
BANNED_THRESHOLD = -1,524,713,356 (71% of i32::MIN)
INVERSE_DECREMENT = 200 (decays ~0.5% per second)

After 3 verification failures: reputation ~-1.6B (below BANNED_THRESHOLD)
Banned peers are disconnected from all protocols
Recovery time from i32::MIN to threshold: ~69 seconds
Full decay to 0: ~59 minutes

Mechanism 3: Block Request Logic

// chain_sync.rs:2174 - peer_block_request()
if best_num >= peer.best_number {
    return None;  // No blocks requested
}

If the node's best_queued_number >= all peers' best_number, no blocks are requested.

Mechanism 4: Bidirectional Banning (CONFIRMED - explains 30-minute stall)

Key Finding: The boot node wasn't just penalizing peers—peers were also banning the boot node.

When verification fails, the boot node calls restart() and re-requests the same block range. Peers interpret this as spam and ban the requester:

node-06: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

node-07: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

node-08: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

node-09: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

boot-03: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648
         Reason: Same block request multiple times. Banned, disconnecting.

The cascade:

Step	Boot Node	Peers
1	Requests block range from peer	Receives request, sends blocks
2	Verification fails → penalizes peer	—
3	Calls `restart()` → re-requests same blocks	Sees duplicate request
4	Repeat steps 2-3	Bans boot node (`i32::MIN` penalty)
5	Shows "8 peers connected"	Won't serve blocks to banned peer

Why boot node appears connected but gets no blocks:

TCP connections remain open (libp2p layer)
Peers have banned boot node from sync protocol (application layer)
Boot node is effectively isolated while appearing "connected"

Evidence from logs (2026-01-16 ~11:26:24-25):

Multiple validators (node-06, 07, 08, 09) banned boot-01/02 simultaneously
Penalty applied: -2147483648 (i32::MIN) = instant ban
Reason: "Same block request multiple times"

What We Know

Observation	Implication
7-8 peers connected during stalls	Not a connectivity issue (TCP level)
No verification errors in later stalls	Initial trigger doesn't repeat
Very low bandwidth during stall (12 B/s)	No block data being transferred
Rapid sync when recovery happens	Blocks are available, just not requested
Network producing blocks normally	Boot node specific issue
7 blocks in 54 minutes (not zero)	Occasional sync happening (reputation decay)
Peers banned boot-01/02 for repeated requests	Boot node isolated at protocol level despite TCP connections

mpskowron · 2026-01-19T20:45:47Z

My proposal for the fix is to not introduce additional BlockImport, but enrich observability layer and make decision based on result of this layer...

@LGLO I like the idea of enriching observability layer.
A couple of thoughts:

Where to query the tip?
If we're querying db-sync for the tip to diagnose db-sync lag, we're somewhat circular. What if we queried a different source - for example the local cardano-node directly via Ogmios or another service(s)?
Query tip in all failure cases as a heuristic of db-sync state?

mpskowron · 2026-01-19T21:28:22Z

Alternative Approach: Voting-Based MC Hash Consensus

Stepping back from the immediate fix - I've been thinking about the root cause here. The current design assumes all validators have synchronized views of Cardano via db-sync. Repeated incidents show this assumption is fragile.
What if validators voted on the MC hash and reached consensus, rather than independently verifying against their local db-sync?
This approach has been proven in production by other substrate-based chains (e.g. Chainflip).
If MC hash is approved, but our db-sync doesn't see the block as finalized, we can be sure that something is wrong with our db-sync/cardano-node setup.

Option 1: Extrinsic-Based Voting

Validators submit explicit vote extrinsics when they observe a new stable MC hash.
Pallets that can be used as a template:

Pros:

Fast convergence (1-2 blocks to reach threshold)
Battle-tested pattern - run in production by other substrate-based chains
Clear punishment mechanism for non-participation
Node with lagging db-sync knows it's behind - it sees 2/3+1 voting for a hash it can't verify locally
Complete decoupling from db-sync synchronization issues

Cons:

Significant implementation effort
0.3 * N vote extrinsics per block (with N validators, MC hash updating every ~20s, PC blocks every ~6s)
Changes security model from "objective verification" to "subjective consensus"

Option 2: Inherent-Based Voting (Simpler)

Block producers include their MC hash votes as part of the block inherent. Need to vote on all the previous MC hashes in one inherent.

Pros:

Much simpler implementation in comparison to option 1
Zero extrinsic overhead - votes piggyback on block production
Fits existing inherent data provider pattern
Clear punishment mechanism for non-participation
Node with lagging db-sync knows it's behind - it sees 2/3+1 voting for a hash it can't verify locally
Complete decoupling from db-sync synchronization issues

Cons:

Slower convergence: need 2/3n+1 blocks to confirm (~67 blocks with 100 validators, ~7 minutes at 6s block time)
Changes security model from "objective verification" to "subjective consensus"

On the delay: We already wait for Cardano block finalization (security parameter, typically 2160 blocks on mainnet). An additional ~5 minutes for MC hash consensus might be acceptable given we're already operating with significant MC confirmation delays. The simplicity gain could outweigh the latency cost.

ozgb · 2026-03-23T06:47:06Z

The current design assumes all validators have synchronized views of Cardano via db-sync.

This is true, and this PR changes that assumption to; enough validators have synchronized views of db-sync

I like the voting proposals - recommend we implement the solution in this PR as a first step, and consider alternative architectural solutions later, as these would require a fork.

sineadplunkett · 2026-03-24T11:23:24Z

@ozgb there are many solutions proposed and discussions on this ticket

Can you break it down please into the different options and which ones the PR on this ticket is addressing please?

Also - For the current proposed PR what is the impact on Node Operators - would this change require a Runtime Upgrade, a Binary Upgrade only etc

ozgb · 2026-03-24T15:19:16Z

@ozgb there are many solutions proposed and discussions on this ticket

Can you break it down please into the different options and which ones the PR on this ticket is addressing please?

Also - For the current proposed PR what is the impact on Node Operators - would this change require a Runtime Upgrade, a Binary Upgrade only etc

I've updated the PR description to reduce noise and reflect the current state of the code - the description includes comparisons between different options.

These changes would be a soft-fork, binary update only. Nodes running this updated block import function would be fully compatible with the existing node network.

These changes are implemented in the partner-chains repo, but could be re-implemented in the midnight-node repo to avoid creating a new partner-chains release.

feat: add staleness check to block import checking

6f30428

gilescope mentioned this pull request Mar 17, 2026

Network stalls when db-sync lags behind Cardano tip midnightntwrk/midnight-node#984

Open

Conversation

ozgb commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Existing validation

Proposed Solution (this PR)

Backwards Compatible

Other possible solutions:

Retry Loop in Verifier

Increase BLOCK_STABILITY_MARGIN further

Add jitter or exponential backoff to block re-requests (Polkadot SDK fork required)

Reduce VERIFICATION_FAIL penalty** severity (Polkadot SDK fork required)

Test plan

Checklist

Uh oh!

LGLO commented Jan 19, 2026

Uh oh!

ozgb commented Jan 19, 2026

Uh oh!

ozgb commented Jan 19, 2026

Problem Summary

Observed Behavior

Short Stalls (~45-60 seconds)

Long Stall (30+ minutes)

Root Cause Analysis

Initial Trigger: Main-Chain Reference Verification Failure

Cascade Effect: Peer Penalties

Mechanism 1: DisconnectedPeers Backoff (explains ~60s stalls)

Mechanism 2: Reputation System (explains extended stalls)

Mechanism 3: Block Request Logic

Mechanism 4: Bidirectional Banning (CONFIRMED - explains 30-minute stall)

What We Know

Uh oh!

mpskowron commented Jan 19, 2026

Uh oh!

mpskowron commented Jan 19, 2026

Alternative Approach: Voting-Based MC Hash Consensus

Option 1: Extrinsic-Based Voting

Option 2: Inherent-Based Voting (Simpler)

Uh oh!

ozgb commented Mar 23, 2026

Uh oh!

sineadplunkett commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ozgb commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ozgb commented Jan 19, 2026 •

edited

Loading

Increase `BLOCK_STABILITY_MARGIN` further

Reduce `VERIFICATION_FAIL` penalty** severity (Polkadot SDK fork required)

sineadplunkett commented Mar 24, 2026 •

edited

Loading