Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Debugging Consensus Failures

steviez edited this page Mar 29, 2022 · 24 revisions

This page is meant to provide some steps to identify and narrow down the cause of consensus failures on nodes.

Identifying a Consensus Failure

There are several symptoms that could be indicative of a consensus failure; these include:

  • No new roots being made
  • Sharp increase in memory usage

That being said, logs provide the ultimate source of truth. The below section includes some log statements that would confirm if a node had a consensus failure.

Finding the Offending Slot

Scanning the validator logs is the best route to finding the slot where consensus broke. Finding the latest new root is a good starting point:

[TIMESTAMP INFO  solana_core::replay_stage] new root {}

With the last new root known, search the surrounding "area" (both before and after) for the following log statements:

[TIMESTAMP WARN  solana_core::cluster_slot_state_verifier] Cluster duplicate confirmed slot {} with hash {}, but our version has hash {}
[TIMESTAMP WARN  solana_vote_program::vote_state] {} dropped vote slots {:?} failed to match hash {} {}

Sometimes, a node that has experienced a consensus failure will get into a loop of checking / failing / purging / repairing a slot. Seeing the bank frozen log line for the same slot repeatedly is a quick way to identify this (and further inspection should show the repeated purges/repairs as well):

[TIMESTAMP INFO  solana_runtime::bank] bank frozen: {} hash: {} accounts_delta: {} signature_count: {} last_blockhash: {} capitalization: {}

One other thing to check is that the blockhash reported in the above log statement matches what is reported by the cluster; you can check against a good set of logs or against https://explorer.solana.com/

  • If the blockhash is different, the node got bad shreds somehow and obviously computed the wrong result. This could point to an issue with transport and/or storage (ie maybe blockstore corrupting shreds on insert).
  • If the blockhash is the same, the node had the correct shreds, but computed the wrong hash. This likely points to a change in behavior in transaction processing between the node's and cluster's software versions.

For the rest of this guide, FAIL_SLOT will refer to the slot where the failure was identified.

Finding the Offending Transaction

Collect a Ledger

Once the offending slot is known, the first step is collecting a ledger to reliably and quickly reproduce the issue. This process can be a little complicated in itself, and more details on this can be found TODO. Namely, a ledger with the following is needed to replay:

  • A snapshot at slot SNAPSHOT_SLOT where SNAPSHOT_SLOT < FAIL_SLOT.
  • The shreds (transaction data) for slots [SNAPSHOT_SLOT, FAIL_SLOT].
    • For good measure and if available, grab the shreds for slots SNAPSHOT_SLOT - 1 and FAIL_SLOT + 1 as well
  • The cluster specific genesis.bin file.
    • Running solana-validator will sync this so this file can be copied from the node that break was found on.

Reproduce Locally

With a ledger in-hand, solana-ledger-tool verify can be used to replay FAIL_SLOT with the following command:

$ solana-ledger-tool verify --ledger LEDGER_DIR --halt-at-slot FAIL_SLOT

Replaying will log bank hashes; the log statement to look for is:

[TIMESTAMP INFO  solana_runtime::bank] bank frozen: SLOT hash: BANK_HASH accounts_delta: ACCOUNT_HASH signature_count: M last_blockhash: BLOCK_HASH capitalization: N

Perform the following steps to confirm that the correct & incorrect behavior can be reproduced locally:

  1. Backup the rocksdb folder somewhere just incase.
  2. Using the command above, replay with the cluster majority software version and confirm the correct bank hash is calculated for FAIL_SLOT.
  3. Using the command above, replay with the version that broke and confirm the incorrect bank hash is calculated for FAIL_SLOT.

Log Updated Accounts

Looking at list of updated accounts will shed some light on which transaction had a different result between the two software versions. The following patch is a simple change that will log all updated accounts.

Patch to Print Updated Accounts
diff --git a/runtime/src/accounts_db.rs b/runtime/src/accounts_db.rs
index 4893f3104c..34a9f02d33 100644
--- a/runtime/src/accounts_db.rs
+++ b/runtime/src/accounts_db.rs
@@ -5937,7 +5937,7 @@ impl AccountsDb {
             hashes.retain(|(pubkey, _hash)| !self.is_filler_account(pubkey));
         }
 
-        let ret = AccountsHash::accumulate_account_hashes(hashes);
+        let ret = AccountsHash::accumulate_account_hashes(slot, hashes);
         accumulate.stop();
         let mut uncleaned_time = Measure::start("uncleaned_index");
         self.uncleaned_pubkeys.insert(slot, dirty_keys);
diff --git a/runtime/src/accounts_hash.rs b/runtime/src/accounts_hash.rs
index 54aaa5c475..252dd44833 100644
--- a/runtime/src/accounts_hash.rs
+++ b/runtime/src/accounts_hash.rs
@@ -3,6 +3,7 @@ use {
     rayon::prelude::*,
     solana_measure::measure::Measure,
     solana_sdk::{
+        clock::Slot,
         hash::{Hash, Hasher},
         pubkey::Pubkey,
     },
@@ -487,9 +488,13 @@ impl AccountsHash {
         )
     }
 
-    pub fn accumulate_account_hashes(mut hashes: Vec<(Pubkey, Hash)>) -> Hash {
+    pub fn accumulate_account_hashes(slot: Slot, mut hashes: Vec<(Pubkey, Hash)>) -> Hash {
         Self::sort_hashes_by_pubkey(&mut hashes);
-
+        if slot == FAIL_SLOT {
+            for hash in hashes.iter() {
+                warn!("Updated Account: {:?}", hash);
+            }
+        }
         Self::compute_merkle_root_loop(hashes, MERKLE_FANOUT, |i| i.1)
     }

Perform the following steps to finally narrow down on divergent trasnaction:

  1. Apply the above patch - make sure to plug in the actual value of FAIL_SLOT.
  2. Replay with both the good and bad software versions using the same command as above; however, make the following tweaks to get more verbose output and write to file (make sure to change the output log filename for good and bad):
$ RUST_LOG=solana=debug solana-ledger-tool verify --ledger LEDGER_DIR --halt-at-slot FAIL_SLOT 2> {good|bad}_output.log
  1. Run the following command on each output file to parse out the "Updated Account" statements.
$ < {good|bad}_output.log grep "Updated Account: .*" > {good|bad}_updated_accounts.txt
  1. Diff the files from previous step that contain all updated accounts for each software version, and note any differing accounts (the pubkey is first element in tuple). Suppose that DIFF_PUBKEY is the / one of the accounts that has different state.
  2. Run solana-ledger-tool slot to display all transaction information for that slot, and search for account from previous step:
$ solana-ledger-tool slot FAIL_SLOT -vv | grep DIFF_PUBKEY

Finding the Root Cause

At this point, it should be known which transaction caused the consensus break. However, that may not provide enough insight to actually solve the underlying problem. Here are some tips

Examine Transaction Details

Look for things that might be "special" about this transaction, such as the slot, involved programs, last deploy date, read/write, etc.

Lookup Transaction Error

If a transaction had different result between versions, it is very possible that it failed on one version. Log output might contain some more information, such as an error for why the transaction failed. To find this log output:

  1. Note one of the signatures for the transaction in question from solana-ledger-tool slot output.
  2. Search the log output from replay for this signature (log level must at least include debug to get anything useful here):
$ < output.log grep TX_SIGNATURE

Depending on the situation, adding the flags to include lines before and after match might be helpful. But, there will be a line of log output that includes the signature, and also includes the error right at the start.

[TIMESTAMP DEBUG solana_runtime::bank] tx error: Err(...

This error will provide a good starting point to search the codebase and determine what is causing logic to fall through to error case.

Git Bisect

Given that replaying should only take several minutes if the snapshot is at a slot shortly before FAIL_SLOT, running a bisect to find the breaking commit is manageable.