Skip to content

Fix amnesia faults vulnerability #308#310

Open
ranchalp wants to merge 8 commits into
strangelove-ventures:mainfrom
ranchalp:main
Open

Fix amnesia faults vulnerability #308#310
ranchalp wants to merge 8 commits into
strangelove-ventures:mainfrom
ranchalp:main

Conversation

@ranchalp
Copy link
Copy Markdown

@ranchalp ranchalp commented Oct 8, 2025

Problem Description

This implementation addresses a critical safety vulnerability (#308 ) in Horcrux that can lead to chain forks in Tendermint-based networks. The vulnerability, known as the "amnesia fault," occurs when Horcrux nodes in a distributed validator setup lose track of their consensus locks due to network partitions, allowing them to sign conflicting blocks.

The Amnesia Fault Scenario

  1. Network Partition: The network splits into groups A, B, and C
  2. Partial Lock: Group A sees enough prevotes to lock on Block V at Round R
  3. Missing Lock: Group B doesn't see the prevotes and remains unlocked
  4. Network Recovery: When the network recovers, Group B can sign a conflicting Block W
  5. Chain Fork: This leads to a chain fork as different parts of the network have committed different blocks

Root Cause

The root cause is that Horcrux is stateless with respect to Tendermint consensus locks. The remote signer only tracks what it has signed (high watermark) but doesn't maintain the consensus-critical state of what the validator is locked on. This allows the validator to "forget" its lock and sign conflicting blocks.

Solution Overview

This implementation adds consensus lock tracking to Horcrux, making the remote signer consensus-aware. The solution ensures that:

  1. Lock Tracking: The system tracks what block/value the validator is locked on
  2. Lock Validation: Before signing any block, the system checks if it would violate an existing lock
  3. Lock Updates: Locks are updated when PRECOMMIT messages are signed according to Tendermint rules
  4. Lock Persistence: Locks persist for all future rounds within the same height

Tendermint Locking Rules Implementation

The implementation follows the correct Tendermint consensus locking rules:

If a PRECOMMIT for value V is signed in round R for tendermint's consensus instance Id then:

  1. Rule 1.1 [lock on V]: For PROPOSAL and PREVOTE messages, allow only signing for V in rounds R' ≥ R
  2. Rule 1.2 [lock on V']: If a PRECOMMIT message for V' is requested and signed in round R' > R such that V' ≠ V, then lock on V' instead for all rounds R'' > R'

Key Points:

  • Locks persist for ALL future rounds within the same height
  • Locks are only cleared when moving to a different height
  • PRECOMMIT messages in later rounds can release and set new locks
  • PROPOSAL and PREVOTE messages in later rounds must respect the existing lock

Implementation Details

Data Structures

ConsensusLock

type ConsensusLock struct {
    Height    int64  `json:"height"`
    Round     int64  `json:"round"`
    Value     []byte `json:"value,omitempty"`     // The locked block hash/value
    ValueType string `json:"value_type"`         // "block" or "nil"
}

Updated SignState

The SignState structure now includes a ConsensusLock field to track the current consensus lock state.

Key Functions

ValidateConsensusLock

func (signState *SignState) ValidateConsensusLock(hrs HRSKey, signBytes []byte) error
  • Checks if signing the given block would violate an existing consensus lock
  • Returns an error if the lock would be violated
  • Allows signing if no lock exists or if the lock is not applicable

Lock Update Logic

The consensus lock is updated in three scenarios:

  1. First lock for height: When no lock exists, set the lock based on PRECOMMIT
  2. Lock release/update: When PRECOMMIT for different value V' is signed in higher round R' > R
  3. Lock preservation: When PRECOMMIT for same value V is signed in higher round (no change needed)

Integration Points

ThresholdValidator.Sign()

  • Added consensus lock validation before proceeding with signing
  • Updates consensus lock when PRECOMMIT is signed
  • Preserves existing lock state in sign state consensus

LocalCosigner.sign()

  • Added consensus lock validation before local signing
  • Updates consensus lock when PRECOMMIT is signed locally

SignState.blockDoubleSign()

  • Added consensus lock validation as the first check
  • Ensures lock violations are caught early in the signing process

Comment thread signer/sign_state.go Outdated
VoteExtensionSignature []byte `json:"vote_ext_signature,omitempty"`

// Consensus lock tracking to prevent amnesia faults
ConsensusLock ConsensusLock `json:"consensus_lock,omitempty"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omitzero? omitempty does not apply to custom types: https://pkg.go.dev/encoding/json

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will show up as an empty struct {} when not locked now.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, with omitzero it won't show up at all.

Comment thread signer/sign_state.go Outdated

// IsLocked returns true if there is an active consensus lock
func (lock *ConsensusLock) IsLocked() bool {
return lock.Height > 0 && lock.Round >= 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be possible to lock at height 0 and round 0, no?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks fixed.

Comment thread signer/sign_state.go Outdated
Height int64 `json:"height"`
Round int64 `json:"round"`
Value []byte `json:"value,omitempty"` // The locked block hash/value
ValueType string `json:"value_type"` // "block" or "nil"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about having just Value? nil for nil locked, non-nil for block locked.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair, done. Thanks

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, done thanks

Comment thread signer/sign_state.go Outdated

// ConsensusLockViolationError represents an error when trying to sign a block that violates a consensus lock
type ConsensusLockViolationError struct {
msg string
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can quickly define a custom error types as type ErrName { error } and then construct it using ErrName{fmt.Errorf(...)}. And the most proper way of defining structured errors is type ErrName { <custom fields> } and postponing the fmt.Sprintf to func (e *ErrName) Error() { return fmt.Sprintf() }, so that message rendering can be skipped altogether if not needed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks.

Comment thread signer/sign_state.go
func IsConsensusLockStepViolationError(err error) bool {
_, ok := err.(*ConsensusLockStepViolationError)
return ok
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the canonical way to casting errors is by using errors.As().

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

Comment thread signer/sign_state.go
// For now, we'll use a placeholder that returns the first 32 bytes as a hash
if len(signBytes) < 32 {
return nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that needs to be fixed before merging, no?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we need an e2e test that it actually works

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks.

@ranchalp
Copy link
Copy Markdown
Author

Thanks @pompon0 I addressed your comments

Comment thread signer/local_cosigner.go Outdated
}

// Handle consensus lock updates according to Tendermint rules
signStateConsensus.ConsensusLock = updateConsensusLock(ccs.lastSignState.ConsensusLock, hrst.HRSKey(), req.SignBytes)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is misleading that you are setting ConsensusLock to the old value and then immediately overwriting it. inline the updateConsensusLock call instead.

Comment thread signer/local_cosigner.go Outdated
}

// Handle consensus lock updates according to Tendermint rules
signStateConsensus.ConsensusLock = updateConsensusLock(ccs.lastSignState.ConsensusLock, hrst.HRSKey(), req.SignBytes)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "update" prefix is implying that updateConsensusLock is mutating its argument(s). Consider something like "nextConsensusLock" or some other name which would imply that it is a pure function.

Comment thread signer/sign_state.go

// If we're signing for a different height, the lock is no longer relevant
if hrs.Height != signState.ConsensusLock.Height {
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: hrs.Height > ... for defense-in-depth

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and return error if <

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same for round)

Comment thread signer/sign_state.go Outdated
// For PROPOSAL and PREVOTE messages in rounds R' >= R, only allow signing for the locked value V
if (hrs.Step == stepPropose || hrs.Step == stepPrevote) && hrs.Round >= signState.ConsensusLock.Round {
// Extract the block hash from the sign bytes to compare with the locked value
blockHash := extractBlockHashFromSignBytes(signBytes, hrs.Step)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it return an error if parsing fails

Comment thread signer/sign_state.go Outdated
blockHash := extractBlockHashFromSignBytes(signBytes, hrs.Step)
if blockHash == nil {
// If we can't extract the block hash, allow signing (fallback to existing behavior)
return nil
Copy link
Copy Markdown

@pompon0 pompon0 Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo, it is more desirable to fail when we don't know what we are signing.

Comment thread signer/sign_state.go
}

// For PROPOSAL and PREVOTE messages in rounds R' >= R, only allow signing for the locked value V
if (hrs.Step == stepPropose || hrs.Step == stepPrevote) && hrs.Round >= signState.ConsensusLock.Round {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a problem here - we should compare pol_round against ConsensusLock.Round here, not current round. For proposal signing we can extract it, but prevotes do not include pol round.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Tendermint type of stuff. What matters for Horcrux signing is consistency, which is only relative to the round of the latest signed precommit. That is, one thing is the highest round where Tendermint state sees the highest POL. But for singing on Horcrux what matters is that whatever is proposed for signing next is consistent with what has been proposed for signing before, not the POL. And the only time that we need to update and carry over locks when it comes to signing is after a Precommit message.

This implementation of Horcrux should not conflict with a correct Tendermint signer since nodes will lock on a value only when sending a Precommit message, and only will sign Prevotes or send Proposals for the locked value until the next Precommit message (if the Tendermint implementation follows the Tendermint spec from this paper as the official documentation says).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline. We need to amend API to handle branch on line 28 of https://arxiv.org/pdf/1807.04938

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ranchalp
Copy link
Copy Markdown
Author

ranchalp commented Oct 23, 2025

Thanks for the comments, just addressed them, except for pushback on one of them (see the conversation in the related comment).

@ranchalp ranchalp force-pushed the main branch 5 times, most recently from cf1df5f to 800e8e7 Compare October 23, 2025 10:52
Comment thread signer/sign_state.go Outdated

// Prevote quorums tracking for Tendermint Algorithm Line 28
// Map: height -> (value -> oldest round that received 2f+1 prevotes)
PrevoteQuorums map[string]map[string]int64 `json:"prevote_quorums"` // height -> (value -> round)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you use string to index by height?

@ranchalp
Copy link
Copy Markdown
Author

Reverted to the version we discussed in sync that uses the POL round.

@ericjohncarlson
Copy link
Copy Markdown
Contributor

Been watching this PR closely and it seems like this has stalled out. Is the original concern still a problem and is this expected to be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants