Skip to content
This repository was archived by the owner on Jan 5, 2026. It is now read-only.
This repository was archived by the owner on Jan 5, 2026. It is now read-only.

bug: gap in epoch history with dynamic validator set #104

@l-monninger

Description

@l-monninger

Summary

The current implementation does not appear to gracefully handle Aptos epochs with respect to Avalanche bootstrap syncing. The error below, which typically appears in the context of changes to the validator set or various sorts of asynchrony, is indicative of as much.

shutting down chain {"reason": "received an unexpected error", "error": "rpc error: code = Unknown desc = failed to build block: Internal error: \"Gap in epoch history. Trying to put in LedgerInfo in epoch: 1, current epoch: 4
[02-08|09:41:50.142] FATAL <rLgK4miC2cHSdc84z8H9iNBZTkqmyRD7xSPJnkkpHc34yhRLY Chain> handler/handler.go:339 shutting down chain {"reason": "received an unexpected error", "error": "rpc error: code = Unknown desc = failed to build block: Internal error: \"Gap in epoch history. Trying to put in LedgerInfo in epoch: 1, current epoch: 4\" while processing sync message: NodeID-EzN4q9mU6TVFkND6oghbdLAUqDacE9Czp Op: chits Message: chain_id:\"p\\x07\\xe5\\xfeﲼ\\x87v\\xb8\\x18h\\xae\\xd8E\\x80/\\x19^\\x98\\x83\\x7fI\\xaf\\xdb\\xe99\\xac\\xe5\\xd4k\\x8a\"  request_id:8941  preferred_id:\"\\x0b\\x9adj\\xa0R\\xb1\\x85\\x18\\x04\\x8d~\\xb5\\xed\\x8dG\\xd8\\xccT\\x07h\\xaf\\xaaO8<\\xd1\\xe2\\xc0\\x18\\x15\\xde\"  accepted_id:\"\\x0b\\x9adj\\xa0R\\xb1\\x85\\x18\\x04\\x8d~\\xb5\\xed\\x8dG\\xd8\\xccT\\x07h\\xaf\\xaaO8<\\xd1\\xe2\\xc0\\x18\\x15\\xde\"  preferred_id_at_height:\"\\xc1\\xddAX\\x92\\x06\\xe1ĄE\\xbe\\x9ar\\x15l\\xf6YEr\\x9f\\xf2ԟ\\x1bL\\x87߯\\xb1\\x14\\xea\\xed\""}

Steps to Reproduce

This is a somewhat challenging error to reproduce. It more reliably emerges over longer running periods and when adding a removing several validators. The simplest procedure I've determined so far is:

  1. Start an M1 subnet on fuji with one validator.
  2. Submit several transactions to this subnet. For example, by calling, movement aptos init repeatedly.
  3. Add a second validator.
  4. If you do not see the error above will the validator is bootstrapping, remove the second validator and submit more transactions and try again.
  5. If you still do not see the error above, remove add a third validator and attempt once more.
  6. Repeat.

You will need to inspect the logs for your chain to view the error. These are stored at ~/.avalanchego/logs/<chain-id>.log

Possible Solutions

It seems likely to me that there are three plausibilities:

  1. The block execution ordering is currently faulty and we should order incoming blocks in a queue by epoch before they are sent to AptosBlockExecutor and only pop off when the ledge epoch matches.
  2. This occurs owing to periods of asynchrony, in which case the blocks are simply missing/not being disseminated. This would largely be a quality of the network on the whole and may not be something around which we can engineer without significantly more re-design.
  3. This occurs owing to a bad reorg strategy in which case a fuller re-design is necessary.

Another alternative would be to remove epochs altogether. This is non-trivial to introduce into the current implementation as simply setting the same epoch for every block will cause a similar invalid epoch history error to the above.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions