Open
Description
There's an issue with the global pubkey cache
We update the cache in memory prior to writing the new keys to disk. This can result in changes being made in memory but never making it to disk (e.g. if a block fails to be imported after the changes are made). Later blocks that would import the same keys to the cache then skip over them (because they are already in memory) and would not write them to disk.
The result is that the node will continue running OK until it is restarted. The in-memory cache is correct, but is out of sync with the on-disk cache.
I think we can solve this redundantly in two ways:
- Maintain the following invariant:
key in memory -> key on disk
. To do this we will need to write keys to disk before ever updating the in-memory store. This has some performance challenges associated with holding the lock while performing IO (previous issue: Optimize against ValidatorPubkeyCache timeouts #2327). - We can also defensively import keys from the head state upon restart.
The second change has lower impact, although it feels like a bit of a hack.