Open
Description
Description
With large numbers of validators and heavy disk congestion, we've observed the slashing protection database timing out with errors like this:
: Nov 16 00:50:31.056 INFO Successfully published attestation type: unaggregated, slot: 39850, committee_index: 2, head_block: 0x3bd75ebc65de718b36911eaab7dad3a9ef7ca44c7ba03d32c7092195c43bcf43, service: attestation
: Nov 16 00:50:32.838 CRIT Not signing slashable block error: SQLPoolError("Error(None)")
: Nov 16 00:50:32.838 CRIT Error whilst producing block message: Unable to sign block, service: block
The default timeout is 5 seconds.
Presently, we sign attestations one at a time and broadcast them, which means that each attester requires a new SQLite database transaction.
https://github.com/sigp/lighthouse/blob/master/validator_client/src/attestation_service.rs#L332-L404
To alleviate the congestion slightly, we could switch to an algorithm like:
- Begin an SQLite transaction
txn
- Check and sign all attestations as part of
txn
- Commit
txn
- Broadcast attestations
That way we preserve the property that an attestation is only broadcast if its signature has been persisted to disk (which is crash-safe). Broadcasting after each check but before the transaction commits would violate this property.
Version
Lighthouse v0.3.4