feat(coprocessor): add ability to revert to a previous state#2122
feat(coprocessor): add ability to revert to a previous state#2122dartdart26 wants to merge 2 commits intomainfrom
Conversation
|
@Eikix Please don't review yet, still working on it. |
3ba3c9b to
191db12
Compare
Changed Lines CoverageCoverage of added/modified lines: 100.0% Per-file breakdownDiff CoverageDiff: origin/main...HEAD, staged and unstaged changes
Summary
|
🧪 CI InsightsHere's what we observed from your CI run for 191db12. 🟢 All jobs passed!But CI Insights is watching 👀 |
01db65c to
628bfb5
Compare
07bdd52 to
1e7f041
Compare
f4c8071 to
6375720
Compare
6375720 to
f120441
Compare
| -- Usage: | ||
| -- psql -v chain_id=<CHAIN_ID> -v to_block_number=<BLOCK_NUMBER> -f revert_coprocessor_db_state.sql | ||
| -- | ||
| -- WARNING: For now, do not revert across a key rotation boundary. The tfhe-worker always |
There was a problem hiding this comment.
Should there be a way to detect this & fail fast? We can just treat it as a case we won't code for, and won't code for, for a long time. But it can be safer to fail fast so as to not forget about it (if it's possible to fail fast here)
There was a problem hiding this comment.
Let me think about how exactly - I guess we could try see if a ciphertext for the last block is from key X and if a ciphertext from the first one (the to block) is from key Y. Issue is hosts are not tied to the GW, so that might be a bit tricky.
There was a problem hiding this comment.
@Eikix I think this goes back to the offline discussion we had about the GW chain not in sync with the host chains. AFAICT, there is not reliable way to say a key rotation hasn't happened for given host chain C in the range we are about to remove.
We could look at the ciphertext_digest table, but we don't know if it has populated fully with all the digests that would detect a different key (it's a race between the sns-worker and the operator stopping the system to run the reversal script). Maybe still worth adding that check, but keep the warning and be explicit that we can't currently detect it.
There was a problem hiding this comment.
I added a best-effort check.
| AND host_chain_id = :'chain_id'; | ||
|
|
||
| DELETE FROM ciphertext_digest | ||
| WHERE txn_block_number > :'to_block_number' |
There was a problem hiding this comment.
is txn_block_number the gateway block number of the txn where the digest is posted? if so, is it being compared to a host block number?
There was a problem hiding this comment.
Great catch! I missed that - I think we need to delete by handle here, let me look into it.
coprocessor/fhevm-engine/db-migration/db-scripts/revert_coprocessor_db_state.sql
Show resolved
Hide resolved
f120441 to
e078976
Compare
If a coprocessor is in a bad state, it can be useful to revert to a previous state and redo the computations and other operations with the hope that it reaches the correct state. Reversal can be initiated by providing a host chain ID and a block number to revert to. The reversal process expects that all coprocessor components are stopped before running it. In order to delete data by block number and chain ID, we have to add a DB migration to keep track of them for all relevant tables. For blocks before this migration, we won't be able to revert, but we assume this is acceptable, also given that there isn't much we can do. Reasoning for the way we implemented reversal is that we assumes we would be able to detect a state drift with a ciphertext granularity. Consequently, the first such drift that we observe means that no prior state was affected. That is why we delete all data for blocks including the offending one and the ones after it. Note that above assumes all ciphertext have been computed up to the offending block, but in reality, due to out-of-order processing, it might not be the case. That means some prior ciphertexts might also drift. Therefore, operators might choose to go a bit further back than the offending block to make it even more likely to capture the root cause of the drift. For now, we cannot revert across a key rotation boundary. The tfhe-worker always uses the latest key from the `keys` table. If a new key was activated after to_block_number, re-processed computations will use the new key instead of the original one, producing incorrect ciphertexts. This commit adds a test that checks the DB after the reversal script has run, an integration test for the host listener, and an E2E test. Finally, we add the `db-state-revert` docker image that can be used to run the reversal script in a containerized environment.
e078976 to
2100490
Compare
antoniupop
left a comment
There was a problem hiding this comment.
Not sure it's relevant/needed, but is this handled on GW side to account for multiple commitments for the same handle and same copro? These could be the same or different CTs whether drifted or not. I guess ACL propagation (and delegation) too since you drop from allowed_handles so will be re-sent on catch-up or is that flow already removed?
coprocessor/fhevm-engine/db-migration/db-scripts/revert_coprocessor_db_state.sql
Outdated
Show resolved
Hide resolved
coprocessor/fhevm-engine/db-migration/db-scripts/revert_coprocessor_db_state.sql
Outdated
Show resolved
Hide resolved
I am not exactly sure if I get it, but if you refer to sending different ciphertext digests to the GW for the same handle, the assumption here is that a minority of coprocessors drift, with the majority being able to reach consensus. Therefore, if CoprocessorX is being reverted and then submits digests again, it will get ACL propagation is no longer happening on main. |
So we assume we don't actually break out of consensus here? I think this is risky, we would not be able to restore consensus if it happens. I can't quite see the benefit of keeping the old discarded digest in GW if a copro has re-computed and updated its DB (what bout S3? re-upload or stay consistent with GW and keep old one?). I'd be in favour of updating I think. Am I missing something? |
We can't change the GW contracts. The main assumption of the whole system is that a majority is always honest. We can't break that and expect the system to behave nicely. Another problem related to that is we have one implementation only, leading to correlated bugs on all instances. Re S3, the data is keyed on the digest, so we will keep the old ciphertexts hanging for now. If we want garbage collection on S3, I suggest we do it in another PR. |
I'm talking about errors, not malicious instances. These can accumulate (contamination) so could end up there if the reconciliation process is not fast enough (or we're very unlucky).
Makes sense yes. In any case I would still upload corrected CTs as well even if not GCing old ones. |
Re GW - not sure what we can do. Any suggestions? Re upload of corrected CT to S3 - yes, that will happen automatically. |
I'm no expert on GW contracts, I'd think we want to:
edit: assuming indeed that a majority is honest, this should not break any guarantees, just allow correcting errors after a restore |
If a coprocessor is in a bad state, it can be useful to revert to a previous state and redo the computations and other operations with the hope that it reaches the correct state.
Reversal can be initiated by providing a host chain ID and a block number to revert to. The reversal process expects that all coprocessor components are stopped before running it.
In order to delete data by block number and chain ID, we have to add a DB migration to keep track of them for all elevant tables. For blocks before this migration, we won't be able to revert, but we assume this is acceptable, also given that there isn't much we can do.
Reasoning for the way we implemented reversal is that we assumes we would be able to detect a state drift with a ciphertext granularity. Consequently, the first such drift that we observe means that no prior state was affected. That is why we delete all data for blocks including the offending one and the ones after it.
Note that above assumes all ciphertext have been computed up to the offending block, but in reality, due to out-of-order processing, it might not be the case. That means some prior ciphertexts might also drift. Therefore, operators might choose to go a bit further back than the offending block to make it even more likely to capture the root cause of the drift.
For now, we cannot revert across a key rotation boundary. The tfhe-worker always uses the latest key from the
keystable. If a new key was activated after to_block_number, re-processed computations will use the new key instead of the original one, producing incorrect ciphertexts.This commit adds a test that checks the DB after the reversal script has run, an integration test for the host listener, and an E2E test.
Finally, we add the
db-state-revertdocker image that can be used to run the reversal script in a containerized environment.