feat(coprocessor): add ability to revert to a previous state by dartdart26 · Pull Request #2122 · zama-ai/fhevm

dartdart26 · 2026-03-17T13:56:59Z

If a coprocessor is in a bad state, it can be useful to revert to a previous state and redo the computations and other operations with the hope that it reaches the correct state.

Reversal can be initiated by providing a host chain ID and a block number to revert to. The reversal process expects that all coprocessor components are stopped before running it.

In order to delete data by block number and chain ID, we have to add a DB migration to keep track of them for all elevant tables. For blocks before this migration, we won't be able to revert, but we assume this is acceptable, also given that there isn't much we can do.

Reasoning for the way we implemented reversal is that we assumes we would be able to detect a state drift with a ciphertext granularity. Consequently, the first such drift that we observe means that no prior state was affected. That is why we delete all data for blocks including the offending one and the ones after it.

Note that above assumes all ciphertext have been computed up to the offending block, but in reality, due to out-of-order processing, it might not be the case. That means some prior ciphertexts might also drift. Therefore, operators might choose to go a bit further back than the offending block to make it even more likely to capture the root cause of the drift.

For now, we cannot revert across a key rotation boundary. The tfhe-worker always uses the latest key from the keys table. If a new key was activated after to_block_number, re-processed computations will use the new key instead of the original one, producing incorrect ciphertexts.

This commit adds a test that checks the DB after the reversal script has run, an integration test for the host listener, and an E2E test.

Finally, we add the db-state-revert docker image that can be used to run the reversal script in a containerized environment.

dartdart26 · 2026-03-17T13:58:41Z

@Eikix Please don't review yet, still working on it.

github-actions · 2026-03-17T16:03:21Z

Changed Lines Coverage

Coverage of added/modified lines: 100.0%

Per-file breakdown

Diff Coverage

Diff: origin/main...HEAD, staged and unstaged changes

coprocessor/fhevm-engine/host-listener/src/database/tfhe_event_propagate.rs (100%)
coprocessor/fhevm-engine/test-harness/src/db_utils.rs (100%)

Summary

Total: 20 lines
Missing: 0 lines
Coverage: 100%

mergify · 2026-03-17T16:42:50Z

🧪 CI Insights

Here's what we observed from your CI run for 191db12.

🟢 All jobs passed!

But CI Insights is watching 👀

dartdart26 · 2026-03-18T16:03:00Z

E2E success here: https://github.com/zama-ai/fhevm/actions/runs/23251441988/job/67594233162

dartdart26 · 2026-03-19T13:44:51Z

Success E2E: https://github.com/zama-ai/fhevm/actions/runs/23295863890

Eikix · 2026-03-19T14:13:40Z

coprocessor/fhevm-engine/db-migration/db-scripts/revert_coprocessor_db_state.sql

+-- Usage:
+--   psql -v chain_id=<CHAIN_ID> -v to_block_number=<BLOCK_NUMBER> -f revert_coprocessor_db_state.sql
+--
+-- WARNING: For now, do not revert across a key rotation boundary. The tfhe-worker always


Should there be a way to detect this & fail fast? We can just treat it as a case we won't code for, and won't code for, for a long time. But it can be safer to fail fast so as to not forget about it (if it's possible to fail fast here)

Let me think about how exactly - I guess we could try see if a ciphertext for the last block is from key X and if a ciphertext from the first one (the to block) is from key Y. Issue is hosts are not tied to the GW, so that might be a bit tricky.

@Eikix I think this goes back to the offline discussion we had about the GW chain not in sync with the host chains. AFAICT, there is not reliable way to say a key rotation hasn't happened for given host chain C in the range we are about to remove.

We could look at the ciphertext_digest table, but we don't know if it has populated fully with all the digests that would detect a different key (it's a race between the sns-worker and the operator stopping the system to run the reversal script). Maybe still worth adding that check, but keep the warning and be explicit that we can't currently detect it.

I added a best-effort check.

Eikix · 2026-03-19T14:22:55Z

coprocessor/fhevm-engine/db-migration/db-scripts/revert_coprocessor_db_state.sql

+   AND host_chain_id = :'chain_id';
+
+DELETE FROM ciphertext_digest
+ WHERE txn_block_number > :'to_block_number'


is txn_block_number the gateway block number of the txn where the digest is posted? if so, is it being compared to a host block number?

Great catch! I missed that - I think we need to delete by handle here, let me look into it.

coprocessor/fhevm-engine/db-migration/db-scripts/revert_coprocessor_db_state.sql

If a coprocessor is in a bad state, it can be useful to revert to a previous state and redo the computations and other operations with the hope that it reaches the correct state. Reversal can be initiated by providing a host chain ID and a block number to revert to. The reversal process expects that all coprocessor components are stopped before running it. In order to delete data by block number and chain ID, we have to add a DB migration to keep track of them for all relevant tables. For blocks before this migration, we won't be able to revert, but we assume this is acceptable, also given that there isn't much we can do. Reasoning for the way we implemented reversal is that we assumes we would be able to detect a state drift with a ciphertext granularity. Consequently, the first such drift that we observe means that no prior state was affected. That is why we delete all data for blocks including the offending one and the ones after it. Note that above assumes all ciphertext have been computed up to the offending block, but in reality, due to out-of-order processing, it might not be the case. That means some prior ciphertexts might also drift. Therefore, operators might choose to go a bit further back than the offending block to make it even more likely to capture the root cause of the drift. For now, we cannot revert across a key rotation boundary. The tfhe-worker always uses the latest key from the `keys` table. If a new key was activated after to_block_number, re-processed computations will use the new key instead of the original one, producing incorrect ciphertexts. This commit adds a test that checks the DB after the reversal script has run, an integration test for the host listener, and an E2E test. Finally, we add the `db-state-revert` docker image that can be used to run the reversal script in a containerized environment.

antoniupop

Not sure it's relevant/needed, but is this handled on GW side to account for multiple commitments for the same handle and same copro? These could be the same or different CTs whether drifted or not. I guess ACL propagation (and delegation) too since you drop from allowed_handles so will be re-sent on catch-up or is that flow already removed?

coprocessor/fhevm-engine/db-migration/db-scripts/revert_coprocessor_db_state.sql

dartdart26 · 2026-03-20T14:21:27Z

Not sure it's relevant/needed, but is this handled on GW side to account for multiple commitments for the same handle and same copro? These could be the same or different CTs whether drifted or not. I guess ACL propagation (and delegation) too since you drop from allowed_handles so will be re-sent on catch-up or is that flow already removed?

I am not exactly sure if I get it, but if you refer to sending different ciphertext digests to the GW for the same handle, the assumption here is that a minority of coprocessors drift, with the majority being able to reach consensus. Therefore, if CoprocessorX is being reverted and then submits digests again, it will get CoprocessorAlreadyAdded revert irrespective of whether consensus has been reached or not. CoprocessorX will move on, even if the bad digest it originally sent stays in the GW contract - arguably we want that.

ACL propagation is no longer happening on main.

antoniupop · 2026-03-20T14:31:04Z

I am not exactly sure if I get it, but if you refer to sending different ciphertext digests to the GW for the same handle, the assumption here is that a minority of coprocessors drift, with the majority being able to reach consensus. Therefore, if CoprocessorX is being reverted and then submits digests again, it will get CoprocessorAlreadyAdded revert irrespective of whether consensus has been reached or not. CoprocessorX will move on, even if the bad digest it originally sent stays in the GW contract - arguably we want that.

So we assume we don't actually break out of consensus here? I think this is risky, we would not be able to restore consensus if it happens. I can't quite see the benefit of keeping the old discarded digest in GW if a copro has re-computed and updated its DB (what bout S3? re-upload or stay consistent with GW and keep old one?). I'd be in favour of updating I think. Am I missing something?

dartdart26 · 2026-03-20T14:41:12Z

I am not exactly sure if I get it, but if you refer to sending different ciphertext digests to the GW for the same handle, the assumption here is that a minority of coprocessors drift, with the majority being able to reach consensus. Therefore, if CoprocessorX is being reverted and then submits digests again, it will get CoprocessorAlreadyAdded revert irrespective of whether consensus has been reached or not. CoprocessorX will move on, even if the bad digest it originally sent stays in the GW contract - arguably we want that.

So we assume we don't actually break out of consensus here? I think this is risky, we would not be able to restore consensus if it happens. I can't quite see the benefit of keeping the old discarded digest in GW if a copro has re-computed and updated its DB (what bout S3? re-upload or stay consistent with GW and keep old one?). I'd be in favour of updating I think. Am I missing something?

We can't change the GW contracts. The main assumption of the whole system is that a majority is always honest. We can't break that and expect the system to behave nicely. Another problem related to that is we have one implementation only, leading to correlated bugs on all instances.

Re S3, the data is keyed on the digest, so we will keep the old ciphertexts hanging for now. If we want garbage collection on S3, I suggest we do it in another PR.

antoniupop · 2026-03-20T14:48:28Z

We can't change the GW contracts. The main assumption of the whole system is that a majority is always honest. We can't break that and expect the system to behave nicely. Another problem related to that is we have one implementation only, leading to correlated bugs on all instances.

I'm talking about errors, not malicious instances. These can accumulate (contamination) so could end up there if the reconciliation process is not fast enough (or we're very unlucky).

Re S3, the data is keyed on the digest, so we will keep the old ciphertexts hanging for now. If we want garbage collection on S3, I suggest we do it in another PR.

Makes sense yes. In any case I would still upload corrected CTs as well even if not GCing old ones.

dartdart26 · 2026-03-20T14:50:19Z

We can't change the GW contracts. The main assumption of the whole system is that a majority is always honest. We can't break that and expect the system to behave nicely. Another problem related to that is we have one implementation only, leading to correlated bugs on all instances.

I'm talking about errors, not malicious instances. These can accumulate (contamination) so could end up there if the reconciliation process is not fast enough (or we're very unlucky).

Re S3, the data is keyed on the digest, so we will keep the old ciphertexts hanging for now. If we want garbage collection on S3, I suggest we do it in another PR.

Makes sense yes. In any case I would still upload corrected CTs as well even if not GCing old ones.

Re GW - not sure what we can do. Any suggestions?

Re upload of corrected CT to S3 - yes, that will happen automatically.

antoniupop · 2026-03-20T14:53:02Z

Re GW - not sure what we can do. Any suggestions?

I'm no expert on GW contracts, I'd think we want to:

keep any achieved consensus (so if it was reached already, we don't change anything)
update a coprocessor's "vote" if consensus was not yet reached

edit: assuming indeed that a majority is honest, this should not break any guarantees, just allow correcting errors after a restore

…revert script

antoniupop

LGTM!

dartdart26 requested a review from Eikix March 17, 2026 13:56

dartdart26 self-assigned this Mar 17, 2026

dartdart26 requested review from a team as code owners March 17, 2026 13:57

cla-bot bot added the cla-signed label Mar 17, 2026

dartdart26 marked this pull request as draft March 17, 2026 13:57

dartdart26 force-pushed the petar/copro-state-revert-2 branch 2 times, most recently from 3ba3c9b to 191db12 Compare March 17, 2026 14:16

dartdart26 force-pushed the petar/copro-state-revert-2 branch 3 times, most recently from 01db65c to 628bfb5 Compare March 18, 2026 15:01

dartdart26 force-pushed the petar/copro-state-revert-2 branch 4 times, most recently from 07bdd52 to 1e7f041 Compare March 19, 2026 10:19

dartdart26 requested review from deasydoesit and tawadaa March 19, 2026 10:21

dartdart26 force-pushed the petar/copro-state-revert-2 branch 3 times, most recently from f4c8071 to 6375720 Compare March 19, 2026 12:54

dartdart26 force-pushed the petar/copro-state-revert-2 branch from 6375720 to f120441 Compare March 19, 2026 13:48

dartdart26 marked this pull request as ready for review March 19, 2026 13:49

Eikix reviewed Mar 19, 2026

View reviewed changes

PanGan21 reviewed Mar 19, 2026

View reviewed changes

coprocessor/fhevm-engine/db-migration/db-scripts/revert_coprocessor_db_state.sql Show resolved Hide resolved

dartdart26 force-pushed the petar/copro-state-revert-2 branch from f120441 to e078976 Compare March 20, 2026 10:55

dartdart26 requested a review from antoniupop March 20, 2026 10:55

dartdart26 force-pushed the petar/copro-state-revert-2 branch from e078976 to 2100490 Compare March 20, 2026 11:16

antoniupop reviewed Mar 20, 2026

View reviewed changes

coprocessor/fhevm-engine/db-migration/db-scripts/revert_coprocessor_db_state.sql Outdated Show resolved Hide resolved

coprocessor/fhevm-engine/db-migration/db-scripts/revert_coprocessor_db_state.sql Outdated Show resolved Hide resolved

fix(coprocessor): do not delete dependence_chain entries in DB state …

bf25610

…revert script

antoniupop approved these changes Mar 20, 2026

View reviewed changes

Conversation

dartdart26 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dartdart26 commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changed Lines Coverage

Diff Coverage

Diff: origin/main...HEAD, staged and unstaged changes

Summary

Uh oh!

mergify bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 CI Insights

🟢 All jobs passed!

Uh oh!

dartdart26 commented Mar 18, 2026

Uh oh!

dartdart26 commented Mar 19, 2026

Uh oh!

Eikix Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dartdart26 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

dartdart26 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

dartdart26 Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Eikix Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

dartdart26 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

dartdart26 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antoniupop left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dartdart26 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoniupop commented Mar 20, 2026

Uh oh!

dartdart26 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoniupop commented Mar 20, 2026

Uh oh!

dartdart26 commented Mar 20, 2026

Uh oh!

antoniupop commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoniupop left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dartdart26 commented Mar 17, 2026 •

edited

Loading

github-actions bot commented Mar 17, 2026 •

edited

Loading

mergify bot commented Mar 17, 2026 •

edited

Loading

Eikix Mar 19, 2026 •

edited

Loading

dartdart26 Mar 20, 2026 •

edited

Loading

dartdart26 commented Mar 20, 2026 •

edited

Loading

dartdart26 commented Mar 20, 2026 •

edited

Loading

antoniupop commented Mar 20, 2026 •

edited

Loading