Skip to content

feat(coprocessor): add ability to revert to a previous state#2122

Open
dartdart26 wants to merge 2 commits intomainfrom
petar/copro-state-revert-2
Open

feat(coprocessor): add ability to revert to a previous state#2122
dartdart26 wants to merge 2 commits intomainfrom
petar/copro-state-revert-2

Conversation

@dartdart26
Copy link
Collaborator

@dartdart26 dartdart26 commented Mar 17, 2026

If a coprocessor is in a bad state, it can be useful to revert to a previous state and redo the computations and other operations with the hope that it reaches the correct state.

Reversal can be initiated by providing a host chain ID and a block number to revert to. The reversal process expects that all coprocessor components are stopped before running it.

In order to delete data by block number and chain ID, we have to add a DB migration to keep track of them for all elevant tables. For blocks before this migration, we won't be able to revert, but we assume this is acceptable, also given that there isn't much we can do.

Reasoning for the way we implemented reversal is that we assumes we would be able to detect a state drift with a ciphertext granularity. Consequently, the first such drift that we observe means that no prior state was affected. That is why we delete all data for blocks including the offending one and the ones after it.

Note that above assumes all ciphertext have been computed up to the offending block, but in reality, due to out-of-order processing, it might not be the case. That means some prior ciphertexts might also drift. Therefore, operators might choose to go a bit further back than the offending block to make it even more likely to capture the root cause of the drift.

For now, we cannot revert across a key rotation boundary. The tfhe-worker always uses the latest key from the keys table. If a new key was activated after to_block_number, re-processed computations will use the new key instead of the original one, producing incorrect ciphertexts.

This commit adds a test that checks the DB after the reversal script has run, an integration test for the host listener, and an E2E test.

Finally, we add the db-state-revert docker image that can be used to run the reversal script in a containerized environment.

@dartdart26 dartdart26 requested a review from Eikix March 17, 2026 13:56
@dartdart26 dartdart26 self-assigned this Mar 17, 2026
@dartdart26 dartdart26 requested review from a team as code owners March 17, 2026 13:57
@cla-bot cla-bot bot added the cla-signed label Mar 17, 2026
@dartdart26 dartdart26 marked this pull request as draft March 17, 2026 13:57
@dartdart26
Copy link
Collaborator Author

@Eikix Please don't review yet, still working on it.

@dartdart26 dartdart26 force-pushed the petar/copro-state-revert-2 branch 2 times, most recently from 3ba3c9b to 191db12 Compare March 17, 2026 14:16
@github-actions
Copy link

github-actions bot commented Mar 17, 2026

Changed Lines Coverage

Coverage of added/modified lines: 100.0%

Per-file breakdown

Diff Coverage

Diff: origin/main...HEAD, staged and unstaged changes

  • coprocessor/fhevm-engine/host-listener/src/database/tfhe_event_propagate.rs (100%)
  • coprocessor/fhevm-engine/test-harness/src/db_utils.rs (100%)

Summary

  • Total: 20 lines
  • Missing: 0 lines
  • Coverage: 100%

@mergify
Copy link

mergify bot commented Mar 17, 2026

🧪 CI Insights

Here's what we observed from your CI run for 191db12.

🟢 All jobs passed!

But CI Insights is watching 👀

@dartdart26 dartdart26 force-pushed the petar/copro-state-revert-2 branch 3 times, most recently from 01db65c to 628bfb5 Compare March 18, 2026 15:01
@dartdart26
Copy link
Collaborator Author

@dartdart26 dartdart26 force-pushed the petar/copro-state-revert-2 branch 4 times, most recently from 07bdd52 to 1e7f041 Compare March 19, 2026 10:19
@dartdart26 dartdart26 force-pushed the petar/copro-state-revert-2 branch 3 times, most recently from f4c8071 to 6375720 Compare March 19, 2026 12:54
@dartdart26
Copy link
Collaborator Author

@dartdart26 dartdart26 force-pushed the petar/copro-state-revert-2 branch from 6375720 to f120441 Compare March 19, 2026 13:48
@dartdart26 dartdart26 marked this pull request as ready for review March 19, 2026 13:49
-- Usage:
-- psql -v chain_id=<CHAIN_ID> -v to_block_number=<BLOCK_NUMBER> -f revert_coprocessor_db_state.sql
--
-- WARNING: For now, do not revert across a key rotation boundary. The tfhe-worker always
Copy link
Contributor

@Eikix Eikix Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a way to detect this & fail fast? We can just treat it as a case we won't code for, and won't code for, for a long time. But it can be safer to fail fast so as to not forget about it (if it's possible to fail fast here)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think about how exactly - I guess we could try see if a ciphertext for the last block is from key X and if a ciphertext from the first one (the to block) is from key Y. Issue is hosts are not tied to the GW, so that might be a bit tricky.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Eikix I think this goes back to the offline discussion we had about the GW chain not in sync with the host chains. AFAICT, there is not reliable way to say a key rotation hasn't happened for given host chain C in the range we are about to remove.

We could look at the ciphertext_digest table, but we don't know if it has populated fully with all the digests that would detect a different key (it's a race between the sns-worker and the operator stopping the system to run the reversal script). Maybe still worth adding that check, but keep the warning and be explicit that we can't currently detect it.

Copy link
Collaborator Author

@dartdart26 dartdart26 Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a best-effort check.

AND host_chain_id = :'chain_id';

DELETE FROM ciphertext_digest
WHERE txn_block_number > :'to_block_number'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is txn_block_number the gateway block number of the txn where the digest is posted? if so, is it being compared to a host block number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! I missed that - I think we need to delete by handle here, let me look into it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@dartdart26 dartdart26 force-pushed the petar/copro-state-revert-2 branch from f120441 to e078976 Compare March 20, 2026 10:55
@dartdart26 dartdart26 requested a review from antoniupop March 20, 2026 10:55
If a coprocessor is in a bad state, it can be useful to revert to a
previous state and redo the computations and other operations with the
hope that it reaches the correct state.

Reversal can be initiated by providing a host chain ID and a block
number to revert to. The reversal process expects that all coprocessor
components are stopped before running it.

In order to delete data by block number and chain ID, we have to add a
DB migration to keep track of them for all relevant tables. For blocks
before this migration, we won't be able to revert, but we assume this is
acceptable, also given that there isn't much we can do.

Reasoning for the way we implemented reversal is that we assumes we
would be able to detect a state drift with a ciphertext granularity.
Consequently, the first such drift that we observe means that no prior
state was affected. That is why we delete all data for blocks including
the offending one and the ones after it.
Note that above assumes all ciphertext have been computed up to the
offending block, but in reality, due to out-of-order processing, it
might not be the case. That means some prior ciphertexts might also
drift. Therefore, operators might choose to go a bit further back than
the offending block to make it even more likely to capture the root
cause of the drift.

For now, we cannot revert across a key rotation boundary.
The tfhe-worker always uses the latest key from the `keys` table. If a
new key was activated after to_block_number, re-processed computations
will use the new key instead of the original one, producing incorrect
ciphertexts.

This commit adds a test that checks the DB after the reversal script has
run, an integration test for the host listener, and an E2E test.

Finally, we add the `db-state-revert` docker image that can be used to
run the reversal script in a containerized environment.
@dartdart26 dartdart26 force-pushed the petar/copro-state-revert-2 branch from e078976 to 2100490 Compare March 20, 2026 11:16
Copy link
Collaborator

@antoniupop antoniupop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it's relevant/needed, but is this handled on GW side to account for multiple commitments for the same handle and same copro? These could be the same or different CTs whether drifted or not. I guess ACL propagation (and delegation) too since you drop from allowed_handles so will be re-sent on catch-up or is that flow already removed?

@dartdart26
Copy link
Collaborator Author

dartdart26 commented Mar 20, 2026

Not sure it's relevant/needed, but is this handled on GW side to account for multiple commitments for the same handle and same copro? These could be the same or different CTs whether drifted or not. I guess ACL propagation (and delegation) too since you drop from allowed_handles so will be re-sent on catch-up or is that flow already removed?

I am not exactly sure if I get it, but if you refer to sending different ciphertext digests to the GW for the same handle, the assumption here is that a minority of coprocessors drift, with the majority being able to reach consensus. Therefore, if CoprocessorX is being reverted and then submits digests again, it will get CoprocessorAlreadyAdded revert irrespective of whether consensus has been reached or not. CoprocessorX will move on, even if the bad digest it originally sent stays in the GW contract - arguably we want that.

ACL propagation is no longer happening on main.

@antoniupop
Copy link
Collaborator

I am not exactly sure if I get it, but if you refer to sending different ciphertext digests to the GW for the same handle, the assumption here is that a minority of coprocessors drift, with the majority being able to reach consensus. Therefore, if CoprocessorX is being reverted and then submits digests again, it will get CoprocessorAlreadyAdded revert irrespective of whether consensus has been reached or not. CoprocessorX will move on, even if the bad digest it originally sent stays in the GW contract - arguably we want that.

So we assume we don't actually break out of consensus here? I think this is risky, we would not be able to restore consensus if it happens. I can't quite see the benefit of keeping the old discarded digest in GW if a copro has re-computed and updated its DB (what bout S3? re-upload or stay consistent with GW and keep old one?). I'd be in favour of updating I think. Am I missing something?

@dartdart26
Copy link
Collaborator Author

dartdart26 commented Mar 20, 2026

I am not exactly sure if I get it, but if you refer to sending different ciphertext digests to the GW for the same handle, the assumption here is that a minority of coprocessors drift, with the majority being able to reach consensus. Therefore, if CoprocessorX is being reverted and then submits digests again, it will get CoprocessorAlreadyAdded revert irrespective of whether consensus has been reached or not. CoprocessorX will move on, even if the bad digest it originally sent stays in the GW contract - arguably we want that.

So we assume we don't actually break out of consensus here? I think this is risky, we would not be able to restore consensus if it happens. I can't quite see the benefit of keeping the old discarded digest in GW if a copro has re-computed and updated its DB (what bout S3? re-upload or stay consistent with GW and keep old one?). I'd be in favour of updating I think. Am I missing something?

We can't change the GW contracts. The main assumption of the whole system is that a majority is always honest. We can't break that and expect the system to behave nicely. Another problem related to that is we have one implementation only, leading to correlated bugs on all instances.

Re S3, the data is keyed on the digest, so we will keep the old ciphertexts hanging for now. If we want garbage collection on S3, I suggest we do it in another PR.

@antoniupop
Copy link
Collaborator

We can't change the GW contracts. The main assumption of the whole system is that a majority is always honest. We can't break that and expect the system to behave nicely. Another problem related to that is we have one implementation only, leading to correlated bugs on all instances.

I'm talking about errors, not malicious instances. These can accumulate (contamination) so could end up there if the reconciliation process is not fast enough (or we're very unlucky).

Re S3, the data is keyed on the digest, so we will keep the old ciphertexts hanging for now. If we want garbage collection on S3, I suggest we do it in another PR.

Makes sense yes. In any case I would still upload corrected CTs as well even if not GCing old ones.

@dartdart26
Copy link
Collaborator Author

We can't change the GW contracts. The main assumption of the whole system is that a majority is always honest. We can't break that and expect the system to behave nicely. Another problem related to that is we have one implementation only, leading to correlated bugs on all instances.

I'm talking about errors, not malicious instances. These can accumulate (contamination) so could end up there if the reconciliation process is not fast enough (or we're very unlucky).

Re S3, the data is keyed on the digest, so we will keep the old ciphertexts hanging for now. If we want garbage collection on S3, I suggest we do it in another PR.

Makes sense yes. In any case I would still upload corrected CTs as well even if not GCing old ones.

Re GW - not sure what we can do. Any suggestions?

Re upload of corrected CT to S3 - yes, that will happen automatically.

@antoniupop
Copy link
Collaborator

antoniupop commented Mar 20, 2026

Re GW - not sure what we can do. Any suggestions?

I'm no expert on GW contracts, I'd think we want to:

  • keep any achieved consensus (so if it was reached already, we don't change anything)
  • update a coprocessor's "vote" if consensus was not yet reached

edit: assuming indeed that a majority is honest, this should not break any guarantees, just allow correcting errors after a restore

Copy link
Collaborator

@antoniupop antoniupop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants