Describe the issue
Description
The exponential backoff mechanism for retrying uncomputable handles is capped at 32,000. After approximately 15 failures (2^15 = 32,768), all persistently failing items have the same counter value, causing the system to lose priority distinction between them.
Location
- File:
coprocessor/fhevm-engine/tfhe-worker/src/tfhe_worker.rs
- Function:
update_uncomputable_handles()
Code Reference
UPDATE computations
SET schedule_order = CURRENT_TIMESTAMP + INTERVAL '1 second' * uncomputable_counter,
uncomputable_counter = LEAST(uncomputable_counter * 2, 32000)::SMALLINT
Impact
- System cannot differentiate between item failing for 15 cycles vs 100 cycles
- Recently failed items (more likely to succeed) not prioritized over long-standing failures
- Performance degradation as worker wastes cycles on unlikely-to-succeed items
Why It Matters
Effective backoff and retry strategies are crucial for efficiency and resilience of distributed computation systems. A flawed strategy leads to wasted resources and slower processing.
Suggested Fix
Implement more robust backoff strategy:
- Increase the cap: Change to larger value or remove if data type allows
- Add jitter: Introduce randomness to prevent thundering herds
- Incorporate time: Add timestamp for last failure for better prioritization
UPDATE computations
SET schedule_order = ...,
uncomputable_counter = LEAST(uncomputable_counter * 2, 65535)::SMALLINT,
last_failed_at = CURRENT_TIMESTAMP
Reproduction Steps
- Create a computation handle designed to always fail
- Allow worker to attempt processing >15 times
- Observe
uncomputable_counter capped at 32,000
- Create second handle that fails for first time
- After few cycles, it also reaches 32,000
- Both items now have same retry priority despite different failure histories
Context
No response
Steps to Reproduce or Propose
No response
Describe the issue
Description
The exponential backoff mechanism for retrying uncomputable handles is capped at 32,000. After approximately 15 failures (2^15 = 32,768), all persistently failing items have the same counter value, causing the system to lose priority distinction between them.
Location
coprocessor/fhevm-engine/tfhe-worker/src/tfhe_worker.rsupdate_uncomputable_handles()Code Reference
Impact
Why It Matters
Effective backoff and retry strategies are crucial for efficiency and resilience of distributed computation systems. A flawed strategy leads to wasted resources and slower processing.
Suggested Fix
Implement more robust backoff strategy:
Reproduction Steps
uncomputable_countercapped at 32,000Context
No response
Steps to Reproduce or Propose
No response