-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Dear dalet-oss team,
I'm opening this issue to report a critical problem I encountered while using the ArangoBee tool, and to suggest a possible solution to improve its robustness.
-Problem Description:
Currently, a service using ArangoBee may fail to start properly if, during a previous deployment, the lock record was not removed from the arangolock collection due to unexpected errors or interruptions.
As a result, the lock remains and blocks future migration attempts, causing the following error:
"Caused by: com.github.arangobee.exception.ArangobeeException: Arangobee did not acquire process lock. Exiting."
This can happen in situations such as:
• Loss of connection to ArangoDB during the migration process
• Network issues or unexpected termination
• Other unexpected failures
-Suggested Solution: Stale Lock Detection and Cleanup.
To address this issue, I suggest implementing logic for stale lock detection and automatic removal. Here's how it could work:
- Stale Lock Detection:
Add anacquiredAttimestamp to the lock record.
If a lock exists but is older than a configurablelockExpirationTime, it is considered "stale".
Stale locks should be automatically removed before trying to acquire a new lock.
- Improved Lock Acquisition Logic:
On startup, first attempt to acquire the lock as usual.
If lock acquisition fails, check the existing lock'sacquiredAttimestamp.
If the lock is stale (older thanlockExpirationTime), remove it and acquire a new one before proceeding with migrations.
- Background Heartbeat Process:
To prevent "race conditions", implement a background heartbeat that periodically updates theacquiredAttimestamp (for example everylockExpirationTime / 3).
This ensures that as long as migrations are running, the lock stays fresh and other instances won’t mistakenly consider it stale.
-Why This Matters:
Without this mechanism, deployments can be blocked by stalelocks, requiring manual deletion. This is especially problematic in distributed environments with multiple replicas (e.g., Kubernetes).
I’d be happy to submit a pull request if you think this feature aligns with the project's goals.
Best regards,
Gleb