Handle stale locks to avoid application startup failure due to not released lock record

Dear dalet-oss team,
I'm opening this issue to report a critical problem I encountered while using the ArangoBee tool, and to suggest a possible solution to improve its robustness.

-Problem Description:
Currently, a service using ArangoBee may fail to start properly if, during a previous deployment, the lock record was not removed from the `arangolock` collection due to unexpected errors or interruptions.
As a result, the lock remains and blocks future migration attempts, causing the following error:
"Caused by: com.github.arangobee.exception.ArangobeeException: Arangobee did not acquire process lock. Exiting."
This can happen in situations such as:
• Loss of connection to ArangoDB during the migration process
• Network issues or unexpected termination
• Other unexpected failures
 
-Suggested Solution: Stale Lock Detection and Cleanup.
To address this issue, I suggest implementing logic for stale lock detection and automatic removal. Here's how it could work:
 
1. Stale Lock Detection:
Add an `acquiredAt` timestamp to the lock record.
If a lock exists but is older than a configurable `lockExpirationTime`, it is considered "stale".
Stale locks should be automatically removed before trying to acquire a new lock.
 
2. Improved Lock Acquisition Logic:
On startup, first attempt to acquire the lock as usual.
If lock acquisition fails, check the existing lock's `acquiredAt` timestamp.
If the lock is stale (older than `lockExpirationTime`), remove it and acquire a new one before proceeding with migrations.
 
3. Background Heartbeat Process:
To prevent "race conditions", implement a background heartbeat that periodically updates the `acquiredAt` timestamp (for example every `lockExpirationTime / 3`).
This ensures that as long as migrations are running, the lock stays fresh and other instances won’t mistakenly consider it stale.

-Why This Matters:
Without this mechanism, deployments can be blocked by stalelocks, requiring manual deletion. This is especially problematic in distributed environments with multiple replicas (e.g., Kubernetes).
 
I’d be happy to submit a pull request if you think this feature aligns with the project's goals.
 
Best regards,
Gleb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle stale locks to avoid application startup failure due to not released lock record #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handle stale locks to avoid application startup failure due to not released lock record #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions