Skip to content

Handle stale locks to avoid application startup failure due to not released lock record #5

@Gleb-Y

Description

@Gleb-Y

Dear dalet-oss team,
I'm opening this issue to report a critical problem I encountered while using the ArangoBee tool, and to suggest a possible solution to improve its robustness.

-Problem Description:
Currently, a service using ArangoBee may fail to start properly if, during a previous deployment, the lock record was not removed from the arangolock collection due to unexpected errors or interruptions.
As a result, the lock remains and blocks future migration attempts, causing the following error:
"Caused by: com.github.arangobee.exception.ArangobeeException: Arangobee did not acquire process lock. Exiting."
This can happen in situations such as:
• Loss of connection to ArangoDB during the migration process
• Network issues or unexpected termination
• Other unexpected failures
 
-Suggested Solution: Stale Lock Detection and Cleanup.
To address this issue, I suggest implementing logic for stale lock detection and automatic removal. Here's how it could work:
 

  1. Stale Lock Detection:
    Add an acquiredAt timestamp to the lock record.
    If a lock exists but is older than a configurable lockExpirationTime, it is considered "stale".
    Stale locks should be automatically removed before trying to acquire a new lock.
     
  2. Improved Lock Acquisition Logic:
    On startup, first attempt to acquire the lock as usual.
    If lock acquisition fails, check the existing lock's acquiredAt timestamp.
    If the lock is stale (older than lockExpirationTime), remove it and acquire a new one before proceeding with migrations.
     
  3. Background Heartbeat Process:
    To prevent "race conditions", implement a background heartbeat that periodically updates the acquiredAt timestamp (for example every lockExpirationTime / 3).
    This ensures that as long as migrations are running, the lock stays fresh and other instances won’t mistakenly consider it stale.

-Why This Matters:
Without this mechanism, deployments can be blocked by stalelocks, requiring manual deletion. This is especially problematic in distributed environments with multiple replicas (e.g., Kubernetes).
 
I’d be happy to submit a pull request if you think this feature aligns with the project's goals.
 
Best regards,
Gleb

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions