Skip to content

Master lock getting out of sync in case the LUA script gets interrupted #553

Open
@valo

Description

We had this production problem yesterday where we discovered that the scheduled tasks are not executed even though the scheduler is running. After investigation it turned out that the master lock key in redis is set to some value, but has no TTL set, essentially leading us to this function: https://github.com/resque/resque-scheduler/blob/master/lib/resque/scheduler/lock/resilient.rb#L54

The above inconsistency caused no master node to be elected (although we don't use multiple schedulers) and all the scheduled jobs got blocked.

I really believe the way this lock is set with 2 separate operations SETNX and EXPIRE is not atomic, even though it is executed in a LUA script. These 2 operations need to be atomic and this can be achieved using the SET NX PX. Even a better solution will be to use a lock implementation which is reviewed by the community, for example using the Redis guidelines for distributed locks: http://redis.io/topics/distlock

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions