Description
We had this production problem yesterday where we discovered that the scheduled tasks are not executed even though the scheduler is running. After investigation it turned out that the master lock key in redis is set to some value, but has no TTL set, essentially leading us to this function: https://github.com/resque/resque-scheduler/blob/master/lib/resque/scheduler/lock/resilient.rb#L54
The above inconsistency caused no master node to be elected (although we don't use multiple schedulers) and all the scheduled jobs got blocked.
I really believe the way this lock is set with 2 separate operations SETNX and EXPIRE is not atomic, even though it is executed in a LUA script. These 2 operations need to be atomic and this can be achieved using the SET NX PX
. Even a better solution will be to use a lock implementation which is reviewed by the community, for example using the Redis guidelines for distributed locks: http://redis.io/topics/distlock