Skip to content

back-pressure / back-off / circuit breaker #56

Open
@johnnagro

Description

@johnnagro

i am relatively new to Sidekiq, and this extension, so please excuse me if there is already a known pattern for this but, I was unable to find one elsewhere. This library has a lot of similar/useful functionality and you seem open to new feature ideas. Feel free to point me somewhere else or debate the merits of this idea.

One of the ways we use Sidekiq is to deliver webhooks but, this idea can apply to other tasks as well. In addition to using the throttle configuration parameters, such as concurrency and threshold, we would also find it useful to be able to 'trip' the throttle/breaker explicitly from inside the perform(...) method, for example after receiving back-pressure from the web-service (e.g. getting an code 429). For the sake of this description, I'm suggesting you would be able to call a method like back_off(...) inside perform(...) (I am not married to that name).

When you receive a 429 you often get a hint about when the web-service will start accepting your requests again (perhaps via a Retry-After header). So, I imagine the back_off(...) method would allow you to optionally pass in a timestamp when to begin processing again - otherwise the default would be to use some sort of exponential back-off or (bonus) a configured Proc akin to how Sidekiq allows users to override the default back-off schedule for job retries.

Just like normal usage of these throttles, I imagine this would remain key-aware, so backing-off would throttle all the jobs with the same key, allowing back-off for specific job parameters versus all jobs of that worker-type. Once the back-pressure elapses, jobs would process as normal given the configuration provided.

Given that resuming a high-concurrency worker-type all-at-once after a back-off period might not be the most graceful thing to do, some circuit-breaker implementations provide some flexibility/configuration regarding resuming. For example, they might ramp up the concurrency of the resumed worker-type (key in this case) starting with just one job to 'test' the breaker. I am not convinced you would need this right away (or at all) and I am worried the added complexity might not have a good cost/benefit ratio. Perhaps its a nice-to-have feature. In the case where you're getting nice hints about when to resume, you probably wouldn't really need this feature. However, in the case where you're getting back-pressure because you're the resource is overloaded, unhealthy, or it is otherwise unavailable (e.g. getting timeouts or 5xxs back) a more gradual resume would be more appropriate.

Anyway, I appreciate the consideration. I'm happy to help implement this if you think its a good idea - you're advice on the approach would be appreciated before I submit a PR.

Here is a simple example:

require 'rest-client'
class WebhookWorker
  include Sidekiq::Worker
  include Sidekiq::Throttled::Worker

  def perform(url, payload)
    begin
      resp = RestClient::Request.execute(method: :post, url: url, payload: payload.to_json, headers: {content_type: :json, accept: :json})
    rescue RestClient::TooManyRequests => err # http 429 response
      if err.response && err.response.headers['x-rate-limit-reset']
        # https://developer.twitter.com/en/docs/basics/rate-limiting.html
        # x-rate-limit-reset: the remaining window before the rate limit resets, in UTC epoch seconds
        resume_at = Time.at(err.response.headers['x-rate-limit-reset'].to_i)
        back_off(resume_at: resume_at)
      else
        back_off()
      end
    end
  end
end

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions