Report and/or gracefully handle rate limiter connection issues #491
Open
Description
Worker depends on an external Redis for rate-limiting purposes. Since this Redis runs on Heroku, its endpoint can sometimes change.
If it cannot connect to this Redis, worker fails to start, so no jobs can run.
We should:
- make worker report this issue (and add an alert)
- make worker handle this failure more gracefully
Open question: How should worker handle a situation where it cannot connect to its Redis? Should it:
- fail to start (as currently)
- start, but disregard rate limits completely
- start, but in some different fallback mode where it makes fewer requests
- something else?
Alternatively, it would not hurt to reevaluate worker's behavior towards the GCE API. Is there any way we can get rid of the dependency on an external rate-limit-checker completely?
References
- Related incident
- travis-ci/reliability#113 (worker: make connection strings runtime configurable)
- GCE rate limit info
Metadata
Assignees
Labels
No labels