Manage Porter threads and backpressure better#106
Conversation
There was a problem hiding this comment.
Pull request overview
Adds request-level backpressure to Porter’s Flask/Hendrix web controller by limiting concurrent in-flight requests, while exempting health-check style endpoints.
Changes:
- Introduces a global in-flight request semaphore with an acquire timeout and 429 on overload.
- Exempts
/get_ursulasand/bucket_samplingfrom the in-flight cap. - Lowers Hendrix deployer default max threadpool size.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| porter/main.py | Defines the non-capped health-check paths and passes them into the WebController. |
| porter/controllers.py | Implements the in-flight semaphore gating in Flask request hooks and adjusts deployer defaults. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## development #106 +/- ##
===============================================
+ Coverage 90.91% 91.19% +0.27%
===============================================
Files 18 18
Lines 969 999 +30
===============================================
+ Hits 881 911 +30
Misses 88 88 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
8244738 to
40224ed
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.log.warn("Too many in-flight requests.") | ||
| response_data = { | ||
| "error": "too_many_requests", | ||
| "message": "Too many in-flight requests.", | ||
| } | ||
| response = make_response( | ||
| jsonify(response_data), 429, {"Retry-After": "3"} | ||
| ) |
There was a problem hiding this comment.
Retry-After is hardcoded to "3" even though the backpressure wait time is configurable via PORTER_DEPLOYER_IN_FLIGHT_ACQUIRE_TIMEOUT_S. If operators change the timeout, clients will receive an inaccurate retry hint. Consider deriving Retry-After from _DEPLOYER_IN_FLIGHT_ACQUIRE_TIMEOUT_S (e.g., ceil to seconds) or making it separately configurable, and keep the header and timeout consistent.
There was a problem hiding this comment.
This is an inexact science. There is no guarantee about what the best value is because it depends on load. It's not necessarily related to the acquire timeout since the requester already waited that timeout and was unsuccessful, how much more time before trying again is uncertain.
I think 3s is a good balance between not too short and not too long, so I'll keep the constant for now for simplicity.
Perhaps reviewers have thoughts here?
…h multiple nodes in a cohort; we don't limit /get_ursulas and /bucket_sampling since those can be used for health checks. Use a global concurrent requests semaphore for absorbing some back-pressure from too many requests; if the semaphore can't be acquired after a timeout then return 429. Reduce Hendrix max threads since it can overwhelm the server.
6b9ee68 to
9b7f5cc
Compare
Type of PR:
Required reviews:
What this does:
Depends on changes in Additional safety valves when using
NetworkRequestClientfor making threshold requests nucypher#3722Use a global concurrent requests semaphore for absorbing some back-pressure from too many requests; if the semaphore can't be acquired after a timeout then return 429. Reduce Hendrix max threads since it can overwhelm the server.
Configure max worker threads used per request, and allow overwriting via env variable
Issues fixed/closed:
Why it's needed:
Notes for reviewers: