You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: cerebrium/scaling/scaling-apps.mdx
+10-2Lines changed: 10 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,7 @@ The **number of requests** currently waiting for processing in the queue indicat
16
16
See below for more information.
17
17
</Info>
18
18
19
-
As traffic decreases, instances enter a cooldown period after processing their last request. When no new requests arrive during cooldown, instances terminate to optimize resource usage. This automatic cycle ensures apps remain responsive while managing costs effectively.
19
+
As traffic decreases, instances enter a cooldown period at reduced concurrency. If reduced concurrency is maintained for the cooldown duration, instances scale down to optimize resource usage. This automatic cycle ensures apps remain responsive while managing costs effectively.
20
20
21
21
## Scaling Configuration
22
22
@@ -40,7 +40,7 @@ The `max_replicas` parameter sets an upper limit on concurrent instances, contro
40
40
41
41
### Cooldown Period
42
42
43
-
After processing a request, instances remain available for the duration specified by `cooldown`. Each new request resets this timer. A longer cooldown period helps handle bursty traffic patterns but increases instance running time and cost.
43
+
The `cooldown` parameter specifies the time window (in seconds) that must pass at reduced concurrency before an instance scales down. This prevents premature scale-down during brief traffic dips that might be followed by more requests. A longer cooldown period helps handle bursty traffic patterns but increases instance running time and cost.
44
44
45
45
### Replica Concurrency
46
46
@@ -187,6 +187,10 @@ Once this request has completed, the usual `cooldown` period will apply, and the
187
187
188
188
## Evaluation Interval
189
189
190
+
<Warning>
191
+
Requires CLI version 2.1.5 or higher.
192
+
</Warning>
193
+
190
194
The `evaluation_interval` parameter controls the time window (in seconds) over which the autoscaler evaluates metrics before making scaling decisions. The default is 30 seconds, with a valid range of 6-300 seconds.
191
195
192
196
```toml
@@ -204,6 +208,10 @@ A shorter interval makes the autoscaler more responsive to traffic spikes but ma
204
208
205
209
## Load Balancing
206
210
211
+
<Warning>
212
+
Requires CLI version 2.1.5 or higher.
213
+
</Warning>
214
+
207
215
The `load_balancing` parameter controls how incoming requests are distributed across your replicas. When not specified, the system automatically selects the best algorithm based on your `replica_concurrency` setting.
| response_grace_period | integer | 3600 |2.1.2+ |Grace period in seconds |
137
+
| cooldown | integer | 1800 |2.1.2+ | Time window (seconds) that must pass at reduced concurrency before scaling down. Helps avoid cold starts from brief traffic dips.|
| scaling_target | integer | 100 |2.1.2+ |Target value for scaling metric (percentage for utilization metrics, absolute value for requests_per_second) |
| evaluation_interval | integer | 30 |2.1.5+ |Time window in seconds over which metrics are evaluated before scaling decisions (6-300s) |
142
+
| load_balancing | string | "" |2.1.5+ |Algorithm for distributing traffic across replicas. Default: round-robin if replica_concurrency > 3, first-available otherwise. Options: round-robin, first-available, min-connections, random-choice-2 |
143
+
| roll_out_duration_seconds | integer | 0 |2.1.2+ |Gradually send traffic to new revision after successful build. Max 600s. Keep at 0 during development. |
144
144
145
145
<Warning>
146
146
Setting min_replicas > 0 maintains warm instances for immediate response but
0 commit comments