Skip to content

Commit ae3ef9f

Browse files
authored
Merge pull request #259 from CerebriumAI/eli/cer-4656-add-evaluation-interval-to-the-clibackend
2 parents 74e168d + 5676ea1 commit ae3ef9f

File tree

6 files changed

+117
-17
lines changed

6 files changed

+117
-17
lines changed

cerebrium/partner-services/deepgram.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -315,7 +315,7 @@ Deepgram services support independent scaling configurations:
315315
- **min_replicas**: Minimum number of instances to maintain (0 for scale-to-zero)
316316
- **max_replicas**: Maximum number of instances that can be created during high load
317317
- **replica_concurrency**: Number of concurrent requests each instance can handle
318-
- **cooldown**: Time in seconds that an instance remains active after processing its last request
318+
- **cooldown**: Time window (in seconds) that must pass at reduced concurrency before scaling down
319319

320320
Adjust these parameters based on expected traffic patterns and latency requirements.
321321

cerebrium/partner-services/index.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ Partner Services support independent scaling configurations:
3636

3737
- Use the `min_replicas` and `max_replicas` parameters to control the number of instances
3838
- The `replica_concurrency` parameter determines how many concurrent requests each instance can handle
39-
- Adjust the `cooldown` parameter to control how long instances remain active after processing requests
39+
- Adjust the `cooldown` parameter to control the time window that must pass at reduced concurrency before scaling down
4040
- Adjust the `hardware` section to control the instance type which affects performance and/or cost
4141

4242
For more information on specific Partner Services, see:

cerebrium/partner-services/rime.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ Rime services support independent scaling configurations:
9999
- **min_replicas**: Minimum instances to maintain (0 for scale-to-zero). Recommended: 1.
100100
- **max_replicas**: Maximum instances during high load.
101101
- **replica_concurrency**: Concurrent requests per instance. Recommended: 3.
102-
- **cooldown**: Seconds an instance remains active after last request. Recommended: 50.
102+
- **cooldown**: Time window (in seconds) that must pass at reduced concurrency before scaling down. Recommended: 50.
103103
- **compute**: Instance type. Recommended: `AMPERE_A10`.
104104

105105
Adjust these parameters based on traffic patterns and latency requirements. Best would be to consult the Rime team

cerebrium/scaling/scaling-apps.mdx

Lines changed: 98 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ The **number of requests** currently waiting for processing in the queue indicat
1616
See below for more information.
1717
</Info>
1818

19-
As traffic decreases, instances enter a cooldown period after processing their last request. When no new requests arrive during cooldown, instances terminate to optimize resource usage. This automatic cycle ensures apps remain responsive while managing costs effectively.
19+
As traffic decreases, instances enter a cooldown period at reduced concurrency. If reduced concurrency is maintained for the cooldown duration, instances scale down to optimize resource usage. This automatic cycle ensures apps remain responsive while managing costs effectively.
2020

2121
## Scaling Configuration
2222

@@ -40,7 +40,7 @@ The `max_replicas` parameter sets an upper limit on concurrent instances, contro
4040

4141
### Cooldown Period
4242

43-
After processing a request, instances remain available for the duration specified by `cooldown`. Each new request resets this timer. A longer cooldown period helps handle bursty traffic patterns but increases instance running time and cost.
43+
The `cooldown` parameter specifies the time window (in seconds) that must pass at reduced concurrency before an instance scales down. This prevents premature scale-down during brief traffic dips that might be followed by more requests. A longer cooldown period helps handle bursty traffic patterns but increases instance running time and cost.
4444

4545
### Replica Concurrency
4646

@@ -184,3 +184,99 @@ Since the config has specified `100` as a target for `concurrency_utilization` a
184184
the autoscaler will suggest a value of 1 replica for scale out. Since however, we have `scale_buffer=3`, the application will actually scale one more replica to **(1+3)=4**.
185185
In other words, the scale buffer will simply add a static amount of replicas to the number of replicas the autoscaler suggests using the scale target.
186186
Once this request has completed, the usual `cooldown` period will apply, and the app replica count will scale down back to the baseline of **1 replica**.
187+
188+
## Evaluation Interval
189+
190+
<Warning>Requires CLI version 2.1.5 or higher.</Warning>
191+
192+
The `evaluation_interval` parameter controls the time window (in seconds) over which the autoscaler evaluates metrics before making scaling decisions. The default is 30 seconds, with a valid range of 6-300 seconds.
193+
194+
```toml
195+
[cerebrium.scaling]
196+
evaluation_interval = 30 # Evaluate metrics over 30-second windows
197+
```
198+
199+
A shorter interval makes the autoscaler more responsive to traffic spikes but may cause more frequent scaling events. A longer interval smooths out transient spikes but may delay scaling responses.
200+
201+
<Info>
202+
For bursty workloads, a shorter `evaluation_interval` (e.g., 10-15 seconds)
203+
helps the system respond quickly to demand. For steady workloads, a longer
204+
interval (e.g., 60 seconds) reduces unnecessary scaling churn.
205+
</Info>
206+
207+
## Load Balancing
208+
209+
<Warning>Requires CLI version 2.1.5 or higher.</Warning>
210+
211+
The `load_balancing` parameter controls how incoming requests are distributed across your replicas. When not specified, the system automatically selects the best algorithm based on your `replica_concurrency` setting.
212+
213+
```toml
214+
[cerebrium.scaling]
215+
load_balancing = "min-connections" # Explicitly set load balancing algorithm
216+
```
217+
218+
**Default behavior**: When `load_balancing` is not set, the system uses `first-available` for `replica_concurrency <= 3` (typical for GPU workloads) and `round-robin` for higher concurrency.
219+
220+
### Available Algorithms
221+
222+
#### round-robin
223+
224+
Cycles through replicas starting from the last successful target. Each replica's concurrency limit is respected - if a replica is at capacity, the algorithm proceeds to the next one in rotation.
225+
226+
| Characteristic | Value |
227+
| -------------------- | ----------------------------------------------------------------------- |
228+
| Selection complexity | O(1) typical, O(N) worst case when scanning for available capacity |
229+
| Latency profile | Consistent p50, good p90 under uniform load |
230+
| Strategy | Stateful index rotation with mutex synchronization; skips full replicas |
231+
232+
**Best for**: Workloads with predictable request times where you want even distribution across replicas over time.
233+
234+
#### first-available
235+
236+
Scans replicas from the start of the list and selects the first one with available capacity.
237+
238+
| Characteristic | Value |
239+
| -------------------- | ----------------------------------------------------------------------------- |
240+
| Selection complexity | O(1) typical, O(N) worst case |
241+
| Latency profile | Optimal p50 when load is light, may degrade p90 under high load |
242+
| Strategy | Linear scan from list start; returns first replica that accepts via Reserve() |
243+
244+
**Best for**: GPU workloads with low concurrency (`replica_concurrency <= 3`). Maximizes utilization of warm replicas before spreading load, reducing cold starts and keeping models in VRAM.
245+
246+
**Tradeoff**: Earlier replicas in the list handle more traffic. This is desirable for GPU workloads but may cause uneven distribution for CPU workloads.
247+
248+
#### min-connections
249+
250+
Linear scan to find the replica with the fewest in-flight requests, then attempts to reserve it. If that replica cannot accept (at capacity), falls back to trying other replicas in iteration order.
251+
252+
| Characteristic | Value |
253+
| -------------------- | ------------------------------------------------------------------ |
254+
| Selection complexity | Θ(N) - always scans all replicas to find minimum |
255+
| Latency profile | Best p90/p99 tail latency |
256+
| Strategy | Single pass to find minimum in-flight; fallback in iteration order |
257+
258+
**Best for**: Workloads with variable request times (e.g., LLM inference where output length varies). Routes new requests to the least busy replica, preventing fast requests from queuing behind slow ones.
259+
260+
#### random-choice-2
261+
262+
Implements the "Power of Two Choices" algorithm: randomly samples two replicas and routes to the one with lower weight (based on active request tracking). Ties are broken randomly.
263+
264+
| Characteristic | Value |
265+
| -------------------- | ----------------------------------------------------------- |
266+
| Selection complexity | Θ(1) - constant time regardless of replica count |
267+
| Latency profile | Good balance of p50 and p90 |
268+
| Strategy | Sample 2 random replicas, compare weights, pick lighter one |
269+
270+
**Best for**: High-throughput scenarios with many replicas where selection overhead matters. Research shows this achieves exponentially better load distribution than pure random selection.
271+
272+
**Note**: Uses weight-based tracking rather than reservation-based concurrency limiting, making it suitable for unlimited concurrency scenarios.
273+
274+
### Choosing an Algorithm
275+
276+
| Scenario | Recommended | Reason |
277+
| ---------------------------------------- | --------------------------- | -------------------------------------------------- |
278+
| GPU inference, `replica_concurrency=1` | `first-available` (default) | Maximizes GPU utilization, keeps models warm |
279+
| LLMs with variable output lengths | `min-connections` | Prevents head-of-line blocking, best tail latency |
280+
| High-throughput, many replicas | `random-choice-2` | Θ(1) selection with near-optimal distribution |
281+
| Uniform request times, even distribution | `round-robin` | Predictable rotation, no hot spots over time |
282+
| Latency-sensitive with variable load | `min-connections` | Minimizes p90/p99 by routing to least busy replica |

migrations/mystic.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ gpu_count = 1 # Number of GPUs
7878
[cerebrium.scaling]
7979
min_replicas = 0 # Save costs when inactive and scale down your app
8080
max_replicas = 2 # Handle increased traffic and scale up where necessary
81-
cooldown = 60 # Time to wait before scaling down an idle instance
81+
cooldown = 60 # Time window at reduced concurrency before scaling down
8282
replica_concurrency = 1 # The number of requests a single container can support
8383

8484
[cerebrium.dependencies.pip]

toml-reference/toml-reference.mdx

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -128,17 +128,19 @@ The `[cerebrium.hardware]` section defines compute resources.
128128

129129
The `[cerebrium.scaling]` section controls auto-scaling behavior.
130130

131-
| Option | Type | Default | Description |
132-
| ------------------------- | ------- | ------------------------- | ---------------------------------------------------------------------------------------------------------------- |
133-
| min_replicas | integer | 0 | Minimum running instances |
134-
| max_replicas | integer | 2 | Maximum running instances |
135-
| replica_concurrency | integer | 10 | Concurrent requests per replica |
136-
| response_grace_period | integer | 3600 | Grace period in seconds |
137-
| cooldown | integer | 1800 | Time to wait before scaling down an idle container |
138-
| scaling_metric | string | "concurrency_utilization" | Metric for scaling decisions (concurrency_utilization, requests_per_second, cpu_utilization, memory_utilization) |
139-
| scaling_target | integer | 100 | Target value for scaling metric (percentage for utilization metrics, absolute value for requests_per_second) |
140-
| scaling_buffer | integer | optional | Additional replica capacity above what scaling metric suggests |
141-
| roll_out_duration_seconds | integer | 0 | Gradually send traffic to new revision after successful build. Max 600s. Keep at 0 during development. |
131+
| Option | Type | Default | CLI Requirement | Description |
132+
| ------------------------- | ------- | ------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
133+
| min_replicas | integer | 0 | 2.1.2+ | Minimum running instances |
134+
| max_replicas | integer | 2 | 2.1.2+ | Maximum running instances |
135+
| replica_concurrency | integer | 10 | 2.1.2+ | Concurrent requests per replica |
136+
| response_grace_period | integer | 3600 | 2.1.2+ | Grace period in seconds |
137+
| cooldown | integer | 1800 | 2.1.2+ | Time window (seconds) that must pass at reduced concurrency before scaling down. Helps avoid cold starts from brief traffic dips. |
138+
| scaling_metric | string | "concurrency_utilization" | 2.1.2+ | Metric for scaling decisions (concurrency_utilization, requests_per_second, cpu_utilization, memory_utilization) |
139+
| scaling_target | integer | 100 | 2.1.2+ | Target value for scaling metric (percentage for utilization metrics, absolute value for requests_per_second) |
140+
| scaling_buffer | integer | optional | 2.1.2+ | Additional replica capacity above what scaling metric suggests |
141+
| evaluation_interval | integer | 30 | 2.1.5+ | Time window in seconds over which metrics are evaluated before scaling decisions (6-300s) |
142+
| load_balancing | string | "" | 2.1.5+ | Algorithm for distributing traffic across replicas. Default: round-robin if replica_concurrency > 3, first-available otherwise. Options: round-robin, first-available, min-connections, random-choice-2 |
143+
| roll_out_duration_seconds | integer | 0 | 2.1.2+ | Gradually send traffic to new revision after successful build. Max 600s. Keep at 0 during development. |
142144

143145
<Warning>
144146
Setting min_replicas > 0 maintains warm instances for immediate response but
@@ -241,6 +243,8 @@ response_grace_period = 3600
241243
cooldown = 1800
242244
scaling_metric = "concurrency_utilization"
243245
scaling_target = 100
246+
evaluation_interval = 30
247+
# load_balancing = "" # Auto-selects based on replica_concurrency
244248
roll_out_duration_seconds = 0
245249

246250
[cerebrium.dependencies.pip]

0 commit comments

Comments
 (0)