new cooldown descriptions

elijah-rou · elijah-rou · commit 9b0f7d7c8382 · 2025-12-18T14:30:48.000-05:00
diff --git a/cerebrium/partner-services/deepgram.mdx b/cerebrium/partner-services/deepgram.mdx
@@ -315,7 +315,7 @@ Deepgram services support independent scaling configurations:
 - **min_replicas**: Minimum number of instances to maintain (0 for scale-to-zero)
 - **max_replicas**: Maximum number of instances that can be created during high load
 - **replica_concurrency**: Number of concurrent requests each instance can handle
-- **cooldown**: Time in seconds that an instance remains active after processing its last request
+- **cooldown**: Time window (in seconds) that must pass at reduced concurrency before scaling down
 
 Adjust these parameters based on expected traffic patterns and latency requirements.
 
diff --git a/cerebrium/partner-services/index.mdx b/cerebrium/partner-services/index.mdx
@@ -36,7 +36,7 @@ Partner Services support independent scaling configurations:
 
 - Use the `min_replicas` and `max_replicas` parameters to control the number of instances
 - The `replica_concurrency` parameter determines how many concurrent requests each instance can handle
-- Adjust the `cooldown` parameter to control how long instances remain active after processing requests
+- Adjust the `cooldown` parameter to control the time window that must pass at reduced concurrency before scaling down
 - Adjust the `hardware` section to control the instance type which affects performance and/or cost
 
 For more information on specific Partner Services, see:
diff --git a/cerebrium/partner-services/rime.mdx b/cerebrium/partner-services/rime.mdx
@@ -99,7 +99,7 @@ Rime services support independent scaling configurations:
 - **min_replicas**: Minimum instances to maintain (0 for scale-to-zero). Recommended: 1.
 - **max_replicas**: Maximum instances during high load.
 - **replica_concurrency**: Concurrent requests per instance. Recommended: 3.
-- **cooldown**: Seconds an instance remains active after last request. Recommended: 50.
+- **cooldown**: Time window (in seconds) that must pass at reduced concurrency before scaling down. Recommended: 50.
 - **compute**: Instance type. Recommended: `AMPERE_A10`.
 
 Adjust these parameters based on traffic patterns and latency requirements. Best would be to consult the Rime team
diff --git a/cerebrium/scaling/scaling-apps.mdx b/cerebrium/scaling/scaling-apps.mdx
@@ -16,7 +16,7 @@ The **number of requests** currently waiting for processing in the queue indicat
   See below for more information.
 </Info>
 
-As traffic decreases, instances enter a cooldown period after processing their last request. When no new requests arrive during cooldown, instances terminate to optimize resource usage. This automatic cycle ensures apps remain responsive while managing costs effectively.
+As traffic decreases, instances enter a cooldown period at reduced concurrency. If reduced concurrency is maintained for the cooldown duration, instances scale down to optimize resource usage. This automatic cycle ensures apps remain responsive while managing costs effectively.
 
 ## Scaling Configuration
 
@@ -40,7 +40,7 @@ The `max_replicas` parameter sets an upper limit on concurrent instances, contro
 
 ### Cooldown Period
 
-After processing a request, instances remain available for the duration specified by `cooldown`. Each new request resets this timer. A longer cooldown period helps handle bursty traffic patterns but increases instance running time and cost.
+The `cooldown` parameter specifies the time window (in seconds) that must pass at reduced concurrency before an instance scales down. This prevents premature scale-down during brief traffic dips that might be followed by more requests. A longer cooldown period helps handle bursty traffic patterns but increases instance running time and cost.
 
 ### Replica Concurrency
 
@@ -187,6 +187,10 @@ Once this request has completed, the usual `cooldown` period will apply, and the
 
 ## Evaluation Interval
 
+<Warning>
+Requires CLI version 2.1.5 or higher.
+</Warning>
+
 The `evaluation_interval` parameter controls the time window (in seconds) over which the autoscaler evaluates metrics before making scaling decisions. The default is 30 seconds, with a valid range of 6-300 seconds.
 
 ```toml
@@ -204,6 +208,10 @@ A shorter interval makes the autoscaler more responsive to traffic spikes but ma
 
 ## Load Balancing
 
+<Warning>
+Requires CLI version 2.1.5 or higher.
+</Warning>
+
 The `load_balancing` parameter controls how incoming requests are distributed across your replicas. When not specified, the system automatically selects the best algorithm based on your `replica_concurrency` setting.
 
 ```toml
diff --git a/migrations/mystic.mdx b/migrations/mystic.mdx
@@ -78,7 +78,7 @@ gpu_count = 1             # Number of GPUs
 [cerebrium.scaling]
 min_replicas = 0         # Save costs when inactive and scale down your app
 max_replicas = 2         # Handle increased traffic and scale up where necessary
-cooldown = 60            # Time to wait before scaling down an idle instance
+cooldown = 60            # Time window at reduced concurrency before scaling down
 replica_concurrency = 1  # The number of requests a single container can support
 
 [cerebrium.dependencies.pip]
diff --git a/toml-reference/toml-reference.mdx b/toml-reference/toml-reference.mdx
@@ -128,19 +128,19 @@ The `[cerebrium.hardware]` section defines compute resources.
 
 The `[cerebrium.scaling]` section controls auto-scaling behavior.
 
-| Option                    | Type    | Default                   | Description                                                                                                                                                                                             |
-| ------------------------- | ------- | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| min_replicas              | integer | 0                         | Minimum running instances                                                                                                                                                                               |
-| max_replicas              | integer | 2                         | Maximum running instances                                                                                                                                                                               |
-| replica_concurrency       | integer | 10                        | Concurrent requests per replica                                                                                                                                                                         |
-| response_grace_period     | integer | 3600                      | Grace period in seconds                                                                                                                                                                                 |
-| cooldown                  | integer | 1800                      | Time to wait before scaling down an idle container                                                                                                                                                      |
-| scaling_metric            | string  | "concurrency_utilization" | Metric for scaling decisions (concurrency_utilization, requests_per_second, cpu_utilization, memory_utilization)                                                                                        |
-| scaling_target            | integer | 100                       | Target value for scaling metric (percentage for utilization metrics, absolute value for requests_per_second)                                                                                            |
-| scaling_buffer            | integer | optional                  | Additional replica capacity above what scaling metric suggests                                                                                                                                          |
-| evaluation_interval       | integer | 30                        | Time window in seconds over which metrics are evaluated before scaling decisions (6-300s)                                                                                                               |
-| load_balancing            | string  | ""                        | Algorithm for distributing traffic across replicas. Default: round-robin if replica_concurrency > 3, first-available otherwise. Options: round-robin, first-available, min-connections, random-choice-2 |
-| roll_out_duration_seconds | integer | 0                         | Gradually send traffic to new revision after successful build. Max 600s. Keep at 0 during development.                                                                                                  |
+| Option                    | Type    | Default                   | CLI Requirement | Description                                                                                                                                                                                             |
+| ------------------------- | ------- | ------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| min_replicas              | integer | 0                         | 2.1.2+          | Minimum running instances                                                                                                                                                                               |
+| max_replicas              | integer | 2                         | 2.1.2+          | Maximum running instances                                                                                                                                                                               |
+| replica_concurrency       | integer | 10                        | 2.1.2+          | Concurrent requests per replica                                                                                                                                                                         |
+| response_grace_period     | integer | 3600                      | 2.1.2+          | Grace period in seconds                                                                                                                                                                                 |
+| cooldown                  | integer | 1800                      | 2.1.2+          | Time window (seconds) that must pass at reduced concurrency before scaling down. Helps avoid cold starts from brief traffic dips.                                                                       |
+| scaling_metric            | string  | "concurrency_utilization" | 2.1.2+          | Metric for scaling decisions (concurrency_utilization, requests_per_second, cpu_utilization, memory_utilization)                                                                                        |
+| scaling_target            | integer | 100                       | 2.1.2+          | Target value for scaling metric (percentage for utilization metrics, absolute value for requests_per_second)                                                                                            |
+| scaling_buffer            | integer | optional                  | 2.1.2+          | Additional replica capacity above what scaling metric suggests                                                                                                                                          |
+| evaluation_interval       | integer | 30                        | 2.1.5+          | Time window in seconds over which metrics are evaluated before scaling decisions (6-300s)                                                                                                               |
+| load_balancing            | string  | ""                        | 2.1.5+          | Algorithm for distributing traffic across replicas. Default: round-robin if replica_concurrency > 3, first-available otherwise. Options: round-robin, first-available, min-connections, random-choice-2 |
+| roll_out_duration_seconds | integer | 0                         | 2.1.2+          | Gradually send traffic to new revision after successful build. Max 600s. Keep at 0 during development.                                                                                                  |
 
 <Warning>
   Setting min_replicas > 0 maintains warm instances for immediate response but