Scaler provides an experimental auto-scaling feature that allows the system to dynamically adjust the number of workers based on workload. Scaling policies determine when to add or remove workers, while Worker Managers handle the actual provisioning of resources.
The scaling system consists of two main components:
- Scaling Policy: A policy that monitors task queues and worker availability to make scaling suggestions.
- Worker Manager: A component that handles the actual creation and destruction of worker groups (e.g., starting containers, launching processes).
The Scaling Policy runs within the Scheduler and communicates with Worker Managers via Cap'n Proto messages. Worker Managers connect to the Scheduler and receive scaling commands directly.
The scaling policy is configured via the policy_content setting in the scheduler configuration:
scaler_scheduler tcp://127.0.0.1:8516 --policy-content "allocate=capability; scaling=vanilla"Or in a TOML configuration file:
[scheduler]
policy_content = "allocate=capability; scaling=capability"Scaler provides several built-in scaling policies:
| Policy | Description |
|---|---|
no |
No automatic scaling. Workers are managed manually or statically provisioned. |
vanilla |
Basic task-to-worker ratio scaling. Scales up when task ratio exceeds threshold, scales down when idle. |
capability |
Capability-aware scaling. Scales worker groups based on task-required capabilities (e.g., GPU, memory). |
fixed_elastic |
Hybrid scaling using primary and secondary worker managers with configurable limits. |
waterfall_v1 |
Priority-based cascading across multiple worker managers. Higher-priority managers fill first; overflow goes to lower-priority. |
The simplest policy that performs no automatic scaling. Use this when:
- Workers are statically provisioned
- External orchestration handles scaling (e.g., Kubernetes HPA)
- You want full manual control over worker count
scaler_scheduler tcp://127.0.0.1:8516 --policy-content "allocate=even_load; scaling=no"The vanilla scaling policy uses a simple task-to-worker ratio to make scaling suggestions:
- Scale up: When
tasks / workers > upper_task_ratio(default: 10) - Scale down: When
tasks / workers < lower_task_ratio(default: 1)
This policy is straightforward and works well for homogeneous workloads where all workers can handle all tasks.
scaler_scheduler tcp://127.0.0.1:8516 \
--policy-content "allocate=even_load; scaling=vanilla"The capability scaling policy is designed for heterogeneous workloads where tasks require specific capabilities (e.g., GPU, high memory, specialized hardware).
Key Features:
- Groups tasks by their required capability sets
- Groups workers by their provided capability sets
- Scales worker groups per capability set independently
- Ensures tasks are matched to workers that can handle them
- Prevents scaling down the last worker group capable of handling pending tasks
- Prevents thrashing by checking if scale-down would immediately trigger scale-up
How It Works:
- Task Grouping: Tasks are grouped by their required capability keys (e.g.,
{"gpu"},{"gpu", "high_memory"}). - Worker Matching: Workers are grouped by their provided capabilities. A worker can handle a task if the task's required capabilities are a subset of the worker's capabilities.
- Per-Capability Scaling: The policy applies the task-to-worker ratio logic independently for each capability set:
- Scale up: When
tasks / capable_workers > upper_task_ratio(default: 5) - Scale down: When
tasks / capable_workers < lower_task_ratio(default: 0.5)
- Scale up: When
- Capability Request: When scaling up, the policy requests worker groups with specific capabilities from the worker manager.
Configuration:
scaler_scheduler tcp://127.0.0.1:8516 \
--policy-content "allocate=capability; scaling=capability"Example Scenario:
Consider a workload with both CPU-only and GPU tasks:
from scaler import Client
with Client(address="tcp://127.0.0.1:8516") as client:
# Submit CPU tasks (no special capabilities required)
cpu_futures = [
client.submit_verbose(cpu_task, args=(i,), capabilities={})
for i in range(100)
]
# Submit GPU tasks (require GPU capability)
gpu_futures = [
client.submit_verbose(gpu_task, args=(i,), capabilities={"gpu": 1})
for i in range(50)
]With the capability scaling policy:
- If no GPU workers exist, the policy requests a worker group with
{"gpu": 1}from the worker manager. - CPU and GPU worker groups are scaled independently based on their respective task queues.
- Idle GPU workers can be shut down without affecting CPU task processing.
The fixed elastic scaling policy supports hybrid scaling with multiple worker managers:
- Primary Manager: A single worker group (identified by
max_worker_groups == 1) that starts once and never shuts down - Secondary Manager: Elastic capacity (
max_worker_groups > 1) that scales based on demand
This is useful for scenarios where you have a fixed pool of dedicated resources but want to burst to additional resources during peak demand.
scaler_scheduler tcp://127.0.0.1:8516 \
--policy-content "allocate=even_load; scaling=fixed_elastic"Behavior:
- The primary manager's worker group is started once and never shut down
- Secondary manager groups are created when demand exceeds primary capacity
- When scaling down, only secondary manager groups are shut down
The waterfall scaling policy cascades worker scaling across prioritized worker managers. Higher-priority managers fill first; when they reach capacity, overflow goes to the next priority tier. When scaling down, the lowest-priority managers drain first.
This is useful for hybrid deployments where you want to prefer cheaper or lower-latency resources (e.g., local bare-metal) and only burst to more expensive resources (e.g., cloud) when needed.
Configuration:
The waterfall policy uses policy_engine_type = "waterfall_v1" and a newline-separated rule format for policy_content. Each rule is a comma-separated line with three fields: priority, manager_id_prefix, max_workers. Lines starting with # are comments.
[scheduler]
policy_engine_type = "waterfall_v1"
policy_content = """
# priority, manager_id_prefix, max_workers
# Use local workers first (cheap, low latency)
1, NAT, 8
# Overflow to ECS when local capacity is exhausted
2, ECS, 50
"""Rules reference worker manager ID prefixes. At runtime, each worker manager generates a full ID like NAT|<pid>; the prefix NAT matches any manager whose ID starts with NAT. Multiple managers can share the same prefix and are governed by the same rule.
Scaling policies, running within the scheduler process, communicate with worker managers using Cap'n Proto messages through the connection that worker managers use to communicate with the scheduler. The protocol uses the following message types:
WorkerManagerHeartbeat (Manager -> Scheduler):
Worker managers periodically send heartbeats to the scheduler containing their capacity information:
max_worker_groups: Maximum number of worker groups this manager can manageworkers_per_group: Number of workers in each groupcapabilities: Default capabilities for workers from this manager
WorkerManagerCommand (Scheduler -> Manager):
The scheduler sends commands to worker managers:
StartWorkerGroup: Request to start a new worker groupworker_group_id: Empty for new groups (manager assigns ID)capabilities: Required capabilities for the worker group
ShutdownWorkerGroup: Request to shut down an existing worker groupworker_group_id: ID of the group to shut down
WorkerManagerCommandResponse (Manager -> Scheduler):
Worker managers respond to commands with status and details:
worker_group_id: ID of the affected worker groupcommand: The command type this response is forstatus: Result status (Success,WorkerGroupTooMuch,WorkerGroupIDNotFound)worker_ids: List of worker IDs in the group (for start commands)capabilities: Actual capabilities of the started workers
Here is an example of a worker manager using the ECS (Amazon Elastic Container Service) integration:
.. literalinclude:: ../../../src/scaler/worker_manager_adapter/aws_raw/ecs.py :language: python
- Match allocation and scaling policies: Use
allocate=capabilitywithscaling=capabilityto ensure tasks are routed to workers with the right capabilities. - Set appropriate thresholds: Adjust task ratio thresholds based on your task duration and scaling latency:
- Short tasks: Use higher
upper_task_ratioto avoid thrashing - Long startup time: Use higher
lower_task_ratioto avoid premature scale-down
- Short tasks: Use higher
- Monitor scaling events: Use Scaler's monitoring tools (
scaler_top) to observe scaling behavior and tune policies. - Worker Manager Placement: Run worker managers on machines that can provision the required resources (e.g., run the ECS worker manager where it has AWS credentials, run the native worker manager on the target machine).