Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 12 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -247,9 +247,9 @@ The following table maps each Scaler command to its corresponding section name i
| `scaler_object_storage_server` | `[object_storage_server]` |
| `scaler_ui` | `[webui]` |
| `scaler_top` | `[top]` |
| `scaler_worker_adapter_native` | `[native_worker_adapter]` |
| `scaler_worker_adapter_fixed_native` | `[fixed_native_worker_adapter]` |
| `scaler_worker_adapter_symphony` | `[symphony_worker_adapter]` |
| `scaler_worker_manager_baremetal_native` | `[native_worker_manager]` |
| `scaler_worker_manager_baremetal_fixed_native` | `[fixed_native_worker_manager]` |
| `scaler_worker_manager_symphony` | `[symphony_worker_manager]` |

### Practical Scenarios & Examples

Expand Down Expand Up @@ -381,7 +381,7 @@ might be added in the future.
A Scaler scheduler can interface with IBM Spectrum Symphony to provide distributed computing across Symphony clusters.

```bash
$ scaler_worker_adapter_symphony tcp://127.0.0.1:2345 --service-name ScalerService --base-concurrency 4
$ scaler_worker_manager_symphony tcp://127.0.0.1:2345 --service-name ScalerService --base-concurrency 4
```

This will start a Scaler worker that connects to the Scaler scheduler at `tcp://127.0.0.1:2345` and uses the Symphony
Expand Down Expand Up @@ -466,25 +466,25 @@ where `deepest_nesting_level` is the deepest nesting level a task has in your wo
workload that has
a base task that calls a nested task that calls another nested task, then the deepest nesting level is 2.

## Worker Adapter usage
## Worker Manager usage

> **Note**: This feature is experimental and may change in future releases.

Scaler provides a Worker Adapter webhook interface to integrate with other job schedulers or resource managers. The
Worker Adapter allows external systems to request the creation and termination of Scaler workers dynamically.
Scaler provides a Worker Manager webhook interface to integrate with other job schedulers or resource managers. The
Worker Manager allows external systems to request the creation and termination of Scaler workers dynamically.

Please check the OpenGRIS standard for more details on the Worker Adapter
Please check the OpenGRIS standard for more details on the Worker Manager
specification [here](https://github.com/finos/opengris).

### Starting the Native Worker Adapter
### Starting the Native Worker Manager

Start a Native Worker Adapter and connect it to the scheduler:
Start a Native Worker Manager and connect it to the scheduler:

```bash
$ scaler_worker_adapter_native tcp://127.0.0.1:2345
$ scaler_worker_manager_baremetal_native tcp://127.0.0.1:2345
```

To check that the Worker Adapter is working, you can bring up `scaler_top` to see workers spawning and terminating as
To check that the Worker Manager is working, you can bring up `scaler_top` to see workers spawning and terminating as
there is task load changes.

## Performance
Expand Down
10 changes: 5 additions & 5 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@ Content
tutorials/quickstart
tutorials/features
tutorials/scaling
tutorials/worker_adapters/index
tutorials/worker_adapters/native
tutorials/worker_adapters/fixed_native
tutorials/worker_adapters/aws_hpc/index
tutorials/worker_adapters/common_parameters
tutorials/worker_manager_adapter/index
tutorials/worker_manager_adapter/native
tutorials/worker_manager_adapter/fixed_native
tutorials/worker_manager_adapter/aws_hpc/index
tutorials/worker_manager_adapter/common_parameters
tutorials/compatibility/ray
tutorials/configuration
tutorials/examples
Expand Down
24 changes: 12 additions & 12 deletions docs/source/tutorials/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -140,12 +140,12 @@ Or through the programmatic API:
death_timeout_seconds=300,
)

Worker Adapter Settings
Worker Manager Settings
-----------------------

Worker adapters share many common configuration settings for networking, worker behavior, and logging.
Worker managers share many common configuration settings for networking, worker behavior, and logging.

For a full list of these settings, see the :doc:`Worker Adapter Common Parameters <worker_adapters/common_parameters>` documentation.
For a full list of these settings, see the :doc:`Worker Manager Common Parameters <worker_manager_adapter/common_parameters>` documentation.

Configuring with TOML Files
---------------------------
Expand Down Expand Up @@ -193,16 +193,16 @@ The following table maps each Scaler command to its corresponding section name i
- ``[webui]``
* - ``scaler_top``
- ``[top]``
* - ``scaler_worker_adapter_native``
- ``[native_worker_adapter]``
* - ``scaler_worker_adapter_fixed_native``
- ``[fixed_native_worker_adapter]``
* - ``scaler_worker_adapter_symphony``
- ``[symphony_worker_adapter]``
* - ``scaler_worker_adapter_ecs``
- ``[ecs_worker_adapter]``
* - ``scaler_worker_manager_baremetal_native``
- ``[native_worker_manager]``
* - ``scaler_worker_manager_baremetal_fixed_native``
- ``[fixed_native_worker_manager]``
* - ``scaler_worker_manager_symphony``
- ``[symphony_worker_manager]``
* - ``scaler_worker_manager_aws_raw_ecs``
- ``[ecs_worker_manager]``
* - ``python -m scaler.entry_points.worker_manager_aws_hpc_batch``
- ``[aws_hpc_worker_adapter]``
- ``[aws_hpc_worker_manager]``


Practical Scenarios & Examples
Expand Down
6 changes: 3 additions & 3 deletions docs/source/tutorials/features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ requirements for tasks and allocate them to workers supporting these.
.. literalinclude:: ../../../examples/task_capabilities.py
:language: python

Scaling Control and Worker Adapter
Scaling Control and Worker Manager
----------------------------------

Scaler offers an *experimental* auto-scaling feature based on policies, enabling you to scale workers up or down
Expand All @@ -169,9 +169,9 @@ Available scaling policies include:
* **no**: No automatic scaling (static workers)
* **vanilla**: Basic task-to-worker ratio scaling
* **capability**: Capability-aware scaling for heterogeneous workloads (e.g., GPU tasks)
* **fixed_elastic**: Hybrid scaling with primary and secondary worker adapters
* **fixed_elastic**: Hybrid scaling with primary and secondary worker managers

For detailed documentation on scaling policies, including the capability-aware scaling controller,
For detailed documentation on scaling policies, including the capability-aware scaling policy,
see the :doc:`scaling` guide.

Client Disconnect and Shutdown
Expand Down
60 changes: 30 additions & 30 deletions docs/source/tutorials/scaling.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
Scaling Policies
================

Scaler provides an *experimental* auto-scaling feature that allows the system to dynamically adjust the number of workers based on workload. Scaling policies determine when to add or remove workers, while Worker Adapters handle the actual provisioning of resources.
Scaler provides an *experimental* auto-scaling feature that allows the system to dynamically adjust the number of workers based on workload. Scaling policies determine when to add or remove workers, while Worker Managers handle the actual provisioning of resources.

Overview
--------

The scaling system consists of two main components:

1. **Scaling Controller**: A policy that monitors task queues and worker availability to make scaling decisions.
2. **Worker Adapter**: A component that handles the actual creation and destruction of worker groups (e.g., starting containers, launching processes).
1. **Scaling Policy**: A policy that monitors task queues and worker availability to make scaling suggestions.
2. **Worker Manager**: A component that handles the actual creation and destruction of worker groups (e.g., starting containers, launching processes).

The Scaling Controller runs within the Scheduler and communicates with Worker Adapters via Cap'n Proto messages. Worker Adapters connect to the Scheduler and receive scaling commands directly.
The Scaling Policy runs within the Scheduler and communicates with Worker Managers via Cap'n Proto messages. Worker Managers connect to the Scheduler and receive scaling commands directly.

The scaling policy is configured via the ``policy_content`` setting in the scheduler configuration:

Expand Down Expand Up @@ -44,7 +44,7 @@ Scaler provides several built-in scaling policies:
* - ``capability``
- Capability-aware scaling. Scales worker groups based on task-required capabilities (e.g., GPU, memory).
* - ``fixed_elastic``
- Hybrid scaling using primary and secondary worker adapters with configurable limits.
- Hybrid scaling using primary and secondary worker managers with configurable limits.


No Scaling (``no``)
Expand All @@ -64,7 +64,7 @@ The simplest policy that performs no automatic scaling. Use this when:
Vanilla Scaling (``vanilla``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The vanilla scaling controller uses a simple task-to-worker ratio to make scaling decisions:
The vanilla scaling policy uses a simple task-to-worker ratio to make scaling suggestions:

* **Scale up**: When ``tasks / workers > upper_task_ratio`` (default: 10)
* **Scale down**: When ``tasks / workers < lower_task_ratio`` (default: 1)
Expand All @@ -80,7 +80,7 @@ This policy is straightforward and works well for homogeneous workloads where al
Capability Scaling (``capability``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The capability scaling controller is designed for heterogeneous workloads where tasks require specific capabilities (e.g., GPU, high memory, specialized hardware).
The capability scaling policy is designed for heterogeneous workloads where tasks require specific capabilities (e.g., GPU, high memory, specialized hardware).

**Key Features:**

Expand All @@ -97,12 +97,12 @@ The capability scaling controller is designed for heterogeneous workloads where

2. **Worker Matching**: Workers are grouped by their provided capabilities. A worker can handle a task if the task's required capabilities are a subset of the worker's capabilities.

3. **Per-Capability Scaling**: The controller applies the task-to-worker ratio logic independently for each capability set:
3. **Per-Capability Scaling**: The policy applies the task-to-worker ratio logic independently for each capability set:

* **Scale up**: When ``tasks / capable_workers > upper_task_ratio`` (default: 5)
* **Scale down**: When ``tasks / capable_workers < lower_task_ratio`` (default: 0.5)

4. **Capability Request**: When scaling up, the controller requests worker groups with specific capabilities from the worker adapter.
4. **Capability Request**: When scaling up, the policy requests worker groups with specific capabilities from the worker manager.

**Configuration:**

Expand Down Expand Up @@ -134,18 +134,18 @@ Consider a workload with both CPU-only and GPU tasks:

With the capability scaling policy:

1. If no GPU workers exist, the controller requests a worker group with ``{"gpu": 1}`` from the adapter.
1. If no GPU workers exist, the policy requests a worker group with ``{"gpu": 1}`` from the worker manager.
2. CPU and GPU worker groups are scaled independently based on their respective task queues.
3. Idle GPU workers can be shut down without affecting CPU task processing.


Fixed Elastic Scaling (``fixed_elastic``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The fixed elastic scaling controller supports hybrid scaling with multiple worker adapters:
The fixed elastic scaling policy supports hybrid scaling with multiple worker managers:

* **Primary Adapter**: A single worker group (identified by ``max_worker_groups == 1``) that starts once and never shuts down
* **Secondary Adapter**: Elastic capacity (``max_worker_groups > 1``) that scales based on demand
* **Primary Manager**: A single worker group (identified by ``max_worker_groups == 1``) that starts once and never shuts down
* **Secondary Manager**: Elastic capacity (``max_worker_groups > 1``) that scales based on demand

This is useful for scenarios where you have a fixed pool of dedicated resources but want to burst to additional resources during peak demand.

Expand All @@ -156,40 +156,40 @@ This is useful for scenarios where you have a fixed pool of dedicated resources

**Behavior:**

* The primary adapter's worker group is started once and never shut down
* Secondary adapter groups are created when demand exceeds primary capacity
* When scaling down, only secondary adapter groups are shut down
* The primary manager's worker group is started once and never shut down
* Secondary manager groups are created when demand exceeds primary capacity
* When scaling down, only secondary manager groups are shut down


Worker Adapter Protocol
Worker Manager Protocol
-----------------------

Scaling controllers, running within the scheduler process, communicate with worker adapters using Cap'n Proto messages through the connection that worker adapters use to communicate with the scheduler. The protocol uses the following message types:
Scaling policies, running within the scheduler process, communicate with worker managers using Cap'n Proto messages through the connection that worker managers use to communicate with the scheduler. The protocol uses the following message types:

**WorkerAdapterHeartbeat (Adapter -> Scheduler):**
**WorkerManagerHeartbeat (Manager -> Scheduler):**

Worker adapters periodically send heartbeats to the scheduler containing their capacity information:
Worker managers periodically send heartbeats to the scheduler containing their capacity information:

* ``max_worker_groups``: Maximum number of worker groups this adapter can manage
* ``max_worker_groups``: Maximum number of worker groups this manager can manage
* ``workers_per_group``: Number of workers in each group
* ``capabilities``: Default capabilities for workers from this adapter
* ``capabilities``: Default capabilities for workers from this manager

**WorkerAdapterCommand (Scheduler -> Adapter):**
**WorkerManagerCommand (Scheduler -> Manager):**

The scheduler sends commands to worker adapters:
The scheduler sends commands to worker managers:

* ``StartWorkerGroup``: Request to start a new worker group

* ``worker_group_id``: Empty for new groups (adapter assigns ID)
* ``worker_group_id``: Empty for new groups (manager assigns ID)
* ``capabilities``: Required capabilities for the worker group

* ``ShutdownWorkerGroup``: Request to shut down an existing worker group

* ``worker_group_id``: ID of the group to shut down

**WorkerAdapterCommandResponse (Adapter -> Scheduler):**
**WorkerManagerCommandResponse (Manager -> Scheduler):**

Worker adapters respond to commands with status and details:
Worker managers respond to commands with status and details:

* ``worker_group_id``: ID of the affected worker group
* ``command``: The command type this response is for
Expand All @@ -198,10 +198,10 @@ Worker adapters respond to commands with status and details:
* ``capabilities``: Actual capabilities of the started workers


Example Worker Adapter
Example Worker Manager
----------------------

Here is an example of a worker adapter using the ECS (Amazon Elastic Container Service) integration:
Here is an example of a worker manager using the ECS (Amazon Elastic Container Service) integration:

.. literalinclude:: ../../../src/scaler/worker_manager_adapter/aws_raw/ecs.py
:language: python
Expand All @@ -219,4 +219,4 @@ Tips

3. **Monitor scaling events**: Use Scaler's monitoring tools (``scaler_top``) to observe scaling behavior and tune policies.

4. **Worker Adapter Placement**: Run worker adapters on machines that can provision the required resources (e.g., run the ECS adapter where it has AWS credentials, run the native adapter on the target machine).
4. **Worker Manager Placement**: Run worker managers on machines that can provision the required resources (e.g., run the ECS worker manager where it has AWS credentials, run the native worker manager on the target machine).
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
AWS HPC Worker Adapter
AWS HPC Worker Manager
======================

The AWS HPC worker adapter offloads task execution to AWS Batch, running each task as a containerized job on managed EC2 compute. This adapter is particularly useful for bursting workloads to the cloud or running tasks that require specific hardware (e.g., GPUs, high memory) not available locally. It currently supports the AWS Batch backend.
The AWS HPC worker manager offloads task execution to AWS Batch, running each task as a containerized job on managed EC2 compute. This manager is particularly useful for bursting workloads to the cloud or running tasks that require specific hardware (e.g., GPUs, high memory) not available locally. It currently supports the AWS Batch backend.

.. seealso::
For a comprehensive, step-by-step walkthrough of setting up AWS infrastructure, building Docker images, and troubleshooting, see the :doc:`AWS Batch Setup Guide <setup>`.

Prerequisites
-------------

To use the AWS HPC worker adapter, you need:
To use the AWS HPC worker manager, you need:

* An **AWS Account** with appropriate permissions.
* The ``boto3`` Python library installed in your Scaler environment.
Expand All @@ -24,7 +24,7 @@ To use the AWS HPC worker adapter, you need:
Getting Started
---------------

To start the AWS HPC worker adapter from the command line:
To start the AWS HPC worker manager from the command line:

.. code-block:: bash

Expand All @@ -45,7 +45,7 @@ Equivalent configuration using a TOML file:

# config.toml

[aws_hpc_worker_adapter]
[aws_hpc_worker_manager]
job_queue = "my-scaler-queue"
job_definition = "my-scaler-job-def"
s3_bucket = "my-scaler-tasks-bucket"
Expand Down Expand Up @@ -79,7 +79,7 @@ How it Works

* **Payload Handling**: Task payloads are serialized using ``cloudpickle``. If the compressed payload is larger than 28KB, it is uploaded to S3. Smaller payloads are passed directly to the AWS Batch job as parameters.
* **Execution**: The Batch container runs a specialized runner script (``batch_job_runner.py``) that deserializes the task, executes it, and writes the result (and any errors) back to S3.
* **Concurrency**: The adapter manages a semaphore to limit the number of concurrent Batch jobs (``--max-concurrent-jobs``), preventing accidental resource exhaustion or exceeding AWS service quotas.
* **Concurrency**: The manager manages a semaphore to limit the number of concurrent Batch jobs (``--max-concurrent-jobs``), preventing accidental resource exhaustion or exceeding AWS service quotas.
* **Efficiency**: Payloads > 4KB are automatically compressed with gzip to minimize S3 usage and data transfer.

Supported Parameters
Expand All @@ -88,7 +88,7 @@ Supported Parameters
.. note::
For more details on how to configure Scaler, see the :doc:`../../configuration` section.

The AWS HPC worker adapter supports the following specific configuration parameters.
The AWS HPC worker manager supports the following specific configuration parameters.

AWS HPC Configuration
~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -101,7 +101,7 @@ AWS HPC Configuration
* ``--max-concurrent-jobs`` (``-mcj``): Maximum number of concurrent Batch jobs (default: ``100``).
* ``--job-timeout-minutes``: Maximum time a Batch job is allowed to run before being terminated (default: ``60``).
* ``--backend`` (``-b``): AWS HPC backend to use (default: ``batch``).
* ``--name`` (``-n``): A custom name for the worker adapter instance.
* ``--name`` (``-n``): A custom name for the worker manager instance.

Common Parameters
~~~~~~~~~~~~~~~~~
Expand All @@ -112,7 +112,7 @@ Setup Guide
-----------

.. important::
Setting up AWS infrastructure and ensuring correct container configuration is critical for the AWS HPC adapter to function correctly.
Setting up AWS infrastructure and ensuring correct container configuration is critical for the AWS HPC manager to function correctly.

For a comprehensive walkthrough of setting up AWS infrastructure, building Docker images, and troubleshooting, please refer to our detailed :doc:`AWS Batch Setup Guide <setup>`.

Expand Down
Loading
Loading