Skip to content

Implement ORB Worker Adapter#525

Open
magniloquency wants to merge 44 commits intofinos:mainfrom
magniloquency:orb
Open

Implement ORB Worker Adapter#525
magniloquency wants to merge 44 commits intofinos:mainfrom
magniloquency:orb

Conversation

@magniloquency
Copy link
Contributor

@magniloquency magniloquency commented Jan 22, 2026

This pull request implements the ORB (Open Resource Broker) Worker Manager, enabling Scaler to dynamically scale worker instances on AWS using the ORB CLI.

Caveats

  • Only one worker is started per EC2 instance. This is likely inefficient and might need improvement in the future
  • I may not be using ORB in the most efficient way, but I did my best given the resources available to me
  • ORB creates launch templates and they are not cleaned up; it is unknown if ORB provides a way to do this
  • ORB requires the AWS CLI to be configured with profiles; other forms of auth seem to not work (e.g. EC2 instance profiles on the running instance)

Key Changes

  • New Worker Manager: Added ORBWorkerAdapter in src/scaler/worker_manager_adapter/orb/, which handles:
    • Dynamic provisioning of EC2 instances via ORB machine requests
    • Automatic security group (outbound-only) and key pair management
    • Deterministic worker ID generation to match instance tags
    • User data generation for automatic worker cluster startup on newly provisioned instances
  • ORB CLI Integration: Added ORBHelper to interface with the orb command-line tool, including support for templates, machines, and provisioning requests. AWS region is injected into ORB config at runtime.
  • AMI Building: Introduced ami/ directory with Packer configuration (opengris-scaler.pkr.hcl) and a build script (build.sh) to create AMIs pre-configured with opengris-scaler.
  • Configuration: Added ORBWorkerAdapterConfig for detailed adapter settings, with ORB config templates bundled in worker_manager_adapter/orb/config/.
  • Entry Points: Added scaler_worker_manager_orb entry point and corresponding run script.
  • Infrastructure: Added support for no_random_worker_ids in Cluster and Worker classes to facilitate deterministic worker identification in cloud environments.
  • Scheduler Fix: The WorkerAdapterController now waits for a pending command to complete before sending a new one. This prevents duplicate StartWorkerGroup commands being sent during the long ORB polling period, which caused WorkerGroupTooMuch errors and spurious "no pending command found" warnings.

Implementation Details

  • The adapter uses a temporary execution environment for ORB to avoid configuration conflicts
  • Implements a webhook handler to respond to Scaler's scaling requests (start_worker_group, shutdown_worker_group)
  • ORB polling runs in a thread executor to avoid blocking the asyncio event loop
  • Boto3 is used for auxiliary AWS operations like discovering the default subnet and managing temporary security groups

Dependencies

  • Added orb-py and boto3 to the orb and all extra dependency groups in pyproject.toml

@magniloquency magniloquency force-pushed the orb branch 13 times, most recently from 1339a57 to 1ebb8d9 Compare January 29, 2026 04:09
@magniloquency magniloquency force-pushed the orb branch 9 times, most recently from 0182b36 to 2c2573a Compare February 10, 2026 00:18
- Include submit_tasks.py in examples readme and documentation.
- Implement skip_examples.txt for top-level examples in CI.
- Add submit_tasks.py to skip_examples.txt as it requires a running scheduler.
gxuu
gxuu previously approved these changes Feb 13, 2026
- Remove aiohttp dependency and RESTful API implementation.
- Rename ORBAdapter to ORBWorkerAdapter.
- Implement ZMQ DEALER connection to scheduler for commands and heartbeats.
- Update start/shutdown logic to return status codes consistent with the new protocol.
- Clean up configuration by removing now-unused WebConfig.
rafa-be
rafa-be previously approved these changes Feb 18, 2026
gxuu
gxuu previously approved these changes Feb 19, 2026
@magniloquency magniloquency dismissed stale reviews from gxuu and rafa-be via 81a2822 February 20, 2026 02:07
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Feb 20, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

Replace camelcase_dict with direct use of asdict output, which already
produces snake_case keys matching the ORBTemplate dataclass field names.
Remove unused sections (server, metrics, performance, events, naming,
circuit_breaker, etc.) to reduce config to the minimal required structure.
magniloquency and others added 14 commits February 20, 2026 10:40
Resolves merge conflicts with origin/main, taking main's desired state.
Updates orb adapter to match the worker_manager_* naming convention
introduced in main: renames entry point file and CLI command, and
updates orb documentation to reflect the new command name and adapter
overview structure.
…ove initialization

- Rename worker_adapter/orb/ to worker_manager_adapter/orb/ and worker_adapter.py to worker_manager.py
- Extract AWS/ORB setup into a lazy __initialize() method called at runtime
- Add proper Optional type annotations for deferred fields
- Add assert guards before connector usage
- Fix unlimited workers check (max_workers == -1)
- Condense multi-line imports in uv_ymq __init__, .pyi, and test file
- Move dict_utils (camelcase/snakecase) out of formatter into its own module
- Move ORB config files to worker_manager_adapter/orb/config/ and delete from drivers/orb/config/
- Inject AWS region into ORB config at runtime rather than requiring pre-configured files
- Remove allowed_ip config field; drop ingress security group rules (workers connect outbound only)
- Extract _poll_for_instance_id helper and run it in executor to avoid blocking the event loop
- Fix orb_config_path default to use package-relative path
- Update docs and entry point references to scaler_worker_manager_orb
Populate instance_types in the generated template so ORB can resolve
the EC2 instance type when requesting machines. Also fix the region
injection in ORBHelper, which was iterating the wrong key ("providers"
instead of "provider.providers") and silently leaving the region as
us-east-1 regardless of config.
…light

When an adapter takes a long time to fulfill a command (e.g. ORB polling
for instance IDs), repeated heartbeats caused the scheduler to send new
commands before the previous response arrived. This resulted in duplicate
StartWorkerGroup commands, WorkerGroupTooMuch errors, and spurious "no
pending command found" warnings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants