Skip to content

Waterfall Engine and Scaling Policy#579

Merged
sharpener6 merged 8 commits intofinos:mainfrom
gxuu:feb-waterfall
Mar 11, 2026
Merged

Waterfall Engine and Scaling Policy#579
sharpener6 merged 8 commits intofinos:mainfrom
gxuu:feb-waterfall

Conversation

@gxuu
Copy link
Contributor

@gxuu gxuu commented Feb 26, 2026

READY FOR REVIEW.

Waterfall_v1 engine and Waterfall Scaling Policy

Summary

Adds a new waterfall scaling policy that cascades worker scaling across prioritized worker managers. Priority-1 managers fill first; overflow goes to priority-2, then priority-3. Shutdown is reversed — least-preferred drains first. Offline managers are skipped automatically.

Configuration

policy_engine_type = "waterfall_v1"
policy_content = """
# priority, worker_type, max_task_concurrency
         1,      native,                   10
         1,     native1,                   20
         2,         ecs,                   64
"""

Each line is priority,worker_type,max_task_concurrency. Comments (#) and blank lines are supported. Worker types match worker manager IDs by prefix (e.g. worker_type native matches manager ID native|abc123). Effective capacity per manager is min(config max_task_concurrency, heartbeat max_worker_groups).

Changes

Protocol:

  • Added workerManagerID (Data) to WorkerManagerHeartbeat in Cap'n Proto schema (required, no default)
  • Each manager generates its own ID as <worker_type>|<uuid>: NAT|<pid>, ECS|<pid>, SYM|<pid>

Scaling interface:

  • Extended ScalingPolicy.get_scaling_commands() with a worker_manager_snapshots parameter (Dict[bytes, WorkerManagerSnapshot]) providing cross-manager state to all policies
  • WorkerManagerController builds these snapshots from existing heartbeat tracking state via _build_manager_snapshots()
  • Added WorkerManagerSnapshot dataclass to simple_policy/scaling/types.py

Policy layer:

  • WaterfallV1Policy implements ScalerPolicy (same as SimplePolicy), routed through VanillaPolicyController via create_policy()
  • VanillaPolicyController remains the only PolicyController implementation
  • Added WATERFALL_V1 case to create_policy() factory in library/utility.py

New files:

  • waterfall_v1/waterfall_v1_policy.pyWaterfallV1Policy(ScalerPolicy) with even_load allocation
  • waterfall_v1/scaling/waterfall.py — Stateless WaterfallScalingPolicy(ScalingPolicy) with worker_type-based manager ID matching
  • waterfall_v1/scaling/types.pyWaterfallRule (fields: priority, worker_type, max_task_concurrency)
  • waterfall_v1/scaling/utility.pyparse_waterfall_rules() config parser
  • tests/scheduler/test_waterfall_scaling.py — 25 unit tests

@gxuu gxuu force-pushed the feb-waterfall branch 8 times, most recently from 5ed23c2 to 0cf85fe Compare February 27, 2026 21:55
@gxuu gxuu marked this pull request as ready for review February 27, 2026 22:25
@yzard
Copy link

yzard commented Mar 2, 2026

@gxuu, the name should be WaterfallPolicyEngine, it should not be controller level

Copy link
Contributor

@magniloquency magniloquency left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor

@gxuu gxuu force-pushed the feb-waterfall branch 4 times, most recently from 983560e to 53486f2 Compare March 4, 2026 21:06
@gxuu gxuu requested review from magniloquency and sharpener6 March 4, 2026 21:27
@gxuu gxuu force-pushed the feb-waterfall branch 2 times, most recently from 5ee75f9 to 77b69ac Compare March 5, 2026 21:59
@gxuu gxuu force-pushed the feb-waterfall branch 5 times, most recently from 4f294c7 to 211d8b2 Compare March 10, 2026 20:13
@gxuu gxuu marked this pull request as draft March 10, 2026 20:56
@gxuu gxuu marked this pull request as ready for review March 11, 2026 01:34
gxuu added 7 commits March 11, 2026 23:41
Introduces an "advance" policy engine with a waterfall scaling strategy that
cascades worker scaling across prioritized adapters. Priority-1 adapters fill
first; overflow goes to priority-2, then priority-3. Shutdown is reversed.

- Add workerAdapterID (Data) to WorkerAdapterHeartbeat protocol
- Extend ScalingController interface with worker_adapter_snapshots parameter
- WorkerAdapterController builds cross-adapter snapshots from heartbeat state
- Add WaterfallScalingController, AdvancePolicy, and supporting types/factories

Signed-off-by: gxu <georgexu420@163.com>
- Route waterfall through VanillaPolicyController like all other policies
- Remove create_policy_controller; add WATERFALL_V1 case to create_policy
- Remove default value for worker_manager_id in WorkerManagerHeartbeat.new_msg
- Rename WaterfallRule.worker_manager_id to adapter_id_prefix for clarity

Signed-off-by: gxu <georgexu420@163.com>
- WaterfallRule.adapter_id_prefix -> worker_type (matches Worker Manager ID spec)
- WaterfallRule.max_workers -> max_task_concurrency
- Config format: priority,worker_type,max_task_concurrency
- Update docstrings and comments to use worker_type terminology

Signed-off-by: gxu <georgexu420@163.com>
Signed-off-by: gxu <georgexu420@163.com>
Signed-off-by: gxu <georgexu420@163.com>
Signed-off-by: gxu <georgexu420@163.com>
Signed-off-by: gxu <georgexu420@163.com>
@gxuu gxuu changed the title Waterfall Scaling Controller Waterfall Engine and Scaling Policy Mar 11, 2026
Signed-off-by: gxu <georgexu420@163.com>
@gxuu gxuu requested a review from sharpener6 March 11, 2026 19:52
@sharpener6 sharpener6 merged commit 503ca16 into finos:main Mar 11, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants