fix(flotilla): wire min_cpu_per_task into TaskResourceRequest#7125
fix(flotilla): wire min_cpu_per_task into TaskResourceRequest#7125XiaoHongbo-Hope wants to merge 19 commits into
Conversation
Greptile SummaryThis PR fixes a long-standing bug where
Confidence Score: 5/5Safe to merge; the change correctly restores the pre-existing 1.0 CPU default and wires the config field that was previously silently ignored. The fix is narrow and well-tested: three new unit tests cover all branches of the floor logic, the default is restored to 1.0 (matching the literal that was there before), Python and env-var paths both validate the > 0 constraint, and MockTaskBuilder::with_resource_request now correctly preserves min_cpu_per_task. SwordfishTaskBuilder is the only production site constructing TaskResourceRequest, and it already carried the config it needed. No files require special attention. Important Files Changed
Reviews (2): Last reviewed commit: "fix(clippy): collapse let-chain in min_c..." | Re-trigger Greptile |
| pub fn num_cpus(&self) -> f64 { | ||
| self.resource_request.num_cpus().unwrap_or(1.0) | ||
| self.resource_request | ||
| .num_cpus() | ||
| .unwrap_or(self.min_cpu_per_task) | ||
| } |
There was a problem hiding this comment.
Silent default-CPU reduction from 1.0 → 0.5
The old fallback was the literal 1.0; DaftExecutionConfig::default().min_cpu_per_task is 0.5. Any deployment that relied on the previous implicit 1.0-CPU bundle request will now submit 0.5-CPU bundles without any config change. On an autoscaler that uses CPU headroom to decide when to launch new nodes, halving the per-task CPU request could double the number of tasks packed per node before a scale-out fires, increasing the likelihood of OOM or throttled execution. The PR description notes this trade-off, but the change in default observable behaviour is worth confirming with the team before merging.
| pub fn with_resource_request(mut self, resource_request: ResourceRequest) -> Self { | ||
| self.resource_request = TaskResourceRequest::new(resource_request); | ||
| self.resource_request = TaskResourceRequest::new( | ||
| resource_request, | ||
| DaftExecutionConfig::default().min_cpu_per_task, | ||
| ); | ||
| self |
There was a problem hiding this comment.
with_resource_request ignores the builder's own config
MockTaskBuilder::with_resource_request constructs a fresh DaftExecutionConfig::default() to obtain the min_cpu_per_task fallback instead of reusing whatever min_cpu_per_task the calling test may have injected into self.resource_request already. Any test that constructs a MockTask with a non-default min_cpu_per_task and later calls with_resource_request will silently reset the fallback to 0.5. This won't break current tests (none set a custom value), but it makes the helper subtly inconsistent and could mask future test regressions. Consider storing the min_cpu_per_task on the builder and reusing it here.
The min_cpu_per_task execution config field had no readers in the distributed scheduler: TaskResourceRequest::num_cpus() returned a hardcoded 1.0 when the plan's ResourceRequest had no num_cpus. This was wired up in Eventual-Inc#4506 for the legacy ray runner only; Eventual-Inc#5375 removed that runner along with its lone reader, and the flotilla scheduler was never wired in. Result: setting min_cpu_per_task via daft.set_execution_config or DAFT_MIN_CPU_PER_TASK had zero effect on autoscaler bundle requests. Plumb the value from DaftExecutionConfig (already in scope at SwordfishTaskBuilder::build) into TaskResourceRequest, and use it as the fallback in num_cpus() instead of the literal 1.0. Closes Eventual-Inc#7123
ca66e01 to
1d05985
Compare
- Comment said 'Floor' but the implementation is 'Default-when-None' (explicit num_cpus is honored as-is). Reword to match. - Add two unit tests: * num_cpus falls back to min_cpu_per_task when ResourceRequest is empty * explicit num_cpus passes through unchanged
Before this PR, the flotilla scheduler fell back to a hardcoded 1.0 CPU when ResourceRequest had no num_cpus; the configured min_cpu_per_task default of 0.5 was inert. Wiring the field through (1d05985) without changing the default would silently halve the per-task CPU floor for every existing user, which Greptile's review flagged as a behaviour change risk on capacity-tuned clusters (e.g. KubeRay packs 2x more tasks before scale-out, increasing OOM risk). Move the default to 1.0 so the no-explicit-num_cpus path matches the pre-wiring behaviour exactly. Users who want a smaller floor set it explicitly via daft.set_execution_config(min_cpu_per_task=...) or DAFT_MIN_CPU_PER_TASK, which is the original purpose of the knob.
…uest MockTaskBuilder.with_resource_request was constructing a fresh DaftExecutionConfig::default() to fetch the min_cpu_per_task fallback, which silently reset any non-default value already set on the builder. Reuse self.resource_request.min_cpu_per_task instead so the fallback threads through chained .with_* calls correctly. Test-helper-only; no production behavior change. Addresses greptile P2 review comment.
…inite min_cpu_per_task
Two follow-ups from review of the min_cpu_per_task wiring:
- The Ray autoscaler bundle still rounded CPU up via `num_cpus().ceil()`,
so a configured `min_cpu_per_task=0.1` reached `request_resources` as
`{"CPU": 1}` (issue Eventual-Inc#7123). Add `TaskResourceRequest::autoscale_bundle`
which keeps CPU fractional (GPU/memory stay integer, zero values dropped)
and build the Ray bundle as a `PyDict` with a float CPU. Update the
`try_autoscale` type hint accordingly.
- `min_cpu_per_task` validation let NaN/inf through (`nan <= 0` and
`inf <= 0` are both false). Use `is_finite() && > 0` on both the Python
setter and the env path.
Add Rust unit tests for the fractional bundle and env validation, and a
Python test covering acceptance of fractional values and rejection of
0/-0.5/nan/inf.
…undles
Ray's request_resources (<= 2.55; Daft pins ray==2.55.1) rejects non-integer
bundle values: `isinstance(bundle[key], int)` raises `TypeError: each bundle
key should be str and value as int.`. Sending a float CPU therefore crashes at
runtime — even a whole 1.0, since pyo3 emits a Python float.
Replace the per-task float bundle with `aggregate_ray_bundles`: CPU-only tasks
have their fractional CPU summed and emitted as ceil(sum) unit {"CPU": 1}
bundles, so N tasks at 0.1 CPU request ceil(0.1*N) CPUs instead of N
(issue Eventual-Inc#7123), while never sending a non-integer value. Tasks carrying GPU or
memory keep an individual bundle (CPU rounded up) since those resources pin
placement to a node.
Revert the try_autoscale type hint to dict[str, int]. Replace the fractional
bundle unit tests with aggregation tests.
aggregate_ray_bundles packed every CPU-only task into unit {"CPU": 1}
bundles. That is wrong for a task requesting num_cpus >= 1: a 4-CPU task
runs on one worker, so splitting it into 4 spread bundles lets the
autoscaler provision 4 single-CPU nodes and leaves the task unschedulable.
It also turned CPU magnitude into the loop count, so a huge or non-finite
explicit num_cpus (inf as i64 == i64::MAX) could hang/OOM, and a NaN
poisoned the running sum and zeroed the batch's CPU request.
Only pack sub-1.0 CPU-only tasks now; tasks with GPU, memory, or
num_cpus >= 1 keep an individual bundle (CPU rounded up to at least 1).
Non-finite / non-positive CPU contributes nothing. The packed sum is now
bounded by task count, so the loop can no longer blow up.
…r mark The high-water mark recorded the fractional cpu_sum, but the request actually sent to Ray is the integer-aggregated bundle total. With min_cpu_per_task=0.1 the mark grew ~0.1 per cycle while ceil() only bumped the real CPU request every ~10 cycles, so scale-up for many pending tasks stalled for ~1/min_cpu_per_task cycles (≈50s at the default 5s interval) per extra CPU. Record the aggregated integer bundle totals (what Ray actually receives) as the mark instead. Because each cycle selects bundles until the fractional cpu_sum exceeds the integer mark, ceil() now bumps by at least one CPU every cycle, restoring the intended one-unit-per-cycle ramp while still never requesting less than before. Convergence is unchanged: once pending demand can no longer exceed the mark, the cycle is skipped. Verified: cargo test -p daft-distributed --lib (8 task tests pass), cargo check/clippy -p daft-distributed --features python clean.
try_new_internal only checked num_gpus for negativity, so an explicit
num_cpus = inf/NaN (or negative) flowed through. Downstream that became a
bundle of {"CPU": i64::MAX} (inf as i64 saturates) in the autoscaler. Require
both num_cpus and num_gpus to be finite and nonnegative; this also catches a
NaN num_gpus that the existing >1-must-be-integer check missed.
aggregate_ray_bundles only packed fractional CPU; fractional-GPU tasks (Daft
supports num_gpus<1) were each emitted as a full {"GPU":1} bundle. Combined
with the high-water mark comparing raw fractional gpu_sum against the integer
mark, 11 tasks at 0.1 GPU would request 11 GPUs instead of ceil(1.1)=2 —
likely tripping "requested bundles exceed max capacity, autoscaler refuses
all". Sum sub-GPU demand into ceil(sum) {"CPU":1,"GPU":1} bundles (their small
CPU rides on the GPU node, not double-counted); memory/multi-unit tasks still
keep an individual placement-pinned bundle.
Extract the select-then-aggregate ramp into a pure next_autoscale_request() in
task.rs so it's unit-testable outside the python-gated worker manager; add
tests for whole-unit CPU/GPU escalation, GPU packing, and the skip-below-mark
case. Tighten comments.
Verified: cargo test -p daft-distributed --lib (78 pass) and
-p common-resource-request; cargo check/clippy --features python clean.
Packing sub-GPU tasks dropped their CPU demand entirely: a task with
num_gpus<1 and the min_cpu_per_task fallback (e.g. @daft.cls(gpus=0.5) ->
num_cpus=None -> 1.0 CPU) only contributed to fractional_gpu_sum, so the
{CPU:1,GPU:1} bundles under-counted CPU. Worse, next_autoscale_request selects
on raw cpu_sum while the high-water mark records the post-aggregation totals,
so two such tasks at mark 1/1 kept re-requesting {CPU:1,GPU:1} without the mark
ever growing — the cluster could stall and never provision the second CPU,
leaving tasks unschedulable.
Accumulate the packed tasks' CPU (gpu_cpu_sum) and spread it across the GPU
bundles as ceil(gpu_cpu_sum / gpu_bundles). The request now reflects real CPU
demand, so the recorded mark grows whenever selection triggers and the ramp
converges. Add a regression test for the gpus=0.5 + fallback-CPU stall.
Verified: cargo test -p daft-distributed --lib (13 task tests) and
cargo check/clippy --features python clean.
The "finite and > 0" check for min_cpu_per_task was duplicated in the env path and the Python setter, which could drift. Extract it into a single DaftExecutionConfig::is_valid_min_cpu_per_task and call it from both.
Carrying GPU tasks' CPU as ceil(gpu_cpu_sum / gpu_bundles) produced bundles
like {CPU:2, GPU:1}. As a single Ray request_resources shape that fits no
standard 1-CPU/1-GPU node, so the autoscaler can't scale up — and the value is
recorded as the high-water mark, stalling further attempts.
Emit unit {CPU:1, GPU:1} bundles instead (a sub-GPU task's cpu and gpu are each
<= 1, so one always fits a standard GPU node), with the count covering both
dimensions: ceil(max(gpu_sum, gpu_cpu_sum)). Two 1-CPU/0.5-GPU tasks now request
two {CPU:1,GPU:1} shapes (2 CPU / 2 GPU) rather than one unschedulable {CPU:2}.
Assert the schedulable shape in the regression test.
Changes Made
Fixes
min_cpu_per_task(DAFT_MIN_CPU_PER_TASK) being a no-op in the flotilla scheduler — it had no readers.TaskResourceRequest::num_cpus()falls back to the configuredmin_cpu_per_taskinstead of a hardcoded1.0, andLocalPhysicalPlan::resource_request()leavesnum_cpusunset so the fallback applies. Default bumped0.5 → 1.0to keep existingbehavior.
request_resources(pinnedray==2.55.1) only accepts integer bundles, so fractional CPU/GPU demand isaggregated into
ceil(sum)integer bundles — N tasks at0.1requestceil(0.1*N)units, not N (issue min_cpu_per_task config has no effect (unwired in flotilla scheduler) #7123). Sub-GPU tasks packinto
{"CPU","GPU":1}bundles that carry their co-located CPU; sub-CPU tasks into{"CPU":1}bundles; memory / whole-unit (>= 1)tasks keep an individual bundle. The high-water mark tracks the post-aggregation integer request, so each cycle escalates by a whole
unit and the ramp converges instead of stalling.
min_cpu_per_task(Python + env, shared rule) and non-finite / negativenum_cpus/num_gpusinResourceRequest.Related Issues
Closes #7123