[CELEBORN-1917] Support celeborn.client.push.maxBytesSizeInFlight#3248
[CELEBORN-1917] Support celeborn.client.push.maxBytesSizeInFlight#3248DDDominik wants to merge 11 commits into
Conversation
|
+CC @venkata91, @rmcyang |
|
@DDDominik, any update? |
5590ef0 to
0dffcf6
Compare
|
Please update the docs with below command: |
|
Have not see this error before, it should be related to this PR. https://github.com/apache/celeborn/actions/runs/15391457221/job/43306794218?pr=3248 |
|
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
b906d44 to
8e67ae6
Compare
e8622e1 to
d09b424
Compare
…kup - Modified reader logic to use latest shuffle ID when no valid finished stages available - Fixes barrier stage resubmission across jobs test failure
8e67ae6 to
0a3f223
Compare
5401626 to
a46773d
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3248 +/- ##
==========================================
- Coverage 63.57% 63.56% -0.00%
==========================================
Files 348 351 +3
Lines 21300 21571 +271
Branches 1879 1914 +35
==========================================
+ Hits 13539 13710 +171
- Misses 6781 6861 +80
- Partials 980 1000 +20 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
mridulm
left a comment
There was a problem hiding this comment.
Just one query, rest looks good - thanks !
| || (maxInFlightBytesSizeEnabled | ||
| && totalInflightBytes.sum() <= maxInFlightBytesSizeTotal | ||
| && batchBytesSize.sum() <= maxInFlightBytesSizePerWorker)) { |
There was a problem hiding this comment.
My earlier query was, should this be || ?
As in, either total inflight is high, or inflight to a specific worker is high (assuming per worker threshold is lower than total !)
| || (maxInFlightBytesSizeEnabled | |
| && totalInflightBytes.sum() <= maxInFlightBytesSizeTotal | |
| && batchBytesSize.sum() <= maxInFlightBytesSizePerWorker)) { | |
| || (maxInFlightBytesSizeEnabled && ( | |
| totalInflightBytes.sum() <= maxInFlightBytesSizeTotal || | |
| batchBytesSize.sum() <= maxInFlightBytesSizePerWorker))) { |
There was a problem hiding this comment.
Looks like the PR got merged before this was addressed.
If this is not a concern, that should be fine - else let us do a follow up.
There was a problem hiding this comment.
To avoid this comment being ignored, I will submit a PR first
FMX
left a comment
There was a problem hiding this comment.
LGTM. Thanks. Merged into main(v0.7.0).
| val CLIENT_PUSH_MAX_BYTES_SIZE_IN_FLIGHT_PERWORKER: OptionalConfigEntry[Long] = | ||
| buildConf("celeborn.client.push.maxBytesSizeInFlight.perWorker") | ||
| .categories("client") | ||
| .version("0.6.1") |
### What changes were proposed in this pull request? add data size limitation to inflight data by introducing a new configuration: `celeborn.client.push.maxBytesInFlight.perWorker/total` and defaults to `celeborn.client.push.buffer.max.size * celeborn.client.push.maxReqsInFlight.perWorker/total`. for backward compatibility, also add a control: `celeborn.client.push.maxReqsInFlight.enabled`. ### Why are the changes needed? celeborn do supports limiting the number of push inflight requests via `celeborn.client.push.maxReqsInFlight.perWorker/total`. this is a good constraint to memory usage where most requests do not exceed `celeborn.client.push.buffer.max.size`. however, in a vectorized shuffle (like blaze and gluten), a request might be greatly larger then the max buffer size, leading to too much inflight data and results OOM. ### Does this PR introduce _any_ user-facing change? Yes, add new config for client ### How was this patch tested? test on local env Closes #3248 from DDDominik/CELEBORN-1917. Lead-authored-by: DDDominik <1015545832@qq.com> Co-authored-by: SteNicholas <programgeek@163.com> Co-authored-by: DDDominik <zhuangxian@kuaishou.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com> (cherry picked from commit 0ed590d) Signed-off-by: Wang, Fei <fwang12@ebay.com>
|
Merged into branch-0.6(0.6.1) |
What changes were proposed in this pull request?
add data size limitation to inflight data by introducing a new configuration:
celeborn.client.push.maxBytesInFlight.perWorker/totaland defaults toceleborn.client.push.buffer.max.size * celeborn.client.push.maxReqsInFlight.perWorker/total.for backward compatibility, also add a control:
celeborn.client.push.maxReqsInFlight.enabled.Why are the changes needed?
celeborn do supports limiting the number of push inflight requests via
celeborn.client.push.maxReqsInFlight.perWorker/total. this is a good constraint to memory usage where most requests do not exceedceleborn.client.push.buffer.max.size. however, in a vectorized shuffle (like blaze and gluten), a request might be greatly larger then the max buffer size, leading to too much inflight data and results OOM.Does this PR introduce any user-facing change?
Yes, add new config for client
How was this patch tested?
test on local env