[CELEBORN-2312] Support committing uncommitted partitions for graceful shutdown by SteNicholas · Pull Request #3668 · apache/celeborn

SteNicholas · 2026-04-20T19:11:00Z

What changes were proposed in this pull request?

Support the worker to proactively commit uncommitted partitions during graceful shutdown, controlled by a new configuration celeborn.worker.graceful.shutdown.commitUncommittedPartitions.enabled(default false).

Key changes:

WorkerPartitionLocationInfo#snapshotUncommittedUniqueIds: Takes a weakly-consistent, point-in-time snapshot of uncommitted partition unique IDs grouped by shuffle key (primary + replica). Uses ConcurrentHashMap iteration semantics - concurrent mutations after the snapshot are not visible.
Controller#commitUncommittedPartitions(): Snapshots all uncommitted partitions, commits them in parallel via the existing commitFiles thread pool, waits with shuffleCommitTimeout, then removes successfully committed partitions and releases slots. Failed partitions are intentionally retained so the existing passive LifecycleManager CommitFiles retry path can still handle them.
Worker#shutdownGracefully(): Invokes Controller#commitUncommittedPartitions() after shutdown.set(true) when the config is enabled.
CelebornConf: New config celeborn.worker.graceful.shutdown.commitUncommittedPartitions.enabled (version 0.7.0, default false).

Why are the changes needed?

During graceful shutdown, the worker currently waits passively for LifecycleManager to send CommitFiles RPCs. This introduces unnecessary shutdown latency in scenarios where:

The LifecycleManager is slow to react (e.g., under GC pressure or network delays).
The LifecycleManager has already deregistered the worker and will not send CommitFiles.
Multiple applications have uncommitted partitions, amplifying the wait time.

By allowing the worker to proactively commit its own partitions, the graceful shutdown window can be significantly shortened while maintaining backward compatibility (opt-in, default off).

Does this PR resolve a correctness bug?

No.

Does this PR introduce any user-facing change?

Yes. A new configuration is introduced:

Config Key	Default Value
`celeborn.worker.graceful.shutdown.commitUncommittedPartitions.enabled`	`false`

How was this patch tested?

WorkerPartitionLocationInfoSuite
- snapshotUncommittedUniqueIds - empty info returns empty maps
- snapshotUncommittedUniqueIds - captures correct IDs across shuffles
- snapshotUncommittedUniqueIds - filters empty shuffle keys
- snapshotUncommittedUniqueIds - snapshot is a point-in-time copy
WorkerSuite
- commitUncommittedPartitions - commits primary and replica partitions
- commitUncommittedPartitions - no-op when no partitions
- commitUncommittedPartitions - idempotent on double call
- commitUncommittedPartitions - retains failed partitions for passive wait
- commitUncommittedPartitions - commits across multiple shuffle keys
- commitUncommittedPartitions - no cross-shuffle uniqueId collision
- commitUncommittedPartitions - cross-shuffle collision with partial failure

Copilot

Pull request overview

Adds an opt-in mechanism for Celeborn workers to proactively commit uncommitted partitions during graceful shutdown (to reduce shutdown latency), controlled by a new worker configuration flag.

Changes:

Add snapshotUncommittedUniqueIds to snapshot uncommitted partition unique IDs (primary + replica) by shuffle key.
Add Controller.commitUncommittedPartitions() and invoke it from Worker.shutdownGracefully() when enabled.
Add new config celeborn.worker.graceful.shutdown.commitUncommittedPartitions.enabled and document it; add unit tests for the new behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Worker.scala`	Calls proactive commit during graceful shutdown when the new config is enabled.
`worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Controller.scala`	Implements proactive commit flow using existing `commitFiles` infrastructure and then removes/releases committed partitions.
`common/src/main/scala/org/apache/celeborn/common/meta/WorkerPartitionLocationInfo.scala`	Adds snapshot API for uncommitted partition IDs grouped by shuffle key.
`common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala`	Introduces the new configuration entry and accessor.
`docs/configuration/worker.md`	Documents the new worker config flag.
`common/src/test/scala/org/apache/celeborn/common/meta/WorkerPartitionLocationInfoSuite.scala`	Adds tests for the new snapshot behavior.
`worker/src/test/scala/org/apache/celeborn/service/deploy/worker/WorkerSuite.scala`	Adds tests for proactive commit behavior and idempotency/failure retention.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…l shutdown

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

FMX · 2026-04-21T08:55:32Z

@SteNicholas In this PR, are the committed files stored in the RocksDB by the storage manager?

SteNicholas · 2026-04-21T09:29:50Z

@FMX, both following steps are covered by existing code, and Controller#commitUncommittedPartitions does not need to do any extra store operations.

commitUncommittedPartitions()
    → commitFiles() → fileWriter.close()
      → TierWriter.close() 第120行: notifyFileCommitted()
        → storageManager.notifyFileInfoCommitted()  ← Write to committedFileInfos
    ...                                                                                                                                                                                                                                             
  StorageManager.close(WORKER_GRACEFUL_SHUTDOWN)
    → saveAllCommittedFileInfosToDB()                ← Persist to RocksDB

FMX · 2026-04-23T07:27:27Z

After some investigation, I think there is something wrong with this PR.

FMX · 2026-04-23T08:59:53Z

Scenario : No restart, but worker proactively commits and clears partitions
This is the more subtle case introduced by the PR.

During graceful shutdown, Controller.commitUncommittedPartitions():

snapshots uncommitted uniqueIds from partitionLocationInfo (best-effort snapshot),
commits them via the existing commitFiles helper,
removes successfully committed (or empty) uniqueIds from partitionLocationInfo and releases slots.
After this proactive commit, the client-side LifecycleManager/CommitManager might still send CommitFiles RPCs for the same shuffle/uniqueIds (e.g., due to timing, retries, or delayed reactions).
When Controller.commitFiles() runs for those ids, it calls partitionLocationInfo.getPrimaryLocation/getReplicaLocation. If the location is missing (because the proactive flow already removed it), the worker logs an error and treats the id as failed:

location == null → failedIds.add(uniqueId).
The final CommitFiles response becomes PARTIAL_SUCCESS (or similar), which the client treats as a commit failure for those ids.

FMX · 2026-04-23T09:01:24Z

I think you'll need to extend the shuffleCommitInfos and persist it to make sure subsequent CommitFiles requests can be recognized as already committed.

FMX

After careful review, I think this PR is not ready.

SteNicholas requested a review from Copilot April 20, 2026 19:11

github-actions Bot added kind:documentation module:common module:worker labels Apr 20, 2026

Copilot started reviewing on behalf of SteNicholas April 20, 2026 19:11 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

SteNicholas marked this pull request as draft April 20, 2026 19:26

SteNicholas force-pushed the CELEBORN-2312 branch from da4ba4c to 4104e87 Compare April 21, 2026 04:44

SteNicholas requested a review from Copilot April 21, 2026 04:44

Copilot started reviewing on behalf of SteNicholas April 21, 2026 04:44 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

SteNicholas force-pushed the CELEBORN-2312 branch from 4104e87 to 629e0c0 Compare April 21, 2026 05:28

SteNicholas requested a review from Copilot April 21, 2026 05:30

Copilot started reviewing on behalf of SteNicholas April 21, 2026 05:37 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

[CELEBORN-2312] Support committing uncommitted partitions for gracefu…

5ff02d4

…l shutdown

SteNicholas force-pushed the CELEBORN-2312 branch from 629e0c0 to 5ff02d4 Compare April 21, 2026 06:10

SteNicholas requested a review from Copilot April 21, 2026 06:11

Copilot started reviewing on behalf of SteNicholas April 21, 2026 06:18 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Comment thread worker/src/test/scala/org/apache/celeborn/service/deploy/worker/WorkerSuite.scala

Comment thread worker/src/test/scala/org/apache/celeborn/service/deploy/worker/WorkerSuite.scala

SteNicholas marked this pull request as ready for review April 21, 2026 06:23

FMX requested changes Apr 24, 2026

View reviewed changes

Conversation

SteNicholas commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR resolve a correctness bug?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

FMX commented Apr 21, 2026

Uh oh!

SteNicholas commented Apr 21, 2026

Uh oh!

FMX commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FMX commented Apr 23, 2026

Uh oh!

FMX commented Apr 23, 2026

Uh oh!

FMX left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SteNicholas commented Apr 20, 2026 •

edited

Loading

FMX commented Apr 23, 2026 •

edited

Loading