Skip to content

Conversation

@kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Dec 8, 2025

Description

Based on discussions with @jtuglu1 in #18764 , there seem to be some race conditions in segment balancing between the duty run thread and the segment drop callback thread (triggered by segment balancing).

Race condition 1

Duty takes snapshot of peon when segment is in the middle of being transitioned from state MOVING_FROM to DROPPING.

  • The result is that the ServerHolder assumes that this segment actually has state LOADED (when in fact it should be DROPPING)
  • This leads the StrategicSegmentAssigner to assume that the segment is over-replicated (being loaded on both A and B) and then trigger a drop from either A (no-op since we were going to start a drop from A anyway) or B (segment becomes briefly unavailable).

Race condition 2

Duty takes snapshot of peon after the DROP operation has succeeded but it is not yet reflected in the inventory (since the inventory was snapshotted earlier in the flow).

  • This would again cause the StrategicSegmentAssigner to consider this segment as over-replicated.
  • Something similar may happen with a load operation as well.

Changes

  • Use a synchronized block to ensure that a segment is moved from the set segmentsMarkedToDrop to segmentsToDrop atomically
  • Use @GuardedBy to ensure that these fields remain thread-safe
  • Pass the list of segments already loaded on a server to LoadQueuePeon.getSegmentsInQueue() so that we track completed requests inside the peon until the inventory has confirmed success of the same.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@jtuglu1
Copy link
Contributor

jtuglu1 commented Dec 8, 2025

Hi @kfaraz, thanks for the patch. While I think this does potentially fix some other issues (and is just generally good to have), I don't think this fully fixes the aforementioned issue:

  • The server inventory view is updated to show loaded=2 (callback on B is showing).
  • We take a copy of the server inventory view in prepareServers() which is then incrementing loaded here.
  • Then, load on B + drop on A callbacks happen.
  • We take a copy of the queuedSegments in all the HttpLoadQueuePeon in prepareCluster() routine. Since the callbacks have already completed, the copy doesn't see any dropping or loading, etc. (it's effectively empty for that segment). But the old "stale" copy of the server inventory view shows loaded=2.

While this is still improvement, I don't think it fixes the core issue that the server inventory and the httploadqueue peons can be "snapshotted" non-atomically, allowing for how we build the SegmentReplicaCount to be messed up. We need to have a way to ensure the ServerHolder and the HttpLoadQueuePeon can be atomically snapshotted without any callback interleavings happening.

@kfaraz
Copy link
Contributor Author

kfaraz commented Dec 8, 2025

Thanks for the clarification, @jtuglu1 !
It makes sense to have the inventory view be consistent with the peons.
I also encountered another possible race condition related segmentsMarkedToDrop.
In ServerHolder.initializeQueuedSegments(), we make a call to peon.getSegmentsInQueue() and peon.getSegmentsMarkedToDrop() in quick succession, but it is possible that between these calls, the state of the peon changes.

I have tried to address this race condition and the reconciliation with the server inventory snapshot in the latest commit.
Let me know if it makes sense and addresses your issue.

@jtuglu1 jtuglu1 self-requested a review December 8, 2025 16:41
@kfaraz
Copy link
Contributor Author

kfaraz commented Dec 9, 2025

Currently squashing some more potential race conditions in the flow. Will update this PR soon.

@jtuglu1
Copy link
Contributor

jtuglu1 commented Dec 9, 2025

Thanks. I'm doing a bit of simulation testing to see if I can squeeze some more out. I've added the other potential one I uncovered here:

T0[Coordinator]: Start DutyGroup[HistoricalManagementDuties]
T1[Coordinator]: Start PrepareBalancerAndLoadQueues::run()
T2[Coordinator]: Move initiated of segment[S] from Server[A] to Server[B] (no callbacks yet)
T3[Coordinator]: End PrepareBalancerAndLoadQueues::run()
T4[Coordinator]: End DutyGroup[HistoricalManagementDuties]
T5[Coordinator]: Start DutyGroup[HistoricalManagementDuties]
T6[Coordinator]: Start PrepareBalancerAndLoadQueues::run()
T7[Coordinator]: enter prepareCurrentServers()
T8[Coordinator]: exit prepareCurrentServers()
T9[Coordinator]: enter prepareCluster()
T10[Coordinator-callback]: Server[B] completed request[MOVE_TO] on segment[S] with status[SUCCESS]
T11[Coordinator-callback]: Dropping segment [S] from server[A]
T12[A]: Completely removing segment[S] in [30,000]ms.
T13[Coordinator-callback]: Server[A] completed request[DROP] on segment[S] with status[SUCCESS].
T14[Coordinator]: exit prepareCluster()
// at this point, I think somehow the segmentreplicacount struct has loaded=0 (somehow which causes the other load)
T15[Coordinator-callback]: Server[C] completed request[LOAD] on segment[S] with status[SUCCESS]
T16[Coordinator]: End DutyGroup[HistoricalManagementDuties]

This causes a 2nd load

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants