Fix race condition in Manager.partitionMigrations #5531

dlmarion · 2025-05-02T16:55:21Z

PR #5416 modified the Manager to move the migrations from an in-memory data structure
in the Manager to new columns in the root and metadata tables. The Manager.partitionMigrations method was changed to scan the root and metadata tables migration column to gather the current migrations. However, if the root and metadata tables are not hosted, then this scan will hang until the tablet locations can be resolved.

ScanServerUpgrade11to12TestIT.testScanRefTableCreation has been failing since #5416 was merged. The test deletes the scanref table, shuts down the TabletServers and Manager, then restarts them. This leaves the root tablet in a state where it needs to perform recovery. The Manager.partitionMigrations method is called via the StatusThread, and it appears that the Manager starts up and the StatusThread gets hung trying to scan the root tablet (it's waiting for a location). Meanwhile, the root tablet can't be assigned a location because the Manager.tserverStatus map is not populated, which is done from the StatusThread as well.

Also modified the IT to set the filesystem to the RawLocalFileSystem so that warnings about missing checksum files were not in the logs.

in the Manager to new columns in the root and metadata tables. The Manager.partitionMigrations method was changed to scan the root and metadata tables migration column to gather the current migrations. However, if the root and metadata tables are not hosted, then this scan will hang until the tablet locations can be resolved. ScanServerUpgrade11to12TestIT.testScanRefTableCreation has been failing since apache#5416 was merged. The test deletes the scanref table, shuts down the TabletServers and Manager, then restarts them. This leaves the root tablet in a state where it needs to perform recovery. The Manager.partitionMigrations method is called via the StatusThread, and it appears that the Manager starts up and the StatusThread gets hung trying to scan the root tablet (it's waiting for a location). Meanwhile, the root tablet can't be assigned a location because the Manager.tserverStatus map is not populated, which is done from the StatusThread as well. Also modified the IT to set the filesystem to the RawLocalFileSystem so that warnings about missing checksum files were not in the logs.

keith-turner · 2025-05-02T18:38:07Z

Meanwhile, the root tablet can't be assigned a location because the Manager.tserverStatus map is not populated, which is done from the StatusThread as well.

Seen problems with the way this code works before. Would probably be best to move balancing out of the status thread and give its own thread. Could have a balancing thread for each data level, with this setup each balancing thread would only read the migrations for its level which would avoid trying to read all data levels at once. Also would be good to minimize the dependencies between balancing and TGW as much as possible.

Because of fundamental problems in the existing code the fix in this PR may avoid the problem in some situations, but the status thread could still get stuck if things changes after its done its checks.

keith-turner · 2025-05-02T18:48:06Z

Saw #5533 recently and its related to the overall problems w/ this general code.

kevinrr888 · 2025-05-02T18:42:54Z

server/manager/src/main/java/org/apache/accumulo/manager/Manager.java

+    if (watchers.size() != 3) {
+      log.debug("Skipping migration check, not all TabletGroupWatchers are started");
+      skipMigrationCheck = true;
+    } else {
+      for (TabletGroupWatcher watcher : watchers) {
+        if (!watcher.isAlive()) {
+          log.debug("Skipping migration check, not all TabletGroupWatchers are started");
+          skipMigrationCheck = true;
+          break;
+        }


I had a problem with one of the ITs in this case where the TGWs wouldn't be started and it would hang on the scan. It hung every time then suddenly after a few changes it stopped hanging, so I assumed it was fixed. Looking back, I should have still been checking that the TGWs were started...

This is a good catch and looks good to me

kevinrr888 · 2025-05-02T18:50:55Z

server/manager/src/main/java/org/apache/accumulo/manager/Manager.java

+    // Don't try to check migrations if the Root and Metadata
+    // tables are not hosted.
+    boolean skipMigrationCheck = false;


I'm thinking we want to do this check everywhere we scan for migrations. This is done in several places in Manager.

Would be better to structure the code such that it ok if the thread gets stuck trying to read migrations.

kevinrr888 · 2025-05-02T18:59:47Z

server/manager/src/main/java/org/apache/accumulo/manager/Manager.java

+        if (level == DataLevel.ROOT || level == DataLevel.METADATA) {
+          final TableId tid = level == DataLevel.ROOT ? SystemTables.ROOT.tableId()
+              : SystemTables.METADATA.tableId();


This use of DataLevel doesn't seem correct. See:

accumulo/core/src/main/java/org/apache/accumulo/core/metadata/schema/Ample.java

Lines 82 to 85 in f676217

public enum DataLevel {

ROOT(null, null),

METADATA(SystemTables.ROOT.tableName(), SystemTables.ROOT.tableId()),

USER(SystemTables.METADATA.tableName(), SystemTables.METADATA.tableId());

dlmarion · 2025-05-06T14:45:56Z

Closing this in favor of a different solution. @keith-turner created #5533 and suggested above a different change in the Manager where balancing is done in its own thread.

keith-turner · 2025-05-06T17:21:39Z

Opened #5537 as a first step in cleaning up some of the code and thread dependencies in the balancing code.

dlmarion added this to the 4.0.0 milestone May 2, 2025

dlmarion requested review from keith-turner and kevinrr888 May 2, 2025 16:55

dlmarion self-assigned this May 2, 2025

keith-turner mentioned this pull request May 2, 2025

Dependency between balancing and tablet group watcher can leave system in unworkable state. #5533

Closed

kevinrr888 reviewed May 2, 2025

View reviewed changes

dlmarion closed this May 6, 2025

dlmarion mentioned this pull request May 12, 2025

Shared mini test suite #5536

Merged

ctubbsii removed this from the 4.0.0 milestone May 14, 2025

dlmarion mentioned this pull request May 21, 2025

Broken or Flaky test: ScanServerUpgrade11to12TestIT.testScanRefTableCreation #5566

Closed

ctubbsii mentioned this pull request May 22, 2025

Create an IT for the race condition identified in #5531 #5577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix race condition in Manager.partitionMigrations #5531

Fix race condition in Manager.partitionMigrations #5531

Uh oh!

dlmarion commented May 2, 2025 •

edited

Loading

Uh oh!

keith-turner commented May 2, 2025 •

edited

Loading

Uh oh!

keith-turner commented May 2, 2025

Uh oh!

kevinrr888 May 2, 2025

Uh oh!

kevinrr888 May 2, 2025

Uh oh!

keith-turner May 2, 2025

Uh oh!

kevinrr888 May 2, 2025

Uh oh!

dlmarion commented May 6, 2025

Uh oh!

keith-turner commented May 6, 2025

Uh oh!

Uh oh!

	public enum DataLevel {
	ROOT(null, null),
	METADATA(SystemTables.ROOT.tableName(), SystemTables.ROOT.tableId()),
	USER(SystemTables.METADATA.tableName(), SystemTables.METADATA.tableId());

Fix race condition in Manager.partitionMigrations #5531

Fix race condition in Manager.partitionMigrations #5531

Uh oh!

Conversation

dlmarion commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keith-turner commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keith-turner commented May 2, 2025

Uh oh!

kevinrr888 May 2, 2025

Choose a reason for hiding this comment

Uh oh!

kevinrr888 May 2, 2025

Choose a reason for hiding this comment

Uh oh!

keith-turner May 2, 2025

Choose a reason for hiding this comment

Uh oh!

kevinrr888 May 2, 2025

Choose a reason for hiding this comment

Uh oh!

dlmarion commented May 6, 2025

Uh oh!

keith-turner commented May 6, 2025

Uh oh!

Uh oh!

dlmarion commented May 2, 2025 •

edited

Loading

keith-turner commented May 2, 2025 •

edited

Loading