Skip to content

Fix race condition in Manager.partitionMigrations #5531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

dlmarion
Copy link
Contributor

@dlmarion dlmarion commented May 2, 2025

PR #5416 modified the Manager to move the migrations from an in-memory data structure
in the Manager to new columns in the root and metadata tables. The Manager.partitionMigrations method was changed to scan the root and metadata tables migration column to gather the current migrations. However, if the root and metadata tables are not hosted, then this scan will hang until the tablet locations can be resolved.

ScanServerUpgrade11to12TestIT.testScanRefTableCreation has been failing since #5416 was merged. The test deletes the scanref table, shuts down the TabletServers and Manager, then restarts them. This leaves the root tablet in a state where it needs to perform recovery. The Manager.partitionMigrations method is called via the StatusThread, and it appears that the Manager starts up and the StatusThread gets hung trying to scan the root tablet (it's waiting for a location). Meanwhile, the root tablet can't be assigned a location because the Manager.tserverStatus map is not populated, which is done from the StatusThread as well.

Also modified the IT to set the filesystem to the RawLocalFileSystem so that warnings about missing checksum files were not in the logs.

in the Manager to new columns in the root and metadata tables.
The Manager.partitionMigrations method was changed to scan the
root and metadata tables migration column to gather the current
migrations. However, if the root and metadata tables are not
hosted, then this scan will hang until the tablet locations
can be resolved.

ScanServerUpgrade11to12TestIT.testScanRefTableCreation has been
failing since apache#5416 was merged. The test deletes the scanref
table, shuts down the TabletServers and Manager, then restarts
them. This leaves the root tablet in a state where it needs to
perform recovery. The Manager.partitionMigrations method is
called via the StatusThread, and it appears that the Manager
starts up and the StatusThread gets hung trying to scan the
root tablet (it's waiting for a location). Meanwhile, the root
tablet can't be assigned a location because the Manager.tserverStatus
map is not populated, which is done from the StatusThread as well.

Also modified the IT to set the filesystem to the RawLocalFileSystem
so that warnings about missing checksum files were not in the logs.
@dlmarion dlmarion added this to the 4.0.0 milestone May 2, 2025
@dlmarion dlmarion self-assigned this May 2, 2025
@keith-turner
Copy link
Contributor

keith-turner commented May 2, 2025

Meanwhile, the root tablet can't be assigned a location because the Manager.tserverStatus map is not populated, which is done from the StatusThread as well.

Seen problems with the way this code works before. Would probably be best to move balancing out of the status thread and give its own thread. Could have a balancing thread for each data level, with this setup each balancing thread would only read the migrations for its level which would avoid trying to read all data levels at once. Also would be good to minimize the dependencies between balancing and TGW as much as possible.

Because of fundamental problems in the existing code the fix in this PR may avoid the problem in some situations, but the status thread could still get stuck if things changes after its done its checks.

@keith-turner
Copy link
Contributor

Saw #5533 recently and its related to the overall problems w/ this general code.

Comment on lines +646 to +655
if (watchers.size() != 3) {
log.debug("Skipping migration check, not all TabletGroupWatchers are started");
skipMigrationCheck = true;
} else {
for (TabletGroupWatcher watcher : watchers) {
if (!watcher.isAlive()) {
log.debug("Skipping migration check, not all TabletGroupWatchers are started");
skipMigrationCheck = true;
break;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a problem with one of the ITs in this case where the TGWs wouldn't be started and it would hang on the scan. It hung every time then suddenly after a few changes it stopped hanging, so I assumed it was fixed. Looking back, I should have still been checking that the TGWs were started...

This is a good catch and looks good to me

Comment on lines +642 to +644
// Don't try to check migrations if the Root and Metadata
// tables are not hosted.
boolean skipMigrationCheck = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we want to do this check everywhere we scan for migrations. This is done in several places in Manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to structure the code such that it ok if the thread gets stuck trying to read migrations.

Comment on lines +657 to +659
if (level == DataLevel.ROOT || level == DataLevel.METADATA) {
final TableId tid = level == DataLevel.ROOT ? SystemTables.ROOT.tableId()
: SystemTables.METADATA.tableId();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use of DataLevel doesn't seem correct. See:

public enum DataLevel {
ROOT(null, null),
METADATA(SystemTables.ROOT.tableName(), SystemTables.ROOT.tableId()),
USER(SystemTables.METADATA.tableName(), SystemTables.METADATA.tableId());

@dlmarion
Copy link
Contributor Author

dlmarion commented May 6, 2025

Closing this in favor of a different solution. @keith-turner created #5533 and suggested above a different change in the Manager where balancing is done in its own thread.

@dlmarion dlmarion closed this May 6, 2025
@keith-turner
Copy link
Contributor

Opened #5537 as a first step in cleaning up some of the code and thread dependencies in the balancing code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants