Skip to content

Fix race condition in Manager.partitionMigrations #5531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -638,6 +638,43 @@ public void run() {
*/
private Map<DataLevel,Set<KeyExtent>> partitionMigrations() {
final Map<DataLevel,Set<KeyExtent>> partitionedMigrations = new EnumMap<>(DataLevel.class);

// Don't try to check migrations if the Root and Metadata
// tables are not hosted.
boolean skipMigrationCheck = false;
Comment on lines +642 to +644
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we want to do this check everywhere we scan for migrations. This is done in several places in Manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to structure the code such that it ok if the thread gets stuck trying to read migrations.


if (watchers.size() != 3) {
log.debug("Skipping migration check, not all TabletGroupWatchers are started");
skipMigrationCheck = true;
} else {
for (TabletGroupWatcher watcher : watchers) {
if (!watcher.isAlive()) {
log.debug("Skipping migration check, not all TabletGroupWatchers are started");
skipMigrationCheck = true;
break;
}
Comment on lines +646 to +655
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a problem with one of the ITs in this case where the TGWs wouldn't be started and it would hang on the scan. It hung every time then suddenly after a few changes it stopped hanging, so I assumed it was fixed. Looking back, I should have still been checking that the TGWs were started...

This is a good catch and looks good to me

final DataLevel level = watcher.getLevel();
if (level == DataLevel.ROOT || level == DataLevel.METADATA) {
final TableId tid = level == DataLevel.ROOT ? SystemTables.ROOT.tableId()
: SystemTables.METADATA.tableId();
Comment on lines +657 to +659
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use of DataLevel doesn't seem correct. See:

public enum DataLevel {
ROOT(null, null),
METADATA(SystemTables.ROOT.tableName(), SystemTables.ROOT.tableId()),
USER(SystemTables.METADATA.tableName(), SystemTables.METADATA.tableId());

final TableCounts count = watcher.getStats(tid);
if (count.hosted() == 0 || count.assigned() > 0 || count.assignedToDeadServers() > 0
|| count.suspended() > 0 || count.unassigned() > 0) {
log.debug("Table {} has tablets that are not hosted, {}", tid, count);
log.debug("Skipping migration check due to unhosted system tables");
skipMigrationCheck = true;
break;
}
}
}
}
if (skipMigrationCheck) {
partitionedMigrations.put(DataLevel.ROOT, new HashSet<>());
partitionedMigrations.put(DataLevel.METADATA, new HashSet<>());
partitionedMigrations.put(DataLevel.USER, new HashSet<>());
return partitionedMigrations;
}

for (DataLevel dl : DataLevel.values()) {
Set<KeyExtent> extents = new HashSet<>();
// prev row needed for the extent
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@
import org.apache.accumulo.minicluster.ServerType;
import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
import org.apache.accumulo.server.ServerContext;
import org.apache.hadoop.fs.RawLocalFileSystem;
import org.apache.hadoop.io.Text;
import org.apache.zookeeper.KeeperException;
import org.junit.jupiter.api.AfterAll;
Expand All @@ -79,6 +80,7 @@ private static class ScanServerUpgradeITConfiguration
@Override
public void configureMiniCluster(MiniAccumuloConfigImpl cfg,
org.apache.hadoop.conf.Configuration coreSite) {
coreSite.set("fs.file.impl", RawLocalFileSystem.class.getName());
cfg.getClusterServerConfiguration().setNumDefaultScanServers(0);
}
}
Expand Down