Skip to content

FileSystemContext.reinit may occure deadlock #16210

@wubiaoi

Description

@wubiaoi

Alluxio Version:
2.8.1

Describe the bug
The method of reinit block blockReinit can be invoked by other method, eg:getCachedWorkers(); but the method of getCachedWorkers lock FileSystemContext;reinit needs a lock of FileSystemContext too.

  public void reinit(boolean updateClusterConf, boolean updatePathConf)
      throws UnavailableException, IOException {
    try (Closeable r = mReinitializer.allow()) {
      InetSocketAddress masterAddr;
      try {
        masterAddr = getMasterAddress();
      } catch (IOException e) {
        throw new UnavailableException("Failed to get master address during reinitialization", e);
      }
      try {
        getClientContext().loadConf(masterAddr, updateClusterConf, updatePathConf);
      } catch (AlluxioStatusException e) {
        // Failed to load configuration from meta master, maybe master is being restarted,
        // or their is a temporary network problem, give up reinitialization. The heartbeat thread
        // will try to reinitialize in the next heartbeat.
        throw new UnavailableException(String.format("Failed to load configuration from "
            + "meta master (%s) during reinitialization", masterAddr), e);
      }
      LOG.debug("Reinitializing FileSystemContext: update cluster conf: {}, update path conf:"
          + " {}", updateClusterConf, updateClusterConf);
      closeContext();
      ReconfigurableRegistry.update();
      initContext(getClientContext(), MasterInquireClient.Factory.create(getClusterConf(),
          getClientContext().getUserState()));
      LOG.debug("FileSystemContext re-initialized");
      mReinitializer.onSuccess();
    }
  }

jstack:

"task-execution-service-5" #1131 daemon prio=5 os_prio=0 tid=0x00007f9938008800 nid=0x153bc waiting on condition [0x00007fa21fdfe000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000006c011b7c0> (a alluxio.concurrent.CountingLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at alluxio.concurrent.CountingLatch.inc(CountingLatch.java:108)
        at alluxio.client.file.FileSystemContextReinitializer$ReinitBlockerResource.<init>(FileSystemContextReinitializer.java:104)
        at alluxio.client.file.FileSystemContextReinitializer.block(FileSystemContextReinitializer.java:155)
        at alluxio.client.file.FileSystemContext.blockReinit(FileSystemContext.java:350)
        at alluxio.client.file.FileSystemContext.acquireBlockMasterClientResource(FileSystemContext.java:477)
        at alluxio.client.file.FileSystemContext.getAllWorkers(FileSystemContext.java:650)
        at alluxio.client.file.FileSystemContext.getCachedWorkers(FileSystemContext.java:636)
        - locked <0x00000006c03b8680> (a alluxio.client.file.FileSystemContext)
        at alluxio.job.util.JobUtils.loadBlock(JobUtils.java:128)
        at alluxio.job.plan.load.LoadDefinition.runTask(LoadDefinition.java:189)
        at alluxio.job.plan.load.LoadDefinition.runTask(LoadDefinition.java:54)
        at alluxio.job.plan.batch.BatchedJobDefinition.runTask(BatchedJobDefinition.java:81)
        at alluxio.job.plan.batch.BatchedJobDefinition.runTask(BatchedJobDefinition.java:42)
        at alluxio.worker.job.task.TaskExecutor.run(TaskExecutor.java:88)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
"config-hash-master-heartbeat-0" #292 daemon prio=5 os_prio=0 tid=0x00007fa46d7ae800 nid=0x145af waiting for monitor entry [0x00007f9e3aeee000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at alluxio.client.file.FileSystemContext.closeContext(FileSystemContext.java:298)
        - waiting to lock <0x00000006c03b8680> (a alluxio.client.file.FileSystemContext)
        at alluxio.client.file.FileSystemContext.reinit(FileSystemContext.java:393)
        at alluxio.client.file.ConfigHashSync.heartbeat(ConfigHashSync.java:94)
        - locked <0x00000006c03cf088> (a alluxio.client.file.ConfigHashSync)
        at alluxio.client.file.FileSystemContextReinitializer.lambda$new$0(FileSystemContextReinitializer.java:69)
        at alluxio.client.file.FileSystemContextReinitializer$$Lambda$98/1029472813.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Metadata

Metadata

Labels

staleThe PR/Issue does not have recent activities and will be closed automaticallytype-bugThis issue is about a bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions