Skip to content

AssertionError in AppendEntriesRequestProcessor  #1091

Open
@Cyrill

Description

@Cyrill

I managed to hit an AssertionError in AppendEntriesRequestProcessor. Apparently, there is a race.
The crash was observed on a custom branch, though the code in master is the same.

First, the code:

AppendEntriesRequestProcessor.PeerExecutorSelector has the following code (Intentionally removed unrelated lines):

public Executor select(final String reqClass, final Object reqHeader) {
            // ...

            final Node node = NodeManager.getInstance().get(groupId, peer);

            if (node == null || !node.getRaftOptions().isReplicatorPipeline()) {
                return executor();
            }

            // The node enable pipeline, we should ensure bolt support it.
            RpcFactoryHelper.rpcFactory().ensurePipeline();

            final PeerRequestContext ctx = getOrCreatePeerRequestContext(groupId, pairOf(peerId, serverId), null);

            return ctx.executor;
        }

getOrCreatePeerRequestContext looks as follows:

PeerRequestContext getOrCreatePeerRequestContext(final String groupId, final PeerPair pair, final Connection conn) {
        ConcurrentMap<PeerPair, PeerRequestContext> groupContexts = this.peerRequestContexts.get(groupId);
        // ....

        PeerRequestContext peerCtx = groupContexts.get(pair);
        if (peerCtx == null) {
            synchronized (Utils.withLockObject(groupContexts)) {
                peerCtx = groupContexts.get(pair);
                // double check in lock
                if (peerCtx == null) {
                    // only one thread to process append entries for every jraft node
                    final PeerId peer = new PeerId();
                    final boolean parsed = peer.parse(pair.local);
                    assert (parsed);
                    final Node node = NodeManager.getInstance().get(groupId, peer);
                    assert (node != null); // <<<<<<<<<<<<<<AssertionError here!
                    peerCtx = new PeerRequestContext(groupId, pair, node.getRaftOptions()
                        .getMaxReplicatorInflightMsgs());
                    groupContexts.put(pair, peerCtx);
                }
            }
        }
        // ...
 
        return peerCtx;
    }

Execution flow

I don't have a specific code to reproduce this issue, but the flow is simple. I observed a slight delay in messaging/threads which ended up with an error.

My assumptions regarding the execution flow are:

  • select is called. NodeManager.getInstance().get(groupId, peer) returns a non-null result, continue to getOrCreatePeerRequestContext
  • Another thread stops the app,NodeManager.getInstance().remove() is called for this node.
  • Inside getOrCreatePeerRequestContext the result of final Node node = NodeManager.getInstance().get(groupId, peer); is null, since the node has already been removed moments ago.
  • The execution crash on the following line assert (node != null);

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions