Skip to content

When a scanner lease expires, it will retry request same regionserver endless. RS too busy! #198

Open
@xuming01

Description

@xuming01

Issuse description:
When RS has a hot region, tsdb's scanner lease may expire. Once many scanners are expire, then on the regionserver side, we will see too many handler are handling scan request and will throw "UnknownScannerException" with "missing scanner" logs like this:

2018-11-02 16:46:40,580 WARN  [RpcServer.default.RWQ.Fifo.scan.handler=380,queue=38,port=60020] regionserver.RSRpcServices: Client tried to access missing scanner 5816065332938628527

In further, the scanner with the same scanner_id will retry send rpc to RS always. RS will be more busy to handle these endless "missing scanner".

This is debug logs on tsdb side:

17:27:10.561 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.decode] - ------------------>> ENTERING DECODE >>------------------
17:27:20.561 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.decode] - rpcid=1335, response size=1126 bytes, 0 readable bytes left, rpc=CloseScannerRequest(scanner_id=0x00B6D09B00455DAF, attempt=0)
17:27:30.980 DEBUG [AsyncHBase Timer HBaseClient #1] [RegionClient.encode] - [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] Sending RPC #1336, payload=BigEndianHeapChannelBuffer(ridx=11, widx=42, cap=42) [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 11, 8, -72, 10, 26, 4, 83, 99, 97, 110, 32, 1, 14, 24, -81, -69, -107, -126, -80, -109, -76, -37, 80, 32, 0, 40, 1]
17:27:30.982 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.handleUpstream] - handleUpstream [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] WRITTEN_AMOUNT: 31
17:27:30.983 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.handleUpstream] - handleUpstream [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=1126, cap=1126)
17:27:30.987 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.decode] - ------------------>> ENTERING DECODE >>------------------
17:27:40.988 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.decode] - rpcid=1336, response size=1126 bytes, 0 readable bytes left, rpc=CloseScannerRequest(scanner_id=0x00B6D09B00455DAF, attempt=0)
17:27:51.400 DEBUG [AsyncHBase Timer HBaseClient #1] [RegionClient.encode] - [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] Sending RPC #1337, payload=BigEndianHeapChannelBuffer(ridx=11, widx=42, cap=42) [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 11, 8, -71, 10, 26, 4, 83, 99, 97, 110, 32, 1, 14, 24, -81, -69, -107, -126, -80, -109, -76, -37, 80, 32, 0, 40, 1]
17:27:51.401 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.handleUpstream] - handleUpstream [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] WRITTEN_AMOUNT: 31
17:27:51.402 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.handleUpstream] - handleUpstream [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=1126, cap=1126)

Moreover, this bug can be occur stable, when code "Thread.sleep(61000)" is add into Scanner class's nextRow() function .

This is my bug fix:
Only let rpc's attampt plus 1 before invoke sendRpc(). when this rpc retry times > hbase.client.retries.number, it will leave.

diff --git a/src/RegionClient.java b/src/RegionClient.java
index ad83aa1..59c0d8e 100644
--- a/src/RegionClient.java
+++ b/src/RegionClient.java
@@ -1547,6 +1547,7 @@ final class RegionClient extends ReplayingDecoder<VoidEnum> {
       final class RetryTimer implements TimerTask {
         public void run(final Timeout timeout) {
           if (isAlive()) {
+            rpc.attempt++;
             sendRpc(rpc);
           } else {
             if (rpc instanceof MultiAction) {

Another think, we change the UnknownScannerException to NonRecoverableException is OK?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions