Description
Issuse description:
When RS has a hot region, tsdb's scanner lease may expire. Once many scanners are expire, then on the regionserver side, we will see too many handler are handling scan request and will throw "UnknownScannerException" with "missing scanner" logs like this:
2018-11-02 16:46:40,580 WARN [RpcServer.default.RWQ.Fifo.scan.handler=380,queue=38,port=60020] regionserver.RSRpcServices: Client tried to access missing scanner 5816065332938628527
In further, the scanner with the same scanner_id will retry send rpc to RS always. RS will be more busy to handle these endless "missing scanner".
This is debug logs on tsdb side:
17:27:10.561 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.decode] - ------------------>> ENTERING DECODE >>------------------
17:27:20.561 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.decode] - rpcid=1335, response size=1126 bytes, 0 readable bytes left, rpc=CloseScannerRequest(scanner_id=0x00B6D09B00455DAF, attempt=0)
17:27:30.980 DEBUG [AsyncHBase Timer HBaseClient #1] [RegionClient.encode] - [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] Sending RPC #1336, payload=BigEndianHeapChannelBuffer(ridx=11, widx=42, cap=42) [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 11, 8, -72, 10, 26, 4, 83, 99, 97, 110, 32, 1, 14, 24, -81, -69, -107, -126, -80, -109, -76, -37, 80, 32, 0, 40, 1]
17:27:30.982 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.handleUpstream] - handleUpstream [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] WRITTEN_AMOUNT: 31
17:27:30.983 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.handleUpstream] - handleUpstream [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=1126, cap=1126)
17:27:30.987 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.decode] - ------------------>> ENTERING DECODE >>------------------
17:27:40.988 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.decode] - rpcid=1336, response size=1126 bytes, 0 readable bytes left, rpc=CloseScannerRequest(scanner_id=0x00B6D09B00455DAF, attempt=0)
17:27:51.400 DEBUG [AsyncHBase Timer HBaseClient #1] [RegionClient.encode] - [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] Sending RPC #1337, payload=BigEndianHeapChannelBuffer(ridx=11, widx=42, cap=42) [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 11, 8, -71, 10, 26, 4, 83, 99, 97, 110, 32, 1, 14, 24, -81, -69, -107, -126, -80, -109, -76, -37, 80, 32, 0, 40, 1]
17:27:51.401 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.handleUpstream] - handleUpstream [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] WRITTEN_AMOUNT: 31
17:27:51.402 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.handleUpstream] - handleUpstream [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=1126, cap=1126)
Moreover, this bug can be occur stable, when code "Thread.sleep(61000)" is add into Scanner class's nextRow() function .
This is my bug fix:
Only let rpc's attampt plus 1 before invoke sendRpc(). when this rpc retry times > hbase.client.retries.number, it will leave.
diff --git a/src/RegionClient.java b/src/RegionClient.java
index ad83aa1..59c0d8e 100644
--- a/src/RegionClient.java
+++ b/src/RegionClient.java
@@ -1547,6 +1547,7 @@ final class RegionClient extends ReplayingDecoder<VoidEnum> {
final class RetryTimer implements TimerTask {
public void run(final Timeout timeout) {
if (isAlive()) {
+ rpc.attempt++;
sendRpc(rpc);
} else {
if (rpc instanceof MultiAction) {
Another think, we change the UnknownScannerException to NonRecoverableException is OK?