Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rate-limiting error Redis timeout #13473

Open
1 task done
liukaiyuliukaiyu opened this issue Aug 8, 2024 · 18 comments
Open
1 task done

rate-limiting error Redis timeout #13473

liukaiyuliukaiyu opened this issue Aug 8, 2024 · 18 comments
Labels
pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... stale

Comments

@liukaiyuliukaiyu
Copy link

liukaiyuliukaiyu commented Aug 8, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

3.4.0

Current Behavior

Kong uses Redis to limit the service flow,
network topology: kong is deployed in elastic cloud containers, and redis cluster is a closed-loop call in the same computer room, and does not involve cross-room and cross-dedicated lines
rate-limiting plugin config = {"redis_database":0,"policy":"redis","redis_host":"xx.xx.xx.xx","redis_timeout":50,"limit_by":"server","second":500,"redis_port":xx,"redis_password":"","fault_tolerant":true}
redis cluster configuration: 12G memory, 3 proxies, 12 redis, of which redis is one master and one slave, that is, 6 groups of redis
Kong Machine specification 4C8G, The container environment only has test traffic, without any other requests, and uses the wrk tool to send requests in parallel
image

Expected Behavior

kong can read redis responses normally instead of timeout

Steps To Reproduce

  1. network topology: kong is deployed in elastic cloud containers, and redis cluster is a closed-loop call in the same computer room, and does not involve cross-room and cross-dedicated lines
    redis cluster configuration: 12G memory, 3 proxies, 12 redis, of which redis is one master and one slave, that is, 6 groups of redis
    Kong Machine specification 4C8G, The container environment only has test traffic, without any other requests, and uses the wrk tool to send requests in parallel
  2. rate-limiting plugin config = {"redis_database":0,"policy":"redis","redis_host":"xx.xx.xx.xx","redis_timeout":50,"limit_by":"server","second":500,"redis_port":xx,"redis_password":"","fault_tolerant":true}
  3. ./wrk -t12 -c50 -d1s http://127.0.0.1:8000/xxx
  4. Redis: timeout
    image

Anything else?

no

@chobits
Copy link
Contributor

chobits commented Aug 15, 2024

Could you use a Redis client tool to check if Redis is operational during the time you're benchmarking Kong?

Based on the information you provided, we can't determine where the timeout issue is occurring—it could be on the Kong side or the Redis side.

( BTW, providing the full text of the Redis error log is better for the community to debug. Your screenshot cuts off part of one line and is incomplete

@chobits
Copy link
Contributor

chobits commented Aug 15, 2024

Please note that to effectively benchmark Kong, you need to identify and reach a bottleneck in your system. Only then can you obtain meaningful results, such as QPS from Kong. We'd also like to know what the bottleneck is during your benchmark—it could be 100% CPU usage by Kong, full network card utilization, or 100% CPU usage by Redis.

@git-hulk
Copy link
Contributor

I believe it's impossible to overload the Redis instance with several hundred RPS. Perhaps you should check the number of workers in your Kong instance as @chobits mentioned.

@bungle bungle added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Aug 19, 2024
@liukaiyuliukaiyu
Copy link
Author

liukaiyuliukaiyu commented Aug 19, 2024

Please note that to effectively benchmark Kong, you need to identify and reach a bottleneck in your system. Only then can you obtain meaningful results, such as QPS from Kong. We'd also like to know what the bottleneck is during your benchmark—it could be 100% CPU usage by Kong, full network card utilization, or 100% CPU usage by Redis.

The kong container maximum bandwidth of the network card is 10 gigabytes
The kong container cpu idle rate is more than 85%
tcpdump captures the packet and analyzes it, and it is found that the redis server responds to the data within 5ms. timeout was returned after receiving more than 50ms
Because with service throttling enabled, each request will trigger nginx to interact with the redis server, look up the value of a key in redis, and incrby will increase the value of the key.
The captured data is as follows. The redis server responded within 1.5ms
image

I printed out the traceId of each request, and the traceId was used as the key value of the request redis. It can be confirmed that the key of the client's read timeout can be found in the packet capture file, and it is 1.5ms to respond to the client.

The direction of suspicion is that the kernel does not pass the read event to the nginx application layer, and the delay time between epollo events can be determined at the kernel level
If the nginx application is not able to handle it in time, how to optimize it, such as adjusting the nginx configuration and linux kernel parameters

@chobits
Copy link
Contributor

chobits commented Aug 20, 2024

The kong container cpu idle rate is more than 85%

You mean, when you benchmarked Kong, it experienced Redis timeouts and only 15% CPU usage of Kong worker processes?

If that's the case, I guess something might be stuck inside Kong.

Did you see any other error logs besides the Redis timeout? For example, any logs related to timer-ng, DNS, connection timeouts, etc.?

The direction of suspicion is that the kernel does not pass the read event to the nginx application layer, and the delay time between epollo events can be determined at the kernel level

I think this should be the last thing to consider. In my experience, I haven't encountered any bugs related to kernel epoll/events.

If the nginx application is not able to handle it in time, how to optimize it, such as adjusting the nginx configuration and linux kernel parameters

We don't know, because currently we don't locate the root cause of this situation.

@chobits
Copy link
Contributor

chobits commented Aug 20, 2024

And for the error log in your Screenshot, like failed to eval Redis: timeout===, is this log introduced by your self? I dont find any error log like this in kong source.

@liukaiyuliukaiyu
Copy link
Author

And for the error log in your Screenshot, like failed to eval Redis: timeout===, is this log introduced by your self? I dont find any error log like this in kong source

More debug logs are listed below:
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua access handler, uri:"/test_kylin_upstream_timeout" c:1
2024/08/14 18:08:39 [debug] 7021#7021: *201070 looking up Lua code cache with key '=access_by_lua(nginx-kong.conf:141)nhli_cd92897f6c733fbbef91fc813412a06e'
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua creating new thread
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua reset ctx
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua run thread, top:0 c:1
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua resume returned 1
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua thread yielded
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua coroutine: resume
2024/08/14 18:08:39 [debug] 7021#7021: *201070 [lua] base_plugin.lua:24: access(): executing plugin "rate-limiting": access
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket pool get keepalive peer
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket get keepalive peer: using connection 00007F89D76CE538, fd:171
2024/08/14 18:08:39 [debug] 7021#7021: *201070 calling ngx_http_lua_get_keepalive_peer ngx_del_timer
2024/08/14 18:08:39 [debug] 7021#7021: *201070 event timer del: 171: 4346592003
2024/08/14 18:08:39 [debug] 7021#7021: *201070 http cleanup add: 00000000026E5708
2024/08/14 18:08:39 [error] 7021#7021: *201070 [lua] redis.lua:219: _gen_req(): ======nargs======5, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:39 [error] 7021#7021: *201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====eval=====i=====1, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:39 [error] 7021#7021: *201070 [lua] redis.lua:230: _gen_req(): ======_gen_req===== local cache_key, expiration = KEYS[1], ARGV[1]
2024/08/14 18:08:39 [error] 7021#7021: *201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====1=====i=====3, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:39 [error] 7021#7021: *201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====ratelimit:d7af6063-037c-47d0-934a-1f19c412ea94:d7af6063-037c-47d0-934a-1f19c412ea94:1723630119000:second:0aa0512166bc8227b5ae1b6d04a47402=====i=====4, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:39 [error] 7021#7021: *201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====1=====i=====5, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket send timeout: 50
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua allocate new chainlink and new buf of size 431, cl:00000000026E5750
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket send data
2024/08/14 18:08:39 [debug] 7021#7021: *201070 send: fd:171 431 of 431
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket sent all the data
2024/08/14 18:08:39 [error] 7021#7021: *201070 [lua] redis.lua:128: eval(): before_receive_time timestamp: 1723630119011, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket calling receive() method
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket read timeout: 50
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua allocate new chainlink and new buf of size 4096, cl:00000000026E5960
2024/08/14 18:08:39 [debug] 7021#7021: *201070 malloc: 00000000026E5AA0:4096
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket read timeout: 50
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket read data: wait:0
2024/08/14 18:08:39 [debug] 7021#7021: *201070 event timer add: 171: 50:4346582060
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua resume returned 1
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua thread yielded
2024/08/14 18:08:39 [debug] 7021#7021: *201070 http run request: "/test_kylin_upstream_timeout?"
2024/08/14 18:08:39 [debug] 7021#7021: *201070 calling write_event_handler
2024/08/14 18:08:39 [debug] 7021#7021: *201070 access phase: 14
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua access handler, uri:"/test_kylin_upstream_timeout" c:1
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua run write event handler: timedout:0, ready:1, writing_raw_req_socket:0
2024/08/14 18:08:39 [debug] 7021#7021: *201070 useless lua write event handler
2024/08/14 18:08:39 [debug] 7021#7021: *201070 write_event_handler executed in 0 ms
2024/08/14 18:08:39 [debug] 7021#7021: *201070 event timer del: 171: 4346582060
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket handler for "/test_kylin_upstream_timeout?", wev 0
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket read handler
2024/08/14 18:08:39 [error] 7021#7021: *201070 lua tcp socket read timed out, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket handle read error
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket waking up the current request
2024/08/14 18:08:39 [debug] 7021#7021: *201070 access phase: 14
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua access handler, uri:"/test_kylin_upstream_timeout" c:1
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp operation done, resuming lua thread
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket calling prepare retvals handler 0000000000663E34, u:00007F89D0185818
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket receive return value handler
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua run thread, top:0 c:1
2024/08/14 18:08:39 [error] 7021#7021: *201070 [lua] redis.lua:132: eval(): =========receive time===== 50ms, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:39 [error] 7021#7021: *201070 [lua] redis.lua:133: eval(): =======line=======nil=====err=====timeout=====partial=====, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua finalize socket
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua http cleanup free: 00000000026E5708
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua close socket connection
2024/08/14 18:08:39 [debug] 7021#7021: *201070 free: 0000000002357AF0, unused: 48
2024/08/14 18:08:39 [debug] 7021#7021: *201070 reusable connection: 0
2024/08/14 18:08:39 [error] 7021#7021: *201070 [lua] init.lua:171: usage(): failed to eval Redis: timeout=====operationid=====0aa0512166bc8227b5ae1b6d04a47402, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:39 [error] 7021#7021: *201070 [lua] handler.lua:101: failed to get usage: timeout, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"

@chobits
Copy link
Contributor

chobits commented Aug 20, 2024

So what's the your custom debug code of this kind of debug log =======line=======nil=====err=====timeout=====partial=====, , we dont find it in kong's source code

And also note that we could not pinpoint the reason of the redis timeout although you show the timeout error there and the source code.

If I were you, I may add more debug log/error log or use stapxx tool to trace the stack of redis timeout. It looks like an interesting journey to debug this.

Also note that the tcpdump info your provided is 100% associated to the timeout situation, if it is yes, then you can debug into kong souce code (redis lua) or openresty lua-nginx-module tcp:sock.

@liukaiyuliukaiyu
Copy link
Author

So what's the your custom debug code of this kind of debug log =======line=======nil=====err=====timeout=====partial=====, , we dont find it in kong's source code

And also note that we could not pinpoint the reason of the redis timeout although you show the timeout error there and the source code.

If I were you, I may add more debug log/error log or use stapxx tool to trace the stack of redis timeout. It looks like an interesting journey to debug this.

Also note that the tcpdump info your provided is 100% associated to the timeout situation, if it is yes, then you can debug into kong souce code (redis lua) or openresty lua-nginx-module tcp:sock.

=======
The logs with ======= are all added by myself
Since it is not required, how should I debug the code? From the debug log and the normal scenario of not reporting the timeout, we can determine that the receive timer is 50ms timeout, delete the timer first, and then report the lua tcp socket read timed out
2024/08/14 18:08:39 [debug] 7021#7021: *201070 useless lua write event handler
2024/08/14 18:08:39 [debug] 7021#7021: *201070 write_event_handler executed in 0 ms
2024/08/14 18:08:39 [debug] 7021#7021: *201070 event timer del: 171: 4346582060
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket handler for "/test_kylin_upstream_timeout?" , wev 0
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket read handler
2024/08/14 18:08:39 [error] 7021#7021: *201070 lua tcp socket read timed out, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket handle read error
2024/08/14 18:08:39 [debug] 7021#7021: *201070 lua tcp socket waking up the current request
2024/08/14 18:08:39 [debug] 7021#7021: *201070 access phase: 14

The following is a debug log that normally does not report a timeout:

2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua access handler, uri:"/test_kylin_upstream_timeout" c:1
2024/08/14 18:08:38 [debug] 7021#7021: *201070 looking up Lua code cache with key '=access_by_lua(nginx-kong.conf:141)nhli_cd92897f6c733fbbef91fc813412a06e'
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua creating new thread
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua reset ctx
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua run thread, top:0 c:1
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua resume returned 1
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua thread yielded
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua coroutine: resume
2024/08/14 18:08:38 [debug] 7021#7021: *201070 [lua] base_plugin.lua:24: access(): executing plugin "rate-limiting": access
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket pool get keepalive peer
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket get keepalive peer: using connection 00007F89D76CDB60, fd:162
2024/08/14 18:08:38 [debug] 7021#7021: *201070 calling ngx_http_lua_get_keepalive_peer ngx_del_timer
2024/08/14 18:08:38 [debug] 7021#7021: *201070 event timer del: 162: 4346591918
2024/08/14 18:08:38 [debug] 7021#7021: *201070 http cleanup add: 00000000026E6718
2024/08/14 18:08:38 [error] 7021#7021: *201070 [lua] redis.lua:219: _gen_req(): ======nargs======5, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:38 [error] 7021#7021: *201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====eval=====i=====1, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:38 [error] 7021#7021: *201070 [lua] redis.lua:230: _gen_req(): ======_gen_req===== local cache_key, expiration = KEYS[1], ARGV[1]
2024/08/14 18:08:38 [error] 7021#7021: *201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====1=====i=====3, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:38 [error] 7021#7021: *201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====ratelimit:d7af6063-037c-47d0-934a-1f19c412ea94:d7af6063-037c-47d0-934a-1f19c412ea94:1723630118000:second:0aa0512166bc8226b5ae1b6d04a43d02=====i=====4, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:38 [error] 7021#7021: *201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====1=====i=====5, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket send timeout: 50
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua allocate new chainlink and new buf of size 431, cl:00000000026E6760
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket send data
2024/08/14 18:08:38 [debug] 7021#7021: *201070 send: fd:162 431 of 431
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket sent all the data
2024/08/14 18:08:38 [error] 7021#7021: *201070 [lua] redis.lua:128: eval(): before_receive_time timestamp: 1723630118926, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket calling receive() method
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket read timeout: 50
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua allocate new chainlink and new buf of size 4096, cl:00000000026E6970
2024/08/14 18:08:38 [debug] 7021#7021: *201070 malloc: 00000000026E6AB0:4096
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket read timeout: 50
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket read data: wait:0
2024/08/14 18:08:38 [debug] 7021#7021: *201070 event timer add: 162: 50:4346581975
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua resume returned 1
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua thread yielded
2024/08/14 18:08:38 [debug] 7021#7021: *201070 http run request: "/test_kylin_upstream_timeout?"
2024/08/14 18:08:38 [debug] 7021#7021: *201070 calling write_event_handler
2024/08/14 18:08:38 [debug] 7021#7021: *201070 access phase: 14
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua access handler, uri:"/test_kylin_upstream_timeout" c:1
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua run write event handler: timedout:0, ready:1, writing_raw_req_socket:0
2024/08/14 18:08:38 [debug] 7021#7021: *201070 useless lua write event handler
2024/08/14 18:08:38 [debug] 7021#7021: *201070 write_event_handler executed in 0 ms
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket handler for "/test_kylin_upstream_timeout?", wev 0
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket read handler
2024/08/14 18:08:38 [debug] 7021#7021: *201070 calling ngx_http_lua_socket_read_handler ngx_del_timer
2024/08/14 18:08:38 [debug] 7021#7021: *201070 event timer del: 162: 4346581975
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket read data: wait:1
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket try to recv data 4096: "/test_kylin_upstream_timeout?"
2024/08/14 18:08:38 [debug] 7021#7021: *201070 recv: eof:0, avail:1
2024/08/14 18:08:38 [debug] 7021#7021: *201070 recv: fd:162 4 of 4096
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket recv returned 4: "/test_kylin_upstream_timeout?"
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket read line
"024/08/14 18:08:38 [debug] 7021#7021: *201070 lua read the final line part: ":0
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket receive done: wait:1, eof:0, uri:"/test_kylin_upstream_timeout?"
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket waking up the current request (read)
2024/08/14 18:08:38 [debug] 7021#7021: *201070 access phase: 14
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua access handler, uri:"/test_kylin_upstream_timeout" c:1
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp operation done, resuming lua thread
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket calling prepare retvals handler 0000000000663E34, u:00007F89CB123010
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket receive return value handler
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua run thread, top:0 c:1
2024/08/14 18:08:38 [error] 7021#7021: *201070 [lua] redis.lua:132: eval(): =========receive time===== 36ms, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:38 [error] 7021#7021: *201070 [lua] redis.lua:133: eval(): =======line=======:0=====err=====nil=====partial=====nil, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket set keepalive: saving connection 00007F89D76CDB60
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket clear current socket connection
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua tcp socket keepalive timeout: 10000 ms
2024/08/14 18:08:38 [debug] 7021#7021: *201070 event timer add: 162: 10000:4346591961
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua finalize socket
2024/08/14 18:08:38 [debug] 7021#7021: *201070 lua http cleanup free: 00000000026E6718
2024/08/14 18:08:38 [error] 7021#7021: *201070 [lua] init.lua:182: usage(): =========Redis: after set_keepalive=======, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"

@liukaiyuliukaiyu
Copy link
Author

So what's the your custom debug code of this kind of debug log =======line=======nil=====err=====timeout=====partial=====, , we dont find it in kong's source code

And also note that we could not pinpoint the reason of the redis timeout although you show the timeout error there and the source code.

If I were you, I may add more debug log/error log or use stapxx tool to trace the stack of redis timeout. It looks like an interesting journey to debug this.

Also note that the tcpdump info your provided is 100% associated to the timeout situation, if it is yes, then you can debug into kong souce code (redis lua) or openresty lua-nginx-module tcp:sock.

The error in the nginx code:
timer delta: 54ms
It is possible to determine if there is a delay in the ngx_process_events function processing the events

2024/08/20 19:55:42 [debug] 7019#7019: *1338394 http run request: "/test_kylin_upstream_timeout?"
2024/08/20 19:55:42 [debug] 7019#7019: *1338394 calling write_event_handler
2024/08/20 19:55:42 [debug] 7019#7019: *1338394 access phase: 14
2024/08/20 19:55:42 [debug] 7019#7019: *1338394 lua access handler, uri:"/test_kylin_upstream_timeout" c:1
2024/08/20 19:55:42 [debug] 7019#7019: *1338394 lua run write event handler: timedout:0, ready:1, writing_raw_req_socket:0
2024/08/20 19:55:42 [debug] 7019#7019: *1338394 useless lua write event handler
2024/08/20 19:55:42 [debug] 7019#7019: *1338394 write_event_handler executed in 0 ms
2024/08/20 19:55:42 [debug] 7019#7019: timer delta: 54
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 event timer del: 75: 4871405350
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 lua tcp socket handler for "/test_kylin_upstream_timeout?", wev 0
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 lua tcp socket read handler
2024/08/20 19:55:42 [error] 7019#7019: *1338482 lua tcp socket read timed out, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 lua tcp socket handle read error
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 lua tcp socket waking up the current request
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 access phase: 14
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 lua access handler, uri:"/test_kylin_upstream_timeout" c:1
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 lua tcp operation done, resuming lua thread
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 lua tcp socket calling prepare retvals handler 0000000000663E34, u:00007F89CB1C03A8
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 lua tcp socket receive return value handler
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 lua run thread, top:0 c:1
2024/08/20 19:55:42 [error] 7019#7019: *1338482 [lua] redis.lua:132: eval(): =========receive time===== 55ms, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/20 19:55:42 [error] 7019#7019: *1338482 [lua] redis.lua:133: eval(): =======line=======nil=====err=====timeout=====partial=====, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 lua finalize socket
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 lua http cleanup free: 0000000002579E88
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 lua close socket connection
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 free: 0000000002356020, unused: 48
2024/08/20 19:55:42 [debug] 7019#7019: *1338482 reusable connection: 0
2024/08/20 19:55:42 [error] 7019#7019: *1338482 [lua] init.lua:171: usage(): failed to eval Redis: timeout=====operationid=====0aa0512166c4843eb6041b6b04bce802, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"

@liukaiyuliukaiyu
Copy link
Author

liukaiyuliukaiyu commented Aug 20, 2024

nginx source code

    delta = ngx_current_msec;

    (void) ngx_process_events(cycle, timer, flags);

    delta = ngx_current_msec - delta;

    ngx_log_debug1(NGX_LOG_DEBUG_EVENT, cycle->log, 0,
                   "timer delta: %M", delta);`

@chobits
Copy link
Contributor

chobits commented Aug 21, 2024

The error in the nginx code:
timer delta: 54ms
It is possible to determine if there is a delay in the ngx_process_events function processing the events

I don’t think 54ms is excessively long. During one cycle of epoll events processing (ngx_process_events), there aren't any special I/O operations happening aside from writing debug logs. Therefore, it seems likely that the 54ms is primarily consumed by CPU-intensive operations and the time taken for debug log writing.

@chobits
Copy link
Contributor

chobits commented Aug 21, 2024

From your debug log

2024/08/20 19:55:42 [error] 7019#7019: *1338482 [lua] redis.lua:132: eval(): =========receive time===== 55ms, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/20 19:55:42 [error] 7019#7019: *1338482 [lua] redis.lua:133: eval(): =======line=======nil=====err=====timeout=====partial=====, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"

These two error log entries are puzzling, one indicated a timeout, while the other reports a latency of only 55ms. In my opinion, 55ms is not a significant latency.


First, it’s best to capture a situation where the delay exceeds some seconds (might be > 1s, which indicated a significat timeout). Make sure to have the corresponding tcpdump packets and debug log information, especially noting that the debug log should cover from when the request is sent from the Redis connection to when the response is received (or the timeout occurs). Then, you can analyze the timeout phenomenon by combining the tcpdump packet capture and the debug logs.

@liukaiyuliukaiyu
Copy link
Author

liukaiyuliukaiyu commented Aug 27, 2024

The error in the nginx code:
timer delta: 54ms
It is possible to determine if there is a delay in the ngx_process_events function processing the events

I don’t think 54ms is excessively long. During one cycle of epoll events processing (ngx_process_events), there aren't any special I/O operations happening aside from writing debug logs. Therefore, it seems likely that the 54ms is primarily consumed by CPU-intensive operations and the time taken for debug log writing.

@chobits We checked with our colleagues in the kernel group and determined that the 50ms delay was in user mode, that is, in lua+nginx. We need to continue to see where the card was stuck in user mode and whether the lua or nginx parameters can be adjusted.
Now at least the direction is clear

@liukaiyuliukaiyu
Copy link
Author

liukaiyuliukaiyu commented Aug 30, 2024

Please note that to effectively benchmark Kong, you need to identify and reach a bottleneck in your system. Only then can you obtain meaningful results, such as QPS from Kong. We'd also like to know what the bottleneck is during your benchmark—it could be 100% CPU usage by Kong, full network card utilization, or 100% CPU usage by Redis.

@chobits Looking at the basic metrics of the container, we can see that the redis timeout will be triggered when the number of net.sockets.tp.inuse connections on the machine reaches 6000.
How should we adjust? Do you have similar experience

Copy link
Contributor

This issue is marked as stale because it has been open for 14 days with no activity.

@github-actions github-actions bot added the stale label Sep 14, 2024
@chobits
Copy link
Contributor

chobits commented Sep 14, 2024

Please note that to effectively benchmark Kong, you need to identify and reach a bottleneck in your system. Only then can you obtain meaningful results, such as QPS from Kong. We'd also like to know what the bottleneck is during your benchmark—it could be 100% CPU usage by Kong, full network card utilization, or 100% CPU usage by Redis.

@chobits Looking at the basic metrics of the container, we can see that the redis timeout will be triggered when the number of net.sockets.tp.inuse connections on the machine reaches 6000. How should we adjust? Do you have similar experience

I think we need to know what these 6000 connections are, like using command netstat -anp. But for your issue, I think it may need your more in-depth troubleshooting.

determined that the 50ms delay was in user mode

And for this, it might need you to use some debug tool, like systap/gdb script, to insert hook to some C function to see why there is a delay occuring in user mode(kong+lua). For nginx/kong/lua, 50ms pure computing is rare and easy to find ( like the metric of 100% cpu usage) , I guess there might be some IO waiting operation.

@github-actions github-actions bot removed the stale label Sep 15, 2024
Copy link
Contributor

This issue is marked as stale because it has been open for 14 days with no activity.

@github-actions github-actions bot added the stale label Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... stale
Projects
None yet
Development

No branches or pull requests

4 participants