Fix the long-blocking read for Valkey RDMA.#233
Conversation
Hi |
|
Hi @michael-grunder @ruihong123 I have another option of this issue:
If this is acceptable, I suggest use |
I had the same thought honestly. What do you think @bjosv |
Thank you for the thoughtful feedback. I agree that your proposed approach offers a better solution. If the reviewers consider the zero-copy read approach necessary, I am comfortable either updating this PR with the new solution or having a separate PR opened that links back here for context and discussion. |
Keeping this PR would be fine, we will fix |
|
Hi @pizhenwei, I am wondering when we should update the internal offset ( Thanks |
There are two jobs in current
So I suggest:
Hi @michael-grunder do you have any suggestions? |
|
@pizhenwei You know much more about the RDMA feature than I do. I'm fine with your suggestion assuming it's safe. |
Hi, I will overwrite the commit based on zhenwei's advises by the end of this week. |
2fcbe12 to
2f9dcbf
Compare
|
Hi everyone, the update is submitted. Please check it. Thanks. |
There was a problem hiding this comment.
Hi @ruihong123
Generally LGTM, thanks!
By the way, would you please add a few test result between the two version, I'm looking forward to seeing the improvement.
For some reasons, I lost the accesses for the testbed that I had the results of the first version. If you deem it necessary for merge, I can create a cloudlab cluster and compare the performance results there. However, it may take some time. |
The test report will not block merging code, but this improvement may be highlight element in release note. |
|
Okay, I got it. I will test this new commit in cloudlab in the next few days. We can merge this PR after I get the experiment result. |
|
Hi here is the benchmark result over the cloudlab C6220 nodes. SET and GET command throughputs (ops/sec) are shown with/without the optimization. The zero-copy implementation can function properly with varied data sizes, which fixes the blocking issue for this PR. With 16KB, the zero-copy optimization can bring in ~1.41X throughput boost, and the performance gain could be higher as the data size grows. However, the data points after 16KB are not available for GET before the optimization. By the way, there is a huge throughput drop for the SET operation with 1MB data size. I recall seeing a similar issue with TCP connections, so I think this issue primarily results from memory allocation. |
16K ~ 32K KV size is quite common in the real workloads, great jobs! thanks!
This should be another issue, so I think we can merge this PR, then fix it in another PR. |
src/rdma.c
Outdated
| struct rdma_cm_id *cm_id = ctx->cm_id; | ||
|
|
||
| if (ctx->recv_offset == ctx->recv_length) { | ||
| connRdmaRegisterRx(c, cm_id); |
There was a problem hiding this comment.
I don't know the RDMA feature, but connRdmaRegisterRx seems to be able to fail when it calls rdmaSendCommand ("no empty command buffer" or if ibv_post_send fails).
Could there be any problems with the client got stuck/hanging in this case?
Is there a need for a return code in read_zc_done to indicate an unrecoverable error?
There was a problem hiding this comment.
It should not happen in the real scenario, but we should still handle this error in code.
src/rdma.c
Outdated
| @@ -991,6 +1031,8 @@ static valkeyContextFuncs valkeyContextRdmaFuncs = { | |||
| .async_read = valkeyRdmaAsyncRead, | |||
| .async_write = valkeyRdmaAsyncWrite, | |||
| .read = valkeyRdmaRead, | |||
There was a problem hiding this comment.
Should we replace the handling within valkeyRdmaRead with an assert to avoid having unused code?
There was a problem hiding this comment.
Good idea, no need to keep the dead code.
|
Hi @ruihong123 |
I will do it later today. I will push in this PR, as it has not been merged yet. |
During the benchmark, when we set the data size as 16KB or more, the benchmark will be blocked on GET. The reason behind is that RDMA event is edge triggered and every incoming data has to be read totally instead of read partially. To solve this problem, we add 'read_zc' in valkey.h to enable read with zero copy. This function enables the valkeyBufferRead to feed all available data into c->reader from RDMA buffer at every coming RDMA event. Signed-off-by: wang4996 <wang4996@purdue.edu>
During the valkey benchmark, when we set the data size as 16KB, the benchmark will be blocking on GET. The reason behind is that RDMA event is edge-triggered and hence every incoming data has to be read totally instead of read partially. Otherwise, the benchmark will be blocked at the event loop . We modify the valkeyBufferRead to enable a total read over the RDMA buffer when the conneciton type is RDMA. To realize this logic we need to keep read the buffer until the read result data size is smaller than the attempt read size.
In addition, we need to deal with a corner case for the logic above. If the message happened to be 16KB and the first read finishes all the available data, the second read will be triggered and block there until return an error. To solve this problem, we make read non-blocking except for the first read.
An alternative approach to solve this problem could be making the message sender chunk the message into 16KB blocks (equal to the stack buffer) and generate the PollIn signal every 16KB data arrival. This approach seems to also improve the performance of RDMA commands with very large data size. According to our benchmark result. this pipelined RDMA transfer can improve "SET" command throughput by 1.4X. I am okay with both solutions.
Please take a look at this PR @pizhenwei.