Fix the long-blocking read for Valkey RDMA. by ruihong123 · Pull Request #233 · valkey-io/libvalkey

ruihong123 · 2025-08-21T16:56:23Z

During the valkey benchmark, when we set the data size as 16KB, the benchmark will be blocking on GET. The reason behind is that RDMA event is edge-triggered and hence every incoming data has to be read totally instead of read partially. Otherwise, the benchmark will be blocked at the event loop . We modify the valkeyBufferRead to enable a total read over the RDMA buffer when the conneciton type is RDMA. To realize this logic we need to keep read the buffer until the read result data size is smaller than the attempt read size.

In addition, we need to deal with a corner case for the logic above. If the message happened to be 16KB and the first read finishes all the available data, the second read will be triggered and block there until return an error. To solve this problem, we make read non-blocking except for the first read.

An alternative approach to solve this problem could be making the message sender chunk the message into 16KB blocks (equal to the stack buffer) and generate the PollIn signal every 16KB data arrival. This approach seems to also improve the performance of RDMA commands with very large data size. According to our benchmark result. this pipelined RDMA transfer can improve "SET" command throughput by 1.4X. I am okay with both solutions.

Please take a look at this PR @pizhenwei.

pizhenwei · 2025-08-22T08:29:40Z

During the valkey benchmark, when we set the data size as 16KB, the benchmark will be blocking on GET. The reason behind is that RDMA event is edge-triggered and hence every incoming data has to be read totally instead of read partially. Otherwise, the benchmark will be blocked at the event loop . We modify the valkeyBufferRead to enable a total read over the RDMA buffer when the conneciton type is RDMA. To realize this logic we need to keep read the buffer until the read result data size is smaller than the attempt read size.

In addition, we need to deal with a corner case for the logic above. If the message happened to be 16KB and the first read finishes all the available data, the second read will be triggered and block there until return an error. To solve this problem, we make read non-blocking except for the first read.

An alternative approach to solve this problem could be making the message sender chunk the message into 16KB blocks (equal to the stack buffer) and generate the PollIn signal every 16KB data arrival. This approach seems to also improve the performance of RDMA commands with very large data size. According to our benchmark result. this pipelined RDMA transfer can improve "SET" command throughput by 1.4X. I am okay with both solutions.

Please take a look at this PR @pizhenwei.

Hi
Thanks for your work!
I'm out of cycles this week, I'll dive into this issue next week.

src/valkey.c

pizhenwei · 2025-08-23T15:04:09Z

Hi @michael-grunder @ruihong123

I have another option of this issue:

char *zc_buf = NULL;

if (c->funcs->read_zc) {
        nread = c->funcs->read_zc(c, &zc_buf);
        if (nread < 0) {
            ...
        }
}
...

zc means zero copy here, because RDMA is using a user space buffer, so c->funcs->read_zc will return the pointer of available reading buffer and size, this will:

reduce memory copy, feed zc_buf and nread directly (instead of copying to stack buffer additionally).
all the available reading buffer will be consumed in a single call, also fix this issue

If this is acceptable, I suggest use c->funcs->read_zc instead of if (c->connection_type != VALKEY_CONN_RDMA). zero copy method is considered as higher performance, I guess TCP zero copy receive may be introduced in the future.

michael-grunder · 2025-08-23T15:19:20Z

If this is acceptable, I suggest use c->funcs->read_zc instead of if (c->connection_type != VALKEY_CONN_RDMA)

I had the same thought honestly. What do you think @bjosv

ruihong123 · 2025-08-24T02:08:36Z

Hi @michael-grunder @ruihong123

I have another option of this issue:
char *zc_buf = NULL;

if (c->funcs->read_zc) {
        nread = c->funcs->read_zc(c, &zc_buf);
        if (nread < 0) {
            ...
        }
}
... 
zc means zero copy here, because RDMA is using a user space buffer, so c->funcs->read_zc will return the pointer of available reading buffer and size, this will:

reduce memory copy, feed zc_buf and nread directly (instead of copying to stack buffer additionally).

all the available reading buffer will be consumed in a single call, also fix this issue

If this is acceptable, I suggest use c->funcs->read_zc instead of if (c->connection_type != VALKEY_CONN_RDMA). zero copy method is considered as higher performance, I guess TCP zero copy receive may be introduced in the future.

Thank you for the thoughtful feedback. I agree that your proposed approach offers a better solution. If the reviewers consider the zero-copy read approach necessary, I am comfortable either updating this PR with the new solution or having a separate PR opened that links back here for context and discussion.

pizhenwei · 2025-08-24T02:28:52Z

Hi @michael-grunder @ruihong123
I have another option of this issue:
char *zc_buf = NULL;

if (c->funcs->read_zc) {
        nread = c->funcs->read_zc(c, &zc_buf);
        if (nread < 0) {
            ...
        }
}
... 
zc means zero copy here, because RDMA is using a user space buffer, so c->funcs->read_zc will return the pointer of available reading buffer and size, this will:

reduce memory copy, feed zc_buf and nread directly (instead of copying to stack buffer additionally).

all the available reading buffer will be consumed in a single call, also fix this issue

If this is acceptable, I suggest use c->funcs->read_zc instead of if (c->connection_type != VALKEY_CONN_RDMA). zero copy method is considered as higher performance, I guess TCP zero copy receive may be introduced in the future.
Thank you for the thoughtful feedback. I agree that your proposed approach offers a better solution. If the reviewers consider the zero-copy read approach necessary, I am comfortable either updating this PR with the new solution or having a separate PR opened that links back here for context and discussion.

Keeping this PR would be fine, we will fix the long-blocking read for Valkey RDMA in this PR together.

ruihong123 · 2025-08-24T20:48:29Z

Hi @pizhenwei, I am wondering when we should update the internal offset (ctx->recv_offset) for the RDMA buffer in c->funcs->read_zc. I feel like it is not 100% safe to update ctx->recv_offset within c->funcs->read_zc, unless the invoker is guaranteed to copy/consume the data before calling next connRdmaHandleCq . Do you think it is okay to move forward ctx->recv_offset within c->funcs->read_zc?

Thanks

pizhenwei · 2025-08-25T00:49:35Z

Hi @pizhenwei, I am wondering when we should update the internal offset (ctx->recv_offset) for the RDMA buffer in c->funcs->read_zc. I feel like it is not 100% safe to update ctx->recv_offset within c->funcs->read_zc, unless the invoker is guaranteed to copy/consume the data before calling next connRdmaHandleCq . Do you think it is okay to move forward ctx->recv_offset within c->funcs->read_zc?

Thanks

There are two jobs in current valkeyRdmaRead:

move ctx->recv_offset. This is safe in c->funcs->read_zc
may connRdmaRegisterRx. This will tell the remote side to continue writing into a new buffer, generally it's the original one, so the remote side may overwrite this RX buffer during the local side is reading it.

So I suggest:

just return the reading buffer pointer and length in c->funcs->read_zc
update ctx->recv_offset and connRdmaRegisterRx(if necessary) in c->funcs->read_zc_done

Hi @michael-grunder do you have any suggestions?

michael-grunder · 2025-08-25T17:04:55Z

@pizhenwei You know much more about the RDMA feature than I do. I'm fine with your suggestion assuming it's safe.

ruihong123 · 2025-08-26T17:26:04Z

@pizhenwei You know much more about the RDMA feature than I do. I'm fine with your suggestion assuming it's safe.

Hi, I will overwrite the commit based on zhenwei's advises by the end of this week.

ruihong123 · 2025-08-30T17:25:58Z

Hi everyone, the update is submitted. Please check it. Thanks.

pizhenwei

Hi @ruihong123
Generally LGTM, thanks!

By the way, would you please add a few test result between the two version, I'm looking forward to seeing the improvement.

include/valkey/valkey.h

src/rdma.c

src/valkey.c

ruihong123 · 2025-08-31T04:23:30Z

Hi @ruihong123 Generally LGTM, thanks!

By the way, would you please add a few test result between the two version, I'm looking forward to seeing the improvement.

For some reasons, I lost the accesses for the testbed that I had the results of the first version. If you deem it necessary for merge, I can create a cloudlab cluster and compare the performance results there. However, it may take some time.

pizhenwei

LGTM, thanks.

pizhenwei · 2025-08-31T08:52:30Z

Hi @ruihong123 Generally LGTM, thanks!
By the way, would you please add a few test result between the two version, I'm looking forward to seeing the improvement.

For some reasons, I lost the accesses for the testbed that I had the results of the first version. If you deem it necessary for merge, I can create a cloudlab cluster and compare the performance results there. However, it may take some time.

The test report will not block merging code, but this improvement may be highlight element in release note.
Do you test these codes? I think this PR should be tested in your developing environment at least.

ruihong123 · 2025-08-31T14:12:33Z

Okay, I got it. I will test this new commit in cloudlab in the next few days. We can merge this PR after I get the experiment result.

ruihong123 · 2025-09-03T23:04:39Z

Hi here is the benchmark result over the cloudlab C6220 nodes. SET and GET command throughputs (ops/sec) are shown with/without the optimization.
+-------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
| Data Size | SET Before Optimization | SET After Optimization | GET Before Optimization | GET After Optimization |
+-------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
| 3 B | 199,362.05 | 191,644.31 | 212269.16 | 218197.67 |
| 1024 B | 152,718.39 | 150,943.39 | 189825.36 | 198,846.70 |
| 16,384 B | 36,064.63 | 36,709.37 | 84,076.01 | 118,652.11 |
| 1,048,576 B | 418.88 | 413.33 | NA | 74,626.87 |
+-------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+

The zero-copy implementation can function properly with varied data sizes, which fixes the blocking issue for this PR. With 16KB, the zero-copy optimization can bring in ~1.41X throughput boost, and the performance gain could be higher as the data size grows. However, the data points after 16KB are not available for GET before the optimization.

By the way, there is a huge throughput drop for the SET operation with 1MB data size. I recall seeing a similar issue with TCP connections, so I think this issue primarily results from memory allocation.

pizhenwei · 2025-09-04T00:27:29Z

Hi here is the benchmark result over the cloudlab C6220 nodes. SET and GET command throughputs (ops/sec) are shown with/without the optimization. +-------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+ | Data Size | SET Before Optimization | SET After Optimization | GET Before Optimization | GET After Optimization | +-------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+ | 3 B | 199,362.05 | 191,644.31 | 212269.16 | 218197.67 | | 1024 B | 152,718.39 | 150,943.39 | 189825.36 | 198,846.70 | | 16,384 B | 36,064.63 | 36,709.37 | 84,076.01 | 118,652.11 | | 1,048,576 B | 418.88 | 413.33 | NA | 74,626.87 | +-------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+

The zero-copy implementation can function properly with varied data sizes, which fixes the blocking issue for this PR. With 16KB, the zero-copy optimization can bring in ~1.41X throughput boost, and the performance gain could be higher as the data size grows. However, the data points after 16KB are not available for GET before the optimization.

16K ~ 32K KV size is quite common in the real workloads, great jobs! thanks!

By the way, there is a huge throughput drop for the SET operation with 1MB data size. I recall seeing a similar issue with TCP connections, so I think this issue primarily results from memory allocation.

This should be another issue, so I think we can merge this PR, then fix it in another PR.

bjosv · 2025-09-05T09:13:17Z

src/rdma.c

+    struct rdma_cm_id *cm_id = ctx->cm_id;
+
+    if (ctx->recv_offset == ctx->recv_length) {
+        connRdmaRegisterRx(c, cm_id);


I don't know the RDMA feature, but connRdmaRegisterRx seems to be able to fail when it calls rdmaSendCommand ("no empty command buffer" or if ibv_post_send fails).

Could there be any problems with the client got stuck/hanging in this case?
Is there a need for a return code in read_zc_done to indicate an unrecoverable error?

It should not happen in the real scenario, but we should still handle this error in code.

bjosv · 2025-09-05T09:55:17Z

src/rdma.c

@@ -991,6 +1031,8 @@ static valkeyContextFuncs valkeyContextRdmaFuncs = {
    .async_read = valkeyRdmaAsyncRead,
    .async_write = valkeyRdmaAsyncWrite,
    .read = valkeyRdmaRead,


Should we replace the handling within valkeyRdmaRead with an assert to avoid having unused code?

Good idea, no need to keep the dead code.

pizhenwei · 2025-10-21T08:29:38Z

Hi @ruihong123
Would you please append a commit in another new PR to drop the dead code?
Or I will do it if you don't have such plan...

ruihong123 · 2025-10-21T15:45:15Z

Hi @ruihong123 Would you please append a commit in another new PR to drop the dead code? Or I will do it if you don't have such plan...

I will do it later today. I will push in this PR, as it has not been merged yet.

During the benchmark, when we set the data size as 16KB or more, the benchmark will be blocked on GET. The reason behind is that RDMA event is edge triggered and every incoming data has to be read totally instead of read partially. To solve this problem, we add 'read_zc' in valkey.h to enable read with zero copy. This function enables the valkeyBufferRead to feed all available data into c->reader from RDMA buffer at every coming RDMA event. Signed-off-by: wang4996 <wang4996@purdue.edu>

pizhenwei

LGTM, thanks!

include/valkey/valkey.h

ruihong123 force-pushed the main branch from e5fd710 to dadb207 Compare August 21, 2025 18:23

michael-grunder reviewed Aug 22, 2025

View reviewed changes

src/valkey.c Outdated Show resolved Hide resolved

michael-grunder reviewed Aug 22, 2025

View reviewed changes

src/valkey.c Outdated Show resolved Hide resolved

michael-grunder reviewed Aug 22, 2025

View reviewed changes

src/valkey.c Outdated Show resolved Hide resolved

ruihong123 force-pushed the main branch 2 times, most recently from 2fcbe12 to 2f9dcbf Compare August 30, 2025 17:13

pizhenwei suggested changes Aug 31, 2025

View reviewed changes

include/valkey/valkey.h Show resolved Hide resolved

src/rdma.c Show resolved Hide resolved

src/rdma.c Outdated Show resolved Hide resolved

src/valkey.c Outdated Show resolved Hide resolved

ruihong123 force-pushed the main branch from 2f9dcbf to 64400d6 Compare August 31, 2025 04:17

pizhenwei approved these changes Aug 31, 2025

View reviewed changes

ruihong123 requested a review from michael-grunder September 4, 2025 20:51

bjosv reviewed Sep 5, 2025

View reviewed changes

michael-grunder approved these changes Oct 17, 2025

View reviewed changes

ruihong123 force-pushed the main branch from 64400d6 to 112d3fa Compare October 21, 2025 16:54

ruihong123 requested review from bjosv and pizhenwei October 21, 2025 17:02

pizhenwei approved these changes Oct 22, 2025

View reviewed changes

bjosv reviewed Oct 22, 2025

View reviewed changes

include/valkey/valkey.h Show resolved Hide resolved

bjosv approved these changes Oct 23, 2025

View reviewed changes

bjosv merged commit 36f6e22 into valkey-io:main Oct 23, 2025
47 checks passed

Conversation

ruihong123 commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pizhenwei commented Aug 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pizhenwei commented Aug 23, 2025

Uh oh!

michael-grunder commented Aug 23, 2025

Uh oh!

ruihong123 commented Aug 24, 2025

Uh oh!

pizhenwei commented Aug 24, 2025

Uh oh!

ruihong123 commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pizhenwei commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michael-grunder commented Aug 25, 2025

Uh oh!

ruihong123 commented Aug 26, 2025

Uh oh!

ruihong123 commented Aug 30, 2025

Uh oh!

pizhenwei left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ruihong123 commented Aug 31, 2025

Uh oh!

pizhenwei left a comment

Choose a reason for hiding this comment

Uh oh!

pizhenwei commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruihong123 commented Aug 31, 2025

Uh oh!

ruihong123 commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pizhenwei commented Sep 4, 2025

Uh oh!

bjosv Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

pizhenwei Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

bjosv Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

pizhenwei Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

pizhenwei commented Oct 21, 2025

Uh oh!

ruihong123 commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pizhenwei left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

ruihong123 commented Aug 21, 2025 •

edited

Loading

ruihong123 commented Aug 24, 2025 •

edited

Loading

pizhenwei commented Aug 25, 2025 •

edited

Loading

pizhenwei left a comment •

edited

Loading

pizhenwei commented Aug 31, 2025 •

edited

Loading

ruihong123 commented Sep 3, 2025 •

edited

Loading

ruihong123 commented Oct 21, 2025 •

edited

Loading