Skip to content

Fix the long-blocking read for Valkey RDMA.#233

Merged
bjosv merged 1 commit intovalkey-io:mainfrom
ruihong123:main
Oct 23, 2025
Merged

Fix the long-blocking read for Valkey RDMA.#233
bjosv merged 1 commit intovalkey-io:mainfrom
ruihong123:main

Conversation

@ruihong123
Copy link
Copy Markdown
Contributor

@ruihong123 ruihong123 commented Aug 21, 2025

During the valkey benchmark, when we set the data size as 16KB, the benchmark will be blocking on GET. The reason behind is that RDMA event is edge-triggered and hence every incoming data has to be read totally instead of read partially. Otherwise, the benchmark will be blocked at the event loop . We modify the valkeyBufferRead to enable a total read over the RDMA buffer when the conneciton type is RDMA. To realize this logic we need to keep read the buffer until the read result data size is smaller than the attempt read size.

In addition, we need to deal with a corner case for the logic above. If the message happened to be 16KB and the first read finishes all the available data, the second read will be triggered and block there until return an error. To solve this problem, we make read non-blocking except for the first read.

An alternative approach to solve this problem could be making the message sender chunk the message into 16KB blocks (equal to the stack buffer) and generate the PollIn signal every 16KB data arrival. This approach seems to also improve the performance of RDMA commands with very large data size. According to our benchmark result. this pipelined RDMA transfer can improve "SET" command throughput by 1.4X. I am okay with both solutions.

Please take a look at this PR @pizhenwei.

@pizhenwei
Copy link
Copy Markdown
Contributor

During the valkey benchmark, when we set the data size as 16KB, the benchmark will be blocking on GET. The reason behind is that RDMA event is edge-triggered and hence every incoming data has to be read totally instead of read partially. Otherwise, the benchmark will be blocked at the event loop . We modify the valkeyBufferRead to enable a total read over the RDMA buffer when the conneciton type is RDMA. To realize this logic we need to keep read the buffer until the read result data size is smaller than the attempt read size.

In addition, we need to deal with a corner case for the logic above. If the message happened to be 16KB and the first read finishes all the available data, the second read will be triggered and block there until return an error. To solve this problem, we make read non-blocking except for the first read.

An alternative approach to solve this problem could be making the message sender chunk the message into 16KB blocks (equal to the stack buffer) and generate the PollIn signal every 16KB data arrival. This approach seems to also improve the performance of RDMA commands with very large data size. According to our benchmark result. this pipelined RDMA transfer can improve "SET" command throughput by 1.4X. I am okay with both solutions.

Please take a look at this PR @pizhenwei.

Hi
Thanks for your work!
I'm out of cycles this week, I'll dive into this issue next week.

@pizhenwei
Copy link
Copy Markdown
Contributor

Hi @michael-grunder @ruihong123

I have another option of this issue:

char *zc_buf = NULL;

if (c->funcs->read_zc) {
        nread = c->funcs->read_zc(c, &zc_buf);
        if (nread < 0) {
            ...
        }
}
... 

zc means zero copy here, because RDMA is using a user space buffer, so c->funcs->read_zc will return the pointer of available reading buffer and size, this will:

  • reduce memory copy, feed zc_buf and nread directly (instead of copying to stack buffer additionally).
  • all the available reading buffer will be consumed in a single call, also fix this issue

If this is acceptable, I suggest use c->funcs->read_zc instead of if (c->connection_type != VALKEY_CONN_RDMA). zero copy method is considered as higher performance, I guess TCP zero copy receive may be introduced in the future.

@michael-grunder
Copy link
Copy Markdown
Collaborator

If this is acceptable, I suggest use c->funcs->read_zc instead of if (c->connection_type != VALKEY_CONN_RDMA)

I had the same thought honestly. What do you think @bjosv

@ruihong123
Copy link
Copy Markdown
Contributor Author

Hi @michael-grunder @ruihong123

I have another option of this issue:

char *zc_buf = NULL;

if (c->funcs->read_zc) {
        nread = c->funcs->read_zc(c, &zc_buf);
        if (nread < 0) {
            ...
        }
}
... 

zc means zero copy here, because RDMA is using a user space buffer, so c->funcs->read_zc will return the pointer of available reading buffer and size, this will:

  • reduce memory copy, feed zc_buf and nread directly (instead of copying to stack buffer additionally).
  • all the available reading buffer will be consumed in a single call, also fix this issue

If this is acceptable, I suggest use c->funcs->read_zc instead of if (c->connection_type != VALKEY_CONN_RDMA). zero copy method is considered as higher performance, I guess TCP zero copy receive may be introduced in the future.

Thank you for the thoughtful feedback. I agree that your proposed approach offers a better solution. If the reviewers consider the zero-copy read approach necessary, I am comfortable either updating this PR with the new solution or having a separate PR opened that links back here for context and discussion.

@pizhenwei
Copy link
Copy Markdown
Contributor

Hi @michael-grunder @ruihong123
I have another option of this issue:

char *zc_buf = NULL;

if (c->funcs->read_zc) {
        nread = c->funcs->read_zc(c, &zc_buf);
        if (nread < 0) {
            ...
        }
}
... 

zc means zero copy here, because RDMA is using a user space buffer, so c->funcs->read_zc will return the pointer of available reading buffer and size, this will:

  • reduce memory copy, feed zc_buf and nread directly (instead of copying to stack buffer additionally).
  • all the available reading buffer will be consumed in a single call, also fix this issue

If this is acceptable, I suggest use c->funcs->read_zc instead of if (c->connection_type != VALKEY_CONN_RDMA). zero copy method is considered as higher performance, I guess TCP zero copy receive may be introduced in the future.

Thank you for the thoughtful feedback. I agree that your proposed approach offers a better solution. If the reviewers consider the zero-copy read approach necessary, I am comfortable either updating this PR with the new solution or having a separate PR opened that links back here for context and discussion.

Keeping this PR would be fine, we will fix the long-blocking read for Valkey RDMA in this PR together.

@ruihong123
Copy link
Copy Markdown
Contributor Author

ruihong123 commented Aug 24, 2025

Hi @pizhenwei, I am wondering when we should update the internal offset (ctx->recv_offset) for the RDMA buffer in c->funcs->read_zc. I feel like it is not 100% safe to update ctx->recv_offset within c->funcs->read_zc, unless the invoker is guaranteed to copy/consume the data before calling next connRdmaHandleCq . Do you think it is okay to move forward ctx->recv_offset within c->funcs->read_zc?

Thanks

@pizhenwei
Copy link
Copy Markdown
Contributor

pizhenwei commented Aug 25, 2025

Hi @pizhenwei, I am wondering when we should update the internal offset (ctx->recv_offset) for the RDMA buffer in c->funcs->read_zc. I feel like it is not 100% safe to update ctx->recv_offset within c->funcs->read_zc, unless the invoker is guaranteed to copy/consume the data before calling next connRdmaHandleCq . Do you think it is okay to move forward ctx->recv_offset within c->funcs->read_zc?

Thanks

There are two jobs in current valkeyRdmaRead:

  • move ctx->recv_offset. This is safe in c->funcs->read_zc
  • may connRdmaRegisterRx. This will tell the remote side to continue writing into a new buffer, generally it's the original one, so the remote side may overwrite this RX buffer during the local side is reading it.

So I suggest:

  • just return the reading buffer pointer and length in c->funcs->read_zc
  • update ctx->recv_offset and connRdmaRegisterRx(if necessary) in c->funcs->read_zc_done

Hi @michael-grunder do you have any suggestions?

@michael-grunder
Copy link
Copy Markdown
Collaborator

@pizhenwei You know much more about the RDMA feature than I do. I'm fine with your suggestion assuming it's safe.

@ruihong123
Copy link
Copy Markdown
Contributor Author

@pizhenwei You know much more about the RDMA feature than I do. I'm fine with your suggestion assuming it's safe.

Hi, I will overwrite the commit based on zhenwei's advises by the end of this week.

@ruihong123 ruihong123 force-pushed the main branch 2 times, most recently from 2fcbe12 to 2f9dcbf Compare August 30, 2025 17:13
@ruihong123
Copy link
Copy Markdown
Contributor Author

Hi everyone, the update is submitted. Please check it. Thanks.

Copy link
Copy Markdown
Contributor

@pizhenwei pizhenwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ruihong123
Generally LGTM, thanks!

By the way, would you please add a few test result between the two version, I'm looking forward to seeing the improvement.

@ruihong123
Copy link
Copy Markdown
Contributor Author

Hi @ruihong123 Generally LGTM, thanks!

By the way, would you please add a few test result between the two version, I'm looking forward to seeing the improvement.

For some reasons, I lost the accesses for the testbed that I had the results of the first version. If you deem it necessary for merge, I can create a cloudlab cluster and compare the performance results there. However, it may take some time.

Copy link
Copy Markdown
Contributor

@pizhenwei pizhenwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@pizhenwei
Copy link
Copy Markdown
Contributor

pizhenwei commented Aug 31, 2025

Hi @ruihong123 Generally LGTM, thanks!
By the way, would you please add a few test result between the two version, I'm looking forward to seeing the improvement.

For some reasons, I lost the accesses for the testbed that I had the results of the first version. If you deem it necessary for merge, I can create a cloudlab cluster and compare the performance results there. However, it may take some time.

The test report will not block merging code, but this improvement may be highlight element in release note.
Do you test these codes? I think this PR should be tested in your developing environment at least.

@ruihong123
Copy link
Copy Markdown
Contributor Author

Okay, I got it. I will test this new commit in cloudlab in the next few days. We can merge this PR after I get the experiment result.

@ruihong123
Copy link
Copy Markdown
Contributor Author

ruihong123 commented Sep 3, 2025

Hi here is the benchmark result over the cloudlab C6220 nodes. SET and GET command throughputs (ops/sec) are shown with/without the optimization.
+-------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
| Data Size | SET Before Optimization | SET After Optimization | GET Before Optimization | GET After Optimization |
+-------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
| 3 B | 199,362.05 | 191,644.31 | 212269.16 | 218197.67 |
| 1024 B | 152,718.39 | 150,943.39 | 189825.36 | 198,846.70 |
| 16,384 B | 36,064.63 | 36,709.37 | 84,076.01 | 118,652.11 |
| 1,048,576 B | 418.88 | 413.33 | NA | 74,626.87 |
+-------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+

The zero-copy implementation can function properly with varied data sizes, which fixes the blocking issue for this PR. With 16KB, the zero-copy optimization can bring in ~1.41X throughput boost, and the performance gain could be higher as the data size grows. However, the data points after 16KB are not available for GET before the optimization.

By the way, there is a huge throughput drop for the SET operation with 1MB data size. I recall seeing a similar issue with TCP connections, so I think this issue primarily results from memory allocation.

@pizhenwei
Copy link
Copy Markdown
Contributor

Hi here is the benchmark result over the cloudlab C6220 nodes. SET and GET command throughputs (ops/sec) are shown with/without the optimization. +-------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+ | Data Size | SET Before Optimization | SET After Optimization | GET Before Optimization | GET After Optimization | +-------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+ | 3 B | 199,362.05 | 191,644.31 | 212269.16 | 218197.67 | | 1024 B | 152,718.39 | 150,943.39 | 189825.36 | 198,846.70 | | 16,384 B | 36,064.63 | 36,709.37 | 84,076.01 | 118,652.11 | | 1,048,576 B | 418.88 | 413.33 | NA | 74,626.87 | +-------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+

The zero-copy implementation can function properly with varied data sizes, which fixes the blocking issue for this PR. With 16KB, the zero-copy optimization can bring in ~1.41X throughput boost, and the performance gain could be higher as the data size grows. However, the data points after 16KB are not available for GET before the optimization.

16K ~ 32K KV size is quite common in the real workloads, great jobs! thanks!

By the way, there is a huge throughput drop for the SET operation with 1MB data size. I recall seeing a similar issue with TCP connections, so I think this issue primarily results from memory allocation.

This should be another issue, so I think we can merge this PR, then fix it in another PR.

src/rdma.c Outdated
struct rdma_cm_id *cm_id = ctx->cm_id;

if (ctx->recv_offset == ctx->recv_length) {
connRdmaRegisterRx(c, cm_id);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the RDMA feature, but connRdmaRegisterRx seems to be able to fail when it calls rdmaSendCommand ("no empty command buffer" or if ibv_post_send fails).

Could there be any problems with the client got stuck/hanging in this case?
Is there a need for a return code in read_zc_done to indicate an unrecoverable error?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not happen in the real scenario, but we should still handle this error in code.

src/rdma.c Outdated
@@ -991,6 +1031,8 @@ static valkeyContextFuncs valkeyContextRdmaFuncs = {
.async_read = valkeyRdmaAsyncRead,
.async_write = valkeyRdmaAsyncWrite,
.read = valkeyRdmaRead,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we replace the handling within valkeyRdmaRead with an assert to avoid having unused code?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, no need to keep the dead code.

@pizhenwei
Copy link
Copy Markdown
Contributor

Hi @ruihong123
Would you please append a commit in another new PR to drop the dead code?
Or I will do it if you don't have such plan...

@ruihong123
Copy link
Copy Markdown
Contributor Author

ruihong123 commented Oct 21, 2025

Hi @ruihong123 Would you please append a commit in another new PR to drop the dead code? Or I will do it if you don't have such plan...

I will do it later today. I will push in this PR, as it has not been merged yet.

During the benchmark, when we set the data size as 16KB or more, the
benchmark will be blocked on GET. The reason behind is that RDMA
event is edge triggered and every incoming data has to be read totally
instead of read partially. To solve this problem, we add 'read_zc' in
valkey.h to enable read with zero copy. This function enables the
valkeyBufferRead to feed all available data into c->reader from RDMA
buffer at every coming RDMA event.

Signed-off-by: wang4996 <wang4996@purdue.edu>
Copy link
Copy Markdown
Contributor

@pizhenwei pizhenwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@bjosv bjosv merged commit 36f6e22 into valkey-io:main Oct 23, 2025
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants