Fix recv EINVAL/corruption when source has smaller recordsize setting by tuxoko · Pull Request #18663 · openzfs/zfs

tuxoko · 2026-06-11T21:31:34Z

If source and target has different recordsize setting, it's possible to
have same file to have different block size when reverse replication is
allowed. This has two consequences in the current
receive_handle_existing_object.

If target side has more than 1 block, it will fail with EINVAL.

$ sudo zfs create -o recordsize=64k pp/fs0
$ sudo zfs create -o recordsize=128k pp/fs1

$ sudo dd if=/dev/urandom of=/pp/fs0/test bs=1M count=1     # src created with 64k
$ sudo zfs snap pp/fs0@s00
$ sudo zfs send pp/fs0@s00 | sudo zfs recv -F pp/fs1        # tgt received with 64k
$ sudo dd if=/dev/urandom of=/pp/fs1/test bs=1M count=1     # tgt truncated and rewritten with 128k
$ sudo zfs snap pp/fs1@s01
$ sudo zfs send -i @s00 pp/fs1@s01 | sudo zfs recv pp/fs0   # reverse send, src remains 64k
$ sudo dd if=/dev/urandom of=/pp/fs0/test bs=1M count=1     # src modified, remains 64k
$ sudo zfs snap pp/fs0@s02
$ sudo zfs send -i @s01 pp/fs0@s02 | sudo zfs recv pp/fs1   # forward send failed
cannot receive incremental stream: invalid backup stream

If target side has 1 block, and source modifies file partially, it will
cause target to truncate everything and only left with the modified
part, causing corruption.

$ sudo zfs create -o recordsize=64k pp/fs0
$ sudo zfs create -o recordsize=128k pp/fs1

$ sudo dd if=/dev/urandom of=/pp/fs0/test bs=128k count=1   # src created with 64k
$ sudo zfs snap pp/fs0@s00
$ sudo zfs send pp/fs0@s00 | sudo zfs recv -F pp/fs1        # tgt received with 64k
$ sudo dd if=/dev/urandom of=/pp/fs1/test bs=128k count=1   # tgt truncated and rewritten with 128k
$ sudo zfs snap pp/fs1@s01
$ sudo zfs send -i @s00 pp/fs1@s01 | sudo zfs recv pp/fs0   # reverse send, src remains 64k
$ sudo dd if=/dev/urandom of=/pp/fs0/test bs=64k count=1 conv=notrunc    # src modified single 64k block
$ sudo zfs snap pp/fs0@s02
$ sudo zfs send -i @s01 pp/fs0@s02 | sudo zfs recv pp/fs1   # forward send, tgt will get corrupted
$ md5sum /pp/fs0/test /pp/fs1/test
7b8ab0f9b9bded4bb1cfe60b8d2ad925  /pp/fs0/test
229c5798dcf15440afafa734f7e42769  /pp/fs1/test

To fix this issue, when we encounter files with same gen but different
blksz. We look ahead into following records to see if WRITEs and FREEs
will completely overwrite existing content. If so, then it mean we are
safe to do truncate and take on new blksz, otherwise we don't truncate
and the blksz will remain different.

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2026-06-12T13:39:27Z

I'd think if the block size was increased on fs1 by the write, backward replication to fs1 should increase the block size there too. The complications here I guess are coming from workarounds from the large blocks support being optional.

tuxoko · 2026-06-15T22:38:08Z

@amotin
Yeah I agree. That seems to be the behavior before the large block change.

Given the comment saying we disallow --large-block to switch to off in recv_check_large_blocks(),
then shouldn't we just always truncate and take on new size whenever there's block size mismatch like old version?
Instead of keeping current block size for same gen, which is what caused this problem in the first place?

However, if we blindly do that now, anyone with already mismatched file will lose data, so I'm not sure what's the best way going forward.

Signed-off-by: Chunwei Chen <david.chen@nutanix.com>

tuxoko · 2026-06-18T02:40:08Z

update to fix the corruption issue.

behlendorf · 2026-06-22T23:22:03Z

Adopting the new recordsize is what I'd expect, but we'll have to think that though carefully. With the latest push I see the rsend tests are now hitting this:

  [ 8043.578156] PANIC: zfs: accessing past end of object 541/aa (size=89600 access=65536+65536)
  [ 8043.582364] Showing stack for process 747172
  [ 8043.584715] CPU: 1 PID: 747172 Comm: xxh128sum Tainted: P           OEL    -------- -  - 4.18.0-553.134.1.el8_10.x86_64 #1
  [ 8043.590263] Hardware name: QEMU Ubuntu 24.04 PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
  [ 8043.595121] Call Trace:
  [ 8043.597581]  dump_stack+0x41/0x60
  [ 8043.599977]  vcmn_err.cold.0+0x68/0x80 [spl]
  [ 8043.630295]  zfs_panic_recover+0x6f/0x90 [zfs]
  [ 8043.633837]  dmu_buf_hold_array_by_dnode+0x226/0x750 [zfs]
  [ 8043.640224]  dmu_read_uio_dnode+0x5f/0x180 [zfs]
  [ 8043.648085]  zfs_read+0x2e0/0x510 [zfs]
  [ 8043.650993]  zpl_iter_read+0xa1/0x110 [zfs]
  [ 8043.653885]  new_sync_read+0x10f/0x160
  [ 8043.656442]  vfs_read+0x91/0x150
  [ 8043.658404]  ksys_read+0x4f/0xb0
  [ 8043.660213]  do_syscall_64+0x5b/0x1d0
  [ 8043.662145]  entry_SYSCALL_64_after_hwframe+0x66/0xcb
  [ 8043.664817] RIP: 0033:0x7f40aa5eb335

tuxoko · 2026-06-23T02:42:07Z

@behlendorf
Yes, I saw that. I think the problem is if we are single block and not power of 2. So we can't keep our block size and grow file at the same time. Before this change, we always truncate and take on new block size, but this is precisely what causes the corruption...

I'm not sure if there's a good way around this. Or maybe we should go back to always truncate with single block and just rewrite whatever we have...

tuxoko · 2026-06-24T06:17:05Z

Update to fix the above non P2 block size issue.
When we can't truncate and block size isn't P2, grow to the nearest P2 in case WRITE will go pass block size.

Also update test pool to include such file.

@s00

If source and target has different recordsize setting, it's possible to have same file to have different block size when reverse replication is allowed. This has two consequences in the current receive_handle_existing_object. If target side has more than 1 block, it will fail with EINVAL. ``` $ sudo zfs create -o recordsize=64k pp/fs0 $ sudo zfs create -o recordsize=128k pp/fs1 $ sudo dd if=/dev/urandom of=/pp/fs0/test bs=1M count=1 # src created with 64k $ sudo zfs snap pp/fs0@s00 $ sudo zfs send pp/fs0@s00 | sudo zfs recv -F pp/fs1 # tgt received with 64k $ sudo dd if=/dev/urandom of=/pp/fs1/test bs=1M count=1 # tgt truncated and rewritten with 128k $ sudo zfs snap pp/fs1@s01 $ sudo zfs send -i @s00 pp/fs1@s01 | sudo zfs recv pp/fs0 # reverse send, src remains 64k $ sudo dd if=/dev/urandom of=/pp/fs0/test bs=1M count=1 # src modified, remains 64k $ sudo zfs snap pp/fs0@s02 $ sudo zfs send -i @s01 pp/fs0@s02 | sudo zfs recv pp/fs1 # forward send failed cannot receive incremental stream: invalid backup stream ``` If target side has 1 block, and source modifies file partially, it will cause target to truncate everything and only left with the modified part, causing corruption. ``` $ sudo zfs create -o recordsize=64k pp/fs0 $ sudo zfs create -o recordsize=128k pp/fs1 $ sudo dd if=/dev/urandom of=/pp/fs0/test bs=128k count=1 # src created with 64k $ sudo zfs snap pp/fs0@s00 $ sudo zfs send pp/fs0@s00 | sudo zfs recv -F pp/fs1 # tgt received with 64k $ sudo dd if=/dev/urandom of=/pp/fs1/test bs=128k count=1 # tgt truncated and rewritten with 128k $ sudo zfs snap pp/fs1@s01 $ sudo zfs send -i @s00 pp/fs1@s01 | sudo zfs recv pp/fs0 # reverse send, src remains 64k $ sudo dd if=/dev/urandom of=/pp/fs0/test bs=64k count=1 conv=notrunc # src modified single 64k block $ sudo zfs snap pp/fs0@s02 $ sudo zfs send -i @s01 pp/fs0@s02 | sudo zfs recv pp/fs1 # forward send, tgt will get corrupted $ md5sum /pp/fs0/test /pp/fs1/test 7b8ab0f9b9bded4bb1cfe60b8d2ad925 /pp/fs0/test 229c5798dcf15440afafa734f7e42769 /pp/fs1/test ``` To fix this issue, when we encounter files with same gen but different blksz. We look ahead into following records to see if WRITEs and FREEs will completely overwrite existing content. If so, then it mean we are safe to do truncate and take on new blksz, otherwise we don't truncate and the blksz will remain different. Signed-off-by: Chunwei Chen <david.chen@nutanix.com>

Signed-off-by: Chunwei Chen <david.chen@nutanix.com>

tuxoko commented Jun 11, 2026

View reviewed changes

Comment thread module/zfs/dmu_recv.c Outdated

Add function to peek into bqueue

572dfa6

Signed-off-by: Chunwei Chen <david.chen@nutanix.com>

tuxoko force-pushed the recv_blksz branch from f3cc6cf to 082ec02 Compare June 18, 2026 02:38

tuxoko changed the title ~~Fix recv failure when source has smaller recordsize setting~~ Fix recv EINVAL/corruption when source has smaller recordsize setting Jun 18, 2026

tuxoko requested review from amotin and behlendorf June 18, 2026 02:40

tuxoko force-pushed the recv_blksz branch from 082ec02 to 5403a56 Compare June 19, 2026 01:36

behlendorf added the Status: Code Review Needed Ready for review and testing label Jun 22, 2026

tuxoko force-pushed the recv_blksz branch from 807ce46 to f1f20eb Compare June 24, 2026 06:15

davidchenntnx added 2 commits June 24, 2026 19:50

Add testcase for recv different blksz issue

d99bc17

Signed-off-by: Chunwei Chen <david.chen@nutanix.com>

tuxoko force-pushed the recv_blksz branch from f1f20eb to d99bc17 Compare June 24, 2026 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix recv EINVAL/corruption when source has smaller recordsize setting#18663

Fix recv EINVAL/corruption when source has smaller recordsize setting#18663
tuxoko wants to merge 3 commits into
openzfs:masterfrom
tuxoko:recv_blksz

tuxoko commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

amotin commented Jun 12, 2026

Uh oh!

tuxoko commented Jun 15, 2026

Uh oh!

tuxoko commented Jun 18, 2026

Uh oh!

behlendorf commented Jun 22, 2026

Uh oh!

tuxoko commented Jun 23, 2026

Uh oh!

tuxoko commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

tuxoko commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Uh oh!

amotin commented Jun 12, 2026

Uh oh!

tuxoko commented Jun 15, 2026

Uh oh!

tuxoko commented Jun 18, 2026

Uh oh!

behlendorf commented Jun 22, 2026

Uh oh!

tuxoko commented Jun 23, 2026

Uh oh!

tuxoko commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tuxoko commented Jun 11, 2026 •

edited

Loading

tuxoko commented Jun 24, 2026 •

edited

Loading