Skip to content

Fix recv EINVAL/corruption when source has smaller recordsize setting#18663

Open
tuxoko wants to merge 3 commits into
openzfs:masterfrom
tuxoko:recv_blksz
Open

Fix recv EINVAL/corruption when source has smaller recordsize setting#18663
tuxoko wants to merge 3 commits into
openzfs:masterfrom
tuxoko:recv_blksz

Conversation

@tuxoko

@tuxoko tuxoko commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

If source and target has different recordsize setting, it's possible to
have same file to have different block size when reverse replication is
allowed. This has two consequences in the current
receive_handle_existing_object.

If target side has more than 1 block, it will fail with EINVAL.

$ sudo zfs create -o recordsize=64k pp/fs0
$ sudo zfs create -o recordsize=128k pp/fs1

$ sudo dd if=/dev/urandom of=/pp/fs0/test bs=1M count=1     # src created with 64k
$ sudo zfs snap pp/fs0@s00
$ sudo zfs send pp/fs0@s00 | sudo zfs recv -F pp/fs1        # tgt received with 64k
$ sudo dd if=/dev/urandom of=/pp/fs1/test bs=1M count=1     # tgt truncated and rewritten with 128k
$ sudo zfs snap pp/fs1@s01
$ sudo zfs send -i @s00 pp/fs1@s01 | sudo zfs recv pp/fs0   # reverse send, src remains 64k
$ sudo dd if=/dev/urandom of=/pp/fs0/test bs=1M count=1     # src modified, remains 64k
$ sudo zfs snap pp/fs0@s02
$ sudo zfs send -i @s01 pp/fs0@s02 | sudo zfs recv pp/fs1   # forward send failed
cannot receive incremental stream: invalid backup stream

If target side has 1 block, and source modifies file partially, it will
cause target to truncate everything and only left with the modified
part, causing corruption.

$ sudo zfs create -o recordsize=64k pp/fs0
$ sudo zfs create -o recordsize=128k pp/fs1

$ sudo dd if=/dev/urandom of=/pp/fs0/test bs=128k count=1   # src created with 64k
$ sudo zfs snap pp/fs0@s00
$ sudo zfs send pp/fs0@s00 | sudo zfs recv -F pp/fs1        # tgt received with 64k
$ sudo dd if=/dev/urandom of=/pp/fs1/test bs=128k count=1   # tgt truncated and rewritten with 128k
$ sudo zfs snap pp/fs1@s01
$ sudo zfs send -i @s00 pp/fs1@s01 | sudo zfs recv pp/fs0   # reverse send, src remains 64k
$ sudo dd if=/dev/urandom of=/pp/fs0/test bs=64k count=1 conv=notrunc    # src modified single 64k block
$ sudo zfs snap pp/fs0@s02
$ sudo zfs send -i @s01 pp/fs0@s02 | sudo zfs recv pp/fs1   # forward send, tgt will get corrupted
$ md5sum /pp/fs0/test /pp/fs1/test
7b8ab0f9b9bded4bb1cfe60b8d2ad925  /pp/fs0/test
229c5798dcf15440afafa734f7e42769  /pp/fs1/test

To fix this issue, when we encounter files with same gen but different
blksz. We look ahead into following records to see if WRITEs and FREEs
will completely overwrite existing content. If so, then it mean we are
safe to do truncate and take on new blksz, otherwise we don't truncate
and the blksz will remain different.

Motivation and Context

Description

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Comment thread module/zfs/dmu_recv.c Outdated
@amotin

amotin commented Jun 12, 2026

Copy link
Copy Markdown
Member

I'd think if the block size was increased on fs1 by the write, backward replication to fs1 should increase the block size there too. The complications here I guess are coming from workarounds from the large blocks support being optional.

@tuxoko

tuxoko commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

@amotin
Yeah I agree. That seems to be the behavior before the large block change.

Given the comment saying we disallow --large-block to switch to off in recv_check_large_blocks(),
then shouldn't we just always truncate and take on new size whenever there's block size mismatch like old version?
Instead of keeping current block size for same gen, which is what caused this problem in the first place?

However, if we blindly do that now, anyone with already mismatched file will lose data, so I'm not sure what's the best way going forward.

Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
@tuxoko tuxoko changed the title Fix recv failure when source has smaller recordsize setting Fix recv EINVAL/corruption when source has smaller recordsize setting Jun 18, 2026
@tuxoko

tuxoko commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

update to fix the corruption issue.

@tuxoko tuxoko requested review from amotin and behlendorf June 18, 2026 02:40
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jun 22, 2026
@behlendorf

Copy link
Copy Markdown
Contributor

Adopting the new recordsize is what I'd expect, but we'll have to think that though carefully. With the latest push I see the rsend tests are now hitting this:

  [ 8043.578156] PANIC: zfs: accessing past end of object 541/aa (size=89600 access=65536+65536)
  [ 8043.582364] Showing stack for process 747172
  [ 8043.584715] CPU: 1 PID: 747172 Comm: xxh128sum Tainted: P           OEL    -------- -  - 4.18.0-553.134.1.el8_10.x86_64 #1
  [ 8043.590263] Hardware name: QEMU Ubuntu 24.04 PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
  [ 8043.595121] Call Trace:
  [ 8043.597581]  dump_stack+0x41/0x60
  [ 8043.599977]  vcmn_err.cold.0+0x68/0x80 [spl]
  [ 8043.630295]  zfs_panic_recover+0x6f/0x90 [zfs]
  [ 8043.633837]  dmu_buf_hold_array_by_dnode+0x226/0x750 [zfs]
  [ 8043.640224]  dmu_read_uio_dnode+0x5f/0x180 [zfs]
  [ 8043.648085]  zfs_read+0x2e0/0x510 [zfs]
  [ 8043.650993]  zpl_iter_read+0xa1/0x110 [zfs]
  [ 8043.653885]  new_sync_read+0x10f/0x160
  [ 8043.656442]  vfs_read+0x91/0x150
  [ 8043.658404]  ksys_read+0x4f/0xb0
  [ 8043.660213]  do_syscall_64+0x5b/0x1d0
  [ 8043.662145]  entry_SYSCALL_64_after_hwframe+0x66/0xcb
  [ 8043.664817] RIP: 0033:0x7f40aa5eb335

@tuxoko

tuxoko commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

@behlendorf
Yes, I saw that. I think the problem is if we are single block and not power of 2. So we can't keep our block size and grow file at the same time. Before this change, we always truncate and take on new block size, but this is precisely what causes the corruption...

I'm not sure if there's a good way around this. Or maybe we should go back to always truncate with single block and just rewrite whatever we have...

@tuxoko

tuxoko commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Update to fix the above non P2 block size issue.
When we can't truncate and block size isn't P2, grow to the nearest P2 in case WRITE will go pass block size.

Also update test pool to include such file.

If source and target has different recordsize setting, it's possible to
have same file to have different block size when reverse replication is
allowed. This has two consequences in the current
receive_handle_existing_object.

If target side has more than 1 block, it will fail with EINVAL.

```
$ sudo zfs create -o recordsize=64k pp/fs0
$ sudo zfs create -o recordsize=128k pp/fs1

$ sudo dd if=/dev/urandom of=/pp/fs0/test bs=1M count=1     # src created with 64k
$ sudo zfs snap pp/fs0@s00
$ sudo zfs send pp/fs0@s00 | sudo zfs recv -F pp/fs1        # tgt received with 64k
$ sudo dd if=/dev/urandom of=/pp/fs1/test bs=1M count=1     # tgt truncated and rewritten with 128k
$ sudo zfs snap pp/fs1@s01
$ sudo zfs send -i @s00 pp/fs1@s01 | sudo zfs recv pp/fs0   # reverse send, src remains 64k
$ sudo dd if=/dev/urandom of=/pp/fs0/test bs=1M count=1     # src modified, remains 64k
$ sudo zfs snap pp/fs0@s02
$ sudo zfs send -i @s01 pp/fs0@s02 | sudo zfs recv pp/fs1   # forward send failed
cannot receive incremental stream: invalid backup stream
```

If target side has 1 block, and source modifies file partially, it will
cause target to truncate everything and only left with the modified
part, causing corruption.

```
$ sudo zfs create -o recordsize=64k pp/fs0
$ sudo zfs create -o recordsize=128k pp/fs1

$ sudo dd if=/dev/urandom of=/pp/fs0/test bs=128k count=1   # src created with 64k
$ sudo zfs snap pp/fs0@s00
$ sudo zfs send pp/fs0@s00 | sudo zfs recv -F pp/fs1        # tgt received with 64k
$ sudo dd if=/dev/urandom of=/pp/fs1/test bs=128k count=1   # tgt truncated and rewritten with 128k
$ sudo zfs snap pp/fs1@s01
$ sudo zfs send -i @s00 pp/fs1@s01 | sudo zfs recv pp/fs0   # reverse send, src remains 64k
$ sudo dd if=/dev/urandom of=/pp/fs0/test bs=64k count=1 conv=notrunc    # src modified single 64k block
$ sudo zfs snap pp/fs0@s02
$ sudo zfs send -i @s01 pp/fs0@s02 | sudo zfs recv pp/fs1   # forward send, tgt will get corrupted
$ md5sum /pp/fs0/test /pp/fs1/test
7b8ab0f9b9bded4bb1cfe60b8d2ad925  /pp/fs0/test
229c5798dcf15440afafa734f7e42769  /pp/fs1/test
```

To fix this issue, when we encounter files with same gen but different
blksz. We look ahead into following records to see if WRITEs and FREEs
will completely overwrite existing content. If so, then it mean we are
safe to do truncate and take on new blksz, otherwise we don't truncate
and the blksz will remain different.

Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants