ensure chunk is not out of bound, which will bring data corruption. by tiehexue · Pull Request #18620 · openzfs/zfs

tiehexue · 2026-06-03T10:41:15Z

Motivation and Context

This PR is to fix #18572.

There is no bound checking in production build in zap_leaf.c, ASSERT is no-op. So if some field corrupted, e.g. memory hardware error, other software bug, zap leaf will go into silently corruption. Refer to #18572 for details.

Description

Assume that the root cause is one bit-flip in fields in CHAIN_END. This is PR do following to avoid and recover from data corruption:

we only care about fileds with CHAIN_END, e.g. la_next, lf_next, le_next, l_hash, other values are not able to check. We basically stop at where corruption happens.
most of comparison to CHAIN_END are replaced with l_chunk_count. l_chunk_count does prevent from one big-flip of CHAIN_END, it is also an in-bound checking.
when free a chunk, we will ignore it if it is invalid.
we do not check a chunk when alloc, because it is checked before the alloc function for available chunks.
l_chunk_count need additional memory, however, it did not downgrade performance.

How Has This Been Tested?

Tested in normal cases, creating/deleting.

And use test code to create a corrupted directory, then use new code to do "ls -la", "cp", it works, no soft lockup. But also, there is data lost.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

tiehexue · 2026-06-04T07:01:40Z

should we concern the zloop test failed in github checks. For a specific test like "ztest -G -VVVVV -K draid -m 0 -r 27 -D 9 -S 2 -R 2 -v 0 -a 9 -C special=random -s 512m -f /mnt/zloop/zloop-run -T 120 -P 60", it is normally failed in different reason in my local dev in both master branch or this PR. And sometimes tests pass.

Signed-off-by: tiehexue <tiehexue@hotmail.com>

tiehexue · 2026-06-04T14:40:45Z

should we concern the zloop test failed in github checks. For a specific test like "ztest -G -VVVVV -K draid -m 0 -r 27 -D 9 -S 2 -R 2 -v 0 -a 9 -C special=random -s 512m -f /mnt/zloop/zloop-run -T 120 -P 60", it is normally failed in different reason in my local dev in both master branch or this PR. And sometimes tests pass.

I did another force-push to run the zloop again, and it just succeed.

amotin · 2026-06-04T16:08:24Z

I have doubts about productivity of this. We can't catch all possible bitflips.

tiehexue · 2026-06-04T16:46:23Z

I have doubts about productivity of this. We can't catch all possible bitflips.

Yes. But this is a bug for decades. This PR does not bring too much overhead, just replacing comparison against CHAIN_END with l_chunk_count, and even using l_chunk_count to replace the macro for counting which should be good for performance. The cost is a new field in memory.

ryao · 2026-06-18T20:30:52Z

I have doubts about productivity of this. We can't catch all possible bitflips.

I have the same doubts. There are countless places in on-disk structures where faults cause horrible things to happen if they were written with a good checksum. Doing something about this particular case requires a reason to think it is common, such as a bug in older releases that enables it. I am not aware of one.

tiehexue · 2026-06-20T15:43:11Z

@ryao @amotin hi, would you like to look at #18572 , where I stated how I reproduced the bug, how to test this, and @robn also mentioned that there were a lot similar bugs/issues reported, and he also made a patch but did not merge.

So I have to say more for your attention:

we do not need to take care of "countless places", it is just one issue/bug which happens repeatedly in decades. The root cause may be bit-flip, or just out-of-boundary access which damages a fatzap (no segmentation fault in kernel module).
Or, we just thinking in the other way: la_next, lf_next, le_next need a value as ending, it could be NULL, CHAIN_END or just ZAP_LEAF_NUMCHUNKS(l). ZAP_LEAF_NUMCHUNKS(l) will need computation every time, so, I add a member in zap_leaf_t as l_chunk_count. With l_chunk_count, the boundary check is more concise, and it has a good side effect: a bit-flip in CHAIN_END (like 0xffef in the issue Reproducible ZAP-leaf chunk-chain corruption causes soft lockup in zfs_readdir / zap_lookup #18572 ) will not bring chaos.

behlendorf · 2026-06-23T00:29:01Z

We can't catch all possible bit flips, but if we can efficiently detect and handle internally inconsistent on-disk state we should do so. We already do something similar with zfs_blkptr_verify() to check for obvious damage in block pointers even when the checksum is valid. It's not perfect, but it acknowledges the reality that rare events do happen at scale.

I'm not aware of any existing or previously fixed bug which would explain this, but this has been reported often enough over the years I think it's reasonable to include a check for it.

behlendorf

I'm still working my way through this and will pick it up tomorrow, but generally speaking we should return an error wherever possible instead of logging a debug message which will never be seen.

there is 4-bytes hole before, now 2 left, checked with pahole. Signed-off-by: tiehexue <tiehexue@hotmail.com>

tiehexue · 2026-06-23T02:57:02Z

I'm still working my way through this and will pick it up tomorrow, but generally speaking we should return an error wherever possible instead of logging a debug message which will never be seen.

Thanks for your review.

Adding a member to a structure is nervous, and luckily it is not a on-disk one.

For the debug message, there are two thought: 1) I think when bad things happens, the user would find a directory is not listable, he would check debug message after enable it; 2) these are void methods, I am not sure how to post out errors rather than panic. But panic is not what I want, keep silent, keep the system acting normally as much as possible, may be better.

Let me know a better way.

amotin · 2026-06-24T17:55:07Z

But panic is not what I want, keep silent, keep the system acting normally as much as possible, may be better.

Nope. Because it is unpredictable. If we assume the errors are possible there, then either make functions return status that will be verified, or panic. Silent return with uninitialized buffer is a request for troubles, that will be impossible to debug.

tiehexue force-pushed the zap-leaf-bound-checking branch from 67b1ce8 to c1d6078 Compare June 3, 2026 11:45

behlendorf added the Status: Code Review Needed Ready for review and testing label Jun 4, 2026

tiehexue force-pushed the zap-leaf-bound-checking branch from c1d6078 to a618c82 Compare June 4, 2026 06:14

ensure chunk in zap leaf is not out of bound.

1c9117f

Signed-off-by: tiehexue <tiehexue@hotmail.com>

tiehexue force-pushed the zap-leaf-bound-checking branch from a618c82 to 1c9117f Compare June 4, 2026 12:08

tiehexue mentioned this pull request Jun 18, 2026

true async io in linux #18684

Open

14 tasks

behlendorf reviewed Jun 23, 2026

View reviewed changes

Comment thread include/sys/zap_leaf.h

avoid increasing the structure size

7067de8

there is 4-bytes hole before, now 2 left, checked with pahole. Signed-off-by: tiehexue <tiehexue@hotmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ensure chunk is not out of bound, which will bring data corruption.#18620

ensure chunk is not out of bound, which will bring data corruption.#18620
tiehexue wants to merge 2 commits into
openzfs:masterfrom
tiehexue:zap-leaf-bound-checking

tiehexue commented Jun 3, 2026

Uh oh!

tiehexue commented Jun 4, 2026

Uh oh!

tiehexue commented Jun 4, 2026

Uh oh!

amotin commented Jun 4, 2026

Uh oh!

tiehexue commented Jun 4, 2026

Uh oh!

ryao commented Jun 18, 2026

Uh oh!

tiehexue commented Jun 20, 2026

Uh oh!

behlendorf commented Jun 23, 2026 •

edited

Loading

Uh oh!

behlendorf left a comment

Uh oh!

Uh oh!

tiehexue commented Jun 23, 2026 •

edited

Loading

Uh oh!

amotin commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

tiehexue commented Jun 3, 2026

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

tiehexue commented Jun 4, 2026

Uh oh!

tiehexue commented Jun 4, 2026

Uh oh!

amotin commented Jun 4, 2026

Uh oh!

tiehexue commented Jun 4, 2026

Uh oh!

ryao commented Jun 18, 2026

Uh oh!

tiehexue commented Jun 20, 2026

Uh oh!

behlendorf commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

behlendorf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tiehexue commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotin commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

behlendorf commented Jun 23, 2026 •

edited

Loading

tiehexue commented Jun 23, 2026 •

edited

Loading