Skip to content

ensure chunk is not out of bound, which will bring data corruption.#18620

Open
tiehexue wants to merge 2 commits into
openzfs:masterfrom
tiehexue:zap-leaf-bound-checking
Open

ensure chunk is not out of bound, which will bring data corruption.#18620
tiehexue wants to merge 2 commits into
openzfs:masterfrom
tiehexue:zap-leaf-bound-checking

Conversation

@tiehexue

@tiehexue tiehexue commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Motivation and Context

This PR is to fix #18572.

There is no bound checking in production build in zap_leaf.c, ASSERT is no-op. So if some field corrupted, e.g. memory hardware error, other software bug, zap leaf will go into silently corruption. Refer to #18572 for details.

Description

Assume that the root cause is one bit-flip in fields in CHAIN_END. This is PR do following to avoid and recover from data corruption:

  1. we only care about fileds with CHAIN_END, e.g. la_next, lf_next, le_next, l_hash, other values are not able to check. We basically stop at where corruption happens.
  2. most of comparison to CHAIN_END are replaced with l_chunk_count. l_chunk_count does prevent from one big-flip of CHAIN_END, it is also an in-bound checking.
  3. when free a chunk, we will ignore it if it is invalid.
  4. we do not check a chunk when alloc, because it is checked before the alloc function for available chunks.
  5. l_chunk_count need additional memory, however, it did not downgrade performance.

How Has This Been Tested?

Tested in normal cases, creating/deleting.

And use test code to create a corrupted directory, then use new code to do "ls -la", "cp", it works, no soft lockup. But also, there is data lost.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@tiehexue tiehexue force-pushed the zap-leaf-bound-checking branch from 67b1ce8 to c1d6078 Compare June 3, 2026 11:45
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jun 4, 2026
@tiehexue tiehexue force-pushed the zap-leaf-bound-checking branch from c1d6078 to a618c82 Compare June 4, 2026 06:14
@tiehexue

tiehexue commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

should we concern the zloop test failed in github checks. For a specific test like "ztest -G -VVVVV -K draid -m 0 -r 27 -D 9 -S 2 -R 2 -v 0 -a 9 -C special=random -s 512m -f /mnt/zloop/zloop-run -T 120 -P 60", it is normally failed in different reason in my local dev in both master branch or this PR. And sometimes tests pass.

Signed-off-by: tiehexue <tiehexue@hotmail.com>
@tiehexue tiehexue force-pushed the zap-leaf-bound-checking branch from a618c82 to 1c9117f Compare June 4, 2026 12:08
@tiehexue

tiehexue commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

should we concern the zloop test failed in github checks. For a specific test like "ztest -G -VVVVV -K draid -m 0 -r 27 -D 9 -S 2 -R 2 -v 0 -a 9 -C special=random -s 512m -f /mnt/zloop/zloop-run -T 120 -P 60", it is normally failed in different reason in my local dev in both master branch or this PR. And sometimes tests pass.

I did another force-push to run the zloop again, and it just succeed.

@amotin

amotin commented Jun 4, 2026

Copy link
Copy Markdown
Member

I have doubts about productivity of this. We can't catch all possible bitflips.

@tiehexue

tiehexue commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

I have doubts about productivity of this. We can't catch all possible bitflips.

Yes. But this is a bug for decades. This PR does not bring too much overhead, just replacing comparison against CHAIN_END with l_chunk_count, and even using l_chunk_count to replace the macro for counting which should be good for performance. The cost is a new field in memory.

@tiehexue tiehexue mentioned this pull request Jun 18, 2026
14 tasks
@ryao

ryao commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

I have doubts about productivity of this. We can't catch all possible bitflips.

I have the same doubts. There are countless places in on-disk structures where faults cause horrible things to happen if they were written with a good checksum. Doing something about this particular case requires a reason to think it is common, such as a bug in older releases that enables it. I am not aware of one.

@tiehexue

Copy link
Copy Markdown
Contributor Author

@ryao @amotin hi, would you like to look at #18572 , where I stated how I reproduced the bug, how to test this, and @robn also mentioned that there were a lot similar bugs/issues reported, and he also made a patch but did not merge.

So I have to say more for your attention:

  1. we do not need to take care of "countless places", it is just one issue/bug which happens repeatedly in decades. The root cause may be bit-flip, or just out-of-boundary access which damages a fatzap (no segmentation fault in kernel module).

  2. Or, we just thinking in the other way: la_next, lf_next, le_next need a value as ending, it could be NULL, CHAIN_END or just ZAP_LEAF_NUMCHUNKS(l). ZAP_LEAF_NUMCHUNKS(l) will need computation every time, so, I add a member in zap_leaf_t as l_chunk_count. With l_chunk_count, the boundary check is more concise, and it has a good side effect: a bit-flip in CHAIN_END (like 0xffef in the issue Reproducible ZAP-leaf chunk-chain corruption causes soft lockup in zfs_readdir / zap_lookup #18572 ) will not bring chaos.

@behlendorf

behlendorf commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

We can't catch all possible bit flips, but if we can efficiently detect and handle internally inconsistent on-disk state we should do so. We already do something similar with zfs_blkptr_verify() to check for obvious damage in block pointers even when the checksum is valid. It's not perfect, but it acknowledges the reality that rare events do happen at scale.

I'm not aware of any existing or previously fixed bug which would explain this, but this has been reported often enough over the years I think it's reasonable to include a check for it.

@behlendorf behlendorf left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still working my way through this and will pick it up tomorrow, but generally speaking we should return an error wherever possible instead of logging a debug message which will never be seen.

Comment thread include/sys/zap_leaf.h
there is 4-bytes hole before, now
2 left, checked with pahole.

Signed-off-by: tiehexue <tiehexue@hotmail.com>
@tiehexue

tiehexue commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

I'm still working my way through this and will pick it up tomorrow, but generally speaking we should return an error wherever possible instead of logging a debug message which will never be seen.

Thanks for your review.

Adding a member to a structure is nervous, and luckily it is not a on-disk one.

For the debug message, there are two thought: 1) I think when bad things happens, the user would find a directory is not listable, he would check debug message after enable it; 2) these are void methods, I am not sure how to post out errors rather than panic. But panic is not what I want, keep silent, keep the system acting normally as much as possible, may be better.

Let me know a better way.

@amotin

amotin commented Jun 24, 2026

Copy link
Copy Markdown
Member

But panic is not what I want, keep silent, keep the system acting normally as much as possible, may be better.

Nope. Because it is unpredictable. If we assume the errors are possible there, then either make functions return status that will be verified, or panic. Silent return with uninitialized buffer is a request for troubles, that will be impossible to debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reproducible ZAP-leaf chunk-chain corruption causes soft lockup in zfs_readdir / zap_lookup

4 participants