Be more careful with locking db.db_mtx by asomers · Pull Request #17418 · openzfs/zfs

asomers · 2025-06-03T21:13:58Z

Lock db->db_mtx in some places that access db->db_data. But don't lock it in free_children, even though it does access db->db_data, because that leads to a recurse-on-non-recursive panic.

Lock db->db_rwlock in some places that access db->db.db_data's contents.

Closes #16626
Sponsored by: ConnectWise

Motivation and Context

Fixes occasional in-memory corruption which is usually manifested as a panic with a message like "blkptr XXX has invalid XXX" or "blkptr XXX has no valid DVAs". I suspect that some on-disk corruption bugs have been caused by this same root cause, too.

Description

Always lock dmu_buf_impl_t.db_mtx in places that access the value of dmu_buf_impl_t.db->db_data. And always lockdmu_buf_impl_t.db_rwlock in places that access the contents of dmu_buf_impl_t.db->db_rwlock.

Note that free_children still violates these rules. It can't easily be fixed without causing other problems. A proper fix is left for the future.

How Has This Been Tested?

I cannot reproduce the bug on command, so I had to rely on statistics to validate the patch.

Since the beginning of 2025, servers running the vulnerable workload on FreeBSD 14.1 without this patch have crashed with a probability of 0.34% per server per day. The distribution of crashes fits a Poisson distribution, suggesting that each crash is random and independent. That is, a server that's already crashed once is no more likely to crash in the future than one which hasn't crashed yet.
Servers running the vulnerable workload on FreeBSD 14.2 with this patch have accumulated a total of 1301 days of uptime with no crashes. So I conclude with 98.8% confidence that the 14.2 upgrade combined with the patch is effective.
Servers running the vulnerable workload on FreeBSD 14.2 without the patch are too few to draw conclusions about. But I don't see any related changes in the diff between 14.1 and 14.2. So I think that the patch is responsible for the cessation of crashes, not the upgrade.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

alek-p

I've already reviewed this internally, and as the PR description states, we've had a good experience running with this patch for the last couple of months

amotin · 2025-06-04T18:18:37Z

As I see, in most of cases (I've spotted only one different) when you are taking db_rwlock, you also take db_mtx. It makes no sense to me, unless the only few exceptions are enormously expensive or otherwise don't allow db_mtx to be taken. I feel like we need some better understanding of locking strategy. At least I do.

snajpa · 2025-06-04T18:39:09Z

FWIW, as we're discussing here, I even think - after all the staring at the code - that the locking itself is actually fine, it seems to be a result of optimizations exactly because things don't need to be overlocked if it's guaranteed to be OK via other logical dependencies.

I think I have actually nailed where the problem is, but @asomers says he can't try it :)

asomers · 2025-06-04T20:03:42Z

As I see, in most of cases (I've spotted only one different) when you are taking db_rwlock, you also take db_mtx. It makes no sense to me, unless the only few exceptions are enormously expensive or otherwise don't allow db_mtx to be taken. I feel like we need some better understanding of locking strategy. At least I do.

That's because of this comment from @pcd1193182: "So the subtlety here is that the value of the db.db_data and db_buf fields are, I believe, still protected by the db_mtx plus the db_holds refcount. The contents of the buffers are protected by the db_rwlock." So many places need both db_mtx and db_rwlock. Some need only the former. I don't know of any cases where code would only need the latter.

snajpa · 2025-06-04T20:35:17Z

I'm sorry, I mixed it up. This is definitely needed and then there's a bug with dbuf resize. Two different things.

satmandu · 2025-08-12T16:34:39Z

@asomers Are you still awaiting reviewers on this? I've been running with the changes from this PR without any issues for a while now. It would be nice to get in all the "prevents corruption" PRs before 2.4.0.

clhedrick · 2025-08-12T18:43:45Z

Does this apply to 2.2.8 also?

amotin

OK. I went through all this, and I believe most of locking is not neded -- se below. Only few I've left uncommented.

module/zfs/dbuf.c

module/zfs/dnode.c

module/zfs/dnode_sync.c

asomers · 2025-08-18T23:11:29Z

Though I see your comments, @amotin , I still struggle to understand the right thing to do, generally, because the locking requirements aren't well documented, nor are they enforced either by the compiler or at runtime. Here are the different descriptions I've seen:

From dbuf.h:

db.db_data, which is protected by db_mtx
...
[db_rwlock] Protects db_buf's contents if they contain an indirect block or data block of the meta-dnode

And here's what @pcd1193182 said in #17118 👍

The value of the `db.db_data` and `db_buf` fields
are protected by `db_mtx` plus the `db_holds` refcount.  The contents are
protected by `db_rwlock`.  `db_mtx` is also responsible for protecting some of
the other parts of the dbuf state.

And later

dbufs have different states,and when they are in these different states, they can only be accessed in
certain ways.

But I don't see any list of what the various states are, nor how to tell which state a dbuf is in.

@amotin added the following in that same discussion thread:

db_rwlock protect content of buffers that are parent (indirect or dnode) of
some other buffer, and we need to either write or read the block pointer of the
buffer, either directly or via de-referencing the pointer of db_blkptr pointing
inside it. All the parent buffers permanently referenced so can not be evicted,
and have only one copy, so their memory should never be reallocated, so db_mtx
protection is not required in this case.

And @amotin added some more detail in this PR:

"If the db_dirtycnt below is zero (and it should be protected by db_mtx), then the buffer must be empty."
"Indirects don't relocate."
"meta-dnode dbufs are not relocatable"
"db_rwlock didn't promise to protect [L0 blocks]"

I can't confidently make any changes here without a complete and accurate description of the locking comments. What I need are:

Complete and accurate documentation in dbuf.h
A way to enforce those requirements at runtime. Perhaps a macro that asserts that a db_buf is locked, or else doesn't need to be locked based on other data in the dmu_buf_impl, and can be called everywhere that db_buf is accessed. And a similar macro for db.db_data.

@amotin can you please help with that? At least with the first part?

amotin · 2025-08-19T00:13:10Z

@asomers Let me rephrase the key points:

Indirects and L0 dnode dbufs are special in having only one data copy ever. They are always decompressed in memory, and if need do be decrypted (only bonus parts of dnode L0 can be encrypted, indirects are only signed), then it is done in place. It means they are never relocated in memory, so we don't need db_mtx to protect their db.db_data. And as long as we hold a reference on those dbufs, they can not be evicted and so change their state. This removes most of db_mtx acquisitions you've added.
db_rwlock is designed to protect specifically indirects and L0 dnode blocks from torn writes when they are modified by sync context, but read by anything else. db_rwlock is not intended to protect any user data dbufs, modified only in open context. For those we have range locks, etc. This removes most of db_rwlock acquisitions you've added.

IvanVolosyuk · 2025-08-23T01:18:26Z

My humble opinion. I think it is reasonable request to:

accurately document specifically what each lock is responsible for and in what states locking is required; enumerate possible states which require different approaches.
add additional debug assertions to make it clear which code path have the lock already held.
in places where locking is not needed due to single use - poison somehow the locks in debug mode to make any unexpected use crash
in places where object is not reallocatable - add macro which makes it clear that locking is not needed and checks that the object is indeed not rellocatable.

It is good to have optimizations, but it is not healthy that there is very limit knowledge of the locking scheme in small group of people with poor documentation and inability to examine the code for correctness.

amotin · 2025-09-12T16:07:59Z

@asomers Despite my comments on many of the changes here, IIRC there were some that could be useful. Do you plan to clean this up, document, etc, or I'll have to take it over?

asomers · 2025-09-12T16:10:12Z

@asomers Despite my comments on many of the changes here, IIRC there were some that could be useful. Do you plan to clean this up, document, etc, or I'll have to take it over?

Yes. My approach is to create some assertion functions which check that either db_data is locked, or is in a state where it doesn't need to be. The WIP is here, but it isn't ready for review yet. Probably next week. https://github.com/asomers/zfs/tree/db_data_elide .

asomers · 2025-09-18T22:05:46Z

@amotin I've eliminated the lock acquisitions as you requested. Please review. Note that while I've run the ZFS test suite with this round of changes, I don't know whether they suffice to solve the original corruption bug. The only way to know that is to run the code in production. But I'd like your review before I try that, because it takes quite a bit of time and effort to get sufficient production time. Not to mention the risk of corrupting customer data again.

module/zfs/dbuf.c

behlendorf · 2025-09-25T16:50:20Z

@asomers if you can rebase this on the latest commits in that master branch that should resolve most of the CI build failures. While you're at if please go ahead and squash the commits.

module/zfs/dbuf.c

scripts/zfs-tests.sh

asomers · 2025-12-04T01:35:49Z

There are suddenly a lot of "Wrong value for OS variable!" failures. I think that virt-install on the CI server must've suddenly changed versions.

amotin · 2025-12-04T01:53:05Z

There are suddenly a lot of "Wrong value for OS variable!" failures.

I am not sure what it means, but when you last rebased?

asomers · 2025-12-04T14:54:33Z

There are suddenly a lot of "Wrong value for OS variable!" failures.

I am not sure what it means, but when you last rebased?

Not since September. I've avoided doing that, since it can make the review confusing. But I'll do it now.

asomers · 2026-01-17T16:28:40Z

@amotin I rebased the changes and fixed the two new panics that have appeared since the last rebase. It's easier now that #18131 is finished. The freebsd16-0c CI failure is not the result of this PR, and the checkstyle failure will resolve itself after I squash. Could you please review again?

Signed-off-by: Alan Somers <asomers@gmail.com>

Lock db_mtx in some places that access db->db_data. But in some places, add assertions that the dbuf is in a state where it will not be copied, rather than locking it. Lock db_rwlock in some places that access db->db.db_data's contents. But in some places, add assertions that should guarantee the buffer is being accessed by one thread only, rather than locking it. Closes openzfs#16626 Sponsored by: ConnectWise Signed-off-by: Alan Somers <asomers@gmail.com>

@amotin

1) It wasn't actually checking the rwlock for indirect blocks 2) Per @amotin, "DMU_BONUS_BLKID and DMU_SPILL_BLKID can only exist at level 0", it was redundantly checking the blkid.

db_dirtycnt==0

The assertion was no longer true after removing the check for db_dirtycnt in the previous commit.

@amotin

Either this function needs to acquire db_rwlock, or we need some guarantee that no other thread can modify db_data while db_dirtycnt == 0. From what @amotin said, it sounds like there is no guarantee.

the meta dnode may have bonus or spill blocks, but we don't need to lock db_data for those.

Delete unintended change

@pcd1193182

Reported by: @pcd1193182

@amotin

According to @amotin that was always the intention. But it wasn't documented, and in practice wasn't always done. Also, don't lock db_rwlock during dbuf_verify. Since db_dirtycnt == 0, we don't need to.

These weren't necessary originally, but after rebasing they are.

include/sys/dbuf.h

module/zfs/dbuf.c

amotin · 2026-01-29T20:27:28Z

module/zfs/dbuf.c

+	if (dr->dr_dnode->dn_phys->dn_nlevels != 1) {
+		parent_db = dr->dr_parent->dr_dbuf;
+		assert_db_data_addr_locked(parent_db);
+		rw_enter(&parent_db->db_rwlock, RW_READER);


Couple chunks below in dbuf_write_ready() you take db_rwlock on the dnode buffer. Though both cases are reads in sync context, and I would not expect them to race.

module/zfs/dnode.c

amotin · 2026-01-29T21:05:19Z

module/zfs/dnode.c

+		mutex_enter(&db->db_mtx);
+		if (db->db_level != 1 || db->db_blkid >= end_blkid) {
+			mutex_exit(&db->db_mtx);


I am not sure why may we need locking here. level and blkid should be constants, I think.

db_state and db_dirtcnt certainly need to be protected by db_mtx. I could move the mutex_enter down until after the db_level check if you insist, though.

I haven't looked on a bigger picture here, but I'd say yeah, between later and never.

amotin

Couple small nits, but please review earlier comments still not marked resolved.

amotin · 2026-03-04T16:12:18Z

module/zfs/dbuf.c

+	}
+
+	assert_db_data_addr_locked(parent_db);
+	rw_enter(&parent_db->db_rwlock, RW_WRITER);


I wonder if dn_maxblkid update below we could move before the lock acquisition (or after the release?) to not think about the lock ordering? They seem unrelated.

amotin · 2026-03-04T16:33:13Z

module/zfs/dnode_sync.c

 		ASSERT(list_head(&db->db_dirty_records) == dr);
 		list_remove_head(&db->db_dirty_records);
 		ASSERT(list_is_empty(&db->db_dirty_records));
+		ASSERT(MUTEX_HELD(&db->db_mtx));


We got this lock just 6 lines above and done nothing to it.

asomers mentioned this pull request Jun 3, 2025

2.3.2 causing kernel panic and I/O hangs, 2.3.1 works on same dataset #17307

Open

asomers force-pushed the db_data branch from 05077e2 to e8c8b5a Compare June 3, 2025 21:23

behlendorf self-requested a review June 4, 2025 00:12

alek-p approved these changes Jun 4, 2025

View reviewed changes

snajpa approved these changes Jun 13, 2025

View reviewed changes

asomers force-pushed the db_data branch from e8c8b5a to a359e6c Compare June 25, 2025 18:47

snajpa mentioned this pull request Jul 31, 2025

ZFS 2.3.3 crash: kernel panic in scrub path (zio_*) #17559

Open

satmandu mentioned this pull request Aug 12, 2025

2.3.4 staging prep #17595

Merged

14 tasks

amotin reviewed Aug 12, 2025

View reviewed changes

behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 13, 2025

amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Sep 12, 2025

asomers requested a review from amotin September 18, 2025 22:03

github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Sep 18, 2025

amotin reviewed Sep 22, 2025

View reviewed changes

module/zfs/dbuf.c Show resolved Hide resolved

asomers force-pushed the db_data branch from 8ce6b17 to 8d81c8d Compare September 23, 2025 14:51

asomers requested a review from amotin September 24, 2025 18:49

asomers force-pushed the db_data branch from 567e263 to 59ee0bb Compare September 25, 2025 17:08

amotin reviewed Oct 23, 2025

View reviewed changes

module/zfs/dbuf.c Outdated Show resolved Hide resolved

asomers requested a review from amotin October 24, 2025 17:52

pcd1193182 suggested changes Nov 10, 2025

View reviewed changes

module/zfs/dbuf.c Outdated Show resolved Hide resolved

scripts/zfs-tests.sh Outdated Show resolved Hide resolved

asomers force-pushed the db_data branch from e69fa63 to 3033dd4 Compare December 4, 2025 15:54

asomers force-pushed the db_data branch from 3033dd4 to c0d6119 Compare January 16, 2026 22:07

asomers added 13 commits January 29, 2026 08:35

Better const-correctness on Linux

cf3e275

Signed-off-by: Alan Somers <asomers@gmail.com>

Fix two logic errors in assert_db_data_contents_locked

35f6251

1) It wasn't actually checking the rwlock for indirect blocks 2) Per @amotin, "DMU_BONUS_BLKID and DMU_SPILL_BLKID can only exist at level 0", it was redundantly checking the blkid.

fixup: don't short-circuit assert_db_data_contents_locked if

4bd5771

db_dirtycnt==0

Fixup: Restore locking around dnode_create in dnode_hold_impl

5abd43f

The assertion was no longer true after removing the check for db_dirtycnt in the previous commit.

Restore the db_rwlock locking in dbuf_verify

17652c4

Either this function needs to acquire db_rwlock, or we need some guarantee that no other thread can modify db_data while db_dirtycnt == 0. From what @amotin said, it sounds like there is no guarantee.

Restore the db_rwlock locking in the other leg of dbuf_verify

5df93db

fixup to "Fix two logic errors in assert_db_data_contents_locked"

57e3efe

the meta dnode may have bonus or spill blocks, but we don't need to lock db_data for those.

fixup to Restore the db_rwlock locking in the other leg of dbuf_verify

a360141

Delete unintended change

Always use RW_READER or RW_WRITER with rw_enter

f86c4e9

Reported by: @pcd1193182

Always use db_mtx to protect accesses to db_dirtycnt

d4dce09

According to @amotin that was always the intention. But it wasn't documented, and in practice wasn't always done. Also, don't lock db_rwlock during dbuf_verify. Since db_dirtycnt == 0, we don't need to.

fixup to "Be more careful with locking around db.db_data": style

5144c4b

Add two more db_mtx acquisitions

2c310ad

These weren't necessary originally, but after rebasing they are.

asomers force-pushed the db_data branch from c0d6119 to 2c310ad Compare January 29, 2026 18:56

amotin reviewed Jan 29, 2026

View reviewed changes

asomers added 3 commits January 29, 2026 16:13

fixup to "Restore the db_rwlock locking in the other leg of dbuf_verify"

aa46aa5

fixup to Be more careful with locking around db.db_data

935c307

Respond to @amotin's latest comments

d10ddf3

asomers force-pushed the db_data branch from 505d3c1 to d10ddf3 Compare February 12, 2026 14:23

Squash: fix another kernel panic that arose after the last rebase.

af3c9b1

amotin reviewed Mar 4, 2026

View reviewed changes

Conversation

asomers commented Jun 3, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

alek-p left a comment

Choose a reason for hiding this comment

Uh oh!

amotin commented Jun 4, 2025

Uh oh!

snajpa commented Jun 4, 2025

Uh oh!

asomers commented Jun 4, 2025

Uh oh!

snajpa commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

satmandu commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clhedrick commented Aug 12, 2025

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asomers commented Aug 18, 2025

Uh oh!

amotin commented Aug 19, 2025

Uh oh!

IvanVolosyuk commented Aug 23, 2025

Uh oh!

amotin commented Sep 12, 2025

Uh oh!

asomers commented Sep 12, 2025

Uh oh!

asomers commented Sep 18, 2025

Uh oh!

Uh oh!

behlendorf commented Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asomers commented Dec 4, 2025

Uh oh!

amotin commented Dec 4, 2025

Uh oh!

asomers commented Dec 4, 2025

Uh oh!

asomers commented Jan 17, 2026

Uh oh!

Uh oh!

Uh oh!

amotin Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

amotin Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

asomers Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

amotin Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

snajpa commented Jun 4, 2025 •

edited

Loading

satmandu commented Aug 12, 2025 •

edited

Loading