Replies: 8 comments 25 replies
-
Some possibilities for the writer that is racing with
|
Beta Was this translation helpful? Give feedback.
-
So the subtlety here is that the value of the
|
Beta Was this translation helpful? Give feedback.
-
@pcd1193182 so, if I understand it correctly, one needs to hold because if we don't hold if that's the case, then here are the places which don't do this properly:
I mean, ZFS is obviously overwriting random memory in our deployment on Linux currently, this would be a nice explanation of why. It seems that the code has historically been optimized to take locks only where strictly necessary, but as the code evolves the context can change to contradict original assumptions. The level of complexity here is umm... something like RCU would be handy at least for the |
Beta Was this translation helpful? Give feedback.
-
So it sounds like there are a few problems:
Based on @pcd1193182 explanation from two days ago, but not considering dbuf state, I've made the following table of functions. Included is every function that directly accesses
|
Beta Was this translation helpful? Give feedback.
-
#17209 adds some assertions to functions that need db_mtx and have it, but don't currently assert. I've also got a local branch that builds on that which adds lock acquisitions where needed. I'll open a PR after #17209 merges. |
Beta Was this translation helpful? Give feedback.
-
My local patch, which adds additional lock acquisitions similar to @snajpa 's , is now running in production on one machine. If it doesn't introduce any new problems for one week, then I'll deploy it more widely. |
Beta Was this translation helpful? Give feedback.
-
@asomers can you please take a look at this current state of code here? -> Line 3111 in a497c5f this is the loop is waiting if it's in so I I haven't made a mistake and have this right - I would try two things: diff --git a/module/zfs/dbuf.c b/module/zfs/dbuf.c
index 1a7274968..68b1a0aab 100644
--- a/module/zfs/dbuf.c
+++ b/module/zfs/dbuf.c
@@ -3116,11 +3116,10 @@ dbuf_assign_arcbuf(dmu_buf_impl_t *db, arc_buf_t *buf, dmu_tx_t *tx)
ASSERT3U(arc_buf_lsize(buf), ==, db->db.db_size);
ASSERT(tx->tx_txg != 0);
+ mutex_enter(&db->db_mtx);
arc_return_buf(buf, db);
ASSERT(arc_released(buf));
- mutex_enter(&db->db_mtx);
-
while (db->db_state == DB_READ || db->db_state == DB_FILL)
cv_wait(&db->db_changed, &db->db_mtx);
@@ -3252,9 +3251,9 @@ dbuf_destroy(dmu_buf_impl_t *db)
* the hash table. We can now drop db_mtx, which allows us to
* acquire the dn_dbufs_mtx.
*/
+ DB_DNODE_ENTER(db);
mutex_exit(&db->db_mtx);
- DB_DNODE_ENTER(db);
dn = DB_DNODE(db);
dndb = dn->dn_dbuf;
if (db->db_blkid != DMU_BONUS_BLKID) { First is to and second is to The second part, if this works, could maaaaybe mitigate #12078? |
Beta Was this translation helpful? Give feedback.
-
So even with all the patches that we've both been floating around (incl. current #17209 as it is now, 3460c6c), I'm still able to reproduce the easiest reproducer I have is running Docker on top of zpool, in a VM with 6 GB of RAM:
the traces seem to differ a bit depending on what attempts are applied, now with all the patches it's down to
However on its own, it's not reliable, it doesn't trigger on every such Essentially in all Now I have to dig through why kdump doesn't work on that dev Debian setup, I'm hoping that I'll be able to see more from a memory dump, hoping to see those line numbers instead of zeroes. |
Beta Was this translation helpful? Give feedback.
-
According to the comments in dbuf.h,
db.db_data
is protected bydb_mtx
, whiledb_buf
is protected bydb_rwlock
. That's been the case ever since f664f1e . However, the code seems a little bit inconsistent. Here's a table of a few selected functions, and which fields they protect with which locks.I can't find any examples of a function that takes
db_rwlock
just to protectdb_mtx
. I think that what should be happening is thatdb_buf
is protected bydb_mtx
, anddb.db_data
should be protected bydb_rwlock
(or perhaps protected by both?). Can anybody please help me understand this? It's not merely an academic exercise either. I suspect that locking mistakes ondb.db_data
are responsible for #16626 and the corruption described in #17077 . cc @pcd1193182Beta Was this translation helpful? Give feedback.
All reactions