Skip to content

release range lock after freed in dmu.#18647

Open
tiehexue wants to merge 1 commit into
openzfs:masterfrom
tiehexue:dead-lock-in-dmu-free
Open

release range lock after freed in dmu.#18647
tiehexue wants to merge 1 commit into
openzfs:masterfrom
tiehexue:dead-lock-in-dmu-free

Conversation

@tiehexue

@tiehexue tiehexue commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Motivation and Context

While working on issue #18454 , a test c program is written to reproduce the issue, however, failed. But a dead lock problem is found. And this is the fix. This deadlock is very similar to the one reported in #18454 , relates to PG_writeback, and triggered by unlink/fsync/ftruncate, and my testing env is aligned with it. It will be good if the issuer can verify this PR.

Description

After free a range in dmu, we should immediately release the range lock. In the original code, the comments stress that following:
"Zero partial page cache entries. This must be done under a range lock in order to keep the ARC and page cache in sync."

The test progam indeed shows deadlock here. The program and deadlock trace will be appended in this PR. The root cause is that we have double pass in fsync/msync, the second pass try to acquire range lock while still holds PG_writeback, then it is blocked by range lock in another thread. The thread has range lock is block by PG_writeback in folio_wait_writeback.

We may need more tests here to ensure ARC and page cache has no problem.

How Has This Been Tested?

run the test program, and watch dmesg.

ubuntu 26.04 amd64/arm64 ext4: fine
debian 13.4 amd64/arm64 ext4: fine

amd64:

ubuntu 26.04 7.0.0-22-generic + 2.3.7 + writeback_iter: deadlock
--- this configuration is to verify if writeback_iter is the root cause, refer to PR #18625

ubuntu 26.04 7.0.0-22-generic + 2.3.7: deadlock
ubuntu 26.04 7.0.0-22-generic + 2.4.99-1: deadlock
ubuntu 26.04 7.0.0-22-generic + 2.4.99-1 + this PR: fixed

6.12.74+deb13-amd64 + zfs-2.3.2-1 : deadlock
6.12.74+deb13-amd64 + zfs-2.3.2-1 + this PR: fixed

arm64:

6.12.74+deb13-arm64 + zfs-2.3.2-1: hang
6.12.74+deb13-arm64 + zfs-2.3.2-1 + this PR: fixed
ubuntu 26.04 7.0.0-22-generic + 2.4.99-1: deadlock
ubuntu 26.04 7.0.0-22-generic + 2.4.99-1 + this PR: fixed

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@tiehexue

tiehexue commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Hi, @ryao , did you notice I had uploaded a test program and trace log. The trace log clearly showed an ABBA deadlock. And as I stated in the PR, I had tested in a lot environment settings with the hang, and the fix.

Also I have doubts if we should re-acquire range lock before zfs_zero_partial_page after we release range lock before truncate_inode_pages_range.

------------- below is original comment -------------

The test program and its dmesg.

dead_lock.c
trace.log

Signed-off-by: tiehexue <tiehexue@hotmail.com>
@tiehexue

tiehexue commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

This PR, release range block before zero partial page, should bring concern about data corruption or correctness. I was trying to verify this by some small programs, however, those programs did not make distinct difference.

A such program or test case that can prove releasing early is bad are more than welcome!

@tiehexue tiehexue force-pushed the dead-lock-in-dmu-free branch from 5332a4e to 5195704 Compare June 8, 2026 16:13
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jun 8, 2026
@tiehexue

tiehexue commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

This PR, release range block before zero partial page, should bring concern about data corruption or correctness. I was trying to verify this by some small programs, however, those programs did not make distinct difference.

A such program or test case that can prove releasing early is bad are more than welcome!

OK, I guess this is the whole picture: 1) for data path go to ARC, it is fine; 2) only mmap bypass ARC.

For mmap read, it will read dirty data, which is fine, and same as other filesystem, which should be handled in application level.

For mmap write, this PR is a must, because there is chance of dead-lock. And if no dead-lock, there is very very little chance after our thread releasing range lock and before starting page truncating, a mmap write may be done in another thread. I have to add msleep(100) between to see a dirty page (by another thread) is truncated. And this should be handled in application level too, at least a file (range) lock is enough. We could see this not a bug, but a reasonable concurrent visit result.

Any way, kernel module shall not be in dead-lock while administrator has to come to the machine to do a reset.

@tiehexue tiehexue mentioned this pull request Jun 18, 2026
14 tasks

@ryao ryao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide a better description of the lock inversion that causes the deadlock? I would like to see stacks for the two threads showing where lock X is grabbed by thread A, where lock Y would be grabbed by thread A, where lock Y is grabbed by thread B and where lock X would be grabbed by thread B.

I know you are saying this is between PG_writeback and zfs_rangelock_enter()/zfs_rangelock_enter(), but I am not seeing an inverted relationship in a second thread when I look at the code. Seeing stack traces of the events where each thread takes locks in an inverse order would make this much easier to review. You don't need to capture the full true stack traces with a tool. That usually only works for the second lock in each thread. Making a stack trace for the first lock in each thread by hand would work for enabling others to see the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants