Fix unnecessary writes of transaction database by Bronek · Pull Request #18120 · openzfs/zfs

Bronek · 2026-01-08T09:01:03Z

This makes the periodical flushes of TXG timestamps database conditional to any updates in the database since the last flush.

No change to documentation, since arguably this is also the most intuitive behaviour.

Motivation and Context

This fixes #18082

Description

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

oshogbo · 2026-01-08T09:13:00Z

This doesn't seem right. If we have a dirty database and the time came to flush we want to flush database, not to wait for the next TXG to arrive no?

oshogbo · 2026-01-08T09:15:19Z

I guess we just need to check if there is something to flush and not flush database if no modification to it was made.

Bronek · 2026-01-08T09:48:39Z

I guess we just need to check if there is something to flush and not flush database if no modification to it was made.

Yes, that's done by the newly added

} else {
  return;
}

However with this change alone, I am not certain that updating spa->spa_last_noted_txg_time = curtime; inside if (txg > spa->spa_last_noted_txg) section (i.e. whether or not the transaction database has been flushed) is still correct. Which is why I moved it below.

If spa->spa_last_noted_txg_time = curtime; is kept in the original location, I can imagine a situation where we keep pushing "last noted" timestamp forward but the actual flush is unnecessarily delayed or perhaps even does not happen. However my imagination is known to push some silly ideas so I am not 100% certain about it.

oshogbo · 2026-01-08T09:56:52Z

Yes, that's done by the newly added
} else {
  return;
}

In your case, the database is flushed not only when the timeout expires, but only when a new TXG is issued and timeout expired. As a result, if the database is dirty and no new TXG is created by the time the flush deadline is reached, the function returns early instead of flushing the current TXG.

Bronek · 2026-01-08T09:59:00Z

Yes, that's done by the newly added
} else {
  return;
}
In your case, the database is flushed not only when the timeout expires, but only when a new TXG is issued and timeout expired. As a result, if the database is dirty and no new TXG is created by the time the flush deadline is reached, the function returns early instead of flushing the current TXG.

excellent point. I think both conditions need to be merged into single if ... else if, commit to follow.

Bronek · 2026-01-08T11:07:45Z

In your case, the database is flushed not only when the timeout expires, but only when a new TXG is issued and timeout expired. As a result, if the database is dirty and no new TXG is created by the time the flush deadline is reached, the function returns early instead of flushing the current TXG.

I don't know how to fix that. From your comment above, the current behaviour is correct since it will flush the database on timeout whether or not there's new TXG. But that's exactly what's waking the disks in 10 minutes (default config) intervals. I guess this needs a new check "is the database dirty" to supplement the new TXG check, but I don't know how to implement it.

Here's comparison of different behaviours, as truths tables:

A = new TXG ?
B = flush interval passed ?

current behaviour:

A	B	flush
0	0	0
0	1	1
1	0	0
1	1	1

proposed behaviour:

A	B	flush
0	0	0
0	1	0
1	0	0
1	1	1

additional check:

C = is database dirty ?

improved fix:

A	B	C	flush
0	0	0	0
0	0	1	0
0	1	0	0
0	1	1	1
1	0	0	0
1	0	1	0
1	1	0	1
1	1	1	1

In particular, this row 0 | 1 | 0 | 0 will prevent unnecessary flushes. So we are looking at (A || C) && B but I do not know how to do the "is database dirty" check.

oshogbo · 2026-01-08T11:14:33Z

Hym, well don't we need to add simply check like:

if (curtime < spa->spa_last_flush_txg_time + spa_flush_txg_time) {
		return;
}
if (spa->spa_last_noted_txg_time < spa->spa_last_flush_txg_time) {
                return;
}

Basically saying that if we have not noted any TXG since last flush lets not flush database?
Will that work?

Bronek · 2026-01-08T11:22:05Z

spa->spa_last_noted_txg_time < spa->spa_last_flush_txg_time

you mean like in a2781dd ? https://github.com/Bronek/zfs/blob/a2781ddc3f619c852ddbc5b8532f7a0c91e944c8/module/zfs/spa.c#L2167-L2170

oshogbo · 2026-01-08T11:26:08Z

Yep, I think this should work.

Now we basically have a way to notice that the database is actually dirty or not. Beascilly if the time of last flush is lower then noted TXG it means we have something to write. If the time of last flush is higher it means that we already flushed everything so there is no need to update ZAP.

oshogbo · 2026-01-08T11:37:18Z

Thinking about this in a bit more detail, I believe there may still be some corner cases. In particular, there is a scenario where we could perform one unnecessary extra write.

Because curtime is tracked in seconds, consider the following:
A TXG is noted and the time is saved as 00:00:01.
We flush the data and also record the flush time as 00:00:01.
No additional writes occur.
Time advances to 00:00:01 + interval, so we decide it is time to flush again.
However, there is no new TXG, and the recorded flush time (00:00:01) is equal to the saved TXG time (00:00:01). As a result, we end up performing one extra write.

To avoid that we can add 1s to flush time for example, just to make sure that they are not the same.
Or we need extra flag that is set when we store txg, and that is cleaned when the data is flushed.
The first one is quite a magic.

Bronek · 2026-01-08T11:59:27Z

Thinking about this in a bit more detail, I believe there may still be some corner cases. In particular, there is a scenario where we could perform one unnecessary extra write.

Because curtime is tracked in seconds, consider the following: A TXG is noted and the time is saved as 00:00:01. We flush the data and also record the flush time as 00:00:01. No additional writes occur. Time advances to 00:00:01 + interval, so we decide it is time to flush again. However, there is no new TXG, and the recorded flush time (00:00:01) is equal to the saved TXG time (00:00:01). As a result, we end up performing one extra write.

To avoid that we can add 1s to flush time for example, just to make sure that they are not the same. Or we need extra flag that is set when we store txg, and that is cleaned when the data is flushed. The first one is quite a magic.

How about this 89ba5c7 ?

oshogbo · 2026-01-08T12:04:13Z

IMHO this is wrong :)

Let me explain (I have rewritten the previous example):
Interval is: 00:00:04.
Last noted is 00:00:00.
A TXG is noted and the time is saved as 00:00:05.
Adjustment is set to 1.
We flush the data (as 00:00:00 + 1 is lower then 00:00:05) and also record the flush time as 00:00:05.
No additional writes occur.
Time advances to 00:00:09, so we decide it is time to flush again.
However, there is no new TXG, and the recorded flush time (00:00:05) is equal to the saved TXG time (00:00:05). As a result, we end up performing one extra write.

Bronek · 2026-01-08T12:49:03Z

IMHO this is wrong :)

Let me explain (I have rewritten the previous example): Interval is: 00:00:04. Last noted is 00:00:00. A TXG is noted and the time is saved as 00:00:05. Adjustment is set to 1. We flush the data (as 00:00:00 + 1 is lower then 00:00:05) and also record the flush time as 00:00:05. No additional writes occur. Time advances to 00:00:09, so we decide it is time to flush again. However, there is no new TXG, and the recorded flush time (00:00:05) is equal to the saved TXG time (00:00:05). As a result, we end up performing one extra write.

Please note this commit also changes the comparison (for the early return;) from < to <=, so if both noted and flushed are equal and there is no new transaction (i.e. adjustment remains 0), then this condition will be true and we return early. If there is a new transaction then adjustment will be 1, meaning we bump "flushed" for the purpose of comparison and the condition will be false, so we proceed to flush.

oshogbo · 2026-01-08T13:06:36Z

Please note this commit also changes the comparison (for the early return;) from < to <=

I see, I have missed that.

oshogbo · 2026-01-09T22:04:54Z

After looking at it once more, I don’t think we need this special variable adj. If both times are the same, it means we’ve already flushed the data, so <= should be sufficient.

Bronek · 2026-01-10T20:43:51Z

After looking at it once more, I don’t think we need this special variable adj. If both times are the same, it means we’ve already flushed the data, so <= should be sufficient.

Could this function be called more than once before one second has elapsed ?

oshogbo · 2026-01-11T07:54:28Z

Can be, but I don’t see a problem even if spa_flush_txg_time is set to 0 (which probably doesn’t make sense to set); we will still flush every second.
Also event that spa_last_noted_txg will be set to 0, the database will reject the same records for the same second, so there is nothing to flush.

Bronek · 2026-01-11T10:00:19Z

Can be, but I don’t see a problem even if spa_flush_txg_time is set to 0 (which probably doesn’t make sense to set); we will still flush every second. Also event that spa_last_noted_txg will be set to 0, the database will reject the same records for the same second, so there is nothing to flush.

With the early return condition like this

	if (curtime < spa->spa_last_flush_txg_time + spa_flush_txg_time ||
		spa->spa_last_noted_txg_time <= spa->spa_last_flush_txg_time) {
		return;
	}

... and assuming neither of spa_last_noted_txg_time or spa_last_flush_txg_time are updated elsewhere (I am ignoring spa_unload_sync_time_logger because it's outside of regular sync schedule), ~~once these two converge on the same value then (without adj) we would always return early. That would be a pretty bad bug IMHO. Or, very likely, I might be missing something.~~

EDIT: I see how this would work. We bump spa_last_noted_txg_time every time there's a new transaction, so the condition will be false and we will flush the transaction, as long as enough time has passed from the previous flush.

oshogbo · 2026-01-11T11:29:06Z

I think you can squas this commits, fix checkstyles and they are ready to go.

Bronek · 2026-01-11T14:21:37Z

@oshogbo technically all of the fix comes from your comments. Ok to add this line ?

Co-authored-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>

oshogbo · 2026-01-11T14:39:11Z

Sure, I appreciate it. Thanks!

This makes the periodical flushes of transaction timestamps database conditional to any updates in the database since the last flush. Co-authored-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Signed-off-by: Bronek Kozicki <brok@incorrekt.com>

amotin

I'm sorry if I missed something from the earlier discussion, TLDR, but I am not getting how this change should fix anything. spa_sync_time_logger() is normally called only once per TXG, so the condition txg > spa->spa_last_noted_txg should always be true. After that, if curtime >= spa->spa_last_noted_txg_time + spa_note_txg_time so that we could get to the later condition, it means we update spa_last_noted_txg_time. Which means the added spa->spa_last_noted_txg_time <= spa->spa_last_flush_txg_time will be false.

I haven't looked deeper and may be wrong, but I suspect that the actual problem here might be that ZFS kicks TXGs and calls this function each TXG timeout, even if nothing has changed there. Those empty TXGs in turn are getting logged into the DB, and respectively produce pool writes that would otherwise not existed.

mbartosi · 2026-01-14T05:51:12Z

You're right @amotin, I used this patch for zfs 2.4.0 under Gentoo with 6.18.4 kernel and it seems that it did not help -- disks are being woken up just after spinning down. I have 2 hdd pools with nvme cache devices.

Bronek · 2026-01-17T18:19:42Z

Superseded by #18138

Bronek marked this pull request as draft January 8, 2026 09:01

github-actions bot added the Status: Work in Progress Not yet ready for general review label Jan 8, 2026

Bronek mentioned this pull request Jan 8, 2026

[2.4] TXG timestamp DB sync if idle causes unnecessary disk access/prevent spin down #18082

Closed

Bronek closed this Jan 8, 2026

Bronek reopened this Jan 8, 2026

Bronek force-pushed the bronek/fix_frequent_spa_txg_writes branch from a2781dd to 89ba5c7 Compare January 8, 2026 11:58

Bronek force-pushed the bronek/fix_frequent_spa_txg_writes branch from 89ba5c7 to b38932e Compare January 8, 2026 12:51

alek-p mentioned this pull request Jan 10, 2026

AI Rollup - Power Management Issues alek-p/openzfs#22

Open

Bronek force-pushed the bronek/fix_frequent_spa_txg_writes branch 3 times, most recently from 798e23d to a5fe3d8 Compare January 11, 2026 13:47

Bronek changed the title ~~DRAFT: Fix unnecessary writes of transaction database~~ Fix unnecessary writes of transaction database Jan 11, 2026

Bronek marked this pull request as ready for review January 11, 2026 14:11

github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Jan 11, 2026

Bronek force-pushed the bronek/fix_frequent_spa_txg_writes branch from a5fe3d8 to 6704e19 Compare January 11, 2026 14:46

amotin reviewed Jan 12, 2026

View reviewed changes

oshogbo mentioned this pull request Jan 16, 2026

Flush RRD only when TXGs contain data #18138

Merged

14 tasks

Bronek closed this Jan 17, 2026

Conversation

Bronek commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

oshogbo commented Jan 8, 2026

Uh oh!

oshogbo commented Jan 8, 2026

Uh oh!

Bronek commented Jan 8, 2026

Uh oh!

oshogbo commented Jan 8, 2026

Uh oh!

Bronek commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bronek commented Jan 8, 2026

Uh oh!

oshogbo commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bronek commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oshogbo commented Jan 8, 2026

Uh oh!

oshogbo commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bronek commented Jan 8, 2026

Uh oh!

oshogbo commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bronek commented Jan 8, 2026

Uh oh!

oshogbo commented Jan 8, 2026

Uh oh!

oshogbo commented Jan 9, 2026

Uh oh!

Bronek commented Jan 10, 2026

Uh oh!

oshogbo commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bronek commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oshogbo commented Jan 11, 2026

Uh oh!

Bronek commented Jan 11, 2026

Uh oh!

oshogbo commented Jan 11, 2026

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

mbartosi commented Jan 14, 2026

Uh oh!

Bronek commented Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Bronek commented Jan 8, 2026 •

edited

Loading

Bronek commented Jan 8, 2026 •

edited

Loading

oshogbo commented Jan 8, 2026 •

edited

Loading

Bronek commented Jan 8, 2026 •

edited

Loading

oshogbo commented Jan 8, 2026 •

edited

Loading

oshogbo commented Jan 8, 2026 •

edited

Loading

oshogbo commented Jan 11, 2026 •

edited

Loading

Bronek commented Jan 11, 2026 •

edited

Loading