Skip to content

Conversation

@Bronek
Copy link

@Bronek Bronek commented Jan 8, 2026

This makes the periodical flushes of TXG timestamps database conditional to any updates in the database since the last flush.

No change to documentation, since arguably this is also the most intuitive behaviour.

Motivation and Context

This fixes #18082

Description

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@oshogbo
Copy link
Contributor

oshogbo commented Jan 8, 2026

This doesn't seem right. If we have a dirty database and the time came to flush we want to flush database, not to wait for the next TXG to arrive no?

@oshogbo
Copy link
Contributor

oshogbo commented Jan 8, 2026

I guess we just need to check if there is something to flush and not flush database if no modification to it was made.

@Bronek
Copy link
Author

Bronek commented Jan 8, 2026

I guess we just need to check if there is something to flush and not flush database if no modification to it was made.

Yes, that's done by the newly added

} else {
  return;
}

However with this change alone, I am not certain that updating spa->spa_last_noted_txg_time = curtime; inside if (txg > spa->spa_last_noted_txg) section (i.e. whether or not the transaction database has been flushed) is still correct. Which is why I moved it below.

If spa->spa_last_noted_txg_time = curtime; is kept in the original location, I can imagine a situation where we keep pushing "last noted" timestamp forward but the actual flush is unnecessarily delayed or perhaps even does not happen. However my imagination is known to push some silly ideas so I am not 100% certain about it.

@oshogbo
Copy link
Contributor

oshogbo commented Jan 8, 2026

Yes, that's done by the newly added

} else {
  return;
}

In your case, the database is flushed not only when the timeout expires, but only when a new TXG is issued and timeout expired. As a result, if the database is dirty and no new TXG is created by the time the flush deadline is reached, the function returns early instead of flushing the current TXG.

@Bronek
Copy link
Author

Bronek commented Jan 8, 2026

Yes, that's done by the newly added

} else {
  return;
}

In your case, the database is flushed not only when the timeout expires, but only when a new TXG is issued and timeout expired. As a result, if the database is dirty and no new TXG is created by the time the flush deadline is reached, the function returns early instead of flushing the current TXG.

excellent point. I think both conditions need to be merged into single if ... else if, commit to follow.

@Bronek
Copy link
Author

Bronek commented Jan 8, 2026

In your case, the database is flushed not only when the timeout expires, but only when a new TXG is issued and timeout expired. As a result, if the database is dirty and no new TXG is created by the time the flush deadline is reached, the function returns early instead of flushing the current TXG.

I don't know how to fix that. From your comment above, the current behaviour is correct since it will flush the database on timeout whether or not there's new TXG. But that's exactly what's waking the disks in 10 minutes (default config) intervals. I guess this needs a new check "is the database dirty" to supplement the new TXG check, but I don't know how to implement it.

Here's comparison of different behaviours, as truths tables:

A = new TXG ?
B = flush interval passed ?

current behaviour:

A B flush
0 0 0
0 1 1
1 0 0
1 1 1

proposed behaviour:

A B flush
0 0 0
0 1 0
1 0 0
1 1 1

additional check:

C = is database dirty ?

improved fix:

A B C flush
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 0
1 1 0 1
1 1 1 1

In particular, this row 0 | 1 | 0 | 0 will prevent unnecessary flushes. So we are looking at (A || C) && B but I do not know how to do the "is database dirty" check.

@Bronek Bronek closed this Jan 8, 2026
@oshogbo
Copy link
Contributor

oshogbo commented Jan 8, 2026

Hym, well don't we need to add simply check like:

if (curtime < spa->spa_last_flush_txg_time + spa_flush_txg_time) {
		return;
}
if (spa->spa_last_noted_txg_time < spa->spa_last_flush_txg_time) {
                return;
}

Basically saying that if we have not noted any TXG since last flush lets not flush database?
Will that work?

@Bronek Bronek reopened this Jan 8, 2026
@Bronek
Copy link
Author

Bronek commented Jan 8, 2026

spa->spa_last_noted_txg_time < spa->spa_last_flush_txg_time

you mean like in a2781dd ? https://github.com/Bronek/zfs/blob/a2781ddc3f619c852ddbc5b8532f7a0c91e944c8/module/zfs/spa.c#L2167-L2170

@oshogbo
Copy link
Contributor

oshogbo commented Jan 8, 2026

Yep, I think this should work.

Now we basically have a way to notice that the database is actually dirty or not. Beascilly if the time of last flush is lower then noted TXG it means we have something to write. If the time of last flush is higher it means that we already flushed everything so there is no need to update ZAP.

@oshogbo
Copy link
Contributor

oshogbo commented Jan 8, 2026

Thinking about this in a bit more detail, I believe there may still be some corner cases. In particular, there is a scenario where we could perform one unnecessary extra write.

Because curtime is tracked in seconds, consider the following:
A TXG is noted and the time is saved as 00:00:01.
We flush the data and also record the flush time as 00:00:01.
No additional writes occur.
Time advances to 00:00:01 + interval, so we decide it is time to flush again.
However, there is no new TXG, and the recorded flush time (00:00:01) is equal to the saved TXG time (00:00:01). As a result, we end up performing one extra write.

To avoid that we can add 1s to flush time for example, just to make sure that they are not the same.
Or we need extra flag that is set when we store txg, and that is cleaned when the data is flushed.
The first one is quite a magic.

@Bronek Bronek force-pushed the bronek/fix_frequent_spa_txg_writes branch from a2781dd to 89ba5c7 Compare January 8, 2026 11:58
@Bronek
Copy link
Author

Bronek commented Jan 8, 2026

Thinking about this in a bit more detail, I believe there may still be some corner cases. In particular, there is a scenario where we could perform one unnecessary extra write.

Because curtime is tracked in seconds, consider the following: A TXG is noted and the time is saved as 00:00:01. We flush the data and also record the flush time as 00:00:01. No additional writes occur. Time advances to 00:00:01 + interval, so we decide it is time to flush again. However, there is no new TXG, and the recorded flush time (00:00:01) is equal to the saved TXG time (00:00:01). As a result, we end up performing one extra write.

To avoid that we can add 1s to flush time for example, just to make sure that they are not the same. Or we need extra flag that is set when we store txg, and that is cleaned when the data is flushed. The first one is quite a magic.

How about this 89ba5c7 ?

@oshogbo
Copy link
Contributor

oshogbo commented Jan 8, 2026

IMHO this is wrong :)

Let me explain (I have rewritten the previous example):
Interval is: 00:00:04.
Last noted is 00:00:00.
A TXG is noted and the time is saved as 00:00:05.
Adjustment is set to 1.
We flush the data (as 00:00:00 + 1 is lower then 00:00:05) and also record the flush time as 00:00:05.
No additional writes occur.
Time advances to 00:00:09, so we decide it is time to flush again.
However, there is no new TXG, and the recorded flush time (00:00:05) is equal to the saved TXG time (00:00:05). As a result, we end up performing one extra write.

@Bronek
Copy link
Author

Bronek commented Jan 8, 2026

IMHO this is wrong :)

Let me explain (I have rewritten the previous example): Interval is: 00:00:04. Last noted is 00:00:00. A TXG is noted and the time is saved as 00:00:05. Adjustment is set to 1. We flush the data (as 00:00:00 + 1 is lower then 00:00:05) and also record the flush time as 00:00:05. No additional writes occur. Time advances to 00:00:09, so we decide it is time to flush again. However, there is no new TXG, and the recorded flush time (00:00:05) is equal to the saved TXG time (00:00:05). As a result, we end up performing one extra write.

Please note this commit also changes the comparison (for the early return;) from < to <=, so if both noted and flushed are equal and there is no new transaction (i.e. adjustment remains 0), then this condition will be true and we return early. If there is a new transaction then adjustment will be 1, meaning we bump "flushed" for the purpose of comparison and the condition will be false, so we proceed to flush.

@Bronek Bronek force-pushed the bronek/fix_frequent_spa_txg_writes branch from 89ba5c7 to b38932e Compare January 8, 2026 12:51
@oshogbo
Copy link
Contributor

oshogbo commented Jan 8, 2026

Please note this commit also changes the comparison (for the early return;) from < to <=

I see, I have missed that.

@oshogbo
Copy link
Contributor

oshogbo commented Jan 9, 2026

After looking at it once more, I don’t think we need this special variable adj. If both times are the same, it means we’ve already flushed the data, so <= should be sufficient.

@Bronek
Copy link
Author

Bronek commented Jan 10, 2026

After looking at it once more, I don’t think we need this special variable adj. If both times are the same, it means we’ve already flushed the data, so <= should be sufficient.

Could this function be called more than once before one second has elapsed ?

@oshogbo
Copy link
Contributor

oshogbo commented Jan 11, 2026

Can be, but I don’t see a problem even if spa_flush_txg_time is set to 0 (which probably doesn’t make sense to set); we will still flush every second.
Also event that spa_last_noted_txg will be set to 0, the database will reject the same records for the same second, so there is nothing to flush.

@Bronek
Copy link
Author

Bronek commented Jan 11, 2026

Can be, but I don’t see a problem even if spa_flush_txg_time is set to 0 (which probably doesn’t make sense to set); we will still flush every second. Also event that spa_last_noted_txg will be set to 0, the database will reject the same records for the same second, so there is nothing to flush.

With the early return condition like this

	if (curtime < spa->spa_last_flush_txg_time + spa_flush_txg_time ||
		spa->spa_last_noted_txg_time <= spa->spa_last_flush_txg_time) {
		return;
	}

... and assuming neither of spa_last_noted_txg_time or spa_last_flush_txg_time are updated elsewhere (I am ignoring spa_unload_sync_time_logger because it's outside of regular sync schedule), once these two converge on the same value then (without adj) we would always return early. That would be a pretty bad bug IMHO. Or, very likely, I might be missing something.

EDIT: I see how this would work. We bump spa_last_noted_txg_time every time there's a new transaction, so the condition will be false and we will flush the transaction, as long as enough time has passed from the previous flush.

@oshogbo
Copy link
Contributor

oshogbo commented Jan 11, 2026

I think you can squas this commits, fix checkstyles and they are ready to go.

@Bronek Bronek force-pushed the bronek/fix_frequent_spa_txg_writes branch 3 times, most recently from 798e23d to a5fe3d8 Compare January 11, 2026 13:47
@Bronek Bronek changed the title DRAFT: Fix unnecessary writes of transaction database Fix unnecessary writes of transaction database Jan 11, 2026
@Bronek Bronek marked this pull request as ready for review January 11, 2026 14:11
@github-actions github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Jan 11, 2026
@Bronek
Copy link
Author

Bronek commented Jan 11, 2026

@oshogbo technically all of the fix comes from your comments. Ok to add this line ?

Co-authored-by: Mariusz Zaborski <[email protected]>

@oshogbo
Copy link
Contributor

oshogbo commented Jan 11, 2026

Sure, I appreciate it. Thanks!

This makes the periodical flushes of transaction timestamps database
conditional to any updates in the database since the last flush.

Co-authored-by: Mariusz Zaborski <[email protected]>
Signed-off-by: Bronek Kozicki <[email protected]>
@Bronek Bronek force-pushed the bronek/fix_frequent_spa_txg_writes branch from a5fe3d8 to 6704e19 Compare January 11, 2026 14:46
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry if I missed something from the earlier discussion, TLDR, but I am not getting how this change should fix anything. spa_sync_time_logger() is normally called only once per TXG, so the condition txg > spa->spa_last_noted_txg should always be true. After that, if curtime >= spa->spa_last_noted_txg_time + spa_note_txg_time so that we could get to the later condition, it means we update spa_last_noted_txg_time. Which means the added spa->spa_last_noted_txg_time <= spa->spa_last_flush_txg_time will be false.

I haven't looked deeper and may be wrong, but I suspect that the actual problem here might be that ZFS kicks TXGs and calls this function each TXG timeout, even if nothing has changed there. Those empty TXGs in turn are getting logged into the DB, and respectively produce pool writes that would otherwise not existed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[2.4] TXG timestamp DB sync if idle causes unnecessary disk access/prevent spin down

3 participants