Implement new label format for large disks by pcd1193182 · Pull Request #17573 · openzfs/zfs

pcd1193182 · 2025-07-28T17:58:37Z

Sponsored by: [Wasabi, Inc.; Klara, Inc.]

Motivation and Context

As disk sector sizes increase, we are able to store fewer and fewer uberblocks on a disk. This makes it increasingly difficult to recover from issues by rolling back to earlier TXGs. Eventually, sector sizes may become large enough that not even a single uberblock can be stored without having to do a partial write. In addition, new ZFS features often need space to store metadata (see, for example, the buffer used by RAIDZ expansion). This space is highly limited with the current disk layout.

Description

This patch contains the logic for a new larger label format. This format is intended to support disks with large sector sizes. By using a larger label we can store more uberblocks and other critical pool metadata. We can also use the extra space to enable new features in ZFS going forwards. This initial commit does not add new capabilities, but provides the framework for them going forwards.

It also contains zdb and zhack support for the new label type, as well as tests that verify basic functionality of the new label. Currently, the size of the disk is used as a rubric for whether or not to enable the new label type, but that is open to change.

How Has This Been Tested?

In addition to the tests added in this PR, I also ran the ZFS test suite with the tunable turned below the size of the disks in use. Some tests failed, but only for space estimation reasons, which could have been corrected with fixes to the tests. Similarly, I ran some ztest runs with the new label format.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

include/sys/vdev_impl.h

allanjude · 2025-08-22T14:30:15Z

include/sys/vdev_impl.h

 * Size of embedded boot loader region on each label.
 * The total size of the first two labels plus the boot area is 4MB.
- * On RAIDZ, this space is overwritten during RAIDZ expansion.
+ * On RAIDZ, this space is overwritten durinvg RAIDZ expansion.


typo durinvg

This typo was marked resolved, but looks like it's still present

robn

This seems to be a bit more polished version of the series I saw a few months ago, which I think bodes well - nothing bad or surprising seen since!

So what's still needed to move this forward? And what's the plan from here? Is the intent to get this merged and onto real pools before there's a new feature that requires it, or hold it until we need it? I'm guessing/hoping the former, to get operational experience and shake out any issues before we actually need it.

Any thoughts or guidance on how to use all this new space? I don't really at this stage, and I'm don't think there's a big long line of things waiting to use it. Regardless, as with most of our on-disk formats, if we're upgrading them to throw of limitations of the past, I would like this to be the last time we ever have to, and that in part means making sure we know how to use it and never break it!

robn · 2025-09-26T05:16:56Z

tests/zfs-tests/tests/functional/large_label/large_label_001_pos.ksh

+
+log_must create_pool -f $TESTPOOL "$DSK"0
+
+log_must zdb -l "$DSK"0


Since user_large_label and uses_old_label just call zdb -l anyway, I assume this is just here (and in large_label_002_pos) to get the output into the log for debugging?

Yeah, if the test fails, it's nice to not have to modify and rerun just to see what the zdb output actually looked like, or if that was the command that failed directly, or we just failed to find the string we wanted.

pcd1193182 · 2025-10-01T20:54:40Z

So what's still needed to move this forward? And what's the plan from here? Is the intent to get this merged and onto real pools before there's a new feature that requires it, or hold it until we need it? I'm guessing/hoping the former, to get operational experience and shake out any issues before we actually need it.

My hope was to get this integrated sooner rather than later. As you said, it would be good to have time to find any issues or make improvements before there's a new feature that needs it, and drives a lot of sudden new adoption of something that hasn't had as much time to mature. Plus, while there aren't any new headline features in this patch, we do still get the benefit of having a longer uberblock history. For a lot of data recovery jobs, that alone can prove quite helpful, since it provides a much longer window of TXGs to roll back to to try and recover specific data.

The thing that's needed to move it forward is reviews, pretty much. I think the bones are good, though I'm sure there are tweaks to be made, and I think it's ready for more eyes on it.

Any thoughts or guidance on how to use all this new space? I don't really at this stage, and I'm don't think there's a big long line of things waiting to use it. Regardless, as with most of our on-disk formats, if we're upgrading them to throw of limitations of the past, I would like this to be the last time we ever have to, and that in part means making sure we know how to use it and never break it!

Right now, we store the checkpoint uberblock in the MOS. This works mostly fine for the intended use cases. However, if your pool is rendered totally unimportable (a bug in the import code related to a new feature that causes it to panic, or really specifically timed courrption, for example) can make it impossible to roll back. Storing a copy of the checkpoint uberblock in the label as a backup and having a new import flag, or zhack or other way to do the rollback might be useful. One thing this PR does include is storing a copy of the pool config in the label. That isn't currently used for anything except debugging, but it could be handy for importing pool with badly damaged or missing top-level vdevs.

Another idea that came to me is the idea of storing a compression dictionary for use with zstd; zstd has a dictionary mode, where rather than storing the dictionary inline with the data, it can use an external pre-programmed dictionary. It might be helpful (especially for smaller recordsizes) to generate dictionaries based ZFS metadata, or allow users to generate them based on their own data, and then use them for compression and decompression. If they're used to store metadata, they may need to be accessed before the MOS is readable, so storing them in the label might help.

We currently use the 3.5 MiB reserved space to do raidz expansion; that space is sufficient because eventually the raidz expansion can use its own previously allocated space as working room. If we ever wanted to try to implement raidz width increases (increasing the parity of existing blocks, for example), we would need more space; the larger labels might provide enough scratch space for that.

I agree it would be nice if we never had to do this again; we don't want to come back in ten years and say "hey actually now we need 64GiB labels, whoops!". One thing that I think works well in the current design to prevent that is the table of contents; that structure can contain not only information about the different sections in the label, but about label extensions or other features that the label is using.

almereyda · 2025-10-27T23:19:23Z

As a curious question, would this be able to help with #11408? Else feel free to mark my comment as off-topic.

pcd1193182 · 2025-11-04T00:57:59Z

As a curious question, would this be able to help with #11408? Else feel free to mark my comment as off-topic.

I took a quick look and I think the answer is no? That feature proposal would require that there be nothing for the first 128MiB of the disk, while this feature leaves the old label format in place so that ZFS-aware utilities know there is ZFS data stored on this device. I think it neither solves nor prevents that idea, it's just sort of orthogonal.

Now, one could use some other functionality that we've discussed related to this around configuring a "data offset" for the pool that would replace the current starting location of the data in ZFS (at the end of the start labels + reserved space); that would allow room to be saved for non-zfs partitions, sort of similar to how I did in cursedfs (though that just stashed the data in the reserved space).

pcd1193182 · 2025-12-19T23:43:39Z

This is a gentle reminder/request for review. The latest push addressed most of the comments that have been offered thus far, and switched to a multi-ring design so that imports stay fast while still providing a deep reserve of TXGs to attempt rolling back to.

tonyhutter · 2025-12-20T00:56:08Z

I'll take a look. We just released 2.4.0, so now is a good time to get some of these bigger features merged. Also, we had some CI fixes go in recently, so please rebase to see a lot of the test failures go away.

tonyhutter · 2025-12-23T01:36:45Z

Sorry again for not taking a look at this earlier. Here's my first pass comments:

man/man7/zpool-features.7 should be updated to document com.klarasystems:large_label

zfs_vdev_large_label_min_size should be documented in man/man4/zfs.4.

We should mention in the man pages that >1024GB gets the large label by default, and less than that gets the small label. If the intention is for "1TB disks and above" to use the large label, then we sould make the limit 1000GB rather than 1024GB, since HDDs don't use power-of-2 capacities.

Please use file_write -d 'R' ... instead of dd if=/dev/urandom ... in the test cases, since /dev/urandom is pretty slow:

13:24:55.79 SUCCESS: create_pool -f testpool /var/tmp/testdir1/dsk0 mirror /var/tmp/testdir1/dsk1 /var/tmp/testdir1/dsk2 raidz1 /var/tmp/testdir1/dsk3 /var/tmp/testdir1/dsk4 /var/tmp/testdir1/dsk5 log /var/tmp/testdir1/dsk6 special /var/tmp/testdir1/dsk7
13:25:20.18 SUCCESS: dd if=/dev/urandom of=/testpool/f1 bs=1M count=1k

We should catch extermely pathological cases like:

$ echo 1 | sudo tee /sys/module/zfs/parameters/zfs_vdev_large_label_min_size
1

$ truncate -s 100M file
$ sudo ./zpool create tank ./file
<hang>


[  245.936606] VERIFY3U(offset + size, <=, vd->vdev_psize) failed (285245440 <= 100663296)
[  245.936660] PANIC at zio.c:1566:zio_read_phys()
[  245.936673] Showing stack for process 1653
[  245.936690] CPU: 5 PID: 1653 Comm: vdev_open Kdump: loaded Tainted: P           OE     -------  ---  5.14.0-570.12.1.el9_6.x86_64 #1
[  245.936716] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.3-4.el9 04/01/2014
[  245.936734] Call Trace:
[  245.936743]  <TASK>
[  245.936755]  dump_stack_lvl+0x34/0x48
[  245.936776]  spl_panic+0xd1/0xe9 [spl]s
[  245.936810]  ? sg_init_table+0x11/0x40
[  245.936823]  ? __sg_alloc_table+0x6e/0x170
[  245.936836]  ? sg_alloc_table+0x20/0x90
[  245.936848]  ? __pfx_sg_kmalloc+0x10/0x10
[  245.936860]  ? abd_alloc_chunks+0x276/0x500 [zfs]
[  245.937265]  zio_read_phys+0x122/0x190 [zfs]
[  245.937605]  vdev_probe+0x18a/0x320 [zfs]
[  245.937938]  ? __pfx_vdev_probe_done+0x10/0x10 [zfs]
[  245.938288]  vdev_open+0x8ce/0xb90 [zfs]
[  245.938618]  vdev_open_child+0x1e/0x40 [zfs]
[  245.938947]  taskq_thread+0x355/0x8c0 [spl]
[  245.938979]  ? __pfx_default_wake_function+0x10/0x10
[  245.938999]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[  245.939026]  kthread+0xdd/0x100
[  245.939037]  ? __pfx_kthread+0x10/0x10
[  245.939049]  ret_from_fork+0x29/0x50
[  245.939064]  </TASK>

zpool create does take longer with large labels. It's to be expected, but still worth pointing out. Here's what I see on my 10 year old laptop running zfs in a VM:

# Small labels
$ time sudo ./zpool create tank mirror ./file{1..10}

real	0m0.571s
user	0m0.037s
sys	0m0.110s

# Large labels
$ time sudo ./zpool create tank mirror ./file{1..10}

real	0m51.515s
user	0m0.035s
sys	0m1.186s

pcd1193182 · 2026-01-05T20:20:22Z

Sorry again for not taking a look at this earlier. Here's my first pass comments:

man/man7/zpool-features.7 should be updated to document com.klarasystems:large_label

zfs_vdev_large_label_min_size should be documented in man/man4/zfs.4.

We should mention in the man pages that >1024GB gets the large label by default, and less than that gets the small label. If the intention is for "1TB disks and above" to use the large label, then we sould make the limit 1000GB rather than 1024GB, since HDDs don't use power-of-2 capacities.

Please use file_write -d 'R' ... instead of dd if=/dev/urandom ... in the test cases, since /dev/urandom is pretty slow:

All done.

We should catch extermely pathological cases like:

Handled, good catch. I'll add a test for this one too; I resolved it by just saying we refuse to use the large label if the usable size post-label is less than the usable size of a 64MiB (SPA_MINDEVSIZE) disk with the small label.

zpool create does take longer with large labels. It's to be expected, but still worth pointing out. Here's what I see on my 10 year old laptop running zfs in a VM:
# Small labels
$ time sudo ./zpool create tank mirror ./file{1..10}

real	0m0.571s
user	0m0.037s
sys	0m0.110s

# Large labels
$ time sudo ./zpool create tank mirror ./file{1..10}

real	0m51.515s
user	0m0.035s
sys	0m1.186s

Yeah this is something I noticed. It makes sense, since we're now zeroing out a gig per disk instead of a meg, but it's not ideal. I can see if there's any improvements to be made here in terms of parallelizing the IO, but I'm not optimistic. I can also see if there's any way we can get away with not zeroing all of the space, but I'm not sure if that will work. We don't want to try to use old uberblocks from earlier pools stored on this disk.

pcd1193182 · 2026-01-05T21:28:50Z

So there's something a little interesting going on with the performance of zpool creation. Right now, we create all the IOs for a given vdev and execute them in parallel, but each vdev is sequential. That can be fixed relatively easily by passing around a common parent ZIO and executing it at the end.

The interesting thing is that that doesn't improve performance at all. I looked at the code in some more detail, and it looks like we actually don't issue the IOs in parallel. And I think this happens a lot throughout the label code. The problem is that we pass ZIO_FLAG_CONFIG_WRITER to the ZIO creation code all over vdev_label.c. That's not inherently bad, that doesn't necessarily imply serialization, but in zio_taskq_dispatch we dispatch all IOs with CONFIG_WRITER or PROBE to the NULL taskq. The NULL taskq is configured to only have one thread. As a result, we effectively serialize all of these writes.

I'm not clear on if we should be calling zio_taskq_dispatch for these IOs, but we do seem to be. Increasing the number of threads in the NULL taskq to 8 improved performance significantly, and I was able to verify with iostat that the IOs were happening to all the disks at once, instead of sequentially like before.

The obvious avenue for improvement is to increase the number of threads in the NULL taskq, but it's a little unfortunate to fix a transient issue like pool creation with a permanent fix. We could also not dispatch to the NULL taskq if we're doing CONFIG_WRITER IOs during initialization; that might be better. Currently spa_is_initializing isn't true around vdev_create in spa_create but that could possibly change?

EDIT: That seems to work, and the tests pass. I had to serialize some of the IOs again, because we actually do rely on the config being written out for each disk before we consider the next one; this is one of the ways we prevent a disk from being used multiple times in the same pool. If we do all the IOs at the end, we can end up with something like zpool add test mirror disk1 disk1 succeeding, which is an issue.

cmd/zhack.c

tonyhutter · 2026-01-26T23:49:13Z

Other than the tiny nit I mentioned, I don't see any big surface-level issues. I was happy to see that you got the zpool create times way down. I re-ran my old test, and saw the creation time go from 51s -> 4s!

cmd/zdb/zdb.c

tonyhutter

Approved, but please fix the minor typo in vdev_impl.h

module/zfs/vdev_label.c

This patch contains the logic for a new larger label format. This format is intended to support disks with large sector sizes. By using a larger label we can store more uberblocks and other critical pool metadata. We can also use the extra space to enable new features in ZFS going forwards. This initial commit does not add new capabilities, but provides the framework for them going forwards. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Wasabi, Inc. Sponsored-by: Klara, Inc.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

This provides a balance between frequent UBs at new txgs, and sparse ones for historical purposes Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

alek-p

Thanks for working on this Paul. Hope you don't mind that I'm reviewing this piecemeal

alek-p · 2026-02-26T20:30:33Z

cmd/zhack.c

+{
+	zio_eck_t *ub_eck =
+	    (zio_eck_t *)
+	    ((char *)(ub_data) + (ASHIFT_UBERBLOCK_SIZE(ashift, B_FALSE))) - 1;


should this be ASHIFT_UBERBLOCK_SIZE(ashift, B_TRUE) here?

alek-p · 2026-02-26T20:32:44Z

cmd/zhack.c

+		return;
+	}
+
+	ub = malloc(UBERBLOCK_SHIFT);


= malloc(ASHIFT_UBERBLOCK_SIZE(ashift, B_TRUE))?

alek-p · 2026-02-26T20:35:01Z

module/zfs/vdev_raidz.c

-			    cvd2->vdev_psize - VDEV_LABEL_END_SIZE);
+			ASSERT3U(rc->rc_shadow_offset +
+			    abd_get_size(rc->rc_abd), <, cvd2->vdev_psize -
+			    VDEV_LABEL_END_SIZE(cvd));


s/cvd/cvd2/ ?

alek-p · 2026-02-26T20:36:48Z

module/zfs/vdev_label.c

+				if (!vdev_toc_get_secinfo(toc,
+				    VDEV_TOC_VDEV_CONFIG,
+				    &vp_size[l], &off))
+					continue;


think we need to fnvlist_free(toc) before this continue

alek-p · 2026-02-26T20:45:34Z

module/zfs/vdev_label.c

+			for (int u = 0;
+			    u < VDEV_LARGE_UBERBLOCK_RING / SPA_MAXBLOCKSIZE;
+			    u++) {
+				vdev_label_write(pio, vd, l, B_TRUE, ub_abd2,


is there a reason to attach this to the pio instead of the local zio we're waiting on later in this function?
Since this function will zero_off + free the ub_abd2, and the vdev_label_write() call will zio_nowait(), it seems like it might be dangerous to use the parent ZIO instead of the local one that we do zio_wait() for.

pcd1193182 force-pushed the new_label branch 4 times, most recently from e039970 to c20fcf4 Compare July 31, 2025 19:27

behlendorf added the Status: Design Review Needed Architecture or design is under discussion label Jul 31, 2025

amotin reviewed Aug 8, 2025

View reviewed changes

include/sys/vdev_impl.h Outdated Show resolved Hide resolved

include/sys/vdev_impl.h Outdated Show resolved Hide resolved

include/sys/vdev_impl.h Outdated Show resolved Hide resolved

allanjude reviewed Aug 22, 2025

View reviewed changes

pcd1193182 force-pushed the new_label branch 3 times, most recently from 821000e to 8c65661 Compare August 28, 2025 23:55

pcd1193182 force-pushed the new_label branch 3 times, most recently from 8567011 to 6718ba5 Compare September 15, 2025 17:07

pcd1193182 force-pushed the new_label branch from 6718ba5 to f2adf61 Compare September 25, 2025 20:07

robn reviewed Sep 26, 2025

View reviewed changes

pcd1193182 force-pushed the new_label branch from f2adf61 to 9b03467 Compare November 3, 2025 22:42

pcd1193182 force-pushed the new_label branch 3 times, most recently from d612a90 to 6f816b2 Compare November 5, 2025 21:43

pcd1193182 force-pushed the new_label branch from 6f816b2 to 018448a Compare January 6, 2026 00:20

tonyhutter reviewed Jan 26, 2026

View reviewed changes

cmd/zhack.c Show resolved Hide resolved

pcd1193182 force-pushed the new_label branch from 018448a to fb80dac Compare January 27, 2026 19:15

kithrup reviewed Jan 27, 2026

View reviewed changes

cmd/zdb/zdb.c Show resolved Hide resolved

tonyhutter approved these changes Feb 2, 2026

View reviewed changes

pcd1193182 force-pushed the new_label branch from fb80dac to ccd5e5c Compare February 3, 2026 18:13

alek-p reviewed Feb 25, 2026

View reviewed changes

module/zfs/vdev_label.c Outdated Show resolved Hide resolved

module/zfs/vdev_label.c Outdated Show resolved Hide resolved

module/zfs/vdev_label.c Outdated Show resolved Hide resolved

pcd1193182 added 12 commits February 26, 2026 10:24

Rob's feedback

7cf8d2c

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

mav feedback

a3531d6

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Try multi-ring setup for large labels

38fff00

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Use three rings for storing Uberblocks

b3c4798

This provides a balance between frequent UBs at new txgs, and sparse ones for historical purposes Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Add entry in man page

6440037

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

tony feedback

fed531e

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

write all disks in parallel

60a0431

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Add warning for sector size > 8k

cfcc429

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

feedback

29cbd3f

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

alek's feedback

2c908e3

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

mmp fix

4940a31

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

pcd1193182 force-pushed the new_label branch from ccd5e5c to 4940a31 Compare February 26, 2026 19:57

alek-p reviewed Feb 26, 2026

View reviewed changes


		log_must create_pool -f $TESTPOOL "$DSK"0

		log_must zdb -l "$DSK"0

Conversation

pcd1193182 commented Jul 28, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcd1193182 commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

almereyda commented Oct 27, 2025

Uh oh!

pcd1193182 commented Nov 4, 2025

Uh oh!

pcd1193182 commented Dec 19, 2025

Uh oh!

tonyhutter commented Dec 20, 2025

Uh oh!

tonyhutter commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcd1193182 commented Jan 5, 2026

Uh oh!

pcd1193182 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tonyhutter commented Jan 26, 2026

Uh oh!

Uh oh!

tonyhutter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alek-p left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

pcd1193182 commented Oct 1, 2025 •

edited

Loading

tonyhutter commented Dec 23, 2025 •

edited

Loading

pcd1193182 commented Jan 5, 2026 •

edited

Loading