Label Redesign Proposal #17225

pcd1193182 · 2025-04-07T20:27:11Z

pcd1193182
Apr 7, 2025
Collaborator

Motivation and Context

The current label design in ZFS has served us well for a long time now, but is starting to show its age. Because we always use a full disk sector for each uberblock, and we have a limited ring size of 128k, the number of uberblocks we can stored on new large-sector devices is getting dangerously low. There are other friction points too; not having enough room to store the pool config, limited room to support new features, and no dedicated space for extreme-rollback uberblocks. The time has come to figure out what a new label layout should look like. @allanjude started this discussion at the 2024 OpenZFS Developer Summit with his presentation, and the next step is to present a proposal and make sure we get the full range of community feedback on it.

Considerations

We don't want to have an irreconcilable break with the past, here; old versions of ZFS should know that there is a pool on the disks, even if that pool is one using the new label type and the old software can't read it.
NVMe devices with an underlying physical sector sizes as high as 128KiB are currently available; we don't know how high this trend is going to go, but it could persist for some time. We want to plan for this new design to last for at least another 5-10 years, so we shouldn't bake in any current assumptions.
ZFS is developing a wider and more diverse array of features than ever before, and some of these require new capabilities to work to their fullest. For example, raidz expansion using the 3.5MB scratch space that was previously reserved for boot data. We want to make sure that this label design enables future features that might need interesting capabilities.
Reliability is still paramount. An upgraded label design that doesn't properly protect the uberblock ring against various kinds of write failures is no upgrade at all.

High-level Proposal

As discussed in Allan's presentation, the core of the proposal is to switch to a larger label size. The initial proposal is a 256MiB label, with a large chunk of reserved space. In total, the start-of-disk reserved area would be 1 GiB. There was some additional discussion at the ZFS dev summit, so the other parts of the proposal have changed somewhat. The layout of the label would be dynamic; the label would have a "Table of Contents" that would store the start and size of each section. This allows flexibility and makes it easier to change things in the future. Some sections would include the boot information block, the vdev config, a whole-pool config, and special Uberblocks (MMP, checkpoint). The ToC could also store information about the use of the reserved space, e.g. if some of it is being used for raidz expansion or other features. The uberblock ring itself would take up a large chunk of the space, probably the last N MiB of the label.

There is also the question of how ZFS will know whether to use the new label design or not, and relatedly, how old systems will interact with these new pools. So far, it seems that the most workable solution is to put dummy data in the old label that will get old versions of ZFS far enough to try to pick an Uberblock, and then only have a single dummy uberblock with version=5001 in it. This will let old systems know that the pool is not openable. We could also have a special key in the vdev nvlist config that indicates to newer pools that this pool is using the v2 label format. That would allow them to short circuit the import logic there to go use the new layout. The problem with this approach is that it might require a nontrivial amount of work to come up with appropriate "dummy" data for the old label. Unfortunately, the version number is only stored in the uberblock itself, so the import process has to get that far before we could leverage it.

Questions

Does this design meet the current needs of ZFS? What about future needs that are still in the pipeline, or haven't even been properly started yet?
How many uberblocks do we want to keep around? If we're limited to 1 per sector, 4k sectors can store plenty, but 128k or 16M sectors will have a harder time.
What do we do about old versions of ZFS with new pools? How do we indicate that we're using the new layout?
When do we use this new layout? Only on new pools? Only on large pools? Only on large pools with large sector sizes?
Can a pool mix old and new label layouts on different disks?
Do we need any special considerations for the end-of-disk label?
Should we reserve space at the end of the disk that can be returned to the metaslab layer if we run out of storage for future extensions?
How big are sector sizes really going to get, anyway?

pcd1193182 · 2025-04-07T20:30:55Z

pcd1193182
Apr 7, 2025
Collaborator Author

One idea that I had for storing uberblocks more efficiently is to not just store one per sector. We still don't want to store uberblocks from similar TXGs in the same sector, though, to prevent shorn writes or other disk issues from destroying multiple important uberblocks at once. My proposal was that uberblocks would be written in a rotating fashion, just like now. But when building the write for that sector, we would start by reading in the current sector, and then shifting the contents back by one uberblock. Then, we put the new uberblock at the front. That way, each sector can store multiple uberblocks, but the latest one is always at the front, where it is more likely to get written out safely before something happens, and where it is easy to locate. Any damaged sector would result in the loss of every Nth uberblock, where N is the number of sectors in use. This could allow us to store large numbers of old uberblocks safely and reliably, giving us much better rewind capabilities.

0 replies

pcd1193182 · 2025-04-09T18:43:35Z

pcd1193182
Apr 9, 2025
Collaborator Author

I've been trying to determine how/to what extent we can keep compatibility with old versions of ZFS. After more closely looking at the code, I think we are going to have to keep pretty much the whole old label around other than the uberblocks. In order for old versions of ZFS to know that there's a pool on the disks at all, we need to have the old label config set up properly. We can have an additional tag in there that old versions won't understand that tells new versions to read the new label format, but the old label still has to basically work, or libzfs won't find the config to pass to tryimport, which won't be able to understand them well enough to even tell there's a pool anyway.

Given that, we still have two options: A version number bump (5001) in the uberblock, or a featureflag. The advantage of a version number bump is simplicity. The advantage of the featureflag is that if we're going to all the effort to write out this uberblock, we could just put a real BP in there. The new featureflag could be READONLY, so old versions of ZFS would still be able to import the pool R/O, which could be useful for recovery. It gets us additional capabilities for very low cost.

1 reply

pcd1193182 May 5, 2025
Collaborator Author

That old BP isn't going to be useful to old versions of ZFS unless we specifically edit it to make it so. The problem is that the offset on disk is interpreted from the end of the label, which will be different between old disks and new ones.

tonyhutter · 2025-04-09T20:58:48Z

tonyhutter
Apr 9, 2025
Maintainer

I don't have any strong opinions on this, as I've never really messed with the label code.

As discussed in Allan's presentation, the core of the proposal is to switch to a larger label size. The initial proposal is a 256MiB label, with a large chunk of reserved space. In total, the start-of-disk reserved area would be 1 GiB.

So would the new minimum vdev size be ~1 GiB? I think the current minimum is effectively ~80MB or something.

2 replies

no-usernames-left Apr 9, 2025

This probably makes sense given the evolution in disk size over the last 20-25 years. While we're at it, we should probably tweak ZFS for solid-state media, such as updates to caching behaviour, for all-flash vdevs.

pcd1193182 Apr 10, 2025
Collaborator Author

We also don't have to enable the new-label feature by default for smaller disks. We could easily only apply this to disks larger than 1TB, where using 1GB for the label is small enough to not matter too much.

pcd1193182 · 2025-04-10T20:54:48Z

pcd1193182
Apr 10, 2025
Collaborator Author

Another interesting question: Do all the vdevs of a given pool need to use the same label type? Could some vdevs be using the large labels, and others the small ones? There's no particular reason why not; the vdev config itself could store whether that vdev uses the large label. The pool-level detection would still be important for old versions of ZFS, so they could know the pool isn't compatible.

5 replies

tonyhutter Apr 10, 2025
Maintainer

Along those lines - if you had an existing mirrored pool with small labels, and you replaced some of the disks with larger drives, would the new drives get large labels?

tonyhutter Apr 10, 2025
Maintainer

If the answer is yes, what happens if the pool is 99.99999% full, and doesn't have the free space for even a 1GB large label. Would the mirror replacement just fail?

pcd1193182 Apr 10, 2025
Collaborator Author

Updating existing drives to use the new labels is pretty fraught. The simplest problem is that we might already have data written in the first GB (or last 512MiB) where we want the new labels to go. Moving it is not really an option. Because of the embedded slog, it might actually be possible in practice to get the free space, but even if we did, now the offsets for every block stored on that disk would be wrong.

Resilvering with larger drives is one of the only ways you could get away with it; you could shift the start of the data region back when we do the resilver, leaving the free GB at the start for the new label. It wouldn't matter how full the pool is, as long as the new drive was at least ~1.5GB larger than the old drive, since the allocatable space (the space between the labels) would be going up.

tonyhutter Apr 11, 2025
Maintainer

Resilvering with larger drives is one of the only ways you could get away with it; you could shift the start of the data region back when we do the resilver, leaving the free GB at the start for the new label. It wouldn't matter how full the pool is, as long as the new drive was at least ~1.5GB larger than the old drive, since the allocatable space (the space between the labels) would be going up.

I was just talking to @behlendorf about this case. I'm concerned about what would happen if you were doing a sequential resilver (mirror) instead of a healing resilver (raid). From vdev_rebuild.c:

 * For mirrored devices it's possible to implement an alternate sequential       
 * reconstruction strategy when resilvering.  Sequential reconstruction          
 * behaves like a traditional RAID rebuild and reconstructs a device in LBA      
 * order without verifying the checksum.  After this phase completes a second    
 * scrub phase is started to verify all of the checksums.  This two phase        
 * process will take longer than the healing reconstruction described above.     
 * However, it has that advantage that after the reconstruction first phase      
 * completes redundancy has been restored.  At this point the pool can incur     
 * another device failure without risking data loss.

Looking at vdev_rebuild_thread() it seems for a sequential resilver we just walk the metaslabs and spacemaps and issue rebuild IOs. If the spacemap allocations overlap with the new large label, it's going to get stepped on.

tonyhutter Apr 11, 2025
Maintainer

Nevermind - I see you already accounted for this in your comment:

you could shift the start of the data region back when we do the resilver, leaving the free GB at the start for the new label.

gmelikov · 2025-04-11T08:42:32Z

gmelikov
Apr 11, 2025
Collaborator

I wanted to highlight a case with prepared images, for ex. for clouds. In them we usually have small partition which will be increased on first boot and/or later. Make labels 1G would force to add 1G to such image in RAW format (Whole ubuntu cloud image with root-on-zfs is ~800MB now, 1/3 of it is /boot and EFI partitions, not root-on-zfs itself). While it may be ok to store qcow2 with compression, but it will add 1G after initial decompression and disk allocation.

Leave <1T disks with old label will be an answer to this case, but then such disks will always be with old labels.

1 reply

pcd1193182 Apr 11, 2025
Collaborator Author

That's an interesting use case to highlight. That procedure has other downsides (e.g. extremely small metaslabs), but it would be nice if we could support somehow expanding to large labels for that case. The only ways I can think to solve it right now are device removal (add a larger disk with the new label, remove the original small desk, accept the remapping cost) or replacement (attach a larger disk with the new label to the small disk, let it resilver, detach and destroy the small disk).

The only other way would be actually moving the data on the disk when we expand it, which is getting into pretty fraught territory. for an 800MB image, sure, that's not too bad, but in the general case it's quite complex.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Label Redesign Proposal #17225

{{title}}

Replies: 5 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Label Redesign Proposal #17225

pcd1193182 Apr 7, 2025 Collaborator

Motivation and Context

Considerations

High-level Proposal

Questions

Replies: 5 comments · 9 replies

pcd1193182 Apr 7, 2025 Collaborator Author

pcd1193182 Apr 9, 2025 Collaborator Author

pcd1193182 May 5, 2025 Collaborator Author

tonyhutter Apr 9, 2025 Maintainer

no-usernames-left Apr 9, 2025

pcd1193182 Apr 10, 2025 Collaborator Author

pcd1193182 Apr 10, 2025 Collaborator Author

tonyhutter Apr 10, 2025 Maintainer

tonyhutter Apr 10, 2025 Maintainer

pcd1193182 Apr 10, 2025 Collaborator Author

tonyhutter Apr 11, 2025 Maintainer

tonyhutter Apr 11, 2025 Maintainer

gmelikov Apr 11, 2025 Collaborator

pcd1193182 Apr 11, 2025 Collaborator Author

pcd1193182
Apr 7, 2025
Collaborator

Replies: 5 comments 9 replies

pcd1193182
Apr 7, 2025
Collaborator Author

pcd1193182
Apr 9, 2025
Collaborator Author

pcd1193182 May 5, 2025
Collaborator Author

tonyhutter
Apr 9, 2025
Maintainer

pcd1193182 Apr 10, 2025
Collaborator Author

pcd1193182
Apr 10, 2025
Collaborator Author

tonyhutter Apr 10, 2025
Maintainer

tonyhutter Apr 10, 2025
Maintainer

pcd1193182 Apr 10, 2025
Collaborator Author

tonyhutter Apr 11, 2025
Maintainer

tonyhutter Apr 11, 2025
Maintainer

gmelikov
Apr 11, 2025
Collaborator

pcd1193182 Apr 11, 2025
Collaborator Author