Label Redesign Proposal #17225
Replies: 5 comments 9 replies
-
One idea that I had for storing uberblocks more efficiently is to not just store one per sector. We still don't want to store uberblocks from similar TXGs in the same sector, though, to prevent shorn writes or other disk issues from destroying multiple important uberblocks at once. My proposal was that uberblocks would be written in a rotating fashion, just like now. But when building the write for that sector, we would start by reading in the current sector, and then shifting the contents back by one uberblock. Then, we put the new uberblock at the front. That way, each sector can store multiple uberblocks, but the latest one is always at the front, where it is more likely to get written out safely before something happens, and where it is easy to locate. Any damaged sector would result in the loss of every Nth uberblock, where N is the number of sectors in use. This could allow us to store large numbers of old uberblocks safely and reliably, giving us much better rewind capabilities. |
Beta Was this translation helpful? Give feedback.
-
I've been trying to determine how/to what extent we can keep compatibility with old versions of ZFS. After more closely looking at the code, I think we are going to have to keep pretty much the whole old label around other than the uberblocks. In order for old versions of ZFS to know that there's a pool on the disks at all, we need to have the old label config set up properly. We can have an additional tag in there that old versions won't understand that tells new versions to read the new label format, but the old label still has to basically work, or libzfs won't find the config to pass to tryimport, which won't be able to understand them well enough to even tell there's a pool anyway. Given that, we still have two options: A version number bump (5001) in the uberblock, or a featureflag. The advantage of a version number bump is simplicity. The advantage of the featureflag is that if we're going to all the effort to write out this uberblock, we could just put a real BP in there. The new featureflag could be READONLY, so old versions of ZFS would still be able to import the pool R/O, which could be useful for recovery. It gets us additional capabilities for very low cost. |
Beta Was this translation helpful? Give feedback.
-
I don't have any strong opinions on this, as I've never really messed with the label code.
So would the new minimum vdev size be ~1 GiB? I think the current minimum is effectively ~80MB or something. |
Beta Was this translation helpful? Give feedback.
-
Another interesting question: Do all the vdevs of a given pool need to use the same label type? Could some vdevs be using the large labels, and others the small ones? There's no particular reason why not; the vdev config itself could store whether that vdev uses the large label. The pool-level detection would still be important for old versions of ZFS, so they could know the pool isn't compatible. |
Beta Was this translation helpful? Give feedback.
-
I wanted to highlight a case with prepared images, for ex. for clouds. In them we usually have small partition which will be increased on first boot and/or later. Make labels 1G would force to add 1G to such image in RAW format (Whole ubuntu cloud image with root-on-zfs is ~800MB now, 1/3 of it is /boot and EFI partitions, not root-on-zfs itself). While it may be ok to store qcow2 with compression, but it will add 1G after initial decompression and disk allocation. Leave <1T disks with old label will be an answer to this case, but then such disks will always be with old labels. |
Beta Was this translation helpful? Give feedback.
-
Motivation and Context
The current label design in ZFS has served us well for a long time now, but is starting to show its age. Because we always use a full disk sector for each uberblock, and we have a limited ring size of 128k, the number of uberblocks we can stored on new large-sector devices is getting dangerously low. There are other friction points too; not having enough room to store the pool config, limited room to support new features, and no dedicated space for extreme-rollback uberblocks. The time has come to figure out what a new label layout should look like. @allanjude started this discussion at the 2024 OpenZFS Developer Summit with his presentation, and the next step is to present a proposal and make sure we get the full range of community feedback on it.
Considerations
High-level Proposal
As discussed in Allan's presentation, the core of the proposal is to switch to a larger label size. The initial proposal is a 256MiB label, with a large chunk of reserved space. In total, the start-of-disk reserved area would be 1 GiB. There was some additional discussion at the ZFS dev summit, so the other parts of the proposal have changed somewhat. The layout of the label would be dynamic; the label would have a "Table of Contents" that would store the start and size of each section. This allows flexibility and makes it easier to change things in the future. Some sections would include the boot information block, the vdev config, a whole-pool config, and special Uberblocks (MMP, checkpoint). The ToC could also store information about the use of the reserved space, e.g. if some of it is being used for raidz expansion or other features. The uberblock ring itself would take up a large chunk of the space, probably the last N MiB of the label.
There is also the question of how ZFS will know whether to use the new label design or not, and relatedly, how old systems will interact with these new pools. So far, it seems that the most workable solution is to put dummy data in the old label that will get old versions of ZFS far enough to try to pick an Uberblock, and then only have a single dummy uberblock with version=5001 in it. This will let old systems know that the pool is not openable. We could also have a special key in the vdev nvlist config that indicates to newer pools that this pool is using the v2 label format. That would allow them to short circuit the import logic there to go use the new layout. The problem with this approach is that it might require a nontrivial amount of work to come up with appropriate "dummy" data for the old label. Unfortunately, the version number is only stored in the uberblock itself, so the import process has to get that far before we could leverage it.
Questions
Beta Was this translation helpful? Give feedback.
All reactions