Implement new label format for large disks#17573
Implement new label format for large disks#17573pcd1193182 wants to merge 12 commits intoopenzfs:masterfrom
Conversation
e039970 to
c20fcf4
Compare
include/sys/vdev_impl.h
Outdated
| * Size of embedded boot loader region on each label. | ||
| * The total size of the first two labels plus the boot area is 4MB. | ||
| * On RAIDZ, this space is overwritten during RAIDZ expansion. | ||
| * On RAIDZ, this space is overwritten durinvg RAIDZ expansion. |
There was a problem hiding this comment.
This typo was marked resolved, but looks like it's still present
821000e to
8c65661
Compare
8567011 to
6718ba5
Compare
6718ba5 to
f2adf61
Compare
robn
left a comment
There was a problem hiding this comment.
This seems to be a bit more polished version of the series I saw a few months ago, which I think bodes well - nothing bad or surprising seen since!
So what's still needed to move this forward? And what's the plan from here? Is the intent to get this merged and onto real pools before there's a new feature that requires it, or hold it until we need it? I'm guessing/hoping the former, to get operational experience and shake out any issues before we actually need it.
Any thoughts or guidance on how to use all this new space? I don't really at this stage, and I'm don't think there's a big long line of things waiting to use it. Regardless, as with most of our on-disk formats, if we're upgrading them to throw of limitations of the past, I would like this to be the last time we ever have to, and that in part means making sure we know how to use it and never break it!
|
|
||
| log_must create_pool -f $TESTPOOL "$DSK"0 | ||
|
|
||
| log_must zdb -l "$DSK"0 |
There was a problem hiding this comment.
Since user_large_label and uses_old_label just call zdb -l anyway, I assume this is just here (and in large_label_002_pos) to get the output into the log for debugging?
There was a problem hiding this comment.
Yeah, if the test fails, it's nice to not have to modify and rerun just to see what the zdb output actually looked like, or if that was the command that failed directly, or we just failed to find the string we wanted.
My hope was to get this integrated sooner rather than later. As you said, it would be good to have time to find any issues or make improvements before there's a new feature that needs it, and drives a lot of sudden new adoption of something that hasn't had as much time to mature. Plus, while there aren't any new headline features in this patch, we do still get the benefit of having a longer uberblock history. For a lot of data recovery jobs, that alone can prove quite helpful, since it provides a much longer window of TXGs to roll back to to try and recover specific data. The thing that's needed to move it forward is reviews, pretty much. I think the bones are good, though I'm sure there are tweaks to be made, and I think it's ready for more eyes on it.
Right now, we store the checkpoint uberblock in the MOS. This works mostly fine for the intended use cases. However, if your pool is rendered totally unimportable (a bug in the import code related to a new feature that causes it to panic, or really specifically timed courrption, for example) can make it impossible to roll back. Storing a copy of the checkpoint uberblock in the label as a backup and having a new import flag, or zhack or other way to do the rollback might be useful. One thing this PR does include is storing a copy of the pool config in the label. That isn't currently used for anything except debugging, but it could be handy for importing pool with badly damaged or missing top-level vdevs. Another idea that came to me is the idea of storing a compression dictionary for use with zstd; zstd has a dictionary mode, where rather than storing the dictionary inline with the data, it can use an external pre-programmed dictionary. It might be helpful (especially for smaller recordsizes) to generate dictionaries based ZFS metadata, or allow users to generate them based on their own data, and then use them for compression and decompression. If they're used to store metadata, they may need to be accessed before the MOS is readable, so storing them in the label might help. We currently use the 3.5 MiB reserved space to do raidz expansion; that space is sufficient because eventually the raidz expansion can use its own previously allocated space as working room. If we ever wanted to try to implement raidz width increases (increasing the parity of existing blocks, for example), we would need more space; the larger labels might provide enough scratch space for that. I agree it would be nice if we never had to do this again; we don't want to come back in ten years and say "hey actually now we need 64GiB labels, whoops!". One thing that I think works well in the current design to prevent that is the table of contents; that structure can contain not only information about the different sections in the label, but about label extensions or other features that the label is using. |
|
As a curious question, would this be able to help with #11408? Else feel free to mark my comment as off-topic. |
I took a quick look and I think the answer is no? That feature proposal would require that there be nothing for the first 128MiB of the disk, while this feature leaves the old label format in place so that ZFS-aware utilities know there is ZFS data stored on this device. I think it neither solves nor prevents that idea, it's just sort of orthogonal. Now, one could use some other functionality that we've discussed related to this around configuring a "data offset" for the pool that would replace the current starting location of the data in ZFS (at the end of the start labels + reserved space); that would allow room to be saved for non-zfs partitions, sort of similar to how I did in cursedfs (though that just stashed the data in the reserved space). |
d612a90 to
6f816b2
Compare
|
This is a gentle reminder/request for review. The latest push addressed most of the comments that have been offered thus far, and switched to a multi-ring design so that imports stay fast while still providing a deep reserve of TXGs to attempt rolling back to. |
|
I'll take a look. We just released 2.4.0, so now is a good time to get some of these bigger features merged. Also, we had some CI fixes go in recently, so please rebase to see a lot of the test failures go away. |
|
Sorry again for not taking a look at this earlier. Here's my first pass comments:
We should mention in the man pages that >1024GB gets the large label by default, and less than that gets the small label. If the intention is for "1TB disks and above" to use the large label, then we sould make the limit 1000GB rather than 1024GB, since HDDs don't use power-of-2 capacities. Please use We should catch extermely pathological cases like:
|
All done.
Handled, good catch. I'll add a test for this one too; I resolved it by just saying we refuse to use the large label if the usable size post-label is less than the usable size of a 64MiB (
Yeah this is something I noticed. It makes sense, since we're now zeroing out a gig per disk instead of a meg, but it's not ideal. I can see if there's any improvements to be made here in terms of parallelizing the IO, but I'm not optimistic. I can also see if there's any way we can get away with not zeroing all of the space, but I'm not sure if that will work. We don't want to try to use old uberblocks from earlier pools stored on this disk. |
|
So there's something a little interesting going on with the performance of zpool creation. Right now, we create all the IOs for a given vdev and execute them in parallel, but each vdev is sequential. That can be fixed relatively easily by passing around a common parent ZIO and executing it at the end. The interesting thing is that that doesn't improve performance at all. I looked at the code in some more detail, and it looks like we actually don't issue the IOs in parallel. And I think this happens a lot throughout the label code. The problem is that we pass I'm not clear on if we should be calling The obvious avenue for improvement is to increase the number of threads in the NULL taskq, but it's a little unfortunate to fix a transient issue like pool creation with a permanent fix. We could also not dispatch to the NULL taskq if we're doing CONFIG_WRITER IOs during initialization; that might be better. Currently EDIT: That seems to work, and the tests pass. I had to serialize some of the IOs again, because we actually do rely on the config being written out for each disk before we consider the next one; this is one of the ways we prevent a disk from being used multiple times in the same pool. If we do all the IOs at the end, we can end up with something like |
|
Other than the tiny nit I mentioned, I don't see any big surface-level issues. I was happy to see that you got the zpool create times way down. I re-ran my old test, and saw the creation time go from 51s -> 4s! |
tonyhutter
left a comment
There was a problem hiding this comment.
Approved, but please fix the minor typo in vdev_impl.h
This patch contains the logic for a new larger label format. This format is intended to support disks with large sector sizes. By using a larger label we can store more uberblocks and other critical pool metadata. We can also use the extra space to enable new features in ZFS going forwards. This initial commit does not add new capabilities, but provides the framework for them going forwards. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Wasabi, Inc. Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
This provides a balance between frequent UBs at new txgs, and sparse ones for historical purposes Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
ccd5e5c to
4940a31
Compare
alek-p
left a comment
There was a problem hiding this comment.
Thanks for working on this Paul. Hope you don't mind that I'm reviewing this piecemeal
| { | ||
| zio_eck_t *ub_eck = | ||
| (zio_eck_t *) | ||
| ((char *)(ub_data) + (ASHIFT_UBERBLOCK_SIZE(ashift, B_FALSE))) - 1; |
There was a problem hiding this comment.
should this be ASHIFT_UBERBLOCK_SIZE(ashift, B_TRUE) here?
| return; | ||
| } | ||
|
|
||
| ub = malloc(UBERBLOCK_SHIFT); |
There was a problem hiding this comment.
= malloc(ASHIFT_UBERBLOCK_SIZE(ashift, B_TRUE))?
| cvd2->vdev_psize - VDEV_LABEL_END_SIZE); | ||
| ASSERT3U(rc->rc_shadow_offset + | ||
| abd_get_size(rc->rc_abd), <, cvd2->vdev_psize - | ||
| VDEV_LABEL_END_SIZE(cvd)); |
| if (!vdev_toc_get_secinfo(toc, | ||
| VDEV_TOC_VDEV_CONFIG, | ||
| &vp_size[l], &off)) | ||
| continue; |
There was a problem hiding this comment.
think we need to fnvlist_free(toc) before this continue
| for (int u = 0; | ||
| u < VDEV_LARGE_UBERBLOCK_RING / SPA_MAXBLOCKSIZE; | ||
| u++) { | ||
| vdev_label_write(pio, vd, l, B_TRUE, ub_abd2, |
There was a problem hiding this comment.
is there a reason to attach this to the pio instead of the local zio we're waiting on later in this function?
Since this function will zero_off + free the ub_abd2, and the vdev_label_write() call will zio_nowait(), it seems like it might be dangerous to use the parent ZIO instead of the local one that we do zio_wait() for.
Sponsored by: [Wasabi, Inc.; Klara, Inc.]
Motivation and Context
As disk sector sizes increase, we are able to store fewer and fewer uberblocks on a disk. This makes it increasingly difficult to recover from issues by rolling back to earlier TXGs. Eventually, sector sizes may become large enough that not even a single uberblock can be stored without having to do a partial write. In addition, new ZFS features often need space to store metadata (see, for example, the buffer used by RAIDZ expansion). This space is highly limited with the current disk layout.
Description
This patch contains the logic for a new larger label format. This format is intended to support disks with large sector sizes. By using a larger label we can store more uberblocks and other critical pool metadata. We can also use the extra space to enable new features in ZFS going forwards. This initial commit does not add new capabilities, but provides the framework for them going forwards.
It also contains zdb and zhack support for the new label type, as well as tests that verify basic functionality of the new label. Currently, the size of the disk is used as a rubric for whether or not to enable the new label type, but that is open to change.
How Has This Been Tested?
In addition to the tests added in this PR, I also ran the ZFS test suite with the tunable turned below the size of the disks in use. Some tests failed, but only for space estimation reasons, which could have been corrected with fixes to the tests. Similarly, I ran some ztest runs with the new label format.
Types of changes
Checklist:
Signed-off-by.