Conversation
9830c7d to
645755d
Compare
4e7ecb0 to
dd73eff
Compare
This comment was marked as outdated.
This comment was marked as outdated.
|
While this change should allow failures of any one (-two-three) enclosures with all the disks, it will do nothing to a failure of two (-three-four) disks from different enclosures. So having two groups per row we still reduce reliability to a half of RAIDZ's, while hopefully doubling spare activation rate. Having too many groups per row still sounds questionable though. Not sure I'd go beyond 3-4. So in real life with realistic enclosure sizes many dRAID vdevs per pool might still be needed. And I'd say at least 6 enclosures, but better 10-12. To me a realistic configuration would look like 11x 64-disk enclosures, and a pool with 16x 44-disk draid2 vdevs. Not many people have setups like this, but those that do may benefit indeed. But we may need to better document all those intricacies. I only briefly looked on a code, but what I haven't noticed there is a new pool feature, since as I understand this is an incompatible change, since we don't store the map on disks. |
|
Thinking again, I think I missed one more positive relation here. When a number of groups (disks in enclosure/slice) goes to and beyond the number of enclosures/slices, there will be a chance that two or more failed drives end up in the same enclosure, counting only as one. But to reach that point with a reasonably-wide group we'd need to already accept pretty bad probability of multiple failures. Would be interesting to see that in numbers instead of guesses. |
@amotin, I'm not sure I understand. Why? The failure of disks from different enclosures are handled the same way as they are handled today without this change, there is no regression in this functionality. I.e. disks can fail, but the amount of them that can fail at the same time is limited by draid/2/3 configuration (up to three). The whole purpose and idea of this change is to have the same features we normally have with dRAID on many disks (i.e. fast sequential resilvering of any disk failure and shared spares) but at the same time being able to tolerate failures of enclosures or even servers/racks (if disks are shared via iSCSI). Without this patch, the only way to configure this is to have a separate draid vdev per failure group, but this won't give you fast resilvering. And the faster we resilver and restore data redundancy, the less chance we hit another disk failure during the resilver, which is better for overall data durability. Moreover, spare disks won't be shared among draid vdevs. So for example if you have only 1s in such vdev which is already used, the next disk failure in this draid vdev won't be resilvered. With this patch, all spares are shared between the groups, so that even if all disks from some group are failed - they all can be resilvered using the spares from other groups, as long as we have them available. By the way, the change is fully compatible. The old pools created with the previous ZFS versions will still work with this change.
Yeap, and you can configure it like this, for example: Yes, the amount of spare disks in such configurations is big, but there is no way to avoid this - you must put spare disks in your failure groups, at least one per group, so that when the whole enclosure fails each group would have spare to resilver to. The only way to alleviate it is to have bigger failure groups (that is, more enclosures). For example, if there were 16 enclosures, we could configure it as three |
|
On a 2nd thought, maybe we can do better and allow user to specify less amount of spares than there are failure groups. Maybe it's not worth to repair all disks when the whole enclosure fails after all since it will generate a lot of iops and consume a lot of resources, dragging the system down essentially, and the enclosure might be changed faster than such resilvering will be even complete. I will think how it can be implemented... Thanks for looking at it, @amotin! |
Right. I haven't told that it is worse. Just that it is not better. And the current state I would not call particularly great on wider vdevs.
Sure. And failure of 3 disks out of 253 will be a game over, unless you are lucky enough for some of them to happen in the same enclosure. |
|
In case of draid3, it won't be game over even if they fail at the same time. Moreover, as I already mentioned, sequential resilvering of one disk on 253 disks draid will be 10 times faster than on 25 disks draid, which decreases the chance of another disk failure while resilvering is in progress, so the risk of having more disks lost at the same time is less. This is the point of dRAID, isn't it? And again, if the disks are failing not at the same time (which is normally the case), you can lose as much as you have spares in your draid vdev. You can lose even all disks in one failure group, if there is enough spares from other groups to resilver them to. Hope it helps. |
|
Hi @andriytk, I'm not entirely sure that this works as intended, or I might just be missing some draid quirk... I cloned https://github.com/andriytk/zfs/tree/fdomains and built successfully. However, doing some basic IO tests and filling the pool with data (4G isos) it starts out all right writing to all devices, but after a while it degrades into only writing to the first device. In the beginning: After a short while: All spinning disks, no separate ZIL or somesuch. But if it was a ZIL issue I'd expect some IO to other devices at some point after this, but no. Also, stopping the test and letting the system go idle, and then restarting the test in another directory immediately starts writing to only the first device, so I don't think it's ZIL related... So, real bug or am I'm doing something wrong? |
Right. But backwards it is not -- previous ZFS version won't be able to import the new pools correctly. That is why a new pool feature is required.
Right. Except raidz3:8d:1s vdev would already have 50% raw space efficiency and significant write overhead, and I doubt anybody will really have 20 enclosures to allow raidz3:16d:1s, even if makes sense for payload. Never mind. My criticism here was no so much pointed on this change, as on dRAID concepts in general. This change does improve the situation a bit, and I appreciate it. |
|
@ZNikke, thanks for giving it a try. It's interesting, I didn't see this problem on my setup. And a quick try to reproduce it with the same configuration on my laptop didn't reproduce it, but I will try it again. I cannot think of anything in the patch that could cause such behaviour, tbh... Which commit did you pick up to build? |
@amotin, I didn't know it's required that previous ZFS versions must be able to import the pools created in the new ZFS version, but this patch does not break it. I just tested it: 1) created pool without failure domains in the new version with the patch; 2) imported it successfully using the old version without the patch. |
It is not required between major versions. But it is required to be controllable. I.e. no new features should be used until explicitly enabled, and presence of any new/unknown active feature should reliably prevent import, or limit it to read-only.
Would you care to explain how, if there is no code to handle the new parameters, permutations or vdev status? |
I think what was meant is to test: create pool with failure domains in the new version, try to import it in old version. Old code should gracefully deny import of any non-supported pools. |
@andriytk It's not the latest now I see so I'll refresh and rebuild, but the changes seem very unrelated. Something related to sizes? My test setup has 2T drives (actually a mix of 2T and 8T, but it seems to pick the smallest size with -f), so my pool size is 54.5T. Oh, and the drive enc0d2 is a 2T drive so it's not that it wants to "fill up" an 8T drive... Is there some layout dump or somesuch that you might find helpful? Or should we look into arranging access for you to our test setup? |
@amotin, when failure domains are not configured in draid vdev - the code works exactly the same as without the patch, no new permutations or parameters (only one, actually - nslice) are used by the new code in such a way that would make it incompatible with the old one. (There is no new vdev status.)
@gmelikov, thank you, it makes sense. Any hints on how to implement it? What needs to be updated in the new code so that old code would recognise it and gracefully deny import? |
|
We have feature flags for that https://openzfs.github.io/openzfs-docs/Basic%20Concepts/Feature%20Flags.html , there's a read-only one if old code can read pool (it forces read-only import) and usual (which restricts import at all on old code) I think this patch may be a good starting point f70c850 (sorry, I personally didn't implement feature flags so I can give only basic examples) |
|
@andriytk Now rebuilt with I've built native deb packages as per https://openzfs.github.io/openzfs-docs/Developer%20Resources/Custom%20Packages.html#dkms-1 on the same Ubuntu 24.04/noble machine. Rebuilt the pool and ran a new test with the same result, a And letting it calm down and restarting the test in another directory immediately resumes writing to only one drive. zpool scrub gets REALLY upset about the state of the pool, this is a The broken files are the lately written ones from test1 and all from test2, ie when all writes ended up on the same drive. |
Having said that, enclosure failures won't work if some spare is backing/replacing the disk which is not from his native failure group, and I've just added checks/restrictions for that. It is because if the disk backing that spare happens to be part of the enclosure failure, it can introduce more failures to that group than it can tolerate. UPD: actually, on a 2nd thought, the situation even worse than that. It doesn't matter whether the spare is native or not because the disks they mapped to are distributed among the failure groups anyway, and the backing disk of this spare can be in any failure group. But what does matter is how much parity is configured and how many failed disks we already have in the failure group on the time of the enclosure failure. For example, in draid1, if we have some failed disk already, we cannot tolerate an enclosure failure which doesn't have that disk, even if that disk backed by some spare. It's because any disk from that enclosure failure can be mapped to that spare that is backing that failed disk, which will make it two failed disks in the failure group, and draid1 cannot tolerate that. In other words, if we want to support enclosure failure, we cannot have any failed disks with draid1 that doesn't belong to that failed enclosure. With draid2, we can have no more than one failed disk in each failure group. With draid3, we can have no more than two failed disks in each failure group to support enclosure failure. Again, it doesn't matter whether those failed disks are backed up by spares or not because any, or even all, of those spares can be mapped to the disks belonging to that failed enclosure. I will update the code and add more restrictions to the allowed failures. |
|
Updated the code, redundancy_draid_spare4 test has passed 100% in 100 iterations now: |
behlendorf
left a comment
There was a problem hiding this comment.
Very clever! This is great work, now that you point it out it's something I wish I'd included in the original dRAID implementation. It is absolutely functionality we'd make use of on our systems. I've done some light testing with this change as so far it's held up well. I'll kick the tires a bit more with some more interesting large layouts.
akashb-22
left a comment
There was a problem hiding this comment.
LGTM. Tested on a smaller setup without issues. Will verify with a larger configuration of drives and enclosures.
|
An observation that I've noticed is where I've manually corrupted a vdev member (file5) in a draid, and noticed that another member from a different fault group is reporting checksum errors as well. Additionally, the events show non-empty ranges fields. Not sure if this is expected behavior or a concern. |
|
@akashb-22, first of all - thanks for giving it so thorough testing! The disks are shuffled between failure groups, so I guess it's normal that you can see checksum errors in different groups when you inject corruption only in one disk of some group. Because, if I remember correctly, cksum error counters are incremented for all disks in the parity group for which the problem is detected. Having said that, I'm not sure about bad ranges though. Don't you normally see it in the same test scenario on draids without failure domains feature? |
tests/zfs-tests/tests/functional/cli_root/zpool_create/zpool_create_draid_005_pos.ksh
Show resolved
Hide resolved
|
@andriytk thanks for the quick iteration on this, it's shaping up nicely. I'll put it through some additional testing over the weekend but so far it's holding up well. When you get a chance can you squash the patch stack and update the PR. I don't think there's a need to keep the separate commits any longer. |
module/zfs/spa.c
Outdated
|
|
||
| if (!spa_feature_is_enabled(spa, SPA_FEATURE_DRAID_FAIL_DOMAINS) && | ||
| draid_nfgroup > 0) | ||
| return (SET_ERROR(ENOTSUP)); |
There was a problem hiding this comment.
You are exiting here in the middle of pool creation with transaction group open and I am sure dozen other things allocated. I haven't looked what's wrong with what I proposed before, but this makes me shiver.
There was a problem hiding this comment.
Well spotted! I dropped this check entirely since the feature is alway enabled on new pool creation anyway. Thank you!
There was a problem hiding this comment.
No. Pool can be created of older pre-OpenZFS version of with arbitrary set of features through compatibility property.
There was a problem hiding this comment.
Yes, and if you create such pools on a new software version, the feature is enabled automatically. It's absolutely similar to dRAID feature, there is no difference. If you look at the code where SPA_FEATURE_DRAID is checked, at vdev_alloc(), you can see that it's checked only when spa->spa_load_state != SPA_LOAD_CREATE.
Here's how I tested it last night just to double-check (thanks to your comment):
- Create a pool on an old OpenZFS version.
- Upgrade OpenZFS version.
- Import the pool created the old version.
- Try to add draid with failure domains to the pool - it fails (due to check at
spa_vdev_add()), as expected. - Try to create new pool with draid vdev with failure domains feature - it works, as expected.
- Try to add draid with failure domains to the new pool - it works, as expected.
Btw, there was a bug on adding draid with failure domains to the old pool (step 4), which I missed somehow during my previous testing - the error was printed, but vdev was still adding nevertheless. I fixed that in commit 74eaf3d last night. Again, thanks your comment which prompted me to double-check it!
There was a problem hiding this comment.
It seems working now after commit 8b0d785:
$ sudo zpool create -d -f -m /var/tmp/testdir testpool draid2:5c /var/tmp/basedir.2400/vdev{0..4}
cannot create 'testpool': operation not supported on this type of pool
$ sudo zpool create -d -o 'feature@draid=enabled' -f -m /var/tmp/testdir testpool draid2:5c /var/tmp/basedir.2400/vdev{0..4}
$ sudo zpool create -d -o 'feature@draid=enabled' -f -m /var/tmp/testdir testpool draid2:5c:10w /var/tmp/basedir.2400/vdev{0..9}
cannot create 'testpool': operation not supported on this type of pool
$ sudo zpool create -d -o 'feature@draid_failure_domains=enabled' -f -m /var/tmp/testdir testpool draid2:5c:10w /var/tmp/basedir.2400/vdev{0..9}
$
There was a problem hiding this comment.
@amotin, is it good now? I fixed it for both features draid and draid with failure domains.
There was a problem hiding this comment.
I guess better than nothing, but I am not too deep into that code. I see SPA_FEATURE_ALLOCATION_CLASSES is checked earlier, so I'd try to do similar. Otherwise I worry that later cleanup may leak something somewhere.
There was a problem hiding this comment.
I can see you point. OK, let me try something different...
There was a problem hiding this comment.
@amotin, please check commit a113c4d:
$ sudo zpool create -d -f -m /var/tmp/testdir testpool draid2:5c /var/tmp/basedir.2400/vdev{0..4}
cannot create 'testpool': operation not supported on this type of pool
$ sudo zpool create -d -o 'feature@draid=enabled' -f -m /var/tmp/testdir testpool draid2:5c /var/tmp/basedir.2400/vdev{0..4}
$ sudo zpool destroy testpool
$ sudo zpool create -d -o 'feature@draid=enabled' -f -m /var/tmp/testdir testpool draid2:5c:10w /var/tmp/basedir.2400/vdev{0..9}
cannot create 'testpool': operation not supported on this type of pool
$ sudo zpool create -d -o 'feature@draid_failure_domains=enabled' -f -m /var/tmp/testdir testpool draid2:5c:10w /var/tmp/basedir.2400/vdev{0..9}
cannot create 'testpool': operation not supported on this type of pool
$ sudo zpool create -d -o 'feature@draid=enabled' -o 'feature@draid_failure_domains=enabled' -f -m /var/tmp/testdir testpool draid2:5c:10w /var/tmp/basedir.2400/vdev{0..9}
$
Thanks for the hint!
@behlendorf, done. Thank you! |
|
@behlendorf, I was thinking, when a failure domain fails, it doesn't seem to make much sense to start resilver automatically by zed, it will take a lot of computing and i/o bandwidth resources only to be wasted when the failed domain component is replaced. What do you think? Should we disable it for domain failures? I'd suggest to add a simple logic to zed's retire agent when it handles device failure event: if at each failure group a device by the same i-th index is faulted, which means we are facing i-th domain failure, and the device from the event is one of those faulted - don't attach hot spare to this device and don't start resilvering. I've added this logic in recent commits ddaf183...a509459. |
Yeah, as long as we can reliably detect domain failures this seems like a reasonable compromise for the default behavior. This is another case where I wish we have better existing mechanisms to fine tune the ZED behavior. For some environments I can alternately imagine it being preferable to rebuild as quickly as possible to restore redundancy. |
fbb362d to
d32bc5e
Compare
This comment was marked as outdated.
This comment was marked as outdated.
|
Rebased on the latest master. |
Currently, the only way to tolerate the failure of the whole
enclosure is to configure several draid vdevs in the pool, each
vdev having disks from different enclosures. But this essentially
degrades draid to raidz and defeats the purpose of having fast
sequential resilvering on wide pools with draid.
This patch allows to configure several children groups in the same
row in one draid vdev. In each such group, let's call it failure
group, the user can configure disks belonging to different
enclosures - failure domains. For example, in case of 10
enclosures with 10 disks each, the user can put 1st disk from each
enclosure into 1st group, 2nd disk from each enclosure into 2nd
group, and so on. If one enclosure fails, only one disk from each
group would fail, which won't affect draid operation, and each
group would have enough redundancy to recover the stored data. Of
course, in case of draid2 - two enclosures can fail at a time, in
case of draid3 - three enclosures (provided there are no other
disk failures in each group).
In order to preserve fast sequential resilvering in case of a disk
failure, the groups much share all disks between themselves, and
this is achieved by shuffling the disks between the groups. But
only i-th disks in each group are shuffled between themselves,
i.e. the disks from the same enclosures, after that they are
shuffled within each group, like it is done today in an ordinary
draid. Thus, no more than one disk from any enclosure can appear
in any failure group as a result of this shuffling.
For example, here's how the pool status output looks like in
case of two `draid1:2d:4c:1s` groups:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
draid1:2d:4c:1s:8w-0 ONLINE 0 0 0
enc0d0 ONLINE 0 0 0
enc1d0 ONLINE 0 0 0
enc2d0 ONLINE 0 0 0
enc3d0 ONLINE 0 0 0
enc0d1 ONLINE 0 0 0
enc1d1 ONLINE 0 0 0
enc2d1 ONLINE 0 0 0
enc3d1 ONLINE 0 0 0
spares
draid1-0-0 AVAIL
draid1-0-1 AVAIL
The number of failure groups is specified indirectly via the new
width parameter in draid vdev configuration descriptor, which is
the total number of disks and which is multiple of children in
each group. This multiple is the number of groups (width /
children). Doing it this way allows the user conveniently see how
many disks draid has in an instant.
Spare disks are evenly distributed among failure groups, so the
number of spares should be multiple of the number of groups, and
they are shared by all groups. However, to support domain failure,
we cannot have more than nparity - 1 failed disks in any group, no
matter if they are rebuilt to draid spares or not (the blocks of
those spares can be mapped to the disks from the failed domain
(enclosure), and we cannot tolerate more than nparity failures in
any failure group).
The retire agent in zed is updated to not start resilvering when
the domain failure happens. Otherwise, it might take a lot of
computing and I/O bandwidth resources, only to be wasted when the
failed domain component is replaced.
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes openzfs#11969.
Motivation and Context
Currently, the only way to tolerate the failure of the whole enclosure is to configure several draid vdevs in the pool, each vdev having disks from different enclosures. But this essentially degrades draid to raidz and defeats the purpose of having fast sequential resilvering on wide pools with draid.
Description
This patch allows to configure several children groups in the same row in one draid vdev. In each such group, let's call it failure group, the user can configure disks belonging to different enclosures - failure domains. For example, in case of 10 enclosures with 10 disks each, the user can put 1st disk from each enclosure into 1st group, 2nd disk from each enclosure into 2nd group, and so on. If one enclosure fails, only one disk from each group would fail, which won't affect draid operation, and each group would have enough redundancy to recover the stored data. Of course, in case of draid2 - two enclosures can fail at a time, in case of draid3 - three enclosures (provided there are no other disk failures in each group).
In order to preserve fast sequential resilvering in case of a disk failure, the groups much share all disks between themselves, and this is achieved by shuffling the disks between the groups. But only i-th disks in each group are shuffled between themselves, i.e. the disks from the same enclosures, after that they are shuffled within each group, like it is done today in an ordinary draid. Thus, no more than one disk from any enclosure can appear in any failure group as a result of this shuffling.
For example, here's how the pool status output looks like in case of two
draid1:2d:4c:1sfailure groups:The number of failure groups is specified indirectly via the new
widthparameter in draid vdev configuration descriptor, which is the total number of disks and which is multiple ofchildrenin each group. This multiple is the number of groups (width / children). Doing it this way allows the user conveniently see how many disks draid has in an instant.Spare disks are evenly distributed among failure groups, so the number of spares should be multiple of the number of groups, and they are shared by all groups. However, to support domain failure, we cannot have more than
nparity - 1failed disks in any group, no matter if they are rebuilt to draid spares or not (the blocks of those spares can be mapped to the disks from the failed domain (enclosure), and we cannot tolerate more thannparityfailures in any failure group).How Has This Been Tested?
The following automation tests were added:
zpool_create_draid_005_pos.kshtest which covers the creation of the pools with different stripe configurations with different random number of groups in the big width row (from 2 to 7).redundancy_draid_width.kshtest which checks that data is intact when any n disks are failed/corrupted at the same offset in each of n failure groups (n being random from 2 to 4).redundancy_draid_spare4.kshtest, based onredundancy_draid_spare1.ksh, uses failure groups (random from 2 to 4) and fails disks in the groups at the same random offset, making sure resilvering is successful and the data is intact. It also checks that no more than nparity failues are allowed in each group.suspend_draid_fgroups.kshtest which checks that the pool gets into suspended state if more than 3 devices are failed in its draid3 vdev with failure groups (random from 2 to 6).Types of changes
Checklist:
Signed-off-by.Closes #11969.