Skip to content

Introduce failure domains to dRAID#18148

Open
andriytk wants to merge 1 commit intoopenzfs:masterfrom
andriytk:fdomains
Open

Introduce failure domains to dRAID#18148
andriytk wants to merge 1 commit intoopenzfs:masterfrom
andriytk:fdomains

Conversation

@andriytk
Copy link
Contributor

@andriytk andriytk commented Jan 22, 2026

Motivation and Context

Currently, the only way to tolerate the failure of the whole enclosure is to configure several draid vdevs in the pool, each vdev having disks from different enclosures. But this essentially degrades draid to raidz and defeats the purpose of having fast sequential resilvering on wide pools with draid.

Description

This patch allows to configure several children groups in the same row in one draid vdev. In each such group, let's call it failure group, the user can configure disks belonging to different enclosures - failure domains. For example, in case of 10 enclosures with 10 disks each, the user can put 1st disk from each enclosure into 1st group, 2nd disk from each enclosure into 2nd group, and so on. If one enclosure fails, only one disk from each group would fail, which won't affect draid operation, and each group would have enough redundancy to recover the stored data. Of course, in case of draid2 - two enclosures can fail at a time, in case of draid3 - three enclosures (provided there are no other disk failures in each group).

In order to preserve fast sequential resilvering in case of a disk failure, the groups much share all disks between themselves, and this is achieved by shuffling the disks between the groups. But only i-th disks in each group are shuffled between themselves, i.e. the disks from the same enclosures, after that they are shuffled within each group, like it is done today in an ordinary draid. Thus, no more than one disk from any enclosure can appear in any failure group as a result of this shuffling.

For example, here's how the pool status output looks like in case of two draid1:2d:4c:1s failure groups:

    NAME                        STATE     READ WRITE CKSUM
    pool1                       ONLINE       0     0     0
      draid1:2d:4c:8w:2s-0      ONLINE       0     0     0
        enc0d0                  ONLINE       0     0     0
        enc1d0                  ONLINE       0     0     0
        enc2d0                  ONLINE       0     0     0
        enc3d0                  ONLINE       0     0     0
        enc0d1                  ONLINE       0     0     0
        enc1d1                  ONLINE       0     0     0
        enc2d1                  ONLINE       0     0     0
        enc3d1                  ONLINE       0     0     0
    spares
      draid1-0-0                AVAIL
      draid1-0-1                AVAIL

The number of failure groups is specified indirectly via the new width parameter in draid vdev configuration descriptor, which is the total number of disks and which is multiple of children in each group. This multiple is the number of groups (width / children). Doing it this way allows the user conveniently see how many disks draid has in an instant.

Spare disks are evenly distributed among failure groups, so the number of spares should be multiple of the number of groups, and they are shared by all groups. However, to support domain failure, we cannot have more than nparity - 1 failed disks in any group, no matter if they are rebuilt to draid spares or not (the blocks of those spares can be mapped to the disks from the failed domain (enclosure), and we cannot tolerate more than nparity failures in any failure group).

How Has This Been Tested?

The following automation tests were added:

  1. zpool_create_draid_005_pos.ksh test which covers the creation of the pools with different stripe configurations with different random number of groups in the big width row (from 2 to 7).
  2. redundancy_draid_width.ksh test which checks that data is intact when any n disks are failed/corrupted at the same offset in each of n failure groups (n being random from 2 to 4).
  3. redundancy_draid_spare4.ksh test, based on redundancy_draid_spare1.ksh, uses failure groups (random from 2 to 4) and fails disks in the groups at the same random offset, making sure resilvering is successful and the data is intact. It also checks that no more than nparity failues are allowed in each group.
  4. suspend_draid_fgroups.ksh test which checks that the pool gets into suspended state if more than 3 devices are failed in its draid3 vdev with failure groups (random from 2 to 6).
$ ./scripts/zfs-tests.sh -T redundancy -I 100
...
Test: functional/redundancy/redundancy_draid_spare1 (run as root) [00:11] [FAIL]
...
Results Summary
PASS	 1899
FAIL	   1

Running Time:	14:43:59
Percent passed:	99.9%

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Closes #11969.

@andriytk andriytk force-pushed the fdomains branch 3 times, most recently from 9830c7d to 645755d Compare January 23, 2026 09:46
@andriytk andriytk changed the title Introduce failure groups/domains to draid vdev Introduce failure groups/domains to draid Jan 23, 2026
@andriytk andriytk changed the title Introduce failure groups/domains to draid Introduce failure groups/domains to dRAID Jan 23, 2026
@andriytk andriytk force-pushed the fdomains branch 4 times, most recently from 4e7ecb0 to dd73eff Compare January 25, 2026 00:26
@andriytk

This comment was marked as outdated.

@amotin
Copy link
Member

amotin commented Jan 28, 2026

While this change should allow failures of any one (-two-three) enclosures with all the disks, it will do nothing to a failure of two (-three-four) disks from different enclosures. So having two groups per row we still reduce reliability to a half of RAIDZ's, while hopefully doubling spare activation rate. Having too many groups per row still sounds questionable though. Not sure I'd go beyond 3-4. So in real life with realistic enclosure sizes many dRAID vdevs per pool might still be needed. And I'd say at least 6 enclosures, but better 10-12. To me a realistic configuration would look like 11x 64-disk enclosures, and a pool with 16x 44-disk draid2 vdevs. Not many people have setups like this, but those that do may benefit indeed. But we may need to better document all those intricacies.

I only briefly looked on a code, but what I haven't noticed there is a new pool feature, since as I understand this is an incompatible change, since we don't store the map on disks.

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Jan 28, 2026
@amotin
Copy link
Member

amotin commented Jan 28, 2026

Thinking again, I think I missed one more positive relation here. When a number of groups (disks in enclosure/slice) goes to and beyond the number of enclosures/slices, there will be a chance that two or more failed drives end up in the same enclosure, counting only as one. But to reach that point with a reasonably-wide group we'd need to already accept pretty bad probability of multiple failures. Would be interesting to see that in numbers instead of guesses.

@andriytk
Copy link
Contributor Author

andriytk commented Jan 28, 2026

While this change should allow failures of any one (-two-three) enclosures with all the disks, it will do nothing to a failure of two (-three-four) disks from different enclosures.

@amotin, I'm not sure I understand. Why?

The failure of disks from different enclosures are handled the same way as they are handled today without this change, there is no regression in this functionality. I.e. disks can fail, but the amount of them that can fail at the same time is limited by draid/2/3 configuration (up to three).

The whole purpose and idea of this change is to have the same features we normally have with dRAID on many disks (i.e. fast sequential resilvering of any disk failure and shared spares) but at the same time being able to tolerate failures of enclosures or even servers/racks (if disks are shared via iSCSI).

Without this patch, the only way to configure this is to have a separate draid vdev per failure group, but this won't give you fast resilvering. And the faster we resilver and restore data redundancy, the less chance we hit another disk failure during the resilver, which is better for overall data durability.

Moreover, spare disks won't be shared among draid vdevs. So for example if you have only 1s in such vdev which is already used, the next disk failure in this draid vdev won't be resilvered. With this patch, all spares are shared between the groups, so that even if all disks from some group are failed - they all can be resilvered using the spares from other groups, as long as we have them available.

By the way, the change is fully compatible. The old pools created with the previous ZFS versions will still work with this change.

To me a realistic configuration would look like 11x 64-disk enclosures...

Yeap, and you can configure it like this, for example:

    NAME                        STATE     READ WRITE CKSUM
    pool1                       ONLINE       0     0     0
      draid2:8d:11c:253w:23s-0  ONLINE       0     0     0
        enc00d00                ONLINE       0     0     0
        enc01d00                ONLINE       0     0     0
        enc02d00                ONLINE       0     0     0
        ...
        enc10d00                ONLINE       0     0     0
        enc00d01                ONLINE       0     0     0
        enc01d01                ONLINE       0     0     0
        enc02d01                ONLINE       0     0     0
        ...
        enc10d21                ONLINE       0     0     0
        enc00d22                ONLINE       0     0     0
        enc01d22                ONLINE       0     0     0
        enc02d22                ONLINE       0     0     0
        ...
        enc10d22                ONLINE       0     0     0
      draid2:8d:11c:253w:23s-1  ONLINE       0     0     0
        enc00d23                ONLINE       0     0     0
        enc01d23                ONLINE       0     0     0
        enc02d23                ONLINE       0     0     0
        ...
        enc10d45                ONLINE       0     0     0
      draid2:8d:11c:198w:18s-2  ONLINE       0     0     0
        enc00d46                ONLINE       0     0     0
        ...
        enc10d63                ONLINE       0     0     0
    spares
      draid2-0-0                AVAIL
      draid2-0-1                AVAIL
      ...
      draid2-0-22               AVAIL
      draid2-1-0                AVAIL
      draid2-1-1                AVAIL
      ...
      draid2-2-17               AVAIL

Yes, the amount of spare disks in such configurations is big, but there is no way to avoid this - you must put spare disks in your failure groups, at least one per group, so that when the whole enclosure fails each group would have spare to resilver to. The only way to alleviate it is to have bigger failure groups (that is, more enclosures). For example, if there were 16 enclosures, we could configure it as three draid2:8d:16c:192w:12s vdevs, the last one being draid2:8d:16c:128w:8s, and having 44 groups and 44 spares instead of 64.

@andriytk
Copy link
Contributor Author

andriytk commented Jan 28, 2026

On a 2nd thought, maybe we can do better and allow user to specify less amount of spares than there are failure groups. Maybe it's not worth to repair all disks when the whole enclosure fails after all since it will generate a lot of iops and consume a lot of resources, dragging the system down essentially, and the enclosure might be changed faster than such resilvering will be even complete.

I will think how it can be implemented...

Thanks for looking at it, @amotin!

@amotin
Copy link
Member

amotin commented Jan 29, 2026

there is no regression in this functionality

Right. I haven't told that it is worse. Just that it is not better. And the current state I would not call particularly great on wider vdevs.

Yeap, and you can configure it like this, for example:

Sure. And failure of 3 disks out of 253 will be a game over, unless you are lucky enough for some of them to happen in the same enclosure.

@andriytk
Copy link
Contributor Author

andriytk commented Jan 29, 2026

In case of draid3, it won't be game over even if they fail at the same time. Moreover, as I already mentioned, sequential resilvering of one disk on 253 disks draid will be 10 times faster than on 25 disks draid, which decreases the chance of another disk failure while resilvering is in progress, so the risk of having more disks lost at the same time is less. This is the point of dRAID, isn't it?

And again, if the disks are failing not at the same time (which is normally the case), you can lose as much as you have spares in your draid vdev. You can lose even all disks in one failure group, if there is enough spares from other groups to resilver them to.

Hope it helps.

@ZNikke
Copy link

ZNikke commented Jan 29, 2026

Hi @andriytk, I'm not entirely sure that this works as intended, or I might just be missing some draid quirk...

I cloned https://github.com/andriytk/zfs/tree/fdomains and built successfully.

However, doing some basic IO tests and filling the pool with data (4G isos) it starts out all right writing to all devices, but after a while it degrades into only writing to the first device.

In the beginning:

                           capacity     operations     bandwidth
pool                     alloc   free   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----
dpool                    31.6G  54.5T      0  3.82K      0  2.19G
  draid2:4d:6c:0s:30w-0  31.6G  54.5T      0  3.82K      0  2.19G
    enc0d2                   -      -      0    129      0  74.1M
    enc1d2                   -      -      0    123      0  75.7M
    enc2d2                   -      -      0    134      0  75.7M
    enc0d3                   -      -      0    138      0  78.1M
    enc1d3                   -      -      0    132      0  71.8M
    enc2d3                   -      -      0    111      0  75.8M
...
    enc2d11                  -      -      0    106      0  71.8M

After a short while:

                           capacity     operations     bandwidth
pool                     alloc   free   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----
dpool                     113G  54.5T      0  2.46K      0  86.1M
  draid2:4d:6c:0s:30w-0   113G  54.5T      0  2.46K      0  86.1M
    enc0d2                   -      -      0  2.46K      0  86.1M
    enc1d2                   -      -      0      0      0      0
    enc2d2                   -      -      0      0      0      0
...
    enc2d11                  -      -      0      0      0      0

All spinning disks, no separate ZIL or somesuch. But if it was a ZIL issue I'd expect some IO to other devices at some point after this, but no.

Also, stopping the test and letting the system go idle, and then restarting the test in another directory immediately starts writing to only the first device, so I don't think it's ZIL related...

So, real bug or am I'm doing something wrong?

@amotin
Copy link
Member

amotin commented Jan 29, 2026

By the way, the change is fully compatible. The old pools created with the previous ZFS versions will still work with this change.

Right. But backwards it is not -- previous ZFS version won't be able to import the new pools correctly. That is why a new pool feature is required.

In case of draid3, it won't be game over even if they fail at the same time.

Right. Except raidz3:8d:1s vdev would already have 50% raw space efficiency and significant write overhead, and I doubt anybody will really have 20 enclosures to allow raidz3:16d:1s, even if makes sense for payload. Never mind. My criticism here was no so much pointed on this change, as on dRAID concepts in general. This change does improve the situation a bit, and I appreciate it.

@andriytk
Copy link
Contributor Author

andriytk commented Jan 29, 2026

@ZNikke, thanks for giving it a try. It's interesting, I didn't see this problem on my setup. And a quick try to reproduce it with the same configuration on my laptop didn't reproduce it, but I will try it again. I cannot think of anything in the patch that could cause such behaviour, tbh...

Which commit did you pick up to build?

@andriytk
Copy link
Contributor Author

But backwards it is not -- previous ZFS version won't be able to import the new pools correctly. That is why a new pool feature is required.

@amotin, I didn't know it's required that previous ZFS versions must be able to import the pools created in the new ZFS version, but this patch does not break it. I just tested it: 1) created pool without failure domains in the new version with the patch; 2) imported it successfully using the old version without the patch.

@amotin
Copy link
Member

amotin commented Jan 30, 2026

I didn't know it's required that previous ZFS versions must be able to import the pools created in the new ZFS version, but this patch does not break it.

It is not required between major versions. But it is required to be controllable. I.e. no new features should be used until explicitly enabled, and presence of any new/unknown active feature should reliably prevent import, or limit it to read-only.

I just tested it: 1) created pool without failure domains in the new version with the patch; 2) imported it successfully using the old version without the patch.

Would you care to explain how, if there is no code to handle the new parameters, permutations or vdev status?

@gmelikov
Copy link
Member

I just tested it: 1) created pool without failure domains in the new version with the patch; 2) imported it successfully using the old version without the patch.

I think what was meant is to test: create pool with failure domains in the new version, try to import it in old version. Old code should gracefully deny import of any non-supported pools.

@ZNikke
Copy link

ZNikke commented Jan 30, 2026

Which commit did you pick up to build?

@andriytk commit e085dc06986a15755e59f6e1d255616440897f25 (HEAD -> fdomains, origin/fdomains)

It's not the latest now I see so I'll refresh and rebuild, but the changes seem very unrelated.

Something related to sizes? My test setup has 2T drives (actually a mix of 2T and 8T, but it seems to pick the smallest size with -f), so my pool size is 54.5T.

Oh, and the drive enc0d2 is a 2T drive so it's not that it wants to "fill up" an 8T drive...

Is there some layout dump or somesuch that you might find helpful? Or should we look into arranging access for you to our test setup?

@andriytk
Copy link
Contributor Author

andriytk commented Jan 30, 2026

Would you care to explain how, if there is no code to handle the new parameters, permutations or vdev status?

@amotin, when failure domains are not configured in draid vdev - the code works exactly the same as without the patch, no new permutations or parameters (only one, actually - nslice) are used by the new code in such a way that would make it incompatible with the old one. (There is no new vdev status.)

I think what was meant is to test: create pool with failure domains in the new version, try to import it in old version. Old code should gracefully deny import of any non-supported pools.

@gmelikov, thank you, it makes sense. Any hints on how to implement it? What needs to be updated in the new code so that old code would recognise it and gracefully deny import?

@gmelikov
Copy link
Member

gmelikov commented Jan 30, 2026

We have feature flags for that https://openzfs.github.io/openzfs-docs/Basic%20Concepts/Feature%20Flags.html , there's a read-only one if old code can read pool (it forces read-only import) and usual (which restricts import at all on old code)

I think this patch may be a good starting point f70c850 (sorry, I personally didn't implement feature flags so I can give only basic examples)

@ZNikke
Copy link

ZNikke commented Jan 30, 2026

@andriytk Now rebuilt with commit b084a5a44cbe13f54ea10a0ff89eef5337ba01f7 (HEAD -> fdomains, origin/fdomains) - same behavior.

I've built native deb packages as per https://openzfs.github.io/openzfs-docs/Developer%20Resources/Custom%20Packages.html#dkms-1 on the same Ubuntu 24.04/noble machine.

Rebuilt the pool and ran a new test with the same result, a zpool iostat -vy -T u dpool 1 starts off as expected, and then suddenly switches behavior from one second to the next.

And letting it calm down and restarting the test in another directory immediately resumes writing to only one drive.

zpool scrub gets REALLY upset about the state of the pool, this is a zpool status -v after it's done:

  pool: dpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 101M in 00:05:44 with 112171 errors on Fri Jan 30 11:22:06 2026
config: 

        NAME                     STATE     READ WRITE CKSUM
        dpool                    ONLINE       0     0     0
          draid2:4d:6c:30w:0s-0  ONLINE       0     0     0
            enc0d2               ONLINE       0     0 1.29M
            enc1d2               ONLINE       0     0    14
            enc2d2               ONLINE       0     0    14
            enc0d3               ONLINE       0     0    14
            enc1d3               ONLINE       0     0    14
            enc2d3               ONLINE       0     0    14
            enc0d4               ONLINE       0     0     0
            enc1d4               ONLINE       0     0     0
            enc2d4               ONLINE       0     0     0
            enc0d5               ONLINE       0     0     0
            enc1d5               ONLINE       0     0     0
            enc2d5               ONLINE       0     0     0
            enc0d6               ONLINE       0     0     0
            enc1d6               ONLINE       0     0     0
            enc2d6               ONLINE       0     0     0
            enc0d7               ONLINE       0     0     0
            enc1d7               ONLINE       0     0     0
            enc2d7               ONLINE       0     0     0
            enc0d8               ONLINE       0     0     0
            enc1d8               ONLINE       0     0     0
            enc2d8               ONLINE       0     0     0
            enc0d9               ONLINE       0     0     0
            enc1d9               ONLINE       0     0     0
            enc2d9               ONLINE       0     0     0
            enc0d10              ONLINE       0     0     0
            enc1d10              ONLINE       0     0     0
            enc2d10              ONLINE       0     0     0
            enc0d11              ONLINE       0     0     0
            enc1d11              ONLINE       0     0     0   
            enc2d11              ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /dpool/test1/21
        /dpool/test2/MD5SUMS
        /dpool/test1/22
        /dpool/test1/24
        /dpool/test1/23
        /dpool/test2/revelations-1.iso
        /dpool/test2/0

The broken files are the lately written ones from test1 and all from test2, ie when all writes ended up on the same drive.

@andriytk
Copy link
Contributor Author

andriytk commented Feb 11, 2026

With this patch, all spares are shared between the groups, so that even if all disks from some group are failed - they all can be resilvered using the spares from other groups, as long as we have them available.

Having said that, enclosure failures won't work if some spare is backing/replacing the disk which is not from his native failure group, and I've just added checks/restrictions for that. It is because if the disk backing that spare happens to be part of the enclosure failure, it can introduce more failures to that group than it can tolerate.

UPD: actually, on a 2nd thought, the situation even worse than that. It doesn't matter whether the spare is native or not because the disks they mapped to are distributed among the failure groups anyway, and the backing disk of this spare can be in any failure group. But what does matter is how much parity is configured and how many failed disks we already have in the failure group on the time of the enclosure failure.

For example, in draid1, if we have some failed disk already, we cannot tolerate an enclosure failure which doesn't have that disk, even if that disk backed by some spare. It's because any disk from that enclosure failure can be mapped to that spare that is backing that failed disk, which will make it two failed disks in the failure group, and draid1 cannot tolerate that.

In other words, if we want to support enclosure failure, we cannot have any failed disks with draid1 that doesn't belong to that failed enclosure. With draid2, we can have no more than one failed disk in each failure group. With draid3, we can have no more than two failed disks in each failure group to support enclosure failure. Again, it doesn't matter whether those failed disks are backed up by spares or not because any, or even all, of those spares can be mapped to the disks belonging to that failed enclosure.

I will update the code and add more restrictions to the allowed failures.

@andriytk
Copy link
Contributor Author

Updated the code, redundancy_draid_spare4 test has passed 100% in 100 iterations now:

$ ./scripts/zfs-tests.sh -t redundancy_draid_spare4 -I 100
...
Results Summary
PASS	 300

Running Time:	01:01:14
Percent passed:	100.0%

@behlendorf behlendorf self-requested a review February 12, 2026 19:38
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clever! This is great work, now that you point it out it's something I wish I'd included in the original dRAID implementation. It is absolutely functionality we'd make use of on our systems. I've done some light testing with this change as so far it's held up well. I'll kick the tires a bit more with some more interesting large layouts.

Copy link
Contributor

@akashb-22 akashb-22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Tested on a smaller setup without issues. Will verify with a larger configuration of drives and enclosures.

@andriytk andriytk changed the title Introduce failure groups/domains to dRAID Introduce failure domains to dRAID Feb 16, 2026
@akashb-22
Copy link
Contributor

An observation that I've noticed is where I've manually corrupted a vdev member (file5) in a draid, and noticed that another member from a different fault group is reporting checksum errors as well. Additionally, the events show non-empty ranges fields. Not sure if this is expected behavior or a concern.

Feb 16 2026 03:29:40.495069098 ereport.fs.zfs.checksum
        class = "ereport.fs.zfs.checksum"
        ena = 0xe16a61314410a801
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0xa002480625c78e12
                vdev = 0xa7a24523262f86f3
        (end detector)
        pool = "pool-oss0"
        pool_guid = 0xa002480625c78e12
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "panic"
        vdev_guid = 0xa7a24523262f86f3
        vdev_type = "file"
        vdev_path = "/root/akash/lustre-zfs/tests/files/fdomain/file15"
        vdev_ashift = 0x9
        vdev_complete_ts = 0x22e16a5b1fbf4
        vdev_delta_ts = 0x23fd39f
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x4
        vdev_delays = 0x0
        dio_verify_errors = 0x0
        parent_guid = 0xa7c81fbfc4a000e1
        parent_type = "draid"
        vdev_spare_paths = "draid1-0-0" "draid1-0-1"
        vdev_spare_guids = 0x645929e27f4918a2 0x88de4722356cd97b
        zio_err = 0x0
        zio_flags = 0x2000b0 [SCRUB SCAN_THREAD CANFAIL DONT_PROPAGATE]
        zio_stage = 0x400000 [VDEV_IO_DONE]
        zio_pipeline = 0x5e00000 [VDEV_IO_START VDEV_IO_DONE VDEV_IO_ASSESS CHECKSUM_VERIFY DONE]
        zio_delay = 0x0
        zio_timestamp = 0x0
        zio_delta = 0x0
        zio_type = 0x1 [READ]
        zio_priority = 0x4 [SCRUB]
        zio_offset = 0x27f72a00
        zio_size = 0x8d600
        zio_objset = 0xaa
        zio_object = 0x2
        zio_level = 0x0
        zio_blkid = 0x35e
        bad_ranges =
        bad_ranges_min_gap = 0x0
        bad_range_sets =
        bad_range_clears =
        bad_set_bits =
        bad_cleared_bits =
        time = 0x6992f194 0x1d8227aa
        eid = 0x313

Feb 16 2026 03:29:40.659067352 ereport.fs.zfs.checksum
        class = "ereport.fs.zfs.checksum"
        ena = 0xe16af78bbfe09801
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0xa002480625c78e12
                vdev = 0x973aed68a1b22e45
        (end detector)
        pool = "pool-oss0"
        pool_guid = 0xa002480625c78e12
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "panic"
        vdev_guid = 0x973aed68a1b22e45
        vdev_type = "file"
        vdev_path = "/root/akash/lustre-zfs/tests/files/fdomain/file5"
        vdev_ashift = 0x9
        vdev_complete_ts = 0x22e16af02faa8
        vdev_delta_ts = 0xa8a2fc
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x332
        vdev_delays = 0x0
        dio_verify_errors = 0x0
        parent_guid = 0xa7c81fbfc4a000e1
        parent_type = "draid"
        vdev_spare_paths = "draid1-0-0" "draid1-0-1"
        vdev_spare_guids = 0x645929e27f4918a2 0x88de4722356cd97b
        zio_err = 0x0
        zio_flags = 0x2000b0 [SCRUB SCAN_THREAD CANFAIL DONT_PROPAGATE]
        zio_stage = 0x400000 [VDEV_IO_DONE]
        zio_pipeline = 0x5e00000 [VDEV_IO_START VDEV_IO_DONE VDEV_IO_ASSESS CHECKSUM_VERIFY DONE]
        zio_delay = 0x0
        zio_timestamp = 0x0
        zio_delta = 0x0
        zio_type = 0x1 [READ]
        zio_priority = 0x4 [SCRUB]
        zio_offset = 0x30528200
        zio_size = 0x92600
        zio_objset = 0xaa
        zio_object = 0x2
        zio_level = 0x0
        zio_blkid = 0x938
        bad_ranges = 0x0 0x92400
        bad_ranges_min_gap = 0x8
        bad_range_sets = 0x0
        bad_range_clears = 0x249297
        time = 0x6992f194 0x274891d8
        eid = 0x314

@andriytk
Copy link
Contributor Author

@akashb-22, first of all - thanks for giving it so thorough testing!

The disks are shuffled between failure groups, so I guess it's normal that you can see checksum errors in different groups when you inject corruption only in one disk of some group. Because, if I remember correctly, cksum error counters are incremented for all disks in the parity group for which the problem is detected.

Having said that, I'm not sure about bad ranges though. Don't you normally see it in the same test scenario on draids without failure domains feature?

@behlendorf
Copy link
Contributor

@andriytk thanks for the quick iteration on this, it's shaping up nicely. I'll put it through some additional testing over the weekend but so far it's holding up well. When you get a chance can you squash the patch stack and update the PR. I don't think there's a need to keep the separate commits any longer.

module/zfs/spa.c Outdated

if (!spa_feature_is_enabled(spa, SPA_FEATURE_DRAID_FAIL_DOMAINS) &&
draid_nfgroup > 0)
return (SET_ERROR(ENOTSUP));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are exiting here in the middle of pool creation with transaction group open and I am sure dozen other things allocated. I haven't looked what's wrong with what I proposed before, but this makes me shiver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted! I dropped this check entirely since the feature is alway enabled on new pool creation anyway. Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Pool can be created of older pre-OpenZFS version of with arbitrary set of features through compatibility property.

Copy link
Contributor Author

@andriytk andriytk Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and if you create such pools on a new software version, the feature is enabled automatically. It's absolutely similar to dRAID feature, there is no difference. If you look at the code where SPA_FEATURE_DRAID is checked, at vdev_alloc(), you can see that it's checked only when spa->spa_load_state != SPA_LOAD_CREATE.

Here's how I tested it last night just to double-check (thanks to your comment):

  1. Create a pool on an old OpenZFS version.
  2. Upgrade OpenZFS version.
  3. Import the pool created the old version.
  4. Try to add draid with failure domains to the pool - it fails (due to check at spa_vdev_add()), as expected.
  5. Try to create new pool with draid vdev with failure domains feature - it works, as expected.
  6. Try to add draid with failure domains to the new pool - it works, as expected.

Btw, there was a bug on adding draid with failure domains to the old pool (step 4), which I missed somehow during my previous testing - the error was printed, but vdev was still adding nevertheless. I fixed that in commit 74eaf3d last night. Again, thanks your comment which prompted me to double-check it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amotin, does it resolve your concern?

Copy link
Contributor Author

@andriytk andriytk Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems working now after commit 8b0d785:

$ sudo zpool create -d -f -m /var/tmp/testdir testpool draid2:5c /var/tmp/basedir.2400/vdev{0..4}
cannot create 'testpool': operation not supported on this type of pool
$ sudo zpool create -d -o 'feature@draid=enabled' -f -m /var/tmp/testdir testpool draid2:5c /var/tmp/basedir.2400/vdev{0..4}
$ sudo zpool create -d -o 'feature@draid=enabled' -f -m /var/tmp/testdir testpool draid2:5c:10w /var/tmp/basedir.2400/vdev{0..9}
cannot create 'testpool': operation not supported on this type of pool
$ sudo zpool create -d -o 'feature@draid_failure_domains=enabled' -f -m /var/tmp/testdir testpool draid2:5c:10w /var/tmp/basedir.2400/vdev{0..9}
$

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amotin, is it good now? I fixed it for both features draid and draid with failure domains.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess better than nothing, but I am not too deep into that code. I see SPA_FEATURE_ALLOCATION_CLASSES is checked earlier, so I'd try to do similar. Otherwise I worry that later cleanup may leak something somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see you point. OK, let me try something different...

Copy link
Contributor Author

@andriytk andriytk Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amotin, please check commit a113c4d:

$ sudo zpool create -d -f -m /var/tmp/testdir testpool draid2:5c /var/tmp/basedir.2400/vdev{0..4}
cannot create 'testpool': operation not supported on this type of pool
$ sudo zpool create -d -o 'feature@draid=enabled' -f -m /var/tmp/testdir testpool draid2:5c /var/tmp/basedir.2400/vdev{0..4}
$ sudo zpool destroy testpool
$ sudo zpool create -d -o 'feature@draid=enabled' -f -m /var/tmp/testdir testpool draid2:5c:10w /var/tmp/basedir.2400/vdev{0..9}
cannot create 'testpool': operation not supported on this type of pool
$ sudo zpool create -d -o 'feature@draid_failure_domains=enabled' -f -m /var/tmp/testdir testpool draid2:5c:10w /var/tmp/basedir.2400/vdev{0..9}
cannot create 'testpool': operation not supported on this type of pool
$ sudo zpool create -d -o 'feature@draid=enabled' -o 'feature@draid_failure_domains=enabled' -f -m /var/tmp/testdir testpool draid2:5c:10w /var/tmp/basedir.2400/vdev{0..9}
$

Thanks for the hint!

@andriytk
Copy link
Contributor Author

can you squash the patch stack and update the PR

@behlendorf, done. Thank you!

@andriytk
Copy link
Contributor Author

andriytk commented Feb 21, 2026

@behlendorf, I was thinking, when a failure domain fails, it doesn't seem to make much sense to start resilver automatically by zed, it will take a lot of computing and i/o bandwidth resources only to be wasted when the failed domain component is replaced. What do you think? Should we disable it for domain failures?

I'd suggest to add a simple logic to zed's retire agent when it handles device failure event: if at each failure group a device by the same i-th index is faulted, which means we are facing i-th domain failure, and the device from the event is one of those faulted - don't attach hot spare to this device and don't start resilvering.

I've added this logic in recent commits ddaf183...a509459.

@behlendorf
Copy link
Contributor

Should we disable it [hot sparing] for domain failures?

Yeah, as long as we can reliably detect domain failures this seems like a reasonable compromise for the default behavior. This is another case where I wish we have better existing mechanisms to fine tune the ZED behavior. For some environments I can alternately imagine it being preferable to rebuild as quickly as possible to restore redundancy.

@andriytk andriytk force-pushed the fdomains branch 4 times, most recently from fbb362d to d32bc5e Compare February 26, 2026 18:13
@andriytk

This comment was marked as outdated.

@andriytk
Copy link
Contributor Author

Rebased on the latest master.

Currently, the only way to tolerate the failure of the whole
enclosure is to configure several draid vdevs in the pool, each
vdev having disks from different enclosures. But this essentially
degrades draid to raidz and defeats the purpose of having fast
sequential resilvering on wide pools with draid.

This patch allows to configure several children groups in the same
row in one draid vdev. In each such group, let's call it failure
group, the user can configure disks belonging to different
enclosures - failure domains. For example, in case of 10
enclosures with 10 disks each, the user can put 1st disk from each
enclosure into 1st group, 2nd disk from each enclosure into 2nd
group, and so on. If one enclosure fails, only one disk from each
group would fail, which won't affect draid operation, and each
group would have enough redundancy to recover the stored data. Of
course, in case of draid2 - two enclosures can fail at a time, in
case of draid3 - three enclosures (provided there are no other
disk failures in each group).

In order to preserve fast sequential resilvering in case of a disk
failure, the groups much share all disks between themselves, and
this is achieved by shuffling the disks between the groups. But
only i-th disks in each group are shuffled between themselves,
i.e. the disks from the same enclosures, after that they are
shuffled within each group, like it is done today in an ordinary
draid. Thus, no more than one disk from any enclosure can appear
in any failure group as a result of this shuffling.

For example, here's how the pool status output looks like in
case of two `draid1:2d:4c:1s` groups:

    NAME                        STATE     READ WRITE CKSUM
    pool1                       ONLINE       0     0     0
      draid1:2d:4c:1s:8w-0      ONLINE       0     0     0
        enc0d0                  ONLINE       0     0     0
        enc1d0                  ONLINE       0     0     0
        enc2d0                  ONLINE       0     0     0
        enc3d0                  ONLINE       0     0     0
        enc0d1                  ONLINE       0     0     0
        enc1d1                  ONLINE       0     0     0
        enc2d1                  ONLINE       0     0     0
        enc3d1                  ONLINE       0     0     0
    spares
      draid1-0-0                AVAIL
      draid1-0-1                AVAIL

The number of failure groups is specified indirectly via the new
width parameter in draid vdev configuration descriptor, which is
the total number of disks and which is multiple of children in
each group. This multiple is the number of groups (width /
children). Doing it this way allows the user conveniently see how
many disks draid has in an instant.

Spare disks are evenly distributed among failure groups, so the
number of spares should be multiple of the number of groups, and
they are shared by all groups. However, to support domain failure,
we cannot have more than nparity - 1 failed disks in any group, no
matter if they are rebuilt to draid spares or not (the blocks of
those spares can be mapped to the disks from the failed domain
(enclosure), and we cannot tolerate more than nparity failures in
any failure group).

The retire agent in zed is updated to not start resilvering when
the domain failure happens. Otherwise, it might take a lot of
computing and I/O bandwidth resources, only to be wasted when the
failed domain component is replaced.

Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes openzfs#11969.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow specifying disk fault domains/groups to dRAID setup enabling enclosure loss resistant setup

6 participants