Skip to content

Commit a20ca0f

Browse files
committed
Add support for raid10
This removes the wait block for raid resync for two reasons: 1) raid0 does not have redundancy and therefore no initial resync[1] 2) with raid10 the resync time for 4x 1.9TB disks takes from tens of minutes to multiple hours, depending on sysctl params `dev.raid.speed_limit_min` and `dev.raid.speed_limit_max` and the speed of the disks. Initial resync for raid10 is not strictly needed[1] Filesystem creation: by default `mkfs.xfs` attempts to TRIM the drive. This is also something that can take tens of minutes or hours, depening on the size of drives. TRIM can be skipped, as instances are delivered with disks fully trimmed[2]. [1] https://raid.wiki.kernel.org/index.php/Initial_Array_Creation [2] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html#InstanceStoreTrimSupport
1 parent 813af95 commit a20ca0f

File tree

7 files changed

+41
-16
lines changed

7 files changed

+41
-16
lines changed

doc/usage/al2.md

+6
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,12 @@ A RAID-0 array is setup that includes all ephemeral NVMe instance storage disks.
171171

172172
Another way of utilizing the ephemeral disks is to format and mount the individual disks. Mounting individual disks allows the [local-static-provisioner](https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner) DaemonSet to create Persistent Volume Claims that pods can utilize.
173173

174+
### Experimental: RAID-10 Kubelet and Containerd (raid10)
175+
176+
Similar to RAID-0 array, it is possible to utilize RAID-10 array for instance types with four or more ephemeral NVMe instance storage disks. RAID-10 tolerates failure of maximum of 2 disks. However, individual ephemeral disks can not be replaced, so the purpose of redundancy is to make graceful decommisioning of a node possible.
177+
178+
RAID-10 can be enabled by passing `--local-disks raid10` flag to the bootstrap script.
179+
174180
---
175181

176182
## Version-locked packages

nodeadm/api/v1alpha1/nodeconfig_types.go

+4-1
Original file line numberDiff line numberDiff line change
@@ -94,13 +94,16 @@ type LocalStorageOptions struct {
9494
}
9595

9696
// LocalStorageStrategy specifies how to handle an instance's local storage devices.
97-
// +kubebuilder:validation:Enum={RAID0, Mount}
97+
// +kubebuilder:validation:Enum={RAID0, RAID10, Mount}
9898
type LocalStorageStrategy string
9999

100100
const (
101101
// LocalStorageRAID0 will create a single raid0 volume from any local disks
102102
LocalStorageRAID0 LocalStorageStrategy = "RAID0"
103103

104+
// LocalStorageRAID10 will create a single raid10 volume from any local disks. Minimum of 4.
105+
LocalStorageRAID10 LocalStorageStrategy = "RAID10"
106+
104107
// LocalStorageMount will mount each local disk individually
105108
LocalStorageMount LocalStorageStrategy = "Mount"
106109
)

nodeadm/crds/node.eks.aws_nodeconfigs.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,7 @@ spec:
107107
an instance's local storage devices.
108108
enum:
109109
- RAID0
110+
- RAID10
110111
- Mount
111112
type: string
112113
type: object

nodeadm/doc/api.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ _Appears in:_
9292
- [LocalStorageOptions](#localstorageoptions)
9393

9494
.Validation:
95-
- Enum: [RAID0 Mount]
95+
- Enum: [RAID0 RAID10 Mount]
9696

9797
#### NodeConfig
9898

nodeadm/internal/api/types.go

+1
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ type LocalStorageStrategy string
101101

102102
const (
103103
LocalStorageRAID0 LocalStorageStrategy = "RAID0"
104+
LocalStorageRAID10 LocalStorageStrategy = "RAID10"
104105
LocalStorageMount LocalStorageStrategy = "Mount"
105106
)
106107

templates/al2/runtime/bootstrap.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ function print_help {
3232
echo "--enable-local-outpost Enable support for worker nodes to communicate with the local control plane when running on a disconnected Outpost. (true or false)"
3333
echo "--ip-family Specify ip family of the cluster"
3434
echo "--kubelet-extra-args Extra arguments to add to the kubelet. Useful for adding labels or taints."
35-
echo "--local-disks Setup instance storage NVMe disks in raid0 or mount the individual disks for use by pods [mount | raid0]"
35+
echo "--local-disks Setup instance storage NVMe disks in raid0 or mount the individual disks for use by pods <mount | raid0 | raid10>"
3636
echo "--mount-bpf-fs Mount a bpffs at /sys/fs/bpf (default: true)"
3737
echo "--pause-container-account The AWS account (number) to pull the pause container from"
3838
echo "--pause-container-version The tag of the pause container"

templates/shared/runtime/bin/setup-local-disks

+27-13
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ err_report() {
1515
trap 'err_report $LINENO' ERR
1616

1717
print_help() {
18-
echo "usage: $0 <raid0 | mount | none>"
18+
echo "usage: $0 <raid0 | raid10 | mount | none>"
1919
echo "Sets up Amazon EC2 Instance Store NVMe disks"
2020
echo ""
2121
echo "-d, --dir directory to mount the filesystem(s) (default: /mnt/k8s-disks/)"
@@ -26,11 +26,18 @@ print_help() {
2626
echo "-h, --help print this help"
2727
}
2828

29-
# Sets up a RAID-0 of NVMe instance storage disks, moves
30-
# the contents of /var/lib/kubelet and /var/lib/containerd
29+
# Sets up a RAID-0 or RAID-10 of NVMe instance storage disks,
30+
# moves the contents of /var/lib/kubelet and /var/lib/containerd
3131
# to the new mounted RAID, and bind mounts the kubelet and
3232
# containerd state directories.
33-
maybe_raid0() {
33+
#
34+
# Do not wait for initial resync: raid0 has no redundancy so there
35+
# is no initial resync. Raid10 does not strictly needed a resync,
36+
# while the time taken for 4 1.9TB disk raid10 would be in range of
37+
# 20 minutes to 20 days, depending on dev.raid.speed_limit_min and
38+
# dev.raid.speed_limit_max sysctl parameters.
39+
maybe_raid() {
40+
local raid_level="$1"
3441
local md_name="kubernetes"
3542
local md_device="/dev/md/${md_name}"
3643
local md_config="/.aws/mdadm.conf"
@@ -40,14 +47,10 @@ maybe_raid0() {
4047
if [[ ! -s "${md_config}" ]]; then
4148
mdadm --create --force --verbose \
4249
"${md_device}" \
43-
--level=0 \
50+
--level="${raid_level}" \
4451
--name="${md_name}" \
4552
--raid-devices="${#EPHEMERAL_DISKS[@]}" \
4653
"${EPHEMERAL_DISKS[@]}"
47-
while [ -n "$(mdadm --detail "${md_device}" | grep -ioE 'State :.*resyncing')" ]; do
48-
echo "Raid is resyncing..."
49-
sleep 1
50-
done
5154
mdadm --detail --scan > "${md_config}"
5255
fi
5356

@@ -63,7 +66,8 @@ maybe_raid0() {
6366
## for the log stripe unit, but the max log stripe unit is 256k.
6467
## So instead, we use 32k (8 blocks) to avoid a warning of breaching the max.
6568
## mkfs.xfs defaults to 32k after logging the warning since the default log buffer size is 32k.
66-
mkfs.xfs -l su=8b "${md_device}"
69+
## Instances are delivered with disks fully trimmed, so TRIM is skipped at creation time.
70+
mkfs.xfs -K -l su=8b "${md_device}"
6771
fi
6872

6973
## Create the mount directory
@@ -231,8 +235,8 @@ set -- "${POSITIONAL[@]}" # restore positional parameters
231235
DISK_SETUP="$1"
232236
set -u
233237

234-
if [[ "${DISK_SETUP}" != "raid0" && "${DISK_SETUP}" != "mount" && "${DISK_SETUP}" != "none" ]]; then
235-
echo "Valid disk setup options are: raid0, mount, or none"
238+
if [[ "${DISK_SETUP}" != "raid0" && "${DISK_SETUP}" != "raid10" && "${DISK_SETUP}" != "mount" && "${DISK_SETUP}" != "none" ]]; then
239+
echo "Valid disk setup options are: raid0, raid10, mount or none"
236240
exit 1
237241
fi
238242

@@ -256,11 +260,21 @@ fi
256260
## Get devices of NVMe instance storage ephemeral disks
257261
EPHEMERAL_DISKS=($(realpath "${disks[@]}" | sort -u))
258262

263+
## Also bail early if there are not enough disks for raid10
264+
if [[ "${DISK_SETUP}" == "raid10" && "${#EPHEMERAL_DISKS[@]}" -lt 4 ]]; then
265+
echo "raid10 requires at least 4 disks, but only ${#EPHEMERAL_DISKS[@]} found, can not continue!"
266+
exit 1
267+
fi
268+
259269
case "${DISK_SETUP}" in
260270
"raid0")
261-
maybe_raid0
271+
maybe_raid 0
262272
echo "Successfully setup RAID-0 consisting of ${EPHEMERAL_DISKS[@]}"
263273
;;
274+
"raid10")
275+
maybe_raid 10
276+
echo "Successfully setup RAID-10 consisting of ${EPHEMERAL_DISKS[@]}"
277+
;;
264278
"mount")
265279
maybe_mount
266280
echo "Successfully setup disk mounts consisting of ${EPHEMERAL_DISKS[@]}"

0 commit comments

Comments
 (0)