Skip to content

Commit 8c8ff63

Browse files
committed
Add support for raid10
This removes the wait block for raid resync for two reasons: 1) raid0 does not have redundancy and therefore no initial resync[1] 2) with raid10 the resync time for 4x 1.9TB disks takes from tens of minutes to multiple hours, depending on sysctl params `dev.raid.speed_limit_min` and `dev.raid.speed_limit_max` and the speed of the disks. Initial resync for raid10 is not strictly needed[1] Filesystem creation: by default `mkfs.xfs` attempts to TRIM the drive. This is also something that can take tens of minutes or hours, depening on the size of drives. TRIM can be skipped, as instances are delivered with disks fully trimmed[2]. [1] https://raid.wiki.kernel.org/index.php/Initial_Array_Creation [2] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html#InstanceStoreTrimSupport
1 parent 6b924b8 commit 8c8ff63

File tree

3 files changed

+34
-14
lines changed

3 files changed

+34
-14
lines changed

doc/usage/al2.md

+6
Original file line numberDiff line numberDiff line change
@@ -172,6 +172,12 @@ A RAID-0 array is setup that includes all ephemeral NVMe instance storage disks.
172172

173173
Another way of utilizing the ephemeral disks is to format and mount the individual disks. Mounting individual disks allows the [local-static-provisioner](https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner) DaemonSet to create Persistent Volume Claims that pods can utilize.
174174

175+
### Experimental: RAID-10 Kubelet and Containerd (raid10)
176+
177+
Similar to RAID-0 array, it is possible to utilize RAID-10 array for instance types with four or more ephemeral NVMe instance storage disks. RAID-10 tolerates failure of maximum of 2 disks. However, individual ephemeral disks can not be replaced, so the purpose of redundancy is to make graceful decommisioning of a node possible.
178+
179+
RAID-10 can be enabled by passing `--local-disks raid10` flag to the bootstrap script.
180+
175181
---
176182

177183
## Version-locked packages

templates/al2/runtime/bootstrap.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ function print_help {
3232
echo "--enable-local-outpost Enable support for worker nodes to communicate with the local control plane when running on a disconnected Outpost. (true or false)"
3333
echo "--ip-family Specify ip family of the cluster"
3434
echo "--kubelet-extra-args Extra arguments to add to the kubelet. Useful for adding labels or taints."
35-
echo "--local-disks Setup instance storage NVMe disks in raid0 or mount the individual disks for use by pods [mount | raid0]"
35+
echo "--local-disks Setup instance storage NVMe disks in raid0 or mount the individual disks for use by pods <mount | raid0 | raid10>"
3636
echo "--mount-bpf-fs Mount a bpffs at /sys/fs/bpf (default: true)"
3737
echo "--pause-container-account The AWS account (number) to pull the pause container from"
3838
echo "--pause-container-version The tag of the pause container"

templates/shared/runtime/bin/setup-local-disks

+27-13
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ err_report() {
1515
trap 'err_report $LINENO' ERR
1616

1717
print_help() {
18-
echo "usage: $0 <raid0 | mount | none>"
18+
echo "usage: $0 <raid0 | raid10 | mount | none>"
1919
echo "Sets up Amazon EC2 Instance Store NVMe disks"
2020
echo ""
2121
echo "-d, --dir directory to mount the filesystem(s) (default: /mnt/k8s-disks/)"
@@ -26,11 +26,18 @@ print_help() {
2626
echo "-h, --help print this help"
2727
}
2828

29-
# Sets up a RAID-0 of NVMe instance storage disks, moves
30-
# the contents of /var/lib/kubelet and /var/lib/containerd
29+
# Sets up a RAID-0 or RAID-10 of NVMe instance storage disks,
30+
# moves the contents of /var/lib/kubelet and /var/lib/containerd
3131
# to the new mounted RAID, and bind mounts the kubelet and
3232
# containerd state directories.
33-
maybe_raid0() {
33+
#
34+
# Do not wait for initial resync: raid0 has no redundancy so there
35+
# is no initial resync. Raid10 does not strictly needed a resync,
36+
# while the time taken for 4 1.9TB disk raid10 would be in range of
37+
# 20 minutes to 20 days, depending on dev.raid.speed_limit_min and
38+
# dev.raid.speed_limit_max sysctl parameters.
39+
maybe_raid() {
40+
local raid_level="$1"
3441
local md_name="kubernetes"
3542
local md_device="/dev/md/${md_name}"
3643
local md_config="/.aws/mdadm.conf"
@@ -40,14 +47,10 @@ maybe_raid0() {
4047
if [[ ! -s "${md_config}" ]]; then
4148
mdadm --create --force --verbose \
4249
"${md_device}" \
43-
--level=0 \
50+
--level="${raid_level}" \
4451
--name="${md_name}" \
4552
--raid-devices="${#EPHEMERAL_DISKS[@]}" \
4653
"${EPHEMERAL_DISKS[@]}"
47-
while [ -n "$(mdadm --detail "${md_device}" | grep -ioE 'State :.*resyncing')" ]; do
48-
echo "Raid is resyncing..."
49-
sleep 1
50-
done
5154
mdadm --detail --scan > "${md_config}"
5255
fi
5356

@@ -63,7 +66,8 @@ maybe_raid0() {
6366
## for the log stripe unit, but the max log stripe unit is 256k.
6467
## So instead, we use 32k (8 blocks) to avoid a warning of breaching the max.
6568
## mkfs.xfs defaults to 32k after logging the warning since the default log buffer size is 32k.
66-
mkfs.xfs -l su=8b "${md_device}"
69+
## Instances are delivered with disks fully trimmed, so TRIM is skipped at creation time.
70+
mkfs.xfs -K -l su=8b "${md_device}"
6771
fi
6872

6973
## Create the mount directory
@@ -231,8 +235,8 @@ set -- "${POSITIONAL[@]}" # restore positional parameters
231235
DISK_SETUP="$1"
232236
set -u
233237

234-
if [[ "${DISK_SETUP}" != "raid0" && "${DISK_SETUP}" != "mount" && "${DISK_SETUP}" != "none" ]]; then
235-
echo "Valid disk setup options are: raid0, mount, or none"
238+
if [[ "${DISK_SETUP}" != "raid0" && "${DISK_SETUP}" != "raid10" && "${DISK_SETUP}" != "mount" && "${DISK_SETUP}" != "none" ]]; then
239+
echo "Valid disk setup options are: raid0, raid10, mount or none"
236240
exit 1
237241
fi
238242

@@ -256,11 +260,21 @@ fi
256260
## Get devices of NVMe instance storage ephemeral disks
257261
EPHEMERAL_DISKS=($(realpath "${disks[@]}" | sort -u))
258262

263+
## Also bail early if there are not enough disks for raid10
264+
if [[ "${DISK_SETUP}" == "raid10" && "${#EPHEMERAL_DISKS[@]}" -lt 4 ]]; then
265+
echo "raid10 requires at least 4 disks, but only ${#EPHEMERAL_DISKS[@]} found, skipping disk setup"
266+
exit 0
267+
fi
268+
259269
case "${DISK_SETUP}" in
260270
"raid0")
261-
maybe_raid0
271+
maybe_raid 0
262272
echo "Successfully setup RAID-0 consisting of ${EPHEMERAL_DISKS[@]}"
263273
;;
274+
"raid10")
275+
maybe_raid 10
276+
echo "Successfully setup RAID-10 consisting of ${EPHEMERAL_DISKS[@]}"
277+
;;
264278
"mount")
265279
maybe_mount
266280
echo "Successfully setup disk mounts consisting of ${EPHEMERAL_DISKS[@]}"

0 commit comments

Comments
 (0)