Skip to content

Commit f55dafa

Browse files
committed
New article /p/74
1 parent 4d2107e commit f55dafa

File tree

1 file changed

+375
-0
lines changed

1 file changed

+375
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,375 @@
1+
---
2+
title: Beating $3k SSD with $2k HDD?
3+
tagline: Practical ZFS application on USTC Mirrors
4+
tags: linux server zfs
5+
redirect_from: /p/74
6+
header:
7+
actions:
8+
- label: '<i class="fas fa-presentation-screen"></i> View slides'
9+
url: /p/72
10+
---
11+
12+
A.K.A. Practical ZFS application on USTC Mirrors. A writeup of the talk I gave at Nanjing University this August.
13+
14+
{% assign image_base = "https://image.ibugone.com" %}
15+
16+
## Background
17+
18+
[USTC Open-Source Software Mirrors](https://mirrors.ustc.edu.cn/) is one of the largest public mirror sites in China. In the two months of May and June 2024, we served an average daily egress traffic of some 36 TiB, which breaks down as follows:
19+
20+
- 19 TiB from HTTP/HTTPS, among 17M requests
21+
- 10.3 TiB from rsync, among 21.8k requests (if we count one absurd client in, the number of requests goes to 147.8k)
22+
23+
Over the years, as mirror repositories have grown and new repositories have been added, we have been running tight on disk space. For our two servers responsible for the mirror service, we have reached unhealthy levels of disk usage:
24+
25+
- HTTP server (XFS): 63.3 TiB used out of 66.0 TiB (96%, achieved on December 18, 2023)
26+
- Rsync server (ZFS): 42.4 TiB used out of 43.2 TiB (98%, achieved on November 21, 2023)
27+
28+
The servers have the following configurations:
29+
30+
<dl>
31+
<dt>HTTP server</dt>
32+
<dd markdown="1">
33+
- Set up in Fall 2020
34+
- Twelve 10 TB HDDs + One 2 TB SSD
35+
- XFS on LVM on hardware RAID
36+
- Reserved free PEs on LVM VG level as XFS cannot be shrunk
37+
</dd>
38+
39+
<dt>Rsync server</dt>
40+
<dd markdown="1">
41+
- Set up in Winter 2016
42+
- Twelve 6 TB HDDs + some smaller SSDs for OS and cache
43+
- RAID-Z3 on ZFS, 8 data disks + 3 parity disks + 1 hot spare
44+
- All default parameters (except `zfs_arc_max`)
45+
</dd>
46+
</dl>
47+
48+
These servers are constantly running at an I/O utilization of over 90%, which results in less than 50 MB/s download speed even from within USTC campus. Clearly this is not the ideal performance for this kind of dedicated storage servers.
49+
50+
{% include figure
51+
image_path="https://image.ibugone.com/grafana/mirrors-io-utilization-may-2024.png"
52+
popup=true
53+
alt="I/O load of two servers from USTC Mirrors in May 2024"
54+
caption="I/O load of two servers from USTC Mirrors in May 2024" %}
55+
56+
## ZFS
57+
58+
ZFS is usually known for being the ultimate single-node storage solution. It combines RAID, volume management, and filesystem in one, and provides advanced features like snapshots, clones and send/receive. Everything in ZFS is checksummed, ensuring data integrity. For servers dedicated to storage, ZFS appears to be a "fire and forget" solution, which is easily challenged by its tremendous amount of tunables and parameters.
59+
60+
As preliminary learning and experiments, I sourced some drives for my own workstation and set up two ZFS pools on them. Then I signed up for some private tracker (PT) sites for I/O load to tune for. The results were quite satisfying: In two years and a half, my single-node PT station has generated 1.20 PiB of uploads.
61+
62+
Over the years, I have gathered some of my most important sources for learning ZFS:
63+
64+
- Chris's Wiki: <https://utcc.utoronto.ca/~cks/space/blog/>
65+
- OpenZFS Documentation: <https://openzfs.github.io/openzfs-docs/>
66+
- My own blog: [Understanding ZFS block sizes]({{ "/p/62" | relative_url }})
67+
- Plus all references in the article
68+
69+
{% include figure
70+
image_path="https://image.ibugone.com/grafana/qb/2024-06-05.png"
71+
popup=true
72+
alt="A Grafana dashboard for qBittorrent"
73+
caption="A byproduct of my ZFS learning: A Grafana dashboard for qBittorrent (lol...)" %}
74+
75+
After these years of learning ZFS, I realized that there's a substantial room for improvement in our mirror servers, by embracing ZFS and tuning it properly.
76+
77+
## Mirrors
78+
79+
Before we move on to rebuilding the ZFS pool, we need to understand our I/O workload. In essence, a mirror site:
80+
81+
- Provides file downloads
82+
- Also (begrudgingly) serves as speed tests
83+
- Mostly reads, and almost all reads are whole-file sequential reads
84+
- Can withstand minimal data loss as mirror contents can be easily re-synced
85+
86+
{% include figure
87+
image_path="https://image.ibugone.com/server/mirrors-file-size-distribution-2024-08.png"
88+
popup=true
89+
alt="File size distribution of USTC Mirrors in August 2024"
90+
caption="File size distribution of USTC Mirrors in August 2024" %}
91+
92+
With those in mind, we analyzed our mirror content. As can be seen from the graph above, half of the 40M files are less than 10 KiB in size, and 90% of the files are less than 1 MiB. Still, the files are averaged at 1.6 MiB.
93+
94+
## Rebuilding the Rsync server {#mirrors2}
95+
96+
In June, we set out to rebuild the Rsync server as it had a lower service traffic and importance, yet a disproportionately higher disk usage. We laid out the following plan:
97+
98+
- First, the RAID overhead of RAID-Z3 was too high (reiterating: half of the files are less than 10 KiB, and the disks have 4 KiB sectors), so we decided to switch to RAID-Z2 as well as split the RAID group into two. Two RAIDZ vdevs also implies double the IOPS, as each "block" is stored on only one vdev.
99+
- We then carefully select dataset properties to optimize for our workload:
100+
- `recordsize=1M` to maximize sequential throughput and minimize fragmentation
101+
- `compression=zstd` to (try to) save some disk space
102+
- Since OpenZFS 2.2, a mechanism called "early-abort" has been extended to Zstd compression (level 3+), which saves CPU cycles by testing data compressibility with LZ4 then Zstd 1, before actually trying to compress with Zstd.
103+
104+
We know that most of our mirror content is already compressed (like software packages and ISOs), so early-abort is urging us to use Zstd.
105+
- `xattr=off` as we don't need extended attributes for mirror content.
106+
- `atime=off` as we don't need access time. Also cuts off a lot of writes.
107+
- `setuid=off`, `exec=off`, `devices=off` to disable what we don't need.
108+
- `secondarycache=metadata` to cache metadata only, as this Rsync server has a much more uniform access pattern than the HTTP server. We would like to save our SSDs from unnecessary writes.
109+
- Some slightly dangerous properties:
110+
- `sync=disabled` to disable synchronous writes. This allows ZFS to buffer writes up to `zfs_txg_timeout` seconds and make better allocation decisions.
111+
- `redundant_metadata=some` to trade some metadata redundancy for better write performance.
112+
113+
We believe these changes are in alignment with our evaluation of data safety and loss tolerance.
114+
115+
- For ZFS module parameters, the sheer number of 290+ tunables is overwhelming. Thanks to @happyaron, the current ZFS maintainer in Debian and administrator of BFSU Mirror, we selected a handful of them:
116+
117+
```shell
118+
# Set ARC size to 160-200 GiB, keep 16 GiB free for OS
119+
options zfs zfs_arc_max=214748364800
120+
options zfs zfs_arc_min=171798691840
121+
options zfs zfs_arc_sys_free=17179869184
122+
123+
# Favor metadata to data by 20x (OpenZFS 2.2+)
124+
options zfs zfs_arc_meta_balance=2000
125+
126+
# Allow up to 80% of ARC to be used for dnodes
127+
options zfs zfs_arc_dnode_limit_percent=80
128+
129+
# See man page section "ZFS I/O Scheduler"
130+
options zfs zfs_vdev_async_read_max_active=8
131+
options zfs zfs_vdev_async_read_min_active=2
132+
options zfs zfs_vdev_scrub_max_active=5
133+
options zfs zfs_vdev_max_active=20000
134+
135+
# Never throttle the ARC
136+
options zfs zfs_arc_lotsfree_percent=0
137+
138+
# Tune L2ARC
139+
options zfs l2arc_headroom=8
140+
options zfs l2arc_write_max=67108864
141+
options zfs l2arc_noprefetch=0
142+
```
143+
144+
And also `zfs_dmu_offset_next_sync`, which is enabled by default since OpenZFS 2.1.5, so it's omitted from our list.
145+
146+
After relocating Rsync service to our primary server (HTTP server), we broke up the existing ZFS pool and rebuilt it anew, before syncing previous repositories back from external sources. To our surprise, the restoration took only 3 days, much faster than we had anticipated. Other numbers also looked promising:
147+
148+
- Compression ratio: 39.5T / 37.1T (1.07x)
149+
150+
We'd like to point out that ZFS only provides two digits after the decimal point for compression ratio, so if you want a higher precision, you need take the raw numbers and calculate it yourself:
151+
152+
```shell
153+
zfs list -po name,logicalused,used
154+
```
155+
156+
Our actual number was 1 + 6.57%, at 2.67 TB (2.43 TiB) saved, which means equivalently 9 copies of WeChat data [as advertised by Lenovo Legion]({{ image_base }}/teaser/lenovo-legion-wechat-data.jpg).
157+
158+
- And most importantly, a much saner I/O load:
159+
160+
{% include figure
161+
image_path="https://image.ibugone.com/grafana/mirrors2-io-utilization-and-free-space-june-july-2024.png"
162+
popup=true
163+
alt="I/O load of server mirrors2 before and after the rebuild"
164+
caption="I/O load of server \"mirrors2\" before and after the rebuild" %}
165+
166+
We can see that, after a few days of warm-up, the I/O load has maintained at around 20%, whereas it was constantly at 90% before the rebuild.
167+
168+
## Rebuilding the HTTP server {#mirrors4}
169+
170+
Our HTTP server was set up in late 2020 and under a different background.
171+
When we were first deciding the technology stack, we were not confident in ZFS and were discouraged by the abysmal performance of our Rsync server.
172+
So we opted for an entirely different stack for this server: hardware RAID, LVM (because the RAID controller didn't allow RAID groups across two controllers), and XFS.
173+
For memory caching, we relied on kernel's page cache, and for SSD caching, we tried LVMcache, which was quite new at the moment and rather immature.
174+
175+
These unpracticed technologies have, without a doubt, ended up a pain.
176+
177+
- XFS cannot be shrunk, so we had to reserve free PEs at LVM VG level. We also cannot fill the FS, so there are two levels of free space reservation. Double the waste.
178+
- We initially allocated 1.5 TB of SSD cache, but given LVMcache's recommendation of no more than 1 million chunks, we opted for just 1 TiB (1 MiB chunk size &times; 1 Mi chunks).
179+
- There were no options for cache eviction policy, so later we dug into the kernel source code and found that it was a 64-level LRU.
180+
- The first thing to die was GRUB2. Due to GRUB's parsing of LVM metadata, it was unable to boot from a VG where a cached volume was present. We had to [patch](https://github.com/taoky/grub/commit/85b260baec91aa4f7db85d7592f6be92d549a0ae) GRUB for it to handle this case.
181+
- With an incorrect understanding of chunk size and number of chunks, our SSD ran severely over its write endurance in under 2 years, and we had to replace it with a new one.
182+
183+
Even after understanding the algorithm and still going for 128 KiB chunk size and over 8 Mi chunks, LVMcache still didn't offer a competitive hit rate:
184+
185+
{% include figure
186+
image_path="https://image.ibugone.com/grafana/mirrors4-dmcache-may-june-2024.png"
187+
popup=true
188+
alt="LVMcache hit rate over May to June 2024"
189+
caption="LVMcache hit rate over May to June 2024" %}
190+
191+
We had already been fed up with those troubles through the years, and the success with our Rsync server rebuild gave us great confidence with ZFS.
192+
So in less than a month, we laid out a similar plan for our HTTP server, but trying something new:
193+
194+
- We updated the kernel to `6.8.8-3-pve`, which bundles the latest `zfs.ko` for us. This means we don't have to waste time on DKMS.
195+
- Since the number of disks is the same (12 disks), we also went for two RAID-Z2 vdevs with 6 disks each.
196+
- As this server provides HTTP service to end users, the access pattern will have a greater hot/cold distinction than the Rsync server. So we keep `secondarycache=all` for this server (leave the default value unchanged).
197+
- This newer server has a better CPU, so we increased compression level to `zstd-8` in hope for a better compression ratio.
198+
- Since we already have the Rsync server running ZFS with desired parameters, we have `zfs send -Lcp` available when syncing the data back. This allows us to restore 50+ TiB of data in just 36 hours.
199+
- Due to having a slightly different set of repositories, the compression ratio is slightly lower at 1 + 3.93% (2.42 TiB / 2.20 TiB saved).
200+
201+
We put the I/O loads of both servers together for comparison:
202+
203+
{% include figure
204+
image_path="https://image.ibugone.com/grafana/mirrors2-4-io-utilization-june-july-2024.png"
205+
popup=true
206+
alt="I/O load of two servers from USTC Mirrors before and after rebuild"
207+
caption="I/O load of two servers from USTC Mirrors before and after rebuild" %}
208+
209+
This graph starts with the initial state. The first server was rebuilt at 1/3, and the second server was rebuilt at 2/3.
210+
211+
The hit rate of ZFS ARC is also quite satisfying:
212+
213+
{% include figure
214+
image_path="https://image.ibugone.com/grafana/mirrors2-4-zfs-arc-hit-rate.png"
215+
popup=true
216+
alt="ZFS ARC hit rate of two servers"
217+
caption="ZFS ARC hit rate of two servers" %}
218+
219+
The stablized I/O load is even lower after both servers were rebuilt.
220+
221+
{% include figure
222+
image_path="https://image.ibugone.com/grafana/mirrors2-4-disk-io-after-rebuild.png"
223+
popup=true
224+
alt="Sustained disk I/O of two servers after rebuild"
225+
caption="Sustained disk I/O of two servers after rebuild" %}
226+
227+
## Misc
228+
229+
### ZFS compression
230+
231+
We are slightly surprised to see that so many repositories are well-compressible:
232+
233+
| NAME | LUSED | USED | RATIO |
234+
| :------------------------- | ----: | ----: | ----: |
235+
| pool0/repo/crates.io-index | 2.19G | 1.65G | 3.01x |
236+
| pool0/repo/elpa | 3.35G | 2.32G | 1.67x |
237+
| pool0/repo/rfc | 4.37G | 3.01G | 1.56x |
238+
| pool0/repo/debian-cdimage | 1.58T | 1.04T | 1.54x |
239+
| pool0/repo/tldp | 4.89G | 3.78G | 1.48x |
240+
| pool0/repo/loongnix | 438G | 332G | 1.34x |
241+
| pool0/repo/rosdistro | 32.2M | 26.6M | 1.31x |
242+
243+
A few numbers (notably the first one) don't make sense, which we attribute to [<i class="fab fa-github"></i> openzfs/zfs#7639](https://github.com/openzfs/zfs/issues/7639).
244+
245+
If we sort the table by difference, it would be:
246+
247+
| NAME | LUSED | USED | DIFF |
248+
| :------------------------ | -----: | -----: | -----: |
249+
| pool0/repo | 58.3T | 56.1T | 2.2T |
250+
| pool0/repo/debian-cdimage | 1.6T | 1.0T | 549.6G |
251+
| pool0/repo/opensuse | 2.5T | 2.3T | 279.7G |
252+
| pool0/repo/turnkeylinux | 1.2T | 1.0T | 155.2G |
253+
| pool0/repo/loongnix | 438.2G | 331.9G | 106.3G |
254+
| pool0/repo/alpine | 3.0T | 2.9T | 103.9G |
255+
| pool0/repo/openwrt | 1.8T | 1.7T | 70.0G |
256+
257+
`debian-cdimage` alone contributes to a quarter of the saved space.
258+
259+
### Grafana for ZFS I/O
260+
261+
We also fixed a Grafana panel for ZFS I/O so it's displaying the correct numbers.
262+
Because ZFS I/O statistics are exported through `/proc/spl/kstat/zfs/$POOL/objset-$OBJSETID_HEX` and is cumulative per "object set" (i.e. dataset), we need to calculate the derivative of the numbers and *then* sum by pool.
263+
This means the use of subqueries is inevitable.
264+
265+
```sql
266+
SELECT
267+
non_negative_derivative(sum("reads"), 1s) AS "read",
268+
non_negative_derivative(sum("writes"), 1s) AS "write"
269+
FROM (
270+
SELECT
271+
first("reads") AS "reads",
272+
first("writes") AS "writes"
273+
FROM "zfs_pool"
274+
WHERE ("host" = 'taokystrong' AND "pool" = 'pool0') AND $timeFilter
275+
GROUP BY time($interval), "host"::tag, "pool"::tag, "dataset"::tag fill(null)
276+
)
277+
WHERE $timeFilter
278+
GROUP BY time($interval), "pool"::tag fill(linear)
279+
```
280+
281+
This query is a bit slow (due to the subquery) and unfortunately there's not much we can do about it.
282+
283+
To display I/O bandwidth, simply replace `reads` and `writes` with `nread` and `nwritten` in the inner query.
284+
285+
{% include figure
286+
image_path="https://image.ibugone.com/grafana/mirrors2-4-zfs-io-count.png"
287+
popup=true
288+
alt="ZFS I/O count and bandwidth"
289+
caption="ZFS I/O count and bandwidth" %}
290+
291+
We are astonished to see an HDD array can sustain 15k IOPS and peaking at 50k IOPS.
292+
This becomes all explained when we discovered that these numbers took ARC hits into account, and a minimal proportion were actually hitting the disks.
293+
294+
### AppArmor
295+
296+
It didn't take long before we noticed all our sync tasks were failing.
297+
We found `rsync` failing with `EPERM` for `socketpair(2)` calls, which never manifested before.
298+
Interestingly, these were denied by AppArmor.
299+
We traced down the cause to be Ubuntu's addition to the kernel, `security/apparmor/af_unix.c`.
300+
As Proxmox VE forks its kernel from Ubuntu, this change also made its way into our server.
301+
302+
We also found PVE packaging their own copy of AppArmor `features`, so we decided to adopt the same approach:
303+
304+
```shell
305+
dpkg-divert --package lxc-pve --rename --divert /usr/share/apparmor-features/features.stock --add /usr/share/apparmor-features/features
306+
wget -O /usr/share/apparmor-features/features https://github.com/proxmox/lxc/raw/master/debian/features
307+
```
308+
309+
### File deduplication
310+
311+
For a small set of repositories, possibly due to limitations of syncing methods, we noticed a lot of identically-looking directories.
312+
313+
{% include figure
314+
image_path="https://image.ibugone.com/server/ls-zerotier-redhar-el.png"
315+
popup=true
316+
alt="Some folders from ZeroTier repository"
317+
caption="Some folders from ZeroTier repository" %}
318+
319+
ZFS deduplication immediately came to our mind, so we made a preliminary test on a ZT:
320+
321+
```shell
322+
zfs create -o dedup=on pool0/repo/zerotier
323+
# dump content into it
324+
```
325+
326+
```console
327+
# zdb -DDD pool0
328+
dedup = 4.93, compress = 1.23, copies = 1.00, dedup * compress / copies = 6.04
329+
```
330+
331+
The results look promising, but we are still hesitant to enable deduplication due to the potential performance impact even on these selected datasets.
332+
333+
Guess what we ended up with?
334+
335+
```shell
336+
# post-sync.sh
337+
# Do file-level deduplication for select repos
338+
case "$NAME" in
339+
docker-ce|influxdata|nginx|openresty|proxmox|salt|tailscale|zerotier)
340+
jdupes -L -Q -r -q "$DIR" ;;
341+
esac
342+
```
343+
344+
As attractive as it looks, this userspace file deduplication tool is as good as ZFS can do, but without the performance loss.
345+
346+
| Name | Orig | Dedup | Diff | Ratio |
347+
|-------------|--------|--------|--------|-------|
348+
| proxmox | 395.4G | 162.6G | 232.9G | 2.43x |
349+
| docker-ce | 539.6G | 318.2G | 221.4G | 1.70x |
350+
| influxdata | 248.4G | 54.8G | 193.6G | 4.54x |
351+
| salt | 139.0G | 87.2G | 51.9G | 1.59x |
352+
| nginx | 94.9G | 59.7G | 35.2G | 1.59x |
353+
| zerotier | 29.8G | 6.1G | 23.7G | 4.88x |
354+
| mysql-repo | 647.8G | 632.5G | 15.2G | 1.02x |
355+
| openresty | 65.1G | 53.4G | 11.7G | 1.22x |
356+
| tailscale | 17.9G | 9.0G | 9.0G | 2.00x |
357+
358+
We decided to exclude `mysql-repo` as the deduplication ratio is too low to justify the I/O load after each sync.
359+
360+
## Conclusion
361+
362+
ZFS solved a number of problems we had with our mirror servers, and with the current setup, we are delighted to announce that ZFS is *the* best solution for mirrors.
363+
364+
With ZFS:
365+
366+
- We no longer need to worry about partitioning, as ZFS can grow and shrink as needed.
367+
- Our HDD array is now running faster than SSDs. Amazing!
368+
- Be the first one to no longer **envy** TUNA's SSD server!
369+
- Extra capacity at no cost, thanks to ZFS compression.
370+
- Even more so with deduplication.
371+
372+
### Considerations
373+
374+
While our ZFS looks very promising, we're aware that ZFS is not known for its long-term performance stability due to fragmentation.
375+
We'll continue to monitor our servers and see if this performance is sustainable.

0 commit comments

Comments
 (0)