Skip to content

Commit 6301406

Browse files
committed
clickhouse gotchas
1 parent f2334f4 commit 6301406

File tree

2 files changed

+159
-0
lines changed

2 files changed

+159
-0
lines changed

clickhouse/archictecture.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Archiecture
2+
3+
Source:
4+
5+
- <https://clickhouse.com/docs/academic_overview>
6+
- <https://blog.dataengineerthings.org/i-spent-8-hours-learning-the-clickhouse-mergetree-table-engine-511093777daa>
7+
8+
## Gotchas
9+
10+
### ClickHouse writes
11+
12+
In synchronous insert mode, each INSERT statement creates a new part and appends it to the table. It means ClickHouse writes data directly to the file system for every INSERT operation (unless using specific buffering settings), whereas standard LSM-tree stores (like RocksDB, Cassandra, or HBase) write to an in-memory MemTable and a sequential WAL (Write-Ahead Log) first.
13+
14+
If you insert data into ClickHouse frequently in small chunks (e.g., 1 row at a time), performance degrades catastrophically. Here is the technical breakdown of why:
15+
16+
1. The "Too Many Parts" problem (Read & Write Amplification)
17+
- How it works: In ClickHouse, every INSERT creates a new directory containing data files (a "Part").
18+
- The Problem: If you send 1,000 individual inserts per second, ClickHouse creates 1,000 folders and files on your disk every second.
19+
- The Impact: The background merger (compaction process) cannot keep up. It has to merge these thousands of small files into larger ones. This causes massive Write Amplification (data is rewritten to disk multiple times) and chokes the CPU and Disk I/O. Eventually, ClickHouse will throw a [Too many parts](https://www.tinybird.co/docs/sql-reference/clickhouse-errors/TOO_MANY_PARTS) error and reject writes to protect itself.
20+
21+
2. Random I/O vs. Sequential I/1. first
22+
- Standard LSM: Writes go to a WAL (append-only file). This is a purely sequential operation, which is extremely fast on both HDDs and SSDs.
23+
- ClickHouse: Creating a new "Part" involves creating a directory, creating multiple files (one for each column + indexes), and writing metadata. This involves significantly more file system inode operations and random I/O overhead than simply appending to a single log file.
24+
25+
3. Read Latency Degradation
26+
The Impact: Queries in LSM trees must check all "Parts" (SSTables) to find data. If you have thousands of small unmerged parts, a SELECT query has to open and read from thousands of files simultaneously. This destroys query latency.
27+
28+
**How ClickHouse handles this**: Basically, the client must essentially build the "MemTable" yourself on its side (or use buffering layer).
29+
- _To minimize the overhead of merges, database clients are encouraged to insert tuples in bulk, e.g. 20,000 rows at once_.
30+
- ClickHouse buffers rows from multiple incoming INSERTs into the same table and creates a new part only after the buffer size exceeds a configurable threshold or a timeout expires. This is _async insert_, basically now we have a pseudo-MemTable.
31+
32+
![](https://clickhouse.com/docs/assets/ideal-img/_vldb2024_2_Figure_5.8af62c4.2048.png)
33+
34+
ClickHouse buffers (Async Inserts or Buffer Engine) act exactly like an LSM "MemTable" (in-memory storage), but without the WAL (Write-Ahead Log). This means if the server pulls the plug (crash, power loss, OOM kill) while data is in that buffer, that data is gone forever.
35+
36+
We don't talk about [Buffer Engine](https://clickhouse.com/docs/engines/table-engines/special/buffer), because it is the legacy and not recommended. Let's just talk about how [async insert handle this](https://clickhouse.com/docs/optimize/asynchronous-inserts). There is a `wait_for_async_insert` settings.
37+
- When set to 1 (the default), ClickHouse only acknowledges the insert after the data is successfully flushed to disk. This ensures strong durability guarantees and makes error handling straightforward: if something goes wrong during the flush, the error is returned to the client. This mode is recommended for most production scenarios, especially when insert failures must be tracked reliably -> _The client has to handle the retry logic_.
38+
- Setting `wait_for_async_insert = 0` enables "fire-and-forget" mode. Here, the server acknowledges the insert as soon as the data is buffered, without waiting for it to reach storage.
39+
40+
41+
**Why does ClickHouse do this?**
42+
43+
Sheer speed. Writing a WAL requires a sequential disk write (fsync) for every batch. Even though sequential writes are fast, they are strictly slower than writing to RAM. ClickHouse is designed for analytics where ingesting 100 million rows/sec is sometimes more important than losing the last 0.5 seconds of data during a catastrophic crash.

linux/fsync.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# `fsync` in Linux
2+
3+
## Overview
4+
5+
`fsync()` is a POSIX system call used to ensure that all modified data and metadata associated with a file descriptor are flushed from volatile memory (page cache) to stable storage (e.g., disk). It is commonly used in durability-critical software such as databases, filesystems, and transactional systems.
6+
7+
```c
8+
int fsync(int fd);
9+
```
10+
11+
On success, `fsync()` returns `0`. On failure, it returns `-1` and sets `errno`.
12+
13+
## What `fsync()` Guarantees
14+
15+
When `fsync(fd)` returns successfully:
16+
17+
- All dirty **file data** for `fd` is written to disk
18+
- All necessary **metadata** (e.g., file size, timestamps, allocation info) is committed
19+
- Data is guaranteed to survive a **power loss or system crash**
20+
21+
This guarantee applies only to the specific file referenced by the file descriptor.
22+
23+
## What `fsync()` Does _Not_ Guarantee
24+
25+
- It does **not** guarantee persistence of directory entries unless the directory itself is also synced
26+
- It does **not** guarantee ordering between multiple file descriptors unless explicitly controlled
27+
- It does **not** ensure durability of memory-mapped (`mmap`) writes unless paired with `msync()`
28+
29+
## Common Usage Pattern
30+
31+
```c
32+
int fd = open("data.log", O_WRONLY | O_CREAT | O_TRUNC, 0644);
33+
write(fd, buffer, len);
34+
fsync(fd);
35+
close(fd);
36+
```
37+
38+
To ensure the file **name and existence** are durable:
39+
40+
```c
41+
int dfd = open(".", O_DIRECTORY | O_RDONLY);
42+
fsync(dfd);
43+
close(dfd);
44+
```
45+
46+
## `fsync()` vs Related Calls
47+
48+
### `fdatasync()`
49+
50+
- Flushes **data** only
51+
- May skip some metadata (e.g., timestamps)
52+
- Often faster than `fsync()`
53+
54+
```c
55+
fdatasync(fd);
56+
```
57+
58+
### `sync()`
59+
60+
- Flushes **all** dirty buffers system-wide
61+
- Asynchronous on many systems
62+
- Not suitable for per-file durability guarantees
63+
64+
### `msync()`
65+
66+
- Used for `mmap()`-based I/O
67+
- Required to flush memory-mapped changes
68+
69+
## Filesystem-Specific Behavior
70+
71+
- **ext4**
72+
73+
- Honors `fsync()` fully
74+
- Behavior affected by mount options (`data=ordered`, `data=journal`)
75+
76+
- **XFS**
77+
78+
- Strong `fsync()` guarantees
79+
- Directory `fsync()` often required for durability
80+
81+
- **btrfs**
82+
83+
- Copy-on-write semantics
84+
- `fsync()` may trigger more extensive metadata writes
85+
86+
## Performance Characteristics
87+
88+
- `fsync()` is expensive:
89+
90+
- Forces cache flushes
91+
- Often triggers disk barriers or cache flush commands (e.g., `FLUSH CACHE`)
92+
93+
- High-frequency `fsync()` calls can dominate latency
94+
- Common optimizations:
95+
96+
- Group commit
97+
- Batched writes
98+
- Asynchronous I/O with explicit durability boundaries
99+
100+
## Error Handling
101+
102+
Common `errno` values:
103+
104+
- `EBADF` – Invalid file descriptor
105+
- `EIO` – I/O error during writeback
106+
- `EINVAL` – Descriptor does not support syncing
107+
108+
Errors may indicate **data loss risk** and should be treated as fatal in durability-sensitive applications.
109+
110+
## Best Practices
111+
112+
- Call `fsync()` only at well-defined durability points
113+
- Sync directories when creating, deleting, or renaming files
114+
- Prefer `fdatasync()` when metadata durability is unnecessary
115+
- Avoid relying on `close()` for durability guarantees
116+
- Test behavior under crash/power-failure scenarios

0 commit comments

Comments
 (0)