clickhouse gotchas

ntk148v · ntk148v · commit 6301406b4812 · 2025-12-14T21:48:13.000+07:00
diff --git a/clickhouse/archictecture.md b/clickhouse/archictecture.md
@@ -0,0 +1,43 @@
+# Archiecture
+
+Source:
+
+- <https://clickhouse.com/docs/academic_overview>
+- <https://blog.dataengineerthings.org/i-spent-8-hours-learning-the-clickhouse-mergetree-table-engine-511093777daa>
+
+## Gotchas
+
+### ClickHouse writes
+
+In synchronous insert mode, each INSERT statement creates a new part and appends it to the table. It means ClickHouse writes data directly to the file system for every INSERT operation (unless using specific buffering settings), whereas standard LSM-tree stores (like RocksDB, Cassandra, or HBase) write to an in-memory MemTable and a sequential WAL (Write-Ahead Log) first.
+
+If you insert data into ClickHouse frequently in small chunks (e.g., 1 row at a time), performance degrades catastrophically. Here is the technical breakdown of why:
+
+1. The "Too Many Parts" problem (Read & Write Amplification)
+- How it works: In ClickHouse, every INSERT creates a new directory containing data files (a "Part").
+- The Problem: If you send 1,000 individual inserts per second, ClickHouse creates 1,000 folders and files on your disk every second.
+- The Impact: The background merger (compaction process) cannot keep up. It has to merge these thousands of small files into larger ones. This causes massive Write Amplification (data is rewritten to disk multiple times) and chokes the CPU and Disk I/O. Eventually, ClickHouse will throw a [Too many parts](https://www.tinybird.co/docs/sql-reference/clickhouse-errors/TOO_MANY_PARTS) error and reject writes to protect itself.
+
+2. Random I/O vs. Sequential I/1. first
+- Standard LSM: Writes go to a WAL (append-only file). This is a purely sequential operation, which is extremely fast on both HDDs and SSDs.
+- ClickHouse: Creating a new "Part" involves creating a directory, creating multiple files (one for each column + indexes), and writing metadata. This involves significantly more file system inode operations and random I/O overhead than simply appending to a single log file.
+
+3. Read Latency Degradation
+The Impact: Queries in LSM trees must check all "Parts" (SSTables) to find data. If you have thousands of small unmerged parts, a SELECT query has to open and read from thousands of files simultaneously. This destroys query latency.
+
+**How ClickHouse handles this**: Basically, the client must essentially build the "MemTable" yourself on its side (or use buffering layer).
+- _To minimize the overhead of merges, database clients are encouraged to insert tuples in bulk, e.g. 20,000 rows at once_.
+- ClickHouse buffers rows from multiple incoming INSERTs into the same table and creates a new part only after the buffer size exceeds a configurable threshold or a timeout expires. This is _async insert_, basically now we have a pseudo-MemTable.
+
+![](https://clickhouse.com/docs/assets/ideal-img/_vldb2024_2_Figure_5.8af62c4.2048.png)
+
+ClickHouse buffers (Async Inserts or Buffer Engine) act exactly like an LSM "MemTable" (in-memory storage), but without the WAL (Write-Ahead Log). This means if the server pulls the plug (crash, power loss, OOM kill) while data is in that buffer, that data is gone forever.
+
+We don't talk about [Buffer Engine](https://clickhouse.com/docs/engines/table-engines/special/buffer), because it is the legacy and not recommended. Let's just talk about how [async insert handle this](https://clickhouse.com/docs/optimize/asynchronous-inserts). There is a `wait_for_async_insert` settings.
+- When set to 1 (the default), ClickHouse only acknowledges the insert after the data is successfully flushed to disk. This ensures strong durability guarantees and makes error handling straightforward: if something goes wrong during the flush, the error is returned to the client. This mode is recommended for most production scenarios, especially when insert failures must be tracked reliably -> _The client has to handle the retry logic_.
+- Setting `wait_for_async_insert = 0` enables "fire-and-forget" mode. Here, the server acknowledges the insert as soon as the data is buffered, without waiting for it to reach storage.
+
+
+**Why does ClickHouse do this?**
+
+Sheer speed. Writing a WAL requires a sequential disk write (fsync) for every batch. Even though sequential writes are fast, they are strictly slower than writing to RAM. ClickHouse is designed for analytics where ingesting 100 million rows/sec is sometimes more important than losing the last 0.5 seconds of data during a catastrophic crash.
diff --git a/linux/fsync.md b/linux/fsync.md
@@ -0,0 +1,116 @@
+# `fsync` in Linux
+
+## Overview
+
+`fsync()` is a POSIX system call used to ensure that all modified data and metadata associated with a file descriptor are flushed from volatile memory (page cache) to stable storage (e.g., disk). It is commonly used in durability-critical software such as databases, filesystems, and transactional systems.
+
+```c
+int fsync(int fd);
+```
+
+On success, `fsync()` returns `0`. On failure, it returns `-1` and sets `errno`.
+
+## What `fsync()` Guarantees
+
+When `fsync(fd)` returns successfully:
+
+- All dirty **file data** for `fd` is written to disk
+- All necessary **metadata** (e.g., file size, timestamps, allocation info) is committed
+- Data is guaranteed to survive a **power loss or system crash**
+
+This guarantee applies only to the specific file referenced by the file descriptor.
+
+## What `fsync()` Does _Not_ Guarantee
+
+- It does **not** guarantee persistence of directory entries unless the directory itself is also synced
+- It does **not** guarantee ordering between multiple file descriptors unless explicitly controlled
+- It does **not** ensure durability of memory-mapped (`mmap`) writes unless paired with `msync()`
+
+## Common Usage Pattern
+
+```c
+int fd = open("data.log", O_WRONLY | O_CREAT | O_TRUNC, 0644);
+write(fd, buffer, len);
+fsync(fd);
+close(fd);
+```
+
+To ensure the file **name and existence** are durable:
+
+```c
+int dfd = open(".", O_DIRECTORY | O_RDONLY);
+fsync(dfd);
+close(dfd);
+```
+
+## `fsync()` vs Related Calls
+
+### `fdatasync()`
+
+- Flushes **data** only
+- May skip some metadata (e.g., timestamps)
+- Often faster than `fsync()`
+
+```c
+fdatasync(fd);
+```
+
+### `sync()`
+
+- Flushes **all** dirty buffers system-wide
+- Asynchronous on many systems
+- Not suitable for per-file durability guarantees
+
+### `msync()`
+
+- Used for `mmap()`-based I/O
+- Required to flush memory-mapped changes
+
+## Filesystem-Specific Behavior
+
+- **ext4**
+
+  - Honors `fsync()` fully
+  - Behavior affected by mount options (`data=ordered`, `data=journal`)
+
+- **XFS**
+
+  - Strong `fsync()` guarantees
+  - Directory `fsync()` often required for durability
+
+- **btrfs**
+
+  - Copy-on-write semantics
+  - `fsync()` may trigger more extensive metadata writes
+
+## Performance Characteristics
+
+- `fsync()` is expensive:
+
+  - Forces cache flushes
+  - Often triggers disk barriers or cache flush commands (e.g., `FLUSH CACHE`)
+
+- High-frequency `fsync()` calls can dominate latency
+- Common optimizations:
+
+  - Group commit
+  - Batched writes
+  - Asynchronous I/O with explicit durability boundaries
+
+## Error Handling
+
+Common `errno` values:
+
+- `EBADF` – Invalid file descriptor
+- `EIO` – I/O error during writeback
+- `EINVAL` – Descriptor does not support syncing
+
+Errors may indicate **data loss risk** and should be treated as fatal in durability-sensitive applications.
+
+## Best Practices
+
+- Call `fsync()` only at well-defined durability points
+- Sync directories when creating, deleting, or renaming files
+- Prefer `fdatasync()` when metadata durability is unnecessary
+- Avoid relying on `close()` for durability guarantees
+- Test behavior under crash/power-failure scenarios