|
2 | 2 |
|
3 | 3 | Features and optimizations to add on top of the current minimal implementation. |
4 | 4 |
|
| 5 | +## Reference implementations |
| 6 | + |
| 7 | +When picking up any of the items in this document, cross-reference these |
| 8 | +ublk implementations: |
| 9 | + |
| 10 | +- [`semistrict/go-ublk`](https://github.com/semistrict/go-ublk) — another |
| 11 | + pure-Go ublk library. Covers multi-queue, USER_COPY mode, zero-copy with |
| 12 | + auto-buf-reg, queue affinity pinning, GET_PARAMS / GET_FEATURES / |
| 13 | + GET_QUEUE_AFFINITY, and an "easy" handler that adapts `io.ReaderAt` |
| 14 | + plus optional `Flusher`/`Discarder`/`Zeroer` interfaces. |
| 15 | +- [SPDK ublk target](https://spdk.io/doc/ublk.html) — production C |
| 16 | + implementation used by [Longhorn v2 data |
| 17 | + engine](https://github.com/longhorn/longhorn/issues/9456). Validates |
| 18 | + our `runtime.LockOSThread`-per-queue model: SPDK explicitly notes the |
| 19 | + kernel constraint that "one ublk device queue can be only processed |
| 20 | + in the context of system thread which initialized it". They pin one |
| 21 | + spdk_thread per reactor and assign queues round-robin — same model |
| 22 | + we'd land on for multi-queue. |
| 23 | +- [`PhanLe1010/libublk-rs (longhorn branch)`](https://github.com/PhanLe1010/libublk-rs/tree/longhorn) — |
| 24 | + Rust PoC the Longhorn team built before settling on SPDK. Useful for |
| 25 | + comparing how the Rust ecosystem models the FETCH/COMMIT lifecycle. |
| 26 | +- [`ublk-org/ublksrv`](https://github.com/ublk-org/ublksrv) — the |
| 27 | + reference C implementation maintained alongside the kernel. Source of |
| 28 | + truth for kernel-feature interaction patterns (BATCH_IO, |
| 29 | + DEFER_TASKRUN, idle-buffer madvise, etc.). |
| 30 | + |
| 31 | +### Reference io_uring libraries (Go) |
| 32 | + |
| 33 | +We ship our own minimal `io_uring` wrapper in `ublk/uring/` rather than |
| 34 | +depending on a generic Go library. The reason is that ublk requires |
| 35 | +**128-byte SQEs (`IORING_SETUP_SQE128`)** and **`IORING_OP_URING_CMD` |
| 36 | +(opcode 46)** to carry the embedded `ublksrv_io_cmd` payload, and none |
| 37 | +of the popular Go ports expose both first-class. `semistrict/go-ublk` |
| 38 | +reached the same conclusion and also hand-rolled its own. |
| 39 | + |
| 40 | +That said, two Go ports are still worth using as **reference for ABI |
| 41 | +correctness** when binding new kernel features: |
| 42 | + |
| 43 | +- [`pawelgaczynski/giouring`](https://github.com/pawelgaczynski/giouring) |
| 44 | + — most complete pure-Go port of `liburing`, naming maps almost 1:1 to |
| 45 | + the C API. Best place to cross-check struct layouts and constants |
| 46 | + (`io_uring_buf_ring`, `Probe`, `RsrcRegister`, etc.) before we add |
| 47 | + things like zero-copy provided buffers. |
| 48 | +- [`godzie44/go-uring`](https://github.com/godzie44/go-uring) — closer |
| 49 | + to a `liburing` port with reactor pattern; useful reference for the |
| 50 | + Go memory-model discussion (`amd64_atomic` build tag, `seq_cst` vs |
| 51 | + `acq_rel` analysis) if we ever profile and find atomics on the SQ/CQ |
| 52 | + head/tail are a bottleneck. |
| 53 | + |
| 54 | +Other libraries we evaluated and ruled out for ublk: |
| 55 | +[`Iceber/iouring-go`](https://github.com/iceber/iouring-go) (most |
| 56 | +popular but networking-focused, 64-byte SQE only), |
| 57 | +[`ii64/gouring`](https://github.com/ii64/gouring) (also 64-byte SQE, |
| 58 | +no URING_CMD). |
| 59 | + |
| 60 | +**Action item:** whenever we touch `ublk/uring/uring.go` or the SQE128 |
| 61 | +layout, cross-check struct offsets and any new opcodes against |
| 62 | +`giouring`'s definitions to catch ABI drift. Cheap insurance. |
| 63 | + |
5 | 64 | ## Ungraceful-exit device leaks |
6 | 65 |
|
7 | 66 | When our test harnesses (or any user program using the library) is |
@@ -44,6 +103,98 @@ cleanup path may or may not be among them. |
44 | 103 | - [ ] Optionally: warn in `ublk.New` if `/sys/class/ublk-char/` is |
45 | 104 | approaching `ublks_max` (default 64). Gives users a signal before |
46 | 105 | `New()` starts failing silently due to accumulated orphans. |
| 106 | +- [ ] Expose a `ublk.CleanupOrphans()` (or `ublkctl` CLI) that scans |
| 107 | + `/sys/class/ublk-char/` and force-deletes leaked devices via |
| 108 | + `/dev/ublk-control` `DEL_DEV[_ASYNC]`. This is exactly the pain point |
| 109 | + Longhorn flags in |
| 110 | + [issue #10738](https://github.com/longhorn/longhorn/issues/10738): |
| 111 | + *"When instance manager pod crash, it might leave the orphan UBLK |
| 112 | + device. Currently, it is difficult to remove these orphan UBLK devices |
| 113 | + and sometime a reboot is needed."* — they have no userspace cleanup |
| 114 | + path either, so a small Go helper that wraps DEL_DEV by ID would be |
| 115 | + genuinely novel value vs. existing libraries (semistrict/SPDK don't |
| 116 | + ship this either). |
| 117 | + |
| 118 | +## Robustness / kernel-ABI hygiene |
| 119 | + |
| 120 | +### Compute ioctl-encoded opcodes from `_IOR`/`_IOWR` macros |
| 121 | + |
| 122 | +We currently hardcode the ioctl-encoded ublk command opcodes in |
| 123 | +`ublk/types.go` (`uCmdAddDev = 0xC0207504`, etc.). The constant `0x20` in |
| 124 | +each value is `sizeof(struct ublksrv_ctrl_cmd) = 32`, baked into the |
| 125 | +ioctl encoding. If the kernel ever extends `ublksrv_ctrl_cmd` (the |
| 126 | +explicit `Reserved1`/`Reserved2` fields exist precisely so it can grow), |
| 127 | +the size shifts and our hardcoded opcodes silently become wrong while |
| 128 | +the kernel returns `EINVAL`. |
| 129 | + |
| 130 | +Mirror the pattern used by [`semistrict/go-ublk`](https://github.com/semistrict/go-ublk) |
| 131 | +in [`const.go`](https://github.com/semistrict/go-ublk/blob/main/const.go): |
| 132 | +define `ioc(dir, type, nr, size)` and derive each `uCmd*` / `uIO*` |
| 133 | +opcode from `iowr(ublkType, nr, unsafe.Sizeof(ctrlCmd{}))`. Pure |
| 134 | +mechanical refactor; no behavior change today, future-proofs against |
| 135 | +struct extensions. |
| 136 | + |
| 137 | +### Handle non-read/non-write IO ops explicitly |
| 138 | + |
| 139 | +`worker.handleIO` currently returns `EOPNOTSUPP` for everything except |
| 140 | +`OP_READ`/`OP_WRITE`. We don't currently set `UBLK_ATTR_VOLATILE_CACHE` |
| 141 | +or `MaxDiscardSectors`, so the kernel block layer shouldn't emit |
| 142 | +`OP_FLUSH` / `OP_DISCARD` / `OP_WRITE_ZEROES` against us — but if a |
| 143 | +user touches the sysfs knobs (`echo 1 > .../discard_max_bytes`) or |
| 144 | +mounts a filesystem that issues unconditional flushes, they'll get |
| 145 | +unexplained EIO. Cheap fix when we add the optional backend interfaces: |
| 146 | + |
| 147 | +- `OP_FLUSH` (2) → `0` when no `Flusher`, otherwise call it |
| 148 | +- `OP_DISCARD` (3) → `0` (or call `Discarder` once exposed) |
| 149 | +- `OP_WRITE_ZEROES` (5) → `0` (or call `Zeroer`) |
| 150 | + |
| 151 | +Cross-ref: semistrict's `IOOp` enum in |
| 152 | +[`const.go`](https://github.com/semistrict/go-ublk/blob/main/const.go) |
| 153 | +and the `ReaderAtHandler` adapter in |
| 154 | +[`easy.go`](https://github.com/semistrict/go-ublk/blob/main/easy.go). |
| 155 | + |
| 156 | +### Async STOP_DEV pattern |
| 157 | + |
| 158 | +We currently issue STOP_DEV synchronously on the same control ring used |
| 159 | +for ADD/SET/DEL. That works because the kernel responds with the CQE |
| 160 | +once the worker thread releases the char fd (which we drive ourselves). |
| 161 | +Once we go multi-queue, the cleaner pattern (used by semistrict in |
| 162 | +[`ctrl.go`](https://github.com/semistrict/go-ublk/blob/main/ctrl.go)) is: |
| 163 | + |
| 164 | +1. Submit STOP_DEV without waiting → kernel injects ENODEV into pending |
| 165 | + FETCHes for every queue. |
| 166 | +2. Wait for all per-queue serve goroutines to drain. |
| 167 | +3. *Then* wait for the STOP_DEV CQE. |
| 168 | + |
| 169 | +Keeps the abort signal symmetric across queues and avoids relying on |
| 170 | +char-fd close to drive the abort path. |
| 171 | + |
| 172 | +### Friendlier error when `ublk_drv` isn't loaded |
| 173 | + |
| 174 | +Today, opening `/dev/ublk-control` on a host without the module |
| 175 | +returns a bare `ENOENT` from `unix.Open`, which surfaces as a |
| 176 | +confusing `"open /dev/ublk-control: no such file or directory"`. |
| 177 | +Detect this case in `openDevice` and wrap it with a hint: |
| 178 | + |
| 179 | +> ublk control device not found; load the kernel module with |
| 180 | +> `modprobe ublk_drv` (requires Linux 6.0+ and CAP_SYS_ADMIN) |
| 181 | +
|
| 182 | +Longhorn went so far as to ship dedicated CLI tooling |
| 183 | +([`longhornctl ... --enable-spdk`](https://github.com/longhorn/longhorn/issues/11803), |
| 184 | +[longhorn/cli PR #321](https://github.com/longhorn/cli/pull/321)) |
| 185 | +just to auto-`modprobe` this on cluster nodes. We can't (and |
| 186 | +shouldn't) call modprobe from a library, but a clear pointer in |
| 187 | +the error message saves users a Google search. |
| 188 | + |
| 189 | +### Safety review note for contributors |
| 190 | + |
| 191 | +Document the unsafe-pointer discipline in the package: any address |
| 192 | +embedded in a SQE (`cmd.Addr`, buffer pointers in IO commands) must be |
| 193 | +either heap-anchored through a long-lived field or pinned via |
| 194 | +`runtime.Pinner` for the duration of the syscall. The bug we just fixed |
| 195 | +in `addDev`/`setParams` is the canonical example. See |
| 196 | +[semistrict's `ctrl.go`](https://github.com/semistrict/go-ublk/blob/main/ctrl.go) |
| 197 | +which uses `runtime.Pinner` consistently. |
47 | 198 |
|
48 | 199 | ## Build |
49 | 200 |
|
@@ -82,6 +233,39 @@ type WriteZeroer interface { WriteZeroes(off, length int64) error } |
82 | 233 | Set `UBLK_ATTR_VOLATILE_CACHE` when backend implements `Flusher`, declare |
83 | 234 | `MaxDiscardSectors` when it implements `Discarder`, etc. |
84 | 235 |
|
| 236 | +Cross-ref: semistrict's [`ReaderAtHandler`](https://github.com/semistrict/go-ublk/blob/main/easy.go) |
| 237 | +already implements this exact pattern (`Flusher`, `Syncer`, `Discarder`, |
| 238 | +`Zeroer`) with optional-interface assertion. Worth mirroring the API |
| 239 | +shape so users have a familiar surface. |
| 240 | + |
| 241 | +### Diagnostic / introspection commands |
| 242 | + |
| 243 | +Wire up the read-only control commands so callers can inspect device |
| 244 | +state without round-tripping through sysfs: |
| 245 | + |
| 246 | +- `UBLK_CMD_GET_DEV_INFO` / `GET_DEV_INFO2` |
| 247 | +- `UBLK_CMD_GET_PARAMS` (we currently only have SET_PARAMS) |
| 248 | +- `UBLK_CMD_GET_QUEUE_AFFINITY` |
| 249 | +- `UBLK_CMD_GET_FEATURES` (lets the library detect kernel capability |
| 250 | + matrix at runtime instead of trial-and-error like our current |
| 251 | + ioctl-encoded → legacy fallback in `addDev`) |
| 252 | + |
| 253 | +Cross-ref: semistrict implements all of these in |
| 254 | +[`ctrl.go`](https://github.com/semistrict/go-ublk/blob/main/ctrl.go) and |
| 255 | +[`affinity.go`](https://github.com/semistrict/go-ublk/blob/main/affinity.go). |
| 256 | + |
| 257 | +### Update size / quiesce / try-stop |
| 258 | + |
| 259 | +Newer ublk control commands worth surfacing once we have a stable |
| 260 | +multi-queue base: |
| 261 | + |
| 262 | +- `UBLK_U_CMD_UPDATE_SIZE` (`UBLK_F_UPDATE_SIZE`) — live device resize. |
| 263 | +- `UBLK_U_CMD_QUIESCE_DEV` (`UBLK_F_QUIESCE`) — pause IO for live |
| 264 | + server upgrade. |
| 265 | +- `UBLK_U_CMD_TRY_STOP_DEV` (`UBLK_F_SAFE_STOP_DEV`) — only stop if no |
| 266 | + openers (avoids the documented "leaked fd → Close hangs forever" |
| 267 | + footgun without changing semantics the way DEL_DEV_ASYNC does). |
| 268 | + |
85 | 269 | ### User Copy Mode |
86 | 270 |
|
87 | 271 | Data transfer via `pread`/`pwrite` on the char device instead of mmap buffers. |
@@ -128,6 +312,7 @@ both submits SQEs and waits for the next CQE. The reference C implementation |
128 | 312 | (ublksrv) uses `io_uring_submit_and_wait_timeout()` which does exactly this. |
129 | 313 |
|
130 | 314 | Benefits: |
| 315 | + |
131 | 316 | - Saves one syscall per IO iteration (currently Submit → epoll_wait → two |
132 | 317 | kernel transitions) |
133 | 318 | - Explicitly processes io_uring task work via GETEVENTS (currently relies on |
|
0 commit comments