Skip to content

Commit 170aefa

Browse files
authored
Pin control-command buffers; expand TODO with cross-references (#19)
## Summary - **Correctness fix in `ublk/device.go`:** wrap `&info` in `addDev` and `&params` in `setParams` with `runtime.Pinner` so the kernel-visible address embedded in `cmd.Addr` stays anchored across the io_uring submit/CQE window. The `unsafe.Pointer → uintptr` conversion makes the struct invisible to the GC, which is in principle free to move or reclaim it while the kernel is still reading. Mirrors the pinning discipline used by [`semistrict/go-ublk`](https://github.com/semistrict/go-ublk/blob/main/ctrl.go). - **Expanded `TODO.md`** with findings from a comparison pass against several reference implementations: - [`semistrict/go-ublk`](https://github.com/semistrict/go-ublk) (the only other Go ublk library) - [SPDK ublk target](https://spdk.io/doc/ublk.html) and [Longhorn](longhorn/longhorn#5159) (production users) - [`PhanLe1010/libublk-rs`](https://github.com/PhanLe1010/libublk-rs/tree/longhorn) (Longhorn's Rust PoC) - [`ublk-org/ublksrv`](https://github.com/ublk-org/ublksrv) (the C reference) - The major pure-Go io_uring libraries: [`giouring`](https://github.com/pawelgaczynski/giouring), [`go-uring`](https://github.com/godzie44/go-uring), [`iouring-go`](https://github.com/iceber/iouring-go), [`gouring`](https://github.com/ii64/gouring) — none expose SQE128 + URING_CMD first-class, which is why we ship our own minimal wrapper. - New TODO items captured from those comparisons: - `ublk.CleanupOrphans()` helper — exact pain point from [longhorn/longhorn#10738](longhorn/longhorn#10738) that no existing library ships. - Compute ioctl-encoded opcodes from `_IOWR` macros instead of hardcoding (future-proofs against `ublksrv_ctrl_cmd` extensions). - Explicit handling of `OP_FLUSH` / `OP_DISCARD` / `OP_WRITE_ZEROES` (currently `EOPNOTSUPP`). - Async `STOP_DEV` pattern for multi-queue. - Friendlier error when `ublk_drv` isn't loaded. - Diagnostic commands (`GET_DEV_INFO`, `GET_PARAMS`, `GET_QUEUE_AFFINITY`, `GET_FEATURES`). - Update-size / quiesce / try-stop control commands. - Safety-review note documenting the unsafe-pointer / `runtime.Pinner` discipline for contributors. ## Test plan - [ ] `go vet ./...` clean (verified locally) - [ ] `go build ./...` clean (verified locally) - [ ] Existing test matrix still green on a host with `ublk_drv` loaded - [ ] Spot-check that `addDev` and `setParams` still succeed under stress (the pinning is purely additive — same address goes into `cmd.Addr`, just with the GC told not to touch the backing memory)
1 parent 17424c5 commit 170aefa

2 files changed

Lines changed: 203 additions & 0 deletions

File tree

TODO.md

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,65 @@
22

33
Features and optimizations to add on top of the current minimal implementation.
44

5+
## Reference implementations
6+
7+
When picking up any of the items in this document, cross-reference these
8+
ublk implementations:
9+
10+
- [`semistrict/go-ublk`](https://github.com/semistrict/go-ublk) — another
11+
pure-Go ublk library. Covers multi-queue, USER_COPY mode, zero-copy with
12+
auto-buf-reg, queue affinity pinning, GET_PARAMS / GET_FEATURES /
13+
GET_QUEUE_AFFINITY, and an "easy" handler that adapts `io.ReaderAt`
14+
plus optional `Flusher`/`Discarder`/`Zeroer` interfaces.
15+
- [SPDK ublk target](https://spdk.io/doc/ublk.html) — production C
16+
implementation used by [Longhorn v2 data
17+
engine](https://github.com/longhorn/longhorn/issues/9456). Validates
18+
our `runtime.LockOSThread`-per-queue model: SPDK explicitly notes the
19+
kernel constraint that "one ublk device queue can be only processed
20+
in the context of system thread which initialized it". They pin one
21+
spdk_thread per reactor and assign queues round-robin — same model
22+
we'd land on for multi-queue.
23+
- [`PhanLe1010/libublk-rs (longhorn branch)`](https://github.com/PhanLe1010/libublk-rs/tree/longhorn)
24+
Rust PoC the Longhorn team built before settling on SPDK. Useful for
25+
comparing how the Rust ecosystem models the FETCH/COMMIT lifecycle.
26+
- [`ublk-org/ublksrv`](https://github.com/ublk-org/ublksrv) — the
27+
reference C implementation maintained alongside the kernel. Source of
28+
truth for kernel-feature interaction patterns (BATCH_IO,
29+
DEFER_TASKRUN, idle-buffer madvise, etc.).
30+
31+
### Reference io_uring libraries (Go)
32+
33+
We ship our own minimal `io_uring` wrapper in `ublk/uring/` rather than
34+
depending on a generic Go library. The reason is that ublk requires
35+
**128-byte SQEs (`IORING_SETUP_SQE128`)** and **`IORING_OP_URING_CMD`
36+
(opcode 46)** to carry the embedded `ublksrv_io_cmd` payload, and none
37+
of the popular Go ports expose both first-class. `semistrict/go-ublk`
38+
reached the same conclusion and also hand-rolled its own.
39+
40+
That said, two Go ports are still worth using as **reference for ABI
41+
correctness** when binding new kernel features:
42+
43+
- [`pawelgaczynski/giouring`](https://github.com/pawelgaczynski/giouring)
44+
— most complete pure-Go port of `liburing`, naming maps almost 1:1 to
45+
the C API. Best place to cross-check struct layouts and constants
46+
(`io_uring_buf_ring`, `Probe`, `RsrcRegister`, etc.) before we add
47+
things like zero-copy provided buffers.
48+
- [`godzie44/go-uring`](https://github.com/godzie44/go-uring) — closer
49+
to a `liburing` port with reactor pattern; useful reference for the
50+
Go memory-model discussion (`amd64_atomic` build tag, `seq_cst` vs
51+
`acq_rel` analysis) if we ever profile and find atomics on the SQ/CQ
52+
head/tail are a bottleneck.
53+
54+
Other libraries we evaluated and ruled out for ublk:
55+
[`Iceber/iouring-go`](https://github.com/iceber/iouring-go) (most
56+
popular but networking-focused, 64-byte SQE only),
57+
[`ii64/gouring`](https://github.com/ii64/gouring) (also 64-byte SQE,
58+
no URING_CMD).
59+
60+
**Action item:** whenever we touch `ublk/uring/uring.go` or the SQE128
61+
layout, cross-check struct offsets and any new opcodes against
62+
`giouring`'s definitions to catch ABI drift. Cheap insurance.
63+
564
## Ungraceful-exit device leaks
665

766
When our test harnesses (or any user program using the library) is
@@ -44,6 +103,98 @@ cleanup path may or may not be among them.
44103
- [ ] Optionally: warn in `ublk.New` if `/sys/class/ublk-char/` is
45104
approaching `ublks_max` (default 64). Gives users a signal before
46105
`New()` starts failing silently due to accumulated orphans.
106+
- [ ] Expose a `ublk.CleanupOrphans()` (or `ublkctl` CLI) that scans
107+
`/sys/class/ublk-char/` and force-deletes leaked devices via
108+
`/dev/ublk-control` `DEL_DEV[_ASYNC]`. This is exactly the pain point
109+
Longhorn flags in
110+
[issue #10738](https://github.com/longhorn/longhorn/issues/10738):
111+
*"When instance manager pod crash, it might leave the orphan UBLK
112+
device. Currently, it is difficult to remove these orphan UBLK devices
113+
and sometime a reboot is needed."* — they have no userspace cleanup
114+
path either, so a small Go helper that wraps DEL_DEV by ID would be
115+
genuinely novel value vs. existing libraries (semistrict/SPDK don't
116+
ship this either).
117+
118+
## Robustness / kernel-ABI hygiene
119+
120+
### Compute ioctl-encoded opcodes from `_IOR`/`_IOWR` macros
121+
122+
We currently hardcode the ioctl-encoded ublk command opcodes in
123+
`ublk/types.go` (`uCmdAddDev = 0xC0207504`, etc.). The constant `0x20` in
124+
each value is `sizeof(struct ublksrv_ctrl_cmd) = 32`, baked into the
125+
ioctl encoding. If the kernel ever extends `ublksrv_ctrl_cmd` (the
126+
explicit `Reserved1`/`Reserved2` fields exist precisely so it can grow),
127+
the size shifts and our hardcoded opcodes silently become wrong while
128+
the kernel returns `EINVAL`.
129+
130+
Mirror the pattern used by [`semistrict/go-ublk`](https://github.com/semistrict/go-ublk)
131+
in [`const.go`](https://github.com/semistrict/go-ublk/blob/main/const.go):
132+
define `ioc(dir, type, nr, size)` and derive each `uCmd*` / `uIO*`
133+
opcode from `iowr(ublkType, nr, unsafe.Sizeof(ctrlCmd{}))`. Pure
134+
mechanical refactor; no behavior change today, future-proofs against
135+
struct extensions.
136+
137+
### Handle non-read/non-write IO ops explicitly
138+
139+
`worker.handleIO` currently returns `EOPNOTSUPP` for everything except
140+
`OP_READ`/`OP_WRITE`. We don't currently set `UBLK_ATTR_VOLATILE_CACHE`
141+
or `MaxDiscardSectors`, so the kernel block layer shouldn't emit
142+
`OP_FLUSH` / `OP_DISCARD` / `OP_WRITE_ZEROES` against us — but if a
143+
user touches the sysfs knobs (`echo 1 > .../discard_max_bytes`) or
144+
mounts a filesystem that issues unconditional flushes, they'll get
145+
unexplained EIO. Cheap fix when we add the optional backend interfaces:
146+
147+
- `OP_FLUSH` (2) → `0` when no `Flusher`, otherwise call it
148+
- `OP_DISCARD` (3) → `0` (or call `Discarder` once exposed)
149+
- `OP_WRITE_ZEROES` (5) → `0` (or call `Zeroer`)
150+
151+
Cross-ref: semistrict's `IOOp` enum in
152+
[`const.go`](https://github.com/semistrict/go-ublk/blob/main/const.go)
153+
and the `ReaderAtHandler` adapter in
154+
[`easy.go`](https://github.com/semistrict/go-ublk/blob/main/easy.go).
155+
156+
### Async STOP_DEV pattern
157+
158+
We currently issue STOP_DEV synchronously on the same control ring used
159+
for ADD/SET/DEL. That works because the kernel responds with the CQE
160+
once the worker thread releases the char fd (which we drive ourselves).
161+
Once we go multi-queue, the cleaner pattern (used by semistrict in
162+
[`ctrl.go`](https://github.com/semistrict/go-ublk/blob/main/ctrl.go)) is:
163+
164+
1. Submit STOP_DEV without waiting → kernel injects ENODEV into pending
165+
FETCHes for every queue.
166+
2. Wait for all per-queue serve goroutines to drain.
167+
3. *Then* wait for the STOP_DEV CQE.
168+
169+
Keeps the abort signal symmetric across queues and avoids relying on
170+
char-fd close to drive the abort path.
171+
172+
### Friendlier error when `ublk_drv` isn't loaded
173+
174+
Today, opening `/dev/ublk-control` on a host without the module
175+
returns a bare `ENOENT` from `unix.Open`, which surfaces as a
176+
confusing `"open /dev/ublk-control: no such file or directory"`.
177+
Detect this case in `openDevice` and wrap it with a hint:
178+
179+
> ublk control device not found; load the kernel module with
180+
> `modprobe ublk_drv` (requires Linux 6.0+ and CAP_SYS_ADMIN)
181+
182+
Longhorn went so far as to ship dedicated CLI tooling
183+
([`longhornctl ... --enable-spdk`](https://github.com/longhorn/longhorn/issues/11803),
184+
[longhorn/cli PR #321](https://github.com/longhorn/cli/pull/321))
185+
just to auto-`modprobe` this on cluster nodes. We can't (and
186+
shouldn't) call modprobe from a library, but a clear pointer in
187+
the error message saves users a Google search.
188+
189+
### Safety review note for contributors
190+
191+
Document the unsafe-pointer discipline in the package: any address
192+
embedded in a SQE (`cmd.Addr`, buffer pointers in IO commands) must be
193+
either heap-anchored through a long-lived field or pinned via
194+
`runtime.Pinner` for the duration of the syscall. The bug we just fixed
195+
in `addDev`/`setParams` is the canonical example. See
196+
[semistrict's `ctrl.go`](https://github.com/semistrict/go-ublk/blob/main/ctrl.go)
197+
which uses `runtime.Pinner` consistently.
47198

48199
## Build
49200

@@ -82,6 +233,39 @@ type WriteZeroer interface { WriteZeroes(off, length int64) error }
82233
Set `UBLK_ATTR_VOLATILE_CACHE` when backend implements `Flusher`, declare
83234
`MaxDiscardSectors` when it implements `Discarder`, etc.
84235

236+
Cross-ref: semistrict's [`ReaderAtHandler`](https://github.com/semistrict/go-ublk/blob/main/easy.go)
237+
already implements this exact pattern (`Flusher`, `Syncer`, `Discarder`,
238+
`Zeroer`) with optional-interface assertion. Worth mirroring the API
239+
shape so users have a familiar surface.
240+
241+
### Diagnostic / introspection commands
242+
243+
Wire up the read-only control commands so callers can inspect device
244+
state without round-tripping through sysfs:
245+
246+
- `UBLK_CMD_GET_DEV_INFO` / `GET_DEV_INFO2`
247+
- `UBLK_CMD_GET_PARAMS` (we currently only have SET_PARAMS)
248+
- `UBLK_CMD_GET_QUEUE_AFFINITY`
249+
- `UBLK_CMD_GET_FEATURES` (lets the library detect kernel capability
250+
matrix at runtime instead of trial-and-error like our current
251+
ioctl-encoded → legacy fallback in `addDev`)
252+
253+
Cross-ref: semistrict implements all of these in
254+
[`ctrl.go`](https://github.com/semistrict/go-ublk/blob/main/ctrl.go) and
255+
[`affinity.go`](https://github.com/semistrict/go-ublk/blob/main/affinity.go).
256+
257+
### Update size / quiesce / try-stop
258+
259+
Newer ublk control commands worth surfacing once we have a stable
260+
multi-queue base:
261+
262+
- `UBLK_U_CMD_UPDATE_SIZE` (`UBLK_F_UPDATE_SIZE`) — live device resize.
263+
- `UBLK_U_CMD_QUIESCE_DEV` (`UBLK_F_QUIESCE`) — pause IO for live
264+
server upgrade.
265+
- `UBLK_U_CMD_TRY_STOP_DEV` (`UBLK_F_SAFE_STOP_DEV`) — only stop if no
266+
openers (avoids the documented "leaked fd → Close hangs forever"
267+
footgun without changing semantics the way DEL_DEV_ASYNC does).
268+
85269
### User Copy Mode
86270

87271
Data transfer via `pread`/`pwrite` on the char device instead of mmap buffers.
@@ -128,6 +312,7 @@ both submits SQEs and waits for the next CQE. The reference C implementation
128312
(ublksrv) uses `io_uring_submit_and_wait_timeout()` which does exactly this.
129313

130314
Benefits:
315+
131316
- Saves one syscall per IO iteration (currently Submit → epoll_wait → two
132317
kernel transitions)
133318
- Explicitly processes io_uring task work via GETEVENTS (currently relies on

ublk/device.go

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ package ublk
33
import (
44
"fmt"
55
"os"
6+
"runtime"
67
"sync"
78
"syscall"
89
"time"
@@ -57,6 +58,15 @@ func (d *Device) addDev(queues, depth uint16, maxIOBuf uint32) error {
5758
Flags: flagCmdIoctlEncode,
5859
}
5960

61+
// Pin info: the kernel reads from this address between Submit and
62+
// the CQE returning. Going through unsafe.Pointer→uintptr in cmd.Addr
63+
// loses pointer-ness from the GC's view, so without pinning the Go
64+
// runtime would be free (in theory) to move or collect the backing
65+
// memory while the kernel is still reading from it.
66+
var pinner runtime.Pinner
67+
pinner.Pin(&info)
68+
defer pinner.Unpin()
69+
6070
cmd := ctrlCmd{
6171
DevID: ^uint32(0),
6272
QueueID: ^uint16(0),
@@ -114,6 +124,14 @@ func (d *Device) setParams(size uint64, maxSectors uint32) error {
114124
},
115125
}
116126

127+
// Pin params: same rationale as in addDev. The kernel reads from
128+
// the address embedded in cmd.Addr between Submit and the CQE
129+
// returning; the uintptr conversion makes the pointer invisible to
130+
// the GC, so we must keep the backing memory anchored explicitly.
131+
var pinner runtime.Pinner
132+
pinner.Pin(&params)
133+
defer pinner.Unpin()
134+
117135
cmd := ctrlCmd{
118136
DevID: uint32(d.id),
119137
QueueID: ^uint16(0),

0 commit comments

Comments
 (0)