@@ -6,9 +6,114 @@ on this list as being implementation requests. Some of the ideas on this list
66are rather rough and unrefined. They serve as entry points for exploring the
77associated problem space.
88
9- ** When implementing ideas on this list or ideas inspired by this list please
10- point that out explicitly and clearly in the associated patches and Cc
11- ` Christian Brauner <brauner (at) kernel (dot) org ` .**
9+ * ** When implementing ideas on this list or ideas inspired by this list
10+ please point that out explicitly and clearly in the associated patches
11+ and Cc ` Christian Brauner <brauner (at) kernel (dot) org ` .**
12+
13+ * Move the item you are working to the In-Progress section.
14+ Please add your github handle or mail address to the issue so we can
15+ ping you.
16+
17+ ## In-Progress
18+
19+ ### Ability to put user xattrs on ` S_IFSOCK ` socket entrypoint inodes in the file system
20+
21+ Currently, the kernel only allows extended attributes in the
22+ ` user.* ` namespace to be attached to directory and regular file
23+ inodes. It would be tremendously useful to allow them to be
24+ associated with socket inodes, too.
25+
26+ ** Usecase:** There are two syslog RFCs in use today: RFC3164 and
27+ RFC5424. ` glibc ` 's ` syslog() ` API generates events close to the
28+ former, but there are programs which would like to generate the
29+ latter instead (as it supports structured logging). The two formats
30+ are not backwards compatible: a client sending RFC5424 messages to a
31+ server only understanding RFC3164 will cause an ugly mess. On Linux
32+ there's only a single ` /dev/log ` AF_UNIX/SOCK_DGRAM socket backing
33+ ` syslog() ` , which is used in a one-way, fire-and-forget style. This
34+ means that feature negotation is not really possible within the
35+ protocol. Various tools bind mount the socket inode into ` chroot() `
36+ and container environments, hence it would be fantastic to associate
37+ supported feature information directly with the inode (and thus
38+ outside of the protocol) to make it easy for clients to determine
39+ which features are spoken on a socket, in a way that survives bind
40+ mounts. Implementation idea would be that syslog daemons
41+ implementing RFC5425 could simply set an xattr ` user.rfc5424 ` to ` 1 `
42+ (or something like that) on the socket inode, and clearly inform
43+ clients in a natural and simple way that they'd be happy to parse
44+ the newer format. Also see:
45+ https://github.com/systemd/systemd/issues/19251 – This idea could
46+ also be extended to other sockets and other protocols: by setting
47+ some extended attribute on a socket inodes, services could advertise
48+ which protocols they support on them. For example D-Bus sockets
49+ could carry ` user.dbus ` set to ` 1 ` , and Varlink sockets
50+ ` user.varlink ` set to ` 1 ` and so on.
51+
52+ ### Support detached mounts with ` pivot_root() `
53+
54+ The new rootfs must currently refer to an attached mount. This restriction
55+ seems unnecessary. We should allow the new rootfs to refer to a detached
56+ mount.
57+
58+ This will allow a service- or container manager to create a new rootfs as
59+ a detached, private mount that isn't exposed anywhere in the filesystem and
60+ then ` pivot_root() ` into it.
61+
62+ Since ` pivot_root() ` only takes path arguments the new rootfs would need to
63+ be passed via ` /proc/<pid>/fd/<nr> ` . In the long run we should add a new
64+ ` pivot_root() ` syscall operating on file descriptors instead of paths.
65+
66+ ### Create mount namespace with custom rootfs via ` open_tree() ` and ` fsmount() `
67+
68+ Add ` OPEN_TREE_NAMESPACE ` flag to ` open_tree() ` and ` FSMOUNT_NAMESPACE ` flag
69+ to ` fsmount() ` that create a new mount namespace with the specified mount tree
70+ as the rootfs mounted on top of a copy of the real rootfs. These return a
71+ namespace file descriptor instead of a mount file descriptor.
72+
73+ This allows ` OPEN_TREE_NAMESPACE ` to function as a combined
74+ ` unshare(CLONE_NEWNS) ` and ` pivot_root() ` .
75+
76+ When creating containers the setup usually involves using ` CLONE_NEWNS ` via
77+ ` clone3() ` or ` unshare() ` . This copies the caller's complete mount namespace.
78+ The runtime will also assemble a new rootfs and then use ` pivot_root() ` to
79+ switch the old mount tree with the new rootfs. Afterward it will recursively
80+ unmount the old mount tree thereby getting rid of all mounts.
81+
82+ Copying all of these mounts only to get rid of them later is wasteful. With a
83+ large mount table and a system where thousands of containers are spawned in
84+ parallel this quickly becomes a bottleneck increasing contention on the
85+ semaphore.
86+
87+ ** Use-Case:** Container runtimes can create an extremely minimal rootfs
88+ directly:
89+
90+ ``` c
91+ fd_mntns = open_tree(-EBADF, " /var/lib/containers/wootwoot" , OPEN_TREE_NAMESPACE);
92+ ```
93+
94+ This creates a mount namespace where "wootwoot" has become the rootfs. The
95+ caller can ` setns() ` into this new mount namespace and assemble additional
96+ mounts without copying and destroying the entire parent mount table.
97+
98+ ### Query mount information via file descriptor with ` statmount() `
99+
100+ Extend ` struct mnt_id_req ` to accept a file descriptor and introduce
101+ ` STATMOUNT_BY_FD ` flag. When a valid fd is provided and ` STATMOUNT_BY_FD `
102+ is set, ` statmount() ` returns mount info about the mount the fd is on.
103+
104+ This works even for "unmounted" mounts (mounts that have been unmounted using
105+ ` umount2(mnt, MNT_DETACH) ` ), if you have access to a file descriptor on that
106+ mount. These unmounted mounts will have no mountpoint and no valid mount
107+ namespace, so ` STATMOUNT_MNT_POINT ` and ` STATMOUNT_MNT_NS_ID ` are unset in
108+ ` statmount.mask ` for such mounts.
109+
110+ ** Use-Case:** Query mount information directly from a file descriptor without
111+ needing the mount ID, which is particularly useful for detached or unmounted
112+ mounts.
113+
114+ ---
115+
116+ ### TODO
12117
13118### xattrs for pidfd
14119
@@ -376,20 +481,6 @@ Namespace-able loop and block devices, usable inside user namespaces.
376481** Use-Case:** Allow mounting images inside nspawn containers, and using
377482RootImage= and friends in the systemd user manager.
378483
379- ### Support detached mounts with ` pivot_root() `
380-
381- The new rootfs must currently refer to an attached mount. This restriction
382- seems unnecessary. We should allow the new rootfs to refer to a detached
383- mount.
384-
385- This will allow a service- or container manager to create a new rootfs as
386- a detached, private mount that isn't exposed anywhere in the filesystem and
387- then ` pivot_root() ` into it.
388-
389- Since ` pivot_root() ` only takes path arguments the new rootfs would need to
390- be passed via ` /proc/<pid>/fd/<nr> ` . In the long run we should add a new
391- ` pivot_root() ` syscall operating on file descriptors instead of paths.
392-
393484### Device cgroup guard to allow ` mknod() ` in non-initial userns
394485
395486If a container manager restricts its unprivileged (user namespaced)
@@ -532,39 +623,6 @@ in case the process dies and its PID is quickly recycled. (This
532623assumes systemd can acquire a pidfd of the foreign process without
533624races, for example via ` SCM_PIDFD ` and ` SO_PEERPIDFD ` or similar.)
534625
535- ### Ability to put user xattrs on ` S_IFSOCK ` socket entrypoint inodes in the file system
536-
537- Currently, the kernel only allows extended attributes in the
538- ` user.* ` namespace to be attached to directory and regular file
539- inodes. It would be tremendously useful to allow them to be
540- associated with socket inodes, too.
541-
542- ** Usecase:** There are two syslog RFCs in use today: RFC3164 and
543- RFC5424. ` glibc ` 's ` syslog() ` API generates events close to the
544- former, but there are programs which would like to generate the
545- latter instead (as it supports structured logging). The two formats
546- are not backwards compatible: a client sending RFC5424 messages to a
547- server only understanding RFC3164 will cause an ugly mess. On Linux
548- there's only a single ` /dev/log ` AF_UNIX/SOCK_DGRAM socket backing
549- ` syslog() ` , which is used in a one-way, fire-and-forget style. This
550- means that feature negotation is not really possible within the
551- protocol. Various tools bind mount the socket inode into ` chroot() `
552- and container environments, hence it would be fantastic to associate
553- supported feature information directly with the inode (and thus
554- outside of the protocol) to make it easy for clients to determine
555- which features are spoken on a socket, in a way that survives bind
556- mounts. Implementation idea would be that syslog daemons
557- implementing RFC5425 could simply set an xattr ` user.rfc5424 ` to ` 1 `
558- (or something like that) on the socket inode, and clearly inform
559- clients in a natural and simple way that they'd be happy to parse
560- the newer format. Also see:
561- https://github.com/systemd/systemd/issues/19251 – This idea could
562- also be extended to other sockets and other protocols: by setting
563- some extended attribute on a socket inodes, services could advertise
564- which protocols they support on them. For example D-Bus sockets
565- could carry ` user.dbus ` set to ` 1 ` , and Varlink sockets
566- ` user.varlink ` set to ` 1 ` and so on.
567-
568626### Open thread-group leader via ` pidfd_open() `
569627
570628Extend ` pidfd_open() ` to allow opening the thread-group leader based on the
0 commit comments