Skip to content

Commit 336afba

Browse files
authored
Merge pull request #46 from brauner/work
wishlist: update the document with a bunch of new in-progress things.
2 parents 0df7593 + 55503c9 commit 336afba

File tree

1 file changed

+108
-50
lines changed

1 file changed

+108
-50
lines changed

README.md

Lines changed: 108 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,114 @@ on this list as being implementation requests. Some of the ideas on this list
66
are rather rough and unrefined. They serve as entry points for exploring the
77
associated problem space.
88

9-
**When implementing ideas on this list or ideas inspired by this list please
10-
point that out explicitly and clearly in the associated patches and Cc
11-
`Christian Brauner <brauner (at) kernel (dot) org`.**
9+
* **When implementing ideas on this list or ideas inspired by this list
10+
please point that out explicitly and clearly in the associated patches
11+
and Cc `Christian Brauner <brauner (at) kernel (dot) org`.**
12+
13+
* Move the item you are working to the In-Progress section.
14+
Please add your github handle or mail address to the issue so we can
15+
ping you.
16+
17+
## In-Progress
18+
19+
### Ability to put user xattrs on `S_IFSOCK` socket entrypoint inodes in the file system
20+
21+
Currently, the kernel only allows extended attributes in the
22+
`user.*` namespace to be attached to directory and regular file
23+
inodes. It would be tremendously useful to allow them to be
24+
associated with socket inodes, too.
25+
26+
**Usecase:** There are two syslog RFCs in use today: RFC3164 and
27+
RFC5424. `glibc`'s `syslog()` API generates events close to the
28+
former, but there are programs which would like to generate the
29+
latter instead (as it supports structured logging). The two formats
30+
are not backwards compatible: a client sending RFC5424 messages to a
31+
server only understanding RFC3164 will cause an ugly mess. On Linux
32+
there's only a single `/dev/log` AF_UNIX/SOCK_DGRAM socket backing
33+
`syslog()`, which is used in a one-way, fire-and-forget style. This
34+
means that feature negotation is not really possible within the
35+
protocol. Various tools bind mount the socket inode into `chroot()`
36+
and container environments, hence it would be fantastic to associate
37+
supported feature information directly with the inode (and thus
38+
outside of the protocol) to make it easy for clients to determine
39+
which features are spoken on a socket, in a way that survives bind
40+
mounts. Implementation idea would be that syslog daemons
41+
implementing RFC5425 could simply set an xattr `user.rfc5424` to `1`
42+
(or something like that) on the socket inode, and clearly inform
43+
clients in a natural and simple way that they'd be happy to parse
44+
the newer format. Also see:
45+
https://github.com/systemd/systemd/issues/19251 – This idea could
46+
also be extended to other sockets and other protocols: by setting
47+
some extended attribute on a socket inodes, services could advertise
48+
which protocols they support on them. For example D-Bus sockets
49+
could carry `user.dbus` set to `1`, and Varlink sockets
50+
`user.varlink` set to `1` and so on.
51+
52+
### Support detached mounts with `pivot_root()`
53+
54+
The new rootfs must currently refer to an attached mount. This restriction
55+
seems unnecessary. We should allow the new rootfs to refer to a detached
56+
mount.
57+
58+
This will allow a service- or container manager to create a new rootfs as
59+
a detached, private mount that isn't exposed anywhere in the filesystem and
60+
then `pivot_root()` into it.
61+
62+
Since `pivot_root()` only takes path arguments the new rootfs would need to
63+
be passed via `/proc/<pid>/fd/<nr>`. In the long run we should add a new
64+
`pivot_root()` syscall operating on file descriptors instead of paths.
65+
66+
### Create mount namespace with custom rootfs via `open_tree()` and `fsmount()`
67+
68+
Add `OPEN_TREE_NAMESPACE` flag to `open_tree()` and `FSMOUNT_NAMESPACE` flag
69+
to `fsmount()` that create a new mount namespace with the specified mount tree
70+
as the rootfs mounted on top of a copy of the real rootfs. These return a
71+
namespace file descriptor instead of a mount file descriptor.
72+
73+
This allows `OPEN_TREE_NAMESPACE` to function as a combined
74+
`unshare(CLONE_NEWNS)` and `pivot_root()`.
75+
76+
When creating containers the setup usually involves using `CLONE_NEWNS` via
77+
`clone3()` or `unshare()`. This copies the caller's complete mount namespace.
78+
The runtime will also assemble a new rootfs and then use `pivot_root()` to
79+
switch the old mount tree with the new rootfs. Afterward it will recursively
80+
unmount the old mount tree thereby getting rid of all mounts.
81+
82+
Copying all of these mounts only to get rid of them later is wasteful. With a
83+
large mount table and a system where thousands of containers are spawned in
84+
parallel this quickly becomes a bottleneck increasing contention on the
85+
semaphore.
86+
87+
**Use-Case:** Container runtimes can create an extremely minimal rootfs
88+
directly:
89+
90+
```c
91+
fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);
92+
```
93+
94+
This creates a mount namespace where "wootwoot" has become the rootfs. The
95+
caller can `setns()` into this new mount namespace and assemble additional
96+
mounts without copying and destroying the entire parent mount table.
97+
98+
### Query mount information via file descriptor with `statmount()`
99+
100+
Extend `struct mnt_id_req` to accept a file descriptor and introduce
101+
`STATMOUNT_BY_FD` flag. When a valid fd is provided and `STATMOUNT_BY_FD`
102+
is set, `statmount()` returns mount info about the mount the fd is on.
103+
104+
This works even for "unmounted" mounts (mounts that have been unmounted using
105+
`umount2(mnt, MNT_DETACH)`), if you have access to a file descriptor on that
106+
mount. These unmounted mounts will have no mountpoint and no valid mount
107+
namespace, so `STATMOUNT_MNT_POINT` and `STATMOUNT_MNT_NS_ID` are unset in
108+
`statmount.mask` for such mounts.
109+
110+
**Use-Case:** Query mount information directly from a file descriptor without
111+
needing the mount ID, which is particularly useful for detached or unmounted
112+
mounts.
113+
114+
---
115+
116+
### TODO
12117

13118
### xattrs for pidfd
14119

@@ -376,20 +481,6 @@ Namespace-able loop and block devices, usable inside user namespaces.
376481
**Use-Case:** Allow mounting images inside nspawn containers, and using
377482
RootImage= and friends in the systemd user manager.
378483

379-
### Support detached mounts with `pivot_root()`
380-
381-
The new rootfs must currently refer to an attached mount. This restriction
382-
seems unnecessary. We should allow the new rootfs to refer to a detached
383-
mount.
384-
385-
This will allow a service- or container manager to create a new rootfs as
386-
a detached, private mount that isn't exposed anywhere in the filesystem and
387-
then `pivot_root()` into it.
388-
389-
Since `pivot_root()` only takes path arguments the new rootfs would need to
390-
be passed via `/proc/<pid>/fd/<nr>`. In the long run we should add a new
391-
`pivot_root()` syscall operating on file descriptors instead of paths.
392-
393484
### Device cgroup guard to allow `mknod()` in non-initial userns
394485

395486
If a container manager restricts its unprivileged (user namespaced)
@@ -532,39 +623,6 @@ in case the process dies and its PID is quickly recycled. (This
532623
assumes systemd can acquire a pidfd of the foreign process without
533624
races, for example via `SCM_PIDFD` and `SO_PEERPIDFD` or similar.)
534625

535-
### Ability to put user xattrs on `S_IFSOCK` socket entrypoint inodes in the file system
536-
537-
Currently, the kernel only allows extended attributes in the
538-
`user.*` namespace to be attached to directory and regular file
539-
inodes. It would be tremendously useful to allow them to be
540-
associated with socket inodes, too.
541-
542-
**Usecase:** There are two syslog RFCs in use today: RFC3164 and
543-
RFC5424. `glibc`'s `syslog()` API generates events close to the
544-
former, but there are programs which would like to generate the
545-
latter instead (as it supports structured logging). The two formats
546-
are not backwards compatible: a client sending RFC5424 messages to a
547-
server only understanding RFC3164 will cause an ugly mess. On Linux
548-
there's only a single `/dev/log` AF_UNIX/SOCK_DGRAM socket backing
549-
`syslog()`, which is used in a one-way, fire-and-forget style. This
550-
means that feature negotation is not really possible within the
551-
protocol. Various tools bind mount the socket inode into `chroot()`
552-
and container environments, hence it would be fantastic to associate
553-
supported feature information directly with the inode (and thus
554-
outside of the protocol) to make it easy for clients to determine
555-
which features are spoken on a socket, in a way that survives bind
556-
mounts. Implementation idea would be that syslog daemons
557-
implementing RFC5425 could simply set an xattr `user.rfc5424` to `1`
558-
(or something like that) on the socket inode, and clearly inform
559-
clients in a natural and simple way that they'd be happy to parse
560-
the newer format. Also see:
561-
https://github.com/systemd/systemd/issues/19251 – This idea could
562-
also be extended to other sockets and other protocols: by setting
563-
some extended attribute on a socket inodes, services could advertise
564-
which protocols they support on them. For example D-Bus sockets
565-
could carry `user.dbus` set to `1`, and Varlink sockets
566-
`user.varlink` set to `1` and so on.
567-
568626
### Open thread-group leader via `pidfd_open()`
569627

570628
Extend `pidfd_open()` to allow opening the thread-group leader based on the

0 commit comments

Comments
 (0)