feature: PoC to mknod devices for user namespace containers#5138
feature: PoC to mknod devices for user namespace containers#5138everzakov wants to merge 1 commit intoopencontainers:mainfrom
Conversation
Signed-off-by: Efim Verzakov <efimverzakov@gmail.com>
cyphar
left a comment
There was a problem hiding this comment.
I don't think this is the right approach at all (see my comment in the issue you opened), but here are the most obvious issues I spotted. I'm really not a fan of this idea.
| destPath, err := securejoin.SecureJoin(rootfs, node.Path) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| err = createDeviceNode(rootfs, node, false) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| mountFile, err := os.OpenFile(destPath, unix.O_PATH|unix.O_CLOEXEC, 0) | ||
| if err != nil { | ||
| return nil, err | ||
| } |
There was a problem hiding this comment.
This is all insecure. You need to use mknodat with OpenInRoot and then after doing the re-open (again with OpenInRoot) you need to do VerifyInode. libpathrs has helpers for tne first bit.
| for _, node := range config.Devices { | ||
| node.Uid = uint32(rootUID) | ||
| node.Gid = uint32(rootGID) | ||
| hostUID, err := config.HostUID(int(node.Uid)) | ||
| if err == nil { | ||
| node.Uid = uint32(hostUID) | ||
| } | ||
| hostGID, err := config.HostGID(int(node.Gid)) | ||
| if err == nil { | ||
| node.Gid = uint32(hostGID) | ||
| } | ||
| } |
There was a problem hiding this comment.
Not sure about this in general, but in the err != nil case you need to use rootUID/rootGID otherwise the devices will be unmapped.
| if m.Device == "usernsMknod" { | ||
| // Create device in initial user ns | ||
| for _, device := range p.config.Config.Devices { | ||
| if device.Path == m.Source { | ||
| src, err = usernsMknod(p.config.Config.Rootfs, device) | ||
| break | ||
| } | ||
| } | ||
| if src == nil && err == nil { | ||
| err = fmt.Errorf("can not find device node") | ||
| } |
There was a problem hiding this comment.
A magical mount type really feels like the wrong way of doing this. It would also need a runtime-spec change.
| // Mount tmpfs in initial user ns and chown it | ||
| entry := mountEntry{Mount: m} | ||
| mountConfig := &mountConfig{ | ||
| root: p.config.Config.Rootfs, | ||
| label: p.config.Config.MountLabel, | ||
| rootlessCgroups: p.config.Config.RootlessCgroups, | ||
| cgroupns: p.config.Config.Namespaces.Contains(configs.NEWCGROUP), | ||
| } | ||
| err := mountToRootfs(mountConfig, entry) | ||
| if err == nil { | ||
| destPath, _ := securejoin.SecureJoin(mountConfig.root, m.Destination) | ||
| mountFile, err := os.OpenFile(destPath, unix.O_PATH|unix.O_CLOEXEC, 0) | ||
| uid, _ := p.config.Config.HostRootUID() | ||
| gid, _ := p.config.Config.HostRootGID() | ||
| err = sys.FchownFile(mountFile, uid, gid) | ||
| if err == nil { | ||
| src = &mountSource{ | ||
| file: mountFile, | ||
| Type: mountSourcePlain, | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This should be done with fsopen, this is all very wrong and pollutes the host mount table AFAICS?
| destPath, _ := securejoin.SecureJoin(mountConfig.root, m.Destination) | ||
| mountFile, err := os.OpenFile(destPath, unix.O_PATH|unix.O_CLOEXEC, 0) |
There was a problem hiding this comment.
This is also insecure for the same reason as above, this should've been OpenInRoot. (But I think this code shouldn't exist.)
The pull request is a PoC to call mknod to create devices for user namespace containers. This feature removes the limitation that the device should be created before container creation.
The main idea is to call mknod in the initial user namespace and container mount namespace. Also, runc should chown the created device to set the right uid/gid in the user namespace.
The main Kernel restrictions for this solution:
To solve these problems we can reuse the existing mechanism - goCreateMountSources function https://github.com/opencontainers/runc/blob/v1.4.0/libcontainer/process_linux.go#L604
We will call mknod and mount tmpfs in the initial user namespace and container mount namespace. Also, we will chown them for the right uid/gid.
The main change for criu is that such devices won't be listed in the /proc/$pid/mountinfo (because we use mknod not bind).
Also, now i have some problems (Operation not permitted for remount) with restoring masked pathes (like /proc/kcore) that's why /dev/null is mount binded.
/dev before:
/dev after:
mountinfo before:
mountinfo after:
Current PoC limitations are all about criu:
Ps: now in my env i have some problems that dump / restore phases can be stuck but from restore log it should be ok. Maybe this is a misconfiguration in my env :=(.
CRIU changes: link.
Closes: #5137