Description
Hi @AkihiroSuda - I'm trying to get the setup working on an actual HPC cluster, and build the podman container as follows (it doesn't work with the podman compose build):
podman build --userns-uid-map=0:0:1 --userns-uid-map=1:1:1999 --userns-uid-map=65534:2000:2 -f ./Dockerfile -t usernetes_node .
That brings up the container OK, but then when I try to do kubeadm-init, it times out and this is what I see in the logs:
Mar 04 16:05:15 u7s-corona173 kubelet[1183]: E0304 16:05:15.675877 1183 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to setup user: setgroups: invalid argument: unknown" pod="kube-system/etcd-u7s-corona173"
Mar 04 16:05:15 u7s-corona173 kubelet[1183]: E0304 16:05:15.675911 1183 kuberuntime_manager.go:1166] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to setup user: setgroups: invalid argument: unknown" pod="kube-system/etcd-u7s-corona173"
Mar 04 16:05:15 u7s-corona173 kubelet[1183]: E0304 16:05:15.675971 1183 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"etcd-u7s-corona173_kube-system(5890a635964013b0836c119ab878b4ac)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"etcd-u7s-corona173_kube-system(5890a635964013b0836c119ab878b4ac)\\\": rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to setup user: setgroups: invalid argument: unknown\"" pod="kube-system/etcd-u7s-corona173" podUID="5890a635964013b0836c119ab878b4ac"
Mar 04 16:05:15 u7s-corona173 kubelet[1183]: E0304 16:05:15.971510 1183 event.go:368] "Unable to write event (may retry after sleeping)" err="Post \"https://u7s-corona173:6443/api/v1/namespaces/default/events\": dial tcp 10.100.171.100:6443: connect: connection refused" event="&Event{ObjectMeta:{u7s-corona173.1829a50dbdf86d87 default 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:u7s-corona173,UID:u7s-corona173,APIVersion:,ResourceVersion:,FieldPath:,},Reason:Starting,Message:Starting kubelet.,Source:EventSource{Component:kubelet,Host:u7s-corona173,},FirstTimestamp:2025-03-04 16:03:29.395740039 +0000 UTC m=+0.316362060,LastTimestamp:2025-03-04 16:03:29.395740039 +0000 UTC m=+0.316362060,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:kubelet,ReportingInstance:u7s-corona173,}"
Mar 04 16:05:15 u7s-corona173 kubelet[1183]: E0304 16:05:15.971611 1183 event.go:307] "Unable to write event (retry limit exceeded!)" event="&Event{ObjectMeta:{u7s-corona173.1829a50dbdf86d87 default 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:u7s-corona173,UID:u7s-corona173,APIVersion:,ResourceVersion:,FieldPath:,},Reason:Starting,Message:Starting kubelet.,Source:EventSource{Component:kubelet,Host:u7s-corona173,},FirstTimestamp:2025-03-04 16:03:29.395740039 +0000 UTC m=+0.316362060,LastTimestamp:2025-03-04 16:03:29.395740039 +0000 UTC m=+0.316362060,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:kubelet,ReportingInstance:u7s-corona173,}"
Mar 04 16:05:15 u7s-corona173 kubelet[1183]: E0304 16:05:15.971944 1183 event.go:368] "Unable to write event (may retry after sleeping)" err="Post \"https://u7s-corona173:6443/api/v1/namespaces/default/events\": dial tcp 10.100.171.100:6443: connect: connection refused" event="&Event{ObjectMeta:{u7s-corona173.1829a50dbe1ad4d4 default 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:u7s-corona173,UID:u7s-corona173,APIVersion:,ResourceVersion:,FieldPath:,},Reason:InvalidDiskCapacity,Message:invalid capacity 0 on image filesystem,Source:EventSource{Component:kubelet,Host:u7s-corona173,},FirstTimestamp:2025-03-04 16:03:29.397994708 +0000 UTC m=+0.318616729,LastTimestamp:2025-03-04 16:03:29.397994708 +0000 UTC m=+0.318616729,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:kubelet,ReportingInstance:u7s-corona173,}"
Also note that the kernel is slightly old so I ignore preflight errors:
Linux u7s-corona173 4.18.0-553.34.1.1toss.t4.x86_64 #1 SMP Mon Jan 13 14:19:40 PST 2025 x86_64 GNU/Linux
I'm trying to distinguish what is an error with UID mapping from an issue that isn't resolvable because of the kernel version. I tried removing the kubernetes part and just doing a basic pull with crictl and got more insight to the "unknown" error:
"lchown /var/lib/containerd/tmpmounts/containerd-mount192687706/home: invalid argument (Hint: try increasing the number of subordinate IDs in /etc/subuid and /etc/subgid): unknown"
Do you have any insights or suggestions? Sorry for asking for help so much, I feel a bit alone working on this. I also tested out nerdctl but seem to have less control there with uid mappings - I can only use --user
and then I hit trouble for the higher IDs and it's non-trivial on a cluster with a ton of users to give me more ids. Thank you!
Activity