Skip to content

Commit 6fc5349

Browse files
committed
kep-127: address some review comments
Signed-off-by: Giuseppe Scrivano <[email protected]>
1 parent 2de2a32 commit 6fc5349

File tree

1 file changed

+75
-26
lines changed

1 file changed

+75
-26
lines changed

keps/sig-node/127-user-namespaces/README.md

+75-26
Original file line numberDiff line numberDiff line change
@@ -126,8 +126,8 @@ Here we use UIDs, but the same applies for GIDs.
126126
inside the container to different IDs in the host. In particular, mapping root
127127
inside the container to unprivileged user and group IDs in the node.
128128
- Increase pod to pod isolation by allowing to use non-overlapping mappings
129-
(UIDs/GIDs) whenever possible. IOW, if two containers runs as user X, they run
130-
as different UIDs in the node and therefore are more isolated than today.
129+
(UIDs/GIDs) whenever possible. In other words: if two containers runs as user
130+
X, they run as different UIDs in the node and therefore are more isolated than today.
131131
- Allow pods to have capabilities (e.g. `CAP_SYS_ADMIN`) that are only valid in
132132
the pod (not valid in the host).
133133
- Benefit from the security hardening that user namespaces provide against some
@@ -288,10 +288,47 @@ message Mount {
288288
}
289289
```
290290

291+
The CRI runtime reports what runtime handlers have support for user
292+
namespaces through the `StatusResponse` message, that gains a new
293+
field `runtime_handlers`:
294+
295+
```
296+
message StatusResponse {
297+
// Status of the Runtime.
298+
RuntimeStatus status = 1;
299+
// Info is extra information of the Runtime. The key could be arbitrary string, and
300+
// value should be in json format. The information could include anything useful for
301+
// debug, e.g. plugins used by the container runtime.
302+
// It should only be returned non-empty when Verbose is true.
303+
map<string, string> info = 2;
304+
305+
// Runtime handlers.
306+
repeated RuntimeHandler runtime_handlers = 3;
307+
}
308+
```
309+
310+
Where RuntimeHandler is defined as below:
311+
312+
```
313+
message RuntimeHandlerFeatures {
314+
// supports_user_namespaces is set to true if the runtime handler supports
315+
// user namespaces.
316+
bool supports_user_namespaces = 1;
317+
}
318+
319+
message RuntimeHandler {
320+
// Name must be unique in StatusResponse.
321+
// An empty string denotes the default handler.
322+
string name = 1;
323+
// Supported features.
324+
RuntimeHandlerFeatures features = 2;
325+
}
326+
```
327+
291328
### Support for pods
292329

293330
Make pods work with user namespaces. This is activated via the
294-
bool `pod.spec.HostUsers`.
331+
bool `pod.spec.hostUsers`.
295332

296333
The mapping length will be 65536, mapping the range 0-65535 to the pod. This wide
297334
range makes sure most workloads will work fine. Additionally, we don't need to
@@ -403,7 +440,7 @@ If the pod wants to read who is the owner of file `/vol/configmap/foo`, now it
403440
will see the owner is root inside the container. This is due to the IDs
404441
transformations that the idmap mount does for us.
405442

406-
In other words, we can make sure the pod can read files instead of chowning them
443+
In other words: we can make sure the pod can read files instead of chowning them
407444
all using the host IDs the pod is mapped to, by just using an idmap mount that
408445
has the same mapping that we use for the pod user namespace.
409446

@@ -469,7 +506,7 @@ something else to this list:
469506
- What about windows or VM container runtimes, that don't use linux namespaces?
470507
We need a review from windows maintainers once we have a more clear proposal.
471508
We can then adjust the needed details, we don't expect the changes (if any) to be big.
472-
IOW, in my head this looks like this: we merge this KEP in provisional state if
509+
In my head this looks like this: we merge this KEP in provisional state if
473510
we agree on the high level idea, with @giuseppe we do a PoC so we can fill-in
474511
more details to the KEP (like CRI changes, changes to container runtimes, how to
475512
configure kubelet ranges, etc.), and then the Windows folks can review and we
@@ -593,6 +630,7 @@ use container runtime versions that have the needed changes.
593630

594631
- Gather and address feedback from the community
595632
- Be able to configure UID/GID ranges to use for pods
633+
- This feature is not supported on Windows.
596634
- Get review from VM container runtimes maintainers (not blocker, as VM runtimes should just ignore
597635
the field, but nice to have)
598636

@@ -603,6 +641,15 @@ use container runtime versions that have the needed changes.
603641

604642
### Upgrade / Downgrade Strategy
605643

644+
Existing pods will still work as intended, as the new field is missing there.
645+
646+
Upgrade will not change any current behaviors.
647+
648+
When the new functionality wasn't yet used, downgrade will not be affected.
649+
650+
Versions of Kubernetes that doesn't have this feature implemented will
651+
ignore and strip out the new field `pod.spec.hostUsers`.
652+
606653
### Version Skew Strategy
607654

608655
<!--
@@ -635,11 +682,12 @@ doesn't create them. The runtime can detect this situation as the `user` field
635682
in the `NamespaceOption` will be seen as nil, [thanks to
636683
protobuf][proto3-defaults]. We already tested this with real code.
637684

638-
Old runtime and new kubelet: containers are created without userns. As the
639-
`user` field of the `NamespaceOption` message is not part of the runtime
640-
protofiles, that part is ignored by the runtime and pods are created using the
641-
host userns.
685+
Old runtime and new kubelet: the runtime won't report that it supports
686+
user namespaces through the `StatusResponse` message, so the kubelet
687+
will detect it and fail when such a request is done.
642688

689+
We added unit tests for the feature gate disabled, and integration
690+
tests for the feature gate enabled and disabled.
643691

644692
[proto3-defaults]: https://developers.google.com/protocol-buffers/docs/proto3#default
645693

@@ -686,7 +734,7 @@ well as the [existing list] of feature gates.
686734
-->
687735

688736
- [x] Feature gate (also fill in values in `kep.yaml`)
689-
- Feature gate name: UserNamespacesPodsSupport
737+
- Feature gate name: UserNamespacesSupport
690738
- Components depending on the feature gate: kubelet, kube-apiserver
691739

692740
###### Does enabling the feature change any default behavior?
@@ -733,7 +781,7 @@ Pods will have to be re-created to use the feature.
733781

734782
We will add.
735783

736-
We will test for when the field pod.spec.HostUsers is set to true, false
784+
We will test for when the field pod.spec.hostUsers is set to true, false
737785
and not set. All of this with and without the feature gate enabled.
738786

739787
We will also unit test that, if pods were created with the new field
@@ -766,7 +814,7 @@ The rollout is just a feature flag on the kubelet and the kube-apiserver.
766814
If one API server is upgraded while others aren't, the pod will be accepted (if the apiserver is >=
767815
1.25). If it is scheduled to a node that the kubelet has the feature flag activated and the node
768816
meets the requirements to use user namespaces, then the pod will be created with the namespace. If
769-
it is scheduled to a node that has the feature disabled, it will be scheduled without the user
817+
it is scheduled to a node that has the feature disabled, it will be created without the user
770818
namespace.
771819

772820
On a rollback, pods created while the feature was active (created with user namespaces) will have to
@@ -787,7 +835,7 @@ will rollout across nodes.
787835

788836
On Kubernetes side, the kubelet should start correctly.
789837

790-
On the node runtime side, a pod created with pod.spec.HostUsers=false should be on RUNNING state if
838+
On the node runtime side, a pod created with pod.spec.hostUsers=false should be on RUNNING state if
791839
all node requirements are met.
792840
<!--
793841
What signals should users be paying attention to when the feature is young
@@ -798,7 +846,7 @@ that might indicate a serious problem?
798846

799847
Yes.
800848

801-
We tested to enable the feature flag, create a deployment with pod.spec.HostUsers=false, and then disable
849+
We tested to enable the feature flag, create a deployment with pod.spec.hostUsers=false, and then disable
802850
the feature flag and restart the kubelet and kube-apiserver.
803851

804852
After that, we deleted the deployment pods (not the deployment object), the pods were re-created
@@ -830,7 +878,7 @@ previous answers based on experience in the field.
830878

831879
###### How can an operator determine if the feature is in use by workloads?
832880

833-
Check if any pod has the pod.spec.HostUsers field set to false.
881+
Check if any pod has the pod.spec.hostUsers field set to false.
834882
<!--
835883
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
836884
checking if there are objects with field X set) may be a last resort. Avoid
@@ -839,7 +887,7 @@ logs or events for this purpose.
839887

840888
###### How can someone using this feature know that it is working for their instance?
841889

842-
Check if any pod has the pod.spec.HostUsers field set to false and is on RUNNING state on a node
890+
Check if any pod has the pod.spec.hostUsers field set to false and is on RUNNING state on a node
843891
that meets all the requirements.
844892

845893
There are step-by-step examples in the Kubernetes documentation too.
@@ -859,7 +907,7 @@ Recall that end users cannot usually observe component logs or access metrics.
859907
- Condition name:
860908
- Other field:
861909
- [x] Other (treat as last resort)
862-
- Details: check pods with pod.spec.HostUsers field set to false, and see if they are in RUNNING
910+
- Details: check pods with pod.spec.hostUsers field set to false, and see if they are in RUNNING
863911
state.
864912

865913
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
@@ -1135,7 +1183,7 @@ No changes to current kubelet behaviors. The feature only uses kubelet-local inf
11351183
- Mitigations: What can be done to stop the bleeding, especially for already
11361184
running user workloads?
11371185

1138-
Remove the pod.spec.HostUsers field or disable the feature gate.
1186+
Remove the pod.spec.hostUsers field or disable the feature gate.
11391187

11401188
- Diagnostics: What are the useful log messages and their required logging
11411189
levels that could help debug the issue?
@@ -1183,7 +1231,7 @@ No changes to current kubelet behaviors. The feature only uses kubelet-local inf
11831231
- Mitigations: What can be done to stop the bleeding, especially for already
11841232
running user workloads?
11851233

1186-
Remove the pod.spec.HostUsers field or disable the feature gate.
1234+
Remove the pod.spec.hostUsers field or disable the feature gate.
11871235

11881236
- Diagnostics: What are the useful log messages and their required logging
11891237
levels that could help debug the issue?
@@ -1217,7 +1265,7 @@ writing to this file.
12171265
- Mitigations: What can be done to stop the bleeding, especially for already
12181266
running user workloads?
12191267

1220-
Remove the pod.spec.HostUsers field or disable the feature gate.
1268+
Remove the pod.spec.hostUsers field or disable the feature gate.
12211269

12221270
- Diagnostics: What are the useful log messages and their required logging
12231271
levels that could help debug the issue?
@@ -1233,12 +1281,11 @@ writing to this file.
12331281
There are no tests for failures to read or write the file, the code-paths just return the errors
12341282
in those cases.
12351283

1236-
12371284
- Error getting the kubelet IDs range configuration
12381285
- Detection: How can it be detected via metrics? Stated another way:
12391286
how can an operator troubleshoot without logging into a master or worker node?
12401287

1241-
In this case the Kubelet will fail to start with a clear error message.
1288+
In this case the kubelet will fail to start with a clear error message.
12421289

12431290
- Mitigations: What can be done to stop the bleeding, especially for already
12441291
running user workloads?
@@ -1369,21 +1416,23 @@ The issues without idmap mounts in previous iterations of this KEP, is that the
13691416
pod had to be unique for every pod in the cluster, easily reaching a limit when the cluster is "big
13701417
enough" and the UID space runs out. However, with idmap mounts the IDs assigned to a pod just needs
13711418
to be unique within the node (and with 64k ranges we have 64k pods possible in the node, so not
1372-
really an issue). IOW, by using idmap mounts, we changed the IDs limit to be node-scoped instead of
1373-
cluster-wide/cluster-scoped.
1419+
really an issue). In other words: by using idmap mounts, we changed the IDs limit to be node-scoped
1420+
instead of cluster-wide/cluster-scoped.
1421+
1422+
Some use cases for longer mappings include:
13741423

13751424
There are no known use cases for longer mappings that we know of. The 16bit range (0-65535) is what
13761425
is assumed by all POSIX tools that we are aware of. If the need arises, longer mapping can be
13771426
considered in a future KEP.
13781427

1379-
### Allow runtimes to pick the mapping?
1428+
### Allow runtimes to pick the mapping
13801429

13811430
Tim suggested that we might want to allow the container runtimes to choose the
13821431
mapping and have different runtimes pick different mappings. While KEP authors
13831432
disagree on this, we still need to discuss it and settle on something. This was
13841433
[raised here](https://github.com/kubernetes/enhancements/pull/3065#discussion_r798760382)
13851434

1386-
Furthermore, the reasons mentioned by Tim (some nodes having CRIO, some others having containerd,
1435+
Furthermore, the reasons mentioned by Tim Hockin (some nodes having CRIO, some others having containerd,
13871436
etc.) are handled correctly now. Different nodes can use different container runtimes, if a custom
13881437
range needs to be used by the kubelet, that can be configured per-node.
13891438

0 commit comments

Comments
 (0)