Skip to content

Commit 6b4562b

Browse files
authored
Merge pull request #5 from giuseppe/userns-followup
userns KEP followup
2 parents 72618fc + 384f677 commit 6b4562b

File tree

1 file changed

+24
-87
lines changed

1 file changed

+24
-87
lines changed

keps/sig-node/127-user-namespaces/README.md

+24-87
Original file line numberDiff line numberDiff line change
@@ -22,18 +22,13 @@
2222
- [Phases](#phases)
2323
- [Phase 1: pods "without" volumes](#phase-1-pods-without-volumes)
2424
- [Phase 2: pods with volumes](#phase-2-pods-with-volumes)
25-
- [Phase 3: pod to pod isolation](#phase-3-pod-to-pod-isolation)
25+
- [Phase 3: TBD](#phase-3-tbd)
2626
- [Summary of the Proposed Changes](#summary-of-the-proposed-changes)
2727
- [Test Plan](#test-plan)
2828
- [Graduation Criteria](#graduation-criteria)
29-
- [pod.spec.useHostUsers graduation](#podspecusehostusers-graduation)
3029
- [Alpha](#alpha)
3130
- [Beta](#beta)
3231
- [GA](#ga)
33-
- [pod.spec.securityContext.userns.pod2podIsolation graduation](#podspecsecuritycontextusernspod2podisolation-graduation)
34-
- [Alpha](#alpha-1)
35-
- [Beta](#beta-1)
36-
- [GA](#ga-1)
3732
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
3833
- [Version Skew Strategy](#version-skew-strategy)
3934
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
@@ -71,14 +66,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
7166

7267
## Summary
7368

74-
This KEP adds a new `hostUsers` field to `pod.Spec` to allow to enable/disable
75-
using user namespaces for pods. Furthermore, it allows increased pod to pod
76-
isolation by means of `pod.spec.securityContext.userns.pod2podIsolation` field.
77-
78-
It allows users to place pods in different user namespaces increasing the
79-
pod-to-pod and pod-to-host isolation. This extra isolation increases the cluster
80-
security as it protects the host and other pods from malicious or compromised
81-
processes inside containers that are able to break into the host.
69+
This KEP adds support to use user-namespaces in pods.
8270

8371
## Motivation
8472

@@ -149,6 +137,9 @@ Here we use UIDs, but the same applies for GIDs.
149137

150138
## Proposal
151139

140+
This KEP adds a new `hostUsers` field to `pod.Spec` to allow to enable/disable
141+
using user namespaces for pods.
142+
152143
This proposal aims to support running pods inside user namespaces. This will
153144
improve the pod to node isolation (phase 1 and 2) and pod to pod isolation
154145
(phase 3) we currently have.
@@ -173,7 +164,7 @@ kernel module with `CAP_SYS_MODULE`.
173164
#### Story 3
174165

175166
As a cluster admin, I want to allow users to run their container as root
176-
without that process having root privileged on the host, so I can mitigate the
167+
without that process having root privileges on the host, so I can mitigate the
177168
impact of a compromised container.
178169

179170
#### Story 4
@@ -185,7 +176,7 @@ host files).
185176

186177
#### Story 5
187178

188-
As a cluster admin, I want to use different host UIDs/GIDs for pods running in
179+
As a cluster admin, I want to use different host UIDs/GIDs for pods running on
189180
the same node (whenever kernel/kube features allow it), so I can mitigate the
190181
impact a compromised pod can have on other pods and the node itself.
191182

@@ -199,29 +190,30 @@ impact a compromised pod can have on other pods and the node itself.
199190

200191
## Design Details
201192

202-
Note: Names are preliminary yet, I'm using field names to simplify explanations.
203-
204193
### Pod.spec changes
205194

206195
The following changes will be done to the pod.spec:
207196

208-
- `pod.spec.useHostUsers`: bool.
197+
- `pod.spec.hostUsers`: bool.
209198
If true or not present, uses the host user namespace (as today)
210199
If false, a new userns is created for the pod.
211-
This field will be used for phase 1, 2 and 3.
212-
213-
- `pod.spec.securityContext.userns.pod2podIsolation`: Enum
214-
If enabled, we will make the userns mappings be non-overlapping as much as possible.
215-
This field will be used in phase 3.
200+
By default it is set to `true`.
216201

217202
### Phases
218203

219204
We propose to divide the work in 3 phases. Each phase makes this work with
220205
either more isolation or more workloads. When no support is yet added to handle
221206
some workload, a clear error will be shown.
222207

208+
PLEASE note that only phase 1 is targeted for alpha. Also note that the missing
209+
details (CRI changes, changes needed in container runtimes, etc.) will be added
210+
in a follow-up PRs.
211+
223212
Please note the last sub-section here is a table with the summary of the changes
224-
proposed on each phase.
213+
proposed on each phase. That table is not updated (it is from the initial
214+
proposal, doesn't have all the feedback and adjustments we discussed) but can
215+
still be useful as a general overview.
216+
225217

226218
#### Phase 1: pods "without" volumes
227219

@@ -267,60 +259,11 @@ listed vulnerabilities (as the host is protected from the container). It is also
267259
a trivial next-step to take, given that we have phase 1 implemented: just return
268260
the same mapping if the pod has other volumes.
269261

270-
#### Phase 3: pod to pod isolation
271-
272-
This phase will provide more isolation between pods that use volumes (as in
273-
phase 2) and requires another opt-in field:
274-
`pod.spec.securityContext.pod2podIsolation`.
275-
276-
This phase will try to not share the same mapping for all pods with volumes, as
277-
phase 2 does, but to achieve it some trade off needs to be made. This phase
278-
builds on the work of the previous phases and more details will be defined while
279-
the other phases evolve.
280-
281-
Here are some ideas so far:
282-
283-
One idea is to give different mappings to pods in different k8s namespaces or
284-
that use a different service account. This needs to be explored in further
285-
detail, but will probably impose limits to which workloads can run this (we need
286-
to expose a shorter mapping, less than 65535).
287-
288-
Another idea is to use id mapped mounts. This probably needs changes to the
289-
OCI runtime-spec, only works with certain filesystems and kernels that may take
290-
too long for some users to get (like managed services). Giuseppe started to
291-
experiment in crun with this
292-
[here](https://github.com/containers/crun/pull/780).
293-
294-
The value for `pod.spec.securityContext.pod2podIsolation` will be an enum, to
295-
select different strategies and allow room for future improvements.
296-
297-
It is being considered having a value that is "auto" for this fields, that
298-
will select the best strategy that your node supports. However, as different
299-
strategies will change the effective UID a container uses, if we add such an
300-
option the documentation will be VERY clear about the implications and
301-
automatizations will be provided whenever possible (we have some ideas on this
302-
front).
303-
304-
Another improvement suggested by @ddebroy to do here is:
305-
* Pods using also only [local ephemeral CSI volumes][csi-ephemeral-vol], as
306-
they share the same lifecycle of the pod, can be moved to use non-overlapping
307-
mappings.
308-
309-
This change can probably be done under the hood without the user noticing, to
310-
achieve more pod 2 pod isolation, and might not need the user to use
311-
`pod.spec.securityContext.pod2podIsolation`. However, some changes for the CSI
312-
vol to use the effective UID/GID might be needed and not trivial. @ddebroy has
313-
[kindly offered to help][csi-help] with this improvement
314-
315-
[csi-ephemeral-vol]: https://kubernetes-csi.github.io/docs/ephemeral-local-volumes.html#overview
316-
[csi-help]: https://github.com/kubernetes/enhancements/pull/3065/files#r762046107
317-
318-
If this phase turns out to be a lot of work, it will be left out as future work
319-
for other KEPs.
262+
#### Phase 3: TBD
320263

321264
### Summary of the Proposed Changes
322265

323-
[This table](https://docs.google.com/presentation/d/1z4oiZ7v4DjWpZQI2kbFbI8Q6botFaA07KJYaKA-vZpg/edit#slide=id.gfd10976c8b_1_41) gives you a quick overview of each phase.
266+
[This table](https://docs.google.com/presentation/d/1z4oiZ7v4DjWpZQI2kbFbI8Q6botFaA07KJYaKA-vZpg/edit#slide=id.gfd10976c8b_1_41) gives you a quick overview of each phase (note it is outdated, but still useful for a general overview).
324267

325268

326269
### Test Plan
@@ -347,23 +290,17 @@ TBD
347290

348291
### Graduation Criteria
349292

350-
Graduation for each pod.spec field we introduce will be separate.
351-
352-
#### pod.spec.useHostUsers graduation
353-
354293
##### Alpha
294+
- Phase 1 implemented
355295

356296
##### Beta
357297

358298
##### GA
359299

360-
#### pod.spec.securityContext.userns.pod2podIsolation graduation
361-
362-
##### Alpha
363-
364-
##### Beta
365-
366-
##### GA
300+
- Make plans on whether, when, and how to enable by default
301+
- Should we reconsider making the mappings smaller by default?
302+
- Should we allow any way for users to for "more" IDs mapped? If yes, how many more and how?
303+
- Should we allow the user can ask for specific mappings?
367304

368305
### Upgrade / Downgrade Strategy
369306

0 commit comments

Comments
 (0)