22
22
- [ Phases] ( #phases )
23
23
- [ Phase 1: pods " ; without" ; volumes] ( #phase-1-pods-without-volumes )
24
24
- [ Phase 2: pods with volumes] ( #phase-2-pods-with-volumes )
25
- - [ Phase 3: pod to pod isolation ] ( #phase-3-pod-to-pod-isolation )
25
+ - [ Phase 3: TBD ] ( #phase-3-tbd )
26
26
- [ Summary of the Proposed Changes] ( #summary-of-the-proposed-changes )
27
27
- [ Test Plan] ( #test-plan )
28
28
- [ Graduation Criteria] ( #graduation-criteria )
29
- - [ pod.spec.useHostUsers graduation] ( #podspecusehostusers-graduation )
30
29
- [ Alpha] ( #alpha )
31
30
- [ Beta] ( #beta )
32
31
- [ GA] ( #ga )
33
- - [ pod.spec.securityContext.userns.pod2podIsolation graduation] ( #podspecsecuritycontextusernspod2podisolation-graduation )
34
- - [ Alpha] ( #alpha-1 )
35
- - [ Beta] ( #beta-1 )
36
- - [ GA] ( #ga-1 )
37
32
- [ Upgrade / Downgrade Strategy] ( #upgrade--downgrade-strategy )
38
33
- [ Version Skew Strategy] ( #version-skew-strategy )
39
34
- [ Production Readiness Review Questionnaire] ( #production-readiness-review-questionnaire )
@@ -71,14 +66,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
71
66
72
67
## Summary
73
68
74
- This KEP adds a new ` hostUsers ` field to ` pod.Spec ` to allow to enable/disable
75
- using user namespaces for pods. Furthermore, it allows increased pod to pod
76
- isolation by means of ` pod.spec.securityContext.userns.pod2podIsolation ` field.
77
-
78
- It allows users to place pods in different user namespaces increasing the
79
- pod-to-pod and pod-to-host isolation. This extra isolation increases the cluster
80
- security as it protects the host and other pods from malicious or compromised
81
- processes inside containers that are able to break into the host.
69
+ This KEP adds support to use user-namespaces in pods.
82
70
83
71
## Motivation
84
72
@@ -149,6 +137,9 @@ Here we use UIDs, but the same applies for GIDs.
149
137
150
138
## Proposal
151
139
140
+ This KEP adds a new ` hostUsers ` field to ` pod.Spec ` to allow to enable/disable
141
+ using user namespaces for pods.
142
+
152
143
This proposal aims to support running pods inside user namespaces. This will
153
144
improve the pod to node isolation (phase 1 and 2) and pod to pod isolation
154
145
(phase 3) we currently have.
@@ -173,7 +164,7 @@ kernel module with `CAP_SYS_MODULE`.
173
164
#### Story 3
174
165
175
166
As a cluster admin, I want to allow users to run their container as root
176
- without that process having root privileged on the host, so I can mitigate the
167
+ without that process having root privileges on the host, so I can mitigate the
177
168
impact of a compromised container.
178
169
179
170
#### Story 4
@@ -185,7 +176,7 @@ host files).
185
176
186
177
#### Story 5
187
178
188
- As a cluster admin, I want to use different host UIDs/GIDs for pods running in
179
+ As a cluster admin, I want to use different host UIDs/GIDs for pods running on
189
180
the same node (whenever kernel/kube features allow it), so I can mitigate the
190
181
impact a compromised pod can have on other pods and the node itself.
191
182
@@ -199,29 +190,30 @@ impact a compromised pod can have on other pods and the node itself.
199
190
200
191
## Design Details
201
192
202
- Note: Names are preliminary yet, I'm using field names to simplify explanations.
203
-
204
193
### Pod.spec changes
205
194
206
195
The following changes will be done to the pod.spec:
207
196
208
- - ` pod.spec.useHostUsers ` : bool.
197
+ - ` pod.spec.hostUsers ` : bool.
209
198
If true or not present, uses the host user namespace (as today)
210
199
If false, a new userns is created for the pod.
211
- This field will be used for phase 1, 2 and 3.
212
-
213
- - ` pod.spec.securityContext.userns.pod2podIsolation ` : Enum
214
- If enabled, we will make the userns mappings be non-overlapping as much as possible.
215
- This field will be used in phase 3.
200
+ By default it is set to ` true ` .
216
201
217
202
### Phases
218
203
219
204
We propose to divide the work in 3 phases. Each phase makes this work with
220
205
either more isolation or more workloads. When no support is yet added to handle
221
206
some workload, a clear error will be shown.
222
207
208
+ PLEASE note that only phase 1 is targeted for alpha. Also note that the missing
209
+ details (CRI changes, changes needed in container runtimes, etc.) will be added
210
+ in a follow-up PRs.
211
+
223
212
Please note the last sub-section here is a table with the summary of the changes
224
- proposed on each phase.
213
+ proposed on each phase. That table is not updated (it is from the initial
214
+ proposal, doesn't have all the feedback and adjustments we discussed) but can
215
+ still be useful as a general overview.
216
+
225
217
226
218
#### Phase 1: pods "without" volumes
227
219
@@ -267,60 +259,11 @@ listed vulnerabilities (as the host is protected from the container). It is also
267
259
a trivial next-step to take, given that we have phase 1 implemented: just return
268
260
the same mapping if the pod has other volumes.
269
261
270
- #### Phase 3: pod to pod isolation
271
-
272
- This phase will provide more isolation between pods that use volumes (as in
273
- phase 2) and requires another opt-in field:
274
- ` pod.spec.securityContext.pod2podIsolation ` .
275
-
276
- This phase will try to not share the same mapping for all pods with volumes, as
277
- phase 2 does, but to achieve it some trade off needs to be made. This phase
278
- builds on the work of the previous phases and more details will be defined while
279
- the other phases evolve.
280
-
281
- Here are some ideas so far:
282
-
283
- One idea is to give different mappings to pods in different k8s namespaces or
284
- that use a different service account. This needs to be explored in further
285
- detail, but will probably impose limits to which workloads can run this (we need
286
- to expose a shorter mapping, less than 65535).
287
-
288
- Another idea is to use id mapped mounts. This probably needs changes to the
289
- OCI runtime-spec, only works with certain filesystems and kernels that may take
290
- too long for some users to get (like managed services). Giuseppe started to
291
- experiment in crun with this
292
- [ here] ( https://github.com/containers/crun/pull/780 ) .
293
-
294
- The value for ` pod.spec.securityContext.pod2podIsolation ` will be an enum, to
295
- select different strategies and allow room for future improvements.
296
-
297
- It is being considered having a value that is "auto" for this fields, that
298
- will select the best strategy that your node supports. However, as different
299
- strategies will change the effective UID a container uses, if we add such an
300
- option the documentation will be VERY clear about the implications and
301
- automatizations will be provided whenever possible (we have some ideas on this
302
- front).
303
-
304
- Another improvement suggested by @ddebroy to do here is:
305
- * Pods using also only [ local ephemeral CSI volumes] [ csi-ephemeral-vol ] , as
306
- they share the same lifecycle of the pod, can be moved to use non-overlapping
307
- mappings.
308
-
309
- This change can probably be done under the hood without the user noticing, to
310
- achieve more pod 2 pod isolation, and might not need the user to use
311
- ` pod.spec.securityContext.pod2podIsolation ` . However, some changes for the CSI
312
- vol to use the effective UID/GID might be needed and not trivial. @ddebroy has
313
- [ kindly offered to help] [ csi-help ] with this improvement
314
-
315
- [ csi-ephemeral-vol ] : https://kubernetes-csi.github.io/docs/ephemeral-local-volumes.html#overview
316
- [ csi-help ] : https://github.com/kubernetes/enhancements/pull/3065/files#r762046107
317
-
318
- If this phase turns out to be a lot of work, it will be left out as future work
319
- for other KEPs.
262
+ #### Phase 3: TBD
320
263
321
264
### Summary of the Proposed Changes
322
265
323
- [ This table] ( https://docs.google.com/presentation/d/1z4oiZ7v4DjWpZQI2kbFbI8Q6botFaA07KJYaKA-vZpg/edit#slide=id.gfd10976c8b_1_41 ) gives you a quick overview of each phase.
266
+ [ This table] ( https://docs.google.com/presentation/d/1z4oiZ7v4DjWpZQI2kbFbI8Q6botFaA07KJYaKA-vZpg/edit#slide=id.gfd10976c8b_1_41 ) gives you a quick overview of each phase (note it is outdated, but still useful for a general overview) .
324
267
325
268
326
269
### Test Plan
@@ -347,23 +290,17 @@ TBD
347
290
348
291
### Graduation Criteria
349
292
350
- Graduation for each pod.spec field we introduce will be separate.
351
-
352
- #### pod.spec.useHostUsers graduation
353
-
354
293
##### Alpha
294
+ - Phase 1 implemented
355
295
356
296
##### Beta
357
297
358
298
##### GA
359
299
360
- #### pod.spec.securityContext.userns.pod2podIsolation graduation
361
-
362
- ##### Alpha
363
-
364
- ##### Beta
365
-
366
- ##### GA
300
+ - Make plans on whether, when, and how to enable by default
301
+ - Should we reconsider making the mappings smaller by default?
302
+ - Should we allow any way for users to for "more" IDs mapped? If yes, how many more and how?
303
+ - Should we allow the user can ask for specific mappings?
367
304
368
305
### Upgrade / Downgrade Strategy
369
306
0 commit comments