Skip to content
This repository was archived by the owner on Nov 16, 2023. It is now read-only.

Commit df63d60

Browse files
authored
Support PodFailureSpec to classify and summarize Pod failures (#41)
1 parent 54a4554 commit df63d60

File tree

16 files changed

+1458
-294
lines changed

16 files changed

+1458
-294
lines changed

README.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -47,11 +47,15 @@ A Framework represents an application with a set of Tasks:
4747
### Controller Feature
4848
1. Highly generalized as it is built for all kinds of applications
4949
2. Light-weight as it is only responsible for Pod orchestration
50-
3. Tolerate Pod/ConfigMap unexpected deletion, Node/Network/FrameworkController/Kubernetes failure
51-
4. Well-defined Framework consistency, state machine and failure model
52-
5. Idiomatic with Kubernetes official controllers, such as [Pod Spec](https://kubernetes.io/docs/concepts/workloads/pods/pod-overview/#pod-templates)
53-
6. Compatible with other Kubernetes features, such as Kubernetes [Service](https://kubernetes.io/docs/concepts/services-networking/service), [Gpu Scheduling](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus), [Volume](https://kubernetes.io/docs/concepts/storage/volumes/), [Logging](https://kubernetes.io/docs/concepts/cluster-administration/logging)
54-
7. Aligned with Kubernetes [Controller Design Guidelines](https://github.com/kubernetes/community/blob/master/contributors/devel/controllers.md) and [API Conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md)
50+
3. Well-defined Framework consistency, state machine and failure model
51+
4. Tolerate Pod/ConfigMap unexpected deletion, Node/Network/FrameworkController/Kubernetes failure
52+
5. Support to specify how to [classify and summarize Pod failures](doc/user-manual.md#PodFailureClassification)
53+
6. Support to expose [Framework and Pod history snapshots](doc/user-manual.md#FrameworkPodHistory) to external systems
54+
7. Easy to leverage [FrameworkBarrier](doc/user-manual.md#FrameworkBarrier) to achieve light-weight Gang Execution and Service Discovery
55+
8. Easy to leverage [HivedScheduler](doc/user-manual.md#HivedScheduler) to achieve GPU Multi-Tenant, Topology-Aware, Priority and Gang Scheduling
56+
9. Compatible with other Kubernetes features, such as Kubernetes [Service](https://kubernetes.io/docs/concepts/services-networking/service), [Gpu Scheduling](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus), [Volume](https://kubernetes.io/docs/concepts/storage/volumes/), [Logging](https://kubernetes.io/docs/concepts/cluster-administration/logging)
57+
10. Idiomatic with Kubernetes official controllers, such as [Pod Spec](https://kubernetes.io/docs/concepts/workloads/pods/pod-overview/#pod-templates)
58+
11. Aligned with Kubernetes [Controller Design Guidelines](https://github.com/kubernetes/community/blob/master/contributors/devel/controllers.md) and [API Conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md)
5559

5660
## Prerequisite
5761
1. A Kubernetes cluster, v1.14.2 or above, on-cloud or on-premise.
@@ -78,7 +82,7 @@ A specialized wrapper can be built on top of FrameworkController to optimize for
7882
### Recommended Kubernetes Scheduler
7983
FrameworkController can directly leverage many [Kubernetes Schedulers](https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers) and among them we recommend these best fits:
8084
* [Kubernetes Default Scheduler](https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/#kube-scheduler): A General-Purpose Kubernetes Scheduler
81-
* [HivedScheduler](https://github.com/microsoft/pai/tree/master/subprojects/hivedscheduler): A Kubernetes Scheduler Extender optimized for GPUs ([Example](example/framework/scenario/tensorflow/gpu/tensorflowdistributedtrainingwithhivedscheduledgpu.yaml))
85+
* [HivedScheduler](doc/user-manual.md#HivedScheduler): A Kubernetes Scheduler Extender optimized for GPUs
8286

8387
### Similar Offering On Other Cluster Manager
8488
* [YARN FrameworkLauncher](https://github.com/Microsoft/pai/blob/master/subprojects/frameworklauncher/yarn): Similar offering natively supports [Apache YARN](http://hadoop.apache.org)

doc/user-manual.md

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,15 @@
33
## <a name="Index">Index</a>
44
- [Framework Interop](#FrameworkInterop)
55
- [Container EnvironmentVariable](#ContainerEnvironmentVariable)
6-
- [CompletionCode Convention](#CompletionCodeConvention)
6+
- [Pod Failure Classification](#PodFailureClassification)
7+
- [Predefined CompletionCode](#PredefinedCompletionCode)
8+
- [CompletionStatus](#CompletionStatus)
79
- [RetryPolicy](#RetryPolicy)
810
- [FrameworkAttemptCompletionPolicy](#FrameworkAttemptCompletionPolicy)
11+
- [Framework and Pod History](#FrameworkPodHistory)
912
- [Controller Extension](#ControllerExtension)
13+
- [FrameworkBarrier](#FrameworkBarrier)
14+
- [HivedScheduler](#HivedScheduler)
1015
- [Best Practice](#BestPractice)
1116

1217
## <a name="FrameworkInterop">Framework Interop</a>
@@ -111,7 +116,7 @@ Type: application/json or application/yaml
111116
Delete the specified Framework.
112117
113118
Notes:
114-
* If you need to ensure at most one instance of a specific Framework (identified by the FrameworkName) is running at any point in time, you should always use and only use the [Foreground Deletion](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion) in the provided body, see [Framework Notes](../pkg/apis/frameworkcontroller/v1/types.go). However, `kubectl delete` does not support to specify the Foreground Deletion at least for [Kubernetes v1.10](https://github.com/kubernetes/kubernetes/issues/66110#issuecomment-413761559), so you may have to use other [Supported Client](#SupportedClient).
119+
* If you need to ensure at most one instance of a specific Framework (identified by the FrameworkName) is running at any point in time, you should always use and only use the [Foreground Deletion](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion) in the provided body, see [Framework Notes](../pkg/apis/frameworkcontroller/v1/types.go). However, `kubectl delete` does not support to specify the Foreground Deletion at least for [Kubernetes v1.14.2](https://github.com/kubernetes/kubernetes/issues/66110#issuecomment-413761559), so you may have to use other [Supported Client](#SupportedClient).
115120

116121
**Response**
117122

@@ -194,8 +199,18 @@ Watch the change events of all Frameworks (in the specified FrameworkNamespace).
194199
## <a name="ContainerEnvironmentVariable">Container EnvironmentVariable</a>
195200
[Container EnvironmentVariable](../pkg/apis/frameworkcontroller/v1/constants.go)
196201

197-
## <a name="CompletionCodeConvention">CompletionCode Convention</a>
198-
[CompletionCode Convention](../pkg/apis/frameworkcontroller/v1/constants.go)
202+
## <a name="PodFailureClassification">Pod Failure Classification</a>
203+
You can specify how to classify and summarize Pod failures by [PodFailureSpec](../pkg/apis/frameworkcontroller/v1/config.go).
204+
205+
## <a name="PredefinedCompletionCode">Predefined CompletionCode</a>
206+
You can leverage the [Predefined CompletionCode](../pkg/apis/frameworkcontroller/v1/completion.go) to instruct your [RetryPolicy](#RetryPolicy) and identify a certain predefined CompletionCode, regardless of different [PodFailureSpec](../pkg/apis/frameworkcontroller/v1/config.go) may be configured in different clusters.
207+
208+
## <a name="CompletionStatus">CompletionStatus</a>
209+
[CompletionStatus](../pkg/apis/frameworkcontroller/v1/types.go): It is generated from [Predefined CompletionCode](#PredefinedCompletionCode) or [PodPattern matching](#PodFailureClassification). For a Pod, if no PodPattern is matched and failed Container exists, the CompletionCode is the same as the last failed Container ExitCode.
210+
211+
[TaskAttemptCompletionStatus](../pkg/apis/frameworkcontroller/v1/types.go): Besides the [CompletionStatus](../pkg/apis/frameworkcontroller/v1/types.go), it also provides more detailed and structured diagnostic information about the completion of a TaskAttempt.
212+
213+
[FrameworkAttemptCompletionStatus](../pkg/apis/frameworkcontroller/v1/types.go): Besides the [CompletionStatus](../pkg/apis/frameworkcontroller/v1/types.go), it also provides more detailed and structured diagnostic information about the completion of a FrameworkAttempt.
199214

200215
## <a name="RetryPolicy">RetryPolicy</a>
201216
### <a name="RetryPolicy_Spec">Spec</a>
@@ -210,7 +225,7 @@ Notes:
210225

211226
*You still need to specify them explicitly, as we have not supported the Framework Spec Defaulting yet.*
212227

213-
2. For the definition of each CompletionType, such as Transient Failed, see [CompletionCode Convention](#CompletionCodeConvention).
228+
2. For the definition of each [CompletionType](../pkg/apis/frameworkcontroller/v1/types.go), such as Transient Failed, see [CompletionStatus](#CompletionStatus).
214229

215230
<table>
216231
<tbody>
@@ -350,10 +365,17 @@ Notes:
350365
</tbody>
351366
</table>
352367

368+
## <a name="FrameworkPodHistory">Framework and Pod History</a>
369+
By leveraging [LogObjectSnapshot](../pkg/apis/frameworkcontroller/v1/config.go), external systems, such as [Fluentd](https://www.fluentd.org) and [ElasticSearch](https://www.elastic.co/products/elasticsearch), can collect and process Framework and Pod history snapshots even if it was retried or deleted, such as persistence, metrics conversion, visualization, alerting, acting, analysis, etc.
370+
353371
## <a name="ControllerExtension">Controller Extension</a>
354372
### <a name="FrameworkBarrier">FrameworkBarrier</a>
355373
1. [Usage](../pkg/barrier/barrier.go)
356374
2. Example: [FrameworkBarrier Example](../example/framework/extension/frameworkbarrier.yaml), [TensorFlow Example](../example/framework/scenario/tensorflow), [etc](../example/framework/scenario).
357375

376+
### <a name="HivedScheduler">HivedScheduler</a>
377+
1. [Usage](https://github.com/microsoft/pai/tree/master/subprojects/hivedscheduler)
378+
2. Example: [TensorFlow Example](../example/framework/scenario/tensorflow/gpu/tensorflowdistributedtrainingwithhivedscheduledgpu.yaml), [etc](https://github.com/microsoft/pai/blob/master/subprojects/GOPATH/src/github.com/microsoft/hivedscheduler/example/request/design/request.yaml).
379+
358380
## <a name="BestPractice">Best Practice</a>
359381
[Best Practice](../pkg/apis/frameworkcontroller/v1/types.go)
Lines changed: 264 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,270 @@
11
# Put it directly under frameworkcontroller's current working directory.
22
# For the full config setting and usage, see ./pkg/apis/frameworkcontroller/v1/config.go
33

4-
# This is the default config for frameworkcontroller, so all settings are commented out.
4+
# This is the default config for frameworkcontroller, so most settings are commented out.
55

66
#kubeApiServerAddress: http://10.10.10.10:8080
7-
#kubeConfigFilePath: ""
7+
#kubeConfigFilePath: ''
88
#workerNumber: 20
9+
10+
podFailureSpec:
11+
################################################################################
12+
# [-1199, -1000]: K8S issued failures
13+
################################################################################
14+
- code: -1000
15+
phrase: PodEvicted
16+
type:
17+
attributes: [Transient]
18+
podPatterns:
19+
- reasonRegex: '(?i)^Evicted$'
20+
messageRegex: '(?ms).*'
21+
- code: -1001
22+
phrase: PodNodeLost
23+
type:
24+
attributes: [Transient]
25+
podPatterns:
26+
- reasonRegex: '(?i)^NodeLost$'
27+
messageRegex: '(?ms).*'
28+
- code: -1002
29+
phrase: PodScheduledToInsufficientResourceNode
30+
type:
31+
attributes: [Transient]
32+
podPatterns:
33+
- reasonRegex: '(?i)^OutOf\S+$'
34+
messageRegex: '(?ms).*'
35+
- code: -1003
36+
phrase: PodPreemptedForCriticalPod
37+
type:
38+
attributes: [Transient]
39+
podPatterns:
40+
- reasonRegex: '(?i)^Preempting$'
41+
messageRegex: '(?ms).*'
42+
- code: -1004
43+
phrase: PodDeadlineExceeded
44+
type:
45+
attributes: [Permanent]
46+
podPatterns:
47+
- reasonRegex: '(?i)^DeadlineExceeded$'
48+
messageRegex: '(?ms).*'
49+
- code: -1005
50+
phrase: PodNodeAdmissionForbidden
51+
type:
52+
attributes: [Permanent]
53+
podPatterns:
54+
- reasonRegex: '(?i)^Forbidden$'
55+
messageRegex: '(?ms).*'
56+
- code: -1006
57+
phrase: PodNodeAdmissionUnexpectedError
58+
type:
59+
attributes: [Transient]
60+
podPatterns:
61+
- reasonRegex: '(?i)^UnexpectedAdmissionError$'
62+
messageRegex: '(?ms).*'
63+
- reasonRegex: '(?i)^UnknownReason$'
64+
messageRegex: '(?ms).*'
65+
- reasonRegex: '(?i)^InvalidNodeInfo$'
66+
messageRegex: '(?ms).*'
67+
- reasonRegex: '(?i)^UnexpectedPredicateFailureType$'
68+
messageRegex: '(?ms).*'
69+
70+
################################################################################
71+
# [-1399, -1200]: Docker issued failures
72+
################################################################################
73+
- code: -1200
74+
phrase: ContainerDockerOOMKilled
75+
type:
76+
attributes: [Permanent]
77+
podPatterns:
78+
- containers:
79+
- reasonRegex: '(?i)^OOMKilled$'
80+
codeRange: {min: 1}
81+
nameRegex: '(?ms).*'
82+
messageRegex: '(?ms).*'
83+
- code: -1201
84+
phrase: ContainerDockerRunFlagInvalid
85+
type:
86+
attributes: [Permanent]
87+
podPatterns:
88+
- containers:
89+
- reasonRegex: '(?i)^ContainerCannotRun$'
90+
codeRange: {min: 125, max: 125}
91+
nameRegex: '(?ms).*'
92+
messageRegex: '(?ms).*'
93+
- code: -1202
94+
phrase: ContainerDockerRunPermissionDenied
95+
type:
96+
attributes: [Permanent]
97+
podPatterns:
98+
- containers:
99+
- reasonRegex: '(?i)^ContainerCannotRun$'
100+
codeRange: {min: 126, max: 126}
101+
nameRegex: '(?ms).*'
102+
messageRegex: '(?ms).*'
103+
- code: -1203
104+
phrase: ContainerDockerRunCmdNotFound
105+
type:
106+
attributes: [Permanent]
107+
podPatterns:
108+
- containers:
109+
- reasonRegex: '(?i)^ContainerCannotRun$'
110+
codeRange: {min: 127, max: 127}
111+
nameRegex: '(?ms).*'
112+
messageRegex: '(?ms).*'
113+
- containers:
114+
- reasonRegex: '(?i)^ContainerCannotRun$'
115+
codeRange: {min: 128, max: 128}
116+
nameRegex: '(?ms).*'
117+
messageRegex: '(?msi).*(not found|cannot find|no such).*'
118+
- code: -1204
119+
phrase: ContainerDockerRunUnknownError
120+
podPatterns:
121+
- containers:
122+
- reasonRegex: '(?i)^ContainerCannotRun$'
123+
codeRange: {min: 128, max: 128}
124+
nameRegex: '(?ms).*'
125+
messageRegex: '(?ms).*'
126+
127+
################################################################################
128+
# [1, 255]: User Container issued failures
129+
################################################################################
130+
# [129, 192]: Involuntary failures caused by OS Signal
131+
- code: 130
132+
phrase: ContainerSigIntReceived
133+
type:
134+
attributes: [Transient]
135+
podPatterns:
136+
- containers:
137+
- codeRange: {min: 130, max: 130}
138+
nameRegex: '(?ms).*'
139+
messageRegex: '(?ms).*'
140+
- code: 131
141+
phrase: ContainerSigQuitReceived
142+
type:
143+
attributes: [Transient]
144+
podPatterns:
145+
- containers:
146+
- codeRange: {min: 131, max: 131}
147+
nameRegex: '(?ms).*'
148+
messageRegex: '(?ms).*'
149+
- code: 132
150+
phrase: ContainerSigIllReceived
151+
type:
152+
attributes: [Permanent]
153+
podPatterns:
154+
- containers:
155+
- codeRange: {min: 132, max: 132}
156+
nameRegex: '(?ms).*'
157+
messageRegex: '(?ms).*'
158+
- code: 134
159+
phrase: ContainerSigAbrtReceived
160+
podPatterns:
161+
- containers:
162+
- codeRange: {min: 134, max: 134}
163+
nameRegex: '(?ms).*'
164+
messageRegex: '(?ms).*'
165+
- code: 135
166+
phrase: ContainerSigBusReceived
167+
type:
168+
attributes: [Permanent]
169+
podPatterns:
170+
- containers:
171+
- codeRange: {min: 135, max: 135}
172+
nameRegex: '(?ms).*'
173+
messageRegex: '(?ms).*'
174+
- code: 136
175+
phrase: ContainerSigFpeReceived
176+
type:
177+
attributes: [Permanent]
178+
podPatterns:
179+
- containers:
180+
- codeRange: {min: 136, max: 136}
181+
nameRegex: '(?ms).*'
182+
messageRegex: '(?ms).*'
183+
- code: 137
184+
phrase: ContainerSigKillReceived
185+
type:
186+
attributes: [Transient]
187+
podPatterns:
188+
- containers:
189+
- codeRange: {min: 137, max: 137}
190+
nameRegex: '(?ms).*'
191+
messageRegex: '(?ms).*'
192+
- code: 139
193+
phrase: ContainerSigSegvReceived
194+
type:
195+
attributes: [Permanent]
196+
podPatterns:
197+
- containers:
198+
- codeRange: {min: 139, max: 139}
199+
nameRegex: '(?ms).*'
200+
messageRegex: '(?ms).*'
201+
- code: 141
202+
phrase: ContainerSigPipeReceived
203+
type:
204+
attributes: [Permanent]
205+
podPatterns:
206+
- containers:
207+
- codeRange: {min: 141, max: 141}
208+
nameRegex: '(?ms).*'
209+
messageRegex: '(?ms).*'
210+
- code: 143
211+
phrase: ContainerSigTermReceived
212+
type:
213+
attributes: [Transient]
214+
podPatterns:
215+
- containers:
216+
- codeRange: {min: 143, max: 143}
217+
nameRegex: '(?ms).*'
218+
messageRegex: '(?ms).*'
219+
220+
# [1, 255] - [129, 192]: Voluntary failures caused by Container itself
221+
# [200, 219]: Reserved Codes
222+
# [1, 255] - [129, 192] - [200, 219]: Custom Codes
223+
# Example: Directly forwarding Container code and just adding type info.
224+
#- code: 220
225+
# phrase: Container220Failed
226+
# type:
227+
# attributes: [Permanent]
228+
# podPatterns:
229+
# - containers:
230+
# - codeRange: {min: 220, max: 220}
231+
# nameRegex: '(?ms).*'
232+
# messageRegex: '(?ms).*'
233+
234+
# Example: Classification only based on Container termination message.
235+
#- code: 221
236+
# phrase: ContainerTensorflowOOMKilled
237+
# type:
238+
# attributes: [Permanent]
239+
# podPatterns:
240+
# - containers:
241+
# - messageRegex: '(?msi)tensorflow.*ResourceExhaustedError.*OOM.*'
242+
# codeRange: {min: 1}
243+
# nameRegex: '(?ms).*'
244+
#- code: 222
245+
# phrase: ContainerMPISegvFault
246+
# type:
247+
# attributes: [Permanent]
248+
# podPatterns:
249+
# - containers:
250+
# - messageRegex: '(?msi)Signal code: Address not mapped.*'
251+
# codeRange: {min: 1}
252+
# nameRegex: '(?ms).*'
253+
#- code: 223
254+
# phrase: ContainerCudaUncorrectableECCError
255+
# type:
256+
# attributes: [Transient]
257+
# podPatterns:
258+
# - containers:
259+
# - messageRegex: '(?msi)CUDA_ERROR_ECC_UNCORRECTABLE.*'
260+
# codeRange: {min: 1}
261+
# nameRegex: '(?ms).*'
262+
263+
# Example: Redirect all unknown failures to a single comparable code.
264+
#- code: 255
265+
# phrase: ContainerUnknownFailed
266+
# podPatterns:
267+
# - containers:
268+
# - codeRange: {min: 1}
269+
# nameRegex: '(?ms).*'
270+
# messageRegex: '(?ms).*'

example/framework/basic/batchfailedpermanent.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@ spec:
2424
containers:
2525
- name: ubuntu
2626
image: ubuntu:trusty
27-
# See CompletionCode Convention in
28-
# ./pkg/apis/frameworkcontroller/v1/constants.go
27+
# See Predefined CompletionCode in
28+
# ./pkg/apis/frameworkcontroller/v1/completion.go
2929
command: [
3030
"sh", "-c",
3131
"sleep 10 &&

example/framework/basic/batchfailedtransient.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@ spec:
2424
containers:
2525
- name: ubuntu
2626
image: ubuntu:trusty
27-
# See CompletionCode Convention in
28-
# ./pkg/apis/frameworkcontroller/v1/constants.go
27+
# See Predefined CompletionCode in
28+
# ./pkg/apis/frameworkcontroller/v1/completion.go
2929
command: [
3030
"sh", "-c",
3131
"sleep 10 &&

0 commit comments

Comments
 (0)