Skip to content
This repository was archived by the owner on Nov 16, 2023. It is now read-only.

Commit d67fc76

Browse files
authored
Support Create ExecutionType: Just create without start (#58)
1 parent 7669288 commit d67fc76

File tree

10 files changed

+169
-21
lines changed

10 files changed

+169
-21
lines changed

README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -54,10 +54,11 @@ A Framework represents an application with a set of Tasks:
5454
2. Partitioned to different heterogeneous TaskRoles which share the same lifecycle
5555
3. Ordered in the same homogeneous TaskRole by TaskIndex
5656
4. With consistent identity {FrameworkName}-{TaskRoleName}-{TaskIndex} as PodName
57-
5. With fine grained [RetryPolicy](doc/user-manual.md#RetryPolicy) for each Task and the whole Framework
58-
6. With fine grained [FrameworkAttemptCompletionPolicy](doc/user-manual.md#FrameworkAttemptCompletionPolicy) for each TaskRole
59-
7. With PodGracefulDeletionTimeoutSec for each Task to [tune Consistency vs Availability](doc/user-manual.md#FrameworkConsistencyAvailability)
60-
8. With fine grained [Status](pkg/apis/frameworkcontroller/v1/types.go) for each TaskAttempt/Task, each TaskRole and the whole FrameworkAttempt/Framework
57+
5. With fine grained [ExecutionType](doc/user-manual.md#FrameworkExecutionType) to Start/Stop the whole Framework
58+
6. With fine grained [RetryPolicy](doc/user-manual.md#RetryPolicy) for each Task and the whole Framework
59+
7. With fine grained [FrameworkAttemptCompletionPolicy](doc/user-manual.md#FrameworkAttemptCompletionPolicy) for each TaskRole
60+
8. With PodGracefulDeletionTimeoutSec for each Task to [tune Consistency vs Availability](doc/user-manual.md#FrameworkConsistencyAvailability)
61+
9. With fine grained [Status](pkg/apis/frameworkcontroller/v1/types.go) for each TaskAttempt/Task, each TaskRole and the whole FrameworkAttempt/Framework
6162

6263
### Controller Feature
6364
1. Highly generalized as it is built for all kinds of applications

doc/user-manual.md

Lines changed: 133 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
## <a name="Index">Index</a>
44
- [Framework Interop](#FrameworkInterop)
5+
- [Framework ExecutionType](#FrameworkExecutionType)
56
- [Container EnvironmentVariable](#ContainerEnvironmentVariable)
67
- [Pod Failure Classification](#PodFailureClassification)
78
- [Predefined CompletionCode](#PredefinedCompletionCode)
@@ -38,7 +39,7 @@ As Framework is actually a [Kubernetes CRD](https://kubernetes.io/docs/concepts/
3839
### <a name="SupportedInteroperation">Supported Interoperation</a>
3940
| API Kind | Operations |
4041
|:---- |:---- |
41-
| Framework | [CREATE](#CREATE_Framework) [DELETE](#DELETE_Framework) [GET](#GET_Framework) [LIST](#LIST_Frameworks) [WATCH](#WATCH_Framework) [WATCH_LIST](#WATCH_LIST_Frameworks)<br>[PATCH](#PATCH_Framework) ([Stop](#Stop_Framework), [Add TaskRole](#Add_TaskRole), [Delete TaskRole](#Delete_TaskRole), [Add/Delete Task](#Add_Delete_Task)) |
42+
| Framework | [CREATE](#CREATE_Framework) [DELETE](#DELETE_Framework) [GET](#GET_Framework) [LIST](#LIST_Frameworks) [WATCH](#WATCH_Framework) [WATCH_LIST](#WATCH_LIST_Frameworks)<br>[PATCH](#PATCH_Framework) ([Start](#Start_Framework), [Stop](#Stop_Framework), [Add TaskRole](#Add_TaskRole), [Delete TaskRole](#Delete_TaskRole), [Add/Delete Task](#Add_Delete_Task)) |
4243
| [ConfigMap](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.14/#configmap-v1-core) | All operations except for [CREATE](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.14/#create-configmap-v1-core) [PUT](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.14/#replace-configmap-v1-core) [PATCH](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.14/#patch-configmap-v1-core) |
4344
| [Pod](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.14/#pod-v1-core) | All operations except for [CREATE](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.14/#create-pod-v1-core) [PUT](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.14/#replace-pod-v1-core) [PATCH](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.14/#patch-pod-v1-core) |
4445

@@ -55,6 +56,8 @@ Type: application/json or application/yaml
5556

5657
Create the specified Framework.
5758

59+
Any [ExecutionType](#FrameworkExecutionType) can be specified to create the Framework.
60+
5861
**Response**
5962

6063
| Code | Body | Description |
@@ -65,6 +68,38 @@ Create the specified Framework.
6568
| Conflict(409) | [Status](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.14/#status-v1-meta) | The specified Framework already exists. |
6669

6770
#### <a name="PATCH_Framework">PATCH Framework</a>
71+
##### <a name="Start_Framework">Start Framework</a>
72+
**Request**
73+
74+
PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}
75+
76+
Body:
77+
78+
```json
79+
[
80+
{
81+
"op": "replace",
82+
"path": "/spec/executionType",
83+
"value": "Start"
84+
}
85+
]
86+
```
87+
88+
Type: application/json-patch+json
89+
90+
**Description**
91+
92+
Start the specified Framework whose [ExecutionType](#FrameworkExecutionType) should be `Create`.
93+
94+
Before the Start, the Framework will not start to run or complete, but the object of the Framework is created, see [Framework PreStart Example](#FrameworkExecutionTypePreStartExample).
95+
96+
**Response**
97+
98+
| Code | Body | Description |
99+
|:---- |:---- |:---- |
100+
| OK(200) | [Framework](../pkg/apis/frameworkcontroller/v1/types.go) | Return current Framework. |
101+
| NotFound(404) | [Status](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.14/#status-v1-meta) | The specified Framework is not found. |
102+
68103
##### <a name="Stop_Framework">Stop Framework</a>
69104
**Request**
70105

@@ -86,9 +121,9 @@ Type: application/json-patch+json
86121

87122
**Description**
88123

89-
Stop the specified Framework:
124+
Stop the specified Framework whose [ExecutionType](#FrameworkExecutionType) should be `Create` or `Start`.
90125

91-
All running containers of the Framework will be stopped while the object of the Framework is still kept.
126+
After the Stop, the Framework will start to complete, but the object of the Framework will not be deleted, see [Framework PostStop Example](#FrameworkExecutionTypePostStopExample).
92127

93128
**Response**
94129

@@ -346,6 +381,100 @@ Watch the change events of all Frameworks (in the specified FrameworkNamespace).
346381
|:---- |:---- |:---- |
347382
| OK(200) | [WatchEvent](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.14/#watchevent-v1-meta) | Streaming the change events of all Frameworks (in the specified FrameworkNamespace). |
348383

384+
## <a name="FrameworkExecutionType">Framework ExecutionType</a>
385+
Framework [ExecutionType](../pkg/apis/frameworkcontroller/v1/types.go) can be specified to control the execution of the Framework:
386+
1. You can just [Create Framework](#CREATE_Framework) with `Create` ExecutionType, which does not also start it at the same time.
387+
- This is useful when you need to do some PreStart actions depend on the Framework object, see [Framework PreStart Example](#FrameworkExecutionTypePreStartExample). And once these actions are done, you can safely [Start Framework](#Start_Framework).
388+
2. You can just [Stop Framework](#Stop_Framework), which does not also delete it at the same time.
389+
- This is useful when you need to do some PostStop actions depend on the Framework object, see [Framework PostStop Example](#FrameworkExecutionTypePostStopExample). And once these actions are done, you can safely [Delete Framework](#DELETE_Framework).
390+
391+
### <a name="FrameworkExecutionTypeExample">Example</a>
392+
#### <a name="FrameworkExecutionTypePreStartExample">Framework PreStart Example</a>
393+
In this example, you need to run a [Framework which depends on a ServiceAccount](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account), but the ServiceAccount also depends on the Framework object to be [OwnerReferences](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents), so you cannot directly [Create Framework](#CREATE_Framework) with ExecutionType `Start`.
394+
1. [Create Framework](#CREATE_Framework) with `Create` ExecutionType and a ServiceAccount reference as below, then the Framework will stay as AttemptCreationPending:
395+
```yaml
396+
apiVersion: frameworkcontroller.microsoft.com/v1
397+
kind: Framework
398+
metadata:
399+
name: prestart
400+
spec:
401+
executionType: Create
402+
retryPolicy:
403+
fancyRetryPolicy: false
404+
maxRetryCount: 0
405+
taskRoles:
406+
- name: a
407+
taskNumber: 4
408+
frameworkAttemptCompletionPolicy:
409+
minFailedTaskCount: 4
410+
minSucceededTaskCount: 1
411+
task:
412+
retryPolicy:
413+
fancyRetryPolicy: false
414+
maxRetryCount: 0
415+
podGracefulDeletionTimeoutSec: 600
416+
pod:
417+
spec:
418+
restartPolicy: Never
419+
serviceAccountName: prestart
420+
containers:
421+
- name: ubuntu
422+
image: ubuntu:trusty
423+
command: ["sh", "-c", "printenv && sleep infinity"]
424+
```
425+
2. Use above creation response's `metadata.uid` to override below {{FrameworkUID}}, and [Create ServiceAccount](https://v1-14.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.14/#create-serviceaccount-v1-core) with above Framework reference as below:
426+
```yaml
427+
apiVersion: v1
428+
kind: ServiceAccount
429+
metadata:
430+
name: prestart
431+
ownerReferences:
432+
- apiVersion: frameworkcontroller.microsoft.com/v1
433+
kind: Framework
434+
name: prestart
435+
uid: {{FrameworkUID}}
436+
controller: true
437+
blockOwnerDeletion: true
438+
```
439+
3. [Start Framework](#Start_Framework), then the Framework will start to run successfully.
440+
4. [Delete Framework](#DELETE_Framework), then both the Framework and above ServiceAccount will be deleted.
441+
442+
#### <a name="FrameworkExecutionTypePostStopExample">Framework PostStop Example</a>
443+
In this example, you need to stop a Framework whose final stopped Framework object needs to be [pushed to/pulled by external systems](#FrameworkPodHistory), so you cannot directly [Delete Framework](#DELETE_Framework).
444+
1. [Create Framework](#CREATE_Framework) as below:
445+
```yaml
446+
apiVersion: frameworkcontroller.microsoft.com/v1
447+
kind: Framework
448+
metadata:
449+
name: poststop
450+
spec:
451+
executionType: Start
452+
retryPolicy:
453+
fancyRetryPolicy: false
454+
maxRetryCount: 0
455+
taskRoles:
456+
- name: a
457+
taskNumber: 4
458+
frameworkAttemptCompletionPolicy:
459+
minFailedTaskCount: 4
460+
minSucceededTaskCount: 1
461+
task:
462+
retryPolicy:
463+
fancyRetryPolicy: false
464+
maxRetryCount: 0
465+
podGracefulDeletionTimeoutSec: 600
466+
pod:
467+
spec:
468+
restartPolicy: Never
469+
containers:
470+
- name: ubuntu
471+
image: ubuntu:trusty
472+
command: ["sh", "-c", "printenv && sleep infinity"]
473+
```
474+
2. [Stop Framework](#Stop_Framework), then the Framework will be stopped, i.e. FrameworkCompleted.
475+
3. [Get Framework](#GET_Framework), and archive it into a DataBase first.
476+
4. [Delete Framework](#DELETE_Framework), then the Framework will be deleted.
477+
349478
## <a name="ContainerEnvironmentVariable">Container EnvironmentVariable</a>
350479
[Container EnvironmentVariable](../pkg/apis/frameworkcontroller/v1/constants.go)
351480

@@ -713,7 +842,7 @@ Besides these general [Framework ConsistencyGuarantees](#ConsistencyGuarantees),
713842
To safely run large scale Framework, i.e. the total task number in a single Framework is greater than 300, you just need to enable the [LargeFrameworkCompression](../pkg/apis/frameworkcontroller/v1/config.go). However, you may also need to decompress the Framework by yourself.
714843

715844
## <a name="FrameworkPodHistory">Framework and Pod History</a>
716-
By leveraging the [LogObjectSnapshot](../pkg/apis/frameworkcontroller/v1/config.go), external systems, such as [Fluentd](https://www.fluentd.org) and [ElasticSearch](https://www.elastic.co/products/elasticsearch), can collect and process Framework and Pod history snapshots even if it was retried or deleted, such as persistence, metrics conversion, visualization, alerting, acting, analysis, etc.
845+
By leveraging the [LogObjectSnapshot](../pkg/apis/frameworkcontroller/v1/config.go), external systems, such as [Fluentd](https://www.fluentd.org) and [ElasticSearch](https://www.elastic.co/products/elasticsearch), can collect and process Framework and Pod history snapshots even if it was retried or deleted, such as for persistence, metrics conversion, visualization, alerting, acting, analysis, etc.
717846

718847
## <a name="FrameworkTaskStateMachine">Framework and Task State Machine</a>
719848
### <a name="FrameworkStateMachine">Framework State Machine</a>

example/framework/extension/frameworkbarrier.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ spec:
5959
- name: frameworkbarrier-volume
6060
mountPath: /mnt/frameworkbarrier
6161
# [PREREQUISITE]
62-
# User needs to create a service account in the same namespace of this
62+
# User needs to create a ServiceAccount in the same namespace of this
6363
# Framework with granted permission for frameworkbarrier, if the k8s
6464
# cluster enforces authorization.
6565
# For example, if the cluster enforces RBAC:

example/framework/scenario/tensorflow/ps/cpu/tensorflowdistributedtrainingwithcpu.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ spec:
6969
- name: data-volume
7070
mountPath: /mnt/data
7171
# [PREREQUISITE]
72-
# User needs to create a service account for frameworkbarrier, if the
72+
# User needs to create a ServiceAccount for frameworkbarrier, if the
7373
# k8s cluster enforces authorization.
7474
# See more in ./example/framework/extension/frameworkbarrier.yaml
7575
serviceAccountName: frameworkbarrier

example/framework/scenario/tensorflow/ps/gpu/tensorflowdistributedtrainingwithdefaultscheduledgpu.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ spec:
7575
- name: data-volume
7676
mountPath: /mnt/data
7777
# [PREREQUISITE]
78-
# User needs to create a service account for frameworkbarrier, if the
78+
# User needs to create a ServiceAccount for frameworkbarrier, if the
7979
# k8s cluster enforces authorization.
8080
# See more in ./example/framework/extension/frameworkbarrier.yaml
8181
serviceAccountName: frameworkbarrier

example/framework/scenario/tensorflow/ps/gpu/tensorflowdistributedtrainingwithhivedscheduledgpu.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ spec:
9595
- name: data-volume
9696
mountPath: /mnt/data
9797
# [PREREQUISITE]
98-
# User needs to create a service account for frameworkbarrier, if the
98+
# User needs to create a ServiceAccount for frameworkbarrier, if the
9999
# k8s cluster enforces authorization.
100100
# See more in ./example/framework/extension/frameworkbarrier.yaml
101101
serviceAccountName: frameworkbarrier

example/run/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Notes:
1616

1717
### Prerequisite
1818

19-
If the k8s cluster enforces [Authorization](https://kubernetes.io/docs/reference/access-authn-authz/authorization/#using-flags-for-your-authorization-module), you need to first create a [Service Account](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account) with granted permission for FrameworkController. For example, if the cluster enforces [RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/#kubectl-create-clusterrolebinding):
19+
If the k8s cluster enforces [Authorization](https://kubernetes.io/docs/reference/access-authn-authz/authorization/#using-flags-for-your-authorization-module), you need to first create a [ServiceAccount](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account) with granted permission for FrameworkController. For example, if the cluster enforces [RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/#kubectl-create-clusterrolebinding):
2020
```shell
2121
kubectl create serviceaccount frameworkcontroller --namespace default
2222
kubectl create clusterrolebinding frameworkcontroller \
@@ -26,7 +26,7 @@ kubectl create clusterrolebinding frameworkcontroller \
2626

2727
### Run
2828

29-
Run FrameworkController with above Service Account and the [k8s inClusterConfig](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#accessing-the-api-from-a-pod):
29+
Run FrameworkController with above ServiceAccount and the [k8s inClusterConfig](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#accessing-the-api-from-a-pod):
3030

3131
#### Run with [default config](../../example/config/default/frameworkcontroller.yaml)
3232
```shell
@@ -51,7 +51,7 @@ spec:
5151
labels:
5252
app: frameworkcontroller
5353
spec:
54-
# Using the service account with granted permission
54+
# Using the ServiceAccount with granted permission
5555
# if the k8s cluster enforces authorization.
5656
serviceAccountName: frameworkcontroller
5757
containers:
@@ -115,7 +115,7 @@ spec:
115115
labels:
116116
app: frameworkcontroller
117117
spec:
118-
# Using the service account with granted permission
118+
# Using the ServiceAccount with granted permission
119119
# if the k8s cluster enforces authorization.
120120
serviceAccountName: frameworkcontroller
121121
containers:
@@ -133,8 +133,8 @@ spec:
133133
"cp /frameworkcontroller-config/frameworkcontroller.yaml . &&
134134
./start.sh"]
135135
volumeMounts:
136-
- name: frameworkcontroller-config
137-
mountPath: /frameworkcontroller-config
136+
- name: frameworkcontroller-config
137+
mountPath: /frameworkcontroller-config
138138
volumes:
139139
- name: frameworkcontroller-config
140140
configMap:

pkg/apis/frameworkcontroller/v1/crd.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ func buildFrameworkValidation() *apiExtensions.CustomResourceValidation {
7474
Properties: map[string]apiExtensions.JSONSchemaProps{
7575
"executionType": {
7676
Enum: []apiExtensions.JSON{
77+
{Raw: []byte(common.Quote(string(ExecutionCreate)))},
7778
{Raw: []byte(common.Quote(string(ExecutionStart)))},
7879
{Raw: []byte(common.Quote(string(ExecutionStop)))},
7980
},

pkg/apis/frameworkcontroller/v1/types.go

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -70,8 +70,7 @@ type Framework struct {
7070
// Spec
7171
//////////////////////////////////////////////////////////////////////////////////////////////////
7272
type FrameworkSpec struct {
73-
Description string `json:"description"`
74-
// Only support to update from ExecutionStart to ExecutionStop
73+
Description string `json:"description"`
7574
ExecutionType ExecutionType `json:"executionType"`
7675
RetryPolicy RetryPolicySpec `json:"retryPolicy"`
7776
TaskRoles []*TaskRoleSpec `json:"taskRoles"`
@@ -115,11 +114,23 @@ type TaskSpec struct {
115114
Pod core.PodTemplateSpec `json:"pod"`
116115
}
117116

117+
// User can set any ExecutionType when create a Framework, and then he can choose
118+
// to change the ExecutionType or not.
119+
// However, only below changes are supported:
120+
// 1. ExecutionCreate -> ExecutionStart/ExecutionStop
121+
// 2. ExecutionStart -> ExecutionStop
118122
type ExecutionType string
119123

120124
const (
125+
// The Framework will be kept in FrameworkAttemptCreationPending.
126+
// So it will never start to run or complete.
127+
ExecutionCreate ExecutionType = "Create"
128+
// The Framework will be transitioned from FrameworkAttemptCreationPending.
129+
// So it will immediately start to run.
121130
ExecutionStart ExecutionType = "Start"
122-
ExecutionStop ExecutionType = "Stop"
131+
// The Framework will be transitioned to FrameworkCompleted.
132+
// So it will immediately start to complete.
133+
ExecutionStop ExecutionType = "Stop"
123134
)
124135

125136
// RetryPolicySpec can be configured for the whole Framework and each TaskRole

pkg/controller/controller.go

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1214,6 +1214,12 @@ func (c *FrameworkController) syncFrameworkState(f *ci.Framework) (err error) {
12141214
return nil
12151215
}
12161216

1217+
if f.Spec.ExecutionType == ci.ExecutionCreate {
1218+
klog.Infof(logPfx + "Skip to createFrameworkAttempt: " +
1219+
"User has requested to just create the Framework without starting it")
1220+
return nil
1221+
}
1222+
12171223
if f.Spec.ExecutionType == ci.ExecutionStop {
12181224
diag := "User has requested to stop the Framework"
12191225
klog.Info(logPfx + diag)

0 commit comments

Comments
 (0)