Skip to content
This repository was archived by the owner on Nov 16, 2023. It is now read-only.

Commit 4b5707f

Browse files
authored
Expose Task History (#62)
1 parent 959722c commit 4b5707f

File tree

38 files changed

+2089
-160
lines changed

38 files changed

+2089
-160
lines changed

Gopkg.lock

Lines changed: 20 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

doc/user-manual.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
## <a name="Index">Index</a>
44
- [Framework Interop](#FrameworkInterop)
55
- [Framework ExecutionType](#FrameworkExecutionType)
6-
- [Container EnvironmentVariable](#ContainerEnvironmentVariable)
6+
- [Predefined Container EnvironmentVariable](#PredefinedContainerEnvironmentVariable)
77
- [Pod Failure Classification](#PodFailureClassification)
88
- [Predefined CompletionCode](#PredefinedCompletionCode)
99
- [CompletionStatus](#CompletionStatus)
@@ -475,8 +475,10 @@ spec:
475475
3. [Get Framework](#GET_Framework), and archive it into a DataBase first.
476476
4. [Delete Framework](#DELETE_Framework), then the Framework will be deleted.
477477

478-
## <a name="ContainerEnvironmentVariable">Container EnvironmentVariable</a>
479-
[Container EnvironmentVariable](../pkg/apis/frameworkcontroller/v1/constants.go)
478+
## <a name="PredefinedContainerEnvironmentVariable">Predefined Container EnvironmentVariable</a>
479+
[Predefined Container EnvironmentVariable](../pkg/apis/frameworkcontroller/v1/constants.go)
480+
481+
[Framework Example](../example/framework/basic/batchstatefulfailed.yaml)
480482

481483
## <a name="PodFailureClassification">Pod Failure Classification</a>
482484
You can specify how to classify and summarize Pod failures by the [PodFailureSpec](../pkg/apis/frameworkcontroller/v1/config.go).
@@ -842,7 +844,7 @@ Besides these general [Framework ConsistencyGuarantees](#ConsistencyGuarantees),
842844
To safely run large scale Framework, i.e. the total task number in a single Framework is greater than 300, you just need to enable the [LargeFrameworkCompression](../pkg/apis/frameworkcontroller/v1/config.go). However, you may also need to decompress the Framework by yourself.
843845

844846
## <a name="FrameworkPodHistory">Framework and Pod History</a>
845-
By leveraging the [LogObjectSnapshot](../pkg/apis/frameworkcontroller/v1/config.go), external systems, such as [Fluentd](https://www.fluentd.org) and [ElasticSearch](https://www.elastic.co/products/elasticsearch), can collect and process Framework and Pod history snapshots even if it was retried or deleted, such as for persistence, metrics conversion, visualization, alerting, acting, analysis, etc.
847+
By leveraging the [LogObjectSnapshot](../pkg/apis/frameworkcontroller/v1/config.go), external systems, such as [Fluentd](https://www.fluentd.org) and [ElasticSearch](https://www.elastic.co/products/elasticsearch), can collect and process Framework, Task and Pod history snapshots even if it was retried or deleted, such as for persistence, metrics conversion, visualization, alerting, acting, analysis, etc.
846848

847849
## <a name="FrameworkTaskStateMachine">Framework and Task State Machine</a>
848850
### <a name="FrameworkStateMachine">Framework State Machine</a>
@@ -894,7 +896,7 @@ The default behavior is to achieve all the [ConsistencyGuarantees](#ConsistencyG
894896

895897
For example, [drain the Node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node) before delete it is acceptable.
896898

897-
*The Task instance can be universally located by its [TaskAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [PodUID](../pkg/apis/frameworkcontroller/v1/types.go).*
899+
*The Task running instance can be universally located by its [TaskAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [PodUID](../pkg/apis/frameworkcontroller/v1/types.go).*
898900

899901
*To avoid the Pod is stuck in deleting forever, such as if its Node is down forever, leverage the same approach as [Delete StatefulSet Pod only after the Pod termination has been confirmed](https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#delete-pods) manually or by your [Cloud Controller Manager](https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/#running-cloud-controller-manager).*
900902

@@ -911,7 +913,7 @@ The default behavior is to achieve all the [ConsistencyGuarantees](#ConsistencyG
911913

912914
4. Do not change the [OwnerReferences](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents) of the managed ConfigMap and Pods.
913915

914-
*The Framework instance can be universally located by its [FrameworkAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [ConfigMapUID](../pkg/apis/frameworkcontroller/v1/types.go).*
916+
*The Framework running instance can be universally located by its [FrameworkAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [ConfigMapUID](../pkg/apis/frameworkcontroller/v1/types.go).*
915917

916918
### <a name="FrameworkAvailability">Framework Availability</a>
917919
According to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), in the presence of a network partition, you cannot achieve both consistency and availability at the same time in any distributed system. So you have to make a trade-off between the [Framework Consistency](#FrameworkConsistency) and the [Framework Availability](#FrameworkAvailability).

example/framework/basic/batchstatefulfailed.yaml

Lines changed: 22 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -27,21 +27,30 @@ spec:
2727
- name: ubuntu
2828
image: ubuntu:trusty
2929
# To locate a specific Task during its whole lifecycle regardless of
30-
# any retry:
30+
# any retry and rescale:
3131
# Consistent Identity:
32-
# PodNamespace = {FrameworkNamespace}
33-
# PodName = {FrameworkName}-{TaskRoleName}-{TaskIndex}
32+
# PodNamespace = {FrameworkNamespace}
33+
# PodName = {FrameworkName}-{TaskRoleName}-{TaskIndex}
3434
# Consistent Environment Variable Value:
35-
# ${FC_FRAMEWORK_NAMESPACE},
36-
# ${FC_FRAMEWORK_NAME}, ${FC_TASKROLE_NAME}, ${FC_TASK_INDEX},
37-
# ${FC_CONFIGMAP_NAME}, ${FC_POD_NAME}
35+
# ${FC_FRAMEWORK_NAMESPACE}
36+
# ${FC_FRAMEWORK_NAME}
37+
# ${FC_TASKROLE_NAME}
38+
# ${FC_TASK_INDEX}
3839
#
39-
# To locate a specific execution attempt of a specific Task:
40-
# Attempt Specific Environment Variable Value:
41-
# ${FC_FRAMEWORK_ATTEMPT_ID}, ${FC_TASK_ATTEMPT_ID}
40+
# To locate a specific Task instance, in case the Task is deleted then
41+
# added by rescale with a different Task instance:
42+
# Environment Variable Value:
43+
# ${FC_TASK_UID}
4244
#
43-
# To locate a specific execution attempt instance of a specific Task:
44-
# Attempt Instance Specific Environment Variable Value:
45-
# ${FC_FRAMEWORK_ATTEMPT_INSTANCE_UID}, ${FC_CONFIGMAP_UID}
46-
# ${FC_TASK_ATTEMPT_INSTANCE_UID}, ${FC_POD_UID}
45+
# To locate a specific execution attempt of a specific Task instance:
46+
# Environment Variable Value:
47+
# ${FC_TASK_UID}
48+
# ${FC_TASK_ATTEMPT_ID}
49+
#
50+
# To locate a specific execution attempt instance of a specific Task
51+
# instance, in case the attempt instance, i.e. the Pod instance is
52+
# created but not observed by FrameworkController, then it is deleted
53+
# and created later with a different attempt instance:
54+
# Environment Variable Value:
55+
# ${FC_TASK_ATTEMPT_INSTANCE_UID}
4756
command: ["sh", "-c", "printenv && sleep 60 && exit 1"]

pkg/apis/frameworkcontroller/v1/config.go

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ type Config struct {
117117
// analysis, etc.
118118
// Notes:
119119
// 1. The snapshot is logged to stderr and can be extracted by the regular
120-
// expression ": ObjectSnapshot: (.+)".
120+
// expression ": ObjectSnapshot: (.+)", see LogMarkerObjectSnapshot.
121121
// 2. To determine the type of the snapshot, using object.apiVersion and
122122
// object.kind.
123123
// 3. The same snapshot may be logged more than once in some rare cases, so
@@ -149,16 +149,20 @@ type Config struct {
149149

150150
type LogObjectSnapshot struct {
151151
Framework LogFrameworkSnapshot `yaml:"framework"`
152+
Task LogTaskSnapshot `yaml:"task"`
152153
Pod LogPodSnapshot `yaml:"pod"`
153154
}
154155

155156
type LogFrameworkSnapshot struct {
156-
OnTaskRetry *bool `yaml:"onTaskRetry"`
157157
OnFrameworkRetry *bool `yaml:"onFrameworkRetry"`
158-
OnFrameworkRescale *bool `yaml:"onFrameworkRescale"`
159158
OnFrameworkDeletion *bool `yaml:"onFrameworkDeletion"`
160159
}
161160

161+
type LogTaskSnapshot struct {
162+
OnTaskRetry *bool `yaml:"onTaskRetry"`
163+
OnTaskDeletion *bool `yaml:"onTaskDeletion"`
164+
}
165+
162166
type LogPodSnapshot struct {
163167
OnPodDeletion *bool `yaml:"onPodDeletion"`
164168
}
@@ -254,18 +258,18 @@ func NewConfig() *Config {
254258
if c.FrameworkMaxRetryDelaySecForTransientConflictFailed == nil {
255259
c.FrameworkMaxRetryDelaySecForTransientConflictFailed = common.PtrInt64(15 * 60)
256260
}
257-
if c.LogObjectSnapshot.Framework.OnTaskRetry == nil {
258-
c.LogObjectSnapshot.Framework.OnTaskRetry = common.PtrBool(true)
259-
}
260261
if c.LogObjectSnapshot.Framework.OnFrameworkRetry == nil {
261262
c.LogObjectSnapshot.Framework.OnFrameworkRetry = common.PtrBool(true)
262263
}
263-
if c.LogObjectSnapshot.Framework.OnFrameworkRescale == nil {
264-
c.LogObjectSnapshot.Framework.OnFrameworkRescale = common.PtrBool(true)
265-
}
266264
if c.LogObjectSnapshot.Framework.OnFrameworkDeletion == nil {
267265
c.LogObjectSnapshot.Framework.OnFrameworkDeletion = common.PtrBool(true)
268266
}
267+
if c.LogObjectSnapshot.Task.OnTaskRetry == nil {
268+
c.LogObjectSnapshot.Task.OnTaskRetry = common.PtrBool(true)
269+
}
270+
if c.LogObjectSnapshot.Task.OnTaskDeletion == nil {
271+
c.LogObjectSnapshot.Task.OnTaskDeletion = common.PtrBool(true)
272+
}
269273
if c.LogObjectSnapshot.Pod.OnPodDeletion == nil {
270274
c.LogObjectSnapshot.Pod.OnPodDeletion = common.PtrBool(true)
271275
}

pkg/apis/frameworkcontroller/v1/constants.go

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ const (
3838
FrameworkPlural = "frameworks"
3939
FrameworkCRDName = FrameworkPlural + "." + GroupName
4040
FrameworkKind = "Framework"
41+
TaskKind = "Task"
4142
ConfigMapKind = "ConfigMap"
4243
PodKind = "Pod"
4344
ObjectUIDFieldPath = "metadata.uid"
@@ -56,9 +57,12 @@ const (
5657
AnnotationKeyConfigMapName = "FC_CONFIGMAP_NAME"
5758
AnnotationKeyPodName = "FC_POD_NAME"
5859

60+
AnnotationKeyFrameworkUID = "FC_FRAMEWORK_UID"
5961
AnnotationKeyFrameworkAttemptID = "FC_FRAMEWORK_ATTEMPT_ID"
6062
AnnotationKeyFrameworkAttemptInstanceUID = "FC_FRAMEWORK_ATTEMPT_INSTANCE_UID"
6163
AnnotationKeyConfigMapUID = "FC_CONFIGMAP_UID"
64+
AnnotationKeyTaskRoleUID = "FC_TASKROLE_UID"
65+
AnnotationKeyTaskUID = "FC_TASK_UID"
6266
AnnotationKeyTaskAttemptID = "FC_TASK_ATTEMPT_ID"
6367

6468
// Predefined Labels
@@ -79,9 +83,12 @@ const (
7983
EnvNameConfigMapName = AnnotationKeyConfigMapName
8084
EnvNamePodName = AnnotationKeyPodName
8185

86+
EnvNameFrameworkUID = AnnotationKeyFrameworkUID
8287
EnvNameFrameworkAttemptID = AnnotationKeyFrameworkAttemptID
8388
EnvNameFrameworkAttemptInstanceUID = AnnotationKeyFrameworkAttemptInstanceUID
8489
EnvNameConfigMapUID = AnnotationKeyConfigMapUID
90+
EnvNameTaskRoleUID = AnnotationKeyTaskRoleUID
91+
EnvNameTaskUID = AnnotationKeyTaskUID
8592
EnvNameTaskAttemptID = AnnotationKeyTaskAttemptID
8693
EnvNameTaskAttemptInstanceUID = "FC_TASK_ATTEMPT_INSTANCE_UID"
8794
EnvNamePodUID = "FC_POD_UID"
@@ -98,9 +105,22 @@ const (
98105
PlaceholderTaskIndex = AnnotationKeyTaskIndex
99106
PlaceholderConfigMapName = AnnotationKeyConfigMapName
100107
PlaceholderPodName = AnnotationKeyPodName
108+
109+
// For LogObjectSnapshot
110+
// All snapshots are logged in format:
111+
// {AnyLogMessage}{ObjectSnapshotTrigger}{LogMarkerObjectSnapshot}{JsonObjectSnapshot}
112+
LogMarkerObjectSnapshot = ": ObjectSnapshot: "
113+
LogMarkerOnFrameworkRetry ObjectSnapshotTrigger = ": OnFrameworkRetry"
114+
LogMarkerOnFrameworkDeletion ObjectSnapshotTrigger = ": OnFrameworkDeletion"
115+
LogMarkerOnTaskRetry ObjectSnapshotTrigger = ": OnTaskRetry"
116+
LogMarkerOnTaskDeletion ObjectSnapshotTrigger = ": OnTaskDeletion"
117+
LogMarkerOnPodDeletion ObjectSnapshotTrigger = ": OnPodDeletion"
101118
)
102119

120+
type ObjectSnapshotTrigger string
121+
103122
var FrameworkGroupVersionKind = SchemeGroupVersion.WithKind(FrameworkKind)
123+
var TaskGroupVersionKind = SchemeGroupVersion.WithKind(TaskKind)
104124
var ConfigMapGroupVersionKind = core.SchemeGroupVersion.WithKind(ConfigMapKind)
105125
var PodGroupVersionKind = core.SchemeGroupVersion.WithKind(PodKind)
106126

0 commit comments

Comments
 (0)