Skip to content

Commit 3fb88ba

Browse files
authored
Merge pull request #39 from pfnet-research/infinite-max-retry
Support infinite retry. Task.Status.History now records the limited number of recent processing records.
2 parents 5599568 + 3f3abd8 commit 3fb88ba

File tree

4 files changed

+58
-46
lines changed

4 files changed

+58
-46
lines changed

README.md

Lines changed: 42 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -319,13 +319,15 @@ name: "this is just an display name"
319319
# redis: 1KB
320320
payload: |
321321
You can define any task information in payload
322-
# retryLimit max value varies on backend type to prevent from overloading backend.
323-
# redis: 10
322+
# retryLimit is the maximum number of retry (negative number means infinite)
323+
# NOTE: only the limited number of recent task records will be recorded in its status.
324+
# so, if you set large value or infinite here, you will loose old task records.
325+
# please see the description of status.history field in the next section.
324326
retryLimit: 3
325327
# timeoutSeconds is for task handler timeout.
326328
# If set positive value, task handler timeout for processing this task.
327329
# Otherwise, worker's default timeout will be applied. (See 'Task Queue Worker' section)
328-
timeoutSeconds: 600
330+
timeoutSeconds: 600
329331
```
330332
331333
#### `Task`
@@ -342,8 +344,8 @@ spec:
342344
rertryLimit: 3
343345
timeoutSeconds: 100
344346
status:
345-
# Phase of the task.
346-
# See below section for task lifecycle
347+
# Phase of the task.
348+
# See below section for task lifecycle
347349
phase: Processing
348350
createdAt: 2020-02-12T20:20:29.350631+09:00
349351
# Failure count of the task
@@ -356,19 +358,24 @@ status:
356358
processUID: 7b7b39f5-da66-4380-8002-033dff0e0f26
357359
# worker name received the task
358360
workerName: everpeace-macbookpro-2018.local
359-
# This value is unique among pftaskqueue worker processes
361+
# This value is unique among pftaskqueue worker processes
360362
workerUID: 15bfe424-889a-49ca-88d7-fb0fc51f68d
361363
# timestamps
362364
receivedAt: 2020-02-12T20:20:39.350631+09:00
363365
startedAt: 2020-02-12T20:20:39.351479+09:00
364-
# history of processing the task.
366+
# history of recent records of processing the task.
367+
# the limited number of recent records are recorded in this field.
368+
# the value varies on backend types to prvent overloading backends:
369+
# - redis: 10 entries
370+
# NOTE: so, if you set larger value than this limit in spec.rertryLimit,
371+
# you will loose old task records.
365372
history:
366-
# TaskRecord:
367-
# this represents a record of task handler invocation
373+
# TaskRecord:
374+
# this represents a record of task handler invocation
368375
# in a specific worker.
369376
# UID of process(generated by pftaskqueue)
370377
- processUID: 194b8ad9-543b-4d01-a571-f8d1db2e74e6
371-
# worker name & UID which received the task
378+
# worker name & UID which received the task
372379
workerName: everpeace-macbookpro-2018.local
373380
workerUID: 06c333f8-aeab-4a3d-8624-064a305c53ff
374381
# timestamps
@@ -387,8 +394,8 @@ status:
387394
# - Signaled: pftaskqueue worker signaled and the task handler was interrupted
388395
# - InternalError: pftaskqueue worker faced some error processing a task
389396
reason: Succeeded
390-
# Returned values from the task handlers.
391-
# See 'Task Handler Specification' section for how pftaskqueue worker communicates
397+
# Returned values from the task handlers.
398+
# See 'Task Handler Specification' section for how pftaskqueue worker communicates
392399
# with its task handler processes.
393400
# payload max size varies on backend type to prevent from overloading backend.
394401
# redis: 1KB
@@ -398,7 +405,7 @@ status:
398405
# redis: 1KB
399406
# If size exceeded, the contents will be truncated automatically
400407
message: ""
401-
# Below two fields will be set if the worker which processes the task was
408+
# Below two fields will be set if the worker which processes the task was
402409
# lost and salvaged by the other worker.
403410
# See "Worker lifecycle" section below for details.
404411
# salvagedBy: <workerUID>
@@ -421,10 +428,10 @@ status:
421428
422429
If you queued your `TaskSpec`, `pftaskqueue` assign UID to it and generate `Task` with `Pending` phase for it. Some worker pulled a `Pending` task from the queue, `Task` transits to `Received` phase. When `Task` actually stared to be processed by task handler process, it transits to `Processing` phase.
423430
424-
Once task handler process succeeded, `Task` transits to `Succeeded` phase. If task handler process failed, `pftaskqueue` can handle automatic retry feature with respect to `TaskSpec.retryLimit`. If the task handler process failed and it didn't reach at its retry limit, `pftaskqueue` re-queue the task with setting `Pending` phase again. Otherwise `pftaskqueue` will give up retry and mark it `Failed` phase. You can see all the process record of the `Task` status.
431+
Once task handler process succeeded, `Task` transits to `Succeeded` phase. If task handler process failed, `pftaskqueue` can handle automatic retry feature with respect to `TaskSpec.retryLimit`. If the task handler process failed and it didn't reach at its retry limit, `pftaskqueue` re-queue the task with setting `Pending` phase again. Otherwise `pftaskqueue` will give up retry and mark it `Failed` phase. You can see all the process record of the `Task` status.
425432
426433
If worker was signaled, tasks in `Received` or `Processing` phase will be treated as failure and `pftaskqueue` will handle automatic retry feature.
427-
434+
428435
```yaml
429436
$ pftaskqueue get-task [queue] --state=failed -o yaml
430437
...
@@ -510,13 +517,13 @@ worker:
510517
# This value will be used when TaskSpec.timeoutSeconds is not set or 0.
511518
defaultTimeout: 30m0s
512519
# Task Handler Command
513-
# A Worker spawns a process with the command for each received tasks
520+
# A Worker spawns a process with the command for each received tasks
514521
commands:
515522
- cat
516523
# Worker heartbeat configuration to detect worker process existence
517524
# Please see "Worker lifecycle" section
518525
heartBeat:
519-
# A Worker process tries to update its Worker.Status.lastHeartBeatAt field
526+
# A Worker process tries to update its Worker.Status.lastHeartBeatAt field
520527
# stored in queue backend in this interval
521528
interval: 2s
522529
# A Worker.Status.lastHeartBeatAt will be determined "expired"
@@ -531,7 +538,7 @@ worker:
531538
exitOnEmpty: false
532539
# If exitOnEmpty is true, worker waits for exit in the grace period
533540
exitOnEmptyGracePeriod: 10s
534-
# If the value was positive, worker will exit
541+
# If the value was positive, worker will exit
535542
# after processing the number of tasks
536543
numTasks: 1000
537544
# Base directory to create workspace for task handler processes
@@ -590,10 +597,10 @@ status:
590597
+------------+
591598
```
592599

593-
Once worker started, it starts with `Running` phase. In the startup, a worker register self to the queue and get its UID. The UID becomes the identifier of workers. If worker exited normally (with `exit-code=0`), it transits `Succeeded` phase. If `exit-code` was not 0, it transits to `Failed` phase.
600+
Once worker started, it starts with `Running` phase. In the startup, a worker register self to the queue and get its UID. The UID becomes the identifier of workers. If worker exited normally (with `exit-code=0`), it transits `Succeeded` phase. If `exit-code` was not 0, it transits to `Failed` phase.
594601

595602
However, worker process was go away by various reasons (`SIGKILL`-ed, `OOMKiller`, etc.). Then, how to detect those worker's sate? `pftaskquue` applies simple timeout based heuristics. A worker process keeps sending heartbeat during it runs, with configured interval, to the queue by updating its `Status.lastHeartBeatAt` field. If the heartbeat became older then configured expiration duration, the worker was determined as 'Lost' state (`phase=Failed, reason=Lost`). Moreover when a worker detects their own heartbeat expired, they exited by their selves to wait they will be salvaged by other workers.
596-
603+
597604
On every worker startup, a worker tries to find `Lost` workers which are safe to be salvaged. `pftaskqueue` also used simple timeout-based heuristics in salvation, too. If time passed `Worker.HeartBeat.SalvagedDuration` after its heartbeat expiration, the worker is determined as a salvation target. Once the worker finds some salvation target workers, it will salvage the worker. "Salvation" means
598605

599606
- marks the target `Salvaged` phase (`phase=Failed, reason=Salvaged`)
@@ -626,26 +633,26 @@ pftaskqueue get-worker [queue] --state=[all,running,succeeded,failed,lost,tosalv
626633
```
627634
{workspace direcoty}
628635
629-
│ # pftaskqueue prepares whole the contents
630-
├── input
636+
│ # pftaskqueue prepares whole the contents
637+
├── input
631638
│   ├── payload # TaskSpec.payload in text format
632639
│   ├── retryLimit # TaskSpec.retryLimit in text format
633640
│   ├── timeoutSeconds # TaskSpec.timeoutSeconds in text format
634641
│   └── meta
635642
│      ├── taskUID # taskUID of the task in text format
636-
│      ├── processUID # prrocessUID of the task handler process
643+
│      ├── processUID # prrocessUID of the task handler process
637644
│      ├── task.json # whole task information in JSON format
638-
│      ├── workerName # workerName of the worker process
639-
│      ├── workerUID # workerUID of the worker process
640-
│      └── workerConfig.json # whole workerConfig information in JSON format
645+
│      ├── workerName # workerName of the worker process
646+
│      ├── workerUID # workerUID of the worker process
647+
│      └── workerConfig.json # whole workerConfig information in JSON format
641648
642649
│ # pftaskqueue just creates the directory
643650
│ # If any error happened in reading files in the directory, the task fails with the TaskResult below.
644651
│ # type: "Failure"
645652
│ # reason: "InternalError"
646653
│ # message: "...error message..."
647654
│ # payload: null
648-
└── output
655+
└── output
649656
├── payload # If the file exists, the contents will record in TaskResult.payload. Null otherwise.
650657
│ # Max size of the payload varies on backend type to avoid from overloading backend
651658
│ # redis: 1KB
@@ -659,7 +666,7 @@ pftaskqueue get-worker [queue] --state=[all,running,succeeded,failed,lost,tosalv
659666
# e.g. [{"payload": "foo", "retryLimit": "3", "timeout": "10m"}]
660667
661668
3 directories, 12 files
662-
```
669+
```
663670

664671
## Dead letters
665672

@@ -681,7 +688,7 @@ $ pftaskqueue get-task [queue] --state=deadletter --output yaml
681688
...
682689
```
683690

684-
## Managing configurations
691+
## Managing configurations
685692

686693
`pftaskqueue` has a lot of configuration parameters. `pftaskqueue` provides multiple ways to configure them. `pftaskqueue` reads configuraton parameter in the following precedence order. Each item takes precedence over the item below it:
687694

@@ -738,16 +745,16 @@ redis:
738745
# key prefix of redis database
739746
# all the key used pftaskqueue was prefixed by '_pftaskqueue:{keyPrefix}:`
740747
keyPrefix: omura
741-
748+
742749
# redis server information(addr, password, db)
743750
addr: ""
744751
password: ""
745752
db: 0
746-
753+
747754
#
748755
# timeout/connection pool setting
749756
# see also: https://github.com/go-redis/redis/blob/a579d58c59af2f8cefbb7f90b8adc4df97f4fd8f/options.go#L59-L95
750-
#
757+
#
751758
dialTimeout: 5s
752759
readTimeout: 3s
753760
writeTimeout: 3s
@@ -757,9 +764,9 @@ redis:
757764
poolTimeout: 4s
758765
idleTimeout: 5m0s
759766
idleCheckFrequency: 1m0s
760-
767+
761768
#
762-
# pftaskqueue will retry when redis operation failed
769+
# pftaskqueue will retry when redis operation failed
763770
# in exponential backoff manner.
764771
# you can configure backoff parameters below
765772
#
@@ -771,7 +778,7 @@ redis:
771778
maxElapsedTime: 1m0s
772779
# max retry count. -1 means no limit.
773780
maxRetry: -1
774-
```
781+
```
775782
776783
## Bash/Zsh completion
777784

pkg/apis/task/task.go

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,7 @@ func (t *Task) IsWorkerLost(defaultTimeoutSeconds int) bool {
178178
return t.Status.CurrentRecord.ReceivedAt.Add(timeout).Before(time.Now())
179179
}
180180

181-
func (t *Task) SetSuccess(payload *string, message *string) error {
181+
func (t *Task) SetSuccess(payload *string, message *string, historyLengthLimit int) error {
182182
if t.Status.Phase != TaskPhaseProcessing {
183183
return errors.Errorf("invalid status: actual=%s expected=%s", t.Status.Phase, TaskPhaseProcessing)
184184
}
@@ -205,10 +205,14 @@ func (t *Task) SetSuccess(payload *string, message *string) error {
205205
}
206206
t.Status.History = append(t.Status.History, *current)
207207

208+
if len(t.Status.History) > historyLengthLimit {
209+
t.Status.History = t.Status.History[len(t.Status.History)-historyLengthLimit:]
210+
}
211+
208212
return nil
209213
}
210214

211-
func (t *Task) RecordFailure(reason TaskResultReason, payload *string, message *string) (bool, error) {
215+
func (t *Task) RecordFailure(reason TaskResultReason, payload *string, message *string, historyLengthLimit int) (bool, error) {
212216
if t.Status.Phase != TaskPhaseProcessing && t.Status.Phase != TaskPhaseReceived {
213217
return false, errors.Errorf("invalid status: actual=%s expected=[%s,%s]", t.Status.Phase, TaskPhaseProcessing, TaskPhaseReceived)
214218
}
@@ -233,13 +237,16 @@ func (t *Task) RecordFailure(reason TaskResultReason, payload *string, message *
233237
t.Status.History = []TaskRecord{}
234238
}
235239
t.Status.History = append(t.Status.History, *current)
240+
if len(t.Status.History) > historyLengthLimit {
241+
t.Status.History = t.Status.History[len(t.Status.History)-historyLengthLimit:]
242+
}
243+
236244
t.Status.FailureCount = t.Status.FailureCount + 1
237245

238246
requeue := true
239247
t.Status.Phase = TaskPhasePending
240-
241248
// no requeue because retry exceeded
242-
if t.Status.FailureCount > t.Spec.RetryLimit {
249+
if t.Spec.RetryLimit >= 0 && t.Status.FailureCount > t.Spec.RetryLimit {
243250
requeue = false
244251
t.Status.Phase = TaskPhaseFailed
245252
}

pkg/backend/redis/redis_test.go

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,7 @@ var (
8989
TimeoutSeconds: 60,
9090
}
9191
SampleInvalidTaskSpec = task.TaskSpec{
92+
Name: strings.Repeat("a", MaxNameLength+1),
9293
Payload: strings.Repeat("x", PayloadMaxSizeInKB*KB+1),
9394
RetryLimit: 100,
9495
TimeoutSeconds: 0,
@@ -712,8 +713,8 @@ var _ = Describe("Backend", func() {
712713
vErr, ok := err.(*util.ValidationError)
713714
Expect(ok).To(Equal(true))
714715
Expect(len(vErr.Errors)).To(Equal(2))
716+
Expect(vErr.Error()).To(ContainSubstring("TaskSpec.Name max length"))
715717
Expect(vErr.Error()).To(ContainSubstring("TaskSpec.Payload max size is"))
716-
Expect(vErr.Error()).To(ContainSubstring("TaskSpec.retryLimit max is"))
717718
})
718719
})
719720
When("Spec is valid", func() {

pkg/backend/redis/task.go

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ const (
4141
KB = 1 << 10
4242
PayloadMaxSizeInKB = 1
4343
MessageMaxSizeInKB = 1
44-
RetryLimitMax = 10
44+
HistoryLengthMax = 10
4545
MaxNameLength = 1024
4646
)
4747

@@ -650,6 +650,7 @@ func (b *Backend) SetSucceeded(ctx context.Context, queueUID, workerUID uuid.UUI
650650
err = t.SetSuccess(
651651
util.Truncate(resultPayload, PayloadMaxSizeInKB*KB),
652652
util.Truncate(message, MessageMaxSizeInKB*KB),
653+
HistoryLengthMax,
653654
)
654655
if err != nil {
655656
dlerr := b.invalidMessageDLError(
@@ -791,6 +792,7 @@ func (b *Backend) RecordFailure(ctx context.Context, queueUID, workerUID uuid.UU
791792
reason,
792793
util.Truncate(resultPayload, PayloadMaxSizeInKB*KB),
793794
util.Truncate(message, MessageMaxSizeInKB*KB),
795+
HistoryLengthMax,
794796
)
795797
if err != nil {
796798
dlerr := b.invalidMessageDLError(
@@ -924,11 +926,6 @@ func (b *Backend) validateTaskSpec(s task.TaskSpec) error {
924926
errors.Errorf("TaskSpec.Payload max size is %d Bytes (actual=%d Bytes)", maxBytes, len(s.Payload)),
925927
)
926928
}
927-
if s.RetryLimit > RetryLimitMax {
928-
vErrors = multierror.Append(vErrors,
929-
errors.Errorf("TaskSpec.retryLimit max is %d (actual=%d)", RetryLimitMax, s.RetryLimit),
930-
)
931-
}
932929
if vErrors != nil {
933930
return (*util.ValidationError)(vErrors)
934931
}

0 commit comments

Comments
 (0)