Skip to content

Commit 713cf68

Browse files
authored
Merge pull request #49982 from mimowo/job-limitperindex-blogport
BlogPost: Job's Backoff Limit Per Index Goes GA
2 parents 7ccdda1 + 481f4c3 commit 713cf68

File tree

1 file changed

+108
-0
lines changed

1 file changed

+108
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes v1.33: Job's Backoff Limit Per Index Goes GA"
4+
date: 2025-04-23
5+
draft: true
6+
slug: kubernetes-v1-33-jobs-backoff-limit-per-index-goes-ga
7+
author: >
8+
[Michał Woźniak](https://github.com/mimowo) (Google)
9+
---
10+
11+
In Kubernetes v1.33, the _Backoff Limit Per Index_ feature reaches general
12+
availability (GA). This blog describes the Backoff Limit Per Index feature and
13+
its benefits.
14+
15+
## About Backoff Limit Per Index
16+
17+
When you run workloads on Kubernetes, you must consider scenarios where Pod
18+
failures can affect the completion of your workloads. Ideally, your workload
19+
should tolerate transient failures and continue running.
20+
21+
To achieve failure tolerance in a Kubernetes Job, you can set the
22+
`spec.backoffLimit` field. This field specifies the total number of tolerated
23+
failures.
24+
25+
However, for workloads where every index is considered independent, like
26+
[embarassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel)
27+
workloads - the `spec.backoffLimit` field is often not flexible enough.
28+
For example, you may choose to run multiple suites of integration tests by
29+
representing each suite as an index within an [Indexed Job](/docs/tasks/job/indexed-parallel-processing-static/).
30+
In that setup, a fast-failing index (test suite) is likely to consume your
31+
entire budget for tolerating Pod failures, and you might not be able to run the
32+
other indexes.
33+
34+
In order to address this limitation, we introduce _Backoff Limit Per Index_,
35+
which allows you to control the number of retries per index.
36+
37+
## How Backoff Limit Per Index works
38+
39+
To use Backoff Limit Per Index for Indexed Jobs, specify the number of tolerated
40+
Pod failures per index with the `spec.backoffLimitPerIndex` field. When you set
41+
this field, the Job executes all indexes by default.
42+
43+
Additionally, to fine-tune the error handling:
44+
* Specify the cap on the total number of failed indexes by setting the
45+
`spec.maxFailedIndexes` field. When the limit is exceeded the entire Job is
46+
terminated.
47+
* Define a short-circuit to detect a failed index by using the `FailIndex` action in the
48+
[Pod Failure Policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy)
49+
feature.
50+
51+
When the number of tolerated failures is exceeded, the Job marks that index as
52+
failed and lists it in the Job's `status.failedIndexes` field.
53+
54+
### Example
55+
56+
The following Job spec snippet is an example of how to combine Backoff Limit Per
57+
Index with the _Pod Failure Policy_ feature:
58+
59+
```yaml
60+
completions: 10
61+
parallelism: 10
62+
completionMode: Indexed
63+
backoffLimitPerIndex: 1
64+
maxFailedIndexes: 5
65+
podFailurePolicy:
66+
rules:
67+
- action: Ignore
68+
onPodConditions:
69+
- type: DisruptionTarget
70+
- action: FailIndex
71+
onExitCodes:
72+
operator: In
73+
values: [ 42 ]
74+
```
75+
76+
In this example, the Job handles Pod failures as follows:
77+
78+
- Ignores any failed Pods that have the built-in
79+
[disruption condition](/docs/concepts/workloads/pods/disruptions/#pod-disruption-conditions),
80+
called `DisruptionTarget`. These Pods don't count towards Job backoff limits.
81+
- Fails the index corresponding to the failed Pod if any of the failed Pod's
82+
containers finished with the exit code 42 - based on the matching "FailIndex"
83+
rule.
84+
- Retries the first failure of any index, unless the index failed due to the
85+
matching `FailIndex` rule.
86+
- Fails the entire Job if the number of failed indexes exceeded 5 (set by the
87+
`spec.maxFailedIndexes` field).
88+
89+
## Learn more
90+
91+
- Read the blog post on the closely related feature of Pod Failure Policy [Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA](/blog/2024/08/19/kubernetes-1-31-pod-failure-policy-for-jobs-goes-ga/)
92+
- For a hands-on guide to using Pod failure policy, including the use of FailIndex, see
93+
[Handling retriable and non-retriable pod failures with Pod failure policy](/docs/tasks/job/pod-failure-policy/)
94+
- Read the documentation for
95+
[Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index) and
96+
[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy)
97+
- Read the KEP for the [Backoff Limits Per Index For Indexed Jobs](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs)
98+
99+
## Get involved
100+
101+
This work was sponsored by
102+
[batch working group](https://github.com/kubernetes/community/tree/master/wg-batch)
103+
in close collaboration with the
104+
[SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps) community.
105+
106+
If you are interested in working on new features in the space we recommend
107+
subscribing to our [Slack](https://kubernetes.slack.com/messages/wg-batch)
108+
channel and attending the regular community meetings.

0 commit comments

Comments
 (0)