testing strategy: add policy for presubmits

pohly · pohly · commit e687f0a85dc4 · 2024-12-17T17:35:49.000+01:00
This was motivated in part by kubernetes/test-infra#33463 (comment) and is part of an effort to document best practices. The part about blocking presubmits and running them always is https://kubernetes.slack.com/archives/C2C40FMNF/p1734418617113169?thread_ts=1734417601.687079&cid=C2C40FMNF
diff --git a/contributors/devel/sig-testing/testing-strategy.md b/contributors/devel/sig-testing/testing-strategy.md
@@ -26,6 +26,55 @@ The Kubernetes job uses [prow](https://prow.k8s.io) to implement the CI system.
 - **Postsubmit:** Runs after code is merged. Useful for building artifacts.
 - **Periodic:** Runs at scheduled intervals. Ideal for monitoring trends and catching regressions.
 
+#### Presubmit jobs
+
+Blocking pre-submit jobs must always run. If they didn't, it could happen that
+a pull request where they didn't run gets merged and introduces a regression
+which breaks such a job. Then when it runs in a different pull request, that
+pull request cannot be merged, even if that was desirable, until the regression
+is fixed.
+
+Usually, non-blocking jobs don't run by default. The `/test` command has to be
+used explicitly for such informing jobs. It is possible to configure them so that they
+[run automatically when certain paths are modified](https://github.com/kubernetes/test-infra/blob/ee70308f09c10f7cd933c26c98acc7ebf785d436/config/jobs/kubernetes/sig-node/sig-node-presubmit.yaml#L3201-L3202).
+
+Non-blocking jobs cannot detect all regressions. A test flake might succeed
+when tested only once during presubmit. When defining the path trigger, it's
+impossible to list everything that might cause a need to run tests
+(e.g. tool changes, updates in packages that a feature depends on). Therefore
+it is required to have a periodic job which runs the same tests regularly.
+
+The advantage of also having a non-blocking job is that problems get discovered
+sooner. Without it, maintainers are forced to diagnose regressions in a
+periodic job and then have to ping the contributor who caused the problem.  If
+that contributor is unresponsive, maintainers may have to fix the problem
+themselves. Instead, the burden is on the contributor whose pull request fails
+the tests. If they are unresponsive, their change doesn't get merged and
+there's no regression.
+
+The advantage of running a non-blocking job automatically for some changes is
+that reviewers don't need to remember which jobs currently should be tested
+before merging and that the results are available immediately when a reviewer
+looks at a pull request for the first time, assuming that the contributor is a
+community member and tests run automatically.
+
+> :warning: **Warning:** A non-blocking job that fails confuses other
+> contributors who are not familiar with the job or the failures. If it runs
+> too often, it wastes CI resources.
+
+To avoid those negative consequences for the project, the guidelines for
+setting up such a job are:
+
+* The job owners are responsive and react to problems with the job.
+* The job must have a low failure rate to avoid confusion in drive-by pull requests.
+* The importance of the feature must justify the extra CI resources (depends
+  on how often it gets triggered).
+* The `run_if_changed` regular expression must be narrow enough that
+  the job doesn't run for unrelated changes. A good practice is to
+  limit it to code changes, for example with:
+
+     /(directory_a|directory_b)/.*\.go
+
 #### SIG Release Blocking and Informing jobs
 
 SIG Release maintains two sets of jobs that decide whether the release is