Speed up post-test cleanup: region-scope and gate LB scans#509
Conversation
cleanerupper.CleanLoadBalancerResources now takes an optional regions slice and only iterates those regions; when empty it falls back to ListRegions (existing behavior). cleanTestWorkflow derives the regions from the workflow's primary zone plus every per-VM zone override (handles suites like shapevalidation that pin VMs to non-default zones), cutting the per-region listing loop from ~40 regions to 1. A new TestWorkflow.CreatesLoadBalancers opt-in flag gates the LB cleanup call entirely. Only loadbalancer.TestSetup sets it true, so other suites skip the scan since they never create LB resources at runtime. In CIT's typical single-region case the per-test cleanup pass drops from ~3m47s to ~2-3s.
|
Hi @jonathanspw. Thanks for your PR. I'm waiting for a GoogleCloudPlatform member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/ok-to-test |
|
/retest |
|
/ok-to-test |
daisy can populate vm.Zone with either a short name ("us-central1-a"),
a path ("zones/us-central1-a"), or a full URL
("https://www.googleapis.com/compute/v1/projects/.../zones/us-central1-a")
depending on workflow state. The previous strings.LastIndex on the raw
input split the trailing zone suffix off a URL but left the URL prefix
intact, so workflowRegions produced strings like
"https://www.googleapis.com/.../zones/us-west1" which GCE rejected with
"Invalid value for field 'region'".
Path.Base the zone first so the same trim works for any of the three
forms. Adds a TestWorkflowRegions case that pins this behavior.
|
Here's a real-world impact sample: Before (1:24): After (0:52): |
-- d894462 by Jonathan Wright <jonathan@almalinux.org>: Speed up post-test cleanup: region-scope and gate LB scans cleanerupper.CleanLoadBalancerResources now takes an optional regions slice and only iterates those regions; when empty it falls back to ListRegions (existing behavior). cleanTestWorkflow derives the regions from the workflow's primary zone plus every per-VM zone override (handles suites like shapevalidation that pin VMs to non-default zones), cutting the per-region listing loop from ~40 regions to 1. A new TestWorkflow.CreatesLoadBalancers opt-in flag gates the LB cleanup call entirely. Only loadbalancer.TestSetup sets it true, so other suites skip the scan since they never create LB resources at runtime. In CIT's typical single-region case the per-test cleanup pass drops from ~3m47s to ~2-3s. -- f1b669a by Jonathan Wright <jonathan@almalinux.org>: gofmt: realign TestWorkflowRegions testcases struct -- 9bacfae by Jonathan Wright <jonathan@almalinux.org>: Fix regionFromZone to handle URL- and path-form zones daisy can populate vm.Zone with either a short name ("us-central1-a"), a path ("zones/us-central1-a"), or a full URL ("https://www.googleapis.com/compute/v1/projects/.../zones/us-central1-a") depending on workflow state. The previous strings.LastIndex on the raw input split the trailing zone suffix off a URL but left the URL prefix intact, so workflowRegions produced strings like "https://www.googleapis.com/.../zones/us-west1" which GCE rejected with "Invalid value for field 'region'". Path.Base the zone first so the same trim works for any of the three forms. Adds a TestWorkflowRegions case that pins this behavior. FUTURE_COPYBARA_INTEGRATE_REVIEW=#509 from jonathanspw:region-scope-lb-cleanup 9bacfae PiperOrigin-RevId: 926845123
-- d894462 by Jonathan Wright <jonathan@almalinux.org>: Speed up post-test cleanup: region-scope and gate LB scans cleanerupper.CleanLoadBalancerResources now takes an optional regions slice and only iterates those regions; when empty it falls back to ListRegions (existing behavior). cleanTestWorkflow derives the regions from the workflow's primary zone plus every per-VM zone override (handles suites like shapevalidation that pin VMs to non-default zones), cutting the per-region listing loop from ~40 regions to 1. A new TestWorkflow.CreatesLoadBalancers opt-in flag gates the LB cleanup call entirely. Only loadbalancer.TestSetup sets it true, so other suites skip the scan since they never create LB resources at runtime. In CIT's typical single-region case the per-test cleanup pass drops from ~3m47s to ~2-3s. -- f1b669a by Jonathan Wright <jonathan@almalinux.org>: gofmt: realign TestWorkflowRegions testcases struct -- 9bacfae by Jonathan Wright <jonathan@almalinux.org>: Fix regionFromZone to handle URL- and path-form zones daisy can populate vm.Zone with either a short name ("us-central1-a"), a path ("zones/us-central1-a"), or a full URL ("https://www.googleapis.com/compute/v1/projects/.../zones/us-central1-a") depending on workflow state. The previous strings.LastIndex on the raw input split the trailing zone suffix off a URL but left the URL prefix intact, so workflowRegions produced strings like "https://www.googleapis.com/.../zones/us-west1" which GCE rejected with "Invalid value for field 'region'". Path.Base the zone first so the same trim works for any of the three forms. Adds a TestWorkflowRegions case that pins this behavior. FUTURE_COPYBARA_INTEGRATE_REVIEW=#509 from jonathanspw:region-scope-lb-cleanup 9bacfae PiperOrigin-RevId: 926845123
e72f2bf
into
GoogleCloudPlatform:main
Speeds up CIT's per-test cleanup phase by ~98% in typical single-region runs.
cleanerupper.CleanLoadBalancerResourcesnow takes an optionalregions []stringparameter; when set, it iterates only those regions instead of every GCP region returned byListRegions(~40+).workflowRegionshelper derives the relevant regions from the workflow's primary zone plus per-VM zone overrides (so suites likeshapevalidationthat pin VMs toeurope-west4-aare still covered correctly).TestWorkflow.CreatesLoadBalancersfield gates the LB cleanup call entirely. Onlyloadbalancer.TestSetupsets it true; every other suite skips the call since they never create LB resources at runtime (verified by grep — onlytest_suites/loadbalancer/loadbalancer_test.goever callscompute.NewForwardingRulesRESTClient,InsertBackendService, etc.).The pre-existing failure modes (zombie VMs, networks, disks) are still covered — those cleanups are unchanged. Only the LB-resource scan is scoped/gated.
Problem
CIT's post-test cleanup is a backup for daisy's own teardown hooks.
CleanLoadBalancerResourceswalks every GCP region listing forwarding rules, backend services, URL maps, target HTTP proxies, NEGs and health checks. Each region is ~5 list calls; ~40 regions adds up to ~200 calls per test. On a single-region test that's all wasted work, since regional resources only exist where VMs are.Compounding that: only the
loadbalancersuite actually creates LB resources. The other ~30 suites were paying for the scan to find nothing.Changes
cleanerupper/cleanerupper.goCleanLoadBalancerResourcesaddsregions []stringparam; nil = scan all (unchanged), non-empty = only thosecleanerupper/cleanerupper_test.gonil; newTestCleanLoadBalancerResourcesRegionScopeassertsListRegionsis skipped and no out-of-scope region is hit when a scope is settestworkflow.goCreatesLoadBalancers boolonTestWorkflow;regionFromZoneandworkflowRegionshelpers;cleanTestWorkflowderivesregionsand gates the LB cleanup call on the flagtestworkflow_test.goTestWorkflowRegions(table-driven, covers mixed GA/Beta VM zones, dedup, malformed zones); newTestCleanTestWorkflowSkipsLBWhenNotMarked(asserts no LB endpoints are hit when the flag is false)test_suites/loadbalancer/setup.got.CreatesLoadBalancers = trueafter the COS skipBenchmark results
Test setup:
almalinux-9-v20260501inus-central1-a, projectalmalinux-image-testing-469421.Per-test cleanup duration
hostnamevalidationimagebootpackagevalidationsshloadbalancerWall-clock per scenario (parallel)
-parallel_count=2-parallel_count=5The 5-test wall-clock savings are smaller in proportion because the slowest test (
loadbalancerat ~9 min of test execution) dominates wall clock — and that test still does the LB cleanup (correctly).Why this matters more at low parallelism
Cleanup phases are per-test and run after each test's workflow finishes. When
-parallel_countis high relative to the number of tests, cleanups overlap with other tests still running — much of the cleanup cost disappears into the wall-clock shadow of the slowest test. When parallelism is low relative to total tests, cleanups serialize and every saved second flows directly to wall clock.Worked example with the measured ~3m45s saved per non-LB test:
-parallel_count=1(fully sequential)-parallel_count=5(matches our benchmark)-parallel_count=40(one slot per test)So the case that benefits the most is a developer running
-parallel_count=1to debug or run a representative subset of suites locally: the cleanup-time savings are linear in test count and fully visible. CI runs with high parallelism still save the same total API calls and project quota, but the wall-clock improvement is bounded by the slowest test on the critical path.