Skip to content

operations: add per-zone config for enabling "ingester multi-az"#15000

Merged
ldufr merged 1 commit intomainfrom
ldufresne/enabled-multi-az-per-ingester-zone
Apr 13, 2026
Merged

operations: add per-zone config for enabling "ingester multi-az"#15000
ldufr merged 1 commit intomainfrom
ldufresne/enabled-multi-az-per-ingester-zone

Conversation

@ldufr
Copy link
Copy Markdown
Contributor

@ldufr ldufr commented Apr 13, 2026

What this PR does

When enabling multi_zone_ingester_multi_az_enabled in the step-3 of the migration, we change the spec of the ingester-zone-a to add the nodeAffinity and that cause a rollout. Since zone-b isn't restarted at this point, it can cause a period in which a ingester shard isn't available. This is only true when not using ingester-zone-c for the migrations.

We add new configs that allows a more fine grain control and update the migration process to set those specs, before we stop the zone-b, preventing a zone-a rollout when we enable multi_zone_ingester_multi_az_enabled.

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]. If changelog entry is not needed, please add the changelog-not-needed label to the PR.
  • about-versioning.md updated with experimental features.

Note

Medium Risk
Changes multi-zone ingester deployment configuration and scheduling behavior (node affinity/tolerations), which can affect rollout timing and shard availability during migrations if misconfigured.

Overview
Adds per-zone control for ingester multi-AZ. Introduces $._config.multi_zone_ingester_zone_(a|b|c)_multi_az_enabled (defaulting to multi_zone_ingester_multi_az_enabled) and switches multi-zone ingester AZ enablement logic to be driven per zone, with validation requiring multi-zone ingesters when any zone is enabled.

Updates multi-AZ read-path migration tests/manifests. Migration step-1 now pre-enables zone-a’s multi-AZ setting, and generated test YAMLs add the resulting nodeAffinity/tolerations on ingester-zone-a so enabling global ingester multi-AZ later doesn’t trigger an unexpected rollout.

Docs/notes. Adds a CHANGELOG.md enhancement entry describing the new per-zone ingester multi-AZ config.

Reviewed by Cursor Bugbot for commit c865d59. Bugbot is set up for automated code reviews on this repo. Configure here.

@ldufr ldufr requested a review from a team as a code owner April 13, 2026 07:18
@ldufr ldufr force-pushed the ldufresne/enabled-multi-az-per-ingester-zone branch from f3fee7d to 4f8ebf6 Compare April 13, 2026 07:21
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Assertion bypass when per-zone flags used without global flag
    • Updated ingester multi-AZ detection to OR the three per-zone flags so the assertion triggers whenever any zone multi-AZ flag is enabled.
  • ✅ Fixed: Naming convention inconsistent with store-gateway per-zone configs
    • Renamed ingester per-zone multi-AZ config keys to the zone_x_multi_az_enabled pattern and updated the migration test override accordingly.

Create PR

Or push these changes by commenting:

@cursor push 1778e09841
Preview (1778e09841)
diff --git a/operations/mimir-tests/test-multi-az-read-path-migration-step-1.jsonnet b/operations/mimir-tests/test-multi-az-read-path-migration-step-1.jsonnet
--- a/operations/mimir-tests/test-multi-az-read-path-migration-step-1.jsonnet
+++ b/operations/mimir-tests/test-multi-az-read-path-migration-step-1.jsonnet
@@ -22,6 +22,6 @@
     // Enable multi-az config for the ingester zone-a to prevent a restart when
     // enabling multi_zone_ingester_multi_az_enabled, but before a second zone
     // is created.
-    multi_zone_ingester_multi_az_zone_a_enabled: true,
+    multi_zone_ingester_zone_a_multi_az_enabled: true,
   },
 }

diff --git a/operations/mimir/multi-zone-ingester.libsonnet b/operations/mimir/multi-zone-ingester.libsonnet
--- a/operations/mimir/multi-zone-ingester.libsonnet
+++ b/operations/mimir/multi-zone-ingester.libsonnet
@@ -19,9 +19,9 @@
 
     // Controls whether the multi (virtual) zone ingester should also be deployed multi-AZ.
     multi_zone_ingester_multi_az_enabled: $._config.multi_zone_read_path_multi_az_enabled,
-    multi_zone_ingester_multi_az_zone_a_enabled: self.multi_zone_ingester_multi_az_enabled,
-    multi_zone_ingester_multi_az_zone_b_enabled: self.multi_zone_ingester_multi_az_enabled,
-    multi_zone_ingester_multi_az_zone_c_enabled: self.multi_zone_ingester_multi_az_enabled,
+    multi_zone_ingester_zone_a_multi_az_enabled: self.multi_zone_ingester_multi_az_enabled,
+    multi_zone_ingester_zone_b_multi_az_enabled: self.multi_zone_ingester_multi_az_enabled,
+    multi_zone_ingester_zone_c_multi_az_enabled: self.multi_zone_ingester_multi_az_enabled,
   },
 
   local container = $.core.v1.container,
@@ -30,10 +30,10 @@
   local service = $.core.v1.service,
   local podAntiAffinity = $.apps.v1.deployment.mixin.spec.template.spec.affinity.podAntiAffinity,
 
-  local isMultiAZEnabled = $._config.multi_zone_ingester_multi_az_enabled,
-  local isZoneAEnabled = $._config.multi_zone_ingester_multi_az_zone_a_enabled && std.length($._config.multi_zone_availability_zones) >= 1,
-  local isZoneBEnabled = $._config.multi_zone_ingester_multi_az_zone_b_enabled && std.length($._config.multi_zone_availability_zones) >= 2,
-  local isZoneCEnabled = $._config.multi_zone_ingester_multi_az_zone_c_enabled && std.length($._config.multi_zone_availability_zones) >= 3,
+  local isMultiAZEnabled = $._config.multi_zone_ingester_zone_a_multi_az_enabled || $._config.multi_zone_ingester_zone_b_multi_az_enabled || $._config.multi_zone_ingester_zone_c_multi_az_enabled,
+  local isZoneAEnabled = $._config.multi_zone_ingester_zone_a_multi_az_enabled && std.length($._config.multi_zone_availability_zones) >= 1,
+  local isZoneBEnabled = $._config.multi_zone_ingester_zone_b_multi_az_enabled && std.length($._config.multi_zone_availability_zones) >= 2,
+  local isZoneCEnabled = $._config.multi_zone_ingester_zone_c_multi_az_enabled && std.length($._config.multi_zone_availability_zones) >= 3,
 
   assert !isMultiAZEnabled || $._config.multi_zone_ingester_enabled : 'ingester multi-AZ deployment requires ingester multi-zone to be enabled',
   assert !$._config.multi_zone_ingester_zpdb_enabled || $._config.rollout_operator_webhooks_enabled : 'zpdb configuration requires rollout_operator_webhooks_enabled=true',

You can send follow-ups to the cloud agent here.

Copy link
Copy Markdown
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but please see my comments, thanks

@ldufr ldufr force-pushed the ldufresne/enabled-multi-az-per-ingester-zone branch from 4f8ebf6 to 45bac9f Compare April 13, 2026 07:35
When enabling `multi_zone_ingester_multi_az_enabled` in the step-3 of the
migration, we change the spec of the ingester-zone-a to add the `nodeAffinity`
and that cause a rollout. Since zone-b isn't restarted at this point, it
can cause a period in which a ingester shard isn't available. This is only
true when not using ingester-zone-c for the migrations.

We add new configs that allows a more fine grain control and update the
migration process to set those specs, before we stop the zone-b, preventing
a zone-a rollout when we enable `multi_zone_ingester_multi_az_enabled`.

Signed-off-by: Laurent Dufresne <laurent.dufresne@grafana.com>
@ldufr ldufr force-pushed the ldufresne/enabled-multi-az-per-ingester-zone branch from 45bac9f to c865d59 Compare April 13, 2026 07:36
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Assertion uses computed zone values instead of raw configs
    • Updated the ingester assertion to use a new raw-config isMultiAZEnabled OR of per-zone flags so misconfigurations are caught even when availability zones are missing.

Create PR

Or push these changes by commenting:

@cursor push 6654b8630b
Preview (6654b8630b)
diff --git a/operations/mimir/multi-zone-ingester.libsonnet b/operations/mimir/multi-zone-ingester.libsonnet
--- a/operations/mimir/multi-zone-ingester.libsonnet
+++ b/operations/mimir/multi-zone-ingester.libsonnet
@@ -30,12 +30,12 @@
   local service = $.core.v1.service,
   local podAntiAffinity = $.apps.v1.deployment.mixin.spec.template.spec.affinity.podAntiAffinity,
 
+  local isMultiAZEnabled = $._config.multi_zone_ingester_zone_a_multi_az_enabled || $._config.multi_zone_ingester_zone_b_multi_az_enabled || $._config.multi_zone_ingester_zone_c_multi_az_enabled,
   local isZoneAEnabled = $._config.multi_zone_ingester_zone_a_multi_az_enabled && std.length($._config.multi_zone_availability_zones) >= 1,
   local isZoneBEnabled = $._config.multi_zone_ingester_zone_b_multi_az_enabled && std.length($._config.multi_zone_availability_zones) >= 2,
   local isZoneCEnabled = $._config.multi_zone_ingester_zone_c_multi_az_enabled && std.length($._config.multi_zone_availability_zones) >= 3,
 
-  local isMultiAZAtLeastOnceEnabled = isZoneAEnabled || isZoneBEnabled || isZoneCEnabled,
-  assert !isMultiAZAtLeastOnceEnabled || $._config.multi_zone_ingester_enabled : 'ingester multi-AZ deployment requires ingester multi-zone to be enabled',
+  assert !isMultiAZEnabled || $._config.multi_zone_ingester_enabled : 'ingester multi-AZ deployment requires ingester multi-zone to be enabled',
   assert !$._config.multi_zone_ingester_zpdb_enabled || $._config.rollout_operator_webhooks_enabled : 'zpdb configuration requires rollout_operator_webhooks_enabled=true',
 
   //

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit c865d59. Configure here.

@ldufr ldufr merged commit c3bca0c into main Apr 13, 2026
78 checks passed
@ldufr ldufr deleted the ldufresne/enabled-multi-az-per-ingester-zone branch April 13, 2026 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants