Skip to content

[ANSIENG-5796] Add KRaft migration rollback playbook with auto-phase detection#2492

Open
Ishika Paliwal (ishikaa-p) wants to merge 3 commits into
7.7.xfrom
7.7.x-kraft-migration-rollback-ANSIENG-5796
Open

[ANSIENG-5796] Add KRaft migration rollback playbook with auto-phase detection#2492
Ishika Paliwal (ishikaa-p) wants to merge 3 commits into
7.7.xfrom
7.7.x-kraft-migration-rollback-ANSIENG-5796

Conversation

@ishikaa-p
Copy link
Copy Markdown
Contributor

Summary

  • Adds playbooks/kraft_migration_rollback.yml — a single-entry-point rollback playbook that auto-detects migration phase (PREMIGRATION / HYBRID_DUAL_WRITE / PURE_DUAL_WRITE) via Jolokia + config inspection and runs the correct rollback steps without requiring manual --tags selection
  • Adds playbooks/tasks/detect_migration_phase.yml — reusable phase detection task file; handles Jolokia-unavailable fallback for mid-rollback re-runs (controllers stopped after Phase 4 Steps 1–4)
  • Adds playbooks/validations/kraft_migration_rollback_validations.yml — pre-flight checks (ZK health, inventory structure, phase detection, rollback feasibility)
  • Integrates rollback as --tags rollback entry point in ZKtoKraftMigration.yml; never tag prevents accidental execution during normal migration runs
  • Adds three Molecule scenarios covering full migration → rollback round-trips for all three reversible phases

What's tested by the Molecule scenarios

Scenario State before rollback Rollback complexity
plaintext-rhel-rollback-phase2 PREMIGRATION — controllers provisioned, brokers untouched LOW — stop controllers only
plaintext-rhel-rollback-phase3 HYBRID_DUAL_WRITE — dual-write active, brokers have zookeeper.connect MEDIUM — 1 rolling broker restart
plaintext-rhel-rollback-phase4 PURE_DUAL_WRITE — brokers have process.roles=broker, no zookeeper.connect HIGH — 3 rolling broker restarts

Each scenario converge mirrors ZKtoKraftMigration.yml play order exactly: migration_precheckkafka_controller → Jolokia validation → kafka_broker serial → wait for ZkMigrationState=1 → [broker KRaft lineinfile + restart for Phase 4] → rollback.

Usage

# Auto-detect phase and rollback:
ansible-playbook -i <inventory> playbooks/kraft_migration_rollback.yml

# Or via the migration playbook:
ansible-playbook -i <inventory> playbooks/ZKtoKraftMigration.yml --tags rollback

# Dry run (phase detection only):
ansible-playbook -i <inventory> playbooks/kraft_migration_rollback.yml --tags validate --check

Test plan

  • Run molecule test -s plaintext-rhel-rollback-phase2
  • Run molecule test -s plaintext-rhel-rollback-phase3
  • Run molecule test -s plaintext-rhel-rollback-phase4
  • Verify existing migration scenarios (plaintext-basic-rhel with MIGRATION_CONVERGE) are unaffected by the never tag change on the rollback import

Related

JIRA: https://confluentinc.atlassian.net/browse/ANSIENG-5796

🤖 Generated with Claude Code

…o-phase detection

Implements automated rollback from ZooKeeper to KRaft migration back to
ZooKeeper-only mode, covering Phases 2, 3, and 4. Phase 5 is detected
and blocked as irreversible.

New files:
- playbooks/kraft_migration_rollback.yml: entry point, auto-routes by
  detected phase, callable via ZKtoKraftMigration.yml --tags rollback
- playbooks/tasks/detect_migration_phase.yml: Jolokia and config-based
  detection, no kafka-migration-check dependency, handles Jolokia-
  unavailable fallback for mid-rollback re-runs
- playbooks/validations/kraft_migration_rollback_validations.yml:
  pre-flight checks including ZK health and FINALIZED guard

Modified:
- playbooks/ZKtoKraftMigration.yml: adds rollback entry point via
  import_playbook with tags rollback

Phase 3 order matches tested impl: stop controllers first, clean ZK
znodes, clean broker metadata in parallel, rolling broker restart.
Phase 4 uses template-based config regen for correct authorizer
handling with two-run workflow for non-RBAC ACL clusters.
…os for all three migration phases

Add three Molecule scenarios that run migration to a target phase then
execute the rollback playbook to verify full round-trip correctness:

  plaintext-rhel-rollback-phase2 — PREMIGRATION state
    Controllers provisioned, brokers untouched. Verifies controller
    service stops cleanly without any broker changes.

  plaintext-rhel-rollback-phase3 — HYBRID_DUAL_WRITE state
    Full migrate_to_dual_write (precheck → controller → Jolokia
    validation → broker serial restart → wait ZkMigrationState=1).
    Verifies one rolling broker restart restores ZK-only mode.

  plaintext-rhel-rollback-phase4 — PURE_DUAL_WRITE state
    Dual-write + broker KRaft migration (process.roles=broker set,
    zookeeper.connect removed). Controllers intentionally left in
    dual-write mode. Verifies three-restart rollback path.

Each scenario includes molecule.yml (minimal ZK + broker + dedicated
controller platform), prepare.yml (unconditional ZK cluster install),
verify.yml (asserts broker.id, zookeeper.connect present; process.roles,
migration flag absent; controller service stopped).

Three converge files at molecule/ root follow ZKtoKraftMigration.yml
play order exactly: migration_precheck → kafka_controller → Jolokia
validation → kafka_broker (serial) → wait → [broker KRaft steps for
phase4] → kraft_migration_rollback.

Also fix ZKtoKraftMigration.yml rollback import: add `never` tag so the
rollback playbook does not execute during normal migration runs when no
--tags filter is passed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ishikaa-p Ishika Paliwal (ishikaa-p) requested a review from a team as a code owner May 11, 2026 20:10
…ack molecule scenarios

Remove ${JOB_BASE_NAME}${BUILD_NUMBER} env var suffixes from container
names — only 3 of ~60 scenarios use them (plaintext-basic-rhel,
custom-user-plaintext-rhel, plaintext-rhel-customrepo). All other
scenarios, including all migration-related ones, use static names.
Use static names consistent with the majority pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant