Skip to content

[Feature] Federated Stateful Rollout: Coordinated Blue-Green Migration for Stateful Multi-Component Workloads (Flink/Spark) #7291

@liwang0513

Description

@liwang0513

What would you like to be added:

I propose the introduction of Federated Blue/Green Deployment Orchestration for stateful, multi-component workloads (specifically targeting Apache Flink).

This feature would extend Karmada's current ResourceInterpreter and StatefulFailoverInjection frameworks to coordinate "Zero-Downtime" migrations between member clusters. The core components would include:

  • State Discovery (Scraping): A new StatusReflection operation in the Resource Interpreter to identify and extract state metadata (e.g., savepointPath or checkpointId) from a source cluster's resource status.
  • Cross-Cluster Handoff Controller: A lifecycle controller that ensures the "Blue" instance (e.g., Cluster A) captures a final state before the "Green" instance (e.g., Cluster B) is triggered.
  • Metadata Injection Bridge: Leveraging StatePreservationRules to automatically inject extracted state paths into the specification of the newly propagated resource in the target cluster.

Why is this needed:

As of early 2026, the Apache Flink Kubernetes Operator (v1.14+) provides robust Blue/Green logic within a single cluster. However, in a federated world, member clusters are isolated. Users currently face a "Knowledge Gap" when trying to perform multi-cluster upgrades or migrations:

  • State Discontinuity: A Flink Operator in Cluster B has no visibility into the savepoints created by an Operator in Cluster A. This forces manual, "Cold Start" migrations which result in significant data processing lag.
  • Split-Brain Risk: Without a central orchestrator like Karmada managing the "Handoff," there is a risk that both Blue and Green instances might attempt to write to the same sink simultaneously, leading to data corruption or duplication.
  • Operational Complexity: Currently, regional failover for Flink is a manual, multi-step process. By automating Federated Blue/Green, we turn a "Regional Disaster Recovery" event into a standard, low-risk deployment pattern.

This enhancement would allow enterprises to move massive stateful streaming jobs across cloud providers with the same ease as a stateless microservice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions