[Feature] Federated Stateful Rollout: Coordinated Blue-Green Migration for Stateful Multi-Component Workloads (Flink/Spark)

**What would you like to be added**:

I propose the introduction of **Federated Blue/Green Deployment Orchestration** for stateful, multi-component workloads (specifically targeting Apache Flink).

This feature would extend Karmada's current `ResourceInterpreter` and `StatefulFailoverInjection` frameworks to coordinate "Zero-Downtime" migrations between member clusters. The core components would include:

- **State Discovery (Scraping)**: A new `StatusReflection` operation in the Resource Interpreter to identify and extract state metadata (e.g., `savepointPath` or `checkpointId`) from a source cluster's resource status.
- **Cross-Cluster Handoff Controller**: A lifecycle controller that ensures the "Blue" instance (e.g., Cluster A) captures a final state before the "Green" instance (e.g., Cluster B) is triggered.
- **Metadata Injection Bridge**: Leveraging `StatePreservationRules` to automatically inject extracted state paths into the specification of the newly propagated resource in the target cluster.

**Why is this needed**:

As of early 2026, the **Apache Flink Kubernetes Operator (v1.14+)** provides robust [Blue/Green logic](https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/bluegreen-controller-flow/) within a single cluster. However, in a federated world, member clusters are isolated. Users currently face a "Knowledge Gap" when trying to perform multi-cluster upgrades or migrations:

- **State Discontinuity**: A Flink Operator in Cluster B has no visibility into the savepoints created by an Operator in Cluster A. This forces manual, "Cold Start" migrations which result in significant data processing lag.
- **Split-Brain Risk**: Without a central orchestrator like Karmada managing the "Handoff," there is a risk that both Blue and Green instances might attempt to write to the same sink simultaneously, leading to data corruption or duplication.
- **Operational Complexity**: Currently, regional failover for Flink is a manual, multi-step process. By automating Federated Blue/Green, we turn a "Regional Disaster Recovery" event into a standard, low-risk deployment pattern.

This enhancement would allow enterprises to move massive stateful streaming jobs across cloud providers with the same ease as a stateless microservice.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Federated Stateful Rollout: Coordinated Blue-Green Migration for Stateful Multi-Component Workloads (Flink/Spark) #7291

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Federated Stateful Rollout: Coordinated Blue-Green Migration for Stateful Multi-Component Workloads (Flink/Spark) #7291

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions