Feature Request: Selective Restart Strategy for SparkApplication Updates

### What feature you would like to be added?

A **selective restart strategy** for SparkApplication updates that avoids unnecessary full application restarts when only specific components (executors) or configurations (services/ingress) are modified.

Currently, any change to the SparkApplication spec triggers a complete application restart by setting the status to `ApplicationStateInvalidating`. This proposal introduces intelligent change detection and selective update mechanisms that respect Spark's Driver-Executor architecture constraints.

### Why is this needed?

### Current Pain Points
- **Inefficient restarts**: Changing executor `priorityClassName` restarts the entire application including the driver
- **Production downtime**: Long-running Spark Streaming applications lose all state during unnecessary restarts  
- **Resource waste**: Healthy driver pods are terminated when only executor changes are needed
- **Operational friction**: Simple scaling or configuration updates require full application cycles

### Business Impact
- **Streaming applications**: Critical for 24/7 data pipelines that cannot afford frequent restarts
- **Cost optimization**: Avoid terminating expensive, long-running computations
- **DevOps efficiency**: Enable faster configuration updates and scaling operations
- **Resource utilization**: Keep healthy components running during partial updates

### Technical Context
From [`internal/controller/sparkapplication/event_filter.go`](https://github.com/kubeflow/spark-operator/blob/main/internal/controller/sparkapplication/event_filter.go#L137-L148):

```go
if !equality.Semantic.DeepEqual(oldApp.Spec, newApp.Spec) {
    // Force-set the application status to Invalidating which handles clean-up and application re-run.
    newApp.Status.AppState.State = v1beta2.ApplicationStateInvalidating
    // ... complete restart for ANY spec change
}
```

### Describe the solution you would like


### 🚨 **Critical Architectural Constraint**
**Spark Driver cannot be restarted independently** because:
- Driver maintains application state (SparkContext, task scheduling, RDD lineage)
- Driver-Executor connections are stateful and cannot be reestablished 
- Executors detect Driver disconnection and terminate automatically
- New Driver instance loses all knowledge of existing Executors

### Proposed Change Detection Categories

#### 1. **Full Restart Required** (Driver-Related Changes)
- **Driver Configuration**: `driver.priorityClassName`, `driver.cores`, `driver.memory`, `driver.javaOptions`
- **Application Logic**: `type`, `mainClass`, `mainApplicationFile`, `arguments`
- **Global Configuration**: `sparkConf`, `hadoopConf` (most keys)
- **Core Infrastructure**: `sparkVersion`, `mode`, `deps`, `volumes`
- **Shared Resources**: `image`, `imagePullPolicy` (if affects both)

#### 2. **Executor Dynamic Replacement** (Spark 3.0+ Dynamic Allocation)
- **Scaling**: `executor.instances` 
- **Resource Updates**: `executor.cores`, `executor.memory`, `executor.coreLimit` 
- **Pod Configuration**: `executor.priorityClassName`, `executor.nodeSelector`
- **Lifecycle**: `executor.deleteOnTermination`, `executor.javaOptions`
- **Dynamic Allocation Settings**: `dynamicAllocation.*`

#### 3. **Hot Updates** (No Pod Restart Required)
- **Service Configuration**: `driver.serviceAnnotations`, `driver.serviceLabels`
- **Ingress Configuration**: `sparkUIOptions`, `driverIngressOptions`  
- **Monitoring**: `monitoring.*` (if external)
- **Metadata**: `labels`, `annotations` (on SparkApplication resource)

### Implementation Approach

#### Enhanced Event Filter
```go
func (f *EventFilter) detectSpecChanges(oldApp, newApp *v1beta2.SparkApplication) UpdateScope {
    if f.hasDriverChanges(oldApp, newApp) {
        return UpdateScopeFull  // Driver changes require full restart
    }
    if f.hasExecutorChanges(oldApp, newApp) {
        return UpdateScopeExecutorDynamic  // Use Dynamic Allocation
    }
    if f.hasServiceChanges(oldApp, newApp) {
        return UpdateScopeHot  // Update K8s resources only
    }
    return UpdateScopeNone
}
```

#### New Application States
```go
const (
    ApplicationStateExecutorScaling    ApplicationStateType = "EXECUTOR_SCALING"
    ApplicationStateHotUpdating        ApplicationStateType = "HOT_UPDATING"
)
```

### Example Use Cases

#### ✅ **Executor Scaling/Replacement**
```yaml
spec:
  executor:
    instances: 5  # 3→5 scale up
    priorityClassName: "high-priority"  # Requires replacement
```
**Result**: Dynamic Allocation gradually replaces executors, driver keeps running

#### ✅ **Service Hot Update**
```yaml
spec:
  driver:
    serviceAnnotations:
      monitoring: "enabled"
  sparkUIOptions:
    servicePort: 8080
```
**Result**: Update Kubernetes Service/Ingress directly, no pod restart

#### ⚠️ **Driver Change (Full Restart)**
```yaml
spec:
  driver:
    priorityClassName: "critical"  # Driver config change
```
**Result**: Complete application restart (current behavior)

### Describe alternatives you have considered


### Alternative 1: External Scaling Controller
Create a separate controller that handles Dynamic Allocation outside the SparkApplication CRD.
- **Pros**: Cleaner separation of concerns
- **Cons**: Additional complexity, fragmented user experience

### Alternative 2: Annotation-Based Hints  
Use annotations to hint which changes should trigger selective updates.
- **Pros**: User control over restart behavior
- **Cons**: Requires user understanding of Spark internals, error-prone

### Alternative 3: Immutable SparkApplication with Separate Scaling Resource
Make SparkApplication immutable and create separate ExecutorScale resource.
- **Pros**: Clear API boundaries
- **Cons**: Breaking change, increased API surface

### Alternative 4: Keep Current Behavior
Continue with full restarts for all changes.
- **Pros**: Simple, predictable
- **Cons**: Poor operational experience for production workloads

### Additional context


### Technical Implementation Details

#### Executor Dynamic Allocation Mechanism
```go
func (r *Reconciler) updateExecutorConfiguration(ctx context.Context, app *v1beta2.SparkApplication, changes ExecutorChanges) error {
    // Scale down old executors
    if err := r.requestExecutorRemoval(ctx, app, changes.ExecutorsToRemove); err != nil {
        return err
    }
    // Scale up with new configuration
    if err := r.requestExecutorAddition(ctx, app, changes.NewExecutorConfig); err != nil {
        return err
    }
    return nil
}
```

#### Hot Update Mechanism
```go
func (r *Reconciler) hotUpdateServices(ctx context.Context, app *v1beta2.SparkApplication) error {
    service := &corev1.Service{}
    if err := r.client.Get(ctx, serviceKey, service); err != nil {
        return err
    }
    service.Annotations = app.Spec.Driver.ServiceAnnotations
    return r.client.Update(ctx, service)
}
```

### Limitations and Constraints

#### What CANNOT be done:
- **Driver restart with executor preservation**: Architecturally impossible
- **Live executor pod configuration changes**: Pod specs are immutable  
- **Cross-pod network reconfiguration**: Requires full restart

#### What CAN be done:
- **Executor replacement via Dynamic Allocation**: Safe and supported by Spark
- **Service/Ingress hot updates**: No impact on running pods
- **Metadata updates**: Labels, annotations on K8s resources

### Implementation Phases
1. **Phase 1**: Change detection and categorization logic
2. **Phase 2**: Dynamic Allocation integration for executor updates  
3. **Phase 3**: Hot update mechanism for service configurations
4. **Phase 4**: Configuration options and rollback mechanisms
5. **Phase 5**: Comprehensive testing and documentation

### Backward Compatibility
- Default behavior remains unchanged (full restart)
- Selective restart enabled via annotation: `spark-operator.kubeflow.org/restart-strategy: "selective"`
- All existing applications continue working without modification

### Requirements
- **Spark Version**: 3.5.2 (for Dynamic Allocation)
- **Kubernetes Version**: 1.31
- **Spark Operator**: v2.2.0 (current main branch)

### Related Work
- [Kubernetes StatefulSet Rolling Updates](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#rolling-updates)
- [Apache Spark Dynamic Allocation](https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation)
- Similar selective restart patterns in Kafka Operator, Elasticsearch Operator

This feature would significantly improve the operational experience for production Spark workloads by providing fine-grained control over restart behavior while respecting Spark's architectural constraints.



Feature Request: Selective Restart Strategy for SparkApplication Updates #2766

Description

What feature you would like to be added?

Why is this needed?

Current Pain Points

Business Impact

Technical Context

Describe the solution you would like

🚨 Critical Architectural Constraint

Proposed Change Detection Categories

1. Full Restart Required (Driver-Related Changes)

2. Executor Dynamic Replacement (Spark 3.0+ Dynamic Allocation)

3. Hot Updates (No Pod Restart Required)

Implementation Approach

Enhanced Event Filter

New Application States

Example Use Cases

✅ Executor Scaling/Replacement

✅ Service Hot Update

⚠️ Driver Change (Full Restart)

Describe alternatives you have considered

Alternative 1: External Scaling Controller

Alternative 2: Annotation-Based Hints

Alternative 3: Immutable SparkApplication with Separate Scaling Resource

Alternative 4: Keep Current Behavior

Additional context

Technical Implementation Details

Executor Dynamic Allocation Mechanism

Hot Update Mechanism

Limitations and Constraints

What CANNOT be done:

What CAN be done:

Implementation Phases

Backward Compatibility

Requirements

Related Work

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions