[Bug] [Zeta] Memory leak in SinkAggregatedCommitterTask due to uncleaned checkpoint caches

### Search before asking

- [x] I had searched in the [issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22) and found no similar issues.


### What happened

#### Background

I am running approximately **25 real-time streaming synchronization jobs** on SeaTunnel Zeta engine in a production environment. Each job has a checkpoint interval configured to around **5000ms (5 seconds)**. The jobs are primarily synchronizing data from MySQL CDC to Iceberg.

#### Symptom Timeline

**Phase 1 - Normal Operation (Day 1-2):**
- All jobs started successfully and ran stably
- Heap memory usage remained low and stable
- ZGC (Z Garbage Collector) worked efficiently with:
  - ~50 GC cycles
  - ~4 minutes cycle duration
  - Heap memory close to 0

**Phase 2 - Sudden Memory Spike (Around Day 3):**
- Heap memory **suddenly spiked** from near 0 to approximately **18.6 GiB**
- ZGC cycles increased dramatically from ~50 to **~150 cycles**
- ZGC cycle duration increased from ~4 minutes to **~12.5 minutes**
- ZGC trigger reasons shifted to "**Allocation Rate**" and "**High Usage**"
- The system eventually crashed with **OutOfMemoryError**

This pattern repeated after restarting the jobs - stable for a few days, then sudden memory explosion.

#### Investigation

I captured a heap dump and analyzed it using **Alibaba Cloud JVM Analysis Tool**. The analysis revealed:

1. **Largest Memory Consumer**: [SinkAggregatedCommitterTask](cci:2://file:///Users/shiwanming/job/open-source-project/seatunnel/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SinkAggregatedCommitterTask.java:67:0-327:1) instances
2. **GC Root Path**:
   ```
   TaskExecutionService$BlockingWorker (Thread)
     └── TaskTracker
           └── SinkAggregatedCommitterTask
                 ├── commitInfoCache (ConcurrentHashMap) - continuously growing
                 └── checkpointBarrierCounter (ConcurrentHashMap) - continuously growing
   ```

#### Root Cause Analysis (Code Deep Dive)

After analyzing the source code of [SinkAggregatedCommitterTask.java](cci:7://file:///Users/shiwanming/job/test/seatunnel/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SinkAggregatedCommitterTask.java:0:0-0:0), I identified the root cause:

**Two internal maps are never cleaned up after checkpoint completion:**

1. **`commitInfoCache`** (`ConcurrentMap<Long, List<CommandInfoT>>`):
   ```java
   // Data is ADDED here (receivedWriterCommitInfo method, line 284-291):
   public void receivedWriterCommitInfo(long checkpointID, CommandInfoT commitInfos) {
       commitInfoCache.computeIfAbsent(checkpointID, id -> new CopyOnWriteArrayList<>());
       commitInfoCache.get(checkpointID).add(commitInfos);  // ← Added but NEVER removed
   }
   ```

2. **`checkpointBarrierCounter`** (`Map<Long, Integer>`):
   ```java
   // Counter is INCREMENTED here (triggerBarrier method, line 211-213):
   Integer count = checkpointBarrierCounter.compute(
           barrier.getId(), (id, num) -> num == null ? 1 : ++num);  // ← Incremented but NEVER removed
   ```

3. **`checkpointCommitInfoMap`** is properly cleaned in [notifyCheckpointComplete()](cci:1://file:///Users/shiwanming/job/open-source-project/seatunnel/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SinkAggregatedCommitterTask.java:302:4-319:5) - but the other two maps are **NOT**:
   ```java
   // notifyCheckpointComplete method (line 304-320):
   checkpointCommitInfoMap.forEach((key, value) -> {
       if (key > checkpointId) {
           return;
       }
       aggregatedCommitInfo.addAll(value);
       checkpointCommitInfoMap.remove(key);  // ✅ This is cleaned
       // ❌ commitInfoCache is NOT cleaned!
       // ❌ checkpointBarrierCounter is NOT cleaned!
   });
   ```

#### Why the Memory Spike Appears "Sudden"

The memory leak accumulates silently over time:

| Metric | Calculation |
|--------|-------------|
| Checkpoint frequency | Every 5 seconds |
| Checkpoints per hour | 720 per job |
| Checkpoints per day | 17,280 per job |
| Total (25 jobs) per day | **432,000 uncleaned map entries** |
| After 3 days | **~1.3 million uncleaned entries** |

The "sudden" spike occurs because:
1. Initially, the leaked memory is small and ZGC handles it easily
2. As entries accumulate, they reach a critical threshold
3. ZGC cannot reclaim these objects (they're still referenced)
4. This triggers a cascade: more frequent GC → longer GC pauses → allocation stalls → memory explosion

This explains why monitoring shows stable memory for days, then a sudden vertical spike.


### SeaTunnel Version

2.3.12

### SeaTunnel Config

```conf
env {
  parallelism = 1
  job.mode = "STREAMING"
  checkpoint.interval = 5000
}

source {
  MySQL-CDC {
    # CDC configuration
  }
}

sink {
  Iceberg {
    # Iceberg sink configuration
  }
}
```

### Running Command

```shell
seatunnel.sh -c mysql_to_iceberg.conf --async
```

### Error Exception

```log
java.lang.OutOfMemoryError: Java heap space

=== Heap Dump Analysis (Alibaba Cloud JVM Tool) ===

Largest Object by Retained Size:
  - Class: org.apache.seatunnel.engine.server.task.SinkAggregatedCommitterTask
  - Count: 25 instances (one per job)
  - Retained Size: Majority of heap

GC Root Path:
  java.lang.Thread @ TaskExecutionService$BlockingWorker
    └── org.apache.seatunnel.engine.server.execution.TaskTracker
          └── org.apache.seatunnel.engine.server.task.SinkAggregatedCommitterTask
                ├── commitInfoCache: java.util.concurrent.ConcurrentHashMap
                │     └── Size: continuously growing (never cleaned)
                └── checkpointBarrierCounter: java.util.concurrent.ConcurrentHashMap
                      └── Size: continuously growing (never cleaned)

Code Search Verification:
  - "commitInfoCache.remove" → NOT FOUND in entire codebase
  - "commitInfoCache.clear" → NOT FOUND in entire codebase
  - "checkpointBarrierCounter.remove" → NOT FOUND in entire codebase
  - "checkpointBarrierCounter.clear" → NOT FOUND in entire codebase
```


### Zeta or Flink or Spark Version

Zeta Engine (SeaTunnel native engine)

### Java or Scala Version

JDK 17 (with ZGC enabled)

### Screenshots

**ZGC Monitoring Dashboard (6-hour intervals):**

_(Attach ZGC monitoring screenshot here showing:)_
- ZGC Cycles: Increased from ~50 to ~150
- ZGC Heap Memory: Sudden spike from ~0 to ~18.6 GiB around day 3
- ZGC Cycle Duration: Increased from ~4 min to ~12.5 min
- Trigger Reasons: "Allocation Rate" and "High Usage" increased significantly

### Are you willing to submit PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] [Zeta] Memory leak in SinkAggregatedCommitterTask due to uncleaned checkpoint caches #10188

Search before asking

What happened

Background

Symptom Timeline

Investigation

Root Cause Analysis (Code Deep Dive)

Why the Memory Spike Appears "Sudden"

SeaTunnel Version

SeaTunnel Config

Running Command

Error Exception

Zeta or Flink or Spark Version

Java or Scala Version

Screenshots

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Calculation
Checkpoint frequency	Every 5 seconds
Checkpoints per hour	720 per job
Checkpoints per day	17,280 per job
Total (25 jobs) per day	432,000 uncleaned map entries
After 3 days	~1.3 million uncleaned entries

[Bug] [Zeta] Memory leak in SinkAggregatedCommitterTask due to uncleaned checkpoint caches #10188

Description

Search before asking

What happened

Background

Symptom Timeline

Investigation

Root Cause Analysis (Code Deep Dive)

Why the Memory Spike Appears "Sudden"

SeaTunnel Version

SeaTunnel Config

Running Command

Error Exception

Zeta or Flink or Spark Version

Java or Scala Version

Screenshots

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions