[Feature][Zeta] STIP-23 Phase 1: Add engine timer flush core flow and FlushSignal handling#10800
[Feature][Zeta] STIP-23 Phase 1: Add engine timer flush core flow and FlushSignal handling#10800nzw921rx wants to merge 32 commits intoapache:devfrom
Conversation
Here is the review result from GPT, I think you can refer to itWhat This PR Solves The user-facing pain point is that some sinks buffer writes, and in low-throughput or idle-source scenarios, if no data keeps flowing to trigger a flush, users may wait a long time before seeing data land. The fix introduces sink.flush.interval: the Source side injects FlushSignal periodically via a timer, Transforms pass it through transparently, and the Sink executes the registered flush action upon receiving the signal. In one sentence: this is Phase 1 of STIP-23, wiring up the "engine-level periodic flush signal" main path for the Zeta engine. A simple example: if a Sink only flushes after buffering 1,000 rows, but the source only produces 10 rows per minute, the old logic could leave data stuck in the buffer indefinitely. This PR makes the engine send a FlushSignal at regular intervals to prompt the Sink to flush its buffered data. Execution Flow timerFlushWorker tick Key Findings The main path is triggered when sink.flush.interval > 0, but no connector in the current repository calls registerFlushAction, so existing connectors will receive and silently ignore the signal for now. The non-blocking delivery logic for Signal in the BlockingQueue branch is broadly reasonable — dropping flush signals on a full queue prevents the timer thread from being back-pressured. However, the Disruptor branch introduces a serious regression: checkpoint barriers are no longer published to the ring buffer, which breaks the downstream checkpoint chain. The Source timer's lifecycle cleanup is also fragile: if reader.close() throws, closeFlushTimer() will never execute. Finally, the documentation and Option description reference enable_timer_flush, but no such config key exists in the code — the actual opt-in mechanism is SinkWriter.Context.registerFlushAction(...). Issue 1: Disruptor Branch Swallows Checkpoint BarriersLocation: RecordEventProducer.java:37 In the baseline, after processing a Barrier, execution falls through to the unified ringBuffer.next() / publish() path. After this PR, the barrier branch only calls ack() and setPrepareClose() without invoking publishRecord(record, ringBuffer). Since RecordEventHandler only forwards records to collector.collect(record) after consuming them from the ring buffer, checkpoint barriers in Source/Transform/Sink chains using INTERMEDIATE_DISRUPTOR_QUEUE will stall on the producer side and never propagate to downstream Transforms or Sinks. This can result in incomplete checkpoints, broken final-checkpoint/prepare-close semantics, and an unreliable recovery path. Recommended fix: Barriers must still go through the blocking publish path. tryPublishEvent can be kept for Signal, but after the barrier branch handles ack/prepareClose, it must call publishRecord(record, ringBuffer). Tests should be added to assert that the cursor advances and that the handler can collect the barrier. Issue 2: Flush Timer Leaks When reader.close() Throws During Source ShutdownLocation: SourceFlowLifeCycle.java:204 The current close() sequence is reader.close(); closeFlushTimer(); super.close();. If the connector's reader.close() throws an IOException, the newly registered timer is never cancelled. This means that during an abnormal shutdown or task cancellation, the timer thread may continue holding references to SourceFlowLifeCycle, the collector, and the task context, and keep attempting to send FlushSignal, resulting in incomplete resource cleanup and misleading WARN log noise. Recommended fix: Move timer cleanup into a finally block. More robustly: cancel the timer first, then close the reader, then call super.close(). At minimum, closeFlushTimer() must execute regardless of the outcome of reader.close(). Issue 3: enable_timer_flush Is a Phantom Config in the DocumentationLocation: EnvCommonOptions.java:88, JobEnvConfig.md:58 Both the Option description and the English/Chinese documentation state that the Sink connector must enable enable_timer_flush. However, a codebase-wide search finds this string only in descriptions and docs — there is no corresponding Option definition or parsing logic. The real enablement mechanism is for the writer to call context.registerFlushAction(...) during initialization. Users will attempt to configure a parameter that does not exist, and connector authors may mistakenly believe they need to add a new connector-level config rather than registering a flush callback. Recommended fix: Update the documentation to read: "the connector must register a flush action via SinkWriter.Context.registerFlushAction." If a connector-level toggle is needed in the future, it should be defined in a dedicated connector PR with a documented default value. Test Coverage The new tests cover timer registration/cancellation, Signal delivery, SinkWriterContext, BlockingQueue Signal, and Disruptor Signal. However, RecordEventProducerTest.barrierIsAlwaysPublishedAndFlipsPrepareCloseForFinalCheckpoint only asserts ack and setPrepareClose — it does not verify that the barrier is actually published to the ring buffer, which is why Issue 1 was missed. It is recommended to add assertions on barrier cursor advancement and handler collection, as well as a lifecycle test verifying that the timer is cancelled when reader.close() throws. Compatibility and Side Effects API compatibility is maintained via a default method on SinkWriter.Context, preserving binary compatibility. The new env option defaults to 0, so existing jobs are unaffected. There is no additional performance overhead by default; when enabled, each source subtask registers a periodic task, and signals are dropped when the queue is full to prevent the timer thread from being stalled by downstream back-pressure. One point worth confirming: whether Signal as a Record payload should implement Serializable, consistent with other control-plane payloads such as CheckpointBarrier and SchemaChangeEvent. Conclusion: Merge After Fixes
Overall assessment: the direction is right, and the abstraction is appropriately restrained. The Signal → Transform passthrough → Sink flushAction design is a solid foundation for the long-term solution. The current implementation should not be merged as-is; fixing the Disruptor barrier regression will make this PR substantially more stable. |
|
Hi @nzw921rx, I rechecked the current PR head locally as This PR wires a real Zeta timer-flush control path: The feature direction is good, but I found two blockers in the current head:
There is also a documentation/API mismatch: Conclusion: can merge after fixesBlocking items:
CI is currently still running/in progress in the fetched metadata, so this also needs a green build after the fixes. |
33fbc02 to
f9268bc
Compare
@davidzollo Thank you for your review. The suggestions were great and have been fixed |
|
@zhangshenghang @corgy-w @liunaijie @dybyte PTAL when you have time. Thanks |
…shSignal handling
…ception in SourceFlowLifeCycle - RecordEventProducer: barrier branch was missing publishRecord call, causing checkpoint barriers to be silently dropped - SourceFlowLifeCycle#close: wrap in try/finally so closeFlushTimer always runs even if reader.close or super.close throws - RecordEventProducerTest: add gating sequence so remainingCapacity correctly reflects ring buffer fullness
bffc467 to
86384b5
Compare
…xception - Replace Runnable with RunnableWithException in SinkWriter.Context#registerFlushAction / getFlushAction - Update SinkWriterContext field and implementation accordingly - Remove the need for connector-level try-catch wrapping of IOException in timerFlush() - Fix SinkWriterContextTest to declare throws Exception on affected test method
|
Hi @nzw921rx, thanks for the careful update here. I re-reviewed the latest head locally on What This PR Fixes
Core Flow ReviewedThe two earlier risks I cared about are now addressed:
FindingsIssue 1: Phase 1 documents a connector option that is not actually available in this PR
The engine contract introduced by this PR is Suggested fix: in Phase 1, describe only the engine-level contract and say sinks must opt in by registering a flush action. If you want to mention Compatibility
Performance And Side Effects
Tests And DocsThe unit coverage around timer registration, signal forwarding, sink writer context, blocking queue, and Disruptor publication is meaningful. The remaining gap is connector adoption, which belongs to the follow-up PR. The docs need the wording fix above. Merge ConclusionConclusion: merge after fixes
Overall, the engine-side design is solid and the previous concurrency/resource-lifecycle concerns are in much better shape now. Once the documentation contract is tightened and CI is green, this should be in good shape. |
|
Hi @nzw921rx, I rechecked the latest head locally. What changed after Daniel's previous review
Runtime chain I recheckedFindings
Merge conclusionConclusion: can mergeBlocking items:
Non-blocking note:
Thanks for the follow-up iterations here. The current head looks ready to merge from my side. |
|
Thank you very much for quickly implementing so many improvements, which have yielded remarkable results! The E2E environment adopts a single-node setup with a dedicated test sink. It can verify that the new hook registerFlushAction is triggered, yet it cannot guarantee effective coverage across task-group and Hazelcast serializer boundaries, nor can it confirm the stability of MultiTable scenarios. Potential Risks: Given the substantial scope of this change, I propose the following suggestions:
|
|
Hi @nzw921rx, I rechecked the latest head locally after the latest follow-up comment. The earlier issues that needed blocking review are now fixed on the current head:
Runtime chain I rechecked Current findings:
Conclusion: can mergeBlocking items:
Non-blocking note:
Thanks for the follow-up iterations here. The current head looks ready to merge from my side. |
…task state machine
@davidzollo Thanks for the thorough review — all three risk areas are spot on. I've added the corresponding tests:
Added
Created
Added
|
|
Hi @nzw921rx, I pulled the latest head locally again and rechecked the new follow-up after your latest comment. What this PR solves
What I rechecked in this follow-up I do not see a reopened code blocker from these additions. Conclusion: merge after fixes
From Daniel’s side, the current head still looks technically sound; the remaining gate is CI on the newest head. |
|
I found some new issues as follows, 1. Critical Concurrency Risk: Missing
|
@davidzollo Thank you for your detailed suggestions
public void sendRecordToNext(Record<?> record) throws IOException {
synchronized (checkpointLock) {
for (OneInputFlowLifeCycle<Record<?>> output : outputs) {
output.received(record);
}
}
}
if (signal instanceof FlushSignal && writerContext.getFlushAction() != null) {
writerContext.getFlushAction().run(); // <-- No exception handling
}
|
|
@dybyte Please help review it. Any suggestions would be greatly appreciated |
DanielLeens
left a comment
There was a problem hiding this comment.
Hi @gabby1996, first on the earlier thread: on the newest head, I agree the checkpointLock concern is already covered indirectly because SeaTunnelSourceCollector.sendRecordToNext() synchronizes on that same lock. I also would keep flush-action failures task-failing rather than swallowing them, because a failed flush is still a real sink write/visibility failure, not something the engine should silently treat as success.
I pulled the latest head locally on seatunnel-review-10800, rechecked the full source -> transform -> sink signal path against upstream/dev, and reviewed the current head commit 213912224292031cf7e9e1f73a712f35525e95f5. I did not run local Maven in this batch; this is a source-level review plus current GitHub check metadata.
What this PR fixes
- User pain: buffered sinks can leave data sitting too long in low-throughput or idle-source cases.
- Fix approach: add
sink.flush.interval, emitFlushSignalfrom the source side on a timer, pass it through transforms, and let the sink execute the registered flush action. - One-line summary: the signal path is much healthier now, but the current flush cadence semantics are still wrong on multi-source fan-in topologies.
Runtime chain I checked
job startup
-> SourceSeaTunnelTask.createSourceFlowLifeCycle() [119-135]
-> read sink.flush.interval
-> create one SourceFlowLifeCycle per source subtask
timer registration
-> SourceFlowLifeCycle.startFlushTimer() [344-358]
-> TaskExecutionService.registerTimerFlushTask(...) [689-714]
-> one scheduled task per source subtask
timer tick
-> SourceFlowLifeCycle.onTimerTick() [184-196]
-> collector.sendRecordToNext(new Record<>(FlushSignal))
-> SeaTunnelSourceCollector.sendRecordToNext() [191-196]
-> broadcast to every downstream output under checkpointLock
middle stages
-> TransformFlowLifeCycle.received(...) [81-93,115-120]
-> Signal is passed through unchanged
sink execution
-> SinkFlowLifeCycle.received(...) [260-267]
-> every received FlushSignal runs writerContext.getFlushAction().run()
Key review findings
- The latest head fixed the earlier barrier-publish concern and tightened timer cleanup, which is good progress.
- The remaining blocker is the trigger dimension: the timer is attached to each source subtask, not to each sink writer.
- That means a sink task receives one flush signal per upstream source timer tick, so the real flush rate scales with source parallelism.
- Example: with
sink.flush.interval=1000ms, 10 source subtasks feeding 1 sink task can drive roughly 10 sink flushes per second instead of the user-expected ~1. - Current GitHub
Buildis still pending on the latest head, but my blocking conclusion here is code-level and does not depend on that check result.
Blocking issue
Issue 1: sink.flush.interval does not currently mean “one sink flush cadence”
- Location:
SourceSeaTunnelTask.java:119,SourceFlowLifeCycle.java:184,SinkFlowLifeCycle.java:260 - Why this is a real problem:
the current design makes the source side define the sink flush frequency. In fan-in topologies, the sink sees N independent periodic signals rather than one cadence. - Risk:
excessive flush frequency, more small-batch I/O, more sink overhead, more timeout/limit exposure, and user-visible behavior that does not match the config meaning. - Suggested fix:
option A is the cleaner one: move the timer ownership to the sink writer / sink task side so one sink writer owns one flush cadence.
Option B would be to deduplicate signals at the sink side within a time window, but that is more of a mitigation than a clean design.
Conclusion
Conclusion: merge after fixes
- Blocking items
- Issue 1 must be fixed first. The core semantics of
sink.flush.intervalare not right yet on the normal multi-source path.
- Suggested follow-up
- Please add a regression test for multi-source fan-in to one sink, and assert that one interval produces only one effective sink flush.
The direction here is promising, and the latest head is clearly stronger than the earlier revisions. The remaining blocker is just important enough that I would not merge before it is corrected.
@DanielLeens Thanks for the detailed review — I rechecked this with a targeted physical-plan test and the current engine topology does not match the assumed fan-in shape. I added a
This indicates that in the current execution model, sink tasks are expanded along the same per-subtask task-chain pattern, i.e., we do not get a That said, if we later introduce a true cross-chain fan-in topology (multiple upstream subtasks converging on a single sink writer), this concern would become valid and should be re-evaluated then. case: @Test
@SetEnvironmentVariable(key = SKIP_CHECK_JAR, value = "true")
public void testSource10Transform4Sink4PhysicalExpand() throws MalformedURLException {
IdGenerator idGenerator = new IdGenerator();
Action sourceAction =
new SourceAction<>(
idGenerator.getNextId(),
"fake-source",
createFakeSource(),
Sets.newHashSet(new URL("file:///fake.jar")),
Collections.emptySet());
LogicalVertex sourceVertex = new LogicalVertex(sourceAction.getId(), sourceAction, 10);
CatalogTable table = createSimpleCatalogTable("default_table");
Action transformAction =
new TransformAction(
idGenerator.getNextId(),
"noop-transform",
new ArrayList<>(Collections.singleton(sourceAction)),
createNoopTransform(table),
Sets.newHashSet(new URL("file:///transform.jar")),
Collections.emptySet());
LogicalVertex transformVertex =
new LogicalVertex(transformAction.getId(), transformAction, 4);
Action sinkAction =
new SinkAction<>(
idGenerator.getNextId(),
"console-sink",
new ArrayList<>(Collections.singleton(transformAction)),
new ConsoleSink(table, ReadonlyConfig.fromMap(new HashMap<>())),
Sets.newHashSet(new URL("file:///console.jar")),
Collections.emptySet());
LogicalVertex sinkVertex = new LogicalVertex(sinkAction.getId(), sinkAction, 4);
JobConfig config = new JobConfig();
config.setName("source10-transform4-sink4");
LogicalDag logicalDag = new LogicalDag(config, idGenerator);
logicalDag.addLogicalVertex(sourceVertex);
logicalDag.addLogicalVertex(transformVertex);
logicalDag.addLogicalVertex(sinkVertex);
logicalDag.addEdge(new LogicalEdge(sourceVertex, transformVertex));
logicalDag.addEdge(new LogicalEdge(transformVertex, sinkVertex));
JobImmutableInformation jobImmutableInformation =
new JobImmutableInformation(
2L,
"source10-transform4-sink4",
nodeEngine.getSerializationService(),
logicalDag,
Collections.emptyList(),
Collections.emptyList());
IMap<Object, Object> runningJobState =
nodeEngine.getHazelcastInstance().getMap("testRunningJobState_source10_t4");
IMap<Object, Long[]> runningJobStateTimestamp =
nodeEngine
.getHazelcastInstance()
.getMap("testRunningJobStateTimestamp_source10_t4");
PhysicalPlan physicalPlan =
PlanUtils.fromLogicalDAG(
logicalDag,
nodeEngine,
jobImmutableInformation,
System.currentTimeMillis(),
Executors.newCachedThreadPool(),
server.getClassLoaderService(),
instance.getFlakeIdGenerator(Constant.SEATUNNEL_ID_GENERATOR_NAME),
runningJobState,
runningJobStateTimestamp,
QueueType.BLOCKINGQUEUE,
new EngineConfig())
.f0();
SubPlan subPlan = physicalPlan.getPipelineList().get(0);
Assertions.assertEquals(10, subPlan.getPhysicalVertexList().size());
Assertions.assertTrue(
subPlan.getPhysicalVertexList().stream()
.allMatch(v -> v.getTaskGroupImmutableInformation().getTasksData().size() == 2));
int taskCount = 0;
int sourceTaskCount = 0;
int transformTaskCount = 0;
int sinkTaskCount = 0;
for (PhysicalVertex physicalVertex : subPlan.getPhysicalVertexList()) {
for (Task task : physicalVertex.getTaskGroup().getTasks()) {
taskCount++;
if (task instanceof SeaTunnelTask) {
SeaTunnelTask seaTunnelTask = (SeaTunnelTask) task;
boolean containsSource = false;
boolean containsTransform = false;
boolean containsSink = false;
for (ActionStateKey actionStateKey : seaTunnelTask.getActionStateKeys()) {
String keyName = actionStateKey.getName();
containsSource = containsSource || keyName.contains("fake-source");
containsTransform = containsTransform || keyName.contains("noop-transform");
containsSink = containsSink || keyName.contains("console-sink");
}
if (containsSource) {
sourceTaskCount++;
}
if (containsTransform) {
transformTaskCount++;
}
if (containsSink) {
sinkTaskCount++;
}
}
}
}
System.out.println(
"[source=10, transform=4, sink=4] generated counts => tasks="
+ taskCount
+ ", sourceTasks="
+ sourceTaskCount
+ ", transformTasks="
+ transformTaskCount
+ ", sinkTasks="
+ sinkTaskCount);
}result: |
You're right! Good job. |
@davidzollo Thanks a lot for the thorough review and approval! |
|
Thanks for the follow-up discussion. I re-reviewed the latest head locally and traced the timer-flush path end to end. What this PR solves
Runtime pathReview findingsIssue 1: Flush cadence is multiplied by source parallelism on the normal fan-in path
Merge conclusionConclusion: Merge after fixes Blocking items:
Non-blocking suggestions:
CI status:
|
@DanielLeens The core design of the source injection remains unchanged. Please refer to #10717 for specific design details |
DanielLeens
left a comment
There was a problem hiding this comment.
Hi @nzw921rx, thanks for the follow-up. I also went back and read #10717 before re-checking the current head locally, so I’m responding with the design background in mind rather than only the current diff.
I agree with the core goal in #10717: Zeta should provide a task-thread timer callback so sink connectors do not have to create their own background scheduler threads, and flush failures can still fail fast on the task path.
However, after re-tracing the current implementation, I still do not think the remaining blocker is resolved by that design note.
What this PR solves:
- User pain: buffered sinks can hold data too long in low-throughput / idle scenarios.
- Fix approach: source-side timers emit
FlushSignal, transforms pass it through, and sinks runwriterContext.getFlushAction()when the signal arrives. - One-line summary: the direction is good, but the current cadence owner is still the source side, so the runtime semantics of
sink.flush.intervalare wrong on normal fan-in topologies.
Runtime chain I re-verified:
task startup
-> SourceSeaTunnelTask.createSourceFlowLifeCycle() [SourceSeaTunnelTask.java:113-135]
-> read sink.flush.interval [119]
-> create one SourceFlowLifeCycle per source subtask
timer registration
-> SourceFlowLifeCycle.startFlushTimer() [344-355]
-> register one scheduled task per source subtask
timer tick
-> SourceFlowLifeCycle.onTimerTick() [184-193]
-> build FlushSignal
-> collector.sendRecordToNext(...)
sink side
-> SinkFlowLifeCycle.received() [260-267]
-> each received FlushSignal runs writerContext.getFlushAction().run()
The remaining blocker is still:
sink.flush.intervalis currently multiplied by source parallelism on the normal fan-in path.- Every source subtask owns its own timer.
- Every timer emits its own
FlushSignal. - The sink executes one flush per signal.
- So with
Nsource subtasks feeding one sink writer, the sink can flush roughlyNtimes per interval instead of once per sink-writer cadence.
Why I do not think #10717 removes this concern:
#10717explains why the engine should provide timed callbacks on the task thread.- It does not prove that the timer owner must be the source side.
- The current code still makes the sink flush cadence depend on source topology, not sink-writer semantics.
That is why I still consider this a code-level blocker rather than just a design-style disagreement.
Suggested fix order:
- Preferred: move timer ownership to the sink writer / sink task boundary, so each sink writer owns one flush cadence.
- Smaller mitigation: keep source-side signal injection, but add sink-side dedup/throttling so each writer flushes at most once per interval window.
Test gap that still matters:
- I still do not see a regression test for
multiple source subtasks -> one sink writer, proving that one interval causes only one effective sink flush.
CI note:
- The current Apache-side
Buildis red, and the linked fork-run metadata was still moving when I checked. That said, my blocking conclusion here is code semantics, not CI status.
Conclusion: merge after fixes
Blocking items:
- Fix the cadence ownership problem so
sink.flush.intervalreally means one sink-writer flush cadence, not one source-subtask signal cadence.
Non-blocking suggestion:
- Add the multi-source fan-in regression test once the cadence semantics are corrected.
Overall, I still like the direction and the STIP motivation. The remaining gap is specifically about where the timer semantics are anchored in the runtime path.
In the current operational topology of Zeta, [source=10, transform=4, sink=4] will generate source tasks=10, transform tasks=10, sink tasks=10, and there is no "multiple source subtasks ->one sink writer" fan in relationship that you assume. Therefore, sink. flush. interval will not be linearly amplified by the source parallelism. If there is a reproducible fan in to single writer DAG, please provide specific job configurations before continuing the evaluation. |
DanielLeens
left a comment
There was a problem hiding this comment.
Hi @nzw921rx, thanks for pushing back with the concrete topology example. I went back through the current Zeta planner/runtime chain end to end instead of continuing from my earlier assumption, and your latest point is correct: on the current topology I cannot reproduce the multiple source subtasks -> one shared sink writer fan-in path that my previous blocker relied on. So I am withdrawing that blocker.
I re-reviewed the latest head locally against upstream/dev as a source-level review. I did not run local Maven in this batch.
What this PR solves
- User pain: buffered sinks can hold visible data for too long in low-throughput or briefly idle streaming jobs.
- Fix approach: add an engine-level timer-flush capability. A source-side timer emits
FlushSignal, transforms pass it through, and sinks executewriterContext.getFlushAction()when the writer explicitly registers one. - One-line summary: this PR lands the Zeta core path for STIP-23 Phase 1 and moves periodic flush handling onto the engine task path instead of connector-owned background threads.
Runtime chain I re-verified
union / fan-in planning
-> PipelineGenerator.checkCanSplit() [PipelineGenerator.java:136-138]
-> union inputs enter split logic
-> PipelineGenerator.splitUnionVertex() [PipelineGenerator.java:140-183]
-> rebuilds downstream vertices per source pipeline
-> downstream parallelism follows the upstream pipeline branch
source -> sink physical plan
-> PhysicalPlanGenerator.getSourceTask() [PhysicalPlanGenerator.java:378-508]
-> creates one task group per source parallelism index
-> if sourceWithSink(flow)=true, calls splitSinkFromFlow(flow)
-> PhysicalPlanGenerator.splitSinkFromFlow() [PhysicalPlanGenerator.java:569-600]
-> rewrites sink behind IntermediateQueue
-> puts the queue -> sink subflow into the same source task group
-> TaskGroupWithIntermediateBlockingQueue.getQueueCache() [TaskGroupWithIntermediateBlockingQueue.java:49-70]
-> TaskGroupWithIntermediateDisruptor.getQueueCache() [TaskGroupWithIntermediateDisruptor.java:49-73]
-> queue/disruptor cache is task-group local, not shared across task groups
task runtime
-> SeaTunnelTask.stateProcess() STARTING -> RUNNING [SeaTunnelTask.java:217-221]
-> calls hook() on all cycles
-> SourceFlowLifeCycle.hook() [SourceFlowLifeCycle.java:218-220]
-> startFlushTimer() [344-355]
-> timer tick
-> SourceFlowLifeCycle.onTimerTick() [184-195]
-> collector.sendRecordToNext(new Record<>(FlushSignal))
-> SeaTunnelSourceCollector.sendRecordToNext() [SeaTunnelSourceCollector.java:191-196]
-> broadcasts under checkpointLock within the current task-group outputs
-> SeaTunnelTask.convertFlowToActionLifeCycle() [SeaTunnelTask.java:279-337]
-> queue -> sink subflow becomes IntermediateQueueFlowLifeCycle -> SinkFlowLifeCycle
-> TransformFlowLifeCycle.received(Signal) [TransformFlowLifeCycle.java:118-122]
-> passes Signal through unchanged
-> SinkFlowLifeCycle.received() [SinkFlowLifeCycle.java:260-267]
-> runs writerContext.getFlushAction().run() only for FlushSignal and only when a flush action is registered
Key findings
- After re-tracing the current planner/runtime chain, the earlier cadence-multiplied-by-source-parallelism blocker does not hit the reachable runtime topology of the current Zeta implementation.
- The sink flow is cloned into source task groups through
splitSinkFromFlow(...), and the intermediate queue/disruptor instances are task-group local, so the current runtime is a paired path rather than a shared sink-writer fan-in path. - The remaining issue I see is non-blocking and documentation-facing: the option/docs wording currently reads like
sink.flush.intervalis immediately effective for sinks in general, while the runtime only flushes when a sink writer explicitly registersflushAction.
Review conclusion
Conclusion: can merge
- Blocking items
- No code-level blocking issue remains in the current head after re-checking the real runtime topology.
- Suggested follow-up
- Please tighten the wording in:
docs/en/introduction/configuration/JobEnvConfig.md:58-62docs/zh/introduction/configuration/JobEnvConfig.md:59-63seatunnel-api/src/main/java/org/apache/seatunnel/api/options/EnvCommonOptions.java:88-94
- Right now the runtime only flushes when
SinkFlowLifeCycle.received()findswriterContext.getFlushAction() != null(SinkFlowLifeCycle.java:260-267), and the opt-in happens throughSinkWriter.Context.registerFlushAction(...)(SinkWriter.java:122-143). Clarifying that this PR introduces the engine capability first, while connector adoption is follow-up work, would make the user-facing contract much clearer.
- CI note
- The current Apache-side
Buildcheck is red. The linked fork workflow run failed inconnector-jdbc-e2e-ddlon JDK 8, specificallyorg.apache.seatunnel.connectors.jdbc.SqlServerSchemaChangeITwithexpected: <hawk9821> but was: <scooter>. - I checked the current
upstream/devchanges in that JDBC E2E module path from this PR's merge-base to the latest head and did not find a corresponding fix there, so this does not currently look like a simple "rebase to pick up an existing dev-side fix" case. - Because of that, I treat the red CI as a separate follow-up rather than a blocker on the code-review conclusion above.
Overall, the direction here is good, and after re-checking the real task-planning/runtime topology, I am comfortable with the core engine path in this PR. Thanks again for the detailed clarification in the discussion; it helped narrow the review back to the code paths that are actually reachable today.
@DanielLeens Thanks for the suggestion. Please see the Phase 3 (#10802 ) sink capability tables where timer flush support is listed per connector. |
|
Thanks for the follow-up. Pointing me to I rechecked the current head locally again before replying. There is still no new source delta after my earlier approval, so this is a follow-up discussion review rather than a re-review of changed code. What this PR solves
Runtime chain I rechecked Current conclusion after this follow-up
One remaining practical note
Conclusion: merge after fixesBlocking item:
Non-blocking follow-up:
Overall, thanks for the clarification. The Phase 3 docs pointer answers the follow-up I had, and my code-review conclusion on this head stays the same. |
Purpose of this pull request
Related: #10717
Phase 1 engine-core implementation for STIP-23 (Engine-Level Timer Flush for Sink Connectors). Full design: issue comment.
This PR builds the engine-side data path: timer scheduling →
FlushSignalpropagation → sink-side flush action invocation. No connector is modified — connector adoption (JDBC) follows in Phase 2.Changes:
Signal/FlushSignalAPI inseatunnel-api— control-plane signal abstractionsink.flush.interval(default 0 = disabled)SinkWriter.ContextaddedregisterFlushAction(Runnable)/getFlushAction()(defaultmethods, zero impact on existing connectors)TaskExecutionService—timerFlushWorkerpool + timer lifecycle managementSourceFlowLifeCycle— timer registration/callback, injectsFlushSignalviacollector.sendFlushSignal()undercheckpointLockTransformFlowLifeCycle— signal passthroughSinkFlowLifeCycle— invokesflushAction.run()on consume thread whenFlushSignalarrivesIntermediateBlockingQueue/RecordEventProducer— non-blocking delivery for signals (offer()/tryPublishEvent()), drop on backpressureDoes this PR introduce any user-facing change?
Yes, but no impact on existing jobs.
sink.flush.interval(Long, default0). Disabled by default. Only takes effect in the Zeta engine when a connector also registers a flush action.defaultmethods onSinkWriter.Context— existing connectors require zero changes.docs/enanddocs/zhJobEnvConfig pages.How was this patch tested?
5 new unit test classes:
TaskExecutionServiceTest— timer register/close/re-register lifecycle, parameter validationSeaTunnelSourceCollectorFlushSignalTest— signal broadcast to multiple outputsSinkWriterContextTest— flush action register/replace/null-checkIntermediateBlockingQueueSignalTest— signal enqueue, backpressure drop, prepareClose drop, counter accuracyRecordEventProducerTest— Disruptor path: signal publish, RingBuffer-full drop, prepareClose behaviorE2E tests and the connector-level option enable_timer_flush will both be introduced in Phase 2 alongside JDBC connector adoption; neither is part of this PR.