Fix stage write resource leak on speculative execution kill by Zouxxyy · Pull Request #102 · aliyun/alibabacloud-hologres-connectors

Zouxxyy · 2026-06-25T09:47:11Z

Summary

Stage write abort() no longer flushes the broken Arrow channel, avoiding ClosedChannelException cascades
abort() actively drops the orphaned stage instead of relying on 7200s TTL expiration
Fix CopyInStageWrapper connectionHolder leak (never closed in any code path before)
HoloBatchWriter.abort() now cleans up stages on job-level failure

Background

When speculative execution kills a stage write task, the current abort() attempts to flush an already-interrupted Arrow channel, causing cascading exceptions. The created stage is never dropped, and the connectionHolder is never closed.

Executor log (sanitized):

16:40:40 INFO  Executor - Got assigned task 64411
16:40:40 INFO  Executor - Running task 21.1 in stage 166.0 (TID 64411)
16:40:40 INFO  JDBCUtil - create connection to holo
16:40:40 INFO  JDBCUtil - execute sql success: call hologres.hg_create_internal_stage(...)
16:40:40 INFO  Executor - Executor is trying to kill task 21.1 in stage 166.0 (TID 64411), reason: another attempt succeeded
16:40:40 ERROR Utils - Aborting task
java.nio.channels.ClosedByInterruptException
    at AbstractInterruptibleChannel.end(...)
    at ArrowWriter.writeBatch(...)
    at CopyInStageWrapper.putRecord(...)
    at BaseHoloDataCopyStageWriter.write(...)
16:40:40 ERROR DataWritingSparkTask - Aborting commit for partition 21 (task 64411, attempt 1, stage 166.0)
16:40:40 WARN  Utils - Suppressing exception in catch:
java.nio.channels.ClosedChannelException    ← abort() flush triggers again
    at CopyInStageWrapper.flush(...)
    at BaseHoloDataCopyStageWriter.close(...)
    at BaseHoloDataCopyStageWriter.abort(...)
16:40:41 WARN  Utils - Suppressing exception in finally:
java.nio.channels.ClosedChannelException    ← close() triggers again
    at CopyInStageWrapper.flush(...)
    at BaseHoloDataCopyStageWriter.close(...)
16:40:41 INFO  Executor - Executor interrupted and killed task 21.1, reason: another attempt succeeded

After fix:

abort() → stageWrapper.abort() (no flush, close resources) → dropStages (cleanup)
close() → stageWrapper already closed, return immediately (idempotent)

Test plan

Compile: mvn compile -pl hologres-connector-spark-3.x -am -DskipTests
Enable spark.speculation=true, run a skewed write-to-Hologres job
Verify no ClosedChannelException cascades in executor logs
Verify stage cleanup: SELECT * FROM hologres.hg_internal_stage_info shows no orphaned stages

🤖 Generated with Claude Code

When a speculative task is killed, abort() no longer flushes the broken Arrow channel. Instead it aborts the stageWrapper (close resources without flush), drops the orphaned stage, and fixes the connectionHolder leak. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nabled In copyStageOnly mode, stages are intentionally preserved for external consumers. Batch-level abort should not drop successfully committed stages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Zouxxyy and others added 2 commits June 25, 2026 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix stage write resource leak on speculative execution kill#102

Fix stage write resource leak on speculative execution kill#102
Zouxxyy wants to merge 2 commits into
aliyun:masterfrom
Zouxxyy:fix/stage-write-abort-cleanup

Zouxxyy commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Zouxxyy commented Jun 25, 2026

Summary

Background

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant