Skip to content

Fix stage write resource leak on speculative execution kill#102

Open
Zouxxyy wants to merge 2 commits into
aliyun:masterfrom
Zouxxyy:fix/stage-write-abort-cleanup
Open

Fix stage write resource leak on speculative execution kill#102
Zouxxyy wants to merge 2 commits into
aliyun:masterfrom
Zouxxyy:fix/stage-write-abort-cleanup

Conversation

@Zouxxyy

@Zouxxyy Zouxxyy commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Stage write abort() no longer flushes the broken Arrow channel, avoiding ClosedChannelException cascades
  • abort() actively drops the orphaned stage instead of relying on 7200s TTL expiration
  • Fix CopyInStageWrapper connectionHolder leak (never closed in any code path before)
  • HoloBatchWriter.abort() now cleans up stages on job-level failure

Background

When speculative execution kills a stage write task, the current abort() attempts to flush an already-interrupted Arrow channel, causing cascading exceptions. The created stage is never dropped, and the connectionHolder is never closed.

Executor log (sanitized):

16:40:40 INFO  Executor - Got assigned task 64411
16:40:40 INFO  Executor - Running task 21.1 in stage 166.0 (TID 64411)
16:40:40 INFO  JDBCUtil - create connection to holo
16:40:40 INFO  JDBCUtil - execute sql success: call hologres.hg_create_internal_stage(...)
16:40:40 INFO  Executor - Executor is trying to kill task 21.1 in stage 166.0 (TID 64411), reason: another attempt succeeded
16:40:40 ERROR Utils - Aborting task
java.nio.channels.ClosedByInterruptException
    at AbstractInterruptibleChannel.end(...)
    at ArrowWriter.writeBatch(...)
    at CopyInStageWrapper.putRecord(...)
    at BaseHoloDataCopyStageWriter.write(...)
16:40:40 ERROR DataWritingSparkTask - Aborting commit for partition 21 (task 64411, attempt 1, stage 166.0)
16:40:40 WARN  Utils - Suppressing exception in catch:
java.nio.channels.ClosedChannelException    ← abort() flush triggers again
    at CopyInStageWrapper.flush(...)
    at BaseHoloDataCopyStageWriter.close(...)
    at BaseHoloDataCopyStageWriter.abort(...)
16:40:41 WARN  Utils - Suppressing exception in finally:
java.nio.channels.ClosedChannelException    ← close() triggers again
    at CopyInStageWrapper.flush(...)
    at BaseHoloDataCopyStageWriter.close(...)
16:40:41 INFO  Executor - Executor interrupted and killed task 21.1, reason: another attempt succeeded

After fix:

abort() → stageWrapper.abort() (no flush, close resources) → dropStages (cleanup)
close() → stageWrapper already closed, return immediately (idempotent)

Test plan

  • Compile: mvn compile -pl hologres-connector-spark-3.x -am -DskipTests
  • Enable spark.speculation=true, run a skewed write-to-Hologres job
  • Verify no ClosedChannelException cascades in executor logs
  • Verify stage cleanup: SELECT * FROM hologres.hg_internal_stage_info shows no orphaned stages

🤖 Generated with Claude Code

Zouxxyy and others added 2 commits June 25, 2026 17:39
When a speculative task is killed, abort() no longer flushes the broken
Arrow channel. Instead it aborts the stageWrapper (close resources without
flush), drops the orphaned stage, and fixes the connectionHolder leak.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nabled

In copyStageOnly mode, stages are intentionally preserved for external
consumers. Batch-level abort should not drop successfully committed stages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant