Skip to content

executor/join: fix HashJoinV2 Close hang on cancelled queries#68478

Draft
joechenrh wants to merge 1 commit into
pingcap:masterfrom
joechenrh:fix/hash-join-v2-close-hang
Draft

executor/join: fix HashJoinV2 Close hang on cancelled queries#68478
joechenrh wants to merge 1 commit into
pingcap:masterfrom
joechenrh:fix/hash-join-v2-close-hang

Conversation

@joechenrh
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: ref #TODO (will follow up with an issue link)

Problem Summary:

When a query that uses hash join v2 is killed or its connection is dropped while the build/probe goroutines are still running, HashJoinV2Exec.Close can hang forever. Pprof shows Close parked inside channel.Clear (the for range ch drain helper), waiting for a producer-side channel to be closed.

The root cause is that three send sites in the build/probe pipeline did not honor closeCh. When Close runs:

  1. close(closeCh) fires.
  2. Close serially drains buildFinished, joinResultCh, probeChkResourceCh, and each probeResultChs[i] via channel.Clear. Each drain only completes once the corresponding channel is closed.
  3. The channels are closed in producer goroutines' RunWithRecover defers — but the producer goroutines themselves were stuck on a send that did not select on closeCh. So Close never observes the close signal and parks indefinitely.

The three sites:

  • createTasks (pkg/executor/join/hash_join_v2.go): select on buildTaskCh <- task only watched doneCh. In TPC-DS-scale builds the channel (buffer size Concurrency) filled while workers were busy, so the send blocked indefinitely after close(closeCh).
  • BuildWorkerV2.buildHashTable (pkg/executor/join/hash_join_v2.go): for task := range taskCh had no closeCh short-circuit. A slow per-task build kept each worker alive through every queued task, delaying range's exit until the producer's deferred close(buildTaskCh) ran.
  • fetchProbeSideChunks (pkg/executor/join/hash_join_base.go): the final probeSideResource.dest <- probeSideResult had no closeCh case. Probe workers do honor closeCh on their receive select and exit eagerly, so the fetcher's send could deadlock on a probeResultChs[i] whose buffer (size 1) was full and whose consumer had already gone away.

What changed and how does it work?

For each of the three send sites, add a case <-closeCh branch (and, for the build worker, a closeCh short-circuit at the top of the per-task loop) so that producers exit promptly after Close is called. The existing sites that already honored closeCh (e.g. fetchBuildSideRows, controlWorkersForRestore, getProbeSideResource, wait4BuildSide) follow the same pattern; this PR brings the remaining three in line.

No behavioral change on the happy path — the new branches only fire after Close runs.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Added pkg/executor/join/close_cancel_test.go:

  • TestBuildWorkerExitsOnCloseCh — focused unit test for the build-worker fix. Pre-closes closeCh and queues several invalid tasks; with the fix the worker returns before touching hashTableContext (which would nil-deref without it).
  • TestHashJoinV2CloseUnblocksBuildPhase — drives a 50 K-row hash join with a new slowBuildTask failpoint set to 1 s per task; asserts Close returns within 2.5 s instead of waiting for every queued task.
  • TestHashJoinV2CloseUnblocksProbeFetcher — drives a probe pass with holdProbeFetcherBeforeSend set to 800 ms; asserts Close returns within 2.5 s instead of stalling on channel.Clear(probeResultChs[i]).

Two new failpoints (slowBuildTask, holdProbeFetcherBeforeSend) are test-only and no-op when not enabled, matching the surrounding slowWorkers/buildHashTablePanic/createTasksPanic style.

Local validation (./tools/check/failpoint-go-test.sh equivalent flow):

  • new tests: TestBuildWorkerExitsOnCloseCh, TestHashJoinV2CloseUnblocks* — PASS
  • existing closely related tests (no regression): TestKillDuringBuild, TestKillDuringProbe, TestCreateTasksPanic, TestBuildHashTablePanic, TestSplitPartitionPanic, TestProcessOneProbeChunkPanic — PASS

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

```release-note
Fix a deadlock where killing or dropping a connection running a hash-join-v2 query could hang in Close because some build/probe goroutines did not honor the executor's cancel signal.
```

HashJoinV2Exec.Close serially drains buildFinished, joinResultCh and
each probeResultChs[i] via channel.Clear. Each drain waits for the
producer goroutine to close that channel, which only happens once the
producer exits. Three send sites missed the closeCh signal and could
keep a producer alive forever after close(closeCh) fired:

1. createTasks (hash_join_v2.go:1228, 1251) only selected on doneCh
   while sending to a buildTaskCh whose buffer == Concurrency. Under a
   large build (e.g. TPC-DS) the channel filled while workers were busy,
   and Close could not unblock the send.
2. BuildWorkerV2.buildHashTable looped over taskCh with no closeCh
   short-circuit, so a slow per-task build kept the worker alive through
   every queued task before the deferred close on buildTaskCh fired.
3. fetchProbeSideChunks dropped its result into probeSideResource.dest
   with no closeCh case. Probe workers honor closeCh on their receive
   select and may exit before draining, leaving the send to block on a
   probeResultChs[i] that nobody will close.

Add a closeCh branch at each send (and an early-exit check at the top
of buildHashTable's loop) so Close can interrupt build and probe
producers reliably.

Tests:
- close_cancel_test.go::TestBuildWorkerExitsOnCloseCh is a focused
  regression for fix (2): with closeCh pre-closed it exits without
  touching hashTableContext, which would nil-deref without the check.
- close_cancel_test.go::TestHashJoinV2CloseUnblocksBuildPhase and
  TestHashJoinV2CloseUnblocksProbeFetcher cover fixes (1)+(2) and (3)
  end-to-end via two new test-only failpoints (slowBuildTask and
  holdProbeFetcherBeforeSend) and assert Close returns within a bounded
  time even when producers would otherwise hang.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 19, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 19, 2026

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 19, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

@ti-chi-bot ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 19, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a99dcc66-03cc-4c95-9a8c-25583ec85bb6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 19, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 19, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign xuhuaiyu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tiprow
Copy link
Copy Markdown

tiprow Bot commented May 19, 2026

Hi @joechenrh. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant