pkg/executor: fix concurrent SessionVars.systems map access during ANALYZE#68465
pkg/executor: fix concurrent SessionVars.systems map access during ANALYZE#68465mjonss wants to merge 1 commit into
Conversation
…rs fan out Partition-analyze workers shared one SessionVars and each independently called GetSessionOrGlobalSystemVar(tidb_build_sampling_stats_concurrency) inside analyzeColumnsPushDown. The session map populated on first read is unsynchronised, so the lazy-populate from multiple workers raced and the Go runtime reported "concurrent map read and map write" as a fatal, killing the TiDB process during a normal auto-analyze. Resolve the value once on the main goroutine in AnalyzeExec.Next before workers spawn and thread it through AnalyzeColumnsExec.samplingStatsConcurrency. Workers now read a plain struct field. Add a regression test that fails if the resolution moves back onto a worker goroutine: a failpoint inside getBuildSamplingStatsConcurrency records the calling stack and the test asserts no call originates from (*AnalyzeExec).analyzeWorker, plus that the resolution still happens exactly once per ANALYZE statement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@mjonss I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughThis PR moves the computation of ChangesANALYZE sampling concurrency race fix
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi @mjonss. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Pull request overview
Fixes a TiDB server crash ("fatal error: concurrent map read and map write") that occurred during ANALYZE on partitioned tables. The root cause was that multiple partition-analyze worker goroutines shared one SessionVars and each independently invoked GetSessionOrGlobalSystemVar(tidb_build_sampling_stats_concurrency) from analyzeColumnsPushDown, racing on the lazily-populated, unsynchronised SessionVars.systems map. The fix resolves the sysvar once on the main goroutine in AnalyzeExec.Next before workers fan out, and threads the value to workers via a new AnalyzeColumnsExec.samplingStatsConcurrency field.
Changes:
- Resolve
tidb_build_sampling_stats_concurrencyonce inAnalyzeExec.Nextand assign it to each task'scolExecbefore starting workers. - Add
samplingStatsConcurrencyfield toAnalyzeColumnsExecand consume it inanalyzeColumnsPushDowninstead of callinggetBuildSamplingStatsConcurrencyper worker. - Add failpoint
getBuildSamplingStatsConcurrencyCalledand a regression test asserting the sysvar is resolved exactly once and never from ananalyzeWorkergoroutine.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/executor/analyze.go | Resolves sampling concurrency once on the main goroutine and propagates it to each colExec task before worker fan-out. |
| pkg/executor/analyze_col.go | Adds samplingStatsConcurrency field on AnalyzeColumnsExec to carry the pre-resolved value to workers. |
| pkg/executor/analyze_col_sampling.go | Replaces in-worker sysvar lookup with the pre-resolved e.samplingStatsConcurrency field. |
| pkg/executor/analyze_utils.go | Adds a failpoint.InjectCall in getBuildSamplingStatsConcurrency to enable the regression test. |
| pkg/executor/analyze_test.go | Adds TestAnalyzeSamplingConcurrencyResolvedOffWorker verifying the sysvar is resolved exactly once and not from an analyzeWorker stack. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #68465 +/- ##
================================================
- Coverage 77.2761% 75.7178% -1.5584%
================================================
Files 2010 2008 -2
Lines 555473 563343 +7870
================================================
- Hits 429248 426551 -2697
- Misses 125305 136741 +11436
+ Partials 920 51 -869
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
/retest |
|
@mjonss: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| // one SessionVars and would race on its unsynchronised `systems` map. The | ||
| // value is resolved on the main goroutine in AnalyzeExec.Next before | ||
| // workers fan out and threaded through AnalyzeColumnsExec.samplingStatsConcurrency. | ||
| func TestAnalyzeSamplingConcurrencyResolvedOffWorker(t *testing.T) { |
There was a problem hiding this comment.
The change is pretty straightforward. I think this test is overkill. I guess the only thing we need to do is add an intest.Assert() in analyzeColumnsPushDown to make sure samplingStatsConcurrency is initialized.
What problem does this PR solve?
Issue Number: close #68457
Problem Summary:
Partition-analyze workers share one
SessionVarsand each independentlycalled
GetSessionOrGlobalSystemVar(tidb_build_sampling_stats_concurrency)inside
analyzeColumnsPushDown. The session'ssystemsmap is lazilypopulated on first read and is not protected by any lock, so when multiple
workers raced to fill the same entry the Go runtime reported
fatal error: concurrent map read and map writeand killed the TiDB server. The originalcrash was observed during auto-analyze of an 8-partition table.
What changed and how does it work?
Resolve
tidb_build_sampling_stats_concurrencyonce on the main goroutinein
AnalyzeExec.Next, before the partition workers fan out, and thread thevalue through a new
AnalyzeColumnsExec.samplingStatsConcurrencyfield.Workers now read a plain struct field, so they never touch
SessionVars.systems. The number of sysvar lookups per ANALYZE drops fromone per partition to one per statement.
Regression test: a failpoint inside
getBuildSamplingStatsConcurrencyrecords the calling goroutine's stack on every invocation. The test asserts
the call still happens (resolution is not silently dropped) and that no
call originates from
(*AnalyzeExec).analyzeWorker. Reintroducing theworker-side call would put
analyzeWorkeron the captured stack and failthe test.
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.