Skip to content

pkg/executor: fix concurrent SessionVars.systems map access during ANALYZE#68465

Open
mjonss wants to merge 1 commit into
pingcap:masterfrom
mjonss:analyze-concurrent-sysvar-68457
Open

pkg/executor: fix concurrent SessionVars.systems map access during ANALYZE#68465
mjonss wants to merge 1 commit into
pingcap:masterfrom
mjonss:analyze-concurrent-sysvar-68457

Conversation

@mjonss
Copy link
Copy Markdown
Contributor

@mjonss mjonss commented May 18, 2026

What problem does this PR solve?

Issue Number: close #68457

Problem Summary:

Partition-analyze workers share one SessionVars and each independently
called GetSessionOrGlobalSystemVar(tidb_build_sampling_stats_concurrency)
inside analyzeColumnsPushDown. The session's systems map is lazily
populated on first read and is not protected by any lock, so when multiple
workers raced to fill the same entry the Go runtime reported fatal error: concurrent map read and map write and killed the TiDB server. The original
crash was observed during auto-analyze of an 8-partition table.

What changed and how does it work?

Resolve tidb_build_sampling_stats_concurrency once on the main goroutine
in AnalyzeExec.Next, before the partition workers fan out, and thread the
value through a new AnalyzeColumnsExec.samplingStatsConcurrency field.
Workers now read a plain struct field, so they never touch
SessionVars.systems. The number of sysvar lookups per ANALYZE drops from
one per partition to one per statement.

Regression test: a failpoint inside getBuildSamplingStatsConcurrency
records the calling goroutine's stack on every invocation. The test asserts
the call still happens (resolution is not silently dropped) and that no
call originates from (*AnalyzeExec).analyzeWorker. Reintroducing the
worker-side call would put analyzeWorker on the captured stack and fail
the test.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Fixed a TiDB server crash ("fatal error: concurrent map read and map write")
that could occur during ANALYZE on partitioned tables, including auto-analyze,
when worker goroutines concurrently resolved
tidb_build_sampling_stats_concurrency.

…rs fan out

Partition-analyze workers shared one SessionVars and each independently
called GetSessionOrGlobalSystemVar(tidb_build_sampling_stats_concurrency)
inside analyzeColumnsPushDown. The session map populated on first read is
unsynchronised, so the lazy-populate from multiple workers raced and the
Go runtime reported "concurrent map read and map write" as a fatal,
killing the TiDB process during a normal auto-analyze.

Resolve the value once on the main goroutine in AnalyzeExec.Next before
workers spawn and thread it through AnalyzeColumnsExec.samplingStatsConcurrency.
Workers now read a plain struct field.

Add a regression test that fails if the resolution moves back onto a
worker goroutine: a failpoint inside getBuildSamplingStatsConcurrency
records the calling stack and the test asserts no call originates from
(*AnalyzeExec).analyzeWorker, plus that the resolution still happens
exactly once per ANALYZE statement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 18, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented May 18, 2026

@mjonss I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 18, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign time-and-fate for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 18, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

📝 Walkthrough

Walkthrough

This PR moves the computation of samplingStatsConcurrency from concurrent worker goroutines to a single pre-fanout step on the main goroutine, eliminating unsafe concurrent access to SessionVars.systems. The value is then pre-assigned to each column executor task before workers start, and test instrumentation verifies the fix works correctly.

Changes

ANALYZE sampling concurrency race fix

Layer / File(s) Summary
Move concurrency resolution to main goroutine
pkg/executor/analyze_col.go, pkg/executor/analyze.go, pkg/executor/analyze_col_sampling.go
AnalyzeColumnsExec gains a samplingStatsConcurrency field. AnalyzeExec.Next computes this value once on the main goroutine before workers fan out and assigns it to each task. analyzeColumnsPushDown uses the pre-set value directly instead of computing concurrency on workers, removing the error path from concurrent computation.
Test instrumentation and verification
pkg/executor/analyze_utils.go, pkg/executor/analyze_test.go
Failpoint import and hook added to getBuildSamplingStatsConcurrency for test interception. New test TestAnalyzeSamplingConcurrencyResolvedOffWorker installs a failpoint callback to capture stack traces and verify the function is called exactly once from the main goroutine, not from worker threads.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

sig/planner, component/statistics, size/M

Suggested reviewers

  • henrybw
  • terry1purcell
  • guo-shaoge

Poem

🐰 A race upon the goroutine,
Where systems maps clashed unseen,
Now resolved before fans spread,
One computation, safely thread-led! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing concurrent access to SessionVars.systems map during ANALYZE operations by resolving sampling stats concurrency on the main goroutine.
Linked Issues check ✅ Passed The PR fully addresses issue #68457 by resolving tidb_build_sampling_stats_concurrency once on the main goroutine before workers spawn, eliminating the concurrent SessionVars.systems map access that caused the panic.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the concurrent map access issue: adding samplingStatsConcurrency field to AnalyzeColumnsExec, resolving it early in AnalyzeExec.Next, and using pre-resolved values in workers. Changes are focused and within scope.
Description check ✅ Passed The PR description follows the required template structure with all essential sections completed: Issue Number, Problem Summary, What Changed and How It Works, Check List (with Unit test marked), Side effects assessment, and Release note.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tiprow
Copy link
Copy Markdown

tiprow Bot commented May 18, 2026

Hi @mjonss. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mjonss mjonss changed the title pkg/executor: resolve sampling stats count, to avoid concurrent map use pkg/executor: fix concurrent SessionVars.systems map access during ANALYZE May 18, 2026
@ti-chi-bot ti-chi-bot Bot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 18, 2026
@mjonss mjonss requested review from 0xPoe and Copilot May 18, 2026 11:09
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a TiDB server crash ("fatal error: concurrent map read and map write") that occurred during ANALYZE on partitioned tables. The root cause was that multiple partition-analyze worker goroutines shared one SessionVars and each independently invoked GetSessionOrGlobalSystemVar(tidb_build_sampling_stats_concurrency) from analyzeColumnsPushDown, racing on the lazily-populated, unsynchronised SessionVars.systems map. The fix resolves the sysvar once on the main goroutine in AnalyzeExec.Next before workers fan out, and threads the value to workers via a new AnalyzeColumnsExec.samplingStatsConcurrency field.

Changes:

  • Resolve tidb_build_sampling_stats_concurrency once in AnalyzeExec.Next and assign it to each task's colExec before starting workers.
  • Add samplingStatsConcurrency field to AnalyzeColumnsExec and consume it in analyzeColumnsPushDown instead of calling getBuildSamplingStatsConcurrency per worker.
  • Add failpoint getBuildSamplingStatsConcurrencyCalled and a regression test asserting the sysvar is resolved exactly once and never from an analyzeWorker goroutine.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/executor/analyze.go Resolves sampling concurrency once on the main goroutine and propagates it to each colExec task before worker fan-out.
pkg/executor/analyze_col.go Adds samplingStatsConcurrency field on AnalyzeColumnsExec to carry the pre-resolved value to workers.
pkg/executor/analyze_col_sampling.go Replaces in-worker sysvar lookup with the pre-resolved e.samplingStatsConcurrency field.
pkg/executor/analyze_utils.go Adds a failpoint.InjectCall in getBuildSamplingStatsConcurrency to enable the regression test.
pkg/executor/analyze_test.go Adds TestAnalyzeSamplingConcurrencyResolvedOffWorker verifying the sysvar is resolved exactly once and not from an analyzeWorker stack.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 76.92308% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.7178%. Comparing base (8c17ce1) to head (7272ae0).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #68465        +/-   ##
================================================
- Coverage   77.2761%   75.7178%   -1.5584%     
================================================
  Files          2010       2008         -2     
  Lines        555473     563343      +7870     
================================================
- Hits         429248     426551      -2697     
- Misses       125305     136741     +11436     
+ Partials        920         51       -869     
Flag Coverage Δ
integration 41.5522% <76.9230%> (+1.7581%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 60.4679% <ø> (ø)
parser ∅ <ø> (∅)
br 49.9725% <ø> (-13.0354%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mjonss
Copy link
Copy Markdown
Contributor Author

mjonss commented May 18, 2026

/retest

@tiprow
Copy link
Copy Markdown

tiprow Bot commented May 18, 2026

@mjonss: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@0xPoe 0xPoe requested a review from winoros May 18, 2026 12:39
Copy link
Copy Markdown
Member

@0xPoe 0xPoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

rest LGTM

// one SessionVars and would race on its unsynchronised `systems` map. The
// value is resolved on the main goroutine in AnalyzeExec.Next before
// workers fan out and threaded through AnalyzeColumnsExec.samplingStatsConcurrency.
func TestAnalyzeSamplingConcurrencyResolvedOffWorker(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is pretty straightforward. I think this test is overkill. I guess the only thing we need to do is add an intest.Assert() in analyzeColumnsPushDown to make sure samplingStatsConcurrency is initialized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

all TiDB instance restart with panic due to analyze race

3 participants