Skip to content

*: refactor traceevent to make some concept clear#65562

Open
tiancaiamao wants to merge 27 commits into
pingcap:masterfrom
tiancaiamao:trace-buf
Open

*: refactor traceevent to make some concept clear#65562
tiancaiamao wants to merge 27 commits into
pingcap:masterfrom
tiancaiamao:trace-buf

Conversation

@tiancaiamao

@tiancaiamao tiancaiamao commented Jan 13, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: ref #64008

Problem Summary:

The code in pkg/util/traceevent are quite messy now. I promise the refactor and here it is.

What changed and how does it work?

In the past, there are a lot concepts like Sink, Trace, FlightRecorder ...
This refactor try to make it clearer.

The value binded with context is a TraceBuf, all the trace events are temporary stored inside it.
Whether to flush or discard it is decided by FlightRecorder's configuration, in CheckFlightRecorderDumpTrigger

  • traceevent.NewTrace is renamed to traceevent.NewTraceBuf
  • tracing.WithFlightRecorder is replaced with traceevent.WithTraceBuf (much more reasonable now)
  • TraceID is stored in TraceBuf directly (previously it has random bits + others, stored in both context.Context and Trace struct)
  • tracing.IsEnabled delegates to traceevent.IsEnabled (here a bug is fixed, there are two IsEnabled implementations and inconsistent after util/traceevent: make flaky test TestFlightRecorder stable #65317)

Check List

Tests

  • Unit test

  • Integration test

  • Manual test (add detailed scripts or steps below)

  • No need to test

    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

  • Refactor
    • Switched internal tracing to a buffered trace mechanism for more consistent collection and flushing.
    • Improved trace ID propagation so DDL, scheduling, session, and transaction events carry and persist trace identifiers reliably.
    • Updated test and diagnostic tooling to use the new trace-buffer flow, enhancing trace visibility and debugging fidelity.

@ti-chi-bot ti-chi-bot Bot added the release-note-none Denotes a PR that doesn't merit a release note. label Jan 13, 2026
@tiancaiamao tiancaiamao requested a review from ekexium January 13, 2026 12:20
@ti-chi-bot ti-chi-bot Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 13, 2026
@tiancaiamao tiancaiamao requested a review from Copilot January 13, 2026 12:20
@tiprow

tiprow Bot commented Jan 13, 2026

Copy link
Copy Markdown

Hi @tiancaiamao. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the traceevent package to clarify naming and relationships between trace event concepts. The refactoring addresses inconsistencies that were introduced in PR #65317 where two different IsEnabled implementations existed.

Changes:

  • Renamed Trace struct to TraceBuf to better reflect its purpose as a buffer for trace events
  • Replaced tracing.WithFlightRecorder with traceevent.WithTraceBuf for cleaner API semantics
  • Consolidated IsEnabled implementations by making tracing.IsEnabled delegate to traceevent.IsEnabled
  • Moved TraceID storage directly into TraceBuf instead of context and removed intermediary functions
  • Removed the rand32 field from TraceBuf, simplifying the trace ID generation logic

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/util/tracing/util.go Renamed Sink to TraceBuf interface, delegated IsEnabled to traceevent package, removed ExtractTraceID function
pkg/util/traceevent/traceevent.go Removed TraceIDFromContext and ContextWithTraceID functions, updated TraceEvent to use traceBuf.TraceID, simplified GenerateTraceID
pkg/util/traceevent/flightrecorder.go Renamed Trace to TraceBuf, added TraceID field, removed rand32 field, updated all methods accordingly
pkg/util/traceevent/adapter.go Updated to use getTraceBuf helper instead of tracing.GetSink
pkg/util/traceevent/test/integration_test.go Updated test calls from NewTrace/WithFlightRecorder to NewTraceBuf/WithTraceBuf
pkg/util/traceevent/adapter_test.go Updated all test cases to use NewTraceBuf and WithTraceBuf
pkg/session/session.go Updated ExecuteInternal to use GetTraceBuf, NewTraceBuf, and WithTraceBuf
pkg/server/conn.go Updated connection handling to use NewTraceBuf and WithTraceBuf
pkg/dxf/framework/taskexecutor/task_executor.go Updated to use NewTraceBuf and WithTraceBuf
pkg/dxf/framework/taskexecutor/manager.go Updated Manager to use TraceBuf type and WithTraceBuf
pkg/dxf/framework/scheduler/scheduler_manager.go Updated scheduler loops to use NewTraceBuf and WithTraceBuf
pkg/dxf/framework/scheduler/nodes.go Updated node refresh loop to use NewTraceBuf and WithTraceBuf
pkg/ddl/notifier/subscribe.go Updated DDL notifier to use NewTraceBuf and WithTraceBuf
pkg/ddl/job_scheduler.go Updated to use TraceBuf type and set TraceID directly instead of via context
pkg/ddl/executor.go Updated to extract TraceID directly from TraceBuf instead of using TraceIDFromContext

Comment thread pkg/util/tracing/util.go Outdated
Comment thread pkg/ddl/job_scheduler.go Outdated
Comment thread pkg/util/traceevent/traceevent.go Outdated
Comment thread pkg/ddl/executor.go Outdated
Comment thread pkg/util/traceevent/flightrecorder.go Outdated
@codecov

codecov Bot commented Jan 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 55.24476% with 64 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.2305%. Comparing base (ee3439d) to head (98d7c39).
⚠️ Report is 341 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #65562        +/-   ##
================================================
- Coverage   77.7488%   77.2305%   -0.5184%     
================================================
  Files          1959       1942        -17     
  Lines        543393     543407        +14     
================================================
- Hits         422482     419676      -2806     
- Misses       120070     123727      +3657     
+ Partials        841          4       -837     
Flag Coverage Δ
integration 40.9832% <55.2447%> (+4.8084%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.5065% <ø> (ø)
parser ∅ <ø> (∅)
br 48.8605% <ø> (-12.1534%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tiancaiamao

Copy link
Copy Markdown
Contributor Author

/retest

@tiprow

tiprow Bot commented Jan 15, 2026

Copy link
Copy Markdown

@tiancaiamao: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copilot AI review requested due to automatic review settings January 19, 2026 06:46

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings January 19, 2026 07:51

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Comment thread pkg/util/traceevent/flightrecorder.go Outdated
zap.Stringer("category", categories),
zap.Any("mapping", compiled.nameMapping),
zap.Uint64s("truthTable", ret.truthTable))

Copilot AI Jan 19, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This blank line appears to be an unintentional addition that creates inconsistent spacing in the code. The line immediately following is a function call that logically belongs together with the preceding log statement without extra spacing.

Copilot uses AI. Check for mistakes.
@tiancaiamao

Copy link
Copy Markdown
Contributor Author

/retest

@tiprow

tiprow Bot commented Jan 20, 2026

Copy link
Copy Markdown

@tiancaiamao: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Comment thread pkg/util/traceevent/flightrecorder.go Outdated
@tiancaiamao

Copy link
Copy Markdown
Contributor Author

/retest

@tiprow

tiprow Bot commented Jan 22, 2026

Copy link
Copy Markdown

@tiancaiamao: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings January 23, 2026 02:30

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings January 23, 2026 04:03

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Comment thread pkg/util/tracing/util.go
traceBuf := tmp.(TraceBuf)
event := Event{
Category: General,
Name: regionType,

Copilot AI Jan 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Event created in StartRegion is missing the TraceID field. Unlike TraceEvent in traceevent.go which sets TraceID from traceBuf.GetTraceID(), this event has an uninitialized TraceID field. This creates inconsistency in event data - some events will have TraceID and others won't. Consider adding the TraceID field by getting it from the TraceBuf interface (if GetTraceID method is exposed) or passing it as a parameter.

Suggested change
Name: regionType,
Name: regionType,
TraceID: traceBuf.GetTraceID(),

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

// Record to flight recorder if enabled (base or full mode).
if recorderEnabled.Load() {
// TODO: clean up here
if recorder := FlightRecorder(); recorder != nil {

Copilot AI Jan 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment says "clean up here" but doesn't provide context about what needs to be cleaned up. Given that this PR is about refactoring to make concepts clearer, consider either removing this TODO if it's no longer relevant, or expanding it to be more specific about what cleanup is needed (e.g., "TODO: consolidate duplicate recording to traceBuf and FlightRecorder" or "TODO: remove global FlightRecorder after TraceBuf consolidation is complete").

Copilot uses AI. Check for mistakes.
Comment thread pkg/util/tracing/util.go Outdated
if tmp := GetSink(ctx); tmp != nil {
sink := tmp.(Sink)
if tmp := GetTraceBuf(ctx); tmp != nil {
traceBuf := tmp.(TraceBuf)

Copilot AI Jan 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type assertion tmp.(TraceBuf) will panic if it fails. Consider using the comma-ok idiom for safer type assertion: traceBuf, ok := tmp.(TraceBuf); if !ok { return ret }. This pattern is already used in other places in the codebase (e.g., pkg/ddl/executor.go:6809).

Suggested change
traceBuf := tmp.(TraceBuf)
traceBuf, ok := tmp.(TraceBuf)
if !ok {
return ret
}

Copilot uses AI. Check for mistakes.
@tiancaiamao

Copy link
Copy Markdown
Contributor Author

/retest

@tiprow

tiprow Bot commented Jan 23, 2026

Copy link
Copy Markdown

@tiancaiamao: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Comment on lines 255 to 259
if recorderEnabled.Load() {
// TODO: clean up here
if recorder := FlightRecorder(); recorder != nil {
recorder.Record(ctx, event)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we delete this part now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is already big enough and we'd better clean up step by step

@ti-chi-bot

ti-chi-bot Bot commented Feb 4, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ekexium
Once this PR has been reviewed and has the lgtm label, please assign benjamin2037 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Feb 4, 2026
@ti-chi-bot

ti-chi-bot Bot commented Feb 4, 2026

Copy link
Copy Markdown

[LGTM Timeline notifier]

Timeline:

  • 2026-02-04 11:45:34.278583503 +0000 UTC m=+267405.379982224: ☑️ agreed by ekexium.

@coderabbitai

coderabbitai Bot commented Mar 2, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c683219c-833e-4edc-9828-b82727641632

📥 Commits

Reviewing files that changed from the base of the PR and between 0c9b064 and 98d7c39.

📒 Files selected for processing (3)
  • pkg/ddl/BUILD.bazel
  • pkg/ddl/executor.go
  • pkg/ddl/job_scheduler.go
✅ Files skipped from review due to trivial changes (2)
  • pkg/ddl/BUILD.bazel
  • pkg/ddl/executor.go

📝 Walkthrough

Walkthrough

Migrates tracing from a Flight Recorder-based system to a TraceBuf-based approach. Introduces TraceBuf, context binding (WithTraceBuf/GetTraceBuf), changes trace ID generation/propagation, and updates many components (DDL, session, server, DXF scheduler, task executor) and tests to use TraceBuf.

Changes

Cohort / File(s) Summary
Tracing Core
pkg/util/traceevent/flightrecorder.go, pkg/util/traceevent/traceevent.go
Add TraceBuf type, NewTraceBuf/WithTraceBuf/GetTraceBuf, Get/SetTraceID; replace global flight recorder sink usage with per-context TraceBuf and update TraceEvent flow and Sink handling.
Tracing Utilities
pkg/util/tracing/util.go, pkg/util/tracing/BUILD.bazel
Introduce TraceBuf abstraction in tracing util, replace sink/flight-recorder pathways with TraceBuf-based APIs, adjust Region/StartRegion flows, remove category globals; update build deps.
DDL & Scheduler
pkg/ddl/executor.go, pkg/ddl/job_scheduler.go, pkg/ddl/notifier/subscribe.go, pkg/ddl/BUILD.bazel
Switch DDL tracing to TraceBuf: extract/set trace IDs from TraceBuf, change getJobRunCtx signature to accept *TraceBuf, create/discard TraceBuf per job, add client-go trace propagation where needed.
DXF Framework & Task Executor
pkg/dxf/framework/scheduler/..., pkg/dxf/framework/taskexecutor/...
Replace NewTrace/WithFlightRecorder with NewTraceBuf/WithTraceBuf across scheduler and task executor; change Manager.trace field type to *TraceBuf.
Session & Transaction
pkg/session/session.go, pkg/session/txnmanager.go
Use GetTraceBuf in ExecuteInternal; remove ctx param from GenerateTraceID; add txnManager.traceCtx to carry per-transaction trace context and ensure safe fallbacks.
Server
pkg/server/conn.go
Replace flight-recorder trace init/wiring with TraceBuf-based init and use TraceBuf.DiscardOrFlush.
Adapter & Tests
pkg/util/traceevent/adapter.go, pkg/util/traceevent/*_test.go, pkg/util/traceevent/test/integration_test.go, pkg/util/tracing/util_test.go
Update adapter to use GetTraceBuf; rewrite/augment tests to use NewTraceBuf/WithTraceBuf and new helpers (waitForTraceID, drainEvents, etc.); add TestStartRegionCarriesTraceID and related test utilities.
Build files
pkg/util/tracing/BUILD.bazel, pkg/ddl/BUILD.bazel
Adjust dependencies: removed some tracing deps, added client-go trace dep for propagation in DDL.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Session
  participant TraceBuf as TraceBuf (ctx)
  participant DDL
  participant JobScheduler
  participant TiKV as TiKV client-go

  Client->>Session: Execute statement (ctx)
  Session->>TraceBuf: GetTraceBuf(ctx) / NewTraceBuf()
  Session->>TraceBuf: SetTraceID(...)
  Session->>DDL: schedule DDL job (ctx with TraceBuf)
  DDL->>JobScheduler: deliver job (attach TraceBuf)
  JobScheduler->>TraceBuf: trace.SetTraceID(...)
  JobScheduler->>TiKV: clientgotrace.ContextWithTraceID(ctx) -> TiKV request
  JobScheduler->>TraceBuf: DiscardOrFlush(ctx)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested labels

ok-to-test

Suggested reviewers

  • yudongusa

Poem

🐰 I hopped through code with whiskers bright,
Swapped recorders for buffers — what a sight!
TraceIDs tucked in soft TraceBuf beds,
From session to DDL my tiny feet sped.
Hop, celebrate — the traces flow just right!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.34% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and generic, using non-descriptive terms like 'refactor' and 'make some concept clear' that don't convey specific meaningful information about the main changes. Consider a more specific title that highlights the primary change, such as 'traceevent: refactor Trace to TraceBuf and clarify API concepts' or similar.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The description is largely complete with problem statement, changes explanation, and checklist, though it omits the Issue Number heading format and lacks detailed manual test steps despite claiming manual testing was performed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/session/txnmanager.go (1)

157-253: ⚠️ Potential issue | 🟡 Minor

Please add/confirm targeted tests for txn lifecycle trace-context handoff.

This change alters session txn lifecycle behavior (EnterNewTxn/OnStmtStart/OnTxnEnd) and should be covered with targeted package tests and SQL integration scenarios.

# Suggested local verification commands
go test ./pkg/session -run 'Test.*Txn.*|Test.*TxnManager.*' -tags=intest,deadlock
# plus your repo's SQL integration suite for BEGIN/COMMIT/ROLLBACK and implicit txn paths

As per coding guidelines, pkg/session/**: "For session variables or protocol behavior changes, run targeted package tests plus SQL integration tests for user-visible behavior".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/session/txnmanager.go` around lines 157 - 253, Add targeted tests that
verify trace-context handoff across the txn lifecycle implemented in txnManager:
specifically test EnterNewTxn, OnStmtStart, and OnTxnEnd to ensure traceCtx set
in EnterNewTxn is propagated to traceevent.TraceEvent calls and that
OnStmtStart/OnTxnEnd behave correctly when ctxProvider is nil or present; create
unit tests in pkg/session that instantiate a txnManager (or use exported
helpers) and assert traceevent.TraceEvent is invoked with the expected
context/fields and that errors are returned when ctxProvider is missing, and add
integration SQL tests exercising BEGIN/COMMIT/ROLLBACK and implicit txn paths to
cover real session behavior (refer to txnManager.traceCtx,
txnManager.ctxProvider, EnterNewTxn, OnStmtStart, OnTxnEnd when writing
assertions).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/ddl/executor.go`:
- Around line 6767-6778: The code stores traceBuf.GetTraceID() directly into
job.TraceInfo.TraceID causing aliasing; change the assignment in the block that
builds job := jobW.Job so you copy/clone the returned byte slice from
traceBuf.GetTraceID() (traceBuf, GetTraceID, TraceBuf, SetTraceID, Reset,
job.TraceInfo.TraceID) into a new slice before assigning it to
job.TraceInfo.TraceID to avoid later mutations or resets of the TraceBuf
invalidating the stored value.

In `@pkg/ddl/job_scheduler.go`:
- Around line 569-575: The getJobRunCtx function currently dereferences
traceInfo.TraceID unconditionally; update getJobRunCtx to guard access by first
checking that traceInfo != nil before reading TraceID (e.g., change the
condition to if traceInfo != nil && len(traceInfo.TraceID) > 0) and only then
call trace.SetTraceID and clientgotrace.ContextWithTraceID to build newCtx; this
prevents panics when jobW.TraceInfo is nil (deliveryJob may pass nil) while
preserving the existing behavior of jobContext and trace handling.

In `@pkg/session/session.go`:
- Around line 1835-1838: The current guard uses tracing.GetTraceBuf(ctx) to
decide whether to create a fallback trace buffer, but that can be fooled by a
mismatched context value; change the check to ensure the value is the typed
*traceevent.TraceBuf. Specifically, replace the nil check with a typed
lookup/assertion (e.g. tb, ok :=
traceevent.GetTraceBuf(ctx).(*traceevent.TraceBuf); if !ok || tb == nil { ... })
so that NewTraceBuf/WithTraceBuf/DiscardOrFlush only run when no real
*traceevent.TraceBuf is present.

In `@pkg/util/traceevent/flightrecorder.go`:
- Around line 39-52: GetTraceID and SetTraceID currently expose and store the
same backing slice (t.traceID) allowing callers to mutate the slice across API
boundaries; update GetTraceID to return a copy of t.traceID and update
SetTraceID to store a copy of the incoming slice (while still using t.mu for
RLock/Lock around reads/writes). Locate the methods TraceBuf.GetTraceID and
TraceBuf.SetTraceID and ensure both create new slices and copy contents when
reading or assigning t.traceID to prevent external mutation and cross-goroutine
races.

In `@pkg/util/traceevent/traceevent.go`:
- Around line 237-253: The current instant event path still obtains a TraceBuf
and calls traceBuf.Record even when tracing is disabled (ModeOff), causing
unnecessary overhead; update the logic in the function that builds Event (using
GetTraceBuf, Event, copyFieldsWithCapacity) to check the global/ctx tracing mode
(ModeOff) before allocating/copying fields and before calling traceBuf.Record,
and short-circuit/return early when tracing is off so no Event is constructed or
recorded.
- Around line 537-540: The log currently prints malformed trace IDs as raw
string bytes using string(event.TraceID) (see the branch that calls
logutil.BgLogger().Info and the zap.String("trace_id", ...)); change it to log
the hex representation by using hex.EncodeToString(event.TraceID) instead so
non-printable bytes are rendered consistently, and add the encoding/hex import
if missing.

---

Outside diff comments:
In `@pkg/session/txnmanager.go`:
- Around line 157-253: Add targeted tests that verify trace-context handoff
across the txn lifecycle implemented in txnManager: specifically test
EnterNewTxn, OnStmtStart, and OnTxnEnd to ensure traceCtx set in EnterNewTxn is
propagated to traceevent.TraceEvent calls and that OnStmtStart/OnTxnEnd behave
correctly when ctxProvider is nil or present; create unit tests in pkg/session
that instantiate a txnManager (or use exported helpers) and assert
traceevent.TraceEvent is invoked with the expected context/fields and that
errors are returned when ctxProvider is missing, and add integration SQL tests
exercising BEGIN/COMMIT/ROLLBACK and implicit txn paths to cover real session
behavior (refer to txnManager.traceCtx, txnManager.ctxProvider, EnterNewTxn,
OnStmtStart, OnTxnEnd when writing assertions).

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a171feb and 0c9b064.

📒 Files selected for processing (20)
  • pkg/ddl/BUILD.bazel
  • pkg/ddl/executor.go
  • pkg/ddl/job_scheduler.go
  • pkg/ddl/notifier/subscribe.go
  • pkg/dxf/framework/scheduler/nodes.go
  • pkg/dxf/framework/scheduler/scheduler_manager.go
  • pkg/dxf/framework/taskexecutor/manager.go
  • pkg/dxf/framework/taskexecutor/task_executor.go
  • pkg/server/conn.go
  • pkg/session/session.go
  • pkg/session/txnmanager.go
  • pkg/util/traceevent/adapter.go
  • pkg/util/traceevent/adapter_test.go
  • pkg/util/traceevent/flightrecorder.go
  • pkg/util/traceevent/test/integration_test.go
  • pkg/util/traceevent/traceevent.go
  • pkg/util/traceevent/traceevent_test.go
  • pkg/util/tracing/BUILD.bazel
  • pkg/util/tracing/util.go
  • pkg/util/tracing/util_test.go
💤 Files with no reviewable changes (1)
  • pkg/util/tracing/BUILD.bazel

Comment thread pkg/ddl/executor.go
Comment thread pkg/ddl/job_scheduler.go
Comment thread pkg/session/session.go Outdated
Comment on lines 1835 to 1838
if sink := tracing.GetTraceBuf(ctx); sink == nil {
trace := traceevent.NewTraceBuf()
ctx = traceevent.WithTraceBuf(ctx, trace)
defer trace.DiscardOrFlush(ctx)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use typed TraceBuf lookup for the fallback guard.

Line 1835 checks raw context presence, not typed *traceevent.TraceBuf. If a mismatched value is present, fallback creation is skipped and events can be dropped.

🔧 Proposed fix
 func (s *session) ExecuteInternal(ctx context.Context, sql string, args ...any) (rs sqlexec.RecordSet, err error) {
-	if sink := tracing.GetTraceBuf(ctx); sink == nil {
+	if traceevent.GetTraceBuf(ctx) == nil {
 		trace := traceevent.NewTraceBuf()
 		ctx = traceevent.WithTraceBuf(ctx, trace)
 		defer trace.DiscardOrFlush(ctx)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if sink := tracing.GetTraceBuf(ctx); sink == nil {
trace := traceevent.NewTraceBuf()
ctx = traceevent.WithTraceBuf(ctx, trace)
defer trace.DiscardOrFlush(ctx)
if traceevent.GetTraceBuf(ctx) == nil {
trace := traceevent.NewTraceBuf()
ctx = traceevent.WithTraceBuf(ctx, trace)
defer trace.DiscardOrFlush(ctx)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/session/session.go` around lines 1835 - 1838, The current guard uses
tracing.GetTraceBuf(ctx) to decide whether to create a fallback trace buffer,
but that can be fooled by a mismatched context value; change the check to ensure
the value is the typed *traceevent.TraceBuf. Specifically, replace the nil check
with a typed lookup/assertion (e.g. tb, ok :=
traceevent.GetTraceBuf(ctx).(*traceevent.TraceBuf); if !ok || tb == nil { ... })
so that NewTraceBuf/WithTraceBuf/DiscardOrFlush only run when no real
*traceevent.TraceBuf is present.

Comment on lines +39 to +52
// GetTraceID gets the traceID field.
func (t *TraceBuf) GetTraceID() []byte {
t.mu.RLock()
traceID := t.traceID
t.mu.RUnlock()
return traceID
}

// SetTraceID sets the traceID field.
func (t *TraceBuf) SetTraceID(traceID []byte) {
t.mu.Lock()
t.traceID = traceID
t.mu.Unlock()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid sharing mutable traceID slices across API boundaries.

Lines 42 and 50 expose/store the same backing slice. External mutation can corrupt buffered trace IDs and create cross-goroutine races.

🛡️ Proposed fix
 func (t *TraceBuf) GetTraceID() []byte {
 	t.mu.RLock()
-	traceID := t.traceID
+	traceID := slices.Clone(t.traceID)
 	t.mu.RUnlock()
 	return traceID
 }
 
 // SetTraceID sets the traceID field.
 func (t *TraceBuf) SetTraceID(traceID []byte) {
 	t.mu.Lock()
-	t.traceID = traceID
+	t.traceID = slices.Clone(traceID)
 	t.mu.Unlock()
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/util/traceevent/flightrecorder.go` around lines 39 - 52, GetTraceID and
SetTraceID currently expose and store the same backing slice (t.traceID)
allowing callers to mutate the slice across API boundaries; update GetTraceID to
return a copy of t.traceID and update SetTraceID to store a copy of the incoming
slice (while still using t.mu for RLock/Lock around reads/writes). Locate the
methods TraceBuf.GetTraceID and TraceBuf.SetTraceID and ensure both create new
slices and copy contents when reading or assigning t.traceID to prevent external
mutation and cross-goroutine races.

Comment on lines +237 to 253
traceBuf := GetTraceBuf(ctx)
if traceBuf == nil {
return
}

// Defensive copy prevents corruption from caller's buffer reuse.
// Reserve extra capacity for LogSink to append metadata without allocation.
event := Event{
Category: category,
Name: name,
Phase: tracing.PhaseInstant,
Timestamp: time.Now(),
TraceID: TraceIDFromContext(ctx),
TraceID: traceBuf.GetTraceID(),
Fields: copyFieldsWithCapacity(fields, 3),
}
traceBuf.Record(ctx, event)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

ModeOff can still buffer events into TraceBuf.

This path records into TraceBuf even when both recorder/logging are disabled, which conflicts with the ModeOff contract and still incurs tracing overhead.

Suggested fix
 func TraceEvent(ctx context.Context, category TraceCategory, name string, fields ...zap.Field) {
 	if !IsEnabled(category) {
 		return
 	}
+	if !recorderEnabled.Load() && !loggingEnabled.Load() {
+		return
+	}
 
 	traceBuf := GetTraceBuf(ctx)
 	if traceBuf == nil {
 		return
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
traceBuf := GetTraceBuf(ctx)
if traceBuf == nil {
return
}
// Defensive copy prevents corruption from caller's buffer reuse.
// Reserve extra capacity for LogSink to append metadata without allocation.
event := Event{
Category: category,
Name: name,
Phase: tracing.PhaseInstant,
Timestamp: time.Now(),
TraceID: TraceIDFromContext(ctx),
TraceID: traceBuf.GetTraceID(),
Fields: copyFieldsWithCapacity(fields, 3),
}
traceBuf.Record(ctx, event)
func TraceEvent(ctx context.Context, category TraceCategory, name string, fields ...zap.Field) {
if !IsEnabled(category) {
return
}
if !recorderEnabled.Load() && !loggingEnabled.Load() {
return
}
traceBuf := GetTraceBuf(ctx)
if traceBuf == nil {
return
}
// Defensive copy prevents corruption from caller's buffer reuse.
// Reserve extra capacity for LogSink to append metadata without allocation.
event := Event{
Category: category,
Name: name,
Phase: tracing.PhaseInstant,
Timestamp: time.Now(),
TraceID: traceBuf.GetTraceID(),
Fields: copyFieldsWithCapacity(fields, 3),
}
traceBuf.Record(ctx, event)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/util/traceevent/traceevent.go` around lines 237 - 253, The current
instant event path still obtains a TraceBuf and calls traceBuf.Record even when
tracing is disabled (ModeOff), causing unnecessary overhead; update the logic in
the function that builds Event (using GetTraceBuf, Event,
copyFieldsWithCapacity) to check the global/ctx tracing mode (ModeOff) before
allocating/copying fields and before calling traceBuf.Record, and
short-circuit/return early when tracing is off so no Event is constructed or
recorded.

Comment on lines +537 to 540
} else if len(event.TraceID) > 0 {
logutil.BgLogger().Info("wrong traceid format",
zap.String("trace_id", string(event.TraceID)))
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Log malformed trace IDs in hex, not raw string bytes.

string(event.TraceID) can emit non-printable bytes and make diagnostics noisy. Use hex.EncodeToString for consistency.

Suggested fix
 		} else if len(event.TraceID) > 0 {
 			logutil.BgLogger().Info("wrong traceid format",
-				zap.String("trace_id", string(event.TraceID)))
+				zap.String("trace_id", hex.EncodeToString(event.TraceID)))
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/util/traceevent/traceevent.go` around lines 537 - 540, The log currently
prints malformed trace IDs as raw string bytes using string(event.TraceID) (see
the branch that calls logutil.BgLogger().Info and the zap.String("trace_id",
...)); change it to log the hex representation by using
hex.EncodeToString(event.TraceID) instead so non-printable bytes are rendered
consistently, and add the encoding/hex import if missing.

@tiancaiamao

Copy link
Copy Markdown
Contributor Author

/retest

@tiprow

tiprow Bot commented Mar 11, 2026

Copy link
Copy Markdown

@tiancaiamao: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tiancaiamao and others added 2 commits April 3, 2026 13:52
- Clone TraceID bytes before storing in executor.go to prevent aliasing
- Add nil check for traceInfo in job_scheduler.go to prevent panic
- Use typed TraceBuf lookup in session.go for proper type checking
- Make GetTraceID/SetTraceID clone slices to prevent data races
- Check mode before building Event to avoid unnecessary overhead
- Fix trace_id logging to use hex.EncodeToString for consistency
- Add global enabledCategories variable and SetEnabledCategories function
- Make GenerateTraceID a pure function by moving rand call out
- Remove unused math/rand/v2 import from traceevent.go

Also includes improvements from master merge to TraceBuf interface:
- Add GetTraceID() []byte to TraceBuf interface
- Remove intermediate traceIDProvider interface

Co-Authored-By: Claude <noreply@anthropic.com>
@ti-chi-bot

ti-chi-bot Bot commented Apr 3, 2026

Copy link
Copy Markdown

@tiancaiamao: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen 98d7c39 link true /test pull-unit-test-next-gen
idc-jenkins-ci-tidb/unit-test 98d7c39 link true /test unit-test
pull-unit-test-ddlv1 98d7c39 link true /test pull-unit-test-ddlv1

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants