Skip to content

Conversation

@ti-chi-bot
Copy link
Member

This is an automated cherry-pick of #12123

What problem does this PR solve?

Issue Number: close #12114 close #12128

What is changed and how it works?

When a DDL operation takes a long time, the connection may be lost, even though the DDL is actually executed. Instead of throwing a failure state, we can query the downstream and wait for the DDL to complete.

Notice:

  • The add index DDL operates the same as before and does not require waiting.
  • Waiting DDL mechanism only works when downstream is TiDB and the user is at the root.

Check List

Tests

  • Unit test
  • Integration test
  • manual test
mysql> select count(*) from information_schema.ddl_jobs where CREATE_TIME >= now() - interval 10 MINUTE;
+----------+
| count(*) |
+----------+
|     9000 |
+----------+
1 row in set (0.90 sec)
mysql> SELECT JOB_ID, JOB_TYPE, SCHEMA_STATE, SCHEMA_ID, TABLE_ID, STATE, QUERY FROM information_schema.ddl_jobs where CREATE_TIME >= now()- interval 10 MINUTE;
...
...
...
9000 rows in set (0.91 sec)

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Wait for the DDL to synchronize when meeting MySQL Invalid connection error

@ti-chi-bot ti-chi-bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR. labels Dec 12, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 12, 2025

This cherry pick PR is for a release branch and has not yet been approved by triage owners.
Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick:

  1. It must be approved by the approvers firstly.
  2. AFTER it has been approved by approvers, please wait for the cherry-pick merging approval from triage owners.
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Member Author

@wk989898 This PR has conflicts, I have hold it.
Please resolve them or ask others to resolve them, then comment /unhold to remove the hold label.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 12, 2025

@ti-chi-bot: ## If you want to know how to resolve it, please read the guide in TiDB Dev Guide.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign liuzix for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link

Summary of Changes

Hello @ti-chi-bot, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of DDL (Data Definition Language) replication in TiCDC. It addresses a critical issue where long-running DDL operations, particularly in TiDB environments, could lead to connection timeouts and perceived failures, even if the DDL was successfully applied downstream. The changes introduce a proactive waiting mechanism that queries the downstream TiDB's DDL job status, ensuring that TiCDC accurately reflects the completion of DDLs and maintains data consistency, especially under network instability or during prolonged schema changes.

Highlights

  • Robust DDL Execution: Introduced a mechanism to wait for asynchronous DDL operations to complete in the downstream TiDB, preventing connection loss from causing false-negative DDL failures.
  • Conditional Waiting: The DDL waiting mechanism is specifically activated when the downstream is TiDB and an 'invalid connection' error occurs during DDL execution. It does not apply to 'add index' DDLs, which return immediately.
  • Simplified Async DDL Handling: Removed the previous asyncExecDDL logic and lastExecutedNormalDDLCache, streamlining how asynchronous DDLs are managed.
  • Enhanced Error Reporting: Modified ErrExecDDLFailed to include the DDL query in the error message for better debugging.
  • New Integration Test: Added a dedicated integration test (ddl_wait) to simulate long-running DDLs and verify the new waiting behavior.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the DDL sink logic to handle long-running DDLs that might cause connection losses. Instead of failing, it now waits for the DDL to complete by polling its status from the downstream TiDB. This is a solid improvement for the sink's robustness. The overall implementation is good, but I've found a critical issue with merge conflict markers in a configuration file, a high-severity SQL injection vulnerability, and a few medium-severity issues related to error handling and code clarity that should be addressed.

Comment on lines +229 to +241
<<<<<<< HEAD
=======
["CDC:ErrDDLStateNotFound"]
error = '''
DDL state not found %s
'''

["CDC:ErrDDLUnsupportType"]
error = '''
unsupport ddl type %s, query %s
'''

>>>>>>> d9a7e9f313 (sink(ticdc): waiting the asynchronous ddl to synchronize (#12123))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There are merge conflict markers (<<<<<<< HEAD, =======, >>>>>>>) in this file. This will break the TOML parsing and cause the application to fail. Please remove them and keep the intended error definitions.

["CDC:ErrDDLStateNotFound"]
error = '''
DDL state not found %s
'''

["CDC:ErrDDLUnsupportType"]
error = '''
unsupport ddl type %s, query %s
'''

Comment on lines +347 to +350
showJobs := fmt.Sprintf(`SELECT JOB_ID, JOB_TYPE, SCHEMA_STATE, SCHEMA_ID, TABLE_ID, STATE, QUERY FROM information_schema.ddl_jobs
WHERE CREATE_TIME >= "%s" AND QUERY = "%s";`, createTime, ddl)
//nolint:rowserrcheck
jobsRows, err := db.QueryContext(ctx, showJobs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Constructing SQL queries using fmt.Sprintf with external data (like the DDL query string) is vulnerable to SQL injection. Even if the ddl string is expected to be safe, it's a security best practice to always use parameterized queries to prevent potential vulnerabilities.

Please modify this to use query parameters.

	showJobs := `SELECT JOB_ID, JOB_TYPE, SCHEMA_STATE, SCHEMA_ID, TABLE_ID, STATE, QUERY FROM information_schema.ddl_jobs 
	WHERE CREATE_TIME >= ? AND QUERY = ?;`
	//nolint:rowserrcheck
	jobsRows, err := db.QueryContext(ctx, showJobs, createTime, ddl)


func (m *DDLSink) checkAsyncExecDDLDone(ctx context.Context, tables map[model.TableName]struct{}) bool {
ctx, cancel := context.WithTimeout(ctx, timeout)
ctx, cancel := context.WithTimeout(ctx, 5*time.Second)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The timeout value 5*time.Second is used as a magic number. It's better to define a constant for it at the top of the file to improve readability and maintainability, for example const checkAsyncExecDDLDoneTimeout = 5 * time.Second.

Comment on lines +144 to +145
log.Warn("Wait the asynchronous ddl to synchronize", zap.String("ddl", ddl.Query), zap.String("ddlCreateTime", ddlCreateTime),
zap.String("readTimeout", m.cfg.ReadTimeout), zap.Error(err))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The log message "Wait the asynchronous ddl to synchronize" is grammatically awkward. Consider rephrasing it for clarity, for example: "Waiting for asynchronous DDL to be synchronized".

Suggested change
log.Warn("Wait the asynchronous ddl to synchronize", zap.String("ddl", ddl.Query), zap.String("ddlCreateTime", ddlCreateTime),
zap.String("readTimeout", m.cfg.ReadTimeout), zap.Error(err))
log.Warn("Waiting for asynchronous DDL to be synchronized", zap.String("ddl", ddl.Query), zap.String("ddlCreateTime", ddlCreateTime),
zap.String("readTimeout", m.cfg.ReadTimeout), zap.Error(err))

Comment on lines +241 to +244
state, err := getDDLStateFromTiDB(ctx, m.db, ddl.Query, ddlCreateTime)
if err != nil {
log.Error("Error when getting DDL state from TiDB", zap.Error(err))
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error returned from getDDLStateFromTiDB is logged but not handled. The function continues execution with a zero-value state, which leads to a generic ErrDDLStateNotFound being returned. This loses the original error context, making debugging more difficult. The original error should be returned to be handled by the retry mechanism.

		state, err := getDDLStateFromTiDB(ctx, m.db, ddl.Query, ddlCreateTime)
		if err != nil {
			log.Error("Error when getting DDL state from TiDB", zap.Error(err))
			return err
		}

Comment on lines +331 to +336
for row.Next() {
err = row.Scan(&ddlCreateTime)
if err != nil {
log.Warn("getting ddlCreateTime failed", zap.Error(err))
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The query is expected to return at most one row. Using a for loop is not idiomatic for this case. An if statement would be more appropriate and clearer.

		if row.Next() {
			err = row.Scan(&ddlCreateTime)
			if err != nil {
				log.Warn("getting ddlCreateTime failed", zap.Error(err))
			}
		}

@codecov
Copy link

codecov bot commented Dec 12, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release-8.5@df4a904). Learn more about missing BASE report.

Additional details and impacted files
Components Coverage Δ
cdc ∅ <0.0000%> (?)
dm 50.0385% <0.0000%> (?)
engine 53.2167% <0.0000%> (?)
Flag Coverage Δ
unit 50.8469% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

@@               Coverage Diff                @@
##             release-8.5     #12466   +/-   ##
================================================
  Coverage               ?   50.8469%           
================================================
  Files                  ?        484           
  Lines                  ?      69666           
  Branches               ?          0           
================================================
  Hits                   ?      35423           
  Misses                 ?      31487           
  Partials               ?       2756           
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 12, 2025

@ti-chi-bot: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-verify cf7a32e link true /test pull-verify
pull-dm-integration-test cf7a32e link true /test pull-dm-integration-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/cherry-pick-not-approved do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants