Skip to content

Conversation

@ti-chi-bot
Copy link
Member

This is an automated cherry-pick of #12183

What problem does this PR solve?

Issue Number: close #12179

What is changed and how it works?

  1. Expose param OutputFieldHeader.
  2. Endode the header as the message.Key when OutputFieldHeader=true since message.Key is never used before.
  3. Write the header only in the first row of the CSV file.
  4. Modify the decoder in the storage consumer and integration test for validating the header.

Header

column1 column2 column3 column4(optional) column5(optional) column6 ... columnX
ticdc-meta$operation ticdc-meta$table ticdc-meta$schema ticdc-meta$commit-ts ticdc-meta$is-update col1 xxx colX

Check List

Tests

  • Unit test
  • Integration test

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Support header line for CSV protocol.

@ti-chi-bot ti-chi-bot added affect-ticdc-config-docs Pull requests that affect TiCDC configuration docs. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR. labels Dec 3, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 3, 2025

This cherry pick PR is for a release branch and has not yet been approved by triage owners.
Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick:

  1. It must be approved by the approvers firstly.
  2. AFTER it has been approved by approvers, please wait for the cherry-pick merging approval from triage owners.
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@gemini-code-assist
Copy link

Summary of Changes

Hello @ti-chi-bot, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances TiCDC's CSV protocol by adding support for an optional header line in the generated CSV files. This feature allows users to include column names as the first row, improving the readability and interoperability of the CSV output with various data processing tools. The changes involve exposing a new configuration parameter, implementing the logic to generate and write the header, and updating the CSV decoder to handle header-present files.

Highlights

  • CSV Header Support: Introduced a new configuration option output-field-header to enable or disable the output of a header row in CSV files generated by TiCDC.
  • Header Encoding & Output: When enabled, the header row is dynamically generated based on the table schema and encoded into the message.Key field, ensuring it's written as the first line of the CSV output.
  • Decoder Compatibility: The CSV decoder has been updated to recognize and correctly parse CSV files that include a header, including validation of column names.
  • Integration Testing: New integration tests and configuration updates confirm the correct generation and validation of CSV headers.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for header lines in the CSV protocol, which is a valuable feature. The implementation is mostly well-structured, propagating the new OutputFieldHeader configuration and adding logic to both encode and decode the headers. However, I've identified a few issues that should be addressed:

  • The CSV decoder uses log.Panic on a header mismatch, which could crash the TiCDC process. This should be changed to return an error for graceful failure.
  • The header encoding logic does not account for the existing OutputHandleKey configuration, which will cause a mismatch between header and data columns when it's enabled.
  • The integration test added to verify the header has a check that is not strict enough and will not fail the test upon header mismatch.

I've left specific comments with suggestions for each of these points. Once these are addressed, the PR should be in good shape.

Comment on lines 82 to 84
if col.Name.O != header[idx+i] {
log.Panic("check column name order failed", zap.Any("col", col.Name.O), zap.Any("header", header[idx+i]))
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using log.Panic for a data validation error like a CSV header mismatch is too aggressive, as it will crash the entire capture process. It's better to return an error to allow for graceful error handling, such as putting the changefeed into an error state.

            if col.Name.O != header[idx+i] {
                return nil, cerror.WrapError(cerror.ErrCSVDecodeFailed,
                    errors.Errorf("csv header column name mismatch, expected: %s, got: %s", col.Name.O, header[idx+i]))
            }

Comment on lines +511 to +512
}
for _, name := range colNames {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The encodeHeader function does not account for the OutputHandleKey configuration. When OutputHandleKey is true, encodeMeta adds the handle key as a column, but there is no corresponding header column. This will cause a mismatch between the header and the data rows. A header for the handle key should be added, for instance ticdc-meta$handle-key.

    }
    if config.OutputHandleKey {
        strBuilder.WriteString(config.Delimiter)
        strBuilder.WriteString("ticdc-meta$handle-key")
    }
    for _, name := range colNames {

buf := bytes.NewBuffer(make([]byte, 0, task.size))
rowsCnt := 0
bytesCnt := int64(0)
// There is always only one message here in task.msgs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment // There is always only one message here in task.msgs is an important assumption. While it might be true for the current implementation, it could become a source of bugs if the logic for batching messages into tasks changes in the future. If a task could contain multiple messages, this logic would only write the header for the very first message in the task, potentially missing headers for subsequent messages if they were intended to start new files. Consider adding a more detailed explanation or a link to where this assumption is guaranteed.

Comment on lines +32 to +34
if [[ "$first_line" != ticdc-meta* ]]; then
echo "check CSV header failed. header: $first_line"
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check for the CSV header is not strict enough. It only echoes a message on failure and does not cause the script to exit with an error. This could lead to test failures being missed. The check should be made stricter to exit on failure. Also, it's good practice to quote file paths (e.g., "$file" in head) to handle potential spaces or special characters.

Suggested change
if [[ "$first_line" != ticdc-meta* ]]; then
echo "check CSV header failed. header: $first_line"
fi
if [[ "$first_line" != ticdc-meta* ]]; then
echo "check CSV header failed. header: $first_line"
exit 1
fi

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 3, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wk989898
Once this PR has been reviewed and has the lgtm label, please assign benmeadowcroft for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link

codecov bot commented Dec 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release-8.5@ce07415). Learn more about missing BASE report.

Additional details and impacted files
Components Coverage Δ
cdc ∅ <0.0000%> (?)
dm ∅ <0.0000%> (?)
engine 53.2054% <0.0000%> (?)
Flag Coverage Δ
unit 53.2054% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

@@               Coverage Diff                @@
##             release-8.5     #12433   +/-   ##
================================================
  Coverage               ?   53.2054%           
================================================
  Files                  ?        213           
  Lines                  ?      17720           
  Branches               ?          0           
================================================
  Hits                   ?       9428           
  Misses                 ?       7689           
  Partials               ?        603           
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 3, 2025

@ti-chi-bot: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-dm-integration-test 9ed650b link true /test pull-dm-integration-test
pull-verify 9ed650b link true /test pull-verify
pull-cdc-integration-mysql-test 9ed650b link true /test pull-cdc-integration-mysql-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affect-ticdc-config-docs Pull requests that affect TiCDC configuration docs. do-not-merge/cherry-pick-not-approved lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants