Skip to content

Conversation

@wk989898
Copy link
Collaborator

@wk989898 wk989898 commented May 23, 2025

What problem does this PR solve?

Issue Number: close #12179

What is changed and how it works?

  1. Expose param OutputFieldHeader.
  2. Endode the header as the message.Key when OutputFieldHeader=true since message.Key is never used before.
  3. Write the header only in the first row of the CSV file.
  4. Modify the decoder in the storage consumer and integration test for validating the header.

Header

column1 column2 column3 column4(optional) column5(optional) column6 ... columnX
ticdc-meta$operation ticdc-meta$table ticdc-meta$schema ticdc-meta$commit-ts ticdc-meta$is-update col1 xxx colX

Check List

Tests

  • Unit test
  • Integration test

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Support header line for CSV protocol.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented May 23, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. affect-ticdc-config-docs Pull requests that affect TiCDC configuration docs. labels May 23, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @wk989898, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request to help everyone get up to speed quickly.

This PR, authored by wk989898, addresses issue #12179 by adding support for including a header line in the CSV output protocol. Previously, the CSV sink only output data rows. This change introduces a new configuration option that allows users to specify whether a header row containing column names should be included in the output.

Highlights

  • New Configuration Option: A new boolean configuration option, output-field-header, is added to the CSV sink configuration. When set to true, the CSV output will include a header row.
  • CSV Encoder Update: The CSV encoder logic is updated to check the new CSVOutputFieldHeader configuration. If enabled and it's the first batch being processed, it generates and writes a header row before the data rows.
  • Header Generation Logic: A new function encodeHeader is introduced to construct the header string. This function dynamically builds the header based on other CSV configuration options like include-commit-ts and output-old-value, ensuring the header matches the structure of the data rows (e.g., including 'commit-ts' or 'is-update' columns if those options are enabled).
  • Unit Tests: New unit tests are added to verify the behavior of the CSV encoder and the encodeHeader function when the header option is enabled and disabled.

Changelog

Click here to see the changelog
  • pkg/config/sink.go
    • Added OutputFieldHeader boolean field to the CSVConfig struct (lines 263-264).
  • pkg/sink/codec/common/config.go
    • Added CSVOutputFieldHeader boolean field to the Config struct, specific to CSV (lines 92-93).
    • Initialized CSVOutputFieldHeader to false in NewConfig (lines 134).
    • Applied the value from replicaConfig.Sink.CSVConfig.OutputFieldHeader to the common Config in the Apply method (lines 239).
  • pkg/sink/codec/csv/csv_decoder.go
    • Passed the codecConfig.CSVOutputFieldHeader value to the Header field of the mydump.CSVConfig when creating the CSV parser (lines 60).
  • pkg/sink/codec/csv/csv_encoder.go
    • Added logic in AppendTxnEvent to check b.config.CSVOutputFieldHeader and b.batchSize == 0. If true, it generates the header using encodeHeader and writes it to the buffer before processing rows (lines 39-52).
  • pkg/sink/codec/csv/csv_encoder_test.go
    • Imported the strings package (lines 17).
    • Added TestCSVBatchCodecWithHeader to test the encoder's behavior with and without the header option, including checking the generated header content (lines 107-156).
  • pkg/sink/codec/csv/csv_message.go
    • Updated comments describing the column order in the CSV output format (lines 103).
    • Changed the return type of encode() to use common.UnsafeStringToBytes for potential performance improvement (lines 120).
    • Added the encodeHeader function to generate the CSV header row based on configuration and column names (lines 494-517).
  • pkg/sink/codec/csv/csv_message_test.go
    • Added TestEncodeHeader to specifically test the encodeHeader function with various configuration combinations (lines 1105-1134).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


CSV, comma-split,
A header now can sit,
Data flows just right.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces the output-field-header option for the CSV protocol, allowing a header line to be included in the output. The changes span configuration, encoding, and decoding aspects.

While the core idea is well-implemented and basic unit tests are in place, there's a critical issue with how header column names are derived, which can lead to incorrect headers when certain features like only-output-updated-columns are enabled. Additionally, integration test coverage for this new feature is currently missing.

The use of common.UnsafeStringToBytes is noted for performance but should always be handled with care, ensuring the lifetime of the underlying data is managed correctly (which appears to be the case here). A minor potential optimization could be caching generated headers if schemas are stable and batches are small, but this is not a primary concern.

Summary of Findings

  • Header Column Derivation: The logic for generating header column names in pkg/sink/codec/csv/csv_encoder.go incorrectly bases them on the columns present in the first event of a batch. This can lead to headers that don't match the full schema or subsequent data rows, especially when only-output-updated-columns is true. This is a high-severity issue.
  • Integration Test Coverage: The new output-field-header feature is not currently covered by any integration tests. An existing test should be modified or a new one added to validate this feature end-to-end. This is a medium-severity issue.
  • PR Description: The 'What is changed and how it works?' section of the pull request description is empty. Filling this out would improve context for reviewers and future maintainers. (Severity: Low, not commented in-line)
  • Unsafe Code Usage: The use of common.UnsafeStringToBytes in pkg/sink/codec/csv/csv_message.go is noted. While it appears safe in the current context (as strBuilder.String() creates a new string whose data is immediately converted to a byte slice), unsafe operations always require careful consideration of memory and object lifetimes. (Severity: Low, not commented in-line)

Merge Readiness

The pull request introduces a useful feature for CSV output. However, due to the high-severity issue identified in header column generation and the medium-severity concern regarding missing integration test coverage, I recommend that these changes not be merged until these points are addressed. Addressing the PR description would also be beneficial. I am unable to approve the pull request; please ensure these changes are reviewed and approved by others before merging.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented May 23, 2025

@gemini-code-assist[bot]: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Code Review

This PR introduces the output-field-header option for the CSV protocol, allowing a header line to be included in the output. The changes span configuration, encoding, and decoding aspects.

While the core idea is well-implemented and basic unit tests are in place, there's a critical issue with how header column names are derived, which can lead to incorrect headers when certain features like only-output-updated-columns are enabled. Additionally, integration test coverage for this new feature is currently missing.

The use of common.UnsafeStringToBytes is noted for performance but should always be handled with care, ensuring the lifetime of the underlying data is managed correctly (which appears to be the case here). A minor potential optimization could be caching generated headers if schemas are stable and batches are small, but this is not a primary concern.

Summary of Findings

  • Header Column Derivation: The logic for generating header column names in pkg/sink/codec/csv/csv_encoder.go incorrectly bases them on the columns present in the first event of a batch. This can lead to headers that don't match the full schema or subsequent data rows, especially when only-output-updated-columns is true. This is a high-severity issue.
  • Integration Test Coverage: The new output-field-header feature is not currently covered by any integration tests. An existing test should be modified or a new one added to validate this feature end-to-end. This is a medium-severity issue.
  • PR Description: The 'What is changed and how it works?' section of the pull request description is empty. Filling this out would improve context for reviewers and future maintainers. (Severity: Low, not commented in-line)
  • Unsafe Code Usage: The use of common.UnsafeStringToBytes in pkg/sink/codec/csv/csv_message.go is noted. While it appears safe in the current context (as strBuilder.String() creates a new string whose data is immediately converted to a byte slice), unsafe operations always require careful consideration of memory and object lifetimes. (Severity: Low, not commented in-line)

Merge Readiness

The pull request introduces a useful feature for CSV output. However, due to the high-severity issue identified in header column generation and the medium-severity concern regarding missing integration test coverage, I recommend that these changes not be merged until these points are addressed. Addressing the PR description would also be beneficial. I am unable to approve the pull request; please ensure these changes are reviewed and approved by others before merging.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@wk989898 wk989898 marked this pull request as ready for review June 6, 2025 04:02
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 6, 2025
@codecov
Copy link

codecov bot commented Jun 6, 2025

Codecov Report

❌ Patch coverage is 69.86301% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.0624%. Comparing base (8a7da93) to head (d028366).
⚠️ Report is 49 commits behind head on master.

Additional details and impacted files
Components Coverage Δ
cdc 59.2690% <69.8630%> (-0.0096%) ⬇️
dm 50.0968% <ø> (+0.0191%) ⬆️
engine 53.2223% <ø> (ø)
Flag Coverage Δ
unit 55.0624% <69.8630%> (+0.0022%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@@               Coverage Diff                @@
##             master     #12183        +/-   ##
================================================
+ Coverage   55.0602%   55.0624%   +0.0022%     
================================================
  Files          1030       1030                
  Lines        143225     143290        +65     
================================================
+ Hits          78860      78899        +39     
- Misses        58570      58592        +22     
- Partials       5795       5799         +4     
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jun 6, 2025
@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 9, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jun 9, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-06-06 10:31:22.75879814 +0000 UTC m=+5460.987113405: ☑️ agreed by 3AceShowHand.
  • 2025-06-09 06:07:51.721634193 +0000 UTC m=+248849.949949456: ☑️ agreed by hongyunyan.

@wk989898
Copy link
Collaborator Author

wk989898 commented Jun 9, 2025

/test pull-cdc-integration-storage-test

@wk989898
Copy link
Collaborator Author

wk989898 commented Jun 9, 2025

/cc @benmeadowcroft

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jun 9, 2025

@wk989898: GitHub didn't allow me to request PR reviews from the following users: benmeadowcroft.

Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @benmeadowcroft

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@wk989898
Copy link
Collaborator Author

@benmeadowcroft PTAL.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jun 10, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand, benmeadowcroft, hongyunyan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Jun 10, 2025
@ti-chi-bot ti-chi-bot bot merged commit a6221a5 into pingcap:master Jun 10, 2025
28 checks passed
@wk989898 wk989898 added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Dec 3, 2025
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #12433.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affect-ticdc-config-docs Pull requests that affect TiCDC configuration docs. approved lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add header for CSV protocol

5 participants