Skip to content

[Feature][connector-doris] adds case insensitivity feature to the Doris connector #9273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: dev
Choose a base branch
from

Conversation

yzeng1618
Copy link

Purpose of this pull request

#9272

This PR adds case insensitivity feature to the Doris connector. During data synchronization, especially in migration scenarios from Oracle to Doris, column name matching issues often occur because Oracle stores table and field names in uppercase by default, while Doris typically uses lowercase identifiers. By adding a case_sensitive configuration option, users can control whether column names are case-sensitive, thus resolving case difference issues during cross-database system migration.

Does this PR introduce any user-facing change?

Yes, this PR introduces a new configuration option case_sensitive that allows users to control whether the Doris connector is case-sensitive when processing column names. When set to false , the connector automatically converts column names to lowercase, achieving case-insensitive column name matching. This is particularly useful when migrating data from databases like Oracle that use uppercase identifiers by default to Doris.

How was this patch tested?

  • Integration tests: Tested actual data synchronization scenarios using Oracle and Doris environments to ensure column name case differences do not affect data synchronization
  • Manual testing: Verified the stability and compatibility of the feature under different configuration combinations
  • Scenario testing: Mainly tested single table, multiple tables with parameter set to false, and scenarios without setting parameters, all results met the requirements

Check list

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a case insensitivity feature to the Doris connector by introducing the "case_sensitive" configuration option and propagating its effect across components.

  • Updated DorisStreamLoad to conditionally lowercase table names.
  • Refactored SeaTunnelRowSerializer and its factory to process field names based on case sensitivity.
  • Adjusted DorisTypeConverters, DorisTableConfig, and configuration files to support the new option.

Reviewed Changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
DorisStreamLoad.java Updates table name assignment based on the case_sensitive flag.
DorisSinkWriter.java Refactors serializer creation to improve abstraction via a factory.
SeaTunnelRowSerializerFactory.java Passes the case_sensitive flag to the serializer.
SeaTunnelRowSerializer.java Processes field names according to the case_sensitive flag.
DorisTypeConverterV2.java / DorisTypeConverterV1.java Delegates column name handling based on case sensitivity.
AbstractDorisTypeConverter.java Introduces a new method for building physical columns that applies case conversion if needed.
DorisTableConfig.java Adjusts database and table names based on the case_sensitive flag.
DorisSinkOptions.java Adds the CASE_SENSITIVE configuration option.
DorisSinkConfig.java Reads and sets the case_sensitive configuration from user input.
Files not reviewed (1)
  • .github/actions/get-workflow-origin: Language not supported
Comments suppressed due to low confidence (1)

seatunnel-connectors-v2/connector-doris/src/main/java/org/apache/seatunnel/connectors/doris/config/DorisTableConfig.java:87

  • [nitpick] Consider refactoring the lowercasing logic into a dedicated utility method to avoid duplication and improve maintainability.
if (!caseSensitive) { dorisTableConfig.setDatabase(dorisTableConfig.getDatabase().toLowerCase()); dorisTableConfig.setTable(dorisTableConfig.getTable().toLowerCase()); }

List<Object> fieldNames = new ArrayList<>(Arrays.asList(seaTunnelRowType.getFieldNames()));
this.caseSensitive = caseSensitive;

String[] fieldNames = seaTunnelRowType.getFieldNames();
Copy link
Preview

Copilot AI May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Add a brief comment to explain why we are converting field names based on the case_sensitive flag to aid future maintainers.

Suggested change
String[] fieldNames = seaTunnelRowType.getFieldNames();
String[] fieldNames = seaTunnelRowType.getFieldNames();
// Normalize field names based on the caseSensitive flag.
// If caseSensitive is false, convert field names to lowercase to ensure consistent handling
// in case-insensitive environments.

Copilot uses AI. Check for mistakes.

@Hisoka-X
Copy link
Member

Hisoka-X commented May 7, 2025

Thanks @yzeng1618 . Please add test case and open ci on your fork repository. https://github.com/apache/seatunnel/pull/9273/checks?check_run_id=41710620403

@Hisoka-X Hisoka-X added the First-time contributor First-time contributor label May 7, 2025
@Hisoka-X
Copy link
Member

Hisoka-X commented May 7, 2025

@yzeng1618
Copy link
Author

yzeng1618 commented May 8, 2025

Why not use https://seatunnel.apache.org/docs/2.3.10/transform-v2/table-rename and https://seatunnel.apache.org/docs/2.3.10/transform-v2/field-rename to change name to lowercase?

1. Complexity in Scenarios with Numerous Fields

For tables containing hundreds of fields, using transform plugins for field renaming becomes extremely cumbersome. Each field requires individual configuration, resulting in verbose and difficult-to-maintain configuration files. For example:

transform {
  FieldRename {
    source_table_name = "table1"
    result_table_name = "table1"
    field_name = {
      FIELD1 = "field1"
      FIELD2 = "field2"
      // ... potentially hundreds of fields
      FIELD500 = "field500"
    }
  }
}

2. Complexity in Multi-Table/Full Database Synchronization Scenarios

In multi-table or full database synchronization scenarios, using transform plugins becomes even more complex. Each table requires separate TableRename and FieldRename configurations, leading to extremely large configuration files. Additionally, if table structures change (e.g., adding new fields), the configuration files must be updated accordingly.

3. Flexibility of Code-Level Optimization

Handling case sensitivity at the code level provides greater flexibility and functional extensibility:

Enables automatic table creation functionality when case insensitivity parameters are set (currently developed, pending evaluation)

4、Consistency with Other Connectors

The Iceberg connector also provides parameters to control case sensitivity, providing a unified user experience.

@Hisoka-X
Copy link
Member

Hisoka-X commented May 8, 2025

Complexity in Scenarios with Numerous Fields
Complexity in Multi-Table/Full Database Synchronization Scenarios

In fact, seatunnel supports using one transform to solve the case conversion problem for all tables and all fields. For example

  FieldRename {
      plugin_input = "transform1"
      plugin_output = "transform2"

      table_match_regex = ".*"
      convert_case = "LOWER"
    }

Please refer https://seatunnel.apache.org/docs/2.3.10/transform-v2/transform-multi-table

Enables automatic table creation functionality when case insensitivity parameters are set

This shouldn't be a problem since the field case is according to upstream.

@yzeng1618
Copy link
Author

Complexity in Scenarios with Numerous Fields
Complexity in Multi-Table/Full Database Synchronization Scenarios

In fact, seatunnel supports using one transform to solve the case conversion problem for all tables and all fields. For example

  FieldRename {
      plugin_input = "transform1"
      plugin_output = "transform2"

      table_match_regex = ".*"
      convert_case = "LOWER"
    }

Please refer https://seatunnel.apache.org/docs/2.3.10/transform-v2/transform-multi-table

Enables automatic table creation functionality when case insensitivity parameters are set

This shouldn't be a problem since the field case is according to upstream.

Thank you for suggesting the use of transform-multi-table. We acknowledge that we hadn't fully considered this solution before, and it is indeed an excellent option.

However, we still believe that adding a case sensitive parameter to the Doris connector has its unique value and necessity:

  1. Simplified User Configuration Experience : While transform-multi-table is powerful, providing a parameter directly in the connector allows users to solve the issue with just one line of configuration, without needing to learn and configure additional transform plugins, significantly lowering the usage barrier.
  2. Performance Optimization : Handling case sensitivity at the connector level is more efficient than at the transform level, as it avoids the overhead of data transfer and processing between different plugins. This optimization is particularly noticeable when processing large volumes of data.
  3. Consistency Across Connectors : As you mentioned, the Iceberg connector already provides similar parameters. Adding the same functionality to the Doris connector maintains consistency across SeaTunnel connectors, providing users with a consistent experience when switching between different connectors.

Therefore, for simple scenarios, using connector parameters would be ideal; for more complex transformation requirements, the transform-multi-table functionality would be more appropriate. This approach provides users with maximum flexibility while maintaining configuration simplicity. We kindly request your evaluation of the necessity of adding the case-sensitive parameter to the Doris connector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants