[Feature][Jdbc] Add String type column split Support by charset-based splitting algorithm #9002

jinkachy · 2025-03-18T11:05:23Z

Purpose of this pull request

This PR introduces a new character-based splitting algorithm for JDBC connectors when dealing with string-type columns. The traditional approach for splitting string-type data relies on database limit queries or mod hash operations, which can be inefficient for large datasets. The new algorithm uses character set ordering for more efficient splitting, eliminating the need for multiple database limit queries when MIN and MAX values are already known.

The core algorithm works as follows(org.apache.seatunnel.connectors.seatunnel.jdbc.source.CollationBasedSplitter):

It treats strings as numbers in a numeral system where the base is the size of the character set (plus 1 to account for null/empty character)
Each string is converted to a "numeral" in this system, with positions representing place values
These numerals are then converted to decimal (BigInteger) values to create a numerical range
The numerical range is split evenly using standard numeric splitting algorithms
The resulting split points are converted back to string representation

This approach produces evenly distributed string splits without requiring additional database queries, significantly improving performance for large datasets.

Does this PR introduce any user-facing change?

Yes, this PR introduces a new configuration option string_split_mode which can be set to charset_based to enable the new character set-based string splitting algorithm. Users can also specify the string_split_mode_collate parameter to define a specific character collation order. If not specified, the database system's default sorting rule will be used.

Currently, the implementation supports all visible ASCII characters (code points 32-126), which covers most common use cases for string fields typically composed of numbers and letters.

Recommendation: recommend setting string_split_mode=charset_based when dealing with large datasets that require many partitions and only have string fields available as split keys. This mode significantly reduces the number of database queries and improves overall performance in these scenarios.

How was this patch tested?

The implementation has been tested with:

Unit tests for the CollationBasedSplitter class to verify the conversion between strings and numeric ranges
tests with different database systems (MySQL, PostgreSQL, and so on) to verify string-based splitting works correctly
Performance comparison tests between the traditional approach and the new character-based approach

All tests confirm that the algorithm correctly splits string-type data into evenly distributed chunks and provides significant performance improvements for large datasets.

Check list

If any new Jar binary package adding in your PR, please add License Notice according New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config
Update the release-note.

Hisoka-X

Thanks @jinkachy ! It's a great feature!

...ctor-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/SplitMode.java

...est/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/CharsetBasedSplitterTest.java

...or-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/config/JdbcOptions.java

.../src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/FixedChunkSplitter.java

...rc/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/DynamicChunkSplitter.java

...ctor-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/SplitMode.java

jinkachy · 2025-03-20T05:35:31Z

@Hisoka-X Thank you for your review very much. All related issues you pointed out have been corrected.

Hisoka-X · 2025-03-24T02:17:41Z

docs/en/connector-v2/source/Jdbc.md

+| string_split_mode                          | String  | No       | -               | When set to "charset_based", enables charset-based string splitting algorithm. The algorithm assumes characters of partition_column are within ASCII range 32-126, which covers most character-based splitting scenarios.                                                                                                                                                                                                                                                                                                                                                                                                                           |
+| string_split_mode_collate                  | String  | No       | -               | Specifies the collation to use when string_split_mode is set to "charset_based" and the table has a special collation. If not specified, the database's default collation will be used.                                                                                                                                                                                                                                                                                                                                                                                                                                                             |


Oh, it dawned on me that it might be better to add the split prefix, just like the other sharding parameters.

Suggested change

| string_split_mode | String | No | - | When set to "charset_based", enables charset-based string splitting algorithm. The algorithm assumes characters of partition_column are within ASCII range 32-126, which covers most character-based splitting scenarios. |

| string_split_mode_collate | String | No | - | Specifies the collation to use when string_split_mode is set to "charset_based" and the table has a special collation. If not specified, the database's default collation will be used. |

| split.string_split_mode | String | No | sample | Supports different string splitting algorithms. By default, `sample` is used to determine the split by sampling the string value. You can switch to `charset_based` to enable charset-based string splitting algorithm. When set to `charset_based`, the algorithm assumes characters of partition_column are within ASCII range 32-126, which covers most character-based splitting scenarios. |

| split.string_split_mode_collate | String | No | - | Specifies the collation to use when string_split_mode is set to `charset_based` and the table has a special collation. If not specified, the database's default collation will be used. |

you are right, fix done

Hisoka-X · 2025-03-24T02:18:08Z

...bc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/config/JdbcSourceConfig.java

@@ -44,6 +45,10 @@ public class JdbcSourceConfig implements Serializable {
    private int splitInverseSamplingRate;
    private boolean decimalTypeNarrowing;

+    private StringSplitMode stringSplitMode;
+
+    private String collate;


Suggested change

private String collate;

private String stringSplitModeCollate;

thanks, fix done

… splitting algorithm

...or-jdbc-e2e/connector-jdbc-e2e-part-5/src/test/resources/jdbc_greenplum_source_and_sink.conf

… splitting algorithm

Hisoka-X

Thanks @jinkachy

github-actions bot added document connectors-v2 e2e jdbc labels Mar 18, 2025

Hisoka-X reviewed Mar 20, 2025

View reviewed changes

jinkachy force-pushed the dev3 branch from 350cf4d to 9fb4fcb Compare March 20, 2025 15:18

Hisoka-X reviewed Mar 24, 2025

View reviewed changes

chenhongyu05 added 8 commits March 24, 2025 11:13

[Feature][Jdbc] Add String type column split Support by charset-based…

79ea2a6

… splitting algorithm

[Feature][Jdbc] Add String type column split Support by charset-based…

e77b661

… splitting algorithm

[Feature][Jdbc] Add String type column split Support by charset-based…

336c4d3

… splitting algorithm

[Feature][Jdbc] Add String type column split Support by charset-based…

97698b0

… splitting algorithm

[Feature][Jdbc] Add String type column split Support by charset-based…

c8e6f04

… splitting algorithm

[Feature][Jdbc] Add String type column split Support by charset-based…

f245da8

… splitting algorithm

[Feature][Jdbc] Add String type column split Support by charset-based…

cc32cad

… splitting algorithm

[Feature][Jdbc] Add String type column split Support by charset-based…

b78bdbd

… splitting algorithm

jinkachy force-pushed the dev3 branch from 31ead8e to b78bdbd Compare March 24, 2025 03:13

chenhongyu05 added 7 commits March 24, 2025 11:17

[Feature][Jdbc] Add String type column split Support by charset-based…

e7c438b

… splitting algorithm

[Feature][Jdbc] Add String type column split Support by charset-based…

a6de31e

… splitting algorithm

[Feature][Jdbc] Add String type column split Support by charset-based…

9cb25b8

… splitting algorithm

[Feature][Jdbc] Add String type column split Support by charset-based…

d5ed2a5

… splitting algorithm

[Feature][Jdbc] Add String type column split Support by charset-based…

54ed6d9

… splitting algorithm

[Feature][Jdbc] Add String type column split Support by charset-based…

35b42a4

… splitting algorithm

[Feature][Jdbc] Add String type column split Support by charset-based…

0eeec9f

… splitting algorithm

Hisoka-X reviewed Mar 25, 2025

View reviewed changes

...or-jdbc-e2e/connector-jdbc-e2e-part-5/src/test/resources/jdbc_greenplum_source_and_sink.conf Outdated Show resolved Hide resolved

[Feature][Jdbc] Add String type column split Support by charset-based…

ba9ecf1

… splitting algorithm

Hisoka-X approved these changes Mar 26, 2025

View reviewed changes

github-actions bot added approved reviewed labels Mar 26, 2025

hailin0 approved these changes Mar 26, 2025

View reviewed changes

hailin0 merged commit dbe41e7 into apache:dev Mar 26, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature][Jdbc] Add String type column split Support by charset-based splitting algorithm #9002

[Feature][Jdbc] Add String type column split Support by charset-based splitting algorithm #9002

Uh oh!

jinkachy commented Mar 18, 2025 •

edited

Loading

Uh oh!

Hisoka-X left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jinkachy commented Mar 20, 2025

Uh oh!

Hisoka-X Mar 24, 2025

Uh oh!

jinkachy Mar 24, 2025

Uh oh!

Hisoka-X Mar 24, 2025

Uh oh!

jinkachy Mar 24, 2025

Uh oh!

Uh oh!

Hisoka-X left a comment

Uh oh!

Uh oh!

Uh oh!

		\| string_split_mode \| String \| No \| - \| When set to "charset_based", enables charset-based string splitting algorithm. The algorithm assumes characters of partition_column are within ASCII range 32-126, which covers most character-based splitting scenarios. \|
		\| string_split_mode_collate \| String \| No \| - \| Specifies the collation to use when string_split_mode is set to "charset_based" and the table has a special collation. If not specified, the database's default collation will be used. \|

	private String collate;
	private String stringSplitModeCollate;

[Feature][Jdbc] Add String type column split Support by charset-based splitting algorithm #9002

[Feature][Jdbc] Add String type column split Support by charset-based splitting algorithm #9002

Uh oh!

Conversation

jinkachy commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

Hisoka-X left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jinkachy commented Mar 20, 2025

Uh oh!

Hisoka-X Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

jinkachy Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

Hisoka-X Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

jinkachy Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Hisoka-X left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jinkachy commented Mar 18, 2025 •

edited

Loading