Skip to content

[Feature][Jdbc] Add String type column split Support by charset-based splitting algorithm #9002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Mar 26, 2025

Conversation

jinkachy
Copy link
Contributor

@jinkachy jinkachy commented Mar 18, 2025

Purpose of this pull request

This PR introduces a new character-based splitting algorithm for JDBC connectors when dealing with string-type columns. The traditional approach for splitting string-type data relies on database limit queries or mod hash operations, which can be inefficient for large datasets. The new algorithm uses character set ordering for more efficient splitting, eliminating the need for multiple database limit queries when MIN and MAX values are already known.

The core algorithm works as follows(org.apache.seatunnel.connectors.seatunnel.jdbc.source.CollationBasedSplitter):

  1. It treats strings as numbers in a numeral system where the base is the size of the character set (plus 1 to account for null/empty character)
  2. Each string is converted to a "numeral" in this system, with positions representing place values
  3. These numerals are then converted to decimal (BigInteger) values to create a numerical range
  4. The numerical range is split evenly using standard numeric splitting algorithms
  5. The resulting split points are converted back to string representation

This approach produces evenly distributed string splits without requiring additional database queries, significantly improving performance for large datasets.

Does this PR introduce any user-facing change?

Yes, this PR introduces a new configuration option string_split_mode which can be set to charset_based to enable the new character set-based string splitting algorithm. Users can also specify the string_split_mode_collate parameter to define a specific character collation order. If not specified, the database system's default sorting rule will be used.

Currently, the implementation supports all visible ASCII characters (code points 32-126), which covers most common use cases for string fields typically composed of numbers and letters.

Recommendation: recommend setting string_split_mode=charset_based when dealing with large datasets that require many partitions and only have string fields available as split keys. This mode significantly reduces the number of database queries and improves overall performance in these scenarios.

How was this patch tested?

The implementation has been tested with:

  1. Unit tests for the CollationBasedSplitter class to verify the conversion between strings and numeric ranges
  2. tests with different database systems (MySQL, PostgreSQL, and so on) to verify string-based splitting works correctly
  3. Performance comparison tests between the traditional approach and the new character-based approach

All tests confirm that the algorithm correctly splits string-type data into evenly distributed chunks and provides significant performance improvements for large datasets.

Check list

Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jinkachy ! It's a great feature!

@jinkachy
Copy link
Contributor Author

@Hisoka-X Thank you for your review very much. All related issues you pointed out have been corrected.

Comment on lines 71 to 72
| string_split_mode | String | No | - | When set to "charset_based", enables charset-based string splitting algorithm. The algorithm assumes characters of partition_column are within ASCII range 32-126, which covers most character-based splitting scenarios. |
| string_split_mode_collate | String | No | - | Specifies the collation to use when string_split_mode is set to "charset_based" and the table has a special collation. If not specified, the database's default collation will be used. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it dawned on me that it might be better to add the split prefix, just like the other sharding parameters.

Suggested change
| string_split_mode | String | No | - | When set to "charset_based", enables charset-based string splitting algorithm. The algorithm assumes characters of partition_column are within ASCII range 32-126, which covers most character-based splitting scenarios. |
| string_split_mode_collate | String | No | - | Specifies the collation to use when string_split_mode is set to "charset_based" and the table has a special collation. If not specified, the database's default collation will be used. |
| split.string_split_mode | String | No | sample | Supports different string splitting algorithms. By default, `sample` is used to determine the split by sampling the string value. You can switch to `charset_based` to enable charset-based string splitting algorithm. When set to `charset_based`, the algorithm assumes characters of partition_column are within ASCII range 32-126, which covers most character-based splitting scenarios. |
| split.string_split_mode_collate | String | No | - | Specifies the collation to use when string_split_mode is set to `charset_based` and the table has a special collation. If not specified, the database's default collation will be used. |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, fix done

@@ -44,6 +45,10 @@ public class JdbcSourceConfig implements Serializable {
private int splitInverseSamplingRate;
private boolean decimalTypeNarrowing;

private StringSplitMode stringSplitMode;

private String collate;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private String collate;
private String stringSplitModeCollate;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fix done

Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jinkachy

@hailin0 hailin0 merged commit dbe41e7 into apache:dev Mar 26, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants