-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[Feature][Jdbc] Add String type column split Support by charset-based splitting algorithm #9002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jinkachy ! It's a great feature!
...ctor-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/SplitMode.java
Outdated
Show resolved
Hide resolved
...ctor-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/SplitMode.java
Outdated
Show resolved
Hide resolved
...est/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/CharsetBasedSplitterTest.java
Outdated
Show resolved
Hide resolved
...est/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/CharsetBasedSplitterTest.java
Outdated
Show resolved
Hide resolved
...or-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/config/JdbcOptions.java
Show resolved
Hide resolved
...or-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/config/JdbcOptions.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/FixedChunkSplitter.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/DynamicChunkSplitter.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/DynamicChunkSplitter.java
Outdated
Show resolved
Hide resolved
...ctor-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/SplitMode.java
Outdated
Show resolved
Hide resolved
@Hisoka-X Thank you for your review very much. All related issues you pointed out have been corrected. |
docs/en/connector-v2/source/Jdbc.md
Outdated
| string_split_mode | String | No | - | When set to "charset_based", enables charset-based string splitting algorithm. The algorithm assumes characters of partition_column are within ASCII range 32-126, which covers most character-based splitting scenarios. | | ||
| string_split_mode_collate | String | No | - | Specifies the collation to use when string_split_mode is set to "charset_based" and the table has a special collation. If not specified, the database's default collation will be used. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it dawned on me that it might be better to add the split prefix, just like the other sharding parameters.
| string_split_mode | String | No | - | When set to "charset_based", enables charset-based string splitting algorithm. The algorithm assumes characters of partition_column are within ASCII range 32-126, which covers most character-based splitting scenarios. | | |
| string_split_mode_collate | String | No | - | Specifies the collation to use when string_split_mode is set to "charset_based" and the table has a special collation. If not specified, the database's default collation will be used. | | |
| split.string_split_mode | String | No | sample | Supports different string splitting algorithms. By default, `sample` is used to determine the split by sampling the string value. You can switch to `charset_based` to enable charset-based string splitting algorithm. When set to `charset_based`, the algorithm assumes characters of partition_column are within ASCII range 32-126, which covers most character-based splitting scenarios. | | |
| split.string_split_mode_collate | String | No | - | Specifies the collation to use when string_split_mode is set to `charset_based` and the table has a special collation. If not specified, the database's default collation will be used. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, fix done
@@ -44,6 +45,10 @@ public class JdbcSourceConfig implements Serializable { | |||
private int splitInverseSamplingRate; | |||
private boolean decimalTypeNarrowing; | |||
|
|||
private StringSplitMode stringSplitMode; | |||
|
|||
private String collate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private String collate; | |
private String stringSplitModeCollate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fix done
… splitting algorithm
… splitting algorithm
… splitting algorithm
… splitting algorithm
… splitting algorithm
… splitting algorithm
… splitting algorithm
… splitting algorithm
… splitting algorithm
… splitting algorithm
… splitting algorithm
… splitting algorithm
… splitting algorithm
… splitting algorithm
… splitting algorithm
...or-jdbc-e2e/connector-jdbc-e2e-part-5/src/test/resources/jdbc_greenplum_source_and_sink.conf
Outdated
Show resolved
Hide resolved
… splitting algorithm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jinkachy
Purpose of this pull request
This PR introduces a new character-based splitting algorithm for JDBC connectors when dealing with string-type columns. The traditional approach for splitting string-type data relies on database limit queries or mod hash operations, which can be inefficient for large datasets. The new algorithm uses character set ordering for more efficient splitting, eliminating the need for multiple database limit queries when MIN and MAX values are already known.
The core algorithm works as follows(org.apache.seatunnel.connectors.seatunnel.jdbc.source.CollationBasedSplitter):
This approach produces evenly distributed string splits without requiring additional database queries, significantly improving performance for large datasets.
Does this PR introduce any user-facing change?
Yes, this PR introduces a new configuration option
string_split_mode
which can be set tocharset_based
to enable the new character set-based string splitting algorithm. Users can also specify thestring_split_mode_collate
parameter to define a specific character collation order. If not specified, the database system's default sorting rule will be used.Currently, the implementation supports all visible ASCII characters (code points 32-126), which covers most common use cases for string fields typically composed of numbers and letters.
Recommendation: recommend setting
string_split_mode=charset_based
when dealing with large datasets that require many partitions and only have string fields available as split keys. This mode significantly reduces the number of database queries and improves overall performance in these scenarios.How was this patch tested?
The implementation has been tested with:
CollationBasedSplitter
class to verify the conversion between strings and numeric rangesAll tests confirm that the algorithm correctly splits string-type data into evenly distributed chunks and provides significant performance improvements for large datasets.
Check list
release-note
.