feat: mysql chunking optimisation#797
Merged
Merged
Conversation
saksham-datazip
commented
Jan 27, 2026
saksham-datazip
commented
Feb 3, 2026
Collaborator
Author
saksham-datazip
left a comment
There was a problem hiding this comment.
self review
vikaxsh
reviewed
Apr 2, 2026
vikaxsh
reviewed
Apr 21, 2026
vaibhav-datazip
approved these changes
Apr 24, 2026
vikaxsh
approved these changes
Apr 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR improves the MySQL chunking strategy with the primary goal of significantly reducing chunk generation time for large tables during backfill.
To achieve this, two mathematical chunking strategies were introduced based on the primary key type, replacing repeated database-based chunk discovery.
Numeric Primary Keys
The numeric range [min, max] is divided using an arithmetic progression to generate evenly spaced chunk boundaries. This allows chunk boundaries to be computed mathematically instead of relying on repeated database lookups, significantly reducing chunking time.
String Primary Keys
String values are mapped into a numeric space using Readable Unicode encoding (big.Int) and then split into balanced ranges. These candidate boundaries are then aligned with actual database values using distinct collation-aware queries to maintain correct ordering.
These strategies substantially reduce the number of database round trips required for chunk discovery, resulting in faster chunk generation and improved performance for large datasets.
As part of this work, several edge cases in chunk boundary calculation were also addressed, particularly around MySQL collation-aware ordering for string primary keys. The implementation aligns generated boundaries with actual database values using collation-aware queries, ensuring correct range generation and preventing missing or overlapping chunks.
Additionally, a small compatibility fix was introduced in refractor.go.
To handle this change correctly, an additional []uint8 case was added in ReformatInt64 so that these values are properly parsed and converted to int64. This ensures consistent behavior regardless of how the query result is returned by the driver.
Type of change
How Has This Been Tested?
Tested MySQL chunking with INT32 primary keys
Tested MySQL chunking with INT64 primary keys
Tested MySQL chunking with FLOAT / DOUBLE primary keys
Verified no data loss or overlap across chunk boundaries
Tested on different kind of string pk for full refresh and cdc
Confirmed performance improvement on large datasets
Performance Stats (Different PK Types)
The following
stats.jsonoutputs were collected from runs on different MySQL tables, each containing 10M records, using different primary key types.🔢 Table with
INT32Primary Key🔣 Table with
FLOAT64Primary KeyScreenshots or Recordings
https://datazip.atlassian.net/wiki/x/AYCVDg
Documentation
Related PR's (If Any):
N/A