refactor(chunking): refactor MySQL helpers for unit testing by vimla01 · Pull Request #975 · datazip-inc/olake

vimla01 · 2026-06-06T22:48:01Z

Description

Refactored the MySQL chunking flow in GetOrSplitChunks into smaller helper functions so the chunk sizing, chunk column selection, chunk bound selection, and split strategy routing can be unit tested independently.

This keeps the existing behavior the same, but makes the chunking code easier to test and reason about.

Fixes #928

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Ran focused MySQL backfill unit tests:
go test ./drivers/mysql/internal -run 'Test(Chunk|LimitOffset|SplitEvenly|PrimaryKey)'

Screenshots or Recordings

olake_928.webm

Documentation

N/A (bug fix, refactor, or test changes only)

Related PR's (If Any):

N/A

vimla01 · 2026-06-06T22:55:34Z

@nayanj98 could you please take a look at this PR?
I have attached a video showcasing the changes and a successful sync with Olake (iceberg & parquet both)

nayanj98 · 2026-06-08T06:06:06Z

@vimla01 Thanks a lot will assign a reviewer for this PR soon

nayanj98 · 2026-06-10T05:54:10Z

@vimla01 Assigning @vaibhav-datazip to review your PR

vaibhav-datazip · 2026-06-13T21:37:38Z

-	tableStatsQuery := jdbc.MySQLTableStatsQuery()
-	err := m.client.QueryRowContext(ctx, tableStatsQuery, stream.Name()).Scan(&approxRowCount, &avgRowSize, &approxTableSize)
+	stats, err := m.fetchTableStats(ctx, stream)


I don't think we need to abstract out every single logic into a function as these are too basic implementation , primarily we need to abstract out the chunking strategies into separate functions so that those logic be tested.

You can go through mssql driver code once, that has all the chunking strategies separated.

also, you can keep some of the logic in separate functions , after the changes you can add comment on pr description or separate comment, why you think the logic needs a separate function.

Updated this to keep the basic metadata and orchestration logic inside GetOrSplitChunks. Only the actual chunking strategies and reusable boundary calculations remain separate

vaibhav-datazip · 2026-06-13T22:08:31Z

+	chunks := types.NewSet[types.Chunk]()
+	chunks.Insert(types.Chunk{
+		Min: nil,
+		Max: utils.ConvertToString(chunkSize),
+	})
+	lastChunk := chunkSize
+	for lastChunk < approxRowCount {


the chunking process earlier used to take place in an isolation level , will be better if you go through what was happening earlier and why that has been remove by you now

i checked the previous flow and restored the isolation wrapper for the limit-offset chunking path so the behavior remains the same as before

vaibhav-datazip · 2026-06-13T22:16:32Z

@@ -0,0 +1,166 @@
+package driver
+


in this file unit tests for some functions are missing , will be better if you can think of edge test cases in those as well as already present functions and add them .

Added more focused unit tests for the extracted chunking helpers and existing utilities. I covered edge cases around numeric chunk ranges/overflow, limit-offset boundaries, composite primary key args, string boundary generation, charset encoding, padding, condensing, and strategy eligibility.

vaibhav-datazip · 2026-06-17T20:34:18Z

-		// 1. Try Numeric Strategy
-		numericChunkBounds = isNumericAndEvenDistributed(minVal, maxVal, approxRowCount, chunkSize, dataType)

-		// 2. If not numeric, check for supported String strategy
+		// Prefer an arithmetic split for evenly distributed numeric keys.
+		numericChunkBounds = isNumericAndEvenDistributed(minVal, maxVal, approxRowCount, chunkSize, dataType)


can you restore this as well as the comments can be preserved and there dosent seem to be any structural/logical change

vaibhav-datazip · 2026-06-17T20:40:58Z

-			})
+	switch {
+	case numericChunkBounds != nil:
+		logger.Infof("Using splitEvenlyForInt Method for stream %s", stream.ID())


can you check other drivers as well for the log placement of chunking strategy being used, and debug being used instead of info , and make it consistent for strategy being used

Checked the other drivers as well. Oracle already uses debug for chunking strategy selection, and I updated MySQL and MongoDB strategy-selection logs from Infof to Debugf for consistency.

vaibhav-datazip · 2026-06-17T20:50:32Z

+	default:
+		logger.Infof("Falling back to limit offset method for stream %s", stream.ID())
+		var chunks *types.Set[types.Chunk]
+		err := jdbc.WithIsolation(ctx, m.client, true, func(_ *sql.Tx) error {


shouldn't the isolation level be done inside the function

Addressed. The isolation is kept inside splitViaPrimaryKey, where the DB reads happen.

vaibhav-datazip · 2026-06-17T20:53:55Z

-		stringChunkStepSize := new(big.Int).Sub(stringChunkBounds.maxEncodedBigIntValue, stringChunkBounds.minEncodedBigIntValue)
-		stringChunkStepSize.Add(stringChunkStepSize, new(big.Int).Sub(big.NewInt(expectedChunks), big.NewInt(1)))
-		stringChunkStepSize.Div(stringChunkStepSize, big.NewInt(expectedChunks))
-


even the stringchunkstepsize is being used at one place only , can we use that only instead of defining a function for it

Addressed. Removed the separate stringChunkStepSize helper and kept the step-size calculation inline at the single usage site.

vaibhav-datazip · 2026-06-18T06:12:13Z

@vimla01 once you are done with your changes after the review , please re-request the review from here so that I know the pr is ready again

refactor mysql chunking helpers

e6001f9

vimla01 requested a deployment to integration_tests June 6, 2026 22:48 — with GitHub Actions Waiting

nayanj98 requested a review from vaibhav-datazip June 10, 2026 05:54

Merge branch 'staging' into issue-928-staging

b2c99e8

vaibhav-datazip temporarily deployed to integration_tests June 11, 2026 11:54 — with GitHub Actions Inactive

vaibhav-datazip reviewed Jun 13, 2026

View reviewed changes

Merge branch 'staging' into issue-928-staging

5ba5cd7

vaibhav-datazip requested a deployment to integration_tests June 13, 2026 23:00 — with GitHub Actions Waiting

vimla01 added 2 commits June 14, 2026 14:18

refine mysql chunking strategies

164d7c4

expand mysql chunking tests

7aa2236

vimla01 requested a deployment to integration_tests June 14, 2026 08:57 — with GitHub Actions Waiting

Merge branch 'staging' into issue-928-staging

2641f5c

vaibhav-datazip requested a deployment to integration_tests June 17, 2026 20:28 — with GitHub Actions Waiting

vaibhav-datazip reviewed Jun 17, 2026

View reviewed changes

fix: address chunking review comments

ec02ce2

vimla01 requested a deployment to integration_tests June 18, 2026 06:27 — with GitHub Actions Waiting

vimla01 requested a review from vaibhav-datazip June 18, 2026 06:40

Conversation

vimla01 commented Jun 6, 2026

Description

Type of change

How Has This Been Tested?

Screenshots or Recordings

Documentation

Related PR's (If Any):

Uh oh!

vimla01 commented Jun 6, 2026

Uh oh!

nayanj98 commented Jun 8, 2026

Uh oh!

nayanj98 commented Jun 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vaibhav-datazip commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants