Skip to content

refactor(chunking): refactor MySQL helpers for unit testing#975

Open
vimla01 wants to merge 7 commits into
datazip-inc:stagingfrom
vimla01:issue-928-staging
Open

refactor(chunking): refactor MySQL helpers for unit testing#975
vimla01 wants to merge 7 commits into
datazip-inc:stagingfrom
vimla01:issue-928-staging

Conversation

@vimla01

@vimla01 vimla01 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Description

Refactored the MySQL chunking flow in GetOrSplitChunks into smaller helper functions so the chunk sizing, chunk column selection, chunk bound selection, and split strategy routing can be unit tested independently.

This keeps the existing behavior the same, but makes the chunking code easier to test and reason about.

Fixes #928

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Ran focused MySQL backfill unit tests:
    go test ./drivers/mysql/internal -run 'Test(Chunk|LimitOffset|SplitEvenly|PrimaryKey)'

Screenshots or Recordings

olake_928.webm

Documentation

  • N/A (bug fix, refactor, or test changes only)

Related PR's (If Any):

N/A

@vimla01 vimla01 requested a deployment to integration_tests June 6, 2026 22:48 — with GitHub Actions Waiting
@vimla01 vimla01 requested a deployment to integration_tests June 6, 2026 22:48 — with GitHub Actions Waiting
@vimla01

vimla01 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

@nayanj98 could you please take a look at this PR?
I have attached a video showcasing the changes and a successful sync with Olake (iceberg & parquet both)

@nayanj98

nayanj98 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

@vimla01 Thanks a lot will assign a reviewer for this PR soon

@nayanj98

Copy link
Copy Markdown
Collaborator

@vimla01 Assigning @vaibhav-datazip to review your PR

@nayanj98 nayanj98 requested a review from vaibhav-datazip June 10, 2026 05:54
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment on lines +74 to +65
tableStatsQuery := jdbc.MySQLTableStatsQuery()
err := m.client.QueryRowContext(ctx, tableStatsQuery, stream.Name()).Scan(&approxRowCount, &avgRowSize, &approxTableSize)
stats, err := m.fetchTableStats(ctx, stream)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to abstract out every single logic into a function as these are too basic implementation , primarily we need to abstract out the chunking strategies into separate functions so that those logic be tested.

You can go through mssql driver code once, that has all the chunking strategies separated.

also, you can keep some of the logic in separate functions , after the changes you can add comment on pr description or separate comment, why you think the logic needs a separate function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to keep the basic metadata and orchestration logic inside GetOrSplitChunks. Only the actual chunking strategies and reusable boundary calculations remain separate

Comment thread drivers/mysql/internal/backfill.go
Comment thread drivers/mysql/internal/backfill.go
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment on lines +278 to +284
chunks := types.NewSet[types.Chunk]()
chunks.Insert(types.Chunk{
Min: nil,
Max: utils.ConvertToString(chunkSize),
})
lastChunk := chunkSize
for lastChunk < approxRowCount {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the chunking process earlier used to take place in an isolation level , will be better if you go through what was happening earlier and why that has been remove by you now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i checked the previous flow and restored the isolation wrapper for the limit-offset chunking path so the behavior remains the same as before

@@ -0,0 +1,166 @@
package driver

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this file unit tests for some functions are missing , will be better if you can think of edge test cases in those as well as already present functions and add them .

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more focused unit tests for the extracted chunking helpers and existing utilities. I covered edge cases around numeric chunk ranges/overflow, limit-offset boundaries, composite primary key args, string boundary generation, charset encoding, padding, condensing, and strategy eligibility.

@vimla01 vimla01 requested a deployment to integration_tests June 14, 2026 08:57 — with GitHub Actions Waiting
@vimla01 vimla01 requested a deployment to integration_tests June 14, 2026 08:57 — with GitHub Actions Waiting
Comment on lines -135 to +150
// 1. Try Numeric Strategy
numericChunkBounds = isNumericAndEvenDistributed(minVal, maxVal, approxRowCount, chunkSize, dataType)

// 2. If not numeric, check for supported String strategy
// Prefer an arithmetic split for evenly distributed numeric keys.
numericChunkBounds = isNumericAndEvenDistributed(minVal, maxVal, approxRowCount, chunkSize, dataType)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you restore this as well as the comments can be preserved and there dosent seem to be any structural/logical change

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored.

Comment thread drivers/mysql/internal/backfill.go Outdated
})
switch {
case numericChunkBounds != nil:
logger.Infof("Using splitEvenlyForInt Method for stream %s", stream.ID())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check other drivers as well for the log placement of chunking strategy being used, and debug being used instead of info , and make it consistent for strategy being used

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the other drivers as well. Oracle already uses debug for chunking strategy selection, and I updated MySQL and MongoDB strategy-selection logs from Infof to Debugf for consistency.

Comment thread drivers/mysql/internal/backfill.go Outdated
default:
logger.Infof("Falling back to limit offset method for stream %s", stream.ID())
var chunks *types.Set[types.Chunk]
err := jdbc.WithIsolation(ctx, m.client, true, func(_ *sql.Tx) error {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the isolation level be done inside the function

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. The isolation is kept inside splitViaPrimaryKey, where the DB reads happen.

Comment on lines -297 to -300
stringChunkStepSize := new(big.Int).Sub(stringChunkBounds.maxEncodedBigIntValue, stringChunkBounds.minEncodedBigIntValue)
stringChunkStepSize.Add(stringChunkStepSize, new(big.Int).Sub(big.NewInt(expectedChunks), big.NewInt(1)))
stringChunkStepSize.Div(stringChunkStepSize, big.NewInt(expectedChunks))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even the stringchunkstepsize is being used at one place only , can we use that only instead of defining a function for it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Removed the separate stringChunkStepSize helper and kept the step-size calculation inline at the single usage site.

@vaibhav-datazip

Copy link
Copy Markdown
Collaborator

@vimla01 once you are done with your changes after the review , please re-request the review from here so that I know the pr is ready again
Screenshot 2026-06-18 at 11 42 03 AM

@vimla01 vimla01 requested a deployment to integration_tests June 18, 2026 06:27 — with GitHub Actions Waiting
@vimla01 vimla01 requested a deployment to integration_tests June 18, 2026 06:27 — with GitHub Actions Waiting
@vimla01 vimla01 requested a review from vaibhav-datazip June 18, 2026 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants