refactor(chunking): refactor MySQL helpers for unit testing#975
refactor(chunking): refactor MySQL helpers for unit testing#975vimla01 wants to merge 7 commits into
Conversation
|
@nayanj98 could you please take a look at this PR? |
|
@vimla01 Thanks a lot will assign a reviewer for this PR soon |
|
@vimla01 Assigning @vaibhav-datazip to review your PR |
| tableStatsQuery := jdbc.MySQLTableStatsQuery() | ||
| err := m.client.QueryRowContext(ctx, tableStatsQuery, stream.Name()).Scan(&approxRowCount, &avgRowSize, &approxTableSize) | ||
| stats, err := m.fetchTableStats(ctx, stream) |
There was a problem hiding this comment.
I don't think we need to abstract out every single logic into a function as these are too basic implementation , primarily we need to abstract out the chunking strategies into separate functions so that those logic be tested.
You can go through mssql driver code once, that has all the chunking strategies separated.
also, you can keep some of the logic in separate functions , after the changes you can add comment on pr description or separate comment, why you think the logic needs a separate function.
There was a problem hiding this comment.
Updated this to keep the basic metadata and orchestration logic inside GetOrSplitChunks. Only the actual chunking strategies and reusable boundary calculations remain separate
| chunks := types.NewSet[types.Chunk]() | ||
| chunks.Insert(types.Chunk{ | ||
| Min: nil, | ||
| Max: utils.ConvertToString(chunkSize), | ||
| }) | ||
| lastChunk := chunkSize | ||
| for lastChunk < approxRowCount { |
There was a problem hiding this comment.
the chunking process earlier used to take place in an isolation level , will be better if you go through what was happening earlier and why that has been remove by you now
There was a problem hiding this comment.
i checked the previous flow and restored the isolation wrapper for the limit-offset chunking path so the behavior remains the same as before
| @@ -0,0 +1,166 @@ | |||
| package driver | |||
|
|
|||
There was a problem hiding this comment.
in this file unit tests for some functions are missing , will be better if you can think of edge test cases in those as well as already present functions and add them .
There was a problem hiding this comment.
Added more focused unit tests for the extracted chunking helpers and existing utilities. I covered edge cases around numeric chunk ranges/overflow, limit-offset boundaries, composite primary key args, string boundary generation, charset encoding, padding, condensing, and strategy eligibility.
| // 1. Try Numeric Strategy | ||
| numericChunkBounds = isNumericAndEvenDistributed(minVal, maxVal, approxRowCount, chunkSize, dataType) | ||
|
|
||
| // 2. If not numeric, check for supported String strategy | ||
| // Prefer an arithmetic split for evenly distributed numeric keys. | ||
| numericChunkBounds = isNumericAndEvenDistributed(minVal, maxVal, approxRowCount, chunkSize, dataType) |
There was a problem hiding this comment.
can you restore this as well as the comments can be preserved and there dosent seem to be any structural/logical change
| }) | ||
| switch { | ||
| case numericChunkBounds != nil: | ||
| logger.Infof("Using splitEvenlyForInt Method for stream %s", stream.ID()) |
There was a problem hiding this comment.
can you check other drivers as well for the log placement of chunking strategy being used, and debug being used instead of info , and make it consistent for strategy being used
There was a problem hiding this comment.
Checked the other drivers as well. Oracle already uses debug for chunking strategy selection, and I updated MySQL and MongoDB strategy-selection logs from Infof to Debugf for consistency.
| default: | ||
| logger.Infof("Falling back to limit offset method for stream %s", stream.ID()) | ||
| var chunks *types.Set[types.Chunk] | ||
| err := jdbc.WithIsolation(ctx, m.client, true, func(_ *sql.Tx) error { |
There was a problem hiding this comment.
shouldn't the isolation level be done inside the function
There was a problem hiding this comment.
Addressed. The isolation is kept inside splitViaPrimaryKey, where the DB reads happen.
| stringChunkStepSize := new(big.Int).Sub(stringChunkBounds.maxEncodedBigIntValue, stringChunkBounds.minEncodedBigIntValue) | ||
| stringChunkStepSize.Add(stringChunkStepSize, new(big.Int).Sub(big.NewInt(expectedChunks), big.NewInt(1))) | ||
| stringChunkStepSize.Div(stringChunkStepSize, big.NewInt(expectedChunks)) | ||
|
|
There was a problem hiding this comment.
even the stringchunkstepsize is being used at one place only , can we use that only instead of defining a function for it
There was a problem hiding this comment.
Addressed. Removed the separate stringChunkStepSize helper and kept the step-size calculation inline at the single usage site.
|
@vimla01 once you are done with your changes after the review , please re-request the review from here so that I know the pr is ready again |

Description
Refactored the MySQL chunking flow in
GetOrSplitChunksinto smaller helper functions so the chunk sizing, chunk column selection, chunk bound selection, and split strategy routing can be unit tested independently.This keeps the existing behavior the same, but makes the chunking code easier to test and reason about.
Fixes #928
Type of change
How Has This Been Tested?
go test ./drivers/mysql/internal -run 'Test(Chunk|LimitOffset|SplitEvenly|PrimaryKey)'Screenshots or Recordings
olake_928.webm
Documentation
Related PR's (If Any):
N/A