feat: mysql chunking optimisation by saksham-datazip · Pull Request #797 · datazip-inc/olake

saksham-datazip · 2026-01-27T11:40:34Z

Description

This PR improves the MySQL chunking strategy with the primary goal of significantly reducing chunk generation time for large tables during backfill.

To achieve this, two mathematical chunking strategies were introduced based on the primary key type, replacing repeated database-based chunk discovery.

Numeric Primary Keys

The numeric range [min, max] is divided using an arithmetic progression to generate evenly spaced chunk boundaries. This allows chunk boundaries to be computed mathematically instead of relying on repeated database lookups, significantly reducing chunking time.

String Primary Keys

String values are mapped into a numeric space using Readable Unicode encoding (big.Int) and then split into balanced ranges. These candidate boundaries are then aligned with actual database values using distinct collation-aware queries to maintain correct ordering.

These strategies substantially reduce the number of database round trips required for chunk discovery, resulting in faster chunk generation and improved performance for large datasets.

As part of this work, several edge cases in chunk boundary calculation were also addressed, particularly around MySQL collation-aware ordering for string primary keys. The implementation aligns generated boundaries with actual database values using collation-aware queries, ensuring correct range generation and preventing missing or overlapping chunks.

Additionally, a small compatibility fix was introduced in refractor.go.

To handle this change correctly, an additional []uint8 case was added in ReformatInt64 so that these values are properly parsed and converted to int64. This ensures consistent behavior regardless of how the query result is returned by the driver.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Tested MySQL chunking with INT32 primary keys
Tested MySQL chunking with INT64 primary keys
Tested MySQL chunking with FLOAT / DOUBLE primary keys
Verified no data loss or overlap across chunk boundaries
Tested on different kind of string pk for full refresh and cdc
Confirmed performance improvement on large datasets

Performance Stats (Different PK Types)

The following stats.json outputs were collected from runs on different MySQL tables, each containing 10M records, using different primary key types.

🔢 Table with `INT32` Primary Key

Seconds Elapsed: 184.00
Speed: 54,347.30 rps
Memory: 96 MB

🔣 Table with `FLOAT64` Primary Key

Seconds Elapsed: 54.00
Speed: 185,179.58 rps
Memory: 36 MB

Screenshots or Recordings

https://datazip.atlassian.net/wiki/x/AYCVDg

Documentation

Documentation Link: [link to README, olake.io/docs, or olake-docs]
N/A (bug fix, refactor, or test changes only)

Related PR's (If Any):

N/A

saksham-datazip

self review

feat: mysql chunking optimization

83ebf36

saksham-datazip commented Jan 27, 2026

View reviewed changes

Comment thread drivers/mysql/internal/backfill.go Outdated

saksham-datazip added 2 commits January 27, 2026 17:19

mysql optimization comment resolve

f5766f8

Merge branch 'staging' into feat/mysql-chunking-optimization

443cf94

vaibhav-datazip reviewed Jan 28, 2026

View reviewed changes

Merge branch 'staging' into feat/mysql-chunking-optimization

6fc574c

saksham-datazip had a problem deploying to integration_tests February 2, 2026 09:42 — with GitHub Actions Failure

chore: formatting fix

c09aee8

saksham-datazip had a problem deploying to integration_tests February 3, 2026 06:30 — with GitHub Actions Failure

my-sql-chunking-formatting-resolved

53520de

saksham-datazip had a problem deploying to integration_tests February 3, 2026 09:46 — with GitHub Actions Failure

saksham-datazip commented Feb 3, 2026

View reviewed changes

Comment thread drivers/mysql/internal/backfill.go Outdated

Comment thread drivers/mysql/internal/backfill.go Outdated

mysql-chunking-self-reviewed

3b9fbe7

saksham-datazip had a problem deploying to integration_tests February 3, 2026 09:53 — with GitHub Actions Failure

mysql-chunking-optimization-for-string-pk

8e4ba6a

saksham-datazip had a problem deploying to integration_tests February 7, 2026 15:19 — with GitHub Actions Failure

Merge branch 'staging' into feat/mysql-chunking-optimization

1707ae1

saksham-datazip had a problem deploying to integration_tests February 7, 2026 15:28 — with GitHub Actions Failure

Merge branch 'staging' into feat/mysql-chunking-optimization

feca5a0

vaibhav-datazip had a problem deploying to integration_tests February 9, 2026 08:01 — with GitHub Actions Failure

feat: solved lint issue

ccfb371

saksham-datazip temporarily deployed to integration_tests February 9, 2026 08:10 — with GitHub Actions Inactive

vaibhav-datazip reviewed Feb 9, 2026

View reviewed changes

Merge branch 'staging' into feat/mysql-chunking-optimization

fe4b4b2

saksham-datazip had a problem deploying to integration_tests February 10, 2026 09:13 — with GitHub Actions Failure

feat: mysql chunking optimization review resolved

910246a

saksham-datazip had a problem deploying to integration_tests February 10, 2026 09:52 — with GitHub Actions Failure

feat: resolving-lint-extra-spaces

1eacf5a

saksham-datazip had a problem deploying to integration_tests February 10, 2026 09:56 — with GitHub Actions Failure

feat: lint error resolved

964a2ee

saksham-datazip had a problem deploying to integration_tests April 2, 2026 07:40 — with GitHub Actions Error

vikaxsh reviewed Apr 2, 2026

View reviewed changes

Comment thread pkg/jdbc/jdbc.go Outdated

chore: removed-MySQLFirstPKAtOrAfterStringQuery

2c9eaa1

saksham-datazip temporarily deployed to integration_tests April 2, 2026 10:00 — with GitHub Actions Inactive

Merge branch 'staging' into feat/mysql-chunking-optimization

3c0c80e

saksham-datazip had a problem deploying to integration_tests April 8, 2026 02:22 — with GitHub Actions Failure

Merge branch 'staging' into feat/mysql-chunking-optimization

37ce840

saksham-datazip had a problem deploying to integration_tests April 9, 2026 07:25 — with GitHub Actions Failure

Merge branch 'staging' into feat/mysql-chunking-optimization

5819281

saksham-datazip had a problem deploying to integration_tests April 10, 2026 20:21 — with GitHub Actions Failure

Merge branch 'staging' into feat/mysql-chunking-optimization

031949d

saksham-datazip had a problem deploying to integration_tests April 14, 2026 12:29 — with GitHub Actions Failure

Merge branch 'staging' into feat/mysql-chunking-optimization

ec479bd

saksham-datazip had a problem deploying to integration_tests April 18, 2026 18:59 — with GitHub Actions Failure

chore: reduced unicode size

46ac722

saksham-datazip temporarily deployed to integration_tests April 19, 2026 17:29 — with GitHub Actions Inactive

Merge branch 'staging' into feat/mysql-chunking-optimization

1966fdf

saksham-datazip had a problem deploying to integration_tests April 20, 2026 11:32 — with GitHub Actions Failure

saksham-datazip had a problem deploying to integration_tests April 20, 2026 12:11 — with GitHub Actions Failure

vikaxsh reviewed Apr 21, 2026

View reviewed changes

Merge branch 'staging' into feat/mysql-chunking-optimization

51639bc

saksham-datazip had a problem deploying to integration_tests April 21, 2026 13:31 — with GitHub Actions Failure

chore: resolved-review-comments

b807117

saksham-datazip temporarily deployed to integration_tests April 22, 2026 15:00 — with GitHub Actions Inactive

Merge branch 'staging' into feat/mysql-chunking-optimization

9857e99

vaibhav-datazip temporarily deployed to integration_tests April 24, 2026 13:54 — with GitHub Actions Inactive

vaibhav-datazip approved these changes Apr 24, 2026

View reviewed changes

vikaxsh approved these changes Apr 24, 2026

View reviewed changes

vaibhav-datazip merged commit 06aeb92 into staging Apr 26, 2026
11 checks passed

vaibhav-datazip deleted the feat/mysql-chunking-optimization branch April 26, 2026 07:15

Conversation

saksham-datazip commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Performance Stats (Different PK Types)

🔢 Table with INT32 Primary Key

🔣 Table with FLOAT64 Primary Key

Screenshots or Recordings

Documentation

Related PR's (If Any):

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saksham-datazip left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

saksham-datazip commented Jan 27, 2026 •

edited

Loading

🔢 Table with `INT32` Primary Key

🔣 Table with `FLOAT64` Primary Key