Skip to content

feat: mysql chunking optimisation#797

Merged
vaibhav-datazip merged 55 commits into
stagingfrom
feat/mysql-chunking-optimization
Apr 26, 2026
Merged

feat: mysql chunking optimisation#797
vaibhav-datazip merged 55 commits into
stagingfrom
feat/mysql-chunking-optimization

Conversation

@saksham-datazip
Copy link
Copy Markdown
Collaborator

@saksham-datazip saksham-datazip commented Jan 27, 2026

Description

This PR improves the MySQL chunking strategy with the primary goal of significantly reducing chunk generation time for large tables during backfill.

To achieve this, two mathematical chunking strategies were introduced based on the primary key type, replacing repeated database-based chunk discovery.

Numeric Primary Keys

The numeric range [min, max] is divided using an arithmetic progression to generate evenly spaced chunk boundaries. This allows chunk boundaries to be computed mathematically instead of relying on repeated database lookups, significantly reducing chunking time.

String Primary Keys

String values are mapped into a numeric space using Readable Unicode encoding (big.Int) and then split into balanced ranges. These candidate boundaries are then aligned with actual database values using distinct collation-aware queries to maintain correct ordering.

These strategies substantially reduce the number of database round trips required for chunk discovery, resulting in faster chunk generation and improved performance for large datasets.

As part of this work, several edge cases in chunk boundary calculation were also addressed, particularly around MySQL collation-aware ordering for string primary keys. The implementation aligns generated boundaries with actual database values using collation-aware queries, ensuring correct range generation and preventing missing or overlapping chunks.

Additionally, a small compatibility fix was introduced in refractor.go.

To handle this change correctly, an additional []uint8 case was added in ReformatInt64 so that these values are properly parsed and converted to int64. This ensures consistent behavior regardless of how the query result is returned by the driver.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Tested MySQL chunking with INT32 primary keys

  • Tested MySQL chunking with INT64 primary keys

  • Tested MySQL chunking with FLOAT / DOUBLE primary keys

  • Verified no data loss or overlap across chunk boundaries

  • Tested on different kind of string pk for full refresh and cdc

  • Confirmed performance improvement on large datasets

Performance Stats (Different PK Types)

The following stats.json outputs were collected from runs on different MySQL tables, each containing 10M records, using different primary key types.

🔢 Table with INT32 Primary Key

  • Seconds Elapsed: 184.00
  • Speed: 54,347.30 rps
  • Memory: 96 MB

🔣 Table with FLOAT64 Primary Key

  • Seconds Elapsed: 54.00
  • Speed: 185,179.58 rps
  • Memory: 36 MB

Screenshots or Recordings

https://datazip.atlassian.net/wiki/x/AYCVDg

Documentation

  • Documentation Link: [link to README, olake.io/docs, or olake-docs]
  • N/A (bug fix, refactor, or test changes only)

Related PR's (If Any):

N/A

Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Copy link
Copy Markdown
Collaborator Author

@saksham-datazip saksham-datazip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self review

Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/config.go Outdated
Comment thread pkg/jdbc/jdbc.go Outdated
Comment thread pkg/jdbc/jdbc.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread pkg/jdbc/jdbc.go Outdated
Comment thread pkg/jdbc/jdbc.go
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
Comment thread drivers/mysql/internal/backfill.go Outdated
@vaibhav-datazip vaibhav-datazip merged commit 06aeb92 into staging Apr 26, 2026
11 checks passed
@vaibhav-datazip vaibhav-datazip deleted the feat/mysql-chunking-optimization branch April 26, 2026 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants