Skip to content

mysql_cdc: chunk large tables across workers via PK-range splitting#4342

Open
ankit481 wants to merge 3 commits intoredpanda-data:mainfrom
ankit481:feat/mysql-cdc-snapshot-chunking
Open

mysql_cdc: chunk large tables across workers via PK-range splitting#4342
ankit481 wants to merge 3 commits intoredpanda-data:mainfrom
ankit481:feat/mysql-cdc-snapshot-chunking

Conversation

@ankit481
Copy link
Copy Markdown
Contributor

@ankit481 ankit481 commented Apr 23, 2026

Summary

Closes #4341. Builds on #4320.

Adds an opt-in snapshot_chunks_per_table field to mysql_cdc. When left at the default (1) the snapshot flow is unchanged from #4320. When set higher, each table's first primary-key column is probed for MIN/MAX under the shared consistent-snapshot transaction and the resulting integer range is split into N half-open chunks that are dispatched across the existing snapshot_max_parallel_tables worker pool.

This is the intra-table parallelism piece. #4320 unblocks pipelines with many tables; this PR unblocks pipelines dominated by a single very large table — the shape behind the 400M-row reference workload in #4341.

Motivation

Inter-table parallelism alone cannot accelerate a snapshot where one table holds the bulk of the rows. Splitting that table across the worker pool is what closes the gap to AWS DMS.

Target from #4341: 400M rows in ~45 min (1h acceptable). At 16 workers each reading a chunked slice of the PK space, the per-worker throughput needed is ~25M rows/hr, which the existing per-worker code path already achieves on commodity RDS hardware in the observed 30M rows/hr baseline.

Design

Consistency model is unchanged

Every worker transaction is still opened inside the single FLUSH TABLES WITH READ LOCK window established by prepareParallelSnapshotSet. MIN/MAX probing runs inside one of those worker transactions, so boundaries computed during planning agree exactly with the state every worker subsequently reads. The binlog position captured under the lock applies uniformly to every chunk.

No new lock acquisition, no relaxation of isolation, no new handoff with the binlog stream.

Chunking math

For each table:

  • chunks_per_table <= 1: emit one whole-table unit (no planning query).
  • First PK column is a supported integer type: compute MIN(pk), MAX(pk), split [MIN, MAX] into N half-open [lo, hi) chunks.
  • First PK column is non-numeric: emit one whole-table unit and log the fallback reason.

Outermost chunks are open-ended — the first chunk has no lower bound and the last chunk has no upper bound. This guarantees every row in [MIN, MAX] is covered without off-by-one risk and that any row outside [MIN, MAX] under the snapshot is still picked up rather than silently dropped.

Composite primary keys

Chunking partitions on the leading PK column only. Per-chunk keyset pagination inside querySnapshotTable continues to use the full PK tuple, so ordering and pagination remain correct for composite PKs such as (tenant_id, id).

Tradeoff: a skewed leading column produces uneven chunks. Operators with that data shape should leave snapshot_chunks_per_table at 1 and rely on snapshot_max_parallel_tables alone. This is a documented limitation, not a correctness issue — no row is ever read twice, and no row is ever missed.

SQL shape

Example: chunks_per_table=4 on an INT PK with range [0, 100), 2nd chunk, mid-pagination:

SELECT * FROM t
WHERE id >= ? AND id < ? AND (id) > (?)
ORDER BY id
LIMIT ?

Bindings: [25, 50, lastSeenID, limit].

First chunk omits the lower bound. Last chunk omits the upper bound. Middle chunks have both.

Files

  • internal/impl/mysql/snapshot_chunking.go (new): planSnapshotWork, splitIntRange, buildChunkPredicate, numeric-PK detection via information_schema.columns.
  • internal/impl/mysql/snapshot.go: querySnapshotTable threads *chunkBounds through the WHERE clause. The existing buildOrderByClause and keyset pagination are untouched.
  • internal/impl/mysql/input_mysql_stream.go: new snapshot_chunks_per_table field with [1, 256] validation, renamed readSnapshotTable -> readSnapshotWorkUnit, chunking plan runs inside runParallelSnapshot.
  • internal/impl/mysql/parallel_snapshot.go: distributeTablesToWorkers generalised to distributeWorkToWorkers[T any] so work units of type snapshotWorkUnit use the same fan-out code path as tables did before. Removed the internal workerCount > len(tables) cap — the caller sizes the pool against the expected work-unit count.

Dispatch

startMySQLSync now routes to runParallelSnapshot whenever either snapshot_max_parallel_tables > 1 or snapshot_chunks_per_table > 1. When both are 1 (default) the original sequential path runs unchanged.

Backwards compatibility

Default snapshot_chunks_per_table: 1 produces byte-identical behaviour to #4320.

  • The config spec adds one Advanced() int field. Existing YAML is unaffected.
  • runSequentialSnapshot is untouched.
  • When parallel path is taken with chunks_per_table=1, every work unit has bounds: nil, so querySnapshotTable emits the same WHERE-less query as before (just via a slightly different code path).
  • Existing integration tests (TestIntegrationMySQLSnapshotAndCDC, TestIntegrationMySQLSnapshotConsistency, TestIntegrationMySQLCDCWithCompositePrimaryKeys, TestIntegrationMySQLCDCSchemaMetadata, TestIntegrationMySQLParallelSnapshot) all pass unchanged.

Tests added

Unit (snapshot_chunking_test.go)

Pure-function coverage of the chunking math and SQL predicate:

  • SingleChunkWhenNLEOnen of 0, 1, -3 all produce one fully-open chunk.
  • SingleChunkWhenRangeCollapsedlo == hi and reversed ranges degenerate to one chunk.
  • OutermostChunksAreOpenEnded — first chunk lo==nil, last chunk hi==nil.
  • ChunksCoverAllIntegersExactlyOnce — enumerates every integer in [lo, hi] for several n and asserts single-chunk membership under half-open semantics.
  • WhenNExceedsSpanStepIsAtLeastOne — short ranges asked for many chunks still cover every value.
  • LargeSpanDoesNotOverflowhi-lo near int64 limits, guards the uint64 cast in splitIntRange.
  • BuildChunkPredicate_* — nil, both-bounds, lower-only, upper-only, fully-open variants produce the expected SQL fragment and arg list.
  • DistributeWorkToWorkers_SnapshotWorkUnitInstantiation — the generic fan-out helper accepts the new work-unit type and visits every item exactly once.

Existing distributeTablesToWorkers tests continue to pass — they now exercise distributeWorkToWorkers at T = string.

Config (config_test.go)

  • TestConfig_SnapshotChunksPerTable_DefaultAndExplicit — default of 1, explicit 16 round-trips through the spec.
  • TestConfig_SnapshotChunksPerTable_InvalidValuesRejected — zero, negative, above-cap, and absurdly-large values all violate the constructor's validation predicate.

Integration (integration_test.go)

  • TestIntegrationMySQLChunkedSnapshot — MySQL 8.0 via testcontainers. Creates one INT PK table and one composite (tenant_id, id) PK table, each loaded with 2000 rows. Runs mysql_cdc with snapshot_max_parallel_tables: 4, snapshot_chunks_per_table: 8. Asserts: every row emitted exactly once, no duplicates from overlapping chunk ranges (tracked via a sync.Map of observed PKs), post-snapshot inserts are picked up by the binlog stream.
  • TestIntegrationMySQLChunkedSnapshotNonNumericPKFallbackVARCHAR PK table with chunks_per_table: 8. Verifies the fallback path reads the whole table without error and emits every row.

Local test results

Unit (whole package, race + shuffle):

ok  internal/impl/mysql  7.170s   (-race -shuffle=on)

Integration — new tests:

--- PASS: TestIntegrationMySQLChunkedSnapshot                         (29.29s)
--- PASS: TestIntegrationMySQLChunkedSnapshotNonNumericPKFallback     (13.29s)
--- PASS: TestIntegrationMySQLParallelSnapshot                        (23.30s)
ok  internal/impl/mysql  32.223s

Integration — existing sequential-path regressions (backwards-compat sanity check):

--- PASS: TestIntegrationMySQLCDCSchemaMetadata                       (16.67s)
--- PASS: TestIntegrationMySQLSnapshotConsistency                     (20.84s)
--- PASS: TestIntegrationMySQLSnapshotAndCDC                          (28.12s)
--- PASS: TestIntegrationMySQLCDCWithCompositePrimaryKeys             (36.05s)
ok  internal/impl/mysql  39.084s

gofmt and go vet clean.

Log excerpt from TestIntegrationMySQLChunkedSnapshot confirming the planner emits 16 work units (2 tables x 8 chunks) across 4 workers, with correct open-ended outermost chunks and full-tuple keyset pagination for composite PKs:

Acquiring table-level read locks for parallel snapshot (4 workers): FLUSH TABLES `single_pk`, `composite_pk` WITH READ LOCK
Parallel snapshot planned: 2 tables -> 16 work units across 4 workers
Querying snapshot: SELECT * FROM single_pk WHERE `id` < ? ORDER BY id LIMIT ?                                                  (first chunk - no lower bound)
Querying snapshot: SELECT * FROM single_pk WHERE `id` >= ? AND `id` < ? ORDER BY id LIMIT ?                                    (middle chunk)
Querying snapshot: SELECT * FROM single_pk WHERE `id` >= ? ORDER BY id LIMIT ?                                                 (last chunk - no upper bound)
Querying snapshot: SELECT * FROM single_pk WHERE `id` >= ? AND `id` < ? AND (id) > (?) ORDER BY id LIMIT ?                     (mid-pagination within a chunk)
Querying snapshot: SELECT * FROM composite_pk WHERE `tenant_id` >= ? AND `tenant_id` < ? AND (tenant_id, id) > (?, ?) ORDER BY tenant_id, id LIMIT ?
starting MySQL CDC stream from binlog mysql-bin.000003 at offset 1218440

Out of scope / follow-ups

  • Non-numeric first-column PKs (UUID, VARCHAR, binary). Needs sampling-based or OFFSET-based boundary discovery; material complexity best kept behind its own config flag in a future PR.
  • Intra-table chunk skew handling. The documented workaround (leave chunks_per_table=1) is sufficient for the common case; adaptive partitioning is a separate feature.
  • Adaptive chunk sizing based on table size. Fixed N is simpler and predictable; adaptive can follow.

Test plan

  • Run unit tests for internal/impl/mysql with -race -shuffle=on
  • Run new integration tests for chunked snapshot (single PK, composite PK, non-numeric PK fallback)
  • Re-run existing sequential-path integration tests for regression
  • Verify gofmt/go vet cleanliness
  • Maintainer review, especially of the MIN/MAX + half-open chunk reasoning and the fallback for non-numeric first PK columns
  • CI integration matrix (MySQL 5.7, 8.0, 8.4 + MariaDB)
  • Production validation against the 400M-row reference workload from mysql_cdc: single-table snapshots are not parallelised, making very large tables the bottleneck #4341

Adds an opt-in `snapshot_max_parallel_tables` field to the `mysql_cdc`
input. When left at the default (`1`) the snapshot flow is the existing
single-transaction, single-goroutine path: bit-for-bit unchanged.

When set above `1`, N REPEATABLE READ / CONSISTENT SNAPSHOT transactions
are opened on independent connections under a single brief FLUSH
TABLES ... WITH READ LOCK window. Every worker observes identical state
at the same binlog position, and the configured tables are fanned out
across the workers via an errgroup. This preserves the existing global
consistent-snapshot invariant and the existing fail-halt failure mode,
while removing the per-table serial bottleneck for pipelines with many
tables.

The inner per-table loop is extracted into readSnapshotTable so both
paths share identical semantics. The sequential path is moved into
runSequentialSnapshot (unchanged body); the parallel path lives in
runParallelSnapshot and parallel_snapshot.go.
Defense-in-depth against a mis-typed config value that would otherwise
try to open thousands of MySQL connections at snapshot time. 256 sits
well above any realistic pipeline (the existing cap at len(tables) is
the more common practical bound) and well below the range where a typo
(e.g. 10000) would cause a connection storm before MySQLs own
max_connections kicked in.

Surfaces as a clear configuration error at Connect time rather than a
runtime too-many-connections from the server.
Adds an opt-in snapshot_chunks_per_table field to mysql_cdc. When left at
its default (1) the snapshot flow is unchanged. When set higher, each
table's first primary-key column is probed for MIN and MAX under the
shared consistent-snapshot transaction and the resulting integer range is
split into N half-open chunks that are dispatched across the existing
snapshot_max_parallel_tables worker pool.

This is a follow-up to the inter-table parallelism introduced in the
mysql_cdc: parallelise snapshot reads across tables change. Inter-table
parallelism alone cannot accelerate a snapshot dominated by a single very
large table, which is the most common shape for message/event tables.
Chunking splits that single-table work across the worker pool instead.

Chunking is supported for tables whose first primary-key column is an
integer type (tinyint/smallint/mediumint/int/integer/bigint, signed or
unsigned). Composite primary keys are supported - chunking partitions on
the leading column only, and per-chunk keyset pagination continues to
respect the full PK ordering. Tables with non-numeric first PK columns
fall back to a whole-table read with an informational log line so mixed
workloads keep working.

Consistency model is unchanged. All worker transactions still begin
under one FLUSH TABLES WITH READ LOCK window so every chunk observes
identical state at the same binlog position. Planning runs inside one
worker's snapshot transaction so MIN/MAX agree with what every worker
subsequently reads.

The outermost chunks in each table are open-ended (no lower bound on
the first chunk, no upper bound on the last) so rows at the exact
MIN/MAX endpoints and any rows outside [MIN, MAX] are captured rather
than silently dropped.

The fan-out helper (previously distributeTablesToWorkers) is generalised
to a generic distributeWorkToWorkers so the parallel path can dispatch
chunk-typed work units while the existing fan-out tests keep passing
with string inputs.

Field cap: snapshot_chunks_per_table is validated at config time to be
within [1, 256], matching the pattern established for
snapshot_max_parallel_tables.

Tests added:

- snapshot_chunking_test.go: splitIntRange coverage and overflow,
  buildChunkPredicate shapes, and generic fan-out against
  snapshotWorkUnit.
- config_test.go: default, explicit, and out-of-range values for
  snapshot_chunks_per_table.
- integration_test.go: TestIntegrationMySQLChunkedSnapshot exercises an
  int PK table and a composite (int, int) PK table with chunks=8 and
  asserts no duplicates across overlapping chunk ranges;
  TestIntegrationMySQLChunkedSnapshotNonNumericPKFallback confirms the
  VARCHAR-PK fallback reads the whole table without error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mysql_cdc: single-table snapshots are not parallelised, making very large tables the bottleneck

1 participant