feat: kafka driver change by saksham-datazip · Pull Request #958 · datazip-inc/olake

saksham-datazip · 2026-05-20T10:36:59Z

Description

Migrated the Kafka consumer implementation from Segment Kafka-Go to Franz-Go to leverage improved consumer group management, rebalance handling, and lifecycle APIs provided by Franz-Go.

As part of this migration:

Added support for static membership using instance.id to improve retry and reconnect behavior. This helps avoid unnecessary rebalances during transient failures or consumer restarts when the same instance rejoins the group.
Implemented rebalance detection using Franz-Go consumer group callbacks along with generation ID tracking stored in consumer metadata. During sync execution, the active generation is continuously validated against the latest assigned generation to detect stale consumers or lost partition ownership.
Added graceful shutdown handling on successful rebalance detection. Instead of continuing to process records with outdated assignments, the sync now exits cleanly to prevent duplicate processing and stale partition consumption during consumer group transitions.

Fixes #794

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Tested retry scenarios by intentionally returning errors from StreamChanges. Verified that retries worked correctly and did not trigger unnecessary consumer group rebalances due to static membership support using instance.id.
Stopped the sync during active processing and triggered a consumer group rebalance. Verified that the sync detected the rebalance successfully and exited gracefully without processing duplicate data.

Screenshots or Recordings

N/A

Documentation

Documentation Link: [link to README, olake.io/docs, or olake-docs]
N/A (bug fix, refactor, or test changes only)

Related PR's (If Any):

hash-data

Nice Code

hash-data · 2026-06-11T04:46:49Z

+	joinCtx, cancel := context.WithTimeout(ctx, 2*time.Minute)
+	defer cancel()
+	for {
+		if joinCtx.Err() != nil {


err can be other reasons as well not only time out

hash-data · 2026-06-11T07:58:34Z

+
+	// remove stale consumers before creating new readers
+	if err := k.readerManager.RemoveExistingConsumers(ctx, k.client); err != nil {
+		return fmt.Errorf("failed to remove existing consumers: %s", err)


can we remove it

I was not able to consistently reproduce this issue on my machine under normal conditions. However, after increasing the session timeout to 2 minutes, commenting out the removeExistingConsumer logic, and adding additional debug logging for investigation, I was able to observe that extra consumers were accumulating in the consumer group during rebalancing.

One possible outcome of this situation is that all partitions get assigned to the pre-existing consumers, leaving the newly created consumer with no partitions. In that case, PollFetches eventually times out because there are no assigned partitions to fetch from, causing the sync to exit without processing any records. However, this is only one possible explanation and not a confirmed root cause. There may be other unexpected behaviors or outcomes resulting from the presence of these extra consumers as well.

To further validate the behavior, I re-enabled the removeExistingConsumer logic and added additional logging inside the function to track how many consumers were being removed. Following the same reproduction steps, the logs showed that the function was working as expected and successfully removing the stale consumers.
Here is the reference link for the same :-
https://datazip.atlassian.net/wiki/x/AQCsI

hash-data · 2026-06-11T09:19:01Z

+			continue
+		}
+
+		message := iter.Next()


how many records will be there in a batch?
and also there was a case where message can be large then 1 mb?

check default timeout

This batch size behavior already existed in Segment, where we had a 10 MB limit, so I kept the same limit in franz-go as well.

What this means is that once the broker has accumulated data up to this limit, it will return the batch. If a single message itself is larger than 10 MB, Kafka will still return that message in a single batch, ignoring the configured fetch limit for that request. I verified this by testing with a 61 MB message.

Regarding how these settings work:

MinFetchSize: Specifies the minimum amount of data the broker should try to accumulate before returning a fetch response.

MaxFetchSize: Sets an upper limit on the amount of data returned in a fetch response. However, if the first available message exceeds this limit, Kafka will still return that message in a single batch.

FetchMaxWaitMs: Specifies how long the broker should wait for data to become available. If the minimum fetch size is not reached within this timeout, the broker returns whatever data is available (or an empty response).

Regarding the default poll timeout we discussed earlier, that behavior is not supported in franz-go. Unlike the Java consumer, franz-go does not enforce a maximum time between polls because heartbeats are handled independently in the background.

Reference: twmb/franz-go#140

hash-data · 2026-06-11T09:28:53Z

-func (b *CustomGroupBalancer) UserData() ([]byte, error) {
-	return nil, nil
+// IsCooperative returns false to indicate that the balancer is not cooperative.
+func (b *CustomGroupBalancer) IsCooperative() bool {


add comment why we are not using ?

In Kafka, when isCooperative is set to false, any rebalance causes all consumers to first have their partitions revoked before receiving new assignments. However, in the Franz-go library, there are three rebalance callbacks. During experimentation, I observed that OnPartitionsAssigned is triggered for all consumers on every rebalance, regardless of whether any partitions are assigned to that particular consumer, even when cooperative mode is set to true. Because of this behavior, either approach works for our use case, as the callback is still invoked during rebalances.So due to this we dont need any comment as of odd behavior from franz-go side

hash-data · 2026-06-11T09:36:29Z


 	// number of consumers to use
-	consumerIDCount := min(b.requiredConsumerIDs, len(members))
+	consumerCount := min(b.requiredConsumerIDs, len(members))


why this statement?

Yes it is not required it was just in staging so didnt removed but removed this now

hash-data · 2026-06-12T04:38:42Z

+		// Exit gracefully when a rebalance is detected via assign/revoke callbacks.
+		onRebalance := func(_ context.Context, client *kgo.Client, _ map[string][]int32) {
+			if r.RebalanceDetected(client) {
+				r.exitMode.Store(gracefulExit)


this will create problem in MO

Had a discussion with you regarding it, so no need for it now because rebalance detetection callbacks are only valid in streamchanges not in preCDC.

hash-data · 2026-06-12T04:42:10Z

+			// generation id -1 means not yet joined
+			// mismatch means readers are on different generations, partition assignment not yet completed
+			if currentReaderGenerationID < 0 || (expectedGenerationID >= 0 && expectedGenerationID != currentReaderGenerationID) {
+				allReadersJoined = false
+				break


need to understand again

// waitForPartitionAssignment blocks until Kafka completes partition assignment // for all readers in the consumer group. func (r *ReaderManager) waitForPartitionAssignment(ctx context.Context) error { joinCtx, cancel := context.WithTimeout(ctx, 2*time.Minute) defer cancel() for { select { case <-joinCtx.Done(): return fmt.Errorf("timed out waiting for partition assignment on consumer group %s: %s", r.config.ConsumerGroupID, joinCtx.Err()) case <-time.After(2 * time.Second): var ( allReadersJoined = true expectedGenerationID int32 = -1 ) for _, kafkaReader := range r.readers { _, currentReaderGenerationID := kafkaReader.reader.GroupMetadata() // generation id -1 means not yet joined // mismatch means readers are on different generations, partition assignment not yet completed if currentReaderGenerationID < 0 || (expectedGenerationID >= 0 && expectedGenerationID != currentReaderGenerationID) { allReadersJoined = false break } if expectedGenerationID < 0 { expectedGenerationID = currentReaderGenerationID } } if allReadersJoined { r.generationID.Store(expectedGenerationID) // brief wait to let partition assignment fully propagate before fetching starts. time.Sleep(2 * time.Second) logger.Infof("consumer group %s stable: all readers assigned, generation id: %d", r.config.ConsumerGroupID, expectedGenerationID) return nil } } } }

Previously on first run it use to first check all Readers readerId and than wait for 500ms now i changes it andi it will first wait 500ms and than start to check all readers readerId but since in other part of code we are using this and no big changes so i changed it.

hash-data · 2026-06-12T04:51:08Z

 		// get current partition metadata and key
 		currentPartitionKey := types.PartitionKey{Topic: record.Message.Topic, Partition: record.Message.Partition}
-		currentPartitionMeta, exists := k.readerManager.GetPartitionIndex(fmt.Sprintf("%s:%d", record.Message.Topic, record.Message.Partition))
+		currentPartitionMeta, exists := k.readerManager.GetPartitionIndex(kafkapkg.PartitionIndexKey(record.Message.Topic, record.Message.Partition))


function name seem not conveying what actully it does

hash-data · 2026-06-12T04:53:11Z

-		if err != nil {
-			return fmt.Errorf("error reading message in Kafka CDC sync: %s", err)
+		// checked before every poll and every record so a rebalance signal is never delayed by full batch processing.
+		if stopProcessing, err := k.readerManager.FetchExitState(); stopProcessing {


we should use context here as well

hash-data · 2026-06-12T04:55:17Z


 		// Type assert and validate messages
-		lastMessages, isValid := lastMessagesMeta.(map[types.PartitionKey]kafka.Message)
+		lastMessages, isValid := lastMessagesMeta.(map[types.PartitionKey]*kgo.Record)


should not it fail?

saksham-datazip · 2026-06-12T05:53:05Z

Nice Code

Thnks

saksham-datazip added 2 commits May 18, 2026 23:51

chore: custom-balancer

a2e63f9

chore: added rebalancing detection

8450f96

saksham-datazip requested a deployment to integration_tests May 20, 2026 10:37 — with GitHub Actions Waiting

Merge branch 'staging' into feat/kafka-driver-change

f75bf6e

saksham-datazip temporarily deployed to integration_tests May 20, 2026 10:37 — with GitHub Actions Inactive

saksham-datazip had a problem deploying to integration_tests May 20, 2026 10:37 — with GitHub Actions Failure

saksham-datazip changed the title ~~Feat/kafka driver change~~ feat/kafka driver change May 20, 2026

saksham-datazip changed the title ~~feat/kafka driver change~~ feat/ kafka driver change May 20, 2026

saksham-datazip changed the title ~~feat/ kafka driver change~~ feat: kafka driver change May 20, 2026

saksham-datazip had a problem deploying to integration_tests May 20, 2026 13:26 — with GitHub Actions Failure

chore: added-rebalancing-detection

507d220

saksham-datazip requested a deployment to integration_tests May 20, 2026 18:50 — with GitHub Actions Waiting

chore: self-reviewed-1

4b2fd9e

saksham-datazip had a problem deploying to integration_tests May 21, 2026 19:14 — with GitHub Actions Failure

saksham-datazip temporarily deployed to integration_tests May 21, 2026 19:14 — with GitHub Actions Inactive

chore: resolved-lint-error

8ea52d2

saksham-datazip temporarily deployed to integration_tests May 21, 2026 19:30 — with GitHub Actions Inactive

saksham-datazip had a problem deploying to integration_tests May 21, 2026 19:30 — with GitHub Actions Failure

chore: refractored-warmupConsumerGroup-function

5bb0008

saksham-datazip had a problem deploying to integration_tests May 22, 2026 04:40 — with GitHub Actions Failure

saksham-datazip temporarily deployed to integration_tests May 22, 2026 04:40 — with GitHub Actions Inactive

chore: improved-comment

9e30f71

saksham-datazip requested a deployment to integration_tests May 22, 2026 04:48 — with GitHub Actions Waiting

chore: resolved-2

bcecada

saksham-datazip had a problem deploying to integration_tests May 22, 2026 09:16 — with GitHub Actions Error

saksham-datazip temporarily deployed to integration_tests May 22, 2026 09:16 — with GitHub Actions Inactive

saksham-datazip had a problem deploying to integration_tests May 22, 2026 11:10 — with GitHub Actions Failure

saksham-datazip temporarily deployed to integration_tests June 5, 2026 10:33 — with GitHub Actions Inactive

Merge branch 'staging' into feat/kafka-driver-change

28f87ff

saksham-datazip temporarily deployed to integration_tests June 5, 2026 11:58 — with GitHub Actions Inactive

saksham-datazip requested a deployment to integration_tests June 5, 2026 11:58 — with GitHub Actions Waiting

Merge branch 'staging' into feat/kafka-driver-change

b80c8ff

saksham-datazip requested a deployment to integration_tests June 5, 2026 12:17 — with GitHub Actions Waiting

saksham-datazip temporarily deployed to integration_tests June 5, 2026 12:17 — with GitHub Actions Inactive

chore: resolved-comments-3

f2012cb

saksham-datazip temporarily deployed to integration_tests June 6, 2026 09:31 — with GitHub Actions Inactive

vikaxsh reviewed Jun 8, 2026

View reviewed changes

Comment thread pkg/kafka/reader.go

Comment thread drivers/kafka/internal/cdc.go

chore: skipped-GroupIDNotFound

964e19b

saksham-datazip had a problem deploying to integration_tests June 8, 2026 12:47 — with GitHub Actions Error

saksham-datazip temporarily deployed to integration_tests June 8, 2026 12:47 — with GitHub Actions Inactive

saksham-datazip requested a deployment to integration_tests June 9, 2026 06:39 — with GitHub Actions Waiting

Merge branch 'staging' into feat/kafka-driver-change

1a574e8

saksham-datazip had a problem deploying to integration_tests June 9, 2026 11:47 — with GitHub Actions Error

saksham-datazip temporarily deployed to integration_tests June 9, 2026 11:47 — with GitHub Actions Inactive

saksham-datazip temporarily deployed to integration_tests June 9, 2026 13:37 — with GitHub Actions Inactive

hash-data reviewed Jun 12, 2026

View reviewed changes

Merge branch 'staging' into feat/kafka-driver-change

f280872

saksham-datazip requested a deployment to integration_tests June 13, 2026 07:42 — with GitHub Actions Waiting

Merge branch 'staging' into feat/kafka-driver-change

83106a9

saksham-datazip requested a deployment to integration_tests June 15, 2026 07:46 — with GitHub Actions Waiting

chore: refractored-code

d0cfcf9

saksham-datazip temporarily deployed to integration_tests June 15, 2026 08:32 — with GitHub Actions Inactive

Conversation

saksham-datazip commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Screenshots or Recordings

Documentation

Related PR's (If Any):

Uh oh!

Uh oh!

Uh oh!

hash-data left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saksham-datazip Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saksham-datazip Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saksham-datazip Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saksham-datazip Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hash-data Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saksham-datazip commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

saksham-datazip commented May 20, 2026 •

edited

Loading

saksham-datazip Jun 13, 2026 •

edited

Loading

saksham-datazip Jun 14, 2026 •

edited

Loading

saksham-datazip Jun 13, 2026 •

edited

Loading

saksham-datazip Jun 14, 2026 •

edited

Loading

hash-data Jun 12, 2026 •

edited

Loading