dekaf: Support collection reset and materialization binding backfill via Kafka leader epochs #2497

jshearer · 2025-11-11T16:16:08Z

Problem

Dekaf does not currently handle materialization backfills correctly. This means that when a collection is reset via dataflow reset (which is now the default capture backfill mode), that topic gets into a broken state. Specifically, the consumer doesn't realize that its offsets have been invalidated, and keeps trying to read from the previously committed offset, which likely is beyond the new collection's write head. As such, no more data will show up in the Dekaf topic until the new collection sees at least as much data as was written to the previous collection.

In order to fix this, as well as allow for "regular" Dekaf bindings (topics) to be backfilled, we need to figure out a way to signal to consumers that their committed offsets are invalid, and they should start over from the beginning of the new collection.

Background: Kafka leader epochs

Kafka uses leader epochs to handle partition leadership changes and log truncation. When leadership changes, the epoch increments. Consumers include their expected epoch in fetch requests, which they discover as part of the topic metadata. If the provided epoch is old, the broker returns FENCED_LEADER_EPOCH. Consumers then call OffsetForLeaderEpoch with the previous epoch value to determine where the old epoch ended, compare to their position, and reset if needed.

We already have a value that conveniently maps to this concept: a materialization binding's backfill counter. The control-plane already ensures that when a collection is reset, all bindings referring to it also get backfilled, so "all" we have to do is map the binding's backfill counter to Kafka's leader epoch!

Implementation

Metadata

In order for us to be able to compare the consumer's leader epoch against the current spec's binding backfill counter, it needs to be told about the epoch that corresponds with its discovered metadata. As such, we emit the current_leader_epoch in Metadata and ListOffsets responses.

Consumers cache epoch with offsets: (offset, leader_epoch). ListOffsets also validates current_leader_epoch from the request and returns FENCED_LEADER_EPOCH if stale.

Fetching

Consumers include current_leader_epoch in fetch requests. We validate this against the current spec's backfill counter:

If `request.current_leader_epoch < collection.binding_backfill_counter`

Return FENCED_LEADER_EPOCH along with the now-current leader epoch. Consumer then calls OffsetForLeaderEpoch with its old offset, gets a new high-watermark of 0 which looks like a truncation (since it is) and triggers its default auto.offset.reset behavior (either earliest or latest depending on whether its configured to start from the beginning, or only tail recent events).

TODO: Actually test with a consumer set to latest and see what it does.

If `request.current_leader_epoch == collection.binding_backfill_counter`

Continue to serve the the fetch.

If `request.current_leader_epoch > collection.binding_backfill_counter`

Error with UNKNOWN_LEADER_EPOCH. This causes the consumer to do a metadata refresh.

`OffsetForLeaderEpoch`

After receiving FENCED_LEADER_EPOCH, consumers call OffsetForLeaderEpoch to discover where their previous epoch ended. They then start reading from the new epoch starting at this offset. In our case, we want the consumers to start over from the beginning, so if the requested epoch is less than the current binding backfill counter, we advertise 0 here.

TODO: Should we really fetch the journal's earliest available offset and advertise that instead of 0? I didn't test with journals that are missing early fragments..

Committed offset segmentation

Normally, Kafka brokers store and return the correct committed offsets by leader epoch. But since we're co-opting this feature for our own use, we need to deal with storing and fetching the correct committed offset for a given epoch (binding backfill counter). We do this by appending the backfill counter to the group ID we use when proxying requests to upstream Kafka for JoinGroup,SyncGroup,OffsetCommit,OffsetFetch, etc.

`to_upstream_topic_name(name, secret, token, Some(backfill_counter))`

Produces epoch-qualified keys:

Epoch 0: <encrypted>-e0
Epoch 1: <encrypted>-e1

Upgrade Path

Committed offset isolation

Previously, all topics for tasks sharing the same token would commit offsets to the same upstream Kafka topic name. This created the possibility for conflicts when multiple bindings across different tasks and tenants used the same topic name and token.

Since we're already introducing a group ID upgrade path here, we might as well also kill this bird by swapping the nonce to the task name from the token. This closes #2083

Migration Flow

First fetch after upgrade:

Consumer connects, discovers epoch in metadata
Calls OffsetFetch with task_name + epoch, gets empty response
OffsetFetch falls back and tries previous topic name, finds old offsets
Consumer continues from previous position
Consumer commits its offsets including the leader epoch
- Once this commit succeeds, we clear out the old offsets
Next OffsetFetch tries task_name + epoch and finds new offsets, no longer falls back.

Closes #2376

jshearer · 2025-12-02T14:43:52Z

Alright @jgraettinger, I've done a couple rounds of self-review here and I think it's time to get your feedback. I've tested this with kcat and Tinybird in all the scenarios I can think of: "old" consumer upgrading to epochs then backfilling, "new" consumer backfilling, real collection reset, starting from non-zero backfill counters, etc. Before deploying I also plan to test with SingleStore and Clickhouse, just to see how the handle epoch changes.

I acknowledge that this feels like adding even more complexity to some already pretty complicated logic.. as we've talked about, my plan is to do a refactor of this codebase to split it up into much more manageable components, right after I write an e2e test suite so I can have some confidence that I'm not breaking behavior in subtle ways.

jgraettinger

LGTM

jgraettinger · 2025-12-04T15:40:44Z

crates/dekaf/src/session.rs

            let response = collections
                .into_iter()
-                .map(|(topic_name, offsets)| {
+                .map(|(topic_name, maybe_current_epoch, offsets)| {


When is maybe_current_epoch None (vs, say, Some(0))?

None = binding not found, Some(0) = binding exists and has a backfill counter of 0. Now that you mention it, I don't love the return type of fetch_topic_offsets, I'm thinking it should be an enum rather than some combination of Options.

jgraettinger · 2025-12-04T15:45:55Z

crates/dekaf/src/session.rs

+                            current_epoch = collection.binding_backfill_counter,
+                            "Consumer epoch is stale, skipping read start"
+                        );
+                        // Remove stale pending read if it exists. Error will be returned during poll phase


nit: this action-at-a-distance is confusing / hard to follow. Is it possible to fail faster?

Yes it is, but not without a refactor that I don't want to do in this PR. Fundamentally, reading/fetching is an operation over futures: a fetch request for a new topic+partition results in a new Read, which progresses asynchronously and can be polled to see whether it has any data to return or not. In some circumstances, a read will short-circuit into an error state before even starting, or it will resolve into an error state after starting. All of this maps nicely onto the concept of a future, and is in fact the pattern that the implementation is currently using, even though we don't call it that.

I would like to implement a ReadManager that breaks this down into logic around starting new reads (all of the pre-validation that's currently done in fetch()), and logic around polling existing reads and dealing with their statuses. It looks something like this:

pub enum ReadError { Fenced { current_epoch: i32 }, UnknownEpoch { current_epoch: i32 }, EpochChangedDuringRead { old_epoch: i32, new_epoch: i32 }, CollectionNotFound, PartitionNotFound, TaskRedirected, SessionIsPreviewOnly, Suspended, } pub struct ReadData { pub batch: Option<Bytes>, pub high_watermark: i64, pub last_stable_offset: i64, pub leader_epoch: i32, } pub enum ReadResult { Data(ReadData), Error(ReadError), Pending, } struct PendingRead { started_at: Instant, offset: i64, leader_epoch: i32, handle: AbortOnDropHandle<anyhow::Result<(Read, BatchResult)>>, } struct CompletedRead { read: Read, batch: BatchResult, leader_epoch: i32, } pub struct ReadManager { pending: HashMap<(TopicName, i32), PendingRead>, completed: HashMap<(TopicName, i32), CompletedRead>, } impl ReadManager{ pub async fn start_reads( &mut self, requests: &[PartitionRequest] ) -> Result<()> { ... } pub async fn poll_reads( &mut self, requests: &[PartitionRequest], timeout: Duration ) -> Vec<ReadResult> { // Index-aligned with `requests` ... } }

But given the current complexity, I don't want to do that refactor before I have some tests that prove that the existing behavior doesn't regress with the refactor.

jgraettinger · 2025-12-04T15:49:10Z

crates/dekaf/src/session.rs


+                // Re-fetch collection to check if epoch changed during the read
+                let auth = self.auth.as_ref().unwrap();
+                let Some(collection) = Collection::new(&auth, &key.0).await? else {


could we simplify by bubbling up a deleted journal as terminal, and then check epoch on the next session ?

my sense is that once we've validated the current journals and reader are on the same epoch, we shouldn't be checking again unless something exceptional happens (e.x. a read journal is deleted out from under us, or we hit a maximum interval / jwt expiration we'll process for under the current configuration, etc).

…ion reset When a collection is reset or a materialization binding is backfilled, consumers need to detect that their committed offsets are invalid. This maps the binding's backfill counter to Kafka's leader epoch mechanism: * Emit `leader_epoch` in Metadata and ListOffsets responses * Validate consumer epoch in Fetch and ListOffsets, returning `FENCED_LEADER_EPOCH` for stale epochs and `UNKNOWN_LEADER_EPOCH` for future epochs * Implement `OffsetForLeaderEpoch` API - returns offset 0 for old epochs (reset to beginning) and current high watermark for current epoch * Append `-e{counter}` suffix to upstream topic names for offset isolation by epoch Also isolate committed offsets by task name and clean up legacy offsets. Previously, all topics sharing the same token would commit offsets to the same upstream Kafka topic name, creating potential conflicts across tasks/tenants. * Swap encryption nonce from token to task_name when epoch suffix is present * Clean up oldoffsets after successful epoch-qualified commit, only after the new commit succeeds

This shows up sometimes, I believe attributed to eventual consistency in the upstream Kafka brokers. It should be retried and not result in the session crashing.

…TopicOrPartition`

…itted offsets

There was a confusing behavior where the task manager would serve 0 journals for a deleted collection until the `SPEC_TTL` expired.

jshearer force-pushed the dekaf/collection_reset branch 4 times, most recently from c3ac061 to 9851e34 Compare November 13, 2025 00:53

jshearer force-pushed the dekaf/collection_reset branch 12 times, most recently from 5b9ca90 to e9f3c44 Compare November 25, 2025 22:35

jshearer force-pushed the dekaf/collection_reset branch from e9f3c44 to b6f79f4 Compare November 26, 2025 17:40

jshearer changed the title ~~dekaf: WIP collection reset support~~ dekaf: Support collection reset and materialization binding backfill via Kafka leader epochs Nov 26, 2025

jshearer requested a review from a team November 26, 2025 17:42

jshearer marked this pull request as ready for review November 26, 2025 17:42

jshearer requested a review from jgraettinger December 2, 2025 14:43

jgraettinger approved these changes Dec 4, 2025

View reviewed changes

jshearer force-pushed the dekaf/collection_reset branch 4 times, most recently from a9ceedf to 830a78f Compare December 10, 2025 19:21

jshearer added 3 commits December 16, 2025 09:41

dekaf: start at epoch 1, not 0.

0795fc4

dekaf: Include UnknownTopicOrPartition in list of retryable errors

0f10bf5

This shows up sometimes, I believe attributed to eventual consistency in the upstream Kafka brokers. It should be retried and not result in the session crashing.

jshearer added 3 commits December 16, 2025 09:41

dekaf: Refactor list_offsets to be much more legible

28b8fed

dekaf: Update read logic to bubble up JournalNotFound into `Unknown…

285c7ad

…TopicOrPartition`

dekaf: Only send a committed_leader_epoch if the partition has comm…

5d38214

…itted offsets

jshearer force-pushed the dekaf/collection_reset branch from 2f34760 to 5d38214 Compare December 16, 2025 14:41

dekaf: Invalidate spec cache when journal listing returns 0 journals

788c0f6

There was a confusing behavior where the task manager would serve 0 journals for a deleted collection until the `SPEC_TTL` expired.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dekaf: Support collection reset and materialization binding backfill via Kafka leader epochs #2497

dekaf: Support collection reset and materialization binding backfill via Kafka leader epochs #2497

Uh oh!

jshearer commented Nov 11, 2025 •

edited

Loading

Uh oh!

jshearer commented Dec 2, 2025

Uh oh!

jgraettinger left a comment

Uh oh!

jgraettinger Dec 4, 2025

Uh oh!

jshearer Dec 4, 2025 •

edited

Loading

Uh oh!

jgraettinger Dec 4, 2025

Uh oh!

jshearer Dec 8, 2025 •

edited

Loading

Uh oh!

jgraettinger Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dekaf: Support collection reset and materialization binding backfill via Kafka leader epochs #2497

Are you sure you want to change the base?

dekaf: Support collection reset and materialization binding backfill via Kafka leader epochs #2497

Uh oh!

Conversation

jshearer commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Background: Kafka leader epochs

Implementation

Metadata

Fetching

If request.current_leader_epoch < collection.binding_backfill_counter

If request.current_leader_epoch == collection.binding_backfill_counter

If request.current_leader_epoch > collection.binding_backfill_counter

OffsetForLeaderEpoch

Committed offset segmentation

to_upstream_topic_name(name, secret, token, Some(backfill_counter))

Upgrade Path

Committed offset isolation

Migration Flow

Uh oh!

jshearer commented Dec 2, 2025

Uh oh!

jgraettinger left a comment

Choose a reason for hiding this comment

Uh oh!

jgraettinger Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

jshearer Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgraettinger Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

jshearer Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgraettinger Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jshearer commented Nov 11, 2025 •

edited

Loading

If `request.current_leader_epoch < collection.binding_backfill_counter`

If `request.current_leader_epoch == collection.binding_backfill_counter`

If `request.current_leader_epoch > collection.binding_backfill_counter`

`OffsetForLeaderEpoch`

`to_upstream_topic_name(name, secret, token, Some(backfill_counter))`

jshearer Dec 4, 2025 •

edited

Loading

jshearer Dec 8, 2025 •

edited

Loading