What did you do?
I investigated the owner warning below in the current codebase:
the newPullerResolvedTs should not be smaller than c.pullerResolvedTs
The warning is emitted in cdc/owner/changefeed.go when
watermark.PullerResolvedTs < c.pullerResolvedTs.
What did you expect to see?
Either:
PullerResolvedTs is defined as a monotonic value end-to-end, so this warning should never fire in normal scheduling paths, or
- if
PullerResolvedTs is a snapshot of the current scheduler state, then owner should not assume that it can only increase.
What did you see instead?
The owner keeps c.pullerResolvedTs as a "grow-only" value, but the scheduler computes
watermark.PullerResolvedTs as the current minimum puller-egress resolved ts among all replication sets.
That scheduler-side value can legitimately become smaller in normal cases, for example:
- A new table/span is added and its stage checkpoints are initialized from a lower checkpoint ts.
- A table is rescheduled / recovered after capture failure, so the puller subscription is recreated and the new stage stats start from a lower value.
- A paused/resumed or recovered changefeed reuses the same owner-side
changefeed instance while the cached pullerResolvedTs from the previous run is still kept in memory.
As a result, the warning can be emitted even though there is no real timestamp regression bug in the puller.
Root cause analysis
There is a semantic mismatch between owner and scheduler:
- In
cdc/owner/changefeed.go, c.pullerResolvedTs is updated as a monotonic cached value:
- increase: assign
- decrease: warn only
- In
cdc/scheduler/internal/v3/replication/replication_manager.go, watermark.PullerResolvedTs is recalculated every tick as:
min(table.Stats.StageCheckpoints["puller-egress"].ResolvedTs) over the current table set
That minimum is not monotonic by definition.
Two implementation details make the problem easier to hit:
ReplicationSet.Checkpoint is merged monotonically, but ReplicationSet.Stats is replaced as a whole when fresh stats are reported. So a recreated table can keep a non-regressing table checkpoint while still reporting a lower puller-egress stage checkpoint.
releaseResources resets resolvedTs, but it does not reset lastSyncedTs or pullerResolvedTs, even though the changefeed struct is reused on restart/resume.
This means the warning is triggered by an invalid monotonicity assumption in owner, not necessarily by corrupted puller progress.
Why this matters
This is not just a noisy warning:
- it can mislead operators into thinking the puller resolved ts regressed unexpectedly
QueryChangeFeedSyncedStatus also exposes cfReactor.pullerResolvedTs, so a stale monotonic cache can diverge from the current scheduler snapshot used by the system
Suggested direction
One of these semantics should be chosen explicitly:
- Treat owner-side
pullerResolvedTs as the latest scheduler snapshot and allow it to go backward.
- Keep a separate monotonic field for synced-status style reporting, but do not compare it directly with the current scheduler minimum and do not warn on snapshot regression.
Additionally, pullerResolvedTs and lastSyncedTs should probably be reset when reusing a changefeed instance during resume/reinitialize.
Version
Observed by code inspection on current master-equivalent logic in local checkout cf888cb0f26881469dd25307dfa721eea91c0c03.
What did you do?
I investigated the owner warning below in the current codebase:
The warning is emitted in
cdc/owner/changefeed.gowhenwatermark.PullerResolvedTs < c.pullerResolvedTs.What did you expect to see?
Either:
PullerResolvedTsis defined as a monotonic value end-to-end, so this warning should never fire in normal scheduling paths, orPullerResolvedTsis a snapshot of the current scheduler state, then owner should not assume that it can only increase.What did you see instead?
The owner keeps
c.pullerResolvedTsas a "grow-only" value, but the scheduler computeswatermark.PullerResolvedTsas the current minimumpuller-egressresolved ts among all replication sets.That scheduler-side value can legitimately become smaller in normal cases, for example:
changefeedinstance while the cachedpullerResolvedTsfrom the previous run is still kept in memory.As a result, the warning can be emitted even though there is no real timestamp regression bug in the puller.
Root cause analysis
There is a semantic mismatch between owner and scheduler:
cdc/owner/changefeed.go,c.pullerResolvedTsis updated as a monotonic cached value:cdc/scheduler/internal/v3/replication/replication_manager.go,watermark.PullerResolvedTsis recalculated every tick as:min(table.Stats.StageCheckpoints["puller-egress"].ResolvedTs)over the current table setThat minimum is not monotonic by definition.
Two implementation details make the problem easier to hit:
ReplicationSet.Checkpointis merged monotonically, butReplicationSet.Statsis replaced as a whole when fresh stats are reported. So a recreated table can keep a non-regressing table checkpoint while still reporting a lowerpuller-egressstage checkpoint.releaseResourcesresetsresolvedTs, but it does not resetlastSyncedTsorpullerResolvedTs, even though thechangefeedstruct is reused on restart/resume.This means the warning is triggered by an invalid monotonicity assumption in owner, not necessarily by corrupted puller progress.
Why this matters
This is not just a noisy warning:
QueryChangeFeedSyncedStatusalso exposescfReactor.pullerResolvedTs, so a stale monotonic cache can diverge from the current scheduler snapshot used by the systemSuggested direction
One of these semantics should be chosen explicitly:
pullerResolvedTsas the latest scheduler snapshot and allow it to go backward.Additionally,
pullerResolvedTsandlastSyncedTsshould probably be reset when reusing achangefeedinstance during resume/reinitialize.Version
Observed by code inspection on current
master-equivalent logic in local checkoutcf888cb0f26881469dd25307dfa721eea91c0c03.