Skip to content

owner: invalid monotonicity assumption on pullerResolvedTs emits false warnings #12598

@3AceShowHand

Description

@3AceShowHand

What did you do?

I investigated the owner warning below in the current codebase:

the newPullerResolvedTs should not be smaller than c.pullerResolvedTs

The warning is emitted in cdc/owner/changefeed.go when
watermark.PullerResolvedTs < c.pullerResolvedTs.

What did you expect to see?

Either:

  1. PullerResolvedTs is defined as a monotonic value end-to-end, so this warning should never fire in normal scheduling paths, or
  2. if PullerResolvedTs is a snapshot of the current scheduler state, then owner should not assume that it can only increase.

What did you see instead?

The owner keeps c.pullerResolvedTs as a "grow-only" value, but the scheduler computes
watermark.PullerResolvedTs as the current minimum puller-egress resolved ts among all replication sets.

That scheduler-side value can legitimately become smaller in normal cases, for example:

  1. A new table/span is added and its stage checkpoints are initialized from a lower checkpoint ts.
  2. A table is rescheduled / recovered after capture failure, so the puller subscription is recreated and the new stage stats start from a lower value.
  3. A paused/resumed or recovered changefeed reuses the same owner-side changefeed instance while the cached pullerResolvedTs from the previous run is still kept in memory.

As a result, the warning can be emitted even though there is no real timestamp regression bug in the puller.

Root cause analysis

There is a semantic mismatch between owner and scheduler:

  • In cdc/owner/changefeed.go, c.pullerResolvedTs is updated as a monotonic cached value:
    • increase: assign
    • decrease: warn only
  • In cdc/scheduler/internal/v3/replication/replication_manager.go, watermark.PullerResolvedTs is recalculated every tick as:
    • min(table.Stats.StageCheckpoints["puller-egress"].ResolvedTs) over the current table set

That minimum is not monotonic by definition.

Two implementation details make the problem easier to hit:

  1. ReplicationSet.Checkpoint is merged monotonically, but ReplicationSet.Stats is replaced as a whole when fresh stats are reported. So a recreated table can keep a non-regressing table checkpoint while still reporting a lower puller-egress stage checkpoint.
  2. releaseResources resets resolvedTs, but it does not reset lastSyncedTs or pullerResolvedTs, even though the changefeed struct is reused on restart/resume.

This means the warning is triggered by an invalid monotonicity assumption in owner, not necessarily by corrupted puller progress.

Why this matters

This is not just a noisy warning:

  • it can mislead operators into thinking the puller resolved ts regressed unexpectedly
  • QueryChangeFeedSyncedStatus also exposes cfReactor.pullerResolvedTs, so a stale monotonic cache can diverge from the current scheduler snapshot used by the system

Suggested direction

One of these semantics should be chosen explicitly:

  1. Treat owner-side pullerResolvedTs as the latest scheduler snapshot and allow it to go backward.
  2. Keep a separate monotonic field for synced-status style reporting, but do not compare it directly with the current scheduler minimum and do not warn on snapshot regression.

Additionally, pullerResolvedTs and lastSyncedTs should probably be reset when reusing a changefeed instance during resume/reinitialize.

Version

Observed by code inspection on current master-equivalent logic in local checkout cf888cb0f26881469dd25307dfa721eea91c0c03.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/enhancementThe issue or PR belongs to an enhancement.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions