Skip to content

[#31220] CDC: Add cleanup_stale_cdc_streams yb-admin command#32025

Open
egladysh wants to merge 1 commit into
yugabyte:masterfrom
Shopify:gh-31220-cleanup-stale-cdc-streams
Open

[#31220] CDC: Add cleanup_stale_cdc_streams yb-admin command#32025
egladysh wants to merge 1 commit into
yugabyte:masterfrom
Shopify:gh-31220-cleanup-stale-cdc-streams

Conversation

@egladysh

@egladysh egladysh commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a CleanupStaleCDCStreams RPC to the MasterReplication master service and a corresponding cleanup_stale_cdc_streams yb-admin command.

The command scans the cdc_state table and identifies stale entries — rows whose CDC stream no longer exists, or whose tablet no longer exists. It supports:

  • An optional YSQL database name to restrict cleanup to entries attributable to a single namespace. Entries whose stream and tablet metadata are both missing cannot be attributed to any namespace and are only removed in an unfiltered (all-namespaces) invocation.
  • A --dry_run flag to report stale entries without deleting them.

Example (dry run):

./bin/yb-admin \
    --master_addresses 127.0.0.1:7100 \
    --dry_run \
    cleanup_stale_cdc_streams yugabyte

Filtering stale CDC state entries for YSQL database: yugabyte

Found 2 stale cdc_state entries (dry run).
  tablet_id: 8f3b9b3a8b2c4e2b9c7d6a5f4e3d2c1b, stream_id: 3d9f1e2a4b6c4d8e9f0123456789abcd, table_id: 000033e8000030008000000000004000, table_name: orders, reason: stream not found
  tablet_id: missing_tablet_id, stream_id: 7a8b9c0d1e2f34567890abcdef123456, reason: tablet not found

Documentation for the new command is added under the yb-admin Change Data Capture commands in docs/content/stable/admin/yb-admin.md.

Implementation notes:

  • The cdc_state table is scanned once and materialized so both the tablet-collection and classification loops share one consistent snapshot.
  • Tablet table metadata is resolved under a single SharedLock before the classification loop to avoid repeated lock acquisitions.
  • xCluster streams that carry an empty namespace_id have their namespace resolved via their first table_id.

Upgrade/Rollback safety

The proto change is purely additive: new messages (CleanupStaleCDCStreamsRequestPB, CleanupStaleCDCStreamsResponsePB) and a new RPC method (CleanupStaleCDCStreams) are added to MasterReplication. No existing message fields are modified.

On a mixed-version cluster, an older master that does not have this RPC will return UNIMPLEMENTED when cleanup_stale_cdc_streams is invoked; all other CDC and xCluster operations are unaffected.

Rolling back removes the yb-admin command and the RPC handler. No on-disk state, catalog schema, or gflag defaults are changed.

Test plan

  • CDCServiceTest.TestCleanupStaleCDCStreamsWithoutCDCStateTable — verifies OBJECT_NOT_FOUND when the cdc_state table does not exist.
  • CDCServiceTest.TestCleanupStaleCDCStreamsDryRunAndDelete — dry-run returns stale entries without deleting; live run deletes exactly those entries and leaves the valid entry untouched.
  • CDCServiceTest.TestCleanupStaleCDCStreamsNamespaceFilter — with a namespace filter, only entries attributable to the selected namespace are deleted; cross-namespace and unattributable (both stream and tablet missing) entries are preserved.

@egladysh egladysh requested a review from asrinivasanyb June 3, 2026 23:23
@netlify

netlify Bot commented Jun 3, 2026

Copy link
Copy Markdown

Deploy Preview for infallible-bardeen-164bc9 ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit fc64afe
🔍 Latest deploy log https://app.netlify.com/projects/infallible-bardeen-164bc9/deploys/6a20b71863992f000845e990
😎 Deploy Preview https://deploy-preview-32025--infallible-bardeen-164bc9.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new administrative command cleanup_stale_cdc_streams to yb-admin (along with the corresponding master RPC CleanupStaleCDCStreams) to identify and purge stale entries from the cdc_state table when their associated streams or tablets no longer exist. The changes include documentation, integration tests, and CLI/client support. The review feedback highlights a potential null pointer dereference of cdc_state_table_ in CatalogManager::CleanupStaleCDCStreams before calling GetTableRangeAsync, which could cause a master crash if the table is uninitialized; adding a defensive null check is recommended.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/yb/master/xrepl_catalog_manager.cc
ASSERT_STR_CONTAINS(resp.error().status().message(), "cdc_state table does not exist");
}

TEST_F(CDCServiceTest, TestCleanupStaleCDCStreamsDryRunAndDelete) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we follow a test pattern similar to CDCSDKYsqlTest::TestValidationAndSyncOfCDCStateEntriesAfterUserTableRemoval - this tests the yb-admin command directly

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asrinivasanyb To confirm, you're after the yb-admin CLI-as-subprocess pattern, not relocating this into the CDCSDKYsqlTest fixture, right?

}

std::vector<cdc::CDCStateTableKey> all_entry_keys;
for (const auto& entry_result : *all_entry_keys_result) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need 2 passes over the cdc_state table ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only one pass over cdc_state? We materialize CDCStateTableRange. The later loops are over the in-memory all_entry_keys vector: first to collect candidate tablet IDs for batched metadata lookup, and then to classify rows using that metadata.

}
}

std::unordered_map<TabletId, std::vector<scoped_refptr<TableInfo>>> tablet_tables_map;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this required only for the colocated table case ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used for both colocated and non-colocated tablets. For non-colocated tablets this normally contains the single user table for the tablet. For colocated tablets it can contain multiple tables, so the same map lets us report all resolved table metadata.

@egladysh egladysh requested a review from asrinivasanyb June 11, 2026 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants