[#31263] YSQL: Stop index backfill when the CREATE INDEX session is terminated by egladysh · Pull Request #31378 · yugabyte/yugabyte-db

egladysh · 2026-04-30T20:41:29Z

Summary

Add DdlRequesterLivenessTask, a master-side task that polls the transaction
status of the DDL transaction held open by the CREATE INDEX CONCURRENTLY
backend. If the transaction is aborted (e.g. because the backend was killed
via pg_terminate_backend), the task calls BackfillTable::Abort() to stop
the in-progress backfill.

Test plan

PgIndexBackfillCancellationTest.BackfillStopsAfterBackendKill — asserts
that no new backfill RPCs are issued after the backend is killed.
PgIndexBackfillCancellationWithoutFixTest.BackfillContinuesAfterBackendKill
— asserts the old behavior (backfill continues) when the liveness monitor is
disabled, serving as a regression baseline.
PgIndexBackfillCancellationEarlyKillTest.BackfillStopsAfterEarlyBackendKill
— same as above but the backend is killed before backfill starts.

CSI

gemini-code-assist

Code Review

This pull request introduces a mechanism to cancel background index backfill operations when the initiating PostgreSQL backend is terminated. It implements a DdlRequesterLivenessTask to monitor the transaction status and abort the backfill if the transaction is aborted. The changes span the client, master, and tserver layers to propagate transaction metadata and include extensive regression tests. Feedback was provided to increase the logging severity from a warning to an error when a backfill abort operation fails.

jasonyb · 2026-05-01T20:25:58Z

trigger jenkins

hari90 · 2026-05-01T20:26:14Z

Phorge diff synced: D52659
Commit: 288225c0

Jenkins build has been triggered. Results will be posted here once it completes.

JenkinsBot

jasonyb

Did not get a chance to look at the whole thing but have provided a lot of comments to keep you busy.

hari90 · 2026-05-02T02:12:43Z

❌ Jenkins build for commit 288225c0: Fail

Exceptions:

Missing Upgrade/Downgrade safety section. AW03 (https://docs.google.com/document/d/1wsTK93mVN_9GQpuBL0dr5RirUh05LfY4h51kRLc60iQ)
Passed: 12

🔨 DB Build/Test Job Summary

Build	Total	Passed	Failed	Failed After Retries
D52659-arm-alma8-clang21-release	9849	9470	12	6
D52659-alma8-clang21-release	9852	9470	12	8
D52659-ubuntu22.04-clang21-debug	2	2	0	0
D52659-arm-mac14-clang21-release	17	17	0	0
D52659-mac14-clang21-release	2	2	0	0
D52659-alma8-clang21-tsan	9665	8029	18	12
D52659-alma8-gcc12-fastdebug	9867	9426	18	8
D52659-alma9-clang21-asan	9759	9042	8	5

Full status

JenkinsBot

netlify · 2026-05-04T16:54:13Z

✅ Deploy Preview for infallible-bardeen-164bc9 ready!

Built without sensitive environment variables

Name	Link
🔨 Latest commit	`c5b9ba9`
🔍 Latest deploy log	https://app.netlify.com/projects/infallible-bardeen-164bc9/deploys/69fee765b01d5500096ddd5b
😎 Deploy Preview	https://deploy-preview-31378--infallible-bardeen-164bc9.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

…timeout_ms

…ow, sort kDdlRequesterLiveness

jasonyb · 2026-05-05T01:23:35Z

Had another busy day. This is still on my radar.

…solved merge conflicts.

egladysh · 2026-05-05T18:09:34Z

merged the latest changes to PollTransactionStatusBase

jasonyb · 2026-05-05T19:01:17Z

This old test run #31378 (comment) had 100% failing tests PgIndexBackfillCancellationEarlyKillTest - BackfillStopsAfterEarlyBackendKill/* and PgIndexBackfillCancellationTest - BackfillStopsAfterBackendKill/* on release and fastdebug builds.

egladysh · 2026-05-05T20:40:03Z

That's not good. Did they fail on all platforms including arm mac14 clang21 (that the setup I have been testing it with)? I don't seem to have access to the test logs. Also is it possible to trigger the tests again?

jasonyb · 2026-05-05T20:58:07Z

BackfillStopsAfterEarlyBackendKill/0:

../../src/yb/yql/pgwrapper/pg_index_backfill-test.cc:3726: Failure
Expected: (rpcs_final) <= (2), actual: 4 vs 2
Expected ≤ 2 BackfillIndex RPCs (liveness task should abort after ≤ 2 chunks), got 4. Suggests requester_transaction was not forwarded from the placeholder BackfillTable to the real BackfillTable.

BackfillStopsAfterEarlyBackendKill/1:

../../src/yb/yql/pgwrapper/pg_index_backfill-test.cc:3726: Failure
Expected: (rpcs_final) <= (2), actual: 4 vs 2
Expected ≤ 2 BackfillIndex RPCs (liveness task should abort after ≤ 2 chunks), got 4. Suggests requester_transaction was not forwarded from the placeholder BackfillTable to the real BackfillTable.

BackfillStopsAfterBackendKill/0:

../../src/yb/yql/pgwrapper/pg_index_backfill-test.cc:3542: Failure
Expected equality of these values:
  rpcs_after_wait
    Which is: 4
  rpcs_after_kill
    Which is: 2
BackfillIndex RPCs continued after pg_terminate_backend: 2 unexpected additional RPC(s) issued

BackfillStopsAfterBackendKill/1:

../../src/yb/yql/pgwrapper/pg_index_backfill-test.cc:3542: Failure
Expected equality of these values:
  rpcs_after_wait
    Which is: 4
  rpcs_after_kill
    Which is: 2
BackfillIndex RPCs continued after pg_terminate_backend: 2 unexpected additional RPC(s) issued

Also note I see [ts-1] TRAP: FailedAssertion("!IsTransactionOrTransactionBlock()", File: "../../../../../../../src/postgres/src/backend/utils/activity/pgstat.c", Line: 586, PID: 490741) in all four of these fastdebug logs.

jasonyb · 2026-05-05T21:02:02Z

These are alma8-clang21-release and alma8-gcc12-fastdebug. All four full logs (fastdebug) are present here: https://gist.github.com/jasonyb/1b2eb85d6240aae92586bbb85e207173

jasonyb · 2026-05-05T21:03:51Z

That's not good. Did they fail on all platforms including arm mac14 clang21 (that the setup I have been testing it with)? I don't seem to have access to the test logs. Also is it possible to trigger the tests again?

This test is not run on mac, but I think it was run on arm-alma8-clang21-release and passed there. So 2 of 3 build types failed consistently.

Trigger jenkins

hari90 · 2026-05-05T21:04:03Z

Jenkins build has been triggered. Results will be posted once it completes. CSI

JenkinsBot

jasonyb · 2026-05-05T21:37:49Z

@egladysh For https://github.com/yugabyte/yugabyte-db/pull/31378/changes#r3175721299, I notice you have split commits locally. Our current policy is to always squash-merge PRs, so splitting commits locally doesn't really do anything. Two options:

safe: create two PRs off master for each of 4b38b00 and f2675a3. Wait for those to land on master. After they land on master, create the final PR that does the main feature you want.
risky: I am not completely sure how stacking multiple PRs together works, but it may be possible to build the three PRs all depending on each other. It is not clear to me whether triggering builds off of these would work (the infra team has expressed verbally that it should, but I do not believe it has been tested yet).

Given the two dependency pieces are low-review-contention, I believe the safe option is better. All we would have to do is wait for test results to pass and for the PR title/description to be good. Make sure to explain what (potential) problems are fixed by each of these pieces.

egladysh · 2026-05-05T21:50:14Z

The comments were: "It's logically independent of the liveness monitoring feature. Consider splitting into
its own commit so it can be cherry-picked independently and won't be lost if this
commit is reverted." But yeah, squash-merge makes a difference. I'll split them up.

As for the failing tests. I looked at the logs and it seems like it fails because of the asserts in fastdebug that you mentioned when the tests does pg_terminate_backend and that leads to

[ts-1] W0502 01:33:34.751808 207507 pg_txn_manager.cc:689] Failed to abort DDL transaction: Aborted (yb/rpc/reactor.cc:122): Shutdown connection (system error 108)
2797 [ts-1] 2026-05-02 01:33:34.753 UTC [207507] FATAL:  Failed to abort DDL transaction: Shutdown connection

The abort fails, DdlRequesterLivenessTask doesn't know about it, backfill continues, the test fails. It could be another independent bug. I'll investigate it.

egladysh · 2026-05-05T22:08:08Z

It seems to be a more serious issue with timing in YBCAbortTransaction when DDL transactions are not aborted cleanly and it's independent from this PR.

hari90 · 2026-05-05T23:26:06Z

❌ Jenkins build for commit f5f5f28d: Fail
CSI
Reason: CSI status: WARNING

Exceptions:

Missing Upgrade/Downgrade safety section. AW03 (https://docs.google.com/document/d/1wsTK93mVN_9GQpuBL0dr5RirUh05LfY4h51kRLc60iQ)
'Multi-Build Test Failures' failed. CSI-Checks (http://csiweb.dev.yugabyte.com/pull/31378/#CSI-Checks)
'Too Many Test Failures Per Build' failed. CSI-Checks (http://csiweb.dev.yugabyte.com/pull/31378/#CSI-Checks)
'No Compile Failures' failed. CSI-Checks (http://csiweb.dev.yugabyte.com/pull/31378/#CSI-Checks)
Passed: 8

Checking test failure count per build versus limit of 20 (0 on mac).

Build	Failures	Status
PR31378-mac14-clang21-release #1	1	FAILURE
PR31378-alma8-clang21-release #1	1	Okay
PR31378-ubuntu22.04-clang21-debug #1	1	Okay
PR31378-alma8-clang21-tsan #1	1	Okay
PR31378-arm-mac14-clang21-release #1	1	FAILURE
PR31378-arm-alma8-clang21-release #1	1	Okay
PR31378-alma8-gcc12-fastdebug #1	1	Okay
PR31378-alma9-clang21-asan #1	1	Okay

🔨 DB Build/Test Job Summary

Build	Total	Failed	Failed After Retries
PR31378-mac14-clang21-release	1	1	1
PR31378-alma8-clang21-release	1	1	1
PR31378-ubuntu22.04-clang21-debug	1	1	1
PR31378-alma8-clang21-tsan	1	1	1
PR31378-arm-mac14-clang21-release	1	1	1
PR31378-arm-alma8-clang21-release	1	1	1
PR31378-alma8-gcc12-fastdebug	1	1	1
PR31378-alma9-clang21-asan	1	1	1

JenkinsBot

jasonyb · 2026-05-05T23:40:33Z

#31378 (comment):

[2026-05-05T21:41:23.518Z] Rebasing (1/6)
[2026-05-05T21:41:23.518Z] Rebasing (2/6)
[2026-05-05T21:41:23.518Z] Rebasing (3/6)
[2026-05-05T21:41:23.518Z] Auto-merging src/yb/master/catalog_entity_info.h
[2026-05-05T21:41:23.518Z] Auto-merging src/yb/master/catalog_manager.cc
[2026-05-05T21:41:23.518Z] Auto-merging src/yb/master/ysql_ddl_verification_task.cc
[2026-05-05T21:41:23.518Z] CONFLICT (content): Merge conflict in src/yb/master/ysql_ddl_verification_task.cc
[2026-05-05T21:41:23.518Z] Auto-merging src/yb/master/ysql_ddl_verification_task.h
[2026-05-05T21:41:23.518Z] error: could not apply 028994f138... Stop index backfill when backend is terminated
[2026-05-05T21:41:23.518Z] hint: Resolve all conflicts manually, mark them as resolved with
[2026-05-05T21:41:23.518Z] hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
[2026-05-05T21:41:23.518Z] hint: You can instead skip this commit: run "git rebase --skip".
[2026-05-05T21:41:23.518Z] hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
[2026-05-05T21:41:23.518Z] Could not apply 028994f138... Stop index backfill when backend is terminated

It failed because of a clear infra-side issue: the build takes each of your commits and rebases them onto newest master. If you resolve merge conflicts in a later commit (such as f5f5f28), it will not help when rebasing an earlier commit that is the source of conflict (such as 028994f). The correct infra approach would be to either

squash the PR's commits into a single commit then rebase that onto latest master
merge the PR's branch to latest master

Both cases avoid the issue of merge conflicts on intermediate commits of the PR's branch.

Workaround while this infra issue is in place: squash your changes to a single local commit and force push that for the PR.

cc: @hari90

jasonyb · 2026-05-06T02:16:36Z

It seems to be a more serious issue with timing in YBCAbortTransaction when DDL transactions are not aborted cleanly and it's independent from this PR.

I dug into the root cause of these test failures. Here's what's happening:

The claim that this is independent from this PR is partially correct — the root cause is pre-existing, but this PR's tests are uniquely sensitive to it.

Root Cause

When pg_terminate_backend sends SIGTERM, the die() signal handler calls YBCInterruptPgGate(), which signals the interrupter thread to shut down the RPC messenger (messenger_.Shutdown() in pggate.cc:592). This happens asynchronously, but the code path to the abort RPC is long enough that the messenger is effectively always shut down by the time it fires.

The shutdown sequence involves two FATALs:

FATAL-1: ProcessInterrupts() raises ereport(FATAL, "terminating connection"). Unlike ERROR (which longjmps to the error handler), FATAL calls proc_exit(1) directly. During proc_exit, the ShutdownPostgres callback calls AbortOutOfAnyTransaction() → AbortTransaction() → YBCAbortTransaction() → YBCPgClearSeparateDdlTxnMode() → FinishTransaction(Abort). The abort RPC fails because the messenger is shut down.
FATAL-2: The RPC failure triggers elog(FATAL, "Failed to abort DDL transaction: Shutdown connection") at pg_yb_utils.c:1552.

The DDL transaction is left in PENDING state on the coordinator. It stays PENDING until the heartbeat timeout expires: transaction_heartbeat_usec × transaction_max_missed_heartbeat_periods = 0.5s × 30 = 15 seconds in release builds, 45 seconds in TSAN.

Why this PR's tests fail

DdlRequesterLivenessTask polls transaction status and calls callbacks_.abort_() only when it sees ABORTED. Since the abort RPC never succeeded, the transaction stays PENDING for 15-45s, and the liveness task keeps reporting "still pending" during that window. The tests' wait period is shorter than this, so they see continued backfill RPCs and fail.

Why existing DDL verification tasks are unaffected

TableSchemaVerificationTask and NamespaceVerificationTask ignore the aborted parameter entirely (bool /*aborted*/) — they compare against the PG schema to determine the actual outcome. They just poll longer. No correctness issue.

Fix for this PR

Neither FATAL-1 nor FATAL-2 causes these test failures. The backend is going to exit regardless. The tests fail because the abort RPC itself fails (messenger already shut down), leaving the transaction PENDING regardless of log severity.

A fix is to reduce transaction_max_missed_heartbeat_periods in the test fixture to shorten the coordinator's expiration window. A value of 10 gives ~5s in release (10 × 0.5s) and ~15s in TSAN (10 × 1.5s) — fast enough for tests, with enough margin to avoid flakiness from normal heartbeat jitter. The test wait time and flags should then be adjusted to exceed this timeout.

This is purely a test-side configuration; the production code in DdlRequesterLivenessTask does not need changes — waiting longer for the coordinator to expire the transaction is acceptable in production.

Separate issue (independent, does not affect test pass/fail): the pre-existing FATAL-2 during SIGTERM

I filed #31439 for the pre-existing bug where YBCAbortTransaction produces FATAL-2 (and on debug/fastdebug builds, a TRAP assertion at pgstat.c:586) when the backend is killed via pg_terminate_backend. FATAL-1 ("terminating connection") is the expected behavior of pg_terminate_backend and is not a bug. The fix for FATAL-2 is to detect the shutdown path (proc_exit_inprogress) and downgrade FATAL-2 to WARNING. The abort RPC is still attempted (so it can succeed in the rare case the messenger hasn't shut down yet), but a failure no longer triggers a nested FATAL. This is independent from this PR and does not affect whether these tests pass or fail.

jasonyb

Spent a lot of time on forming/understanding issue #31439, so again did not get to review the whole thing.

egladysh · 2026-05-06T21:34:07Z

@jasonyb Yeah, it looks correct: #31439. I can open a separate PR for this issue b/c this one depends on it?

jasonyb · 2026-05-06T21:40:19Z

@jasonyb Yeah, it looks correct: #31439. I can open a separate PR for this issue b/c this one depends on it?

My understanding was that #31439 is not a dependency for this: the tests should pass even though we see FATALs in the logs. Assuming I am correct about this, then I think it is better to do #31439 after this so that you have a concrete repro (namely, the logs of some of the tests here), but this is just a recommendation. If I am wrong and the tests fail without fixing #31439, yes, please open a separate PR for #31439.

egladysh · 2026-05-06T22:18:53Z

@jasonyb I think there is a clear dependency for fastdebug or debug builds where the assert is enabled. The fastdebug builds keep the asserts enabled but the timing is close to release builds. In debug builds the timing is different and the RPC has time to finish before the assert checks so it passes most of the time. The test is sensitive to proper DDL cleanup and exposes the bug with using the assert when it is enabled. Again the assert in question that breaks the proper DDL cleanup is around

TRAP: FailedAssertion("!IsTransactionOrTransactionBlock()", File: "../../../../../../../src/postgres/src/backend/utils/activity/pgstat.c", Line: 586, PID: 207507)

egladysh · 2026-05-06T22:23:21Z

@jasonyb The fix suggested in #31439 is consistent with my understanding of the issue. I am thinking about creating another PR for that. Please let me know if you have other ideas.

jasonyb · 2026-05-06T22:32:21Z

@jasonyb The fix suggested in #31439 is consistent with my understanding of the issue. I am thinking about creating another PR for that. Please let me know if you have other ideas.

@egladysh, sounds good to me.

egladysh · 2026-05-06T23:06:16Z

@jasonyb Opened #31470

egladysh · 2026-05-07T01:04:07Z

I do agree with #31472.

jasonyb

Still working through the review. Have these comments in the meantime.

jasonyb · 2026-05-07T01:12:41Z

  return Status::OK();
 }

+void BackfillTable::StartRequesterLivenessMonitor() {


After some back-and-forth with AI, got this (which I haven't fully verified in the interest of time):

Review of StartRequesterLivenessMonitor and StopLivenessMonitor

StartRequesterLivenessMonitor

Issue 1: Race between CreateAndStartTask and storing liveness_task_

The task is created and started at line 1120, but not stored into liveness_task_ until line 1131. The task begins polling immediately. If the transaction is already aborted (or aborts very quickly), the task's FinishPollTransaction fires abort_() → BackfillTable::Abort() → MarkAllIndexesAsFailed() → CheckIfDone() → StopLivenessMonitor() before liveness_task_ is assigned. StopLivenessMonitor sees null and does nothing.

In this specific case the task happens to have self-completed via Complete() before calling abort_(), so there's no leak. But the correctness depends on an implementation detail of FinishPollTransaction's ordering (Complete() before abort_()) — if that ordering ever changes, this breaks silently.

The fix is straightforward: create the task without starting it, store it under the lock, then start it:

auto task = std::make_shared<DdlRequesterLivenessTask>(...); { std::lock_guard l(mutex_); DCHECK(!liveness_task_); liveness_task_ = task; } task->Start();

This would require exposing a two-phase create+start API on DdlRequesterLivenessTask (the current CreateAndStartTask bundles both).

Issue 2: No error handling on CreateAndStartTask

CreateAndStartTask returns a shared_ptr, not a Result. If Start() fails internally (threadpool full, task immediately aborted by ValidateRunnable, etc.), the caller has no way to know the liveness monitor is non-functional. The entire feature becomes silently disabled with no log message indicating why.

At minimum, CreateAndStartTask should return Result<shared_ptr<DdlRequesterLivenessTask>> or the caller should verify the task's state after creation. Alternatively, log a warning if the task is in a terminal state immediately after start.

Issue 3: Callback captures shared_ptr<BackfillTable>

The lambdas capture self via shared_from_this(). This prevents BackfillTable destruction while the liveness task is alive. If StopLivenessMonitor is never called (e.g. a code path that sets done_ without going through the normal terminal paths), the BackfillTable leaks along with the task. Today's call sites appear to cover all terminal paths, but this is fragile — a future change that adds a new exit path could miss the StopLivenessMonitor call.

Consider weak_ptr for the callbacks, or a mechanism where the task self-terminates when done_() returns true (which ValidateRunnable already does — but only on the next scheduled step, not immediately).

StopLivenessMonitor

Good: Lock discipline avoids deadlock

Moving the task out via std::move under mutex_ then calling AbortAndReturnPrevState outside the lock is correct. The abort path calls PerformAbort() → Shutdown() → sync_.Wait(). If this were done under mutex_, and the task's in-flight callback tried to re-enter BackfillTable (which acquires mutex_), you'd deadlock.

Good: Idempotency via std::move

After std::move, liveness_task_ is null; subsequent calls are no-ops. This is essential because StopLivenessMonitor is called from multiple convergent terminal paths: Done() (success), MarkIndexesAsFailed() (failure), CheckIfDone() (via Abort()).

Concern: AbortAndReturnPrevState may block

AbortAndReturnPrevState can trigger PerformAbort() → Shutdown() → sync_.Wait(), which blocks until the in-flight GetTransactionStatus RPC completes. This is called from Done() and CheckIfDone(), which run on the callback threadpool. A slow or hung transaction status RPC could stall the backfill completion path. The transaction_rpc_timeout_ms flag bounds this, but it's worth being aware of.

Summary of actionable items

Fix the create-then-store race — separate creation from starting, or at minimum document why the current ordering is safe and what invariants it depends on.

Add error handling for task creation — at minimum log when the liveness monitor fails to start.

Consider the blocking potential of StopLivenessMonitor — sync_.Wait() can stall the callback threadpool.

jasonyb · 2026-05-07T01:27:58Z

    is_backfilling_ = false;
  }

+  // Store/retrieve the DDL transaction from the PG backend that initiated the backfill.


After some back-and-forth with AI, got this (which I haven't fully verified in the interest of time):

Review of SetPendingBackfillRequesterTransaction and TakePendingBackfillRequesterTransaction

Note: Not persisted across master failover

Already tracked in #31472 and commented at https://github.com/yugabyte/yugabyte-db/pull/31378/changes#r3198264572. Not repeating here.

Note: Take returning nullopt is expected in multiple cases

StartBackfillingData always calls Take when !requester_transaction && current_version (line 348). Nullopt is the normal result for:

YCQL: Set is never called — YCQL doesn't go through CatalogManager::BackfillIndex (which is PGSQL-only). Nullopt is the expected baseline.

YSQL without a requester transaction: older PG clients that don't send requester_transaction, or decode failure (line 6594).

YSQL after master failover: in-memory state lost, already tracked in [YSQL] Create BackfillJobPB earlier in the backfill lifecycle #31472.

The only scenario where nullopt from Take would indicate a problem is a version mismatch — Set was called at V+1 but Take is called at some other version. This would mean an unexpected version bump occurred between the permission update and the backfill launch. This is unlikely today (YSQL does exactly one bump), but there's no way to distinguish this case from the legitimate nullopt cases at the Take call site.

Not flagging this as actionable, but noting that debugging a missing liveness monitor will require correlating logs from the Set call site (which currently has no log) with the Take call site. A VLOG at the Set call (line 523) recording the stored version would help.

Issue 1: Version encoding assumes exactly one version bump between Set and Take

Set stores at current_version + 1. Take is called with the table's version at the time backfill is launched. This works because YSQL does exactly one permission update (WRITE_AND_DELETE → DO_BACKFILL) which bumps the version by exactly 1.

If a future change introduces additional intermediate permission steps or version bumps between Set and Take, the versions would mismatch and Take would silently return nullopt, disabling liveness monitoring. The version matching is correct today but the coupling is implicit. A comment on SetPendingBackfillRequesterTransaction noting this single-bump assumption would help.

Issue 2: Transaction decode failure is only a WARNING

In CatalogManager::BackfillIndex (lines 6590-6596):

if (req->has_requester_transaction()) { auto result = TransactionMetadata::FromPB(req->requester_transaction()); if (result.ok()) { requester_txn = std::move(*result); } else { LOG(WARNING) << "BackfillIndex: failed to decode requester transaction: " << result.status(); } }

If the PG backend sends a malformed transaction, the decode fails and is logged as a WARNING. The backfill proceeds without liveness monitoring. This is fine for robustness (don't block backfill for a monitoring feature), but the WARNING could be easy to miss. Consider LOG(DFATAL) in debug builds to catch protocol bugs early.

Issue 3: Method bodies in the header file

SetPendingBackfillRequesterTransaction and TakePendingBackfillRequesterTransaction are defined inline in catalog_entity_info.h. Most TableInfo methods with comparable complexity (SetIsBackfilling, SetCreateTableErrorStatus, etc.) are declared in the header but defined in catalog_entity_info.cc. ClearIsBackfilling is inline but is a trivial one-liner. These two methods have lock acquisition, conditional logic, and std::exchange — they should follow the prevailing pattern and move to the .cc file.

Summary of actionable items

Comment the single-version-bump assumption — at the Set call site (line 523-524) or on the field declaration.

Add a VLOG at the Set call site (line 523) recording the stored version, to aid debugging when the liveness monitor unexpectedly doesn't start.

Consider LOG(DFATAL) for transaction decode failure — at line 6594-6596, a malformed requester_transaction from the PG client is only a WARNING. LOG(DFATAL) would catch protocol bugs in debug builds.

Move Set/Take method bodies to catalog_entity_info.cc — they have non-trivial logic and don't match the header-inline pattern used by comparable TableInfo methods.

Co-authored-by: jasonyb <93959687+jasonyb@users.noreply.github.com>

jasonyb

I'll look at the last two tests later. I'm familiar with the backfill flow but not the ddl verification task flow. Asked for assistance on that, but I will do it later if no help comes.

jasonyb · 2026-05-09T00:13:21Z

+  // Retrieve the requester transaction if it was stored during the permission-update phase.
+  // Pass current_version so TakePendingBackfillRequesterTransaction rejects stale
+  // transactions from earlier backfill attempts.
+  if (!requester_transaction && current_version) {


Sorry, I should have clarified that this suggestion was contingent on #31378 (comment) being true. I believe it is true, but it is not a blocker to me to have this dead code.

jasonyb · 2026-05-09T01:53:11Z

+    pid_ready.Wait();
+    {
+      auto monitor_conn = VERIFY_RESULT(ConnectToDB(kDatabaseName));
+      RETURN_NOT_OK(WaitFor(


I believe we already have a GetIndexStateFlags for this purpose

Doesn't GetIndexStateFlags use conn_ that we reset already? I guess we could modify the GetIndexStateFlags and pass connection as a parameter? I'd be fine that.

Ok. Not a blocker, but you could choose to not conn_.reset() to be able to use GetIndexStateFlags here, right? conn_ shouldn't be blocking the CREATE INDEX, at least not in any meaningful way.

jasonyb · 2026-05-09T02:01:10Z

+//   T~4s  First sleep inside UpdateIndexPermission completes (kAlterTableDelay)
+//   T~4s  Permission updated; requester_transaction stored via
+//         SetPendingBackfillRequesterTransaction
+//   T~8s  Second sleep inside UpdateIndexPermission completes -> AlterTable RPCs proceed


You have two UpdateIndexPermission, and while that is true, the below SleepFor(kAlterTableDelay * 2 + 1s); is after having waited for indisready permissions, so I believe the first UpdateIndexPermission is already covered. This brings it down to SleepFor(kAlterTableDelay + 1s);.

Furthermore, the comment says "BEFORE any tserver BackfillIndex RPCs have been sent", but if you are waiting the full kAlterTableDelay, doesn't that mean it is sent (if not for the code overhead)? It is not clear to me at what step you are trying to kill the backend. What further complicates this is the long delay before the transaction is noticed as aborted.

I think WaitForIndexStateFlags will see indisready=true at T0. I asked AI to clarify the comments. It seems correct to me now. I am trying to kill the backend at T9 (added comments). Also "RPC's sent" is not strictly correct, changed to "RPC's completed".

egladysh · 2026-05-09T07:53:54Z

Addressed some of the comments, will take a look at the rest later.

iSignal

Thanks for making this change @egladysh, it will help quite a bit! The overall structure of the code looks good. I have some specific concerns and a couple of questions. Main ones are

master leader failover (can be punted to a later diff)
BackfillTable::Abort/Done synchronization seems to be missing.
pg client session txn identification needs to be based on txn ddl being enabled or not

Rest are relatively minor.

iSignal · 2026-05-10T22:50:46Z

+    // The PG backend holds a DDL transaction open for the entire backfill duration
+    // (StartTransactionCommand at indexcmds.c:2334). Pass it to the master so it can detect when
+    // this backend is killed (-> txn aborted) and stop launching new backfill chunks.
+    auto meta = GetDdlTransactionMetadata(


We should use req.use_regular_transaction_block just like the other DDLs. The context here is that prior to txn DDL, DDLs used to use a separate "autonomous" transaction (kDDL session txn) but after transactional DDL feature they use the kPlain session txn. So I imagine this may not work correctly for the case when transactional DDL is on unless we plumb this field through similar to other DDL like PgCreateTable.

We can either try testing with transactional DDL on or just plumb it through similar to other DDL

I see. I'll add it to PgBackfillIndexRequestPB.

iSignal · 2026-05-11T00:11:33Z

+  // Schedule asynchronously so user_cb fires first.
+  //
+  // Complete() must still be called before callbacks_.abort_() to avoid a different deadlock:
+  // Abort() may call BackfillTable::StopLivenessMonitor() -> AbortAndReturnPrevState(), which


is it possible to supply a flag to Abort indicating it is coming from the liveness check so that the backfill is not going to try and stop the liveness check again? That would simplify the code.

iSignal · 2026-05-11T00:28:59Z

+  auto self = shared_from_this();
+  BackgroundDdlCallbacks callbacks{
+      .done_ = [self] { return self->done(); },
+      .abort_ = [self] { return self->Abort(); },


It doesn't seem like BackfillTable::Abort / BackfillTable::Done are ready to be called in a multi threaded context once we add this callback.

txn poll can call Abort while backfilltablet is causing a transition to Done success path.

txn poll can call Abort while backfilltablet is causing its own failure transition.

It seems like so far they were able to use std atomics to avoid real locking but now it would be better to use a proper lock to keep it simple. we can have some explicit internal enum state like waiting, aborting, aborted, success and use that to decide what to do from the callbacks (we only want to affect waiting state from the txn callback and not the others). Any other approaches are also ok but current path seems prone to problems.

@iSignal Ah... I assumed that those atomics and mutex's were there to make them thread-safe. I do see the gap now. We can fix it with an enum (kind of a state machine you suggested) or just moving the done_ usage around like:

Status BackfillTable::Abort() { bool expected = false; if (!done_.compare_exchange_strong(expected, true)) { return Status::OK(); } ...

Which one would you prefer?

done_ would be simpler but does not handle race between Abort and tablet Done failure path right? Both may try to mark indexes as failed. I guess an atomic int enum CAS with more than true/false can help distinguish the different states.

Hmm, I amn't sure. It seems like indexes_to_build() takes care of it with LockForWrite. Transition failed to failed is harmless and after failed indexes_to_build() will return {} if my understanding is correct?

iSignal · 2026-05-11T00:34:49Z

  LeaderEpoch epoch_;
  ash::WaitStateInfoPtr wait_state_;
+  std::optional<TransactionMetadata> requester_transaction_;
+  std::shared_ptr<DdlRequesterLivenessTask> liveness_task_ GUARDED_BY(mutex_);


maybe this can be a std::weak_ptr? The task runs on its own, do we need to own it? Right now, both the task and backfilltable are holding refs to each other, so we really need to be sure we release both correctly.

@iSignal I think that the reference cycle is already explicitly broken because every exit path (MarkIndexesAsFailed, CheckIfDone) calls StopLivenessMonitor(), which moves liveness_task_ out and clears BackfillTable's reference to the task. Anther reference to the task is the TableInfo's task list and the task will hold the last shared_ptr<BackfillTable> because the captures in the callbacks, and when the task finishes those are released too. No leak. Also I think that weak_ptr might actually be wrong because StopLivenessMonitor needs to call AbortAndReturnPrevState on the task but it depends on the life cycle of the tasks in TableInfo. I feel like shared_ptr is safer but I could be wrong.

Yes it is safe now but it is a bit worrying that every future path would need to reason about and remember to call Stop... during Job termination paths to break the loop.

If we write it as below to get a shared ptr out of the weak ptr, it would allow the task to exit by itself as well. But open to other suggestions as well

void BackfillTable::StopLivenessMonitor() { std::shared_ptr<DdlRequesterLivenessTask> task; { std::lock_guard l(mutex_); task = liveness_task_.lock(); liveness_task_.reset(); } if (task) { task->AbortAndReturnPrevState(STATUS(Aborted, "BackfillTable is done")); } }

My concern is about AbortAndReturnPrevState. I thought the call must be made? If that's not the case, I'd agree that weak_ptr would be a better choice.

iSignal · 2026-05-11T00:35:50Z

+        context->GetClientDeadline(), IsTxnUsingTableLocks(false));
+    std::optional<TransactionMetadata> txn_metadata;
+    if (!meta.ok()) {
+      VLOG(1) << "BackfillIndex: failed to get DDL transaction metadata: " << meta.status();


minor: maybe can be LOG(WARNING)

egladysh · 2026-05-11T17:08:14Z

@iSignal FYI two separate tickets to resolve issues with passing transactions around.
#31472
#31471

This PR just passes them in memory in TableInfo, a limitation that is understood.

jasonyb

Still waiting for a test overhaul regarding timing expectations. I believe I covered everything I need to for review (verification task is @iSignal).

jasonyb · 2026-05-11T20:19:14Z

+    pid_ready.Wait();
+    {
+      auto monitor_conn = VERIFY_RESULT(ConnectToDB(kDatabaseName));
+      RETURN_NOT_OK(WaitFor(


Ok. Not a blocker, but you could choose to not conn_.reset() to be able to use GetIndexStateFlags here, right? conn_ shouldn't be blocking the CREATE INDEX, at least not in any meaningful way.

jasonyb · 2026-05-11T20:25:00Z

+    }
+
+    auto terminated = VERIFY_RESULT(
+        conn_->FetchRow<bool>(Format("SELECT pg_terminate_backend($0)", create_index_pid)));


You claim conn_ is reset above but use it here. Does the test pass locally?

I don't think I ran the EarlyKill tests after I asked AI to refactor them to dedup the code. You are right it's a bug.

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread src/yb/master/ysql_ddl_verification_task.cc Outdated

iSignal requested review from amitanandaiyer and jasonyb April 30, 2026 21:56

egladysh force-pushed the query_cancellation_for_index_backfills branch from 82438bc to 288225c Compare April 30, 2026 22:37

jasonyb requested changes May 1, 2026

View reviewed changes

egladysh force-pushed the query_cancellation_for_index_backfills branch from f3ddcb0 to 6a4cb24 Compare May 4, 2026 17:13

egladysh added 4 commits May 4, 2026 10:36

Stop retrying master RPCs when server is shutting down

4b38b00

Fix dangling reference: capture ts_uuid by value in thread pool lambda

f2675a3

Stop index backfill when backend is terminated

028994f

As per review, added DCHECK, fixed comments, modifed transaction_rpc_…

ebaafb3

…timeout_ms

egladysh force-pushed the query_cancellation_for_index_backfills branch from a5a7305 to ebaafb3 Compare May 4, 2026 17:40

Deduplicate the test code, replace deprecated FetchFormat with FetchR…

8b620dc

…ow, sort kDdlRequesterLiveness

egladysh requested a review from jasonyb May 4, 2026 17:44

lint fixes

b8f2335

Merge branch 'master' into query_cancellation_for_index_backfills, re…

f5f5f28

…solved merge conflicts.

jasonyb mentioned this pull request May 6, 2026

[YSQL] YBCAbortTransaction elog(FATAL) during pg_terminate_backend of DDL session #31439

Open

1 task

jasonyb reviewed May 6, 2026

View reviewed changes

Comment thread src/yb/master/backfill_index.cc Outdated

Comment thread src/yb/master/ysql_ddl_verification_task.cc Outdated

egladysh added 2 commits May 6, 2026 00:14

as per comments fixed comment formatting

b707b74

Fixed comments

a489127

egladysh requested a review from jasonyb May 6, 2026 22:29

This was referenced May 7, 2026

[YSQL] Create BackfillJobPB earlier in the backfill lifecycle #31472

Open

[DocDB] Deduplicate terminal-state logic in BackfillTable (done_.store) #31473

Open

jasonyb reviewed May 7, 2026

View reviewed changes

egladysh and others added 6 commits May 8, 2026 11:12

Removed defaulted nullopt and changed to VERIFY_RESULT

9140fce

Style formatting

69826e4

Update src/yb/master/backfill_index.cc

d2aa0c8

Co-authored-by: jasonyb <93959687+jasonyb@users.noreply.github.com>

Update src/yb/master/backfill_index.cc

8d8680c

Co-authored-by: jasonyb <93959687+jasonyb@users.noreply.github.com>

Lint fixes

16646e9

as per discussion, revert the changes around VERIFY_RESULT

7c3c66d

jasonyb reviewed May 9, 2026

View reviewed changes

Address review comments: log bad status, fix non-ASCII chars, use conn_

c5b9ba9

iSignal requested changes May 11, 2026

View reviewed changes

jasonyb reviewed May 11, 2026

View reviewed changes

Conversation

egladysh commented Apr 30, 2026 • edited by hari90 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

jasonyb commented May 1, 2026

Uh oh!

hari90 commented May 1, 2026

Uh oh!

jasonyb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hari90 commented May 2, 2026

🔨 DB Build/Test Job Summary

Uh oh!

netlify Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for infallible-bardeen-164bc9 ready!

Uh oh!

jasonyb commented May 5, 2026

Uh oh!

egladysh commented May 5, 2026

Uh oh!

jasonyb commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

egladysh commented May 5, 2026

Uh oh!

jasonyb commented May 5, 2026

Uh oh!

jasonyb commented May 5, 2026

Uh oh!

jasonyb commented May 5, 2026

Uh oh!

hari90 commented May 5, 2026

Uh oh!

jasonyb commented May 5, 2026

Uh oh!

egladysh commented May 5, 2026

Uh oh!

egladysh commented May 5, 2026

Uh oh!

hari90 commented May 5, 2026

🔨 DB Build/Test Job Summary

Uh oh!

jasonyb commented May 5, 2026

Uh oh!

jasonyb commented May 6, 2026

Root Cause

Why this PR's tests fail

Why existing DDL verification tasks are unaffected

Fix for this PR

Separate issue (independent, does not affect test pass/fail): the pre-existing FATAL-2 during SIGTERM

Uh oh!

jasonyb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

egladysh commented May 6, 2026

Uh oh!

jasonyb commented May 6, 2026

Uh oh!

egladysh commented May 6, 2026

egladysh commented Apr 30, 2026 •

edited by hari90

Loading

netlify Bot commented May 4, 2026 •

edited

Loading

jasonyb commented May 5, 2026 •

edited

Loading

Review of `StartRequesterLivenessMonitor` and `StopLivenessMonitor`

`StartRequesterLivenessMonitor`

Issue 1: Race between `CreateAndStartTask` and storing `liveness_task_`

Issue 2: No error handling on `CreateAndStartTask`

Issue 3: Callback captures `shared_ptr<BackfillTable>`

`StopLivenessMonitor`

Good: Idempotency via `std::move`

Concern: `AbortAndReturnPrevState` may block

Review of `SetPendingBackfillRequesterTransaction` and `TakePendingBackfillRequesterTransaction`

Note: `Take` returning nullopt is expected in multiple cases