Skip to content

[#31263] YSQL: Stop index backfill when the CREATE INDEX session is terminated#31378

Open
egladysh wants to merge 16 commits into
yugabyte:masterfrom
Shopify:query_cancellation_for_index_backfills
Open

[#31263] YSQL: Stop index backfill when the CREATE INDEX session is terminated#31378
egladysh wants to merge 16 commits into
yugabyte:masterfrom
Shopify:query_cancellation_for_index_backfills

Conversation

@egladysh
Copy link
Copy Markdown
Collaborator

@egladysh egladysh commented Apr 30, 2026

Summary

Add DdlRequesterLivenessTask, a master-side task that polls the transaction
status of the DDL transaction held open by the CREATE INDEX CONCURRENTLY
backend. If the transaction is aborted (e.g. because the backend was killed
via pg_terminate_backend), the task calls BackfillTable::Abort() to stop
the in-progress backfill.

Test plan

  • PgIndexBackfillCancellationTest.BackfillStopsAfterBackendKill — asserts
    that no new backfill RPCs are issued after the backend is killed.
  • PgIndexBackfillCancellationWithoutFixTest.BackfillContinuesAfterBackendKill
    — asserts the old behavior (backfill continues) when the liveness monitor is
    disabled, serving as a regression baseline.
  • PgIndexBackfillCancellationEarlyKillTest.BackfillStopsAfterEarlyBackendKill
    — same as above but the backend is killed before backfill starts.

CSI

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to cancel background index backfill operations when the initiating PostgreSQL backend is terminated. It implements a DdlRequesterLivenessTask to monitor the transaction status and abort the backfill if the transaction is aborted. The changes span the client, master, and tserver layers to propagate transaction metadata and include extensive regression tests. Feedback was provided to increase the logging severity from a warning to an error when a backfill abort operation fails.

Comment thread src/yb/master/ysql_ddl_verification_task.cc Outdated
@egladysh egladysh force-pushed the query_cancellation_for_index_backfills branch from 82438bc to 288225c Compare April 30, 2026 22:37
@jasonyb
Copy link
Copy Markdown
Contributor

jasonyb commented May 1, 2026

trigger jenkins

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented May 1, 2026

Phorge diff synced: D52659
Commit: 288225c0

Jenkins build has been triggered. Results will be posted here once it completes.


JenkinsBot

Copy link
Copy Markdown
Contributor

@jasonyb jasonyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not get a chance to look at the whole thing but have provided a lot of comments to keep you busy.

Comment thread src/yb/yql/pgwrapper/pg_index_backfill-test.cc Outdated
Comment thread src/yb/yql/pgwrapper/pg_index_backfill-test.cc Outdated
Comment thread src/yb/yql/pgwrapper/pg_index_backfill-test.cc Outdated
Comment thread src/yb/yql/pgwrapper/pg_index_backfill-test.cc Outdated
Comment thread src/yb/yql/pgwrapper/pg_index_backfill-test.cc Outdated
Comment thread src/yb/master/ysql_ddl_verification_task.h Outdated
Comment thread src/yb/client/client_master_rpc.cc
Comment thread src/yb/master/ysql_backends_manager.cc
Comment thread src/yb/master/ysql_ddl_verification_task.cc
Comment thread src/yb/master/catalog_entity_info.h
@hari90
Copy link
Copy Markdown
Contributor

hari90 commented May 2, 2026

Jenkins build for commit 288225c0: Fail

Exceptions:


🔨 DB Build/Test Job Summary

Build Total Passed Failed Failed After Retries
D52659-arm-alma8-clang21-release 9849 9470 12 6
D52659-alma8-clang21-release 9852 9470 12 8
D52659-ubuntu22.04-clang21-debug 2 2 0 0
D52659-arm-mac14-clang21-release 17 17 0 0
D52659-mac14-clang21-release 2 2 0 0
D52659-alma8-clang21-tsan 9665 8029 18 12
D52659-alma8-gcc12-fastdebug 9867 9426 18 8
D52659-alma9-clang21-asan 9759 9042 8 5

Full status


JenkinsBot

@netlify
Copy link
Copy Markdown

netlify Bot commented May 4, 2026

Deploy Preview for infallible-bardeen-164bc9 ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit c5b9ba9
🔍 Latest deploy log https://app.netlify.com/projects/infallible-bardeen-164bc9/deploys/69fee765b01d5500096ddd5b
😎 Deploy Preview https://deploy-preview-31378--infallible-bardeen-164bc9.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@egladysh egladysh force-pushed the query_cancellation_for_index_backfills branch from f3ddcb0 to 6a4cb24 Compare May 4, 2026 17:13
@egladysh egladysh force-pushed the query_cancellation_for_index_backfills branch from a5a7305 to ebaafb3 Compare May 4, 2026 17:40
@egladysh egladysh requested a review from jasonyb May 4, 2026 17:44
@jasonyb
Copy link
Copy Markdown
Contributor

jasonyb commented May 5, 2026

Had another busy day. This is still on my radar.

@egladysh
Copy link
Copy Markdown
Collaborator Author

egladysh commented May 5, 2026

merged the latest changes to PollTransactionStatusBase

@jasonyb
Copy link
Copy Markdown
Contributor

jasonyb commented May 5, 2026

This old test run #31378 (comment) had 100% failing tests PgIndexBackfillCancellationEarlyKillTest - BackfillStopsAfterEarlyBackendKill/* and PgIndexBackfillCancellationTest - BackfillStopsAfterBackendKill/* on release and fastdebug builds.

@egladysh
Copy link
Copy Markdown
Collaborator Author

egladysh commented May 5, 2026

That's not good. Did they fail on all platforms including arm mac14 clang21 (that the setup I have been testing it with)? I don't seem to have access to the test logs. Also is it possible to trigger the tests again?

@jasonyb
Copy link
Copy Markdown
Contributor

jasonyb commented May 5, 2026

BackfillStopsAfterEarlyBackendKill/0:

../../src/yb/yql/pgwrapper/pg_index_backfill-test.cc:3726: Failure
Expected: (rpcs_final) <= (2), actual: 4 vs 2
Expected ≤ 2 BackfillIndex RPCs (liveness task should abort after ≤ 2 chunks), got 4. Suggests requester_transaction was not forwarded from the placeholder BackfillTable to the real BackfillTable.

BackfillStopsAfterEarlyBackendKill/1:

../../src/yb/yql/pgwrapper/pg_index_backfill-test.cc:3726: Failure
Expected: (rpcs_final) <= (2), actual: 4 vs 2
Expected ≤ 2 BackfillIndex RPCs (liveness task should abort after ≤ 2 chunks), got 4. Suggests requester_transaction was not forwarded from the placeholder BackfillTable to the real BackfillTable.

BackfillStopsAfterBackendKill/0:

../../src/yb/yql/pgwrapper/pg_index_backfill-test.cc:3542: Failure
Expected equality of these values:
  rpcs_after_wait
    Which is: 4
  rpcs_after_kill
    Which is: 2
BackfillIndex RPCs continued after pg_terminate_backend: 2 unexpected additional RPC(s) issued

BackfillStopsAfterBackendKill/1:

../../src/yb/yql/pgwrapper/pg_index_backfill-test.cc:3542: Failure
Expected equality of these values:
  rpcs_after_wait
    Which is: 4
  rpcs_after_kill
    Which is: 2
BackfillIndex RPCs continued after pg_terminate_backend: 2 unexpected additional RPC(s) issued

Also note I see [ts-1] TRAP: FailedAssertion("!IsTransactionOrTransactionBlock()", File: "../../../../../../../src/postgres/src/backend/utils/activity/pgstat.c", Line: 586, PID: 490741) in all four of these fastdebug logs.

@jasonyb
Copy link
Copy Markdown
Contributor

jasonyb commented May 5, 2026

These are alma8-clang21-release and alma8-gcc12-fastdebug. All four full logs (fastdebug) are present here: https://gist.github.com/jasonyb/1b2eb85d6240aae92586bbb85e207173

@jasonyb
Copy link
Copy Markdown
Contributor

jasonyb commented May 5, 2026

That's not good. Did they fail on all platforms including arm mac14 clang21 (that the setup I have been testing it with)? I don't seem to have access to the test logs. Also is it possible to trigger the tests again?

This test is not run on mac, but I think it was run on arm-alma8-clang21-release and passed there. So 2 of 3 build types failed consistently.

Trigger jenkins

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented May 5, 2026

Jenkins build has been triggered. Results will be posted once it completes. CSI


JenkinsBot

@jasonyb
Copy link
Copy Markdown
Contributor

jasonyb commented May 5, 2026

@egladysh For https://github.com/yugabyte/yugabyte-db/pull/31378/changes#r3175721299, I notice you have split commits locally. Our current policy is to always squash-merge PRs, so splitting commits locally doesn't really do anything. Two options:

  • safe: create two PRs off master for each of 4b38b00 and f2675a3. Wait for those to land on master. After they land on master, create the final PR that does the main feature you want.
  • risky: I am not completely sure how stacking multiple PRs together works, but it may be possible to build the three PRs all depending on each other. It is not clear to me whether triggering builds off of these would work (the infra team has expressed verbally that it should, but I do not believe it has been tested yet).

Given the two dependency pieces are low-review-contention, I believe the safe option is better. All we would have to do is wait for test results to pass and for the PR title/description to be good. Make sure to explain what (potential) problems are fixed by each of these pieces.

@egladysh
Copy link
Copy Markdown
Collaborator Author

egladysh commented May 5, 2026

The comments were: "It's logically independent of the liveness monitoring feature. Consider splitting into
its own commit so it can be cherry-picked independently and won't be lost if this
commit is reverted." But yeah, squash-merge makes a difference. I'll split them up.

As for the failing tests. I looked at the logs and it seems like it fails because of the asserts in fastdebug that you mentioned when the tests does pg_terminate_backend and that leads to

[ts-1] W0502 01:33:34.751808 207507 pg_txn_manager.cc:689] Failed to abort DDL transaction: Aborted (yb/rpc/reactor.cc:122): Shutdown connection (system error 108)
2797 [ts-1] 2026-05-02 01:33:34.753 UTC [207507] FATAL:  Failed to abort DDL transaction: Shutdown connection

The abort fails, DdlRequesterLivenessTask doesn't know about it, backfill continues, the test fails. It could be another independent bug. I'll investigate it.

@egladysh
Copy link
Copy Markdown
Collaborator Author

egladysh commented May 5, 2026

It seems to be a more serious issue with timing in YBCAbortTransaction when DDL transactions are not aborted cleanly and it's independent from this PR.

@hari90
Copy link
Copy Markdown
Contributor

hari90 commented May 5, 2026

Jenkins build for commit f5f5f28d: Fail
CSI
Reason: CSI status: WARNING

Exceptions:

Checking test failure count per build versus limit of 20 (0 on mac).

Build Failures Status
PR31378-mac14-clang21-release #1 1 FAILURE
PR31378-alma8-clang21-release #1 1 Okay
PR31378-ubuntu22.04-clang21-debug #1 1 Okay
PR31378-alma8-clang21-tsan #1 1 Okay
PR31378-arm-mac14-clang21-release #1 1 FAILURE
PR31378-arm-alma8-clang21-release #1 1 Okay
PR31378-alma8-gcc12-fastdebug #1 1 Okay
PR31378-alma9-clang21-asan #1 1 Okay

🔨 DB Build/Test Job Summary

Build Total Passed Failed Failed After Retries
PR31378-mac14-clang21-release 1 0 1 1
PR31378-alma8-clang21-release 1 0 1 1
PR31378-ubuntu22.04-clang21-debug 1 0 1 1
PR31378-alma8-clang21-tsan 1 0 1 1
PR31378-arm-mac14-clang21-release 1 0 1 1
PR31378-arm-alma8-clang21-release 1 0 1 1
PR31378-alma8-gcc12-fastdebug 1 0 1 1
PR31378-alma9-clang21-asan 1 0 1 1

JenkinsBot

@jasonyb
Copy link
Copy Markdown
Contributor

jasonyb commented May 5, 2026

#31378 (comment):

[2026-05-05T21:41:23.518Z] Rebasing (1/6)
[2026-05-05T21:41:23.518Z] Rebasing (2/6)
[2026-05-05T21:41:23.518Z] Rebasing (3/6)
[2026-05-05T21:41:23.518Z] Auto-merging src/yb/master/catalog_entity_info.h
[2026-05-05T21:41:23.518Z] Auto-merging src/yb/master/catalog_manager.cc
[2026-05-05T21:41:23.518Z] Auto-merging src/yb/master/ysql_ddl_verification_task.cc
[2026-05-05T21:41:23.518Z] CONFLICT (content): Merge conflict in src/yb/master/ysql_ddl_verification_task.cc
[2026-05-05T21:41:23.518Z] Auto-merging src/yb/master/ysql_ddl_verification_task.h
[2026-05-05T21:41:23.518Z] error: could not apply 028994f138... Stop index backfill when backend is terminated
[2026-05-05T21:41:23.518Z] hint: Resolve all conflicts manually, mark them as resolved with
[2026-05-05T21:41:23.518Z] hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
[2026-05-05T21:41:23.518Z] hint: You can instead skip this commit: run "git rebase --skip".
[2026-05-05T21:41:23.518Z] hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
[2026-05-05T21:41:23.518Z] Could not apply 028994f138... Stop index backfill when backend is terminated

It failed because of a clear infra-side issue: the build takes each of your commits and rebases them onto newest master. If you resolve merge conflicts in a later commit (such as f5f5f28), it will not help when rebasing an earlier commit that is the source of conflict (such as 028994f). The correct infra approach would be to either

  • squash the PR's commits into a single commit then rebase that onto latest master
  • merge the PR's branch to latest master

Both cases avoid the issue of merge conflicts on intermediate commits of the PR's branch.

Workaround while this infra issue is in place: squash your changes to a single local commit and force push that for the PR.

cc: @hari90

@jasonyb
Copy link
Copy Markdown
Contributor

jasonyb commented May 6, 2026

It seems to be a more serious issue with timing in YBCAbortTransaction when DDL transactions are not aborted cleanly and it's independent from this PR.

I dug into the root cause of these test failures. Here's what's happening:

The claim that this is independent from this PR is partially correct — the root cause is pre-existing, but this PR's tests are uniquely sensitive to it.

Root Cause

When pg_terminate_backend sends SIGTERM, the die() signal handler calls YBCInterruptPgGate(), which signals the interrupter thread to shut down the RPC messenger (messenger_.Shutdown() in pggate.cc:592). This happens asynchronously, but the code path to the abort RPC is long enough that the messenger is effectively always shut down by the time it fires.

The shutdown sequence involves two FATALs:

  1. FATAL-1: ProcessInterrupts() raises ereport(FATAL, "terminating connection"). Unlike ERROR (which longjmps to the error handler), FATAL calls proc_exit(1) directly. During proc_exit, the ShutdownPostgres callback calls AbortOutOfAnyTransaction()AbortTransaction()YBCAbortTransaction()YBCPgClearSeparateDdlTxnMode()FinishTransaction(Abort). The abort RPC fails because the messenger is shut down.

  2. FATAL-2: The RPC failure triggers elog(FATAL, "Failed to abort DDL transaction: Shutdown connection") at pg_yb_utils.c:1552.

The DDL transaction is left in PENDING state on the coordinator. It stays PENDING until the heartbeat timeout expires: transaction_heartbeat_usec × transaction_max_missed_heartbeat_periods = 0.5s × 30 = 15 seconds in release builds, 45 seconds in TSAN.

Why this PR's tests fail

DdlRequesterLivenessTask polls transaction status and calls callbacks_.abort_() only when it sees ABORTED. Since the abort RPC never succeeded, the transaction stays PENDING for 15-45s, and the liveness task keeps reporting "still pending" during that window. The tests' wait period is shorter than this, so they see continued backfill RPCs and fail.

Why existing DDL verification tasks are unaffected

TableSchemaVerificationTask and NamespaceVerificationTask ignore the aborted parameter entirely (bool /*aborted*/) — they compare against the PG schema to determine the actual outcome. They just poll longer. No correctness issue.

Fix for this PR

Neither FATAL-1 nor FATAL-2 causes these test failures. The backend is going to exit regardless. The tests fail because the abort RPC itself fails (messenger already shut down), leaving the transaction PENDING regardless of log severity.

A fix is to reduce transaction_max_missed_heartbeat_periods in the test fixture to shorten the coordinator's expiration window. A value of 10 gives ~5s in release (10 × 0.5s) and ~15s in TSAN (10 × 1.5s) — fast enough for tests, with enough margin to avoid flakiness from normal heartbeat jitter. The test wait time and flags should then be adjusted to exceed this timeout.

This is purely a test-side configuration; the production code in DdlRequesterLivenessTask does not need changes — waiting longer for the coordinator to expire the transaction is acceptable in production.

Separate issue (independent, does not affect test pass/fail): the pre-existing FATAL-2 during SIGTERM

I filed #31439 for the pre-existing bug where YBCAbortTransaction produces FATAL-2 (and on debug/fastdebug builds, a TRAP assertion at pgstat.c:586) when the backend is killed via pg_terminate_backend. FATAL-1 ("terminating connection") is the expected behavior of pg_terminate_backend and is not a bug. The fix for FATAL-2 is to detect the shutdown path (proc_exit_inprogress) and downgrade FATAL-2 to WARNING. The abort RPC is still attempted (so it can succeed in the rare case the messenger hasn't shut down yet), but a failure no longer triggers a nested FATAL. This is independent from this PR and does not affect whether these tests pass or fail.

Copy link
Copy Markdown
Contributor

@jasonyb jasonyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent a lot of time on forming/understanding issue #31439, so again did not get to review the whole thing.

Comment thread src/yb/master/backfill_index.cc Outdated
Comment thread src/yb/master/ysql_ddl_verification_task.cc Outdated
@egladysh
Copy link
Copy Markdown
Collaborator Author

egladysh commented May 6, 2026

@jasonyb Yeah, it looks correct: #31439. I can open a separate PR for this issue b/c this one depends on it?

@jasonyb
Copy link
Copy Markdown
Contributor

jasonyb commented May 6, 2026

@jasonyb Yeah, it looks correct: #31439. I can open a separate PR for this issue b/c this one depends on it?

My understanding was that #31439 is not a dependency for this: the tests should pass even though we see FATALs in the logs. Assuming I am correct about this, then I think it is better to do #31439 after this so that you have a concrete repro (namely, the logs of some of the tests here), but this is just a recommendation. If I am wrong and the tests fail without fixing #31439, yes, please open a separate PR for #31439.

@egladysh
Copy link
Copy Markdown
Collaborator Author

egladysh commented May 6, 2026

@jasonyb I think there is a clear dependency for fastdebug or debug builds where the assert is enabled. The fastdebug builds keep the asserts enabled but the timing is close to release builds. In debug builds the timing is different and the RPC has time to finish before the assert checks so it passes most of the time. The test is sensitive to proper DDL cleanup and exposes the bug with using the assert when it is enabled. Again the assert in question that breaks the proper DDL cleanup is around

TRAP: FailedAssertion("!IsTransactionOrTransactionBlock()", File: "../../../../../../../src/postgres/src/backend/utils/activity/pgstat.c", Line: 586, PID: 207507)

@egladysh
Copy link
Copy Markdown
Collaborator Author

egladysh commented May 6, 2026

@jasonyb The fix suggested in #31439 is consistent with my understanding of the issue. I am thinking about creating another PR for that. Please let me know if you have other ideas.

@egladysh egladysh requested a review from jasonyb May 6, 2026 22:29
@jasonyb
Copy link
Copy Markdown
Contributor

jasonyb commented May 6, 2026

@jasonyb The fix suggested in #31439 is consistent with my understanding of the issue. I am thinking about creating another PR for that. Please let me know if you have other ideas.

@egladysh, sounds good to me.

@egladysh
Copy link
Copy Markdown
Collaborator Author

egladysh commented May 6, 2026

@jasonyb Opened #31470

@egladysh
Copy link
Copy Markdown
Collaborator Author

egladysh commented May 7, 2026

I do agree with #31472.

Copy link
Copy Markdown
Contributor

@jasonyb jasonyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working through the review. Have these comments in the meantime.

Comment thread src/yb/tserver/pg_client_session.cc Outdated
Comment thread src/yb/client/client.h Outdated
Comment thread src/yb/client/client-internal.h Outdated
Comment thread src/yb/master/backfill_index.h Outdated
Comment thread src/yb/master/backfill_index.h Outdated
Comment thread src/yb/master/backfill_index.cc Outdated
Comment thread src/yb/master/backfill_index.cc
return Status::OK();
}

void BackfillTable::StartRequesterLivenessMonitor() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some back-and-forth with AI, got this (which I haven't fully verified in the interest of time):

Review of StartRequesterLivenessMonitor and StopLivenessMonitor

StartRequesterLivenessMonitor

Issue 1: Race between CreateAndStartTask and storing liveness_task_

The task is created and started at line 1120, but not stored into liveness_task_ until line 1131. The task begins polling immediately. If the transaction is already aborted (or aborts very quickly), the task's FinishPollTransaction fires abort_()BackfillTable::Abort()MarkAllIndexesAsFailed()CheckIfDone()StopLivenessMonitor() before liveness_task_ is assigned. StopLivenessMonitor sees null and does nothing.

In this specific case the task happens to have self-completed via Complete() before calling abort_(), so there's no leak. But the correctness depends on an implementation detail of FinishPollTransaction's ordering (Complete() before abort_()) — if that ordering ever changes, this breaks silently.

The fix is straightforward: create the task without starting it, store it under the lock, then start it:

auto task = std::make_shared<DdlRequesterLivenessTask>(...);
{
  std::lock_guard l(mutex_);
  DCHECK(!liveness_task_);
  liveness_task_ = task;
}
task->Start();

This would require exposing a two-phase create+start API on DdlRequesterLivenessTask (the current CreateAndStartTask bundles both).

Issue 2: No error handling on CreateAndStartTask

CreateAndStartTask returns a shared_ptr, not a Result. If Start() fails internally (threadpool full, task immediately aborted by ValidateRunnable, etc.), the caller has no way to know the liveness monitor is non-functional. The entire feature becomes silently disabled with no log message indicating why.

At minimum, CreateAndStartTask should return Result<shared_ptr<DdlRequesterLivenessTask>> or the caller should verify the task's state after creation. Alternatively, log a warning if the task is in a terminal state immediately after start.

Issue 3: Callback captures shared_ptr<BackfillTable>

The lambdas capture self via shared_from_this(). This prevents BackfillTable destruction while the liveness task is alive. If StopLivenessMonitor is never called (e.g. a code path that sets done_ without going through the normal terminal paths), the BackfillTable leaks along with the task. Today's call sites appear to cover all terminal paths, but this is fragile — a future change that adds a new exit path could miss the StopLivenessMonitor call.

Consider weak_ptr for the callbacks, or a mechanism where the task self-terminates when done_() returns true (which ValidateRunnable already does — but only on the next scheduled step, not immediately).

StopLivenessMonitor

Good: Lock discipline avoids deadlock

Moving the task out via std::move under mutex_ then calling AbortAndReturnPrevState outside the lock is correct. The abort path calls PerformAbort()Shutdown()sync_.Wait(). If this were done under mutex_, and the task's in-flight callback tried to re-enter BackfillTable (which acquires mutex_), you'd deadlock.

Good: Idempotency via std::move

After std::move, liveness_task_ is null; subsequent calls are no-ops. This is essential because StopLivenessMonitor is called from multiple convergent terminal paths: Done() (success), MarkIndexesAsFailed() (failure), CheckIfDone() (via Abort()).

Concern: AbortAndReturnPrevState may block

AbortAndReturnPrevState can trigger PerformAbort()Shutdown()sync_.Wait(), which blocks until the in-flight GetTransactionStatus RPC completes. This is called from Done() and CheckIfDone(), which run on the callback threadpool. A slow or hung transaction status RPC could stall the backfill completion path. The transaction_rpc_timeout_ms flag bounds this, but it's worth being aware of.

Summary of actionable items

  1. Fix the create-then-store race — separate creation from starting, or at minimum document why the current ordering is safe and what invariants it depends on.
  2. Add error handling for task creation — at minimum log when the liveness monitor fails to start.
  3. Consider the blocking potential of StopLivenessMonitorsync_.Wait() can stall the callback threadpool.

Comment thread src/yb/master/catalog_manager.cc
is_backfilling_ = false;
}

// Store/retrieve the DDL transaction from the PG backend that initiated the backfill.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some back-and-forth with AI, got this (which I haven't fully verified in the interest of time):

Review of SetPendingBackfillRequesterTransaction and TakePendingBackfillRequesterTransaction

Note: Not persisted across master failover

Already tracked in #31472 and commented at https://github.com/yugabyte/yugabyte-db/pull/31378/changes#r3198264572. Not repeating here.

Note: Take returning nullopt is expected in multiple cases

StartBackfillingData always calls Take when !requester_transaction && current_version (line 348). Nullopt is the normal result for:

  • YCQL: Set is never called — YCQL doesn't go through CatalogManager::BackfillIndex (which is PGSQL-only). Nullopt is the expected baseline.
  • YSQL without a requester transaction: older PG clients that don't send requester_transaction, or decode failure (line 6594).
  • YSQL after master failover: in-memory state lost, already tracked in [YSQL] Create BackfillJobPB earlier in the backfill lifecycle #31472.

The only scenario where nullopt from Take would indicate a problem is a version mismatch — Set was called at V+1 but Take is called at some other version. This would mean an unexpected version bump occurred between the permission update and the backfill launch. This is unlikely today (YSQL does exactly one bump), but there's no way to distinguish this case from the legitimate nullopt cases at the Take call site.

Not flagging this as actionable, but noting that debugging a missing liveness monitor will require correlating logs from the Set call site (which currently has no log) with the Take call site. A VLOG at the Set call (line 523) recording the stored version would help.

Issue 1: Version encoding assumes exactly one version bump between Set and Take

Set stores at current_version + 1. Take is called with the table's version at the time backfill is launched. This works because YSQL does exactly one permission update (WRITE_AND_DELETE → DO_BACKFILL) which bumps the version by exactly 1.

If a future change introduces additional intermediate permission steps or version bumps between Set and Take, the versions would mismatch and Take would silently return nullopt, disabling liveness monitoring. The version matching is correct today but the coupling is implicit. A comment on SetPendingBackfillRequesterTransaction noting this single-bump assumption would help.

Issue 2: Transaction decode failure is only a WARNING

In CatalogManager::BackfillIndex (lines 6590-6596):

if (req->has_requester_transaction()) {
    auto result = TransactionMetadata::FromPB(req->requester_transaction());
    if (result.ok()) {
      requester_txn = std::move(*result);
    } else {
      LOG(WARNING) << "BackfillIndex: failed to decode requester transaction: " << result.status();
    }
}

If the PG backend sends a malformed transaction, the decode fails and is logged as a WARNING. The backfill proceeds without liveness monitoring. This is fine for robustness (don't block backfill for a monitoring feature), but the WARNING could be easy to miss. Consider LOG(DFATAL) in debug builds to catch protocol bugs early.

Issue 3: Method bodies in the header file

SetPendingBackfillRequesterTransaction and TakePendingBackfillRequesterTransaction are defined inline in catalog_entity_info.h. Most TableInfo methods with comparable complexity (SetIsBackfilling, SetCreateTableErrorStatus, etc.) are declared in the header but defined in catalog_entity_info.cc. ClearIsBackfilling is inline but is a trivial one-liner. These two methods have lock acquisition, conditional logic, and std::exchange — they should follow the prevailing pattern and move to the .cc file.

Summary of actionable items

  1. Comment the single-version-bump assumption — at the Set call site (line 523-524) or on the field declaration.
  2. Add a VLOG at the Set call site (line 523) recording the stored version, to aid debugging when the liveness monitor unexpectedly doesn't start.
  3. Consider LOG(DFATAL) for transaction decode failure — at line 6594-6596, a malformed requester_transaction from the PG client is only a WARNING. LOG(DFATAL) would catch protocol bugs in debug builds.
  4. Move Set/Take method bodies to catalog_entity_info.cc — they have non-trivial logic and don't match the header-inline pattern used by comparable TableInfo methods.

Copy link
Copy Markdown
Contributor

@jasonyb jasonyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look at the last two tests later. I'm familiar with the backfill flow but not the ddl verification task flow. Asked for assistance on that, but I will do it later if no help comes.

// Retrieve the requester transaction if it was stored during the permission-update phase.
// Pass current_version so TakePendingBackfillRequesterTransaction rejects stale
// transactions from earlier backfill attempts.
if (!requester_transaction && current_version) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have clarified that this suggestion was contingent on #31378 (comment) being true. I believe it is true, but it is not a blocker to me to have this dead code.

Comment thread src/yb/master/backfill_index.h Outdated
Comment thread src/yb/master/ysql_ddl_verification_task.cc
Comment thread src/yb/tserver/pg_client_session.cc Outdated
Comment thread src/yb/yql/pgwrapper/pg_index_backfill-test.cc Outdated
Comment thread src/yb/yql/pgwrapper/pg_index_backfill-test.cc Outdated
Comment thread src/yb/yql/pgwrapper/pg_index_backfill-test.cc Outdated
pid_ready.Wait();
{
auto monitor_conn = VERIFY_RESULT(ConnectToDB(kDatabaseName));
RETURN_NOT_OK(WaitFor(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we already have a GetIndexStateFlags for this purpose

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't GetIndexStateFlags use conn_ that we reset already? I guess we could modify the GetIndexStateFlags and pass connection as a parameter? I'd be fine that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Not a blocker, but you could choose to not conn_.reset() to be able to use GetIndexStateFlags here, right? conn_ shouldn't be blocking the CREATE INDEX, at least not in any meaningful way.

Comment on lines +3574 to +3577
// T~4s First sleep inside UpdateIndexPermission completes (kAlterTableDelay)
// T~4s Permission updated; requester_transaction stored via
// SetPendingBackfillRequesterTransaction
// T~8s Second sleep inside UpdateIndexPermission completes -> AlterTable RPCs proceed
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have two UpdateIndexPermission, and while that is true, the below SleepFor(kAlterTableDelay * 2 + 1s); is after having waited for indisready permissions, so I believe the first UpdateIndexPermission is already covered. This brings it down to SleepFor(kAlterTableDelay + 1s);.

Furthermore, the comment says "BEFORE any tserver BackfillIndex RPCs have been sent", but if you are waiting the full kAlterTableDelay, doesn't that mean it is sent (if not for the code overhead)? It is not clear to me at what step you are trying to kill the backend. What further complicates this is the long delay before the transaction is noticed as aborted.

Copy link
Copy Markdown
Collaborator Author

@egladysh egladysh May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think WaitForIndexStateFlags will see indisready=true at T0. I asked AI to clarify the comments. It seems correct to me now. I am trying to kill the backend at T9 (added comments). Also "RPC's sent" is not strictly correct, changed to "RPC's completed".

Comment thread src/yb/yql/pgwrapper/pg_index_backfill-test.cc Outdated
@egladysh
Copy link
Copy Markdown
Collaborator Author

egladysh commented May 9, 2026

Addressed some of the comments, will take a look at the rest later.

Copy link
Copy Markdown
Contributor

@iSignal iSignal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this change @egladysh, it will help quite a bit! The overall structure of the code looks good. I have some specific concerns and a couple of questions. Main ones are

  1. master leader failover (can be punted to a later diff)
  2. BackfillTable::Abort/Done synchronization seems to be missing.
  3. pg client session txn identification needs to be based on txn ddl being enabled or not

Rest are relatively minor.

// The PG backend holds a DDL transaction open for the entire backfill duration
// (StartTransactionCommand at indexcmds.c:2334). Pass it to the master so it can detect when
// this backend is killed (-> txn aborted) and stop launching new backfill chunks.
auto meta = GetDdlTransactionMetadata(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use req.use_regular_transaction_block just like the other DDLs. The context here is that prior to txn DDL, DDLs used to use a separate "autonomous" transaction (kDDL session txn) but after transactional DDL feature they use the kPlain session txn. So I imagine this may not work correctly for the case when transactional DDL is on unless we plumb this field through similar to other DDL like PgCreateTable.

We can either try testing with transactional DDL on or just plumb it through similar to other DDL

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'll add it to PgBackfillIndexRequestPB.

Comment thread src/yb/master/backfill_index.cc
Comment thread src/yb/master/backfill_index.cc
// Schedule asynchronously so user_cb fires first.
//
// Complete() must still be called before callbacks_.abort_() to avoid a different deadlock:
// Abort() may call BackfillTable::StopLivenessMonitor() -> AbortAndReturnPrevState(), which
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to supply a flag to Abort indicating it is coming from the liveness check so that the backfill is not going to try and stop the liveness check again? That would simplify the code.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Comment thread src/yb/master/ysql_ddl_verification_task.cc
auto self = shared_from_this();
BackgroundDdlCallbacks callbacks{
.done_ = [self] { return self->done(); },
.abort_ = [self] { return self->Abort(); },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like BackfillTable::Abort / BackfillTable::Done are ready to be called in a multi threaded context once we add this callback.

  1. txn poll can call Abort while backfilltablet is causing a transition to Done success path.
  2. txn poll can call Abort while backfilltablet is causing its own failure transition.

It seems like so far they were able to use std atomics to avoid real locking but now it would be better to use a proper lock to keep it simple. we can have some explicit internal enum state like waiting, aborting, aborted, success and use that to decide what to do from the callbacks (we only want to affect waiting state from the txn callback and not the others). Any other approaches are also ok but current path seems prone to problems.

Copy link
Copy Markdown
Collaborator Author

@egladysh egladysh May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iSignal Ah... I assumed that those atomics and mutex's were there to make them thread-safe. I do see the gap now. We can fix it with an enum (kind of a state machine you suggested) or just moving the done_ usage around like:

Status BackfillTable::Abort() {
      bool expected = false;
      if (!done_.compare_exchange_strong(expected, true)) {
          return Status::OK();
      }
      ...

Which one would you prefer?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done_ would be simpler but does not handle race between Abort and tablet Done failure path right? Both may try to mark indexes as failed. I guess an atomic int enum CAS with more than true/false can help distinguish the different states.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I amn't sure. It seems like indexes_to_build() takes care of it with LockForWrite. Transition failed to failed is harmless and after failed indexes_to_build() will return {} if my understanding is correct?

LeaderEpoch epoch_;
ash::WaitStateInfoPtr wait_state_;
std::optional<TransactionMetadata> requester_transaction_;
std::shared_ptr<DdlRequesterLivenessTask> liveness_task_ GUARDED_BY(mutex_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this can be a std::weak_ptr? The task runs on its own, do we need to own it? Right now, both the task and backfilltable are holding refs to each other, so we really need to be sure we release both correctly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iSignal I think that the reference cycle is already explicitly broken because every exit path (MarkIndexesAsFailed, CheckIfDone) calls StopLivenessMonitor(), which moves liveness_task_ out and clears BackfillTable's reference to the task. Anther reference to the task is the TableInfo's task list and the task will hold the last shared_ptr<BackfillTable> because the captures in the callbacks, and when the task finishes those are released too. No leak. Also I think that weak_ptr might actually be wrong because StopLivenessMonitor needs to call AbortAndReturnPrevState on the task but it depends on the life cycle of the tasks in TableInfo. I feel like shared_ptr is safer but I could be wrong.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is safe now but it is a bit worrying that every future path would need to reason about and remember to call Stop... during Job termination paths to break the loop.

If we write it as below to get a shared ptr out of the weak ptr, it would allow the task to exit by itself as well. But open to other suggestions as well

void BackfillTable::StopLivenessMonitor() {
  std::shared_ptr<DdlRequesterLivenessTask> task;
  {
    std::lock_guard l(mutex_);
    task = liveness_task_.lock();
    liveness_task_.reset();
  }
  if (task) {
    task->AbortAndReturnPrevState(STATUS(Aborted, "BackfillTable is done"));
  }
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is about AbortAndReturnPrevState. I thought the call must be made? If that's not the case, I'd agree that weak_ptr would be a better choice.

context->GetClientDeadline(), IsTxnUsingTableLocks(false));
std::optional<TransactionMetadata> txn_metadata;
if (!meta.ok()) {
VLOG(1) << "BackfillIndex: failed to get DDL transaction metadata: " << meta.status();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: maybe can be LOG(WARNING)

@egladysh
Copy link
Copy Markdown
Collaborator Author

@iSignal FYI two separate tickets to resolve issues with passing transactions around.
#31472
#31471

This PR just passes them in memory in TableInfo, a limitation that is understood.

Copy link
Copy Markdown
Contributor

@jasonyb jasonyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still waiting for a test overhaul regarding timing expectations. I believe I covered everything I need to for review (verification task is @iSignal).

Comment thread src/yb/master/ysql_ddl_verification_task.cc
pid_ready.Wait();
{
auto monitor_conn = VERIFY_RESULT(ConnectToDB(kDatabaseName));
RETURN_NOT_OK(WaitFor(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Not a blocker, but you could choose to not conn_.reset() to be able to use GetIndexStateFlags here, right? conn_ shouldn't be blocking the CREATE INDEX, at least not in any meaningful way.

}

auto terminated = VERIFY_RESULT(
conn_->FetchRow<bool>(Format("SELECT pg_terminate_backend($0)", create_index_pid)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You claim conn_ is reset above but use it here. Does the test pass locally?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I ran the EarlyKill tests after I asked AI to refactor them to dedup the code. You are right it's a bug.

Comment thread src/yb/yql/pgwrapper/pg_index_backfill-test.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants