[#31263] YSQL: Stop index backfill when the CREATE INDEX session is terminated#31378
[#31263] YSQL: Stop index backfill when the CREATE INDEX session is terminated#31378egladysh wants to merge 16 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to cancel background index backfill operations when the initiating PostgreSQL backend is terminated. It implements a DdlRequesterLivenessTask to monitor the transaction status and abort the backfill if the transaction is aborted. The changes span the client, master, and tserver layers to propagate transaction metadata and include extensive regression tests. Feedback was provided to increase the logging severity from a warning to an error when a backfill abort operation fails.
82438bc to
288225c
Compare
|
trigger jenkins |
jasonyb
left a comment
There was a problem hiding this comment.
Did not get a chance to look at the whole thing but have provided a lot of comments to keep you busy.
|
❌ Jenkins build for commit Exceptions:
🔨 DB Build/Test Job Summary
JenkinsBot |
✅ Deploy Preview for infallible-bardeen-164bc9 ready!Built without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
f3ddcb0 to
6a4cb24
Compare
a5a7305 to
ebaafb3
Compare
…ow, sort kDdlRequesterLiveness
|
Had another busy day. This is still on my radar. |
…solved merge conflicts.
|
merged the latest changes to PollTransactionStatusBase |
|
This old test run #31378 (comment) had 100% failing tests PgIndexBackfillCancellationEarlyKillTest - BackfillStopsAfterEarlyBackendKill/* and PgIndexBackfillCancellationTest - BackfillStopsAfterBackendKill/* on release and fastdebug builds. |
|
That's not good. Did they fail on all platforms including arm mac14 clang21 (that the setup I have been testing it with)? I don't seem to have access to the test logs. Also is it possible to trigger the tests again? |
|
Also note I see |
|
These are alma8-clang21-release and alma8-gcc12-fastdebug. All four full logs (fastdebug) are present here: https://gist.github.com/jasonyb/1b2eb85d6240aae92586bbb85e207173 |
This test is not run on mac, but I think it was run on arm-alma8-clang21-release and passed there. So 2 of 3 build types failed consistently. Trigger jenkins |
|
Jenkins build has been triggered. Results will be posted once it completes. CSI JenkinsBot |
|
@egladysh For https://github.com/yugabyte/yugabyte-db/pull/31378/changes#r3175721299, I notice you have split commits locally. Our current policy is to always squash-merge PRs, so splitting commits locally doesn't really do anything. Two options:
Given the two dependency pieces are low-review-contention, I believe the safe option is better. All we would have to do is wait for test results to pass and for the PR title/description to be good. Make sure to explain what (potential) problems are fixed by each of these pieces. |
|
The comments were: "It's logically independent of the liveness monitoring feature. Consider splitting into As for the failing tests. I looked at the logs and it seems like it fails because of the asserts in fastdebug that you mentioned when the tests does The abort fails, DdlRequesterLivenessTask doesn't know about it, backfill continues, the test fails. It could be another independent bug. I'll investigate it. |
|
It seems to be a more serious issue with timing in |
|
❌ Jenkins build for commit Exceptions:
Checking test failure count per build versus limit of 20 (0 on mac).
🔨 DB Build/Test Job Summary
JenkinsBot |
It failed because of a clear infra-side issue: the build takes each of your commits and rebases them onto newest master. If you resolve merge conflicts in a later commit (such as f5f5f28), it will not help when rebasing an earlier commit that is the source of conflict (such as 028994f). The correct infra approach would be to either
Both cases avoid the issue of merge conflicts on intermediate commits of the PR's branch. Workaround while this infra issue is in place: squash your changes to a single local commit and force push that for the PR. cc: @hari90 |
I dug into the root cause of these test failures. Here's what's happening: The claim that this is independent from this PR is partially correct — the root cause is pre-existing, but this PR's tests are uniquely sensitive to it. Root CauseWhen The shutdown sequence involves two FATALs:
The DDL transaction is left in PENDING state on the coordinator. It stays PENDING until the heartbeat timeout expires: Why this PR's tests fail
Why existing DDL verification tasks are unaffected
Fix for this PRNeither FATAL-1 nor FATAL-2 causes these test failures. The backend is going to exit regardless. The tests fail because the abort RPC itself fails (messenger already shut down), leaving the transaction PENDING regardless of log severity. A fix is to reduce This is purely a test-side configuration; the production code in Separate issue (independent, does not affect test pass/fail): the pre-existing FATAL-2 during SIGTERMI filed #31439 for the pre-existing bug where |
My understanding was that #31439 is not a dependency for this: the tests should pass even though we see FATALs in the logs. Assuming I am correct about this, then I think it is better to do #31439 after this so that you have a concrete repro (namely, the logs of some of the tests here), but this is just a recommendation. If I am wrong and the tests fail without fixing #31439, yes, please open a separate PR for #31439. |
|
@jasonyb I think there is a clear dependency for fastdebug or debug builds where the assert is enabled. The fastdebug builds keep the asserts enabled but the timing is close to release builds. In debug builds the timing is different and the RPC has time to finish before the assert checks so it passes most of the time. The test is sensitive to proper DDL cleanup and exposes the bug with using the assert when it is enabled. Again the assert in question that breaks the proper DDL cleanup is around |
|
I do agree with #31472. |
jasonyb
left a comment
There was a problem hiding this comment.
Still working through the review. Have these comments in the meantime.
| return Status::OK(); | ||
| } | ||
|
|
||
| void BackfillTable::StartRequesterLivenessMonitor() { |
There was a problem hiding this comment.
After some back-and-forth with AI, got this (which I haven't fully verified in the interest of time):
Review of StartRequesterLivenessMonitor and StopLivenessMonitor
StartRequesterLivenessMonitor
Issue 1: Race between CreateAndStartTask and storing liveness_task_
The task is created and started at line 1120, but not stored into liveness_task_ until line 1131. The task begins polling immediately. If the transaction is already aborted (or aborts very quickly), the task's FinishPollTransaction fires abort_() → BackfillTable::Abort() → MarkAllIndexesAsFailed() → CheckIfDone() → StopLivenessMonitor() before liveness_task_ is assigned. StopLivenessMonitor sees null and does nothing.
In this specific case the task happens to have self-completed via Complete() before calling abort_(), so there's no leak. But the correctness depends on an implementation detail of FinishPollTransaction's ordering (Complete() before abort_()) — if that ordering ever changes, this breaks silently.
The fix is straightforward: create the task without starting it, store it under the lock, then start it:
auto task = std::make_shared<DdlRequesterLivenessTask>(...);
{
std::lock_guard l(mutex_);
DCHECK(!liveness_task_);
liveness_task_ = task;
}
task->Start();This would require exposing a two-phase create+start API on DdlRequesterLivenessTask (the current CreateAndStartTask bundles both).
Issue 2: No error handling on CreateAndStartTask
CreateAndStartTask returns a shared_ptr, not a Result. If Start() fails internally (threadpool full, task immediately aborted by ValidateRunnable, etc.), the caller has no way to know the liveness monitor is non-functional. The entire feature becomes silently disabled with no log message indicating why.
At minimum, CreateAndStartTask should return Result<shared_ptr<DdlRequesterLivenessTask>> or the caller should verify the task's state after creation. Alternatively, log a warning if the task is in a terminal state immediately after start.
Issue 3: Callback captures shared_ptr<BackfillTable>
The lambdas capture self via shared_from_this(). This prevents BackfillTable destruction while the liveness task is alive. If StopLivenessMonitor is never called (e.g. a code path that sets done_ without going through the normal terminal paths), the BackfillTable leaks along with the task. Today's call sites appear to cover all terminal paths, but this is fragile — a future change that adds a new exit path could miss the StopLivenessMonitor call.
Consider weak_ptr for the callbacks, or a mechanism where the task self-terminates when done_() returns true (which ValidateRunnable already does — but only on the next scheduled step, not immediately).
StopLivenessMonitor
Good: Lock discipline avoids deadlock
Moving the task out via std::move under mutex_ then calling AbortAndReturnPrevState outside the lock is correct. The abort path calls PerformAbort() → Shutdown() → sync_.Wait(). If this were done under mutex_, and the task's in-flight callback tried to re-enter BackfillTable (which acquires mutex_), you'd deadlock.
Good: Idempotency via std::move
After std::move, liveness_task_ is null; subsequent calls are no-ops. This is essential because StopLivenessMonitor is called from multiple convergent terminal paths: Done() (success), MarkIndexesAsFailed() (failure), CheckIfDone() (via Abort()).
Concern: AbortAndReturnPrevState may block
AbortAndReturnPrevState can trigger PerformAbort() → Shutdown() → sync_.Wait(), which blocks until the in-flight GetTransactionStatus RPC completes. This is called from Done() and CheckIfDone(), which run on the callback threadpool. A slow or hung transaction status RPC could stall the backfill completion path. The transaction_rpc_timeout_ms flag bounds this, but it's worth being aware of.
Summary of actionable items
- Fix the create-then-store race — separate creation from starting, or at minimum document why the current ordering is safe and what invariants it depends on.
- Add error handling for task creation — at minimum log when the liveness monitor fails to start.
- Consider the blocking potential of
StopLivenessMonitor—sync_.Wait()can stall the callback threadpool.
| is_backfilling_ = false; | ||
| } | ||
|
|
||
| // Store/retrieve the DDL transaction from the PG backend that initiated the backfill. |
There was a problem hiding this comment.
After some back-and-forth with AI, got this (which I haven't fully verified in the interest of time):
Review of SetPendingBackfillRequesterTransaction and TakePendingBackfillRequesterTransaction
Note: Not persisted across master failover
Already tracked in #31472 and commented at https://github.com/yugabyte/yugabyte-db/pull/31378/changes#r3198264572. Not repeating here.
Note: Take returning nullopt is expected in multiple cases
StartBackfillingData always calls Take when !requester_transaction && current_version (line 348). Nullopt is the normal result for:
- YCQL:
Setis never called — YCQL doesn't go throughCatalogManager::BackfillIndex(which is PGSQL-only). Nullopt is the expected baseline. - YSQL without a requester transaction: older PG clients that don't send
requester_transaction, or decode failure (line 6594). - YSQL after master failover: in-memory state lost, already tracked in [YSQL] Create
BackfillJobPBearlier in the backfill lifecycle #31472.
The only scenario where nullopt from Take would indicate a problem is a version mismatch — Set was called at V+1 but Take is called at some other version. This would mean an unexpected version bump occurred between the permission update and the backfill launch. This is unlikely today (YSQL does exactly one bump), but there's no way to distinguish this case from the legitimate nullopt cases at the Take call site.
Not flagging this as actionable, but noting that debugging a missing liveness monitor will require correlating logs from the Set call site (which currently has no log) with the Take call site. A VLOG at the Set call (line 523) recording the stored version would help.
Issue 1: Version encoding assumes exactly one version bump between Set and Take
Set stores at current_version + 1. Take is called with the table's version at the time backfill is launched. This works because YSQL does exactly one permission update (WRITE_AND_DELETE → DO_BACKFILL) which bumps the version by exactly 1.
If a future change introduces additional intermediate permission steps or version bumps between Set and Take, the versions would mismatch and Take would silently return nullopt, disabling liveness monitoring. The version matching is correct today but the coupling is implicit. A comment on SetPendingBackfillRequesterTransaction noting this single-bump assumption would help.
Issue 2: Transaction decode failure is only a WARNING
In CatalogManager::BackfillIndex (lines 6590-6596):
if (req->has_requester_transaction()) {
auto result = TransactionMetadata::FromPB(req->requester_transaction());
if (result.ok()) {
requester_txn = std::move(*result);
} else {
LOG(WARNING) << "BackfillIndex: failed to decode requester transaction: " << result.status();
}
}If the PG backend sends a malformed transaction, the decode fails and is logged as a WARNING. The backfill proceeds without liveness monitoring. This is fine for robustness (don't block backfill for a monitoring feature), but the WARNING could be easy to miss. Consider LOG(DFATAL) in debug builds to catch protocol bugs early.
Issue 3: Method bodies in the header file
SetPendingBackfillRequesterTransaction and TakePendingBackfillRequesterTransaction are defined inline in catalog_entity_info.h. Most TableInfo methods with comparable complexity (SetIsBackfilling, SetCreateTableErrorStatus, etc.) are declared in the header but defined in catalog_entity_info.cc. ClearIsBackfilling is inline but is a trivial one-liner. These two methods have lock acquisition, conditional logic, and std::exchange — they should follow the prevailing pattern and move to the .cc file.
Summary of actionable items
- Comment the single-version-bump assumption — at the
Setcall site (line 523-524) or on the field declaration. - Add a VLOG at the
Setcall site (line 523) recording the stored version, to aid debugging when the liveness monitor unexpectedly doesn't start. - Consider LOG(DFATAL) for transaction decode failure — at line 6594-6596, a malformed
requester_transactionfrom the PG client is only a WARNING. LOG(DFATAL) would catch protocol bugs in debug builds. - Move Set/Take method bodies to catalog_entity_info.cc — they have non-trivial logic and don't match the header-inline pattern used by comparable
TableInfomethods.
Co-authored-by: jasonyb <93959687+jasonyb@users.noreply.github.com>
Co-authored-by: jasonyb <93959687+jasonyb@users.noreply.github.com>
jasonyb
left a comment
There was a problem hiding this comment.
I'll look at the last two tests later. I'm familiar with the backfill flow but not the ddl verification task flow. Asked for assistance on that, but I will do it later if no help comes.
| // Retrieve the requester transaction if it was stored during the permission-update phase. | ||
| // Pass current_version so TakePendingBackfillRequesterTransaction rejects stale | ||
| // transactions from earlier backfill attempts. | ||
| if (!requester_transaction && current_version) { |
There was a problem hiding this comment.
Sorry, I should have clarified that this suggestion was contingent on #31378 (comment) being true. I believe it is true, but it is not a blocker to me to have this dead code.
| pid_ready.Wait(); | ||
| { | ||
| auto monitor_conn = VERIFY_RESULT(ConnectToDB(kDatabaseName)); | ||
| RETURN_NOT_OK(WaitFor( |
There was a problem hiding this comment.
I believe we already have a GetIndexStateFlags for this purpose
There was a problem hiding this comment.
Doesn't GetIndexStateFlags use conn_ that we reset already? I guess we could modify the GetIndexStateFlags and pass connection as a parameter? I'd be fine that.
There was a problem hiding this comment.
Ok. Not a blocker, but you could choose to not conn_.reset() to be able to use GetIndexStateFlags here, right? conn_ shouldn't be blocking the CREATE INDEX, at least not in any meaningful way.
| // T~4s First sleep inside UpdateIndexPermission completes (kAlterTableDelay) | ||
| // T~4s Permission updated; requester_transaction stored via | ||
| // SetPendingBackfillRequesterTransaction | ||
| // T~8s Second sleep inside UpdateIndexPermission completes -> AlterTable RPCs proceed |
There was a problem hiding this comment.
You have two UpdateIndexPermission, and while that is true, the below SleepFor(kAlterTableDelay * 2 + 1s); is after having waited for indisready permissions, so I believe the first UpdateIndexPermission is already covered. This brings it down to SleepFor(kAlterTableDelay + 1s);.
Furthermore, the comment says "BEFORE any tserver BackfillIndex RPCs have been sent", but if you are waiting the full kAlterTableDelay, doesn't that mean it is sent (if not for the code overhead)? It is not clear to me at what step you are trying to kill the backend. What further complicates this is the long delay before the transaction is noticed as aborted.
There was a problem hiding this comment.
I think WaitForIndexStateFlags will see indisready=true at T0. I asked AI to clarify the comments. It seems correct to me now. I am trying to kill the backend at T9 (added comments). Also "RPC's sent" is not strictly correct, changed to "RPC's completed".
|
Addressed some of the comments, will take a look at the rest later. |
There was a problem hiding this comment.
Thanks for making this change @egladysh, it will help quite a bit! The overall structure of the code looks good. I have some specific concerns and a couple of questions. Main ones are
- master leader failover (can be punted to a later diff)
- BackfillTable::Abort/Done synchronization seems to be missing.
- pg client session txn identification needs to be based on txn ddl being enabled or not
Rest are relatively minor.
| // The PG backend holds a DDL transaction open for the entire backfill duration | ||
| // (StartTransactionCommand at indexcmds.c:2334). Pass it to the master so it can detect when | ||
| // this backend is killed (-> txn aborted) and stop launching new backfill chunks. | ||
| auto meta = GetDdlTransactionMetadata( |
There was a problem hiding this comment.
We should use req.use_regular_transaction_block just like the other DDLs. The context here is that prior to txn DDL, DDLs used to use a separate "autonomous" transaction (kDDL session txn) but after transactional DDL feature they use the kPlain session txn. So I imagine this may not work correctly for the case when transactional DDL is on unless we plumb this field through similar to other DDL like PgCreateTable.
We can either try testing with transactional DDL on or just plumb it through similar to other DDL
There was a problem hiding this comment.
I see. I'll add it to PgBackfillIndexRequestPB.
| // Schedule asynchronously so user_cb fires first. | ||
| // | ||
| // Complete() must still be called before callbacks_.abort_() to avoid a different deadlock: | ||
| // Abort() may call BackfillTable::StopLivenessMonitor() -> AbortAndReturnPrevState(), which |
There was a problem hiding this comment.
is it possible to supply a flag to Abort indicating it is coming from the liveness check so that the backfill is not going to try and stop the liveness check again? That would simplify the code.
| auto self = shared_from_this(); | ||
| BackgroundDdlCallbacks callbacks{ | ||
| .done_ = [self] { return self->done(); }, | ||
| .abort_ = [self] { return self->Abort(); }, |
There was a problem hiding this comment.
It doesn't seem like BackfillTable::Abort / BackfillTable::Done are ready to be called in a multi threaded context once we add this callback.
- txn poll can call Abort while backfilltablet is causing a transition to Done success path.
- txn poll can call Abort while backfilltablet is causing its own failure transition.
It seems like so far they were able to use std atomics to avoid real locking but now it would be better to use a proper lock to keep it simple. we can have some explicit internal enum state like waiting, aborting, aborted, success and use that to decide what to do from the callbacks (we only want to affect waiting state from the txn callback and not the others). Any other approaches are also ok but current path seems prone to problems.
There was a problem hiding this comment.
@iSignal Ah... I assumed that those atomics and mutex's were there to make them thread-safe. I do see the gap now. We can fix it with an enum (kind of a state machine you suggested) or just moving the done_ usage around like:
Status BackfillTable::Abort() {
bool expected = false;
if (!done_.compare_exchange_strong(expected, true)) {
return Status::OK();
}
...
Which one would you prefer?
There was a problem hiding this comment.
done_ would be simpler but does not handle race between Abort and tablet Done failure path right? Both may try to mark indexes as failed. I guess an atomic int enum CAS with more than true/false can help distinguish the different states.
There was a problem hiding this comment.
Hmm, I amn't sure. It seems like indexes_to_build() takes care of it with LockForWrite. Transition failed to failed is harmless and after failed indexes_to_build() will return {} if my understanding is correct?
| LeaderEpoch epoch_; | ||
| ash::WaitStateInfoPtr wait_state_; | ||
| std::optional<TransactionMetadata> requester_transaction_; | ||
| std::shared_ptr<DdlRequesterLivenessTask> liveness_task_ GUARDED_BY(mutex_); |
There was a problem hiding this comment.
maybe this can be a std::weak_ptr? The task runs on its own, do we need to own it? Right now, both the task and backfilltable are holding refs to each other, so we really need to be sure we release both correctly.
There was a problem hiding this comment.
@iSignal I think that the reference cycle is already explicitly broken because every exit path (MarkIndexesAsFailed, CheckIfDone) calls StopLivenessMonitor(), which moves liveness_task_ out and clears BackfillTable's reference to the task. Anther reference to the task is the TableInfo's task list and the task will hold the last shared_ptr<BackfillTable> because the captures in the callbacks, and when the task finishes those are released too. No leak. Also I think that weak_ptr might actually be wrong because StopLivenessMonitor needs to call AbortAndReturnPrevState on the task but it depends on the life cycle of the tasks in TableInfo. I feel like shared_ptr is safer but I could be wrong.
There was a problem hiding this comment.
Yes it is safe now but it is a bit worrying that every future path would need to reason about and remember to call Stop... during Job termination paths to break the loop.
If we write it as below to get a shared ptr out of the weak ptr, it would allow the task to exit by itself as well. But open to other suggestions as well
void BackfillTable::StopLivenessMonitor() {
std::shared_ptr<DdlRequesterLivenessTask> task;
{
std::lock_guard l(mutex_);
task = liveness_task_.lock();
liveness_task_.reset();
}
if (task) {
task->AbortAndReturnPrevState(STATUS(Aborted, "BackfillTable is done"));
}
}
There was a problem hiding this comment.
My concern is about AbortAndReturnPrevState. I thought the call must be made? If that's not the case, I'd agree that weak_ptr would be a better choice.
| context->GetClientDeadline(), IsTxnUsingTableLocks(false)); | ||
| std::optional<TransactionMetadata> txn_metadata; | ||
| if (!meta.ok()) { | ||
| VLOG(1) << "BackfillIndex: failed to get DDL transaction metadata: " << meta.status(); |
There was a problem hiding this comment.
minor: maybe can be LOG(WARNING)
| pid_ready.Wait(); | ||
| { | ||
| auto monitor_conn = VERIFY_RESULT(ConnectToDB(kDatabaseName)); | ||
| RETURN_NOT_OK(WaitFor( |
There was a problem hiding this comment.
Ok. Not a blocker, but you could choose to not conn_.reset() to be able to use GetIndexStateFlags here, right? conn_ shouldn't be blocking the CREATE INDEX, at least not in any meaningful way.
| } | ||
|
|
||
| auto terminated = VERIFY_RESULT( | ||
| conn_->FetchRow<bool>(Format("SELECT pg_terminate_backend($0)", create_index_pid))); |
There was a problem hiding this comment.
You claim conn_ is reset above but use it here. Does the test pass locally?
There was a problem hiding this comment.
I don't think I ran the EarlyKill tests after I asked AI to refactor them to dedup the code. You are right it's a bug.
Summary
Add
DdlRequesterLivenessTask, a master-side task that polls the transactionstatus of the DDL transaction held open by the
CREATE INDEX CONCURRENTLYbackend. If the transaction is aborted (e.g. because the backend was killed
via
pg_terminate_backend), the task callsBackfillTable::Abort()to stopthe in-progress backfill.
Test plan
PgIndexBackfillCancellationTest.BackfillStopsAfterBackendKill— assertsthat no new backfill RPCs are issued after the backend is killed.
PgIndexBackfillCancellationWithoutFixTest.BackfillContinuesAfterBackendKill— asserts the old behavior (backfill continues) when the liveness monitor is
disabled, serving as a regression baseline.
PgIndexBackfillCancellationEarlyKillTest.BackfillStopsAfterEarlyBackendKill— same as above but the backend is killed before backfill starts.
CSI