Skip to content

[DocDB] Deadlock on backup during concurrent DDL workload #29448

@pilshchikov

Description

@pilshchikov

Jira Link: DB-19240

Description

Steps:

  1. Create 3 nodes RF=3 universe 2.29.0.0-b190
  2. Create database and load some data in 4 tables for 30 minutes
  3. Stop load
  4. Start loop:
    4.1. Start workload with DDL operations (INSERT, DELETE, UPDATE, DROP COLUMN, ADD COLUMN, CHANGE TYPE) operations that happen in one thread but executes on different nodes that could cause multiple concurrent catalog rewrite in one moment of time
    4.2. Start backup creation
    4.3. Stop workload
    4.4. Restore on different database

Each loop increase amount of threads that doing writes (ALTER DDLs still happen “sequentially”, at least tries to do it)
Backup task fails on a second try:

Caused by: java.lang.RuntimeException: Task id 479eba1e-8f45-4820-af70-e8c10f961931_PGSQL_TABLE_TYPE_db_8942b77c-a572-4bd6-a707-61fb0bceff12 status: Task failed during YSQL Dump phase with status YSQL_DUMP_COMMAND_FAILED. Please check YB-Controller logs on node 172.151.26.116 for more details

And in controller logs i see:

ysql_dump: error: query failed: ERROR:  deadlock detected (query layer retry isn't possible because this is not the first command in the transaction. Consider using READ COMMITTED isolation level.)
DETAIL:  Heartbeat: Transaction b5209972-4021-42e4-afd9-4775b658d6f8 aborted due to a deadlock: <1763692810079569>56207663-3bb0-4bbe-b302-977f52e82490-><1763692813568114>b5209972-4021-42e4-afd9-4775b658d6f8->: kDeadlock [serializable]

It is a new test, i tried to run it on 2025.2, 2024.2 and latest master (this one) and only this master fails with this issue.

Concurrent DDL. ysql dump is running with txn isolation serializable, read only, deferrable and for this isolation level, PG attempts to find a point where it can be run without a serialization failure, so the dump can never fail. However, YB does not have seem to have the same meaning for this, so the SELECT query in the dump can fail.

2025-11-21 02:41:11.686 UTC [50295] STATEMENT:  SELECT t.tableoid, t.oid, i.indrelid, t.relname AS indexname, t.relpages, t.reltuples, t.relallvisible, pg_catalog.pg_get_indexdef(i.ind
exrelid) AS indexdef, i.indkey, i.indisclustered, c.contype, c.conname, c.condeferrable, c.condeferred, c.tableoid AS contableoid, c.oid AS conoid, pg_catalog.pg_get_constraintdef(c.oi
d, false) AS condef, CASE WHEN i.indexprs IS NOT NULL THEN (SELECT pg_catalog.array_agg(attname ORDER BY attnum)  FROM pg_catalog.pg_attribute   WHERE attrelid = i.indexrelid) ELSE NUL
L END AS indattnames, (SELECT spcname FROM pg_catalog.pg_tablespace s WHERE s.oid = t.reltablespace) AS tablespace, t.reloptions AS indreloptions, i.indisreplident, i.indoption, inh.in
hparent AS parentidx, i.indnkeyatts AS indnkeyatts, i.indnatts AS indnatts, (SELECT pg_catalog.array_agg(attnum ORDER BY attnum)   FROM pg_catalog.pg_attribute   WHERE attrelid = i.ind
exrelid AND     attstattarget >= 0) AS indstatcols, (SELECT pg_catalog.array_agg(attstattarget ORDER BY attnum)   FROM pg_catalog.pg_attribute   WHERE attrelid = i.indexrelid AND     a
ttstattarget >= 0) AS indstatvals, i.indnullsnotdistinct FROM unnest('{16410,16415,16432,16437,16454,16459,16476,16481}'::pg_catalog.oid[]) AS src(tbloid)
        JOIN pg_catalog.pg_index i ON (src.tbloid = i.indrelid) JOIN pg_catalog.pg_class t ON (t.oid = i.indexrelid) JOIN pg_catalog.pg_class t2 ON (t2.oid = i.indrelid) LEFT JOIN pg_c
atalog.pg_constraint c ON (i.indrelid = c.conrelid AND i.indexrelid = c.conindid AND c.contype IN ('p','u','x')) LEFT JOIN pg_catalog.pg_inherits inh ON (inh.inhrelid = indexrelid) WHE
RE (i.indisvalid OR t2.relkind = 'p') AND i.indisready ORDER BY i.indrelid, indexname

All links in JIRA first comment

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.

Metadata

Metadata

Assignees

Labels

area/docdbYugabyteDB core featureskind/bugThis issue is a bugpriority/highHigh Priorityqa_automationBugs identified via itest-system, LST, Stress automation or causing automation failuresqa_stressBugs identified via Stress automation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions