Skip to content

Fix sync worker hanging on Extension wait event after table resynchronization#508

Open
kmsarabu wants to merge 8 commits into2ndQuadrant:REL2_x_STABLEfrom
kmsarabu:REL2_x_STABLE
Open

Fix sync worker hanging on Extension wait event after table resynchronization#508
kmsarabu wants to merge 8 commits into2ndQuadrant:REL2_x_STABLEfrom
kmsarabu:REL2_x_STABLE

Conversation

@kmsarabu
Copy link

@kmsarabu kmsarabu commented Sep 18, 2025

Problem

After performing pglogical.alter_subscription_resynchronize_table operations, sync workers get stuck waiting on "Extension" wait events indefinitely. The table status shows as catchup(u) but the sync worker remains idle until a dummy transaction is performed on the source database.

Symptoms

  • Sync worker shows "Extension" wait event in pg_stat_activity
  • Table status remains in catchup state despite successful data synchronization
  • Worker only progresses when new WAL activity occurs on source database (e.g., dummy/heartbeat transactions)
  • No error messages, just indefinite waiting

Cause

The issue appears due to a race condition in the sync completion logic:

  1. When a sync worker completes its initial data copy and moves to SYNC_STATUS_SYNCWAIT ('w'), the apply worker in process_syncing_tables() sets the status to SYNC_STATUS_CATCHUP ('u') and updates the replay_stop_lsn().
  2. The sync worker then enters apply_work() to catch up to the target LSN. However, the transition from SYNC_STATUS_CATCHUP ('u') to SYNC_STATUS_SYNCDONE ('y') only happens in the apply worker's handle_commit() function when:
    • The sync worker reaches its replay_stop_lsn
    • AND a commit message is processed
  3. When the publisher database is idle: If the sync worker has already caught up to the target LSN but there are no new transactions (commits) coming from the source database, the status remains in 'u' and never transitions 'y'.

The problem occurs because the sync worker relies on receiving WAL messages to trigger completion checks, but if the source database has no activity, no messages are sent, leaving the worker stuck.

Changes Made

  1. Added completion check in apply_work(): sync workers check if their current LSN has reached or exceeded the target replay_stop_lsn

  2. Clean exit path: When completion is detected, worker updates table status to SYNC_STATUS_SYNCDONE and exits cleanly

@nmisch nmisch self-requested a review September 25, 2025 02:13
Copy link
Collaborator

@nmisch nmisch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the report and patch. Would you add a test case that shows the
hang?

Why 5 seconds as the check interval?

I wondered if this would fix hang #497 in sql/add_table.sql, but it appears
not to. (That's fine. If it had, though, I would have considered add_table
to be the test coverage for $SUBJECT.)

@kmsarabu
Copy link
Author

kmsarabu commented Sep 26, 2025

Thanks @nmisch

Why 5 seconds as the check interval?

  • The 5-second interval balances responsiveness with performance overhead.
  1. Issue reproduced in regression test sql/add_table.sql; this fix also resolves bug Regression test add_table failing in PG17 RHEL9 x86_64 machines #497.
  2. Modified regression script sql/add_table.sql to include sync_status info.

Regression test sql/add_table.sql without this fix:

not ok 11    - add_table                              109352 ms
# 1 of 29 tests failed.
# The differences that caused some tests to fail can be viewed in the file "/home/postgres/work/pglogical/regression_output/regression.diffs".

$ cat /home/postgres/work/pglogical/regression_output/regression.diffs
diff -U3 /home/postgres/work/pglogical/expected/add_table.out /home/postgres/work/pglogical/regression_output/results/add_table.out
--- /home/postgres/work/pglogical/expected/add_table.out        2025-09-26 08:22:51.343316055 +0000
+++ /home/postgres/work/pglogical/regression_output/results/add_table.out       2025-09-26 08:26:02.544659300 +0000
@@ -257,17 +257,13 @@
 BEGIN;
 SET statement_timeout = '60s';
 SELECT pglogical.wait_for_table_sync_complete('test_subscription', 'test_publicschema');
- wait_for_table_sync_complete
-------------------------------
-
-(1 row)
-
+ERROR:  canceling statement due to statement timeout
 COMMIT;
 SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status, sync_status IN ('y', 'r') FROM pglogical.local_sync_status ORDER BY 2,3,4;
  sync_kind | sync_subid |   sync_nspname    |    sync_relname    | sync_status | ?column?
 -----------+------------+-------------------+--------------------+-------------+----------
  d         | 3848008564 | public            | test_nosync        | r           | t
- d         | 3848008564 | public            | test_publicschema  | r           | t
+ d         | 3848008564 | public            | test_publicschema  | u           | f
  d         | 3848008564 | strange.schema-IS | test_strangeschema | r           | t
  f         | 3848008564 |                   |                    | r           | t
 (4 rows)
@@ -286,8 +282,8 @@
 -[ RECORD 1 ]---------------
 nspname  | public
 relname  | test_publicschema
-status   | replicating
-?column? | t
+status   | catchup
+?column? | f

 \x
 BEGIN;

Regression test sql/add_table.sql with this fix:

# using temp instance on port 61705 with PID 1132873
ok 1         - preseed                                    46 ms
ok 2         - infofuncs                                  43 ms
ok 3         - init_fail                                  62 ms
ok 4         - init                                      255 ms
ok 5         - preseed_check                              15 ms
ok 6         - basic                                    8074 ms
ok 7         - extended                                44429 ms
ok 8         - conflict_secondary_unique                  21 ms
ok 9         - toasted                                  3030 ms
ok 10        - replication_set                          1028 ms
ok 11        - add_table                               37830 ms
ok 12        - matview                                  4040 ms
ok 13        - bidirectional                            4256 ms
ok 14        - primary_key                             24270 ms
ok 15        - interfaces                               1224 ms
ok 16        - foreign_key                              2022 ms
ok 17        - functions                               11211 ms
ok 18        - copy                                     1021 ms
ok 19        - sequence                                   14 ms
ok 20        - triggers                                 5056 ms
ok 21        - parallel                                 3251 ms
ok 22        - row_filter                              14152 ms
ok 23        - row_filter_sampling                      2021 ms
ok 24        - att_list                                 8093 ms
ok 25        - column_filter                            3040 ms
ok 26        - apply_delay                              7253 ms
ok 27        - multiple_upstreams                       3266 ms
ok 28        - node_origin_cascade                      3273 ms
ok 29        - drop                                     1040 ms
1..29
# All 29 tests passed.

@kmsarabu kmsarabu requested a review from nmisch September 26, 2025 09:06
@nmisch
Copy link
Collaborator

nmisch commented Sep 27, 2025

At 9b36db7 (first version of pull request) add_table still timed out for me,
but at bb85291 and 382d9cb it passes. It's great to have a fix for that.

I see you've made some edits today, removing the 5s interval. I was okay
having the interval, and I may be okay having no delay interval. (My initial
reaction was that 5s was on the high side, but I haven't formed an opinion on
no-delay as an alternative.) How are you thinking about that decision now?

The patch looks good conceptually, so I'll just need to complete a detailed
review. Is it ready for that, or do you have more changes planned? If you
need to edit it again, please remove the trailing whitespace.
git diff --check REL2_x_STABLE can confirm successful removal. Another cosmetic
point: a comment still says "Periodic completion check", but the 5s period is
gone. If you don't otherwise need to edit the pull request again, I can
handle those two cosmetic points.

Your test changes have the test require sync_status='r' in places where it
used to accept 'y' or 'r'. I wondered if that would make tests flaky, so I
tried to break that by adding sleeps just before the code sets
sync_status='r'. That didn't stop the tests from getting 'r', so I'm not
going to worry about that further. If the tests prove flaky, we can fix that
later. Here's how I checked with sleeps:

diff --git a/pglogical_apply.c b/pglogical_apply.c
index 81a4de4..1588520 100644
--- a/pglogical_apply.c
+++ b/pglogical_apply.c
@@ -1829,2 +1829,3 @@ process_syncing_tables(XLogRecPtr end_lsn)
 			{
+				nmisch_sleep(10, "table SYNC_STATUS_READY normal");
 				sync->status = SYNC_STATUS_READY;
diff --git a/pglogical_sync.c b/pglogical_sync.c
index c7ee4d4..520cd70 100644
--- a/pglogical_sync.c
+++ b/pglogical_sync.c
@@ -42,2 +42,4 @@
 
+#include "libpq/pqsignal.h"
+
 #include "utils/memutils.h"
@@ -96,2 +98,16 @@ static int exec_cmd_win32(const char *cmd, char *cmdargv[]);
 
+void
+nmisch_sleep(int secs, const char *msg)
+{
+	elog(LOG, "nmisch sleeping before %s", msg);
+#ifdef PG_SETMASK
+	PG_SETMASK(&BlockSig);
+	pg_usleep(secs * 1000000);
+	PG_SETMASK(&UnBlockSig);
+#else
+	sigprocmask(SIG_SETMASK, &BlockSig, NULL);
+	pg_usleep(secs * 1000000);
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+#endif
+}
 
@@ -991,2 +1007,4 @@ pglogical_sync_subscription(PGLogicalSubscription *sub)
 						{
+							nmisch_sleep(10,
+										 "table SYNC_STATUS_READY oldsync");
 							set_table_sync_status(sub->id, remoterel->nspname,
@@ -1000,2 +1018,4 @@ pglogical_sync_subscription(PGLogicalSubscription *sub)
 
+							nmisch_sleep(10, "SYNC_KIND_FULL SYNC_STATUS_READY create");
+
 							newsync.kind = SYNC_KIND_FULL;
@@ -1044,2 +1064,3 @@ pglogical_sync_subscription(PGLogicalSubscription *sub)
 		/* Nothing to do here yet. */
+		nmisch_sleep(10, "SYNC_KIND_FULL SYNC_STATUS_READY update");
 		status = SYNC_STATUS_READY;
diff --git a/pglogical_sync.h b/pglogical_sync.h
index 51e9cd9..aa7668e 100644
--- a/pglogical_sync.h
+++ b/pglogical_sync.h
@@ -52,2 +52,4 @@ typedef struct PGLogicalSyncStatus
 
+extern void nmisch_sleep(int secs, const char *msg);
+
 extern void pglogical_sync_worker_finish(void);

@kmsarabu
Copy link
Author

Thanks @nmisch for reviewing this PR.

  1. The interval check was originally added as a safety measure, but in practice it proved unnecessary. The sync worker only entered the stuck state after resynchronization had already completed, so an additional periodic check was redundant. Removing it simplifies the loop without changing behavior or safety.

  2. Thanks again and appreciate the review! The patch is ready for the detailed review. I’ve fixed the PR and removed/updated the trailing whitespace and comments.

@nmisch
Copy link
Collaborator

nmisch commented Nov 18, 2025

Thanks. This is still in my queue. My schedule is unusually packed right now,
so I likely won't be able to do a proper review for another few months,
unfortunately. I won't forget it.

@leson238
Copy link

leson238 commented Feb 2, 2026

Hello @nmisch, do you happen to have some time to review the changes here? Thank you so much!

@nmisch
Copy link
Collaborator

nmisch commented Feb 3, 2026

Nothing has changed since my 2025-11-18 comment. I remain on schedule to review it by 2026-02-18.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants