Fix sync worker hanging on Extension wait event after table resynchronization#508
Fix sync worker hanging on Extension wait event after table resynchronization#508kmsarabu wants to merge 8 commits into2ndQuadrant:REL2_x_STABLEfrom
Conversation
…S_CATCHUP ('u') state
nmisch
left a comment
There was a problem hiding this comment.
Thanks for the report and patch. Would you add a test case that shows the
hang?
Why 5 seconds as the check interval?
I wondered if this would fix hang #497 in sql/add_table.sql, but it appears
not to. (That's fine. If it had, though, I would have considered add_table
to be the test coverage for $SUBJECT.)
|
Thanks @nmisch Why 5 seconds as the check interval?
Regression test
|
|
At 9b36db7 (first version of pull request) add_table still timed out for me, I see you've made some edits today, removing the 5s interval. I was okay The patch looks good conceptually, so I'll just need to complete a detailed Your test changes have the test require sync_status='r' in places where it |
|
Thanks @nmisch for reviewing this PR.
|
|
Thanks. This is still in my queue. My schedule is unusually packed right now, |
|
Hello @nmisch, do you happen to have some time to review the changes here? Thank you so much! |
|
Nothing has changed since my 2025-11-18 comment. I remain on schedule to review it by 2026-02-18. |
Problem
After performing
pglogical.alter_subscription_resynchronize_tableoperations, sync workers get stuck waiting on "Extension" wait events indefinitely. The table status shows ascatchup(u) but the sync worker remains idle until a dummy transaction is performed on the source database.Symptoms
pg_stat_activitycatchupstate despite successful data synchronizationCause
The issue appears due to a race condition in the sync completion logic:
apply_work()to catch up to the target LSN. However, the transition from SYNC_STATUS_CATCHUP ('u') to SYNC_STATUS_SYNCDONE ('y') only happens in the apply worker's handle_commit() function when:The problem occurs because the sync worker relies on receiving WAL messages to trigger completion checks, but if the source database has no activity, no messages are sent, leaving the worker stuck.
Changes Made
Added completion check in
apply_work(): sync workers check if their current LSN has reached or exceeded the targetreplay_stop_lsnClean exit path: When completion is detected, worker updates table status to
SYNC_STATUS_SYNCDONEand exits cleanly