Fix segfault when recovering 2pc transaction #1078

Smyatkin-Maxim · 2025-04-29T07:42:00Z

When promoting a mirror segment due to failover we have seen a stacktrace like this:

"FATAL","58P01","requested WAL segment pg_xlog/00000023000071D50000001F has already been removed",,,,,,,0,,"xlogutils.c",580,"Stack
trace:
1    0x557bdb9f09b6 postgres errstart + 0x236
2    0x557bdb5fc6cf postgres <symbol not found> + 0xdb5fc6cf
3    0x557bdb5fd021 postgres read_local_xlog_page + 0x191
4    0x557bdb5fb922 postgres <symbol not found> + 0xdb5fb922
5    0x557bdb5fba11 postgres XLogReadRecord + 0xa1
6    0x557bdb5e7767 postgres RecoverPreparedTransactions + 0xd7
7    0x557bdb5f608b postgres StartupXLOG + 0x2a3b
8    0x557bdb870a89 postgres StartupProcessMain + 0x139
9    0x557bdb62f489 postgres AuxiliaryProcessMain + 0x549
10   0x557bdb86d275 postgres <symbol not found> + 0xdb86d275
11   0x557bdb8704e3 postgres PostmasterMain + 0x1213
12   0x557bdb56a1f7 postgres main + 0x497
13   0x7fded4a61c87 libc.so.6 __libc_start_main + 0xe7

Note: stacktrace is from one of our production GP clusters and might be slightly different from what we will see in cloudberry, but the failure is still present here as well. Testcase proves it.

PG13 and PG14 have a fix for this bug, but it's doesn't have any test case and looks like we didn't cherry-pick that far. The discussion can be found here: https://www.postgresql.org/message-id/flat/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel%40cybertec.at

In a few words, StartupXLOG() renames the last wal segment to .partial but tries to read it by the old name later in RecoverPreparedTransactions().

The fix is mostly borrowed from PG14 postgres/postgres@f663b00 with some cloudberry-related exceptions. Also added a regression test which segfaults without this fix for any version of GP, PG<=12 or Cloudberry.

Fixes #ISSUE_Number

What does this PR do?

Type of Change

Bug fix (non-breaking change)
New feature (non-breaking change)
Breaking change (fix or feature with breaking changes)
Documentation update

Breaking Changes

Test Plan

Unit tests added/updated
Integration tests added/updated
Passed make installcheck
Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Followed contribution guide
Added/updated documentation
Reviewed code for security implications
Requested review from cloudberry committers

Additional Context

CI Skip Instructions

github-actions

Hi, @Smyatkin-Maxim welcome!🎊 Thanks for taking the effort to make our project better! 🙌 Keep making such awesome contributions!

Smyatkin-Maxim · 2025-04-29T07:44:53Z

Hi team,
as this PR somewhat reorders the process of xlog startup, I need reviewers to carefully double-check I didn't mess something up here. Especially I have no knowledge how Startup hook is used, perhaps it cannot be moved up this way.
The issue itself shows up quite often in our GP installations and from what I understand it's not fixed yet in cloudberry as well.

Smyatkin-Maxim · 2025-04-29T09:39:15Z

I see that my test is failing. Didn't fail locally the last time I tried. Will take a look
Added ordering for test stability, but still need to figure out why it crashes - it doesn't on my local build.

yjhjstz · 2025-04-29T13:45:39Z

Nice catch! @Smyatkin-Maxim it's pax related flake test. So I will retry ci.

Smyatkin-Maxim · 2025-04-30T09:06:54Z

Nice catch! @Smyatkin-Maxim it's pax related flake test. So I will retry ci.

on the first CI run there was a crash+coredump on my testcase. I'm gonna be rerunning IC2 tests a couple times, feels like there is yet another rarely reproducible bug, which my testcase (or my fix?) uncovers.

UPD: after a couple reruns of isolation2 suite I still didn't get the core dump I've seen on the first run.

yjhjstz · 2025-04-30T18:47:29Z

@jiaqizho @gongxun0928

1 | lock_tbl1a
              1 | lock_view2
              1 | lock_view3
+             1 | pg_pax_blocks_4[33](https://github.com/apache/cloudberry/actions/runs/14729599874/job/41419204900?pr=1078#step:18:34)56
              2 | lock_tbl1
              2 | lock_tbl1a
              2 | lock_view2

this pax related.

jiaqizho · 2025-05-21T02:48:21Z

@jiaqizho @gongxun0928

1 | lock_tbl1a
              1 | lock_view2
              1 | lock_view3
+             1 | pg_pax_blocks_4[33](https://github.com/apache/cloudberry/actions/runs/14729599874/job/41419204900?pr=1078#step:18:34)56
              2 | lock_tbl1
              2 | lock_tbl1a
              2 | lock_view2

this pax related.

retry pls. not sure which case will cause this lock.

gfphoenix78 · 2025-06-05T01:34:01Z

@jiaqizho @gongxun0928

1 | lock_tbl1a
              1 | lock_view2
              1 | lock_view3
+             1 | pg_pax_blocks_4[33](https://github.com/apache/cloudberry/actions/runs/14729599874/job/41419204900?pr=1078#step:18:34)56
              2 | lock_tbl1
              2 | lock_tbl1a
              2 | lock_view2

this pax related.

The test should be fixed in the latest main branch.

gfphoenix78 · 2025-06-06T01:45:38Z

Thanks for your PR, @Smyatkin-Maxim, do you mind to cherry-pick the commit that fixes this issue in pg14? So, the context and commit message are kept.

When promoting a mirror segment due to failover we have seen a stacktrace like this: ``` "FATAL","58P01","requested WAL segment pg_xlog/00000023000071D50000001F has already been removed",,,,,,,0,,"xlogutils.c",580,"Stack trace: 1 0x557bdb9f09b6 postgres errstart + 0x236 2 0x557bdb5fc6cf postgres <symbol not found> + 0xdb5fc6cf 3 0x557bdb5fd021 postgres read_local_xlog_page + 0x191 4 0x557bdb5fb922 postgres <symbol not found> + 0xdb5fb922 5 0x557bdb5fba11 postgres XLogReadRecord + 0xa1 6 0x557bdb5e7767 postgres RecoverPreparedTransactions + 0xd7 7 0x557bdb5f608b postgres StartupXLOG + 0x2a3b 8 0x557bdb870a89 postgres StartupProcessMain + 0x139 9 0x557bdb62f489 postgres AuxiliaryProcessMain + 0x549 10 0x557bdb86d275 postgres <symbol not found> + 0xdb86d275 11 0x557bdb8704e3 postgres PostmasterMain + 0x1213 12 0x557bdb56a1f7 postgres main + 0x497 13 0x7fded4a61c87 libc.so.6 __libc_start_main + 0xe7 ``` Note: stacktrace is from one of our production GP clusters and might be slightly different from what we will see in cloudberry, but the failure is still present here as well. Testcase proves it. PG13 and PG14 have a fix for this bug, but it's doesn't have any test case and looks like we didn't cherry-pick that far. The discussion can be found here: https://www.postgresql.org/message-id/flat/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel%40cybertec.at In a few words, StartupXLOG() renames the last wal segment to .partial but tries to read it by the old name later in RecoverPreparedTransactions(). The fix is mostly borrowed from PG14 postgres/postgres@f663b00 with some cloudberry-related exceptions. Also added a regression test which segfaults without this fix for any version of GP, PG<=12 or Cloudberry.

github-actions bot reviewed Apr 29, 2025

View reviewed changes

my-ship-it requested a review from yjhjstz April 30, 2025 02:03

yjhjstz approved these changes May 6, 2025

View reviewed changes

Smyatkin-Maxim force-pushed the smyatkin/2pc_startup_segfault branch from 7f15ba5 to 6823eb8 Compare May 14, 2025 08:11

Smyatkin-Maxim force-pushed the smyatkin/2pc_startup_segfault branch from 6823eb8 to f3b103d Compare May 22, 2025 08:16

gfphoenix78 approved these changes Jun 6, 2025

View reviewed changes

Smyatkin-Maxim added 2 commits June 11, 2025 11:46

Add stable ordering to testcase

98f1b1d

my-ship-it force-pushed the smyatkin/2pc_startup_segfault branch from f3b103d to 98f1b1d Compare June 11, 2025 03:46

Merge branch 'main' into smyatkin/2pc_startup_segfault

db77181

yjhjstz merged commit 3bae20b into apache:main Jun 17, 2025
26 checks passed

yjhjstz mentioned this pull request Jun 18, 2025

Cherry pick hot standby commits from gpdb #1152

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix segfault when recovering 2pc transaction #1078

Fix segfault when recovering 2pc transaction #1078

Uh oh!

Smyatkin-Maxim commented Apr 29, 2025

Uh oh!

github-actions bot left a comment

Uh oh!

Smyatkin-Maxim commented Apr 29, 2025 •

edited

Loading

Uh oh!

Smyatkin-Maxim commented Apr 29, 2025 •

edited

Loading

Uh oh!

yjhjstz commented Apr 29, 2025

Uh oh!

Smyatkin-Maxim commented Apr 30, 2025 •

edited

Loading

Uh oh!

yjhjstz commented Apr 30, 2025

Uh oh!

jiaqizho commented May 21, 2025

Uh oh!

gfphoenix78 commented Jun 5, 2025

Uh oh!

gfphoenix78 commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

Fix segfault when recovering 2pc transaction #1078

Fix segfault when recovering 2pc transaction #1078

Uh oh!

Conversation

Smyatkin-Maxim commented Apr 29, 2025

What does this PR do?

Type of Change

Breaking Changes

Test Plan

Impact

Checklist

Additional Context

CI Skip Instructions

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Smyatkin-Maxim commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Smyatkin-Maxim commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yjhjstz commented Apr 29, 2025

Uh oh!

Smyatkin-Maxim commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yjhjstz commented Apr 30, 2025

Uh oh!

jiaqizho commented May 21, 2025

Uh oh!

gfphoenix78 commented Jun 5, 2025

Uh oh!

gfphoenix78 commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

Smyatkin-Maxim commented Apr 29, 2025 •

edited

Loading

Smyatkin-Maxim commented Apr 29, 2025 •

edited

Loading

Smyatkin-Maxim commented Apr 30, 2025 •

edited

Loading