Skip to content

Fix segfault when recovering 2pc transaction #1078

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Smyatkin-Maxim
Copy link
Contributor

When promoting a mirror segment due to failover we have seen a stacktrace like this:

"FATAL","58P01","requested WAL segment pg_xlog/00000023000071D50000001F has already been removed",,,,,,,0,,"xlogutils.c",580,"Stack
trace:
1    0x557bdb9f09b6 postgres errstart + 0x236
2    0x557bdb5fc6cf postgres <symbol not found> + 0xdb5fc6cf
3    0x557bdb5fd021 postgres read_local_xlog_page + 0x191
4    0x557bdb5fb922 postgres <symbol not found> + 0xdb5fb922
5    0x557bdb5fba11 postgres XLogReadRecord + 0xa1
6    0x557bdb5e7767 postgres RecoverPreparedTransactions + 0xd7
7    0x557bdb5f608b postgres StartupXLOG + 0x2a3b
8    0x557bdb870a89 postgres StartupProcessMain + 0x139
9    0x557bdb62f489 postgres AuxiliaryProcessMain + 0x549
10   0x557bdb86d275 postgres <symbol not found> + 0xdb86d275
11   0x557bdb8704e3 postgres PostmasterMain + 0x1213
12   0x557bdb56a1f7 postgres main + 0x497
13   0x7fded4a61c87 libc.so.6 __libc_start_main + 0xe7

Note: stacktrace is from one of our production GP clusters and might be slightly different from what we will see in cloudberry, but the failure is still present here as well. Testcase proves it.

PG13 and PG14 have a fix for this bug, but it's doesn't have any test case and looks like we didn't cherry-pick that far. The discussion can be found here: https://www.postgresql.org/message-id/flat/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel%40cybertec.at

In a few words, StartupXLOG() renames the last wal segment to .partial but tries to read it by the old name later in RecoverPreparedTransactions().

The fix is mostly borrowed from PG14 postgres/postgres@f663b00 with some cloudberry-related exceptions. Also added a regression test which segfaults without this fix for any version of GP, PG<=12 or Cloudberry.

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @Smyatkin-Maxim welcome!🎊 Thanks for taking the effort to make our project better! 🙌 Keep making such awesome contributions!

@Smyatkin-Maxim
Copy link
Contributor Author

Smyatkin-Maxim commented Apr 29, 2025

Hi team,
as this PR somewhat reorders the process of xlog startup, I need reviewers to carefully double-check I didn't mess something up here. Especially I have no knowledge how Startup hook is used, perhaps it cannot be moved up this way.
The issue itself shows up quite often in our GP installations and from what I understand it's not fixed yet in cloudberry as well.

@Smyatkin-Maxim
Copy link
Contributor Author

Smyatkin-Maxim commented Apr 29, 2025

I see that my test is failing. Didn't fail locally the last time I tried. Will take a look
Added ordering for test stability, but still need to figure out why it crashes - it doesn't on my local build.

@yjhjstz
Copy link
Member

yjhjstz commented Apr 29, 2025

Nice catch! @Smyatkin-Maxim it's pax related flake test. So I will retry ci.

@my-ship-it my-ship-it requested a review from yjhjstz April 30, 2025 02:03
@Smyatkin-Maxim
Copy link
Contributor Author

Smyatkin-Maxim commented Apr 30, 2025

Nice catch! @Smyatkin-Maxim it's pax related flake test. So I will retry ci.

on the first CI run there was a crash+coredump on my testcase. I'm gonna be rerunning IC2 tests a couple times, feels like there is yet another rarely reproducible bug, which my testcase (or my fix?) uncovers.

UPD: after a couple reruns of isolation2 suite I still didn't get the core dump I've seen on the first run.

@yjhjstz
Copy link
Member

yjhjstz commented Apr 30, 2025

@jiaqizho @gongxun0928

1 | lock_tbl1a
              1 | lock_view2
              1 | lock_view3
+             1 | pg_pax_blocks_4[33](https://github.com/apache/cloudberry/actions/runs/14729599874/job/41419204900?pr=1078#step:18:34)56
              2 | lock_tbl1
              2 | lock_tbl1a
              2 | lock_view2

this pax related.

@Smyatkin-Maxim Smyatkin-Maxim force-pushed the smyatkin/2pc_startup_segfault branch from 7f15ba5 to 6823eb8 Compare May 14, 2025 08:11
@jiaqizho
Copy link
Contributor

@jiaqizho @gongxun0928

1 | lock_tbl1a
              1 | lock_view2
              1 | lock_view3
+             1 | pg_pax_blocks_4[33](https://github.com/apache/cloudberry/actions/runs/14729599874/job/41419204900?pr=1078#step:18:34)56
              2 | lock_tbl1
              2 | lock_tbl1a
              2 | lock_view2

this pax related.

retry pls. not sure which case will cause this lock.

When promoting a mirror segment due to failover we have seen a stacktrace like this:

```
"FATAL","58P01","requested WAL segment pg_xlog/00000023000071D50000001F has already been removed",,,,,,,0,,"xlogutils.c",580,"Stack
trace:
1    0x557bdb9f09b6 postgres errstart + 0x236
2    0x557bdb5fc6cf postgres <symbol not found> + 0xdb5fc6cf
3    0x557bdb5fd021 postgres read_local_xlog_page + 0x191
4    0x557bdb5fb922 postgres <symbol not found> + 0xdb5fb922
5    0x557bdb5fba11 postgres XLogReadRecord + 0xa1
6    0x557bdb5e7767 postgres RecoverPreparedTransactions + 0xd7
7    0x557bdb5f608b postgres StartupXLOG + 0x2a3b
8    0x557bdb870a89 postgres StartupProcessMain + 0x139
9    0x557bdb62f489 postgres AuxiliaryProcessMain + 0x549
10   0x557bdb86d275 postgres <symbol not found> + 0xdb86d275
11   0x557bdb8704e3 postgres PostmasterMain + 0x1213
12   0x557bdb56a1f7 postgres main + 0x497
13   0x7fded4a61c87 libc.so.6 __libc_start_main + 0xe7
```
Note: stacktrace is from one of our production GP clusters and might be slightly different from what we
will see in cloudberry, but the failure is still present here as well.
Testcase proves it.

PG13 and PG14 have a fix for this bug, but it's doesn't have any test
case and looks like we didn't cherry-pick that far. The discussion can be found here:
https://www.postgresql.org/message-id/flat/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel%40cybertec.at

In a few words, StartupXLOG() renames the last wal segment to .partial but tries to read it by the old name later in
RecoverPreparedTransactions().

The fix is mostly borrowed from PG14 postgres/postgres@f663b00 with some cloudberry-related exceptions.
Also added a regression test which segfaults without this fix for any
version of GP, PG<=12 or Cloudberry.
@Smyatkin-Maxim Smyatkin-Maxim force-pushed the smyatkin/2pc_startup_segfault branch from 6823eb8 to f3b103d Compare May 22, 2025 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants