-
Notifications
You must be signed in to change notification settings - Fork 159
Fix segfault when recovering 2pc transaction #1078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix segfault when recovering 2pc transaction #1078
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @Smyatkin-Maxim welcome!🎊 Thanks for taking the effort to make our project better! 🙌 Keep making such awesome contributions!
Hi team, |
I see that my test is failing. Didn't fail locally the last time I tried. Will take a look |
Nice catch! @Smyatkin-Maxim it's pax related flake test. So I will retry ci. |
on the first CI run there was a crash+coredump on my testcase. I'm gonna be rerunning IC2 tests a couple times, feels like there is yet another rarely reproducible bug, which my testcase (or my fix?) uncovers. UPD: after a couple reruns of isolation2 suite I still didn't get the core dump I've seen on the first run. |
1 | lock_tbl1a
1 | lock_view2
1 | lock_view3
+ 1 | pg_pax_blocks_4[33](https://github.com/apache/cloudberry/actions/runs/14729599874/job/41419204900?pr=1078#step:18:34)56
2 | lock_tbl1
2 | lock_tbl1a
2 | lock_view2 this pax related. |
7f15ba5
to
6823eb8
Compare
retry pls. not sure which case will cause this lock. |
When promoting a mirror segment due to failover we have seen a stacktrace like this: ``` "FATAL","58P01","requested WAL segment pg_xlog/00000023000071D50000001F has already been removed",,,,,,,0,,"xlogutils.c",580,"Stack trace: 1 0x557bdb9f09b6 postgres errstart + 0x236 2 0x557bdb5fc6cf postgres <symbol not found> + 0xdb5fc6cf 3 0x557bdb5fd021 postgres read_local_xlog_page + 0x191 4 0x557bdb5fb922 postgres <symbol not found> + 0xdb5fb922 5 0x557bdb5fba11 postgres XLogReadRecord + 0xa1 6 0x557bdb5e7767 postgres RecoverPreparedTransactions + 0xd7 7 0x557bdb5f608b postgres StartupXLOG + 0x2a3b 8 0x557bdb870a89 postgres StartupProcessMain + 0x139 9 0x557bdb62f489 postgres AuxiliaryProcessMain + 0x549 10 0x557bdb86d275 postgres <symbol not found> + 0xdb86d275 11 0x557bdb8704e3 postgres PostmasterMain + 0x1213 12 0x557bdb56a1f7 postgres main + 0x497 13 0x7fded4a61c87 libc.so.6 __libc_start_main + 0xe7 ``` Note: stacktrace is from one of our production GP clusters and might be slightly different from what we will see in cloudberry, but the failure is still present here as well. Testcase proves it. PG13 and PG14 have a fix for this bug, but it's doesn't have any test case and looks like we didn't cherry-pick that far. The discussion can be found here: https://www.postgresql.org/message-id/flat/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel%40cybertec.at In a few words, StartupXLOG() renames the last wal segment to .partial but tries to read it by the old name later in RecoverPreparedTransactions(). The fix is mostly borrowed from PG14 postgres/postgres@f663b00 with some cloudberry-related exceptions. Also added a regression test which segfaults without this fix for any version of GP, PG<=12 or Cloudberry.
6823eb8
to
f3b103d
Compare
When promoting a mirror segment due to failover we have seen a stacktrace like this:
Note: stacktrace is from one of our production GP clusters and might be slightly different from what we will see in cloudberry, but the failure is still present here as well. Testcase proves it.
PG13 and PG14 have a fix for this bug, but it's doesn't have any test case and looks like we didn't cherry-pick that far. The discussion can be found here: https://www.postgresql.org/message-id/flat/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel%40cybertec.at
In a few words, StartupXLOG() renames the last wal segment to .partial but tries to read it by the old name later in RecoverPreparedTransactions().
The fix is mostly borrowed from PG14 postgres/postgres@f663b00 with some cloudberry-related exceptions. Also added a regression test which segfaults without this fix for any version of GP, PG<=12 or Cloudberry.
Fixes #ISSUE_Number
What does this PR do?
Type of Change
Breaking Changes
Test Plan
make installcheck
make -C src/test installcheck-cbdb-parallel
Impact
Performance:
User-facing changes:
Dependencies:
Checklist
Additional Context
CI Skip Instructions