-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Distributed Table-Based Lock to support MySQL Database Cluster #9955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Cleanup old mac address handling code to use JDK11 lib instead of hacks. Also really strange to see some basic string parsing code was written by hand, replaced with Long.parseValue(str, 16) to convert hex string to long. Signed-off-by: Rohit Yadav <[email protected]>
This introduces a MySQL innodb table based distributed lock which can be used by one or more management server and its threads. This removes usage of MySQL server provided locking functions (GET_LOCK, RELEASE_LOCK) which are not replicated or supported currently by any MySQL clustering solutions. This would be the first main step in having CloudStack to work with a MySQL clustering solution such as InnoDB cluster, Percona Xtradb cluster, MariaDB galera cluster. There may be other changes required which can be found in due course if this feature works at scale. Signed-off-by: Rohit Yadav <[email protected]>
@blueorangutan package |
@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9955 +/- ##
============================================
+ Coverage 16.15% 16.17% +0.01%
- Complexity 13274 13293 +19
============================================
Files 5666 5668 +2
Lines 498078 498228 +150
Branches 60267 60288 +21
============================================
+ Hits 80481 80602 +121
- Misses 408584 408609 +25
- Partials 9013 9017 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 11588 |
@shwstppr can you help look at this? 🙏 |
Signed-off-by: Abhishek Kumar <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
[SF] Trillian test result (tid-12581)
|
Requested more reviews on this. @blueorangutan package |
@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✖️ debian ✔️ suse15. SL-JID 12684 |
@blueorangutan package |
@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12687 |
@blueorangutan test matrix |
@rohityadavcloud a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests |
[SF] Trillian Build Failed (tid-12607) |
[SF] Trillian test result (tid-12582)
|
[SF] Trillian test result (tid-12604)
|
[SF] Trillian test result (tid-12605)
|
[SF] Trillian test result (tid-12606)
|
Codewise I don't see any problems. I don't have a setup to test this on. For my understanding: This would also work for MariaDB Galera clusters, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clgtm
two types of environments need testing at least
the KVM envs work reasonably well, but the Vmware and Xen have excessive errors. Probably environmental??? |
@blueorangutan test matrix |
@DaanHoogland a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests |
[SF] Trillian test result (tid-12646)
|
[SF] Trillian test result (tid-12648)
|
[SF] Trillian test result (tid-12647)
|
[SF] Trillian test result (tid-12649)
|
Update - we found initial DB setup has issues, but deploy-db + import from sql dump to a pxc cluster worked. However, DB upgrade may fail as mysql routines aren't available or support (to be investigated). Otherwise, I've been running a active-active 1:1 3-node percona-ACS mgmt server setup for a few weeks now, seems to be working. |
This introduces a innodb table based locking mechanism instead of using GET_LOCK/RELEASE_LOCK which is documented as limitation for many Mysql-clustering solutions such as https://docs.percona.com/percona-xtradb-cluster/8.0/limitation.html
The locking mechanism uses DB-transactions to acquire and store locks in the database. In some cases MySQL deadlock can happen (as demonstrated by dummy bg-thread task d15c672), however, as global locks have large-enough timeout, the code retries to acquire lock again when an exception happens within the timeout period.
This needs to be reviewed and tested against such mysql-clustering solutions and whether active-active or active-backup setup can work with CloudStack mgmt server(s) with these changes.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
How this was tested
For testing purposes, I create a 3x mgmt server host, 2x kvm host env (using mbx). Then on each of the mgmt server hosts, I had Percona XtraDB cluster node setup in a 1:1 fashion such as each mgmt server connects to the locally running MySQL instance, while the 3x MySQL instances are in a cluster. This allows & enabled control-plane HA strategy, wherein both mgmt server & database achieve HA in variety of strategies (such as 1:1 setup, or using an internal LB or proxy middleware on the mgmt server host such as ProxySQL).
Screenshots:

