Skip to content

Distributed Table-Based Lock to support MySQL Database Cluster #9955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

rohityadavcloud
Copy link
Member

@rohityadavcloud rohityadavcloud commented Nov 20, 2024

This introduces a innodb table based locking mechanism instead of using GET_LOCK/RELEASE_LOCK which is documented as limitation for many Mysql-clustering solutions such as https://docs.percona.com/percona-xtradb-cluster/8.0/limitation.html

The locking mechanism uses DB-transactions to acquire and store locks in the database. In some cases MySQL deadlock can happen (as demonstrated by dummy bg-thread task d15c672), however, as global locks have large-enough timeout, the code retries to acquire lock again when an exception happens within the timeout period.

This needs to be reviewed and tested against such mysql-clustering solutions and whether active-active or active-backup setup can work with CloudStack mgmt server(s) with these changes.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

How this was tested

For testing purposes, I create a 3x mgmt server host, 2x kvm host env (using mbx). Then on each of the mgmt server hosts, I had Percona XtraDB cluster node setup in a 1:1 fashion such as each mgmt server connects to the locally running MySQL instance, while the 3x MySQL instances are in a cluster. This allows & enabled control-plane HA strategy, wherein both mgmt server & database achieve HA in variety of strategies (such as 1:1 setup, or using an internal LB or proxy middleware on the mgmt server host such as ProxySQL).

Screenshots:
Screenshot 2025-03-06 at 3 26 36 PM
Screenshot 2025-03-06 at 3 26 49 PM

Cleanup old mac address handling code to use JDK11 lib instead of hacks.
Also really strange to see some basic string parsing code was written by
hand, replaced with Long.parseValue(str, 16) to convert hex string to
long.

Signed-off-by: Rohit Yadav <[email protected]>
This introduces a MySQL innodb table based distributed lock which can
be used by one or more management server and its threads. This removes
usage of MySQL server provided locking functions (GET_LOCK,
RELEASE_LOCK) which are not replicated or supported currently by any
MySQL clustering solutions. This would be the first main step in having
CloudStack to work with a MySQL clustering solution such as InnoDB
cluster, Percona Xtradb cluster, MariaDB galera cluster. There may be
other changes required which can be found in due course if this feature
works at scale.

Signed-off-by: Rohit Yadav <[email protected]>
@rohityadavcloud rohityadavcloud added this to the 4.21.0 milestone Nov 20, 2024
@rohityadavcloud rohityadavcloud removed the request for review from shwstppr November 20, 2024 14:26
@DaanHoogland DaanHoogland changed the base branch from main to 4.19 November 20, 2024 14:28
@rohityadavcloud rohityadavcloud marked this pull request as draft November 20, 2024 14:29
@rohityadavcloud rohityadavcloud changed the base branch from 4.19 to main November 20, 2024 14:33
@rohityadavcloud
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link

codecov bot commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 20.27027% with 59 lines in your changes missing coverage. Please review.

Project coverage is 16.17%. Comparing base (eab37ec) to head (4b0d0fe).
Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
...n/java/org/apache/cloudstack/ca/CAManagerImpl.java 0.00% 24 Missing ⚠️
...java/com/cloud/upgrade/DatabaseUpgradeChecker.java 0.00% 23 Missing ⚠️
...rk/db/src/main/java/com/cloud/utils/db/DbUtil.java 55.55% 9 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #9955      +/-   ##
============================================
+ Coverage     16.15%   16.17%   +0.01%     
- Complexity    13274    13293      +19     
============================================
  Files          5666     5668       +2     
  Lines        498078   498228     +150     
  Branches      60267    60288      +21     
============================================
+ Hits          80481    80602     +121     
- Misses       408584   408609      +25     
- Partials       9013     9017       +4     
Flag Coverage Δ
uitests 3.99% <ø> (-0.01%) ⬇️
unittests 17.03% <20.27%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 11588

@rohityadavcloud
Copy link
Member Author

@shwstppr can you help look at this? 🙏

@rohityadavcloud rohityadavcloud marked this pull request as ready for review February 28, 2025 04:37
@blueorangutan
Copy link

[SF] Trillian test result (tid-12581)
Environment: kvm-ubuntu22 (x2), Advanced Networking with Mgmt server u22
Total time taken: 73392 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9955-t12581-kvm-ubuntu22.zip
Smoke tests completed. 138 look OK, 3 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 1521.45 test_network.py
test_oobm_multiple_mgmt_server_ownership Failure 31.90 test_outofbandmanagement.py
test_06_purge_expunged_vm_background_task Failure 408.64 test_purge_expunged_vms.py

@rohityadavcloud
Copy link
Member Author

Requested more reviews on this.

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✖️ debian ✔️ suse15. SL-JID 12684

@weizhouapache
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12687

@apache apache deleted a comment from blueorangutan Mar 7, 2025
@apache apache deleted a comment from blueorangutan Mar 7, 2025
@rohityadavcloud
Copy link
Member Author

@blueorangutan test matrix

@blueorangutan
Copy link

@rohityadavcloud a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-12607)

@blueorangutan
Copy link

[SF] Trillian test result (tid-12582)
Environment: vmware-70u3 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 147760 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9955-t12582-vmware-70u3.zip
Smoke tests completed. 127 look OK, 14 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_events_resource Error 359.29 test_events_resource.py
test_DeployVmAntiAffinityGroup_in_project Error 132.11 test_affinity_groups_projects.py
test_DeployVmAntiAffinityGroup Error 50.30 test_affinity_groups.py
test_03_deploy_and_scale_kubernetes_cluster Failure 45.33 test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster Failure 0.09 test_kubernetes_clusters.py
test_01_prepare_and_cancel_maintenance Error 0.11 test_ms_maintenance_and_safe_shutdown.py
test_04_deploy_vm_for_other_user_and_test_vm_operations Error 117.56 test_network_permissions.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 1524.45 test_network.py
test_01_non_strict_host_anti_affinity Failure 173.18 test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity Error 105.21 test_nonstrict_affinity_group.py
test_06_purge_expunged_vm_background_task Failure 405.30 test_purge_expunged_vms.py
test_01_restore_vm Error 3615.27 test_restore_vm.py
test_02_restore_vm_with_disk_offering Error 7.65 test_restore_vm.py
test_03_restore_vm_with_disk_offering_custom_size Error 3625.18 test_restore_vm.py
test_04_restore_vm_allocated_root Error 8.84 test_restore_vm.py
test_01_deploy_vm_on_specific_host Error 12.58 test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster Error 3604.33 test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod Error 1.39 test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster Error 4.50 test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod Error 4.40 test_vm_deployment_planner.py
test_09_expunge_vm Failure 427.74 test_vm_life_cycle.py
ContextSuite context=TestMigrateVMStrictTags>:setup Error 0.00 test_vm_strict_host_tags.py
test_01_restore_vm_strict_tags_success Error 3610.88 test_vm_strict_host_tags.py
test_02_restore_vm_strict_tags_failure Error 8.98 test_vm_strict_host_tags.py
test_01_scale_vm_strict_tags_success Error 20.20 test_vm_strict_host_tags.py
test_02_scale_vm_strict_tags_failure Error 7207.23 test_vm_strict_host_tags.py
test_01_deploy_vm_on_specific_host_without_strict_tags Error 3607.85 test_vm_strict_host_tags.py
test_02_deploy_vm_on_any_host_without_strict_tags Error 3611.58 test_vm_strict_host_tags.py
test_03_deploy_vm_on_specific_host_with_strict_tags_success Error 3608.37 test_vm_strict_host_tags.py
test_04_deploy_vm_on_any_host_with_strict_tags_success Error 3611.32 test_vm_strict_host_tags.py
test_10_list_volumes Failure 378.48 test_volumes.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-12604)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 56992 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9955-t12604-kvm-ol8.zip
Smoke tests completed. 140 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 1520.74 test_network.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-12605)
Environment: kvm-ubuntu22 (x2), Advanced Networking with Mgmt server u22
Total time taken: 61008 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9955-t12605-kvm-ubuntu22.zip
Smoke tests completed. 139 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 1522.67 test_network.py
test_oobm_multiple_mgmt_server_ownership Failure 30.81 test_outofbandmanagement.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-12606)
Environment: vmware-70u3 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 139409 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9955-t12606-vmware-70u3.zip
Smoke tests completed. 134 look OK, 7 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_events_resource Error 373.67 test_events_resource.py
test_01_prepare_and_cancel_maintenance Error 0.10 test_ms_maintenance_and_safe_shutdown.py
test_04_deploy_vm_for_other_user_and_test_vm_operations Error 114.76 test_network_permissions.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 1522.13 test_network.py
test_02_restore_vm_with_disk_offering Error 59.26 test_restore_vm.py
test_03_restore_vm_with_disk_offering_custom_size Error 62.72 test_restore_vm.py
test_01_deploy_vm_on_specific_host Error 23.17 test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster Error 3604.17 test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod Error 2.48 test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster Error 14.72 test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod Error 3604.50 test_vm_deployment_planner.py
test_01_migrate_vm_strict_tags_success Error 3610.40 test_vm_strict_host_tags.py
test_02_migrate_vm_strict_tags_failure Error 3624.97 test_vm_strict_host_tags.py
ContextSuite context=TestMigrateVMStrictTags>:teardown Error 3629.19 test_vm_strict_host_tags.py
test_01_restore_vm_strict_tags_success Error 7210.62 test_vm_strict_host_tags.py
test_02_restore_vm_strict_tags_failure Error 3608.59 test_vm_strict_host_tags.py
ContextSuite context=TestRestoreVMStrictTags>:teardown Error 3611.18 test_vm_strict_host_tags.py
test_01_scale_vm_strict_tags_success Error 25.49 test_vm_strict_host_tags.py
test_02_scale_vm_strict_tags_failure Error 3608.51 test_vm_strict_host_tags.py
ContextSuite context=TestScaleVMStrictTags>:teardown Error 3611.52 test_vm_strict_host_tags.py
test_01_deploy_vm_on_specific_host_without_strict_tags Error 32.83 test_vm_strict_host_tags.py
test_02_deploy_vm_on_any_host_without_strict_tags Error 7221.18 test_vm_strict_host_tags.py
test_03_deploy_vm_on_specific_host_with_strict_tags_success Error 3613.93 test_vm_strict_host_tags.py
test_04_deploy_vm_on_any_host_with_strict_tags_success Error 45.06 test_vm_strict_host_tags.py
ContextSuite context=TestVMDeploymentPlannerStrictTags>:teardown Error 17.16 test_vm_strict_host_tags.py

@wido
Copy link
Contributor

wido commented Mar 10, 2025

Codewise I don't see any problems. I don't have a setup to test this on.

For my understanding: This would also work for MariaDB Galera clusters, right?

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@DaanHoogland
Copy link
Contributor

two types of environments need testing at least

  • a traditional master-replica settup
  • some kind of clustered DB

the KVM envs work reasonably well, but the Vmware and Xen have excessive errors. Probably environmental???
trying insanity;

@DaanHoogland
Copy link
Contributor

@blueorangutan test matrix

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-12646)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 60832 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9955-t12646-kvm-ol8.zip
Smoke tests completed. 140 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 1521.69 test_network.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-12648)
Environment: vmware-70u3 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 66744 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9955-t12648-vmware-70u3.zip
Smoke tests completed. 133 look OK, 8 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_events_resource Error 342.57 test_events_resource.py
test_01_events_resource Error 342.58 test_events_resource.py
test_04_list_domains_level_filter Failure 0.07 test_list_domains.py
test_05_list_domains_no_filter Failure 0.04 test_list_domains.py
test_05_list_volumes_isrecursive Failure 0.05 test_list_volumes.py
test_07_list_volumes_listall Failure 0.04 test_list_volumes.py
test_01_prepare_and_cancel_maintenance Error 0.08 test_ms_maintenance_and_safe_shutdown.py
test_04_deploy_vm_for_other_user_and_test_vm_operations Error 116.54 test_network_permissions.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 1523.87 test_network.py
test_02_restore_vm_with_disk_offering Error 62.61 test_restore_vm.py
test_03_restore_vm_with_disk_offering_custom_size Error 63.39 test_restore_vm.py
test_02_restore_vm_strict_tags_failure Error 57.67 test_vm_strict_host_tags.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-12647)
Environment: kvm-ubuntu22 (x2), Advanced Networking with Mgmt server u22
Total time taken: 70365 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9955-t12647-kvm-ubuntu22.zip
Smoke tests completed. 137 look OK, 4 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestAccounts>:setup Error 0.00 test_accounts.py
ContextSuite context=TestAddVmToSubDomain>:setup Error 0.00 test_accounts.py
test_DeleteDomain Error 12.63 test_accounts.py
test_forceDeleteDomain Failure 12.48 test_accounts.py
ContextSuite context=TestRemoveUserFromAccount>:setup Error 14.09 test_accounts.py
ContextSuite context=TestTemplateHierarchy>:setup Error 15.81 test_accounts.py
ContextSuite context=TestDeployVmWithAffinityGroup>:setup Error 0.00 test_affinity_groups_projects.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 1521.27 test_network.py
test_oobm_multiple_mgmt_server_ownership Failure 30.85 test_outofbandmanagement.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-12649)
Environment: xcpng82 (x2), Advanced Networking with Mgmt server ol9
Total time taken: 76423 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9955-t12649-xcpng82.zip
Smoke tests completed. 135 look OK, 6 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_condensed_drs_algorithm Failure 178.79 test_cluster_drs.py
test_02_balanced_drs_algorithm Failure 178.56 test_cluster_drs.py
test_01_prepare_and_cancel_maintenance Error 0.10 test_ms_maintenance_and_safe_shutdown.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 9.52 test_network.py
test_01_non_strict_host_anti_affinity Error 233.48 test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity Error 108.32 test_nonstrict_affinity_group.py
test_02_create_volume Error 5.32 test_resource_names.py
test_05_scale_vm_dont_allow_disk_offering_change Failure 70.91 test_scale_vm.py

@rohityadavcloud
Copy link
Member Author

Update - we found initial DB setup has issues, but deploy-db + import from sql dump to a pxc cluster worked. However, DB upgrade may fail as mysql routines aren't available or support (to be investigated). Otherwise, I've been running a active-active 1:1 3-node percona-ACS mgmt server setup for a few weeks now, seems to be working.

@rohityadavcloud rohityadavcloud marked this pull request as draft April 17, 2025 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants