Skip to content

FirstFitRouting Allocator only tries the first cluster it finds then fails before attempting others #10598

Open
@loth

Description

@loth

Our CS environment has two clusters, one windows and one linux. We have host tags that allocate where VM's are placed. We are facing an issue when setting host in maintenance that sometimes it will just fall to error state. Upon checking the logs I found that while it checked the first cluster (ID 19, Windows) however it never tried any hosts in the Linux cluster (ID 16).

Here is the log from the management server. As we can see, the allocator tried to find resources in cluster 19, failed to find any with the Linux tag, then failed out before trying the other cluster.

2025-03-20 20:04:13,017 DEBUG [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388]) (logid:30d64f64) Run VM work job: com.cloud.vm.VmWorkMigrateAway for VM 8127, job origin: 3900296
2025-03-20 20:04:13,018 DEBUG [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Execute VM work job: com.cloud.vm.VmWorkMigrateAway{"srcHostId":982,"userId":1,"accountId":1,"vmId":8127,"handlerName":"VirtualMachineManagerImpl"}
2025-03-20 20:04:13,031 INFO  [c.c.v.ClusteredVirtualMachineManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Searching for hosts in the zone for vm migration
2025-03-20 20:04:13,034 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Trying to deploy VM [error decoding VM instance {"id":8127,"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}] and details: Plan [{"_dcId":14,"_poolId":637,"_recreateDisks":false,"preferredHostIds":[],"migrationPlan":true,"hostPriorities":{}}]; avoid list [{"_clusterIds":[24],"_hostIds":[982]}] and planner: [].
2025-03-20 20:04:13,037 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Checking non dedicated resources to deploy VM [{"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}].
2025-03-20 20:04:13,046 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Adding pods [[]], clusters [[]] and hosts [[]] to the avoid list in the deploy process of user VM ["ce0c1fb6-65ef-4942-97de-05b255fdcd7d"], because this VM is not explicitly dedicated to these components.
2025-03-20 20:04:13,048 DEBUG [o.a.c.u.r.ReflectionToStringBuilderUtils] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Object is null, not reflecting it.
2025-03-20 20:04:13,049 DEBUG [o.a.c.u.r.ReflectionToStringBuilderUtils] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Object is null, not reflecting it.
2025-03-20 20:04:13,049 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Trying to allocate a host and storage pools from datacenter [{"name":"CS12","uuid":"81e07cf1-e395-4f5e-aa7e-4e9d80aa3cfb"}], pod [null], cluster [null], to deploy VM [{"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}] with requested CPU [24000] and requested RAM [(32.00 GB) 34359738368].
2025-03-20 20:04:13,050 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) ROOT volume [69d114a8-5673-4ad7-a387-2ac0d11cf3a1] is ready to deploy VM [ce0c1fb6-65ef-4942-97de-05b255fdcd7d].
2025-03-20 20:04:13,054 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Adding pods [] to the avoid set because these pods are in the Disabled state.
2025-03-20 20:04:13,057 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Adding clusters [] of pod [5] to the void set because these clusters are in the Disabled state.
2025-03-20 20:04:13,060 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Adding hosts [] of datacenter [81e07cf1-e395-4f5e-aa7e-4e9d80aa3cfb] to the avoid set, because these hosts are in the Disabled state.
2025-03-20 20:04:13,065 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) DeploymentPlan [DataCenterDeployment] has not specified host. Trying to find another destination to deploy VM [ce0c1fb6-65ef-4942-97de-05b255fdcd7d], avoiding pods [], clusters [24] and hosts [982].
2025-03-20 20:04:13,065 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Deploy avoids pods: [], clusters: [24], hosts: [982].
2025-03-20 20:04:13,065 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Deploy hosts with priorities {}, hosts have NORMAL priority by default
2025-03-20 20:04:13,066 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) This VM has last host_id specified, trying to choose the same host: 982
2025-03-20 20:04:13,069 WARN  [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) The last host [bf145201-bcab-4634-8750-36b064574bc7] of VM [ce0c1fb6-65ef-4942-97de-05b255fdcd7d] is in the avoid set. Skipping this and trying other available hosts.
2025-03-20 20:04:13,069 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Cannot choose the last host to deploy this VM VM instance {"id":8127,"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}.
2025-03-20 20:04:13,072 DEBUG [c.c.d.FirstFitPlanner] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Searching all possible resources under this Zone: 14
2025-03-20 20:04:13,076 DEBUG [c.c.d.FirstFitPlanner] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Listing clusters in order of aggregate capacity, that have (at least one host with) enough CPU and RAM capacity under this Zone: 14
2025-03-20 20:04:13,084 DEBUG [c.c.d.FirstFitPlanner] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Removing from the clusterId list these clusters from avoid set: [24]
2025-03-20 20:04:13,102 DEBUG [c.c.d.FirstFitPlanner] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) The clusterId list for the given offering tag: [19, 16]
2025-03-20 20:04:13,106 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Checking resources in Cluster: 19 under Pod: 5
2025-03-20 20:04:13,111 INFO  [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64)  Guest VM is requested with Custom[UEFI] Boot Type false
2025-03-20 20:04:13,111 DEBUG [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64) Looking for hosts in zone [14], pod [5], cluster [19]
2025-03-20 20:04:13,111 DEBUG [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64) Looking for hosts having tag specified on SvcOffering:generic
2025-03-20 20:04:13,118 DEBUG [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64) Hosts with tag 'generic' are:[Host {"id":919,"name":"node87","type":"Routing","uuid":"e5668da2-862f-4cea-bda8-20707525ecda"}, Host {"id":1126,"name":"node51","type":"Routing","uuid":"be3eb6a2-29b3-4a1e-a125-af8f9135d3a6"}, Host {"id":1129,"name":"node52","type":"Routing","uuid":"a803eaea-82a1-4c10-a4d2-b0db6c8c15c6"}, Host {"id":1350,"name":"node84","type":"Routing","uuid":"1a006517-3a4b-4a98-85a2-f693fa4f4501"}, Host {"id":1368,"name":"node50","type":"Routing","uuid":"bc93e3f6-3751-46fd-8dee-c24f6e8fc922"}]
2025-03-20 20:04:13,118 DEBUG [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64) Looking for hosts having tag specified on Template:linux
2025-03-20 20:04:13,123 DEBUG [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64) Hosts with tag 'linux' are:[]
2025-03-20 20:04:13,124 DEBUG [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64) Found 0 Hosts satisfying both tags, host ids are:[]
2025-03-20 20:04:13,126 ERROR [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64) No suitable host found for vm [VM instance {"id":8127,"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}] with tags [generic].
2025-03-20 20:04:13,127 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Invocation exception, caused by: com.cloud.utils.exception.CloudRuntimeException: No suitable host found for vm [VM instance {"id":8127,"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}].
2025-03-20 20:04:13,127 INFO  [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Rethrow exception com.cloud.utils.exception.CloudRuntimeException: No suitable host found for vm [VM instance {"id":8127,"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}].
2025-03-20 20:04:13,127 DEBUG [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388]) (logid:30d64f64) Done with run of VM work job: com.cloud.vm.VmWorkMigrateAway for VM 8127, job origin: 3900296
2025-03-20 20:04:13,127 ERROR [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388]) (logid:30d64f64) Unable to complete AsyncJobVO: {id:3901388, userId: 1, accountId: 1, instanceType: null, instanceId: null, cmd: com.cloud.vm.VmWorkMigrateAway, cmdInfo: rO0ABXNyAB5jb20uY2xvdWQudm0uVm1Xb3JrTWlncmF0ZUF3YXmt4MX4jtcEmwIAAUoACXNyY0hvc3RJZHhyABNjb20uY2xvdWQudm0uVm1Xb3Jrn5m2VvAlZ2sCAARKAAlhY2NvdW50SWRKAAZ1c2VySWRKAAR2bUlkTAALaGFuZGxlck5hbWV0ABJMamF2YS9sYW5nL1N0cmluZzt4cAAAAAAAAAABAAAAAAAAAAEAAAAAAAAfv3QAGVZpcnR1YWxNYWNoaW5lTWFuYWdlckltcGwAAAAAAAAD1g, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 95539959339, completeMsid: null, lastUpdated: null, lastPolled: null, created: Thu Mar 20 20:04:12 UTC 2025, removed: null}, job origin: 3900296 com.cloud.utils.exception.CloudRuntimeException: No suitable host found for vm [VM instance {"id":8127,"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}].
2025-03-20 20:04:13,128 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388]) (logid:30d64f64) Complete async job-3901388, jobStatus: FAILED, resultCode: 0

I also checked the UI upon doing a manual migration, all windows hosts are invalid destinations in the options to select a host (which I believe findHostsForMigration uses, so that should be working fine)

versions

ACS 4.20, KVM with NFS Primary/Secondary storage

The steps to reproduce the bug

Setup two clusters, with host tags to identify where VM's should go (windows/linux).
Try to set maintenance on one of the hosts.
Check for clusters in the log to see if it attempts others.

What to do about it?

Correct the code to try all clusters instead of just the first.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    on Hold

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions