Task manager assigns all burst jobs to a single controller node when capacity is tied

## Summary

When multiple control-plane instances have equal remaining capacity (the common case for container-group workloads where `AWX_CONTROL_NODE_TASK_IMPACT = 1`), `fit_task_to_most_remaining_capacity_instance` always selects the same node due to deterministic iteration order and strict `>` tie-breaking.

This causes burst workloads (many jobs submitted concurrently) to concentrate all job management overhead (event processing, callbacks, output streaming) on a single controller pod, while other controller pods remain idle.

## Steps to reproduce

1. Deploy AWX with 3 controller task replicas (e.g. `task_replicas: 3` in the CR)
2. Use a container group instance group for job execution (the default setup)
3. Submit 20+ jobs concurrently via the API:
   ```bash
   for i in $(seq 1 20); do
     curl -sk -u admin:password -X POST "$URL/api/v2/job_templates/7/launch/" \
       -H "Content-Type: application/json" &
   done
   wait
   ```
4. Check `controller_node` on the completed jobs:
   ```bash
   curl -sk -u admin:password "$URL/api/v2/jobs/?id__gte=<first_job_id>&order_by=id" | \
     python3 -c "import sys,json; [print(j['controller_node']) for j in json.load(sys.stdin)['results']]"
   ```

## Expected result

Jobs should be distributed across available controller nodes when all have equal (or near-equal) remaining capacity.

## Actual result

100% of jobs are assigned to a single controller node. Other controller pods manage zero jobs during the burst.

## Root cause

In `awx/main/scheduler/task_manager_models.py`, the selection logic:

```python
if would_be_remaining >= 0 and (instance_most_capacity is None or would_be_remaining > most_remaining_capacity):
```

For container-group jobs, the control impact is only 1 unit (`AWX_CONTROL_NODE_TASK_IMPACT = 1`) out of typical capacity of ~640. Combined with:
- Sequential task manager cycles (each processing approximately 1 job due to advisory lock timing)
- Capacity being reset between cycles (completed jobs free their impact)

All nodes appear equally viable on every cycle, and the first node in iteration order always wins the tie.

## Impact

The controller node handles job lifecycle management: event processing, callback receiver, output streaming to the database and websocket consumers. Concentrating 40+ concurrent jobs on one pod creates resource pressure on that pod while others remain idle, potentially causing job failures under memory/CPU constraints.

## Environment

- AWX 24.x / AAP 2.5+ (any version with multi-replica controller support)
- Container group execution (Kubernetes/EKS/OpenShift)
- Multiple controller task replicas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Task manager assigns all burst jobs to a single controller node when capacity is tied #16416

Summary

Steps to reproduce

Expected result

Actual result

Root cause

Impact

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Task manager assigns all burst jobs to a single controller node when capacity is tied #16416

Description

Summary

Steps to reproduce

Expected result

Actual result

Root cause

Impact

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions