Skip to content

Task manager assigns all burst jobs to a single controller node when capacity is tied #16416

Description

@amasolov

Summary

When multiple control-plane instances have equal remaining capacity (the common case for container-group workloads where AWX_CONTROL_NODE_TASK_IMPACT = 1), fit_task_to_most_remaining_capacity_instance always selects the same node due to deterministic iteration order and strict > tie-breaking.

This causes burst workloads (many jobs submitted concurrently) to concentrate all job management overhead (event processing, callbacks, output streaming) on a single controller pod, while other controller pods remain idle.

Steps to reproduce

  1. Deploy AWX with 3 controller task replicas (e.g. task_replicas: 3 in the CR)
  2. Use a container group instance group for job execution (the default setup)
  3. Submit 20+ jobs concurrently via the API:
    for i in $(seq 1 20); do
      curl -sk -u admin:password -X POST "$URL/api/v2/job_templates/7/launch/" \
        -H "Content-Type: application/json" &
    done
    wait
  4. Check controller_node on the completed jobs:
    curl -sk -u admin:password "$URL/api/v2/jobs/?id__gte=<first_job_id>&order_by=id" | \
      python3 -c "import sys,json; [print(j['controller_node']) for j in json.load(sys.stdin)['results']]"

Expected result

Jobs should be distributed across available controller nodes when all have equal (or near-equal) remaining capacity.

Actual result

100% of jobs are assigned to a single controller node. Other controller pods manage zero jobs during the burst.

Root cause

In awx/main/scheduler/task_manager_models.py, the selection logic:

if would_be_remaining >= 0 and (instance_most_capacity is None or would_be_remaining > most_remaining_capacity):

For container-group jobs, the control impact is only 1 unit (AWX_CONTROL_NODE_TASK_IMPACT = 1) out of typical capacity of ~640. Combined with:

  • Sequential task manager cycles (each processing approximately 1 job due to advisory lock timing)
  • Capacity being reset between cycles (completed jobs free their impact)

All nodes appear equally viable on every cycle, and the first node in iteration order always wins the tie.

Impact

The controller node handles job lifecycle management: event processing, callback receiver, output streaming to the database and websocket consumers. Concentrating 40+ concurrent jobs on one pod creates resource pressure on that pod while others remain idle, potentially causing job failures under memory/CPU constraints.

Environment

  • AWX 24.x / AAP 2.5+ (any version with multi-replica controller support)
  • Container group execution (Kubernetes/EKS/OpenShift)
  • Multiple controller task replicas

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions