Skip to content

Fair Round Robin Implementation#120

Merged
HenryTraill merged 15 commits intomasterfrom
fair-round-robin
Feb 20, 2026
Merged

Fair Round Robin Implementation#120
HenryTraill merged 15 commits intomasterfrom
fair-round-robin

Conversation

@tejas7777
Copy link
Contributor

@tejas7777 tejas7777 commented Feb 19, 2026

Close: #119

Why we are doing this

Prevent noisy-neighbor delays by queueing work per branch and dispatching fairly across branches.

Design

  1. dispatch_branch_task() receives webhook work and enqueues it into a Redis LIST for that branch.
  2. Redis stores:
    • jobs:branch:{branch_id} (per-branch FIFO queue)
    • jobs:branches:active (set of branches with pending jobs)
    • jobs:dispatcher:cursor (last branch processed)
  3. A dedicated dispatcher worker runs continuously:
    • reads active branches
    • rotates from the cursor (round-robin)
    • dispatches at most one job per branch per cycle
  4. Backpressure check pauses dispatch when Celery broker queue depth is above threshold.
  5. Dispatched jobs run task_send_webhooks, which performs the outbound HTTP calls and writes webhook logs.

Mermaid Diagram

flowchart TD
    A["TC2 webhook payload"] --> B["dispatch_branch_task"]
    B --> C["Redis list: jobs:branch:BRANCH_ID"]
    B --> D["Redis set: jobs:branches:active"]

    E["Dispatcher worker loop (job_dispatcher_task)"] --> D
    E --> F["Read cursor: jobs:dispatcher:cursor"]
    E --> G{"Broker queue below threshold?"}
    G -- "No" --> H["Sleep cycle_delay"]
    H --> E
    G -- "Yes" --> I["dispatch_cycle (round robin)"]

    I --> C
    I --> J["task.apply_async"]
    J --> K["Celery broker queue"]
    K --> L["task_send_webhooks"]
    L --> M["Client webhook endpoints"]
    L --> N["WebhookLog rows"]
    I --> O["Ack queue head"]
    O --> P["Update cursor"]
    P --> E
Loading

Expected Outcome

  • Fairer cross-branch processing.
  • Reduced impact from one branch generating heavy traffic.
  • Controlled queue growth via dispatcher-side backpressure.

@HenryTraill
Copy link
Collaborator

Also can you improve the current workers queue too long error on chronos to include qlength

@codecov-commenter
Copy link

codecov-commenter commented Feb 19, 2026

Codecov Report

❌ Patch coverage is 98.04878% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
chronos/worker.py 92.30% 2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@tejas7777
Copy link
Contributor Author

Code coverage since the bot doesn't seem to be updating the message and the page gives 404:

image

@tejas7777 tejas7777 self-assigned this Feb 19, 2026
@tejas7777 tejas7777 added the enhancement New feature or request label Feb 19, 2026
@tejas7777 tejas7777 assigned HenryTraill and unassigned tejas7777 Feb 19, 2026
# Worker tuning: fetch one task at a time for fair scheduling
worker_prefetch_multiplier=1,
# Task execution limits
task_soft_time_limit=300,
Copy link
Contributor Author

@tejas7777 tejas7777 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We weren't setting task_soft_time_limit and task_time_limit earlier so by default they were None - they concern the normal celery workers and the dispatch worker itself as we run it with these flags --soft-time-limit=0 --time-limit=0

Copy link
Collaborator

@HenryTraill HenryTraill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets breakdown the events array and add each event to the new queues to make the round robin equal

@HenryTraill HenryTraill assigned tejas7777 and unassigned HenryTraill Feb 20, 2026
@tejas7777
Copy link
Contributor Author

tejas7777 commented Feb 20, 2026

Codex - xHigh reasoning review on backwards compatibility:

Backward Compatibility Sign-off (PR Summary)

Approved for rollout with backward compatibility, with one operational requirement: keep dispatcher running while round-robin queues are draining.

Why this is backward compatible

  1. Task contract is unchanged: task_send_webhooks(payload, url_extension) (chronos/worker.py:195).
  2. Old queued batched payloads are still supported because worker-side per-event split remains (chronos/worker.py:133).
  3. New split payloads are the same serialized JSON payload shape consumed by the same worker task (chronos/worker.py:320).
  4. Feature-flag fallback remains available: use_round_robin=False routes new ingress directly to Celery (chronos/views.py:82).

Mixed-version safety

  1. Old jobs + new workers: compatible.
  2. New jobs + old workers: compatible.
  3. Mixed old/new jobs during rolling deploy: compatible.

Operational guardrails

  1. If issues occur, set use_round_robin=False first (stop new per-branch queue writes).
  2. Keep dispatcher up until jobs:branches:active is empty.
  3. Rollback only after queue drain; otherwise queued per-branch jobs can be stranded.

Bottom line

No payload-format backward-compatibility break was identified for in-flight webhook tasks. Primary risk is operational sequencing (flag/dispatcher/rollback order), not task schema incompatibility.

@tejas7777
Copy link
Contributor Author

Testing after changes:

  • Creating new integration works

  • Webhooks are sent

image image

@tejas7777 tejas7777 assigned HenryTraill and unassigned tejas7777 Feb 20, 2026
@tejas7777
Copy link
Contributor Author

@HenryTraill We need to make sure on Heroku that the dispatcher process never auto scales past one single instance else if we have multiple dispatchers there is risk of duplicate webhooks.

@HenryTraill HenryTraill merged commit bc3c2ba into master Feb 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Queue is too long. Check workers and speeds.

3 participants