Skip to content

Remove gossip and mingling from Celery workers #2251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aaronkanzer
Copy link
Member

@aaronkanzer aaronkanzer commented Mar 27, 2025

Stemming from behavior in @sandyhider 's work on EMBER Archive (a clone of DANDI Archive), their respective CloudAMQP instance quickly hits capacity

Similar behavior has been observed in LINC (Cc @kabilar )

From Sandy's correspondence with CloudAMQP support staff, a suggestion was received to disable gossip and mingling.

Why can we not see current gossip and mingling:

Current, our Procfile has --log-level at INFO -- gossip and mingling are only expressed at the DEBUG level

Do we need mingling in DANDI Archive

No -- mingling is meant to handle celery workers that have tasks defined in differing code bases. Our celery tasks are all prescribed in the same static codde

Do we need gossip in DANDI Archive

No -- gossip enables Celery workers to broadcast and listen to events (used for things like remote control, celery events, and dashboards like Flower).

Our Celery setup is primarily producer/consumer within a single deployment, so there’s no need for this cross-worker coordination.

This PR includes the appropriate flags to disable Celery functionality that does not seem integral, and should prevent bloat of our message broker

Cc @yarikoptic @satra

@aaronkanzer aaronkanzer marked this pull request as draft March 27, 2025 19:56
@aaronkanzer
Copy link
Member Author

Note In draft mode currently -- just doing some robust testing with DEBUG to evaluate frequency of gossip and mingling

@aaronkanzer
Copy link
Member Author

aaronkanzer commented Mar 27, 2025

@waxlamp @mvandenburgh @jjnesbitt @satra @yarikoptic -- Trying to replicate locally with --log-level DEBUG in the celery (a bit tougher though since the RabbitMQ dashboard isn't present), but noticing some interesting behavior in the LINC RabbitMQ logs (that lead me to have higher confidence that mingle and gossip disabled should help)

You can see below that without any messages being sent to the ingest-zarr-achive, checksum, s3-log-processing queues, about 5.4 messages per second are being consumed -- which scales per month to ~13M messages per month (~65% of 20M allocation per our current MQ instance quota)

I'd still like to be able to stress test how many messages are produced/consumed when some of our more intensive tasks are invoked.

Screenshot 2025-03-27 at 4 26 04 PM Screenshot 2025-03-27 at 4 27 25 PM

Curious if locally isn't as useful to find the root cause -- rather, a staging environment deployment might be best -- it is certainly possible that both gossip/mingling and something with our current infra<>tasks being sent are both ballooning message consumption

@aaronkanzer aaronkanzer marked this pull request as ready for review April 14, 2025 17:44
@aaronkanzer
Copy link
Member Author

Moving to ready for review per results observed by @ sandyhider on EMBER team: aplbrain@ab0188b

@kabilar
Copy link
Member

kabilar commented Apr 16, 2025

Thanks Aaron.

@aaronkanzer @sandyhider Do we know why this issue is arising with the LINC and EMBER deployments but not the DANDI deployment?

@mvandenburgh How many messages are being consumed at baseline for the DANDI deployment?

@sandyhider
Copy link

sandyhider commented Apr 16, 2025

On Ember our service plan on CloudAMQP is called Elegant Ermine. It allows 20 million messages a month. Last month we hit that limit around day 26. The higher service plans don't have a fixed limit but a max throughput a minute. After having issues trying to change our service plan to a larger size, the Heroku Support team suggested that if we make these changes with mingle and gossip we could stay at the current service plan level. We are just under 3 million messages now and it took us almost a week to make the changes on production this month. It was at roughly 2.3 million at that time.

You are not having the problem on DANDI because your CloudAMQP doesn't have a hard limit. The support person also told us that by default mingle and gossip do not actually do anything when they discover and issue. So those messages are just wasted traffic in the current system.

@kabilar
Copy link
Member

kabilar commented Apr 16, 2025

Thank you, @sandyhider.

@mvandenburgh mvandenburgh added the internal Changes only affect the internal API label Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal Changes only affect the internal API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants