Skip to content

Investigate worker (Celery?) issues #2697

@lbarcziova

Description

@lbarcziova

After the previous week redeployment (6th January), we have started hitting issues with jobs processing, causing tasks not being processed for some time and delays:

  • we could see restarts of the worker pods caused by hitting CPU limits - tried to mitigate this by increasing the CPU limits (Increase cpu limit to handle spikes better deployment#631), the limit was also increased for postgres (Increase cpu limits for postgres deployment#636) where metrics showed also going above limit
  • sometimes, the tasks are not being processed at all in workers, without any task blocking them
  • we could see in logs messages like:
    • Substantial drift from celery@packit-worker-long-running-0 may mean clocks are out of sync. Current drift is 1799 seconds. [orig: 2025-01-14 14:51:59.656603 recv: 2025-01-14 14:22:00.484181]
    • consumer: Connection to broker lost. Trying to re-establish the connection... followed by a restart

Metadata

Metadata

Assignees

Labels

area/generalNot tied to a specific areacomplexity/single-taskRegular task; should be done within daysgain/highBrings a lot of value to usersimpact/highAffects a lot of userskind/bugAn unexpected problem or behavior

Type

No type

Projects

Status

done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions