Skip to content

Instability at large scales - workers get ReadTimeout when calling api-server #56571

@ron-gaist

Description

@ron-gaist

Apache Airflow version

3.1.0

If "Other Airflow 2/3 version" selected, which one?

No response

What happened?

Our large scale setup includes

  • ~1000 celery executor workers,
  • 15 api servers - 64 worker processes each (with enough resources - having checked utilization)

Also, maybe relevantly,

  • 6 scheduler replicas
  • 2 dag processors
  • a pgbouncer with a large enough airflow connection pool size (doesn't reach maximum)
  • dags with up to 8k tasks (in parallel) and a final task that depends on all of them.
    usually dags are smaller than that, average ~5k tasks

When all workers are active and working on task instances - they all get the following warning 4 times
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is the %d time calling it. [airflow.sdk.api.client]
and, the 5th time - they get this error:
[error] Task execute_workload[$celery_task_uuid] raise unexpected: ReadTimeout('timed out') [celery.app.trace]

We investigated this error a little and found that the error comes from the httpx default timeout
from httpx docs ('https://www.python-httpx.org/advanced/timeouts/'):

HTTPX is careful to enforce timeouts everywhere by default.
The default behavior is to raise a TimeoutException after 5 seconds of network inactivity.

What you think should happen instead?

Airflow should allow users to configure the timeout via airflow.cfg to accommodate users with high-load systems.
For example:

[api]
HTTPX_TIMEOUT = # 5 by default

Also - maybe add a section to the docs detailing best practices when working with very high loads to make the api server reliable.

How to reproduce

(1) Run airflow in a kubernetes cluster with:
~ 1k celery workers
~ 15 api server replicas (64 worker processes. resource limits: 25Gi RAM. 8 CPU cores)

(2) Have large dags so that all 1k workers do tasks in parallel (each task should take more than 5 mins)

(3) Observe workers for errors (ReadTimeout)

Operating System

Debian GNU/Linux 12 (bookworm)

Versions of Apache Airflow Providers

apache-airflow-providers-celery==3.12.2
apache-airflow-providers-common-compat==1.7.3
apache-airflow-providers-common-io==1.6.2
apache-airflow-providers-common-sql=1.27.5
apache-airflow-providers-standard==1.6.0
apache-airflow-providers-postgres==6.2.3

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

Problem occurs everytime that all workers are executing a task instance (the highest load)
logs:

[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is the 1st time calling it. [airflow.sdk.api.client]
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is the 2nd time calling it. [airflow.sdk.api.client]
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is the 3rd time calling it. [airflow.sdk.api.client]
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is the 4th time calling it. [airflow.sdk.api.client]
[error] Task execute_workload[a7469ad-3481-4fd4-b8f236b37cf1] raise unexpected: ReadTimeout('timed out') [celery.app.trace]

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions