Instability at large scales - workers get ReadTimeout when calling api-server

### Apache Airflow version

3.1.0

### If "Other Airflow 2/3 version" selected, which one?

_No response_

### What happened?

Our large scale setup includes
* ~1000 celery executor workers,
* 15 api servers - 64 worker processes each (with enough resources - having checked utilization)

Also, maybe relevantly,
* 6 scheduler replicas
* 2 dag processors
* a pgbouncer with a large enough `airflow` connection pool size (doesn't reach maximum)
* dags with up to 8k tasks (in parallel) and a final task that depends on all of them.
   usually dags are smaller than that, average ~5k tasks

When all workers are active and working on task instances - they all get the following warning 4 times
**[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is the %d time calling it. [airflow.sdk.api.client]**
and, the 5th time - they get this error:
**[error] Task execute_workload[$celery_task_uuid] raise unexpected: ReadTimeout('timed out') [celery.app.trace]**

We investigated this error a little and found that the error comes from the httpx default timeout
from httpx docs ('https://www.python-httpx.org/advanced/timeouts/'):
```
HTTPX is careful to enforce timeouts everywhere by default.
The default behavior is to raise a TimeoutException after 5 seconds of network inactivity.
```

### What you think should happen instead?

Airflow should allow users to configure the timeout via `airflow.cfg` to accommodate users with high-load systems.
For example:
```
[api]
HTTPX_TIMEOUT = # 5 by default
```
Also - maybe add a section to the docs detailing best practices when working with very high loads to make the api server reliable.


### How to reproduce

(1) Run airflow in a kubernetes cluster with:
~ 1k celery workers
~ 15 api server replicas (64 worker processes. resource limits: 25Gi RAM. 8 CPU cores)

(2) Have large dags so that all 1k workers do tasks in parallel (each task should take more than 5 mins)

(3) Observe workers for errors (ReadTimeout)

### Operating System

Debian GNU/Linux 12 (bookworm)

### Versions of Apache Airflow Providers

apache-airflow-providers-celery==3.12.2
apache-airflow-providers-common-compat==1.7.3
apache-airflow-providers-common-io==1.6.2
apache-airflow-providers-common-sql=1.27.5
apache-airflow-providers-standard==1.6.0
apache-airflow-providers-postgres==6.2.3

### Deployment

Official Apache Airflow Helm Chart

### Deployment details

_No response_

### Anything else?

Problem occurs everytime that all workers are executing a task instance (the highest load)
logs:
```
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is the 1st time calling it. [airflow.sdk.api.client]
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is the 2nd time calling it. [airflow.sdk.api.client]
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is the 3rd time calling it. [airflow.sdk.api.client]
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is the 4th time calling it. [airflow.sdk.api.client]
[error] Task execute_workload[a7469ad-3481-4fd4-b8f236b37cf1] raise unexpected: ReadTimeout('timed out') [celery.app.trace]
```

### Are you willing to submit PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Instability at large scales - workers get ReadTimeout when calling api-server #56571

Apache Airflow version

If "Other Airflow 2/3 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Instability at large scales - workers get ReadTimeout when calling api-server #56571

Description

Apache Airflow version

If "Other Airflow 2/3 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions