-
Notifications
You must be signed in to change notification settings - Fork 15.8k
Open
Labels
area:corekind:bugThis is a clearly a bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yetlabel for new issues that we didn't triage yet
Description
Apache Airflow version
Other Airflow 2/3 version (please specify below)
If "Other Airflow 2/3 version" selected, which one?
2.7.1
What happened?
airflow goes to up for retry with random if '611b2....' when I clear the task to run. I am using airflow docker image, I had thought it was postgres connection overload issue, so I used pgbouncer for managing the connection, airflow ran for 3-4 days and again started giving this error of retry.
Feel free to modify this file to suit your needs.
x-airflow-common:
&airflow-common
image: airflow310v1_fastexcel:latest
environment:
&airflow-common-env
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@pgbouncer/airflow
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@pgbouncer/airflow
AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
AIRFLOW__CORE__FERNET_KEY: ''
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session'
AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK: 'true'
AIRFLOW__WEBSERVER__SHOW_TRIGGER_FORM_IF_NO_PARAMS: 'true'
AIRFLOW__API__MAXIMUM_PAGE_LIMIT: 10000
AIRFLOW__API__FALLBACK_PAGE_LIMIT: 10000
# AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT: 60.0
# --- Celery stability & throughput ---
AIRFLOW__CELERY__WORKER_CONCURRENCY: 4 #"32" # threads per worker container (tune below with --scale)
AIRFLOW__CELERY__TASK_SOFT_TIME_LIMIT: 600
AIRFLOW__CELERY__TASK_TIME_LIMIT: 1200
AIRFLOW__CELERY__WORKER_AUTOSCALE: '16,4' # max,min workers
AIRFLOW__CELERY__WORKER_PREFETCH_MULTIPLIER: 1 # prevents task hogging/starvation
AIRFLOW__CELERY__WORKER_MAX_TASKS_PER_CHILD: 20 #100
AIRFLOW__CELERY__WORKER_DISABLE_RATE_LIMITS: 'true'
AIRFLOW__CELERY__BROKER_TRANSPORT_OPTIONS: >-
{"visibility_timeout": 43200, "socket_timeout": 60, "retry_on_timeout": true}
# --- Core parallelism (you have 200 cores / 2TB RAM) ---
AIRFLOW__CORE__TASK_RUNNER: StandardTaskRunner
AIRFLOW__CORE__DEFAULT_TASK_RETRIES: 1
AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG: 16
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG: 4
AIRFLOW__CORE__PARALLELISM: 32
AIRFLOW__CORE__DAG_CONCURRENCY: 16
# --- Scheduler robustness ---
AIRFLOW__SCHEDULER__CATCHUP_BY_DEFAULT: 'false'
AIRFLOW__SCHEDULER__MAX_TIS_PER_QUERY: 32
AIRFLOW__SCHEDULER__PROCESSOR_POLL_INTERVAL: 5
AIRFLOW__SCHEDULER__PARSING_PROCESSES: 2
AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC: 10
# AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: 'true'
AIRFLOW__SCHEDULER__SCHEDULER_ZOMBIE_TASK_THRESHOLD: 600
AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_SIZE: 15
AIRFLOW__DATABASE__SQL_ALCHEMY_MAX_OVERFLOW: 20 #15
AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE: 1800 #600
AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_PRE_PING: 'true'
AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_TIMEOUT: 30
AIRFLOW__DATABASE__SQL_ALCHEMY_ECHO: 'false'
# Logging
AIRFLOW__LOGGING__LOGGING_LEVEL: "INFO"
AIRFLOW__CORE__ENABLE_XCOM_PICKLING: 'true' #"false" # Triggerer keep default,it’s cheap and avoids deferral stalls
user: "${AIRFLOW_UID:-50000}:0"
shm_size: '1000g'
tmpfs:
- /dev/shm:size=1000g
depends_on:
&airflow-common-depends-on
redis:
condition: service_healthy
postgres:
condition: service_healthy
deploy:
&airflow-common-resources
resources:
reservations:
devices:
- driver: nvidia
count: 4
capabilities: [gpu]
services:
postgres:
image: postgres:13
container_name: postgres
command:
- postgres
- -c
- max_connections=500
- -c
- shared_buffers=1GB
- -c
- effective_cache_size=2GB
- -c
- random_page_cost=1.1 # For SSD storage
- -c
- idle_in_transaction_session_timeout=300000
- -c
- statement_timeout=600000
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
volumes:
- ../databases/postgres/data:/var/lib/postgresql/data
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 10s
retries: 5
start_period: 5s
restart: always
networks:
my_network:
ipv4_address: 172.29.1.2
pgbouncer:
image: edoburu/pgbouncer:latest
container_name: pgbouncer
environment:
- DATABASE_URL=postgresql://airflow:airflow@postgres:5432/airflow
- POOL_MODE=transaction
- MAX_CLIENT_CONN=1000
- DEFAULT_POOL_SIZE=25
- RESERVE_POOL_SIZE=5
- MAX_DB_CONNECTIONS=200
- MIN_POOL_SIZE=10
ports:
- "7432:6432"
depends_on:
- postgres
restart: always
networks:
my_network:
ipv4_address: 172.29.2.14
redis:
image: redis:latest
container_name: redis
command: redis-server --maxmemory 4gb --maxmemory-policy allkeys-lru
expose:
- 6379
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 30s
retries: 50
start_period: 30s
restart: always
networks:
my_network:
ipv4_address: 172.29.1.3
airflow-webserver:
<<: *airflow-common
command: webserver
container_name: airflow-webserver
ports:
- "8099:8080"
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
deploy:
<<: *airflow-common-resources
networks:
my_network:
ipv4_address: 172.29.1.7
etc
here is my complete docker compose
What you think should happen instead?
No response
How to reproduce
given in docker compose
Operating System
linux server 24.04.3
Versions of Apache Airflow Providers
No response
Deployment
Docker-Compose
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
area:corekind:bugThis is a clearly a bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yetlabel for new issues that we didn't triage yet