Skip to content

airflow goes to up for retry with random id '611b2....' after it runs for 5min #56562

@unrealrt3001

Description

@unrealrt3001

Apache Airflow version

Other Airflow 2/3 version (please specify below)

If "Other Airflow 2/3 version" selected, which one?

2.7.1

What happened?

airflow goes to up for retry with random if '611b2....' when I clear the task to run. I am using airflow docker image, I had thought it was postgres connection overload issue, so I used pgbouncer for managing the connection, airflow ran for 3-4 days and again started giving this error of retry.

Feel free to modify this file to suit your needs.


x-airflow-common:

&airflow-common
image: airflow310v1_fastexcel:latest
environment:
&airflow-common-env
AIRFLOW__CORE__EXECUTOR: CeleryExecutor

AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@pgbouncer/airflow
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@pgbouncer/airflow

AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
AIRFLOW__CORE__FERNET_KEY: ''
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session'
AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK: 'true'
AIRFLOW__WEBSERVER__SHOW_TRIGGER_FORM_IF_NO_PARAMS: 'true'
AIRFLOW__API__MAXIMUM_PAGE_LIMIT: 10000
AIRFLOW__API__FALLBACK_PAGE_LIMIT: 10000
# AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT: 60.0


# --- Celery stability & throughput ---
AIRFLOW__CELERY__WORKER_CONCURRENCY: 4 #"32"              # threads per worker container (tune below with --scale)
AIRFLOW__CELERY__TASK_SOFT_TIME_LIMIT: 600
AIRFLOW__CELERY__TASK_TIME_LIMIT: 1200

AIRFLOW__CELERY__WORKER_AUTOSCALE: '16,4'  # max,min workers
AIRFLOW__CELERY__WORKER_PREFETCH_MULTIPLIER: 1 # prevents task hogging/starvation
AIRFLOW__CELERY__WORKER_MAX_TASKS_PER_CHILD: 20 #100

AIRFLOW__CELERY__WORKER_DISABLE_RATE_LIMITS: 'true'
AIRFLOW__CELERY__BROKER_TRANSPORT_OPTIONS: >-
  {"visibility_timeout": 43200, "socket_timeout": 60, "retry_on_timeout": true}

# --- Core parallelism (you have 200 cores / 2TB RAM) ---
AIRFLOW__CORE__TASK_RUNNER: StandardTaskRunner
AIRFLOW__CORE__DEFAULT_TASK_RETRIES: 1


AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG: 16
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG: 4
AIRFLOW__CORE__PARALLELISM: 32
AIRFLOW__CORE__DAG_CONCURRENCY: 16

# --- Scheduler robustness ---
AIRFLOW__SCHEDULER__CATCHUP_BY_DEFAULT: 'false'
AIRFLOW__SCHEDULER__MAX_TIS_PER_QUERY: 32
AIRFLOW__SCHEDULER__PROCESSOR_POLL_INTERVAL: 5

AIRFLOW__SCHEDULER__PARSING_PROCESSES: 2

AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC: 10
# AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: 'true'
AIRFLOW__SCHEDULER__SCHEDULER_ZOMBIE_TASK_THRESHOLD: 600
    
AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_SIZE: 15
AIRFLOW__DATABASE__SQL_ALCHEMY_MAX_OVERFLOW: 20 #15
AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE: 1800 #600
AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_PRE_PING: 'true'
AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_TIMEOUT: 30
AIRFLOW__DATABASE__SQL_ALCHEMY_ECHO: 'false'

# Logging
AIRFLOW__LOGGING__LOGGING_LEVEL: "INFO"
AIRFLOW__CORE__ENABLE_XCOM_PICKLING: 'true' #"false" # Triggerer keep default,it’s cheap and avoids deferral stalls

user: "${AIRFLOW_UID:-50000}:0"
shm_size: '1000g'
tmpfs:
  - /dev/shm:size=1000g
depends_on:
  &airflow-common-depends-on
  redis:
    condition: service_healthy
  postgres:
    condition: service_healthy
deploy:
  &airflow-common-resources
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 4
          capabilities: [gpu]

services:

postgres:
  image: postgres:13
  container_name: postgres
  command: 
    - postgres
    - -c
    - max_connections=500
    - -c
    - shared_buffers=1GB
    - -c
    - effective_cache_size=2GB
    - -c
    - random_page_cost=1.1             # For SSD storage
    - -c
    - idle_in_transaction_session_timeout=300000
    - -c
    - statement_timeout=600000
  environment:
    POSTGRES_USER: airflow
    POSTGRES_PASSWORD: airflow
    POSTGRES_DB: airflow
  volumes:
    - ../databases/postgres/data:/var/lib/postgresql/data
  healthcheck:
    test: ["CMD", "pg_isready", "-U", "airflow"]
    interval: 10s
    retries: 5
    start_period: 5s
  restart: always
  networks:
    my_network:
      ipv4_address: 172.29.1.2

pgbouncer:
  image: edoburu/pgbouncer:latest
  container_name: pgbouncer
  environment:
    - DATABASE_URL=postgresql://airflow:airflow@postgres:5432/airflow
    - POOL_MODE=transaction
    - MAX_CLIENT_CONN=1000
    - DEFAULT_POOL_SIZE=25
    - RESERVE_POOL_SIZE=5
    - MAX_DB_CONNECTIONS=200
    - MIN_POOL_SIZE=10
  ports:
    - "7432:6432"
  depends_on:
    - postgres
  restart: always
  networks:
    my_network:
      ipv4_address: 172.29.2.14

redis:
  image: redis:latest
  container_name: redis
  command: redis-server --maxmemory 4gb --maxmemory-policy allkeys-lru
  expose:
    - 6379
  healthcheck:
    test: ["CMD", "redis-cli", "ping"]
    interval: 10s
    timeout: 30s
    retries: 50
    start_period: 30s
  restart: always
  networks:
    my_network:
      ipv4_address: 172.29.1.3
  


airflow-webserver:
  <<: *airflow-common
  command: webserver
  container_name: airflow-webserver
  ports:
    - "8099:8080"
  healthcheck:
    test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
    interval: 30s
    timeout: 10s
    retries: 5
    start_period: 30s
  restart: always
  depends_on:
    <<: *airflow-common-depends-on
    airflow-init:
      condition: service_completed_successfully
  deploy:
    <<: *airflow-common-resources
  networks:
    my_network:
      ipv4_address: 172.29.1.7

etc

here is my complete docker compose

docker-compose.yaml

What you think should happen instead?

No response

How to reproduce

given in docker compose

Operating System

linux server 24.04.3

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:corekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions