Skip to content

Unexpected SIGTERM on Tasks (Airflow 2.10.5 on GKE with KubernetesExecutor and No Resource Constraints) #53894

@priya369

Description

@priya369

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.10.5

What happened?

We are running Airflow on a GKE cluster with the KubernetesExecutor. Our deployment has 1200+ DAGs, with 100+ tasks running concurrently at peak. Since the last month, we’ve been observing 20–30 tasks failing daily due to SIGTERM, even though the cluster has no resource pressure (CPU/memory/network usage stays well under 50%).
We initially suspected it was an issue with the Airflow version (2.9.2), so we upgraded to Airflow 2.10.5 — however, the issue still persists.

dag_id=PCDM_Editorial_Gold_Top_Content_Append_Daily_run_id=scheduled__2025-07-06T07_15_00+00_00_task_id=GcsToSnowflakegold_top_content_editorial_attempt=1.log

What you think should happen instead?

Tasks should continue running unless explicitly cancelled, timeout, or encounter a failure condition. Spontaneous SIGTERM from the Airflow system (without user action or resource pressure) should not happen.

How to reproduce

  1. airflow config: `
    executor: "KubernetesExecutor"
    allowPodLaunching: true
    env:
  • name: "AIRFLOW_GPL_UNIDECODE"
    value: "yes"
  • name: "AIRFLOW__WEBSERVER__BASE_URL"
    value: "*****************************************"
  • name: "AIRFLOW__CORE__LOAD_EXAMPLES"
    value: "False"
  • name: "AIRFLOW__CORE__LAZY_LOAD_PLUGINS"
    value: "False"
  • name: "AIRFLOW__CORE__PLUGINS_FOLDER"
    value: "/opt/airflow/plugins"
  • name: "AIRFLOW__WEBSERVER__RELOAD_ON_PLUGIN_CHANGE"
    value: "True"
  • name: "AIRFLOW__CORE__AIRFLOW_HOME"
    value: "/opt/airflow"
  • name: "AIRFLOW__CORE__DAGS_FOLDER"
    value: "/opt/airflow/dags/repo/airflow-dags"
  • name: "AIRFLOW__CORE__PARALLELISM"
    value: "150"
  • name: "AIRFLOW__CORE__DEFAULT_POOL_TASK_SLOT_COUNT"
    value: "200"
  • name: "AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG"
    value: "60"
  • name: "AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG"
    value: "20"
  • name: "AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME"
    value: "64000"
  • name: "AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT"
    value: "500"
  • name: "AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT"
    value: "600"
  • name: "AIRFLOW__CORE__ENABLE_XCOM_PICKLING"
    value: "True"
  • name: "AIRFLOW__CORE__TEST_CONNECTION"
    value: "Enabled"
  • name: "AIRFLOW__CORE__MAX_TEMPLATED_FIELD_LENGTH"
    value: "1000000"
  • name: "AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE"
    value: "900"
  • name: "AIRFLOW__DATABASE__SQL_ALCHEMY_MAX_OVERFLOW"
    value: "200"
  • name: "AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_SIZE"
    value: "250"
  • name: "AIRFLOW__API__ENABLE_EXPERIMENTAL_API"
    value: "True"
  • name: "AIRFLOW__API__AUTH_BACKEND"
    value: "airflow.api.auth.backend.default"
  • name: "AIRFLOW__WEBSERVER__EXPOSE_CONFIG"
    value: "True"
  • name: "AIRFLOW__WEBSERVER__SHOW_TRIGGER_FORM_IF_NO_PARAMS"
    value: "True"
  • name: "AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT"
    value: "google-cloud-platform://?extra__google_cloud_platform__scope=https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/drive,https://www.googleapis.com/auth/bigquery"
  • name: "AIRFLOW__LOGGING__LOGGING_LEVEL"
    value: "DEBUG"
  • name: "AIRFLOW__LOGGING__REMOTE_LOGGING"
    value: "True"
  • name: "AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER"
    value: "gs://airflow-cluster/airflow/logs"
  • name: "AIRFLOW__KUBERNETES__DAGS_IN_IMAGE"
    value: "False"
  • name: "AIRFLOW__KUBERNETES__NAMESPACE"
    value: "airflow-data-eng-prod"
  • name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_HOST"
    value: "/opt/airflow/dags/repo/airflow-dags"
  • name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_CLAIM"
    value: "room-prod-pvc-pvc"
  • name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_MOUNT_POINT"
    value: "/opt/airflow/dags/repo/airflow-dags"
  • name: "AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME"
    value: "airflow-worker"
  • name: "AIRFLOW__KUBERNETES_EXECUTOR__DELETE_WORKER_PODS"
    value: "True"
  • name: "AIRFLOW__KUBERNETES_EXECUTOR__DELETE_WORKER_PODS_ON_FAILURE"
    value: "True"
  • name: "AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_CREATION_BATCH_SIZE"
    value: "50"
  • name: "AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_QUEUED_CHECK_INTERVAL"
    value: "30"
  • name: "AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION"
    value: "False"
  • name: "AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD"
    value: "300"
  • name: "AIRFLOW__SCHEDULER__FILE_PARSING_SORT_MODE"
    value: "modified_time"
  • name: "AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL"
    value: "700"
  • name: "AIRFLOW__SCHEDULER__PARSING_PROCESSES"
    value: "5"
  • name: "AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC"
    value: "60"
  • name: "AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC"
    value: "120"
  • name: "AIRFLOW__SCHEDULER__ZOMBIE_DETECTION_INTERVAL"
    value: "100"
  • name: "AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL"
    value: "100"
  • name: "AIRFLOW__WEBSERVER__AUTHENTICATE"
    value: "True"
  • name: "AIRFLOW__WEBSERVER__AUTH_BACKEND"
    value: "airflow.contrib.auth.backends.github_enterprise_auth"
  • name: "AIRFLOW__GITHUB_ENTERPRISE__HOST"
    value: "github.com"
  • name: "AIRFLOW__GITHUB_ENTERPRISE__OAUTH_CALLBACK_ROUTE"
    value: "/oauth-authorized/github"
    `

Operating System

linux

Versions of Apache Airflow Providers

2.10.5

Deployment

Official Apache Airflow Helm Chart

Deployment details

Airflow Version:
Originally 2.9.2, upgraded to 2.10.5 (issue persists in both versions)

Executor:
KubernetesExecutor

Orchestration Platform:
Google Kubernetes Engine (GKE) – Standard mode (not Autopilot)

Cluster Size & Scaling:
~20 nodes (n2-highmem-8)
Autoscaling enabled, but cluster consistently runs at <50% CPU/memory usage
Nodes use standard nodepools, not preemptible or spot instances

Airflow Deployment Mode:

  • Deployed via Helm with custom image (Python 3.10 base)
  • Webserver, Scheduler, as separate pods
    Task Workloads:
    ~1200 DAGs, mostly scheduled daily/hourly
    ~100–150 concurrent tasks at peak times
    Heavy usage of:
  • DatabricksSubmitRunOperator
  • SnowflakeOperator
    -ExternalTaskSensor
  • ShortCircuitOperator

Anything else?

We have thoroughly verified that this is not caused by:

  • Resource exhaustion (CPU, memory, or disk)
  • Pod evictions or node preemptions
  • Task timeouts or DAG-level retries
  • Manual task terminations
  • Implemented dag timeout and execution timeout also created separate pool for databricks and snowflake task

We suspect the issue may stem from one of the following:
-Internal Airflow lifecycle logic (e.g., zombie detection, pod orphan cleanup)

  • K8sExecutor edge cases where the scheduler or triggerer may initiate SIGTERM without resource-based justification
  • GCSFuse or sidecar instability possibly impacting task pod lifecycle (though main containers are healthy)
  • We are open to instrumenting Airflow internals with logging or tracing if the core team can suggest areas to probe (e.g: executor heartbeat, cleanup routines, orphan detection)

We would also be happy to help test a patch or proposed fix in our production-like test cluster if needed.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions