Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.10.5
What happened?
We are running Airflow on a GKE cluster with the KubernetesExecutor. Our deployment has 1200+ DAGs, with 100+ tasks running concurrently at peak. Since the last month, we’ve been observing 20–30 tasks failing daily due to SIGTERM, even though the cluster has no resource pressure (CPU/memory/network usage stays well under 50%).
We initially suspected it was an issue with the Airflow version (2.9.2), so we upgraded to Airflow 2.10.5 — however, the issue still persists.
dag_id=PCDM_Editorial_Gold_Top_Content_Append_Daily_run_id=scheduled__2025-07-06T07_15_00+00_00_task_id=GcsToSnowflakegold_top_content_editorial_attempt=1.log
What you think should happen instead?
Tasks should continue running unless explicitly cancelled, timeout, or encounter a failure condition. Spontaneous SIGTERM from the Airflow system (without user action or resource pressure) should not happen.
How to reproduce
- airflow config: `
executor: "KubernetesExecutor"
allowPodLaunching: true
env:
- name: "AIRFLOW_GPL_UNIDECODE"
value: "yes"
- name: "AIRFLOW__WEBSERVER__BASE_URL"
value: "*****************************************"
- name: "AIRFLOW__CORE__LOAD_EXAMPLES"
value: "False"
- name: "AIRFLOW__CORE__LAZY_LOAD_PLUGINS"
value: "False"
- name: "AIRFLOW__CORE__PLUGINS_FOLDER"
value: "/opt/airflow/plugins"
- name: "AIRFLOW__WEBSERVER__RELOAD_ON_PLUGIN_CHANGE"
value: "True"
- name: "AIRFLOW__CORE__AIRFLOW_HOME"
value: "/opt/airflow"
- name: "AIRFLOW__CORE__DAGS_FOLDER"
value: "/opt/airflow/dags/repo/airflow-dags"
- name: "AIRFLOW__CORE__PARALLELISM"
value: "150"
- name: "AIRFLOW__CORE__DEFAULT_POOL_TASK_SLOT_COUNT"
value: "200"
- name: "AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG"
value: "60"
- name: "AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG"
value: "20"
- name: "AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME"
value: "64000"
- name: "AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT"
value: "500"
- name: "AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT"
value: "600"
- name: "AIRFLOW__CORE__ENABLE_XCOM_PICKLING"
value: "True"
- name: "AIRFLOW__CORE__TEST_CONNECTION"
value: "Enabled"
- name: "AIRFLOW__CORE__MAX_TEMPLATED_FIELD_LENGTH"
value: "1000000"
- name: "AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE"
value: "900"
- name: "AIRFLOW__DATABASE__SQL_ALCHEMY_MAX_OVERFLOW"
value: "200"
- name: "AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_SIZE"
value: "250"
- name: "AIRFLOW__API__ENABLE_EXPERIMENTAL_API"
value: "True"
- name: "AIRFLOW__API__AUTH_BACKEND"
value: "airflow.api.auth.backend.default"
- name: "AIRFLOW__WEBSERVER__EXPOSE_CONFIG"
value: "True"
- name: "AIRFLOW__WEBSERVER__SHOW_TRIGGER_FORM_IF_NO_PARAMS"
value: "True"
- name: "AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT"
value: "google-cloud-platform://?extra__google_cloud_platform__scope=https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/drive,https://www.googleapis.com/auth/bigquery"
- name: "AIRFLOW__LOGGING__LOGGING_LEVEL"
value: "DEBUG"
- name: "AIRFLOW__LOGGING__REMOTE_LOGGING"
value: "True"
- name: "AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER"
value: "gs://airflow-cluster/airflow/logs"
- name: "AIRFLOW__KUBERNETES__DAGS_IN_IMAGE"
value: "False"
- name: "AIRFLOW__KUBERNETES__NAMESPACE"
value: "airflow-data-eng-prod"
- name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_HOST"
value: "/opt/airflow/dags/repo/airflow-dags"
- name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_CLAIM"
value: "room-prod-pvc-pvc"
- name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_MOUNT_POINT"
value: "/opt/airflow/dags/repo/airflow-dags"
- name: "AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME"
value: "airflow-worker"
- name: "AIRFLOW__KUBERNETES_EXECUTOR__DELETE_WORKER_PODS"
value: "True"
- name: "AIRFLOW__KUBERNETES_EXECUTOR__DELETE_WORKER_PODS_ON_FAILURE"
value: "True"
- name: "AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_CREATION_BATCH_SIZE"
value: "50"
- name: "AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_QUEUED_CHECK_INTERVAL"
value: "30"
- name: "AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION"
value: "False"
- name: "AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD"
value: "300"
- name: "AIRFLOW__SCHEDULER__FILE_PARSING_SORT_MODE"
value: "modified_time"
- name: "AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL"
value: "700"
- name: "AIRFLOW__SCHEDULER__PARSING_PROCESSES"
value: "5"
- name: "AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC"
value: "60"
- name: "AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC"
value: "120"
- name: "AIRFLOW__SCHEDULER__ZOMBIE_DETECTION_INTERVAL"
value: "100"
- name: "AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL"
value: "100"
- name: "AIRFLOW__WEBSERVER__AUTHENTICATE"
value: "True"
- name: "AIRFLOW__WEBSERVER__AUTH_BACKEND"
value: "airflow.contrib.auth.backends.github_enterprise_auth"
- name: "AIRFLOW__GITHUB_ENTERPRISE__HOST"
value: "github.com"
- name: "AIRFLOW__GITHUB_ENTERPRISE__OAUTH_CALLBACK_ROUTE"
value: "/oauth-authorized/github"
`
Operating System
linux
Versions of Apache Airflow Providers
2.10.5
Deployment
Official Apache Airflow Helm Chart
Deployment details
Airflow Version:
Originally 2.9.2, upgraded to 2.10.5 (issue persists in both versions)
Executor:
KubernetesExecutor
Orchestration Platform:
Google Kubernetes Engine (GKE) – Standard mode (not Autopilot)
Cluster Size & Scaling:
~20 nodes (n2-highmem-8)
Autoscaling enabled, but cluster consistently runs at <50% CPU/memory usage
Nodes use standard nodepools, not preemptible or spot instances
Airflow Deployment Mode:
- Deployed via Helm with custom image (Python 3.10 base)
- Webserver, Scheduler, as separate pods
Task Workloads:
~1200 DAGs, mostly scheduled daily/hourly
~100–150 concurrent tasks at peak times
Heavy usage of:
- DatabricksSubmitRunOperator
- SnowflakeOperator
-ExternalTaskSensor
- ShortCircuitOperator
Anything else?
We have thoroughly verified that this is not caused by:
- Resource exhaustion (CPU, memory, or disk)
- Pod evictions or node preemptions
- Task timeouts or DAG-level retries
- Manual task terminations
- Implemented dag timeout and execution timeout also created separate pool for databricks and snowflake task
We suspect the issue may stem from one of the following:
-Internal Airflow lifecycle logic (e.g., zombie detection, pod orphan cleanup)
- K8sExecutor edge cases where the scheduler or triggerer may initiate SIGTERM without resource-based justification
- GCSFuse or sidecar instability possibly impacting task pod lifecycle (though main containers are healthy)
- We are open to instrumenting Airflow internals with logging or tracing if the core team can suggest areas to probe (e.g: executor heartbeat, cleanup routines, orphan detection)
We would also be happy to help test a patch or proposed fix in our production-like test cluster if needed.
Are you willing to submit PR?
Code of Conduct
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.10.5
What happened?
We are running Airflow on a GKE cluster with the KubernetesExecutor. Our deployment has 1200+ DAGs, with 100+ tasks running concurrently at peak. Since the last month, we’ve been observing 20–30 tasks failing daily due to SIGTERM, even though the cluster has no resource pressure (CPU/memory/network usage stays well under 50%).
We initially suspected it was an issue with the Airflow version (2.9.2), so we upgraded to Airflow 2.10.5 — however, the issue still persists.
dag_id=PCDM_Editorial_Gold_Top_Content_Append_Daily_run_id=scheduled__2025-07-06T07_15_00+00_00_task_id=GcsToSnowflakegold_top_content_editorial_attempt=1.log
What you think should happen instead?
Tasks should continue running unless explicitly cancelled, timeout, or encounter a failure condition. Spontaneous SIGTERM from the Airflow system (without user action or resource pressure) should not happen.
How to reproduce
executor: "KubernetesExecutor"
allowPodLaunching: true
env:
value: "yes"
value: "*****************************************"
value: "False"
value: "False"
value: "/opt/airflow/plugins"
value: "True"
value: "/opt/airflow"
value: "/opt/airflow/dags/repo/airflow-dags"
value: "150"
value: "200"
value: "60"
value: "20"
value: "64000"
value: "500"
value: "600"
value: "True"
value: "Enabled"
value: "1000000"
value: "900"
value: "200"
value: "250"
value: "True"
value: "airflow.api.auth.backend.default"
value: "True"
value: "True"
value: "google-cloud-platform://?extra__google_cloud_platform__scope=https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/drive,https://www.googleapis.com/auth/bigquery"
value: "DEBUG"
value: "True"
value: "gs://airflow-cluster/airflow/logs"
value: "False"
value: "airflow-data-eng-prod"
value: "/opt/airflow/dags/repo/airflow-dags"
value: "room-prod-pvc-pvc"
value: "/opt/airflow/dags/repo/airflow-dags"
value: "airflow-worker"
value: "True"
value: "True"
value: "50"
value: "30"
value: "False"
value: "300"
value: "modified_time"
value: "700"
value: "5"
value: "60"
value: "120"
value: "100"
value: "100"
value: "True"
value: "airflow.contrib.auth.backends.github_enterprise_auth"
value: "github.com"
value: "/oauth-authorized/github"
`
Operating System
linux
Versions of Apache Airflow Providers
2.10.5
Deployment
Official Apache Airflow Helm Chart
Deployment details
Airflow Version:
Originally 2.9.2, upgraded to 2.10.5 (issue persists in both versions)
Executor:
KubernetesExecutor
Orchestration Platform:
Google Kubernetes Engine (GKE) – Standard mode (not Autopilot)
Cluster Size & Scaling:
~20 nodes (n2-highmem-8)
Autoscaling enabled, but cluster consistently runs at <50% CPU/memory usage
Nodes use standard nodepools, not preemptible or spot instances
Airflow Deployment Mode:
Task Workloads:
~1200 DAGs, mostly scheduled daily/hourly
~100–150 concurrent tasks at peak times
Heavy usage of:
-ExternalTaskSensor
Anything else?
We have thoroughly verified that this is not caused by:
We suspect the issue may stem from one of the following:
-Internal Airflow lifecycle logic (e.g., zombie detection, pod orphan cleanup)
We would also be happy to help test a patch or proposed fix in our production-like test cluster if needed.
Are you willing to submit PR?
Code of Conduct