Unexpected SIGTERM on Tasks (Airflow 2.10.5 on GKE with KubernetesExecutor and No Resource Constraints) #62978

priya369 · 2025-07-29T17:09:26Z

priya369
Jul 29, 2025

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.10.5

What happened?

We are running Airflow on a GKE cluster with the KubernetesExecutor. Our deployment has 1200+ DAGs, with 100+ tasks running concurrently at peak. Since the last month, we’ve been observing 20–30 tasks failing daily due to SIGTERM, even though the cluster has no resource pressure (CPU/memory/network usage stays well under 50%).
We initially suspected it was an issue with the Airflow version (2.9.2), so we upgraded to Airflow 2.10.5 — however, the issue still persists.

dag_id=PCDM_Editorial_Gold_Top_Content_Append_Daily_run_id=scheduled__2025-07-06T07_15_00+00_00_task_id=GcsToSnowflakegold_top_content_editorial_attempt=1.log

What you think should happen instead?

Tasks should continue running unless explicitly cancelled, timeout, or encounter a failure condition. Spontaneous SIGTERM from the Airflow system (without user action or resource pressure) should not happen.

How to reproduce

airflow config: `
executor: "KubernetesExecutor"
allowPodLaunching: true
env:

name: "AIRFLOW_GPL_UNIDECODE"
value: "yes"
name: "AIRFLOW__WEBSERVER__BASE_URL"
value: "*****************************************"
name: "AIRFLOW__CORE__LOAD_EXAMPLES"
value: "False"
name: "AIRFLOW__CORE__LAZY_LOAD_PLUGINS"
value: "False"
name: "AIRFLOW__CORE__PLUGINS_FOLDER"
value: "/opt/airflow/plugins"
name: "AIRFLOW__WEBSERVER__RELOAD_ON_PLUGIN_CHANGE"
value: "True"
name: "AIRFLOW__CORE__AIRFLOW_HOME"
value: "/opt/airflow"
name: "AIRFLOW__CORE__DAGS_FOLDER"
value: "/opt/airflow/dags/repo/airflow-dags"
name: "AIRFLOW__CORE__PARALLELISM"
value: "150"
name: "AIRFLOW__CORE__DEFAULT_POOL_TASK_SLOT_COUNT"
value: "200"
name: "AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG"
value: "60"
name: "AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG"
value: "20"
name: "AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME"
value: "64000"
name: "AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT"
value: "500"
name: "AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT"
value: "600"
name: "AIRFLOW__CORE__ENABLE_XCOM_PICKLING"
value: "True"
name: "AIRFLOW__CORE__TEST_CONNECTION"
value: "Enabled"
name: "AIRFLOW__CORE__MAX_TEMPLATED_FIELD_LENGTH"
value: "1000000"
name: "AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE"
value: "900"
name: "AIRFLOW__DATABASE__SQL_ALCHEMY_MAX_OVERFLOW"
value: "200"
name: "AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_SIZE"
value: "250"
name: "AIRFLOW__API__ENABLE_EXPERIMENTAL_API"
value: "True"
name: "AIRFLOW__API__AUTH_BACKEND"
value: "airflow.api.auth.backend.default"
name: "AIRFLOW__WEBSERVER__EXPOSE_CONFIG"
value: "True"
name: "AIRFLOW__WEBSERVER__SHOW_TRIGGER_FORM_IF_NO_PARAMS"
value: "True"
name: "AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT"
value: "google-cloud-platform://?extra__google_cloud_platform__scope=https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/drive,https://www.googleapis.com/auth/bigquery"
name: "AIRFLOW__LOGGING__LOGGING_LEVEL"
value: "DEBUG"
name: "AIRFLOW__LOGGING__REMOTE_LOGGING"
value: "True"
name: "AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER"
value: "gs://airflow-cluster/airflow/logs"
name: "AIRFLOW__KUBERNETES__DAGS_IN_IMAGE"
value: "False"
name: "AIRFLOW__KUBERNETES__NAMESPACE"
value: "airflow-data-eng-prod"
name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_HOST"
value: "/opt/airflow/dags/repo/airflow-dags"
name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_CLAIM"
value: "room-prod-pvc-pvc"
name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_MOUNT_POINT"
value: "/opt/airflow/dags/repo/airflow-dags"
name: "AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME"
value: "airflow-worker"
name: "AIRFLOW__KUBERNETES_EXECUTOR__DELETE_WORKER_PODS"
value: "True"
name: "AIRFLOW__KUBERNETES_EXECUTOR__DELETE_WORKER_PODS_ON_FAILURE"
value: "True"
name: "AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_CREATION_BATCH_SIZE"
value: "50"
name: "AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_QUEUED_CHECK_INTERVAL"
value: "30"
name: "AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION"
value: "False"
name: "AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD"
value: "300"
name: "AIRFLOW__SCHEDULER__FILE_PARSING_SORT_MODE"
value: "modified_time"
name: "AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL"
value: "700"
name: "AIRFLOW__SCHEDULER__PARSING_PROCESSES"
value: "5"
name: "AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC"
value: "60"
name: "AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC"
value: "120"
name: "AIRFLOW__SCHEDULER__ZOMBIE_DETECTION_INTERVAL"
value: "100"
name: "AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL"
value: "100"
name: "AIRFLOW__WEBSERVER__AUTHENTICATE"
value: "True"
name: "AIRFLOW__WEBSERVER__AUTH_BACKEND"
value: "airflow.contrib.auth.backends.github_enterprise_auth"
name: "AIRFLOW__GITHUB_ENTERPRISE__HOST"
value: "github.com"
name: "AIRFLOW__GITHUB_ENTERPRISE__OAUTH_CALLBACK_ROUTE"
value: "/oauth-authorized/github"
`

Operating System

linux

Versions of Apache Airflow Providers

2.10.5

Deployment

Official Apache Airflow Helm Chart

Deployment details

Airflow Version:
Originally 2.9.2, upgraded to 2.10.5 (issue persists in both versions)

Executor:
KubernetesExecutor

Orchestration Platform:
Google Kubernetes Engine (GKE) – Standard mode (not Autopilot)

Cluster Size & Scaling:
~20 nodes (n2-highmem-8)
Autoscaling enabled, but cluster consistently runs at <50% CPU/memory usage
Nodes use standard nodepools, not preemptible or spot instances

Airflow Deployment Mode:

Deployed via Helm with custom image (Python 3.10 base)
Webserver, Scheduler, as separate pods
Task Workloads:
~1200 DAGs, mostly scheduled daily/hourly
~100–150 concurrent tasks at peak times
Heavy usage of:
DatabricksSubmitRunOperator
SnowflakeOperator
-ExternalTaskSensor
ShortCircuitOperator

Anything else?

We have thoroughly verified that this is not caused by:

Resource exhaustion (CPU, memory, or disk)
Pod evictions or node preemptions
Task timeouts or DAG-level retries
Manual task terminations
Implemented dag timeout and execution timeout also created separate pool for databricks and snowflake task

We suspect the issue may stem from one of the following:
-Internal Airflow lifecycle logic (e.g., zombie detection, pod orphan cleanup)

K8sExecutor edge cases where the scheduler or triggerer may initiate SIGTERM without resource-based justification
GCSFuse or sidecar instability possibly impacting task pod lifecycle (though main containers are healthy)
We are open to instrumenting Airflow internals with logging or tracing if the core team can suggest areas to probe (e.g: executor heartbeat, cleanup routines, orphan detection)

We would also be happy to help test a patch or proposed fix in our production-like test cluster if needed.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

2025-07-29T17:09:31Z

boring-cyborg[bot]
Bot Jul 29, 2025

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

0 replies

potiuk · 2025-07-31T09:13:44Z

potiuk
Jul 31, 2025
Collaborator

I think your best bet is to attempt to upgrade to Airflow 3. This is issue might take forever to diagnose because there is absolutely no clue on why it happens. The whole mechanism of heartbeating, supervising running tasks, etc. that could be causeing it had been completely rewritten there and I think the fastest way to get this problem solved is to simply switch to Airflow 3.

0 replies

sean-rose · 2026-04-21T22:49:01Z

sean-rose
Apr 21, 2026

We also use KubernetesExecutor and have seen similar issues when GKE node upgrades happen. One mitigation we've used is to configure the maintenance windows so the GKE upgrades happen during less critical times.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected SIGTERM on Tasks (Airflow 2.10.5 on GKE with KubernetesExecutor and No Resource Constraints) #62978

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Unexpected SIGTERM on Tasks (Airflow 2.10.5 on GKE with KubernetesExecutor and No Resource Constraints) #62978

Uh oh!

Uh oh!

priya369 Jul 29, 2025

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

Replies: 3 comments

Uh oh!

boring-cyborg[bot] Bot Jul 29, 2025

Uh oh!

potiuk Jul 31, 2025 Collaborator

Uh oh!

sean-rose Apr 21, 2026

priya369
Jul 29, 2025

boring-cyborg[bot]
Bot Jul 29, 2025

potiuk
Jul 31, 2025
Collaborator

sean-rose
Apr 21, 2026