Unexpected SIGTERM on Tasks (Airflow 2.10.5 on GKE with KubernetesExecutor and No Resource Constraints)

### Apache Airflow version

Other Airflow 2 version (please specify below)

### If "Other Airflow 2 version" selected, which one?

2.10.5

### What happened?

We are running Airflow on a GKE cluster with the KubernetesExecutor. Our deployment has 1200+ DAGs, with 100+ tasks running concurrently at peak. Since the last month, we’ve been observing 20–30 tasks failing daily due to SIGTERM, even though the cluster has no resource pressure (CPU/memory/network usage stays well under 50%).
We initially suspected it was an issue with the Airflow version (2.9.2), so we upgraded to Airflow 2.10.5 — however, the issue still persists.

[dag_id=PCDM_Editorial_Gold_Top_Content_Append_Daily_run_id=scheduled__2025-07-06T07_15_00+00_00_task_id=GcsToSnowflakegold_top_content_editorial_attempt=1.log](https://github.com/user-attachments/files/21493879/dag_id.PCDM_Editorial_Gold_Top_Content_Append_Daily_run_id.scheduled__2025-07-06T07_15_00%2B00_00_task_id.GcsToSnowflakegold_top_content_editorial_attempt.1.log)

### What you think should happen instead?

Tasks should continue running unless explicitly cancelled, timeout, or encounter a failure condition. Spontaneous SIGTERM from the Airflow system (without user action or resource pressure) should not happen.

### How to reproduce

1. airflow config:  `
executor: "KubernetesExecutor"
allowPodLaunching: true
env:
  - name: "AIRFLOW_GPL_UNIDECODE"
    value: "yes"
  - name: "AIRFLOW__WEBSERVER__BASE_URL"
    value: "*****************************************"
  - name: "AIRFLOW__CORE__LOAD_EXAMPLES"
    value: "False"
  - name: "AIRFLOW__CORE__LAZY_LOAD_PLUGINS"
    value: "False"
  - name: "AIRFLOW__CORE__PLUGINS_FOLDER"
    value: "/opt/airflow/plugins"
  - name: "AIRFLOW__WEBSERVER__RELOAD_ON_PLUGIN_CHANGE"
    value: "True"
  - name: "AIRFLOW__CORE__AIRFLOW_HOME"
    value: "/opt/airflow"
  - name: "AIRFLOW__CORE__DAGS_FOLDER"
    value: "/opt/airflow/dags/repo/airflow-dags"
  - name: "AIRFLOW__CORE__PARALLELISM"
    value: "150"
  - name: "AIRFLOW__CORE__DEFAULT_POOL_TASK_SLOT_COUNT"
    value: "200"
  - name: "AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG"
    value: "60"
  - name: "AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG"
    value: "20"
  - name: "AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME"
    value: "64000"
  - name: "AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT"
    value: "500"
  - name: "AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT"
    value: "600"
  - name: "AIRFLOW__CORE__ENABLE_XCOM_PICKLING"
    value: "True"
  - name: "AIRFLOW__CORE__TEST_CONNECTION"
    value: "Enabled"
  - name: "AIRFLOW__CORE__MAX_TEMPLATED_FIELD_LENGTH"
    value: "1000000"
  - name: "AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE"
    value: "900"
  - name: "AIRFLOW__DATABASE__SQL_ALCHEMY_MAX_OVERFLOW"
    value: "200"
  - name: "AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_SIZE"
    value: "250"
  - name: "AIRFLOW__API__ENABLE_EXPERIMENTAL_API"
    value: "True"
  - name: "AIRFLOW__API__AUTH_BACKEND"
    value: "airflow.api.auth.backend.default"
  - name: "AIRFLOW__WEBSERVER__EXPOSE_CONFIG"
    value: "True"
  - name: "AIRFLOW__WEBSERVER__SHOW_TRIGGER_FORM_IF_NO_PARAMS"
    value: "True"
  - name: "AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT"
    value: "google-cloud-platform://?extra__google_cloud_platform__scope=https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/drive,https://www.googleapis.com/auth/bigquery"
  - name: "AIRFLOW__LOGGING__LOGGING_LEVEL"
    value: "DEBUG"
  - name: "AIRFLOW__LOGGING__REMOTE_LOGGING"
    value: "True"
  - name: "AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER"
    value: "gs://airflow-cluster/airflow/logs"
  - name: "AIRFLOW__KUBERNETES__DAGS_IN_IMAGE"
    value: "False"
  - name: "AIRFLOW__KUBERNETES__NAMESPACE"
    value: "airflow-data-eng-prod"
  - name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_HOST"
    value: "/opt/airflow/dags/repo/airflow-dags"
  - name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_CLAIM"
    value: "room-prod-pvc-pvc"
  - name: "AIRFLOW__KUBERNETES__DAGS_VOLUME_MOUNT_POINT"
    value: "/opt/airflow/dags/repo/airflow-dags"
  - name: "AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME"
    value: "airflow-worker"
  - name: "AIRFLOW__KUBERNETES_EXECUTOR__DELETE_WORKER_PODS"
    value: "True"
  - name: "AIRFLOW__KUBERNETES_EXECUTOR__DELETE_WORKER_PODS_ON_FAILURE"
    value: "True"
  - name: "AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_CREATION_BATCH_SIZE"
    value: "50"
  - name: "AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_QUEUED_CHECK_INTERVAL"
    value: "30"
  - name: "AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION"
    value: "False"
  - name: "AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD"
    value: "300"
  - name: "AIRFLOW__SCHEDULER__FILE_PARSING_SORT_MODE"
    value: "modified_time"
  - name: "AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL"
    value: "700"
  - name: "AIRFLOW__SCHEDULER__PARSING_PROCESSES"
    value: "5"
  - name: "AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC"
    value: "60"
  - name: "AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC"
    value: "120"
  - name: "AIRFLOW__SCHEDULER__ZOMBIE_DETECTION_INTERVAL"
    value: "100"
  - name: "AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL"
    value: "100"
  - name: "AIRFLOW__WEBSERVER__AUTHENTICATE"
    value: "True"
  - name: "AIRFLOW__WEBSERVER__AUTH_BACKEND"
    value: "airflow.contrib.auth.backends.github_enterprise_auth"
  - name: "AIRFLOW__GITHUB_ENTERPRISE__HOST"
    value: "github.com"
  - name: "AIRFLOW__GITHUB_ENTERPRISE__OAUTH_CALLBACK_ROUTE"
    value: "/oauth-authorized/github"  
`

### Operating System

linux

### Versions of Apache Airflow Providers

2.10.5

### Deployment

Official Apache Airflow Helm Chart

### Deployment details

Airflow Version:
Originally 2.9.2, upgraded to 2.10.5 (issue persists in both versions)

Executor:
KubernetesExecutor

Orchestration Platform:
Google Kubernetes Engine (GKE) – Standard mode (not Autopilot)

Cluster Size & Scaling:
~20 nodes (n2-highmem-8)
Autoscaling enabled, but cluster consistently runs at <50% CPU/memory usage
Nodes use standard nodepools, not preemptible or spot instances

Airflow Deployment Mode:
- Deployed via Helm with custom image (Python 3.10 base)
- Webserver, Scheduler,  as separate pods
Task Workloads:
~1200 DAGs, mostly scheduled daily/hourly
~100–150 concurrent tasks at peak times
Heavy usage of:
- DatabricksSubmitRunOperator
- SnowflakeOperator
-ExternalTaskSensor
- ShortCircuitOperator

### Anything else?

We have thoroughly verified that this is not caused by:
- Resource exhaustion (CPU, memory, or disk)
- Pod evictions or node preemptions
- Task timeouts or DAG-level retries
- Manual task terminations
- Implemented dag timeout and execution timeout also created separate pool for databricks and snowflake task

We suspect the issue may stem from one of the following:
 -Internal Airflow lifecycle logic (e.g., zombie detection, pod orphan cleanup)
 - K8sExecutor edge cases where the scheduler or triggerer may initiate SIGTERM without resource-based justification
 - GCSFuse or sidecar instability possibly impacting task pod lifecycle (though main containers are healthy)
 - We are open to instrumenting Airflow internals with logging or tracing if the core team can suggest areas to probe (e.g:  executor heartbeat, cleanup routines, orphan detection)

We would also be happy to help test a patch or proposed fix in our production-like test cluster if needed.

### Are you willing to submit PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected SIGTERM on Tasks (Airflow 2.10.5 on GKE with KubernetesExecutor and No Resource Constraints) #53894

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Unexpected SIGTERM on Tasks (Airflow 2.10.5 on GKE with KubernetesExecutor and No Resource Constraints) #53894

Description

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions