Replies: 3 comments
-
|
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Beta Was this translation helpful? Give feedback.
-
|
I think your best bet is to attempt to upgrade to Airflow 3. This is issue might take forever to diagnose because there is absolutely no clue on why it happens. The whole mechanism of heartbeating, supervising running tasks, etc. that could be causeing it had been completely rewritten there and I think the fastest way to get this problem solved is to simply switch to Airflow 3. |
Beta Was this translation helpful? Give feedback.
-
|
We also use |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.10.5
What happened?
We are running Airflow on a GKE cluster with the KubernetesExecutor. Our deployment has 1200+ DAGs, with 100+ tasks running concurrently at peak. Since the last month, we’ve been observing 20–30 tasks failing daily due to SIGTERM, even though the cluster has no resource pressure (CPU/memory/network usage stays well under 50%).
We initially suspected it was an issue with the Airflow version (2.9.2), so we upgraded to Airflow 2.10.5 — however, the issue still persists.
dag_id=PCDM_Editorial_Gold_Top_Content_Append_Daily_run_id=scheduled__2025-07-06T07_15_00+00_00_task_id=GcsToSnowflakegold_top_content_editorial_attempt=1.log
What you think should happen instead?
Tasks should continue running unless explicitly cancelled, timeout, or encounter a failure condition. Spontaneous SIGTERM from the Airflow system (without user action or resource pressure) should not happen.
How to reproduce
executor: "KubernetesExecutor"
allowPodLaunching: true
env:
value: "yes"
value: "*****************************************"
value: "False"
value: "False"
value: "/opt/airflow/plugins"
value: "True"
value: "/opt/airflow"
value: "/opt/airflow/dags/repo/airflow-dags"
value: "150"
value: "200"
value: "60"
value: "20"
value: "64000"
value: "500"
value: "600"
value: "True"
value: "Enabled"
value: "1000000"
value: "900"
value: "200"
value: "250"
value: "True"
value: "airflow.api.auth.backend.default"
value: "True"
value: "True"
value: "google-cloud-platform://?extra__google_cloud_platform__scope=https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/drive,https://www.googleapis.com/auth/bigquery"
value: "DEBUG"
value: "True"
value: "gs://airflow-cluster/airflow/logs"
value: "False"
value: "airflow-data-eng-prod"
value: "/opt/airflow/dags/repo/airflow-dags"
value: "room-prod-pvc-pvc"
value: "/opt/airflow/dags/repo/airflow-dags"
value: "airflow-worker"
value: "True"
value: "True"
value: "50"
value: "30"
value: "False"
value: "300"
value: "modified_time"
value: "700"
value: "5"
value: "60"
value: "120"
value: "100"
value: "100"
value: "True"
value: "airflow.contrib.auth.backends.github_enterprise_auth"
value: "github.com"
value: "/oauth-authorized/github"
`
Operating System
linux
Versions of Apache Airflow Providers
2.10.5
Deployment
Official Apache Airflow Helm Chart
Deployment details
Airflow Version:
Originally 2.9.2, upgraded to 2.10.5 (issue persists in both versions)
Executor:
KubernetesExecutor
Orchestration Platform:
Google Kubernetes Engine (GKE) – Standard mode (not Autopilot)
Cluster Size & Scaling:
~20 nodes (n2-highmem-8)
Autoscaling enabled, but cluster consistently runs at <50% CPU/memory usage
Nodes use standard nodepools, not preemptible or spot instances
Airflow Deployment Mode:
Task Workloads:
~1200 DAGs, mostly scheduled daily/hourly
~100–150 concurrent tasks at peak times
Heavy usage of:
-ExternalTaskSensor
Anything else?
We have thoroughly verified that this is not caused by:
We suspect the issue may stem from one of the following:
-Internal Airflow lifecycle logic (e.g., zombie detection, pod orphan cleanup)
We would also be happy to help test a patch or proposed fix in our production-like test cluster if needed.
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions