-
Notifications
You must be signed in to change notification settings - Fork 974
Description
Code of Conduct
- I agree to follow this project's Code of Conduct
Search before asking
- I have searched in the issues and found no similar issues.
Describe the bug
This occurs for Kubernetes Cluster deployment mode for kyuubi over k8 cluster. When multiple large number of spark batch jobs are submitted to kyuubi, the kyuubi attempts to spin up large number of spark drivers to handle per batch job. However, when there is heavy load, it can cause kyuubi to store records via MetadataManager about the state of each batch job as "PENDING" and repeated polling about each batch job's status until it runs out of memory. Then, upon next restart of kyuubi pod container, it again repeatedly polls as the records never gets updated as all the spark drivers to handle batch jobs never gets created and scheduled in first place. This causes the records in Metadata Store for kyuubi to persist regarding batch jobs as "state" field of value "PENDING" and "engine_state" field of value "UNKNOWN". The records can never get resolved and the repeated polling continues causing subsequent restarts of kyuubi to run out of memory.
Affects Version(s)
v1.10.2
Kyuubi Server Log Output
2025-10-21 16:03:48.837 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=406b8bc1-457e-4115-ae55-50a0d39c061c to be created, elapsed time: 92106ms, return UNKNOWN status
2025-10-21 16:03:48.929 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=6d439666-0484-4902-9dab-39ad39f96b3e to be created, elapsed time: 92291ms, return UNKNOWN status
2025-10-21 16:03:48.929 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=fc5979a0-5599-4088-90ee-c5e995e0fca7 to be created, elapsed time: 92365ms, return UNKNOWN status
2025-10-21 16:03:48.930 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=6aacf45b-21b3-44a5-bdd1-1f05eaaec393 to be created, elapsed time: 92365ms, return UNKNOWN status
2025-10-21 16:03:48.930 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=2db1998a-e5e2-4a31-bc97-40f5f1b31345 to be created, elapsed time: 92258ms, return UNKNOWN status
2025-10-21 16:03:48.931 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=20a84034-3c6b-47b0-916b-199b0e0750da to be created, elapsed time: 92250ms, return UNKNOWN status
2025-10-21 16:03:48.932 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=3aa6e64e-a4af-443e-97c3-4296befd050a to be created, elapsed time: 92202ms, return UNKNOWN status
2025-10-21 16:03:48.937 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=d259e70a-ef06-41f3-8e2e-fe661552fae1 to be created, elapsed time: 92199ms, return UNKNOWN status
2025-10-21 16:03:48.938 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=f4e45523-f18a-4f69-b163-07b96666cee0 to be created, elapsed time: 92271ms, return UNKNOWN status
2025-10-21 16:03:49.135 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=171cacb5-4b47-4db4-a384-6f2925f927e6 to be created, elapsed time: 92505ms, return UNKNOWN status
2025-10-21 16:03:50.638 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=d98b10c6-e56a-429a-b86c-79f353ac18bb to be created, elapsed time: 94073ms, return UNKNOWN status
2025-10-21 16:03:50.640 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=a1e6f809-c1b7-41a7-b21f-007eaa2eaaf0 to be created, elapsed time: 94076ms, return UNKNOWN status
2025-10-21 16:03:50.729 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=0d71979b-3f34-485f-8088-9bada0308133 to be created, elapsed time: 94086ms, return UNKNOWN statusKyuubi Engine Log Output
No engine log output... it seems the k8 pod crashes for kyuubi server before it even have chance to write the engine logs specific to each batch-job submission for my given user...Kyuubi Server Configurations
################################################## kyuubi server settings #############################################
kyuubi.kubernetes.authenticate.driver.serviceAccountName=kyuubi-poc
kyuubi.kubernetes.trust.certificates=true
# defaults to POD we ran into edge case where imagepullbackoff and job is pending
kyuubi.kubernetes.application.state.source=CONTAINER
kyuubi.kubernetes.authenticate.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token
kyuubi.engine.kubernetes.submit.timeout=PT300S
# enable arrow configuration
kyuubi.operation.result.format=arrow
# kyuubi.operation.incremental.collect=true
################################################## kyuubi engine settings #############################################
kyuubi.engine.share.level=USER
# kyuubi.engine.share.level=SERVER
kyuubi.server.name=superset-poc-server
################################################## Very expiremental stuff ############################################
# kyuubi.engine.deregister.exception.messages=Error getting policies,serviceName=spark,httpStatusCode:400
# kyuubi.engine.deregister.job.max.failures=1
# kyuubi.engine.deregister.exception.ttl=PT10M
################################################## kyuubi profile settings ############################################
kyuubi.session.conf.advisor=org.apache.kyuubi.session.FileSessionConfAdvisor
##################################kyuubi engine kill disable settings #################################################
# kyuubi.engine.ui.stop.enabled=false
################################################## kyuubi engine clean up settings ####################################
kyuubi.kubernetes.spark.cleanupTerminatedDriverPod.kind=COMPLETED
kyuubi.kubernetes.terminatedApplicationRetainPeriod=PT5M
################################################## User specific defaults #############################################
# ___srv-spark-dbt-np___.kyuubi.session.engine.idle.timeout=PT30S
# ___srv-spark-dbt-np___.kyuubi.session.idle.timeout=PT30S
___srv-spark-dbt-np___.kyuubi.session.engine.initialize.timeout=PT10M
kyuubi.session.idle.timeout=PT15M
kyuubi.batch.session.idle.timeout=PT15M
kyuubi.engine.user.isolated.spark.session.idle.timeout=PT15M
################################################## Trino Engine #######################################################
kyuubi.frontend.protocols=REST,THRIFT_BINARY,TRINO
kyuubi.frontend.trino.bind.host=0.0.0.0
kyuubi.frontend.trino.bind.port=10011
################################################## kyuubi session settings ###########################################
# kyuubi.session.conf.restrict.list=spark.sql.optimizer.excludedRules,spark.kubernetes.driver.node.selector.label,spark.kubernetes.executor.node.selector.label,spark.master,spark.submit.deployMode,spark.kubernetes.namespace,spark.kubernetes.authenticate.driver.serviceAccountName,spark.kubernetes.driver.podTemplateFile,spark.kubernetes.executor.podTemplateFile,spark.ui.killEnabled,spark.redaction.regex,spark.sql.redaction.string.regex
spark.kyuubi.conf.restricted.list=spark.sql.optimizer.excludedRules,spark.kubernetes.driver.node.selector.label,spark.kubernetes.executor.node.selector.label,spark.master,spark.submit.deployMode,spark.kubernetes.namespace,spark.kubernetes.authenticate.driver.serviceAccountName,spark.kubernetes.driver.podTemplateFile,spark.kubernetes.executor.podTemplateFile,spark.ui.killEnabled,spark.redaction.regex,spark.sql.redaction.string.regex
kyuubi.session.conf.ignore.list=spark.sql.optimizer.excludedRules,spark.kubernetes.driver.node.selector.label,spark.kubernetes.executor.node.selector.label,spark.master,spark.submit.deployMode,spark.kubernetes.namespace,spark.kubernetes.authenticate.driver.serviceAccountName,spark.kubernetes.driver.podTemplateFile,spark.kubernetes.executor.podTemplateFile,spark.ui.killEnabled,spark.redaction.regex,spark.sql.redaction.string.regex
kyuubi.batch.conf.ignore.list=spark.kubernetes.driver.node.selector.label,spark.kubernetes.executor.node.selector.label,spark.master,spark.submit.deployMode,spark.kubernetes.namespace,spark.kubernetes.authenticate.driver.serviceAccountName,spark.kubernetes.driver.podTemplateFile,spark.kubernetes.executor.podTemplateFile,spark.ui.killEnabled,spark.redaction.regex,spark.sql.redaction.string.regex
kyuubi.batchConf.spark.spark.sql.adaptive.enabled=true
kyuubi.batchConf.spark.spark.sql.adaptive.forceApply=false
kyuubi.batchConf.spark.spark.sql.adaptive.logLevel=info
kyuubi.batchConf.spark.spark.sql.adaptive.advisoryPartitionSizeInBytes=128m
kyuubi.batchConf.spark.spark.sql.adaptive.coalescePartitions.enabled=true
kyuubi.batchConf.spark.spark.sql.adaptive.coalescePartitions.minPartitionNum=1
kyuubi.batchConf.spark.spark.sql.adaptive.coalescePartitions.initialPartitionNum=1024
kyuubi.batchConf.spark.spark.sql.adaptive.fetchShuffleBlocksInBatch=true
kyuubi.batchConf.spark.spark.sql.adaptive.localShuffleReader.enabled=true
kyuubi.batchConf.spark.spark.sql.adaptive.skewJoin.enabled=true
kyuubi.batchConf.spark.spark.sql.adaptive.skewJoin.skewedPartitionFactor=5
kyuubi.batchConf.spark.spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=400m
kyuubi.batchConf.spark.spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin=0.2
# DRA (shuffle tracking) defaults for batch engines
kyuubi.batchConf.spark.spark.dynamicAllocation.enabled=true
kyuubi.batchConf.spark.spark.dynamicAllocation.shuffleTracking.enabled=true
kyuubi.batchConf.spark.spark.dynamicAllocation.initialExecutors=2
kyuubi.batchConf.spark.spark.dynamicAllocation.minExecutors=2
kyuubi.batchConf.spark.spark.dynamicAllocation.maxExecutors=64
kyuubi.batchConf.spark.spark.dynamicAllocation.executorAllocationRatio=0.5
kyuubi.batchConf.spark.spark.dynamicAllocation.executorIdleTimeout=60s
kyuubi.batchConf.spark.spark.dynamicAllocation.cachedExecutorIdleTimeout=30min
kyuubi.batchConf.spark.spark.cleaner.periodicGC.interval=5min
kyuubi.batchConf.spark.spark.sql.autoBroadcastJoinThreshold=-1
kyuubi.operation.getTables.ignoreTableProperties=true
# default resource configs
kyuubi.batchConf.spark.spark.executor.memory=20G
kyuubi.batchConf.spark.spark.executor.cores=6
kyuubi.batchConf.spark.spark.driver.memory=20G
kyuubi.batchConf.spark.spark.driver.cores=6Kyuubi Engine Configurations
spark.submit.deployMode=cluster
spark.hadoop.hive.server2.transport.mode=binary
spark.hadoop.hive.execution.engine=spark
spark.hadoop.hive.input.format=io.delta.hive.HiveInputFormat
spark.hadoop.hive.tez.input.format=io.delta.hive.HiveInputFormat
spark.hadoop.fs.AbstractFileSystem.abfss.impl=org.apache.hadoop.fs.azurebfs.Abfss
spark.eventLog.compress=true
spark.eventLog.compression.codec=zstd
spark.hadoop.fs.azure.write.request.size=33554432
spark.executor.memory=16G
spark.executor.cores=8
spark.driver.memory=200G
spark.driver.cores=40
spark.driver.maxResultSize=20g
spark.scheduler.mode=FAIR
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.adaptive.enabled=true
spark.decommission.enabled=true
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=16
spark.dynamicAllocation.maxExecutors=64
spark.dynamicAllocation.executorAllocationRatio=0.5
spark.kubernetes.executor.annotation.prometheus.io/port=7778
spark.kubernetes.executor.annotation.prometheus.io/scrape=true
spark.kubernetes.executor.annotation.prometheus.io/path=/metrics
spark.kubernetes.driver.annotation.prometheus.io/scrape=true
spark.kubernetes.driver.annotation.prometheus.io/port=7778
spark.kubernetes.driver.annotation.prometheus.io/path=/metrics
spark.kubernetes.executor.annotation.k8s.grafana.com/scrape=true
spark.kubernetes.executor.annotation.k8s.grafana.com/metrics.path=/metrics
spark.kubernetes.executor.annotation.k8s.grafana.com/metrics.portNumber=7778
spark.kubernetes.driver.annotation.k8s.grafana.com/scrape=true
spark.kubernetes.driver.annotation.k8s.grafana.com/metrics.path=/metrics
spark.kubernetes.driver.annotation.k8s.grafana.com/metrics.portNumber=7778
spark.excludeOnFailure.enabled=true
spark.metrics.conf=/opt/spark/conf/metrics.properties
spark.metrics.namespace=${spark.app.name}
spark.eventLog.enabled=true
spark.sql.extensions=org.apache.kyuubi.plugin.spark.authz.ranger.RangerSparkExtension,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
# Optimizations
spark.sql.redaction.string.regex=(?i)\bselect\b[\s\S]+?\bfrom\b[\s\S]+?(;|$)
# spark.redaction.regex=(?i)secret|password|passwd|token|key|credential|credentials|pwd
# spark.redaction.regex="(?i)secret|password|passwd|token|\.account\.key|credential|credentials|\.client\.secret\|_secret|pwd"
# test new redaction
spark.redaction.regex=(?i)secret|password|passwd|token|\.account\.key|credential|credentials|pwd|appMgrInfo
spark.sql.adaptive.enabled=true
spark.sql.adaptive.forceApply=false
spark.sql.adaptive.logLevel=info
spark.sql.adaptive.advisoryPartitionSizeInBytes=256m
spark.sql.adaptive.coalescePartitions.enabled=true
spark.sql.adaptive.coalescePartitions.minPartitionNum=1
spark.sql.adaptive.coalescePartitions.initialPartitionNum=1024
spark.sql.adaptive.fetchShuffleBlocksInBatch=true
spark.sql.adaptive.localShuffleReader.enabled=true
spark.sql.adaptive.skewJoin.enabled=true
spark.sql.adaptive.skewJoin.skewedPartitionFactor=5
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=400m
spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin=0.2
spark.sql.autoBroadcastJoinThreshold=-1
# Plugins (disable Gluten globally; enable only in Gluten profile)
spark.plugins=io.dataflint.spark.SparkDataflintPlugin
# TPCDS catalog configs
spark.sql.catalog.tpcds=org.apache.kyuubi.spark.connector.tpcds.TPCDSCatalog
# spark.sql.catalog.tpcds.excludeDatabases=sf30000
spark.sql.catalog.tpcds.useAnsiStringType=false
spark.sql.catalog.tpcds.useTableSchema_2_6=true
spark.sql.catalog.tpcds.read.maxPartitionBytes=128m
# Polaris
spark.sql.defaultCatalog=polaris
spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.polaris.warehouse=dv-polaris
spark.sql.catalog.polaris.token-refresh-enabled=true
spark.jars.ivy.log.level=DEBUG
spark.ui.killEnabled=falseAdditional context
No response
Are you willing to submit PR?
- Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
- No. I cannot submit a PR at this time.