Skip to content

[Bug] Kyuubi OOM when polling for status of spark driver for Simultaneous Large Number of Batch Jobs #7226

@JoonPark1

Description

@JoonPark1

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

This occurs for Kubernetes Cluster deployment mode for kyuubi over k8 cluster. When multiple large number of spark batch jobs are submitted to kyuubi, the kyuubi attempts to spin up large number of spark drivers to handle per batch job. However, when there is heavy load, it can cause kyuubi to store records via MetadataManager about the state of each batch job as "PENDING" and repeated polling about each batch job's status until it runs out of memory. Then, upon next restart of kyuubi pod container, it again repeatedly polls as the records never gets updated as all the spark drivers to handle batch jobs never gets created and scheduled in first place. This causes the records in Metadata Store for kyuubi to persist regarding batch jobs as "state" field of value "PENDING" and "engine_state" field of value "UNKNOWN". The records can never get resolved and the repeated polling continues causing subsequent restarts of kyuubi to run out of memory.

Affects Version(s)

v1.10.2

Kyuubi Server Log Output

2025-10-21 16:03:48.837 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=406b8bc1-457e-4115-ae55-50a0d39c061c to be created, elapsed time: 92106ms, return UNKNOWN status
2025-10-21 16:03:48.929 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=6d439666-0484-4902-9dab-39ad39f96b3e to be created, elapsed time: 92291ms, return UNKNOWN status
2025-10-21 16:03:48.929 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=fc5979a0-5599-4088-90ee-c5e995e0fca7 to be created, elapsed time: 92365ms, return UNKNOWN status
2025-10-21 16:03:48.930 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=6aacf45b-21b3-44a5-bdd1-1f05eaaec393 to be created, elapsed time: 92365ms, return UNKNOWN status
2025-10-21 16:03:48.930 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=2db1998a-e5e2-4a31-bc97-40f5f1b31345 to be created, elapsed time: 92258ms, return UNKNOWN status
2025-10-21 16:03:48.931 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=20a84034-3c6b-47b0-916b-199b0e0750da to be created, elapsed time: 92250ms, return UNKNOWN status
2025-10-21 16:03:48.932 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=3aa6e64e-a4af-443e-97c3-4296befd050a to be created, elapsed time: 92202ms, return UNKNOWN status
2025-10-21 16:03:48.937 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=d259e70a-ef06-41f3-8e2e-fe661552fae1 to be created, elapsed time: 92199ms, return UNKNOWN status
2025-10-21 16:03:48.938 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=f4e45523-f18a-4f69-b163-07b96666cee0 to be created, elapsed time: 92271ms, return UNKNOWN status
2025-10-21 16:03:49.135 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=171cacb5-4b47-4db4-a384-6f2925f927e6 to be created, elapsed time: 92505ms, return UNKNOWN status
2025-10-21 16:03:50.638 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=d98b10c6-e56a-429a-b86c-79f353ac18bb to be created, elapsed time: 94073ms, return UNKNOWN status
2025-10-21 16:03:50.640 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=a1e6f809-c1b7-41a7-b21f-007eaa2eaaf0 to be created, elapsed time: 94076ms, return UNKNOWN status
2025-10-21 16:03:50.729 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Waiting for driver pod with label: kyuubi-unique-tag=0d71979b-3f34-485f-8088-9bada0308133 to be created, elapsed time: 94086ms, return UNKNOWN status

Kyuubi Engine Log Output

No engine log output... it seems the k8 pod crashes for kyuubi server before it even have chance to write the engine logs specific to each batch-job submission for my given user...

Kyuubi Server Configurations

################################################## kyuubi server settings #############################################
            kyuubi.kubernetes.authenticate.driver.serviceAccountName=kyuubi-poc
            kyuubi.kubernetes.trust.certificates=true
            # defaults to POD we ran into edge case where imagepullbackoff and job is pending
            kyuubi.kubernetes.application.state.source=CONTAINER
            kyuubi.kubernetes.authenticate.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token
            kyuubi.engine.kubernetes.submit.timeout=PT300S
            # enable arrow configuration
            kyuubi.operation.result.format=arrow
            # kyuubi.operation.incremental.collect=true
            ################################################## kyuubi engine settings #############################################
            kyuubi.engine.share.level=USER
            # kyuubi.engine.share.level=SERVER
            kyuubi.server.name=superset-poc-server
            ################################################## Very expiremental stuff ############################################
            # kyuubi.engine.deregister.exception.messages=Error getting policies,serviceName=spark,httpStatusCode:400
            # kyuubi.engine.deregister.job.max.failures=1
            # kyuubi.engine.deregister.exception.ttl=PT10M
            ################################################## kyuubi profile settings ############################################
            kyuubi.session.conf.advisor=org.apache.kyuubi.session.FileSessionConfAdvisor
            ##################################kyuubi engine kill disable settings #################################################
            # kyuubi.engine.ui.stop.enabled=false
            ################################################## kyuubi engine clean up settings ####################################
            kyuubi.kubernetes.spark.cleanupTerminatedDriverPod.kind=COMPLETED
            kyuubi.kubernetes.terminatedApplicationRetainPeriod=PT5M
            ################################################## User specific defaults #############################################
            # ___srv-spark-dbt-np___.kyuubi.session.engine.idle.timeout=PT30S
            # ___srv-spark-dbt-np___.kyuubi.session.idle.timeout=PT30S
            ___srv-spark-dbt-np___.kyuubi.session.engine.initialize.timeout=PT10M
            kyuubi.session.idle.timeout=PT15M
            kyuubi.batch.session.idle.timeout=PT15M
            kyuubi.engine.user.isolated.spark.session.idle.timeout=PT15M
            ################################################## Trino Engine #######################################################
            kyuubi.frontend.protocols=REST,THRIFT_BINARY,TRINO
            kyuubi.frontend.trino.bind.host=0.0.0.0
            kyuubi.frontend.trino.bind.port=10011
            ################################################## kyuubi session settings ###########################################
            # kyuubi.session.conf.restrict.list=spark.sql.optimizer.excludedRules,spark.kubernetes.driver.node.selector.label,spark.kubernetes.executor.node.selector.label,spark.master,spark.submit.deployMode,spark.kubernetes.namespace,spark.kubernetes.authenticate.driver.serviceAccountName,spark.kubernetes.driver.podTemplateFile,spark.kubernetes.executor.podTemplateFile,spark.ui.killEnabled,spark.redaction.regex,spark.sql.redaction.string.regex
            spark.kyuubi.conf.restricted.list=spark.sql.optimizer.excludedRules,spark.kubernetes.driver.node.selector.label,spark.kubernetes.executor.node.selector.label,spark.master,spark.submit.deployMode,spark.kubernetes.namespace,spark.kubernetes.authenticate.driver.serviceAccountName,spark.kubernetes.driver.podTemplateFile,spark.kubernetes.executor.podTemplateFile,spark.ui.killEnabled,spark.redaction.regex,spark.sql.redaction.string.regex
            kyuubi.session.conf.ignore.list=spark.sql.optimizer.excludedRules,spark.kubernetes.driver.node.selector.label,spark.kubernetes.executor.node.selector.label,spark.master,spark.submit.deployMode,spark.kubernetes.namespace,spark.kubernetes.authenticate.driver.serviceAccountName,spark.kubernetes.driver.podTemplateFile,spark.kubernetes.executor.podTemplateFile,spark.ui.killEnabled,spark.redaction.regex,spark.sql.redaction.string.regex
            kyuubi.batch.conf.ignore.list=spark.kubernetes.driver.node.selector.label,spark.kubernetes.executor.node.selector.label,spark.master,spark.submit.deployMode,spark.kubernetes.namespace,spark.kubernetes.authenticate.driver.serviceAccountName,spark.kubernetes.driver.podTemplateFile,spark.kubernetes.executor.podTemplateFile,spark.ui.killEnabled,spark.redaction.regex,spark.sql.redaction.string.regex
            kyuubi.batchConf.spark.spark.sql.adaptive.enabled=true
            kyuubi.batchConf.spark.spark.sql.adaptive.forceApply=false
            kyuubi.batchConf.spark.spark.sql.adaptive.logLevel=info
            kyuubi.batchConf.spark.spark.sql.adaptive.advisoryPartitionSizeInBytes=128m
            kyuubi.batchConf.spark.spark.sql.adaptive.coalescePartitions.enabled=true
            kyuubi.batchConf.spark.spark.sql.adaptive.coalescePartitions.minPartitionNum=1
            kyuubi.batchConf.spark.spark.sql.adaptive.coalescePartitions.initialPartitionNum=1024
            kyuubi.batchConf.spark.spark.sql.adaptive.fetchShuffleBlocksInBatch=true
            kyuubi.batchConf.spark.spark.sql.adaptive.localShuffleReader.enabled=true
            kyuubi.batchConf.spark.spark.sql.adaptive.skewJoin.enabled=true
            kyuubi.batchConf.spark.spark.sql.adaptive.skewJoin.skewedPartitionFactor=5
            kyuubi.batchConf.spark.spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=400m
            kyuubi.batchConf.spark.spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin=0.2
            # DRA (shuffle tracking) defaults for batch engines
            kyuubi.batchConf.spark.spark.dynamicAllocation.enabled=true
            kyuubi.batchConf.spark.spark.dynamicAllocation.shuffleTracking.enabled=true
            kyuubi.batchConf.spark.spark.dynamicAllocation.initialExecutors=2
            kyuubi.batchConf.spark.spark.dynamicAllocation.minExecutors=2
            kyuubi.batchConf.spark.spark.dynamicAllocation.maxExecutors=64
            kyuubi.batchConf.spark.spark.dynamicAllocation.executorAllocationRatio=0.5
            kyuubi.batchConf.spark.spark.dynamicAllocation.executorIdleTimeout=60s
            kyuubi.batchConf.spark.spark.dynamicAllocation.cachedExecutorIdleTimeout=30min
            kyuubi.batchConf.spark.spark.cleaner.periodicGC.interval=5min
            kyuubi.batchConf.spark.spark.sql.autoBroadcastJoinThreshold=-1
            kyuubi.operation.getTables.ignoreTableProperties=true
            # default resource configs
            kyuubi.batchConf.spark.spark.executor.memory=20G
            kyuubi.batchConf.spark.spark.executor.cores=6
            kyuubi.batchConf.spark.spark.driver.memory=20G
            kyuubi.batchConf.spark.spark.driver.cores=6

Kyuubi Engine Configurations

            spark.submit.deployMode=cluster
            spark.hadoop.hive.server2.transport.mode=binary
            spark.hadoop.hive.execution.engine=spark
            spark.hadoop.hive.input.format=io.delta.hive.HiveInputFormat
            spark.hadoop.hive.tez.input.format=io.delta.hive.HiveInputFormat
            spark.hadoop.fs.AbstractFileSystem.abfss.impl=org.apache.hadoop.fs.azurebfs.Abfss
            spark.eventLog.compress=true
            spark.eventLog.compression.codec=zstd
            spark.hadoop.fs.azure.write.request.size=33554432
            spark.executor.memory=16G
            spark.executor.cores=8
            spark.driver.memory=200G
            spark.driver.cores=40
            spark.driver.maxResultSize=20g
            spark.scheduler.mode=FAIR
            spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
            spark.sql.adaptive.enabled=true
            spark.decommission.enabled=true
            spark.dynamicAllocation.enabled=true
            spark.dynamicAllocation.minExecutors=16
            spark.dynamicAllocation.maxExecutors=64
            spark.dynamicAllocation.executorAllocationRatio=0.5
            spark.kubernetes.executor.annotation.prometheus.io/port=7778
            spark.kubernetes.executor.annotation.prometheus.io/scrape=true
            spark.kubernetes.executor.annotation.prometheus.io/path=/metrics
            spark.kubernetes.driver.annotation.prometheus.io/scrape=true
            spark.kubernetes.driver.annotation.prometheus.io/port=7778
            spark.kubernetes.driver.annotation.prometheus.io/path=/metrics
            spark.kubernetes.executor.annotation.k8s.grafana.com/scrape=true
            spark.kubernetes.executor.annotation.k8s.grafana.com/metrics.path=/metrics
            spark.kubernetes.executor.annotation.k8s.grafana.com/metrics.portNumber=7778
            spark.kubernetes.driver.annotation.k8s.grafana.com/scrape=true
            spark.kubernetes.driver.annotation.k8s.grafana.com/metrics.path=/metrics
            spark.kubernetes.driver.annotation.k8s.grafana.com/metrics.portNumber=7778
            spark.excludeOnFailure.enabled=true
            spark.metrics.conf=/opt/spark/conf/metrics.properties
            spark.metrics.namespace=${spark.app.name}
            spark.eventLog.enabled=true
           spark.sql.extensions=org.apache.kyuubi.plugin.spark.authz.ranger.RangerSparkExtension,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
            # Optimizations
            spark.sql.redaction.string.regex=(?i)\bselect\b[\s\S]+?\bfrom\b[\s\S]+?(;|$)
            # spark.redaction.regex=(?i)secret|password|passwd|token|key|credential|credentials|pwd
            # spark.redaction.regex="(?i)secret|password|passwd|token|\.account\.key|credential|credentials|\.client\.secret\|_secret|pwd"
            # test new redaction
            spark.redaction.regex=(?i)secret|password|passwd|token|\.account\.key|credential|credentials|pwd|appMgrInfo
            spark.sql.adaptive.enabled=true
            spark.sql.adaptive.forceApply=false
            spark.sql.adaptive.logLevel=info
            spark.sql.adaptive.advisoryPartitionSizeInBytes=256m
            spark.sql.adaptive.coalescePartitions.enabled=true
            spark.sql.adaptive.coalescePartitions.minPartitionNum=1
            spark.sql.adaptive.coalescePartitions.initialPartitionNum=1024
            spark.sql.adaptive.fetchShuffleBlocksInBatch=true
            spark.sql.adaptive.localShuffleReader.enabled=true
            spark.sql.adaptive.skewJoin.enabled=true
            spark.sql.adaptive.skewJoin.skewedPartitionFactor=5
            spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=400m
            spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin=0.2
            spark.sql.autoBroadcastJoinThreshold=-1
            # Plugins (disable Gluten globally; enable only in Gluten profile)
            spark.plugins=io.dataflint.spark.SparkDataflintPlugin
            # TPCDS catalog configs
            spark.sql.catalog.tpcds=org.apache.kyuubi.spark.connector.tpcds.TPCDSCatalog
            # spark.sql.catalog.tpcds.excludeDatabases=sf30000
            spark.sql.catalog.tpcds.useAnsiStringType=false
            spark.sql.catalog.tpcds.useTableSchema_2_6=true
            spark.sql.catalog.tpcds.read.maxPartitionBytes=128m
            # Polaris
            spark.sql.defaultCatalog=polaris
            spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog
            spark.sql.catalog.polaris.warehouse=dv-polaris
            spark.sql.catalog.polaris.token-refresh-enabled=true
            spark.jars.ivy.log.level=DEBUG
            spark.ui.killEnabled=false

Additional context

No response

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
  • No. I cannot submit a PR at this time.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions