Skip to content

SparkApplication Creates Duplicate PodGroups and Ignores Custom Volcano Queue #2526

@sagarprst

Description

@sagarprst

SparkApplication Creates Duplicate PodGroups and Ignores Custom Volcano Queue

Hi,

I’m encountering an issue with my Spark custom resource definition (CRD) configuration when using Volcano as the batch scheduler. Specifically, even after specifying a custom queue in the SparkApplication spec, the job still defaults to the default queue during execution.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pi
namespace: default
spec:
imagePullSecrets:
- my-pull-secrets
type: Scala
mode: cluster
image: our.repo/spark:volcano-latest
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar
sparkVersion: 3.5.3
restartPolicy:
type: Never
driver:
cores: 1
memory: 1024m
serviceAccount: spark-operator-spark
executor:
cores: 1
instances: 1
memory: 1024m
batchScheduler: "volcano"
batchSchedulerOptions:
queue: "myqueue"

Then after applying the above yaml, i see that the job gets completed but I see 2 podgroups instead of 1:

kubectl get podgroups -n default

NAME STATUS MINMEMBER RUNNINGS AGE

podgroup1 Completed 1 2m36s
podgroup2 Inqueue 1 2m25s

Expected behavior

The Spark job should show the specified queue: myqueue

Only one PodGroup should be created, owned by the SparkApplication resource.

Actual behavior

The Spark job runs successfully.

However, two PodGroups are created:

One with the ownerReference.kind set to Pod, which ends up in an Inqueue state.

Another associated with the SparkApplication, which reaches a Completed status.

It seems the PodGroup related to the actual SparkApplication is functioning as expected, but the additional PodGroup causes confusion and potential scheduling conflicts.

Environment & Versions

  • Kubernetes Version: v1.28.7
  • Spark Operator Version: 2.1.0
  • Apache Spark Version: 3.5.3

Can you confirm if this is expected behavior or a bug?

Is there a workaround or additional configuration needed to ensure the correct queue is used and redundant PodGroups are avoided?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions