Skip to content

Restart of Spark Operator may Result in Duplicated Submission #2788

@kohn

Description

@kohn

What happened?

  • ✋ I have searched the open/closed issues and my issue is not listed.
    When I restart spark-operator, sometimes job fails with the error of "driver pod already exist". I checked the kubernetes event, and find the following event:
  • T0: got the event of SparkApplicationSubmitted
  • T1: old spark-operator exit
  • T2: new spark-operator started
  • T3: got the event of SparkApplicationAdded
  • T4: got the event of SparkApplicationSubmissionFailed
  • T5: got the event of SparkApplicationFailed

I guess this bug happens when spark-operator abruptly exist while spark-submit command is done but updateSparkApplicationStatus is not executed so that SparkApplication status is still "new"(""). Then the new spark-operator is up and try to re-submit again as the status is "new".

Reproduction Code

Keep submitting lots of jobs and restart the spark-operator

Expected behavior

jobs can be started

Actual behavior

some jobs fail with the error of "driver pod already exist"

Environment & Versions

  • Kubernetes Version: 1.33
  • Spark Operator Version: 2.3.0
  • Apache Spark Version:

Additional context

No response

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Metadata

Metadata

Assignees

Labels

kind/bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions