Stop cancelling already-succeeded executor pods on Spark job completion by sudiptob2 · Pull Request #99 · armadaproject/armada-spark

sudiptob2 · 2026-02-09T17:47:59Z

⚠️ THIS CODE IS GENERATED USING GEN-AI TOOLS

What

Skip cancelling executor jobs that already exited successfully during Spark application shutdown.

Why

When a Spark job completes, the shutdown sequence cancels all executor jobs in Armada -- including ones that already exited successfully. This makes successful pods appear as "Cancelled" in Armada's UI and logs, misrepresenting the actual job outcome.

Changes

Track executors that reach terminal states (succeeded, failed, cancelled) in a concurrent set, and exclude them from the cancellation batch during shutdown
Add a grace period before cancellation so the event watcher can capture Succeeded/Failed events from executors that exit after receiving StopExecutor
Move event watcher shutdown to after cancellation so terminal events are captured during the grace period
Add onExecutorSucceeded handler -- previously, succeeded executors were incorrectly reported as failed with "Unexpected success"
Use safeRemoveExecutor to tolerate the RPC endpoint being gone during the shutdown grace period
Fix client-mode E2E test race condition where armadactl watch --exit-if-inactive exits immediately because the job set doesn't exist yet; use submit future completion as the job-done signal instead
Accept Succeeded pod phase in waitForPod so fast-completing client-mode pods are detected

Tests

New unit tests verify that getActiveExecutorIds correctly excludes failed, succeeded, cancelled, and unschedulable executors, that terminal executors are cleaned from the pending set, and that terminal tracking is thread-safe.

How to verify

Run mvn test to confirm unit tests pass
Run E2E tests in both cluster and client deploy modes
Check Armada UI after a successful Spark job -- executor pods should show "Succeeded" instead of "Cancelled"

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaClusterManagerBackend.scala

.claude/settings.json

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

GeorgeJahad · 2026-02-10T21:15:28Z

Since this is a draft PR, I haven't studied all the details, but it looks pretty good to me now.

Were you thinking of doing anything else before moving it out of draft status?

sudiptob2 · 2026-02-10T21:51:18Z

@GeorgeJahad, I am thinking of removing the third-party plugins and later adding plugins / or coming up with our own plugins. Otherwise looks good to me.

REMOVE
"comprehensive-review@claude-code-workflows": true,
"unit-testing@claude-code-workflows": true,
"error-debugging@claude-code-workflows": true

GeorgeJahad · 2026-02-11T04:29:09Z

src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaClusterManagerBackend.scala

  private val pendingExecutors = new mutable.HashSet[String]()

+  /** Tracks executors that have reached a terminal state (succeeded, failed, cancelled) */
+  private val terminalExecutors: java.util.Set[String] =


NIT: I find this method name a bit misleading, "terminal" to me sounds like they are terminating but haven't yet terminated.

To me "terminatedExecutors" seems clearer. Feel free to ignore this comment if you disagree.

To me terminalExecutors felt it intentionally to represent all final states (succeeded, failed, cancelled), not only forcefully terminated ones. terminated felt narrower to me. But happy to rename if terminatedExecutors sounds more clrear.

no, if it is just me, then leave it.

GeorgeJahad · 2026-02-11T04:35:49Z

src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaClusterManagerBackend.scala

    // Configure TLS
    val useTls         = conf.get(ARMADA_EVENT_WATCHER_USE_TLS)
    val channelBuilder = NettyChannelBuilder.forAddress(host, port)

    val channelBuilderWithTls = if (useTls) {
      logInfo("Using TLS for event watcher gRPC channel")
      channelBuilder.useTransportSecurity()
    } else {
      logInfo("Using plaintext for event watcher gRPC channel")
      channelBuilder.usePlaintext()
    }

    val channel = token match {
      case Some(t) =>
        val metadata = new Metadata()
        metadata.put(
          Metadata.Key.of("Authorization", Metadata.ASCII_STRING_MARSHALLER),
          "Bearer " + t
        )
        channelBuilderWithTls
          .intercept(MetadataUtils.newAttachHeadersInterceptor(metadata))
          .build()
      case None =>
        channelBuilderWithTls.build()
    }



I believe this code isn't used anymore and should have been removed when the armada client was updated. If possible, please remove it.

Does the client library now handle event watcher TLS? Because, as far as I remember, without TLS event watcher won't work in C3.

Tracked here: G-Research/spark#174

GeorgeJahad · 2026-02-11T04:44:50Z

src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaClusterManagerBackend.scala

+  /** Mark an executor as having reached a terminal state and clean it from pending set.
+    */
+  private def markTerminal(executorId: String): Unit = {
+    terminalExecutors.add(executorId)


this is going to leave time when an executor can be in both terminalExecutors and pendingExecutors. Unless there is a good reason why, I would prefer that not to be the case.

GeorgeJahad · 2026-02-11T04:58:02Z

src/test/scala/org/apache/spark/scheduler/cluster/armada/ArmadaClusterManagerBackendSuite.scala

+    backend.getPendingExecutorCount shouldBe 1
+  }
+
+  test("thread safety of terminal executor tracking") {


Test is a bit confusing, how about a comment like:
"Use multiple threads to terminate half the jobs, then confirm the number of remaining active ones"

GeorgeJahad · 2026-02-11T05:05:00Z

.claude/hooks/verify-build.sh

+CHANGED_FILES=$(git diff --name-only HEAD 2>/dev/null; git diff --name-only --cached HEAD 2>/dev/null; git ls-files --others --exclude-standard 2>/dev/null)
+BUILD_FILES=$(echo "$CHANGED_FILES" | grep -E '\.(scala|java)$|pom\.xml' | head -1)
+if [ -z "$BUILD_FILES" ]; then
+  echo '{"systemMessage": "Skipped build verification (no code changes detected)"}'


I've been running my instance with this hook, and I don't think I've ever seen this message, even though I don't change my files very often.

Do you see it in your runs when you don't change files?

Yes, I do see it, if I make doc only changes

I always see this instead:

Stop says: Build verification passed

oddly for me it seems to run a compile no matter what i ask:

❯ what is the current load average ● Bash(uptime) ⎿ 07:44:11 up 12:48, 2 users, load average: 0.23, 0.23, 0.09 ● Load averages: 0.23, 0.23, 0.09 (1min, 5min, 15min). Pretty idle. ⎿ Stop says: Build verification passed

I did updated it a bit in a later commit since the initial version. Did you pull the latest one?

GeorgeJahad · 2026-02-11T05:09:08Z

CLAUDE.md

+- **ArmadaEventWatcher** — Long-lived daemon thread with `volatile running` flag; 5s join timeout on shutdown
+- **PodSpecConverter** — Bidirectional Fabric8 <-> Protobuf; hardcodes None/empty for version-incompatible fields (dnsConfig, ephemeralContainers, hostUsers, os, schedulingGates)
+- **Config** — 100+ entries via Spark's `ConfigBuilder` API; all prefixed `spark.armada.*`
+


ArmadaClientApplication class?

added ArmadaClientApplication

GeorgeJahad · 2026-02-11T05:18:15Z

CLAUDE.md

+## Testing Standards
+
+- **Framework:** ScalaTest 3.2.16 (`AnyFunSuite` style exclusively)
+- **Mocking:** Mockito 5.12 (`mock(classOf[...])`, `when(...).thenReturn(...)`)


there are a lot of hard coded version numbers of dependencies in this file that have been copied over from the pom file.

Wouldn't it be better to tell claude to read the pom for the versions of these dependencies?

Yeah reffered to pom.xml. If it drifts too much, we can always ask Claude to update the CLAUDE.md.

GeorgeJahad · 2026-02-11T05:25:13Z

I've given you a bunch of nits to cleanup but the code is basically ready. I'll approve soon

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

GeorgeJahad · 2026-02-11T17:57:56Z

This all looks good, but in standup we said we would make the unofficial plugins a recommendation. Once that is done, I'll approve.

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

sudiptob2 · 2026-02-11T18:30:24Z

Thanks, @GeorgeJahad, for the thorough review.

As we continue using Claude, we will likely identify more optimal plugins for this project. For now, these three plugins look good to me, so I recommended them as optional.
I have moved them to a local settings template and added a section in the README about working with Claude Code:
df2265e

GeorgeJahad

lgtm! Very good work @sudiptob2!

thanks!

sudiptob2 added 2 commits February 9, 2026 12:38

initial claude setup

bba5820

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

fix wrong status on job completion

9c455e1

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

sudiptob2 force-pushed the fix/161/executor-without-cancelling branch from 38358b2 to 9c455e1 Compare February 9, 2026 21:36

GeorgeJahad reviewed Feb 10, 2026

View reviewed changes

src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaClusterManagerBackend.scala Outdated Show resolved Hide resolved

GeorgeJahad reviewed Feb 10, 2026

View reviewed changes

.claude/settings.json Outdated Show resolved Hide resolved

sudiptob2 added 3 commits February 10, 2026 12:22

use markTerminal in doKillExecutors

e184b6f

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

optimize claude setup

91b4397

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

improve claude setup with docs and ignores

feea3b3

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

sudiptob2 marked this pull request as ready for review February 10, 2026 21:51

GeorgeJahad reviewed Feb 11, 2026

View reviewed changes

sudiptob2 added 3 commits February 11, 2026 10:25

atomize markTerminal to prevent dual-set membership

45ec63f

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

update CLAUDE.md with key class and pom refs

6732dc9

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

use safeRemoveExecutor in doKillExecutors

82e01c2

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

sudiptob2 mentioned this pull request Feb 11, 2026

Clenaup dead code related to even watcher after ArmadaClient update G-Research/spark#174

Open

move third-party plugins to opt-in local template

df2265e

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

GeorgeJahad approved these changes Feb 11, 2026

View reviewed changes

sudiptob2 merged commit d50d68a into armadaproject:master Feb 12, 2026
12 checks passed

sudiptob2 deleted the fix/161/executor-without-cancelling branch February 12, 2026 14:41

Conversation

sudiptob2 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GeorgeJahad commented Feb 10, 2026

Uh oh!

sudiptob2 commented Feb 10, 2026

Uh oh!

GeorgeJahad Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GeorgeJahad Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GeorgeJahad commented Feb 11, 2026

Uh oh!

GeorgeJahad commented Feb 11, 2026

Uh oh!

sudiptob2 commented Feb 11, 2026

Uh oh!

GeorgeJahad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sudiptob2 commented Feb 9, 2026 •

edited

Loading

GeorgeJahad Feb 11, 2026 •

edited

Loading

GeorgeJahad Feb 11, 2026 •

edited

Loading