Skip to content

Stop cancelling already-succeeded executor pods on Spark job completion#99

Merged
sudiptob2 merged 9 commits intoarmadaproject:masterfrom
sudiptob2:fix/161/executor-without-cancelling
Feb 12, 2026
Merged

Stop cancelling already-succeeded executor pods on Spark job completion#99
sudiptob2 merged 9 commits intoarmadaproject:masterfrom
sudiptob2:fix/161/executor-without-cancelling

Conversation

@sudiptob2
Copy link
Collaborator

@sudiptob2 sudiptob2 commented Feb 9, 2026

⚠️ THIS CODE IS GENERATED USING GEN-AI TOOLS

Fixes G-Research/spark#161

What

Skip cancelling executor jobs that already exited successfully during Spark application shutdown.

Why

When a Spark job completes, the shutdown sequence cancels all executor jobs in Armada -- including ones that already exited successfully. This makes successful pods appear as "Cancelled" in Armada's UI and logs, misrepresenting the actual job outcome.

Changes

  • Track executors that reach terminal states (succeeded, failed, cancelled) in a concurrent set, and exclude them from the cancellation batch during shutdown
  • Add a grace period before cancellation so the event watcher can capture Succeeded/Failed events from executors that exit after receiving StopExecutor
  • Move event watcher shutdown to after cancellation so terminal events are captured during the grace period
  • Add onExecutorSucceeded handler -- previously, succeeded executors were incorrectly reported as failed with "Unexpected success"
  • Use safeRemoveExecutor to tolerate the RPC endpoint being gone during the shutdown grace period
  • Fix client-mode E2E test race condition where armadactl watch --exit-if-inactive exits immediately because the job set doesn't exist yet; use submit future completion as the job-done signal instead
  • Accept Succeeded pod phase in waitForPod so fast-completing client-mode pods are detected

Tests

New unit tests verify that getActiveExecutorIds correctly excludes failed, succeeded, cancelled, and unschedulable executors, that terminal executors are cleaned from the pending set, and that terminal tracking is thread-safe.

How to verify

  • Run mvn test to confirm unit tests pass
  • Run E2E tests in both cluster and client deploy modes
  • Check Armada UI after a successful Spark job -- executor pods should show "Succeeded" instead of "Cancelled"
image

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
@sudiptob2 sudiptob2 force-pushed the fix/161/executor-without-cancelling branch from 38358b2 to 9c455e1 Compare February 9, 2026 21:36
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
@GeorgeJahad
Copy link
Collaborator

Since this is a draft PR, I haven't studied all the details, but it looks pretty good to me now.

Were you thinking of doing anything else before moving it out of draft status?

@sudiptob2
Copy link
Collaborator Author

@GeorgeJahad, I am thinking of removing the third-party plugins and later adding plugins / or coming up with our own plugins. Otherwise looks good to me.

REMOVE
"comprehensive-review@claude-code-workflows": true,
"unit-testing@claude-code-workflows": true,
"error-debugging@claude-code-workflows": true

@sudiptob2 sudiptob2 marked this pull request as ready for review February 10, 2026 21:51
private val pendingExecutors = new mutable.HashSet[String]()

/** Tracks executors that have reached a terminal state (succeeded, failed, cancelled) */
private val terminalExecutors: java.util.Set[String] =
Copy link
Collaborator

@GeorgeJahad GeorgeJahad Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I find this method name a bit misleading, "terminal" to me sounds like they are terminating but haven't yet terminated.

To me "terminatedExecutors" seems clearer. Feel free to ignore this comment if you disagree.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me terminalExecutors felt it intentionally to represent all final states (succeeded, failed, cancelled), not only forcefully terminated ones. terminated felt narrower to me. But happy to rename if terminatedExecutors sounds more clrear.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, if it is just me, then leave it.

Comment on lines 187 to 212
// Configure TLS
val useTls = conf.get(ARMADA_EVENT_WATCHER_USE_TLS)
val channelBuilder = NettyChannelBuilder.forAddress(host, port)

val channelBuilderWithTls = if (useTls) {
logInfo("Using TLS for event watcher gRPC channel")
channelBuilder.useTransportSecurity()
} else {
logInfo("Using plaintext for event watcher gRPC channel")
channelBuilder.usePlaintext()
}

val channel = token match {
case Some(t) =>
val metadata = new Metadata()
metadata.put(
Metadata.Key.of("Authorization", Metadata.ASCII_STRING_MARSHALLER),
"Bearer " + t
)
channelBuilderWithTls
.intercept(MetadataUtils.newAttachHeadersInterceptor(metadata))
.build()
case None =>
channelBuilderWithTls.build()
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this code isn't used anymore and should have been removed when the armada client was updated. If possible, please remove it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the client library now handle event watcher TLS? Because, as far as I remember, without TLS event watcher won't work in C3.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked here: G-Research/spark#174

/** Mark an executor as having reached a terminal state and clean it from pending set.
*/
private def markTerminal(executorId: String): Unit = {
terminalExecutors.add(executorId)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to leave time when an executor can be in both terminalExecutors and pendingExecutors. Unless there is a good reason why, I would prefer that not to be the case.

backend.getPendingExecutorCount shouldBe 1
}

test("thread safety of terminal executor tracking") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test is a bit confusing, how about a comment like:
"Use multiple threads to terminate half the jobs, then confirm the number of remaining active ones"

CHANGED_FILES=$(git diff --name-only HEAD 2>/dev/null; git diff --name-only --cached HEAD 2>/dev/null; git ls-files --others --exclude-standard 2>/dev/null)
BUILD_FILES=$(echo "$CHANGED_FILES" | grep -E '\.(scala|java)$|pom\.xml' | head -1)
if [ -z "$BUILD_FILES" ]; then
echo '{"systemMessage": "Skipped build verification (no code changes detected)"}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been running my instance with this hook, and I don't think I've ever seen this message, even though I don't change my files very often.

Do you see it in your runs when you don't change files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I do see it, if I make doc only changes

Screenshot 2026-02-11 at 12 08 37 AM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always see this instead:

Stop says: Build verification passed                                                    

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oddly for me it seems to run a compile no matter what i ask:

                                                                                            
❯ what is the current load average                                                           
                                                                                             
● Bash(uptime)                                                                               
  ⎿   07:44:11 up 12:48,  2 users,  load average: 0.23, 0.23, 0.09                           
                                                                                             
● Load averages: 0.23, 0.23, 0.09 (1min, 5min, 15min). Pretty idle.                        
  ⎿  Stop says: Build verification passed             

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did updated it a bit in a later commit since the initial version. Did you pull the latest one?

- **ArmadaEventWatcher** — Long-lived daemon thread with `volatile running` flag; 5s join timeout on shutdown
- **PodSpecConverter** — Bidirectional Fabric8 <-> Protobuf; hardcodes None/empty for version-incompatible fields (dnsConfig, ephemeralContainers, hostUsers, os, schedulingGates)
- **Config** — 100+ entries via Spark's `ConfigBuilder` API; all prefixed `spark.armada.*`

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArmadaClientApplication class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added ArmadaClientApplication

CLAUDE.md Outdated
## Testing Standards

- **Framework:** ScalaTest 3.2.16 (`AnyFunSuite` style exclusively)
- **Mocking:** Mockito 5.12 (`mock(classOf[...])`, `when(...).thenReturn(...)`)
Copy link
Collaborator

@GeorgeJahad GeorgeJahad Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a lot of hard coded version numbers of dependencies in this file that have been copied over from the pom file.

Wouldn't it be better to tell claude to read the pom for the versions of these dependencies?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah reffered to pom.xml. If it drifts too much, we can always ask Claude to update the CLAUDE.md.

@GeorgeJahad
Copy link
Collaborator

I've given you a bunch of nits to cleanup but the code is basically ready. I'll approve soon

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
@GeorgeJahad
Copy link
Collaborator

This all looks good, but in standup we said we would make the unofficial plugins a recommendation. Once that is done, I'll approve.

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
@sudiptob2
Copy link
Collaborator Author

Thanks, @GeorgeJahad, for the thorough review.

As we continue using Claude, we will likely identify more optimal plugins for this project. For now, these three plugins look good to me, so I recommended them as optional.
I have moved them to a local settings template and added a section in the README about working with Claude Code:
df2265e

Copy link
Collaborator

@GeorgeJahad GeorgeJahad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Very good work @sudiptob2!

thanks!

@sudiptob2 sudiptob2 merged commit d50d68a into armadaproject:master Feb 12, 2026
12 checks passed
@sudiptob2 sudiptob2 deleted the fix/161/executor-without-cancelling branch February 12, 2026 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spark Job Completion makes it look like successful pods were cancelled

2 participants