Skip to content

Commit d50d68a

Browse files
authored
Stop cancelling already-succeeded executor pods on Spark job completion (#99)
Fixes G-Research/spark#161 What Skip cancelling executor jobs that already exited successfully during Spark application shutdown. Why When a Spark job completes, the shutdown sequence cancels all executor jobs in Armada -- including ones that already exited successfully. This makes successful pods appear as "Cancelled" in Armada's UI and logs, misrepresenting the actual job outcome. Changes - Track executors that reach terminal states (succeeded, failed, cancelled) in a concurrent set, and exclude them from the cancellation batch during shutdown - Add a grace period before cancellation so the event watcher can capture Succeeded/Failed events from executors that exit after receiving StopExecutor - Move event watcher shutdown to after cancellation so terminal events are captured during the grace period - Add onExecutorSucceeded handler -- previously, succeeded executors were incorrectly reported as failed with "Unexpected success" - Use safeRemoveExecutor to tolerate the RPC endpoint being gone during the shutdown grace period - Fix client-mode E2E test race condition where armadactl watch --exit-if-inactive exits immediately because the job set doesn't exist yet; use submit future completion as the job-done signal instead - Accept Succeeded pod phase in waitForPod so fast-completing client-mode pods are detected Tests New unit tests verify that getActiveExecutorIds correctly excludes failed, succeeded, cancelled, and unschedulable executors, that terminal executors are cleaned from the pending set, and that terminal tracking is thread-safe. How to verify - Run mvn test to confirm unit tests pass - Run E2E tests in both cluster and client deploy modes - Check Armada UI after a successful Spark job -- executor pods should show "Succeeded" instead of "Cancelled" <img width="1506" height="348" alt="image" src="https://github.com/user-attachments/assets/0a364f39-4971-4fdb-9d73-45bbca0f613c" /> --------- Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
1 parent d9c3665 commit d50d68a

File tree

17 files changed

+747
-67
lines changed

17 files changed

+747
-67
lines changed

.claude/commands/build.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
Build the project.
2+
3+
If argument is "quick" or "fast", skip tests:
4+
```
5+
mvn clean package -DskipTests
6+
```
7+
8+
Otherwise run the full build with tests:
9+
```
10+
mvn clean package
11+
```
12+
13+
Rules:
14+
1. Before building, run `mvn spotless:check` first — if formatting fails, run `mvn spotless:apply` and report which files were fixed
15+
2. If the build fails, read the error output and provide a clear summary of what went wrong
16+
3. Do NOT automatically fix build errors — report them and let the user decide
17+
4. Show the final artifact path on success (target/*.jar)

.claude/commands/ci-local.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
Run the full CI pipeline locally to verify everything passes before pushing.
2+
3+
This mirrors what GitHub Actions runs. Execute these steps in order, stopping on first failure:
4+
5+
1. **Lint**: `mvn spotless:check`
6+
- If it fails, ask the user if they want to auto-fix with `mvn spotless:apply`
7+
8+
2. **Compile**: `mvn compile`
9+
- Report any compilation errors clearly
10+
11+
3. **Test**: `mvn test`
12+
- Summarize test results (total, passed, failed, skipped)
13+
14+
4. **Package**: `mvn package -DskipTests`
15+
- Confirm the JAR was built successfully
16+
17+
Report a final summary:
18+
```
19+
CI Local Results:
20+
Lint: PASS/FAIL
21+
Compile: PASS/FAIL
22+
Test: PASS/FAIL (X passed, Y failed)
23+
Package: PASS/FAIL
24+
```
25+
26+
Do NOT continue to the next step if any step fails — report the failure and stop.

.claude/commands/commit.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
Create a git commit for the current staged/unstaged changes.
2+
3+
Rules:
4+
1. Run `git status` and `git diff` to understand what changed
5+
2. Stage relevant files (prefer specific files over `git add -A`)
6+
3. Write a SHORT commit message (max 50 chars for subject line, imperative mood)
7+
4. Use `--signoff` to sign off using the committer's git config (do NOT hardcode any name/email)
8+
5. Do NOT add Co-Authored-By or any other trailers
9+
6. If there are no changes, say so and stop
10+
11+
Commit format:
12+
```
13+
git commit --signoff -m "short imperative message"
14+
```
15+
16+
The `--signoff` flag automatically uses the name and email from `git config user.name` and `git config user.email`, so each collaborator's own identity is used.
17+
18+
Examples of good messages:
19+
- "add user and permission models"
20+
- "implement executor allocation logic"
21+
- "fix gang scheduling annotation prefix"
22+
- "add table-driven tests for config parsing"
23+
24+
Do NOT:
25+
- Use long descriptive messages
26+
- Add Co-Authored-By trailers
27+
- Use past tense ("added", "fixed")
28+
- Prefix with type tags ("feat:", "fix:") unless asked
29+
- Hardcode any author name or email

.claude/commands/lint.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Check and fix code formatting using Spotless/Scalafmt.
2+
3+
Steps:
4+
1. Run `mvn spotless:check` to see if there are formatting violations
5+
2. If violations are found, run `mvn spotless:apply` to auto-fix them
6+
3. After applying, run `git diff --stat` to show what files were reformatted
7+
4. Summarize the changes (which files, what kind of formatting was fixed)
8+
9+
If no violations are found, say so and stop.
10+
11+
Do NOT commit the formatting changes — just apply and report.

.claude/commands/summary.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
Generate a concise implementation summary for a PR description.
2+
3+
Steps:
4+
1. Run `git diff master...HEAD --stat` and `git log master..HEAD --oneline` to understand all changes on this branch
5+
2. Read the changed files to understand what was implemented and why
6+
3. Write a summary suitable for a GitHub PR description
7+
8+
Summary format rules:
9+
- Start with a one-line "What" statement explaining the change
10+
- Follow with a "Why" section (2-3 sentences max) explaining the motivation
11+
- List the key changes as plain bullet points (no nested bullets)
12+
- If there are new tests, mention what they cover in one line
13+
- End with a "How to verify" section with concrete steps if applicable
14+
- Keep the total summary under 30 lines
15+
- Use plain text with minimal markdown (no tables, no headers larger than ##, no code blocks unless showing a command)
16+
- Do not repeat file paths or class names unnecessarily
17+
- Focus on behavior changes, not implementation details
18+
- Write in present tense, active voice
19+
20+
Do NOT:
21+
- Use heavy markdown formatting (no bold, no tables, no badges)
22+
- List every single file changed
23+
- Include generic boilerplate like "This PR adds..."
24+
- Add emojis
25+
- Over-explain things that are obvious from the diff
26+
27+
Output the summary directly so the user can copy-paste it into the PR description.

.claude/hooks/spotless-apply.sh

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
#!/bin/bash
2+
# Auto-format Scala files after Edit/Write using Spotless
3+
INPUT=$(cat)
4+
FILE_PATH=$(echo "$INPUT" | jq -r '.tool_input.file_path // empty')
5+
6+
# Only run for build-relevant files
7+
if [[ "$FILE_PATH" != *.scala && "$FILE_PATH" != *.java && "$FILE_PATH" != */pom.xml ]]; then
8+
exit 0
9+
fi
10+
11+
# Debounce: skip if Spotless ran within the last 30 seconds
12+
STAMP_FILE="/tmp/.claude-spotless-last-run"
13+
if [ -f "$STAMP_FILE" ]; then
14+
LAST_RUN=$(cat "$STAMP_FILE")
15+
NOW=$(date +%s)
16+
ELAPSED=$((NOW - LAST_RUN))
17+
if [ "$ELAPSED" -lt 30 ]; then
18+
exit 0
19+
fi
20+
fi
21+
22+
date +%s > "$STAMP_FILE"
23+
24+
RESULT=$(cd "$CLAUDE_PROJECT_DIR" && mvn spotless:apply -q 2>&1)
25+
EXIT_CODE=$?
26+
27+
if [ $EXIT_CODE -eq 0 ]; then
28+
echo "{\"systemMessage\": \"Spotless formatting applied successfully\"}"
29+
else
30+
echo "{\"systemMessage\": \"Spotless formatting failed: $RESULT\"}"
31+
fi

.claude/hooks/verify-build.sh

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
#!/bin/bash
2+
# Verify the project compiles before Claude stops
3+
INPUT=$(cat)
4+
STOP_HOOK_ACTIVE=$(echo "$INPUT" | jq -r '.stop_hook_active // false')
5+
6+
# Prevent infinite loops - skip if already triggered by a Stop hook
7+
if [ "$STOP_HOOK_ACTIVE" = "true" ]; then
8+
echo '{"systemMessage": "Skipped build verification (stop hook already active)"}'
9+
exit 0
10+
fi
11+
12+
cd "$CLAUDE_PROJECT_DIR"
13+
14+
# Skip build verification if no build-relevant files changed
15+
CHANGED_FILES=$(git diff --name-only HEAD 2>/dev/null; git diff --name-only --cached HEAD 2>/dev/null; git ls-files --others --exclude-standard 2>/dev/null)
16+
BUILD_FILES=$(echo "$CHANGED_FILES" | grep -E '\.(scala|java)$|pom\.xml' | head -1)
17+
if [ -z "$BUILD_FILES" ]; then
18+
echo '{"systemMessage": "Skipped build verification (no code changes detected)"}'
19+
exit 0
20+
fi
21+
22+
RESULT=$(mvn compile -q 2>&1)
23+
EXIT_CODE=$?
24+
25+
if [ $EXIT_CODE -ne 0 ]; then
26+
ESCAPED_RESULT=$(echo "$RESULT" | head -20 | jq -Rs .)
27+
echo "{\"systemMessage\": \"Build failed. Fix compilation errors before finishing: ${ESCAPED_RESULT}\"}" >&2
28+
exit 2
29+
fi
30+
31+
echo '{"systemMessage": "Build verification passed"}'
32+
exit 0

.claude/settings.json

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
{
2+
"permissions": {
3+
"allow": [
4+
"Read",
5+
"Edit",
6+
"Write",
7+
"Glob",
8+
"Grep",
9+
"Bash(mvn *)",
10+
"Bash(mvn)",
11+
"Bash(./scripts/*)",
12+
"Bash(git status)",
13+
"Bash(git status *)",
14+
"Bash(git diff)",
15+
"Bash(git diff *)",
16+
"Bash(git log *)",
17+
"Bash(git log)",
18+
"Bash(git add *)",
19+
"Bash(git commit *)",
20+
"Bash(git checkout *)",
21+
"Bash(git branch *)",
22+
"Bash(git branch)",
23+
"Bash(git switch *)",
24+
"Bash(git merge *)",
25+
"Bash(git stash *)",
26+
"Bash(git stash)",
27+
"Bash(git tag *)",
28+
"Bash(git push *)",
29+
"Bash(git push)",
30+
"Bash(git pull *)",
31+
"Bash(git pull)",
32+
"Bash(git fetch *)",
33+
"Bash(git fetch)",
34+
"Bash(git remote *)",
35+
"Bash(git config *)",
36+
"Bash(gh *)",
37+
"Bash(ls *)",
38+
"Bash(ls)",
39+
"Bash(mkdir *)",
40+
"Bash(which *)",
41+
"Bash(java *)",
42+
"Bash(scala *)",
43+
"Bash(docker *)"
44+
],
45+
"deny": [
46+
"Bash(rm -rf *)",
47+
"Bash(git reset --hard *)",
48+
"Bash(git push --force *)",
49+
"Bash(git push -f *)",
50+
"Bash(git clean -f *)"
51+
]
52+
},
53+
"hooks": {
54+
"PostToolUse": [
55+
{
56+
"matcher": "Edit|Write",
57+
"hooks": [
58+
{
59+
"type": "command",
60+
"command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/spotless-apply.sh",
61+
"timeout": 120,
62+
"statusMessage": "Formatting with Spotless...",
63+
"async": true
64+
}
65+
]
66+
}
67+
],
68+
"Stop": [
69+
{
70+
"hooks": [
71+
{
72+
"type": "command",
73+
"command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/verify-build.sh",
74+
"timeout": 300,
75+
"statusMessage": "Verifying build compiles..."
76+
}
77+
]
78+
}
79+
]
80+
},
81+
"enabledPlugins": {
82+
"code-simplifier@claude-plugins-official": true
83+
}
84+
}
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
{
2+
"_comment": [
3+
"Optional third-party plugins for Claude Code.",
4+
"These are community plugins (not Anthropic-maintained) — review before enabling.",
5+
"Source: https://github.com/wshobson/agents",
6+
"",
7+
"Setup steps:",
8+
" 1. Add the marketplace: /plugin marketplace add wshobson/agents",
9+
" 2. Copy this file: cp .claude/settings.local.example.json .claude/settings.local.json",
10+
" 3. Restart Claude Code to load the plugins."
11+
],
12+
"enabledPlugins": {
13+
"comprehensive-review@claude-code-workflows": true,
14+
"unit-testing@claude-code-workflows": true,
15+
"error-debugging@claude-code-workflows": true
16+
}
17+
}

.claudeignore

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Build artifacts
2+
target/
3+
build/
4+
out/
5+
bin/
6+
7+
# IDE/Editor caches
8+
.idea/
9+
.vscode/
10+
.metals/
11+
.bloop/
12+
*.iml
13+
14+
# Spark test dependencies (~586MB)
15+
.spark-*/
16+
17+
# Generated/temporary files
18+
dependency-reduced-pom.xml
19+
scripts/.tmp/
20+
extraJars/*.jar
21+
.history
22+
.bsp
23+
24+
# Jupyter workspace
25+
example/jupyter/workspace/
26+
27+
# macOS
28+
.DS_Store

0 commit comments

Comments
 (0)