Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 25 additions & 13 deletions nomad/scripts/wait_for_jobs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,14 @@ function get_status() {
echo "$client_status"
}

# See https://developer.hashicorp.com/nomad/commands/job/status#running
function is_running() {
local status="$1"
case "$status" in
complete|failed|lost|stopped|dead|unknown)
return 1 ;; # not running
running)
return 0 ;; # still running
*)
return 0 ;; # still running (e.g., pending, running)
return 1 ;; # not running
esac
}
Comment on lines +44 to 53
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

The is_running() function should treat "pending" as an active state.

Nomad allocations can have a ClientStatus of "pending", which represents allocations waiting to be scheduled on a client node. The official Nomad documentation shows examples of allocations with DesiredStatus="run" and ClientStatus="pending" as valid intermediate states.

The current implementation only returns 0 for "running", treating "pending" allocations as not running. This causes:

  1. Loop exits when allocation enters "pending" state
  2. is_run_success("pending") returns false
  3. Allocation marked as failed

HashiCorp's own documentation recommends filtering for allocations with ClientStatus of "running" or "pending", which contradicts the current implementation that only checks "running".

Update the function to handle both states:

function is_running() {
    local status="$1"
    case "$status" in
      running|pending)
        return 0 ;;
      *)
        return 1 ;;
    esac
}
🤖 Prompt for AI Agents
In nomad/scripts/wait_for_jobs.sh around lines 44-53, the is_running() helper
currently returns success only for "running", treating "pending" as not active
and causing loops to exit and allocations to be marked failed; update the case
pattern so that both "running" and "pending" return 0 (i.e., treat pending as an
active intermediate state), leaving other statuses to return 1; after this
change, ensure any callers (e.g., is_run_success) rely on is_running() semantics
or are adjusted if they separately check for "pending".


Expand Down Expand Up @@ -114,18 +115,16 @@ function wait_for_job() {
continue
fi

# Job has completed
# Job has completed at this point, either successfully or failed.
if is_run_success "$status"; then
break
echo "Scenario $scenario_name ($alloc_id) completed successfully in $ELAPSED_SECS seconds."
return 0
else
echo "Scenario $scenario_name ($alloc_id) finished with status=$status after $ELAPSED_SECS seconds."
print_failed_tasks_and_logs "$alloc_id" "$nomad_status"
return 1
fi

echo "Scenario $scenario_name ($alloc_id) failed (status=$status) after $ELAPSED_SECS seconds."
print_failed_tasks_and_logs "$alloc_id" "$nomad_status"
exit 1
done
echo "Scenario $scenario_name ($alloc_id) completed successfully in $ELAPSED_SECS seconds."

return 0
}


Expand All @@ -148,13 +147,26 @@ SCENARIO_NAME="$1"

shift # Remove the first argument (scenario name)

# Count total number of allocations and track failures
TOTAL_ALLOCATIONS=$#
FAILED_ALLOCATIONS=0

# Process each allocation ID passed as arguments
while [[ $# -gt 0 ]]; do
# Get next allocation ID from arguments
alloc_id="$1"
wait_for_job "$SCENARIO_NAME" "$alloc_id"

if ! wait_for_job "$SCENARIO_NAME" "$alloc_id"; then
FAILED_ALLOCATIONS=$((FAILED_ALLOCATIONS + 1))
fi

shift # Remove the processed allocation ID
done

# If all allocations failed, exit with error
if [[ $FAILED_ALLOCATIONS -eq $TOTAL_ALLOCATIONS ]]; then
echo "Error: All $TOTAL_ALLOCATIONS allocation(s) failed." >&2
exit 1
fi

Comment on lines +166 to +171
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "ci_allocs.sh" -o -name "wait_for_jobs.sh" 2>/dev/null | head -20

Repository: holochain/wind-tunnel

Length of output: 126


🏁 Script executed:

cat nomad/scripts/ci_allocs.sh | head -100

Repository: holochain/wind-tunnel

Length of output: 3424


🏁 Script executed:

grep -n "peer_end_count\|peer_count\|ClientStatus\|failed.*alloc" nomad/scripts/ci_allocs.sh -A 3 -B 3

Repository: holochain/wind-tunnel

Length of output: 656


🏁 Script executed:

grep -n "make_allocs_csv\|generate_run_summary" nomad/scripts/ci_allocs.sh -A 10 -B 5

Repository: holochain/wind-tunnel

Length of output: 1446


🏁 Script executed:

grep -n "make_allocs_csv\|alloc_ids" nomad/scripts/wait_for_jobs.sh -A 5 -B 5

Repository: holochain/wind-tunnel

Length of output: 369


🏁 Script executed:

sed -n '146,180p' nomad/scripts/wait_for_jobs.sh

Repository: holochain/wind-tunnel

Length of output: 735


🏁 Script executed:

rg "wait_for_jobs" --type sh -A 5 -B 5 | head -80

Repository: holochain/wind-tunnel

Length of output: 47


🏁 Script executed:

grep -rn "wait_for_jobs\|nomad/scripts/wait_for_jobs.sh" . --include="*.sh" --include="*.nomad" 2>/dev/null | head -20

Repository: holochain/wind-tunnel

Length of output: 47


🏁 Script executed:

find . -name "*.nomad" -o -name "*.hcl" 2>/dev/null | xargs grep -l "wait_for_jobs" 2>/dev/null

Repository: holochain/wind-tunnel

Length of output: 49


🏁 Script executed:

cat nomad/scripts/wait_for_jobs.sh

Repository: holochain/wind-tunnel

Length of output: 4779


🏁 Script executed:

grep -rn "make_allocs_csv\|generate_run_summary" . --include="*.sh" --include="*.nomad" 2>/dev/null | grep -v "Binary"

Repository: holochain/wind-tunnel

Length of output: 574


🏁 Script executed:

sed -n '25,50p' nomad/scripts/ci_allocs.sh

Repository: holochain/wind-tunnel

Length of output: 980


🏁 Script executed:

grep -rn "make_allocs_csv\|alloc_ids" . --include="*.sh" -B 10 -A 2 | grep -A 15 "make_allocs_csv"

Repository: holochain/wind-tunnel

Length of output: 2840


🏁 Script executed:

grep -rn "wait_for_jobs\|make_allocs_csv" . --include="*.sh" --include="*.yml" --include="*.yaml" 2>/dev/null | grep -v "Binary"

Repository: holochain/wind-tunnel

Length of output: 772


🏁 Script executed:

sed -n '155,170p' .github/workflows/nomad.yaml

Repository: holochain/wind-tunnel

Length of output: 799


🏁 Script executed:

sed -n '125,160p' .github/workflows/nomad.yaml

Repository: holochain/wind-tunnel

Length of output: 1850


Orchestration does not filter failed allocations; peer_end_count is set statically from config.

The risk scenario in your review is valid. In ci_allocs.sh, the generate_run_summary function sets both peer_count and peer_end_count to the behavior count from the job config (line 105: peer_end_count: ($bs | length)), regardless of allocation completion status.

The allocation IDs passed to make_allocs_csv come directly from Nomad's job run output without status filtering. If an allocation is created by Nomad but fails mid-execution, it remains in the alloc_ids list and gets written to the CSV. Since peer_end_count is derived statically from the configured behaviors array—not from actual work completion—the summarizer cannot detect if work was incomplete.

The script's assumption that the downstream summarizer will detect partial failures via peer_end_count is unfounded: that metric reflects only the configured behavior count, not the actual completion count. If allocations fail before completing their work, the summarizer has no signal to detect this.

🤖 Prompt for AI Agents
In nomad/scripts/wait_for_jobs.sh around lines 166-171, the script exits only on
all-allocations-failed but downstream summarization uses a statically set
peer_end_count from job config; to fix, change the pipeline that builds
alloc_ids and generates the run summary so peer_end_count is derived from actual
completed/healthy allocations rather than the configured behaviors count: query
Nomad for each allocation's status (e.g., running/complete/failed) after job
run, filter alloc_ids to include only allocations that reached a
successful/completed state (or add a status column to the CSV), and set
peer_end_count to the count of those successful allocations (or completed tasks)
before calling generate_run_summary/make_allocs_csv so partial failures are
reflected accurately.

exit 0
Loading