Skip to content

feat(nomad): get peer_end_count by summing peer_end_count from run summaries inside of allocations by their run_id#581

Merged
veeso merged 1 commit intomainfrom
feat/504-get-peer_end_count-after-running-nomad-scenarios
Mar 19, 2026
Merged

feat(nomad): get peer_end_count by summing peer_end_count from run summaries inside of allocations by their run_id#581
veeso merged 1 commit intomainfrom
feat/504-get-peer_end_count-after-running-nomad-scenarios

Conversation

@veeso
Copy link
Member

@veeso veeso commented Mar 17, 2026

Summary

When running scenarios via Nomad, the peer_end_count in the generated run_summary.jsonl was always set to the configured peer_count, ignoring agents that may have dropped during the run. This PR fixes that by fetching each allocation's run_summary.jsonl via the Nomad API, reading the actual peer_end_count, and summing them per run_id.

Changes

  • Added fetch_peer_end_count function that reads peer_end_count from an allocation's run_summary.jsonl via nomad alloc fs, with a fallback to the configured peer_count if the file is unavailable.
  • Before generating the combined run summary, iterate over all allocations to build a peer_end_count_by_run map keyed by run_id.
  • Use the summed peer_end_count instead of hardcoding it to peer_count in the jq template.

closes #504

TODO:

  • All code changes are reflected in docs, including module-level docs
  • All new/edited/removed scenarios are reflected in summary visualiser tool (see checklist)
  • I ran the Nomad CI workflow successfully on my branch

Note that all commits in a PR must follow Conventional Commits before it can be merged, as these are used to generate the changelog

Summary by CodeRabbit

  • Chores
    • Improved CI allocation data collection to aggregate final peer counts per run, updating exported run summaries to use aggregated values with fallbacks and emitting warnings when allocation-level data is missing, enhancing accuracy of run-level metrics.

@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8829acb7-466f-4cd2-9305-72b7f03e9fed

📥 Commits

Reviewing files that changed from the base of the PR and between 6de2fbe and f71d584.

📒 Files selected for processing (1)
  • nomad/scripts/ci_allocs.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • nomad/scripts/ci_allocs.sh

Walkthrough

The script nomad/scripts/ci_allocs.sh now fetches each allocation's alloc/run_summary.jsonl via nomad alloc fs, extracts per-allocation peer_end_count, aggregates these counts by run_id, and uses the aggregated value (falling back to configured peer_count) when generating run-summary JSONL.

Changes

Cohort / File(s) Summary
Nomad CI allocation script
nomad/scripts/ci_allocs.sh
Added fetch_peer_end_count(alloc_id, fallback_peer_count) which reads alloc/run_summary.jsonl via nomad alloc fs and jq, aggregating .peer_end_count. Modified generate_run_summary to build peer_end_count_by_run and use the aggregated per-run value (with fallback to peer_count) when emitting JSON objects.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested reviewers

  • cdunster
  • jost-s
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fetching and summing peer_end_count from allocation run summaries by run_id instead of using the configured peer_count.
Description check ✅ Passed The PR description covers the summary, specific changes made, and references the closed issue, though the TODO checklist items remain uncompleted.
Linked Issues check ✅ Passed The code changes implement all primary objectives from issue #504: added fetch_peer_end_count function to retrieve and sum peer_end_count from allocations by run_id, and modified generate_run_summary to use aggregated values instead of peer_count.
Out of Scope Changes check ✅ Passed All changes are scoped to the ci_allocs.sh script and directly address the objectives defined in issue #504 with no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/504-get-peer_end_count-after-running-nomad-scenarios
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 17, 2026

The following will be added to the changelog


[0.6.1] - 2026-03-19

Features

  • (nomad) Get peer_end_count by summing peer_end_count from run summaries inside of allocations by their run_id

@veeso veeso force-pushed the feat/504-get-peer_end_count-after-running-nomad-scenarios branch 4 times, most recently from d5e0380 to 6de2fbe Compare March 18, 2026 17:00
@veeso veeso marked this pull request as ready for review March 19, 2026 08:41
@claude
Copy link

claude bot commented Mar 19, 2026

Claude finished @veeso's task in 50s —— View job


No issues found.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
nomad/scripts/ci_allocs.sh (2)

77-77: Nitpick: Move local declaration outside the loop.

Declaring local inside the loop re-declares on each iteration. Moving it before the loop is cleaner.

✨ Suggested improvement
     declare -A peer_end_count_by_run
+    local this_alloc_peer_end_count
     while IFS=',' read -r _job_name _scenario_name alloc_id _run_id _started_at _duration _peer_count _behaviours; do
-        local this_alloc_peer_end_count
         this_alloc_peer_end_count=$(fetch_peer_end_count "$alloc_id" "$_peer_count")
         peer_end_count_by_run["$_run_id"]="$(( ${peer_end_count_by_run["$_run_id"]:-0} + this_alloc_peer_end_count ))"
     done < "$allocs_csv_file"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nomad/scripts/ci_allocs.sh` at line 77, Move the variable declaration for
this_alloc_peer_end_count out of the loop to avoid re-declaring it on each
iteration; declare it once as local before the loop that uses it (reference the
variable name this_alloc_peer_end_count and the enclosing loop in
nomad/scripts/ci_allocs.sh) and keep assignments/updates inside the loop.

150-162: Pipeline error handling may behave unexpectedly with set -eo pipefail.

With set -eo pipefail at the top of the script, the || result="" at the end of line 154 should catch failures, but the behavior can be subtle. If nomad succeeds but jq fails (e.g., malformed JSON), pipefail would cause the pipeline to return non-zero, and || result="" would then execute—this is likely the intended behavior.

However, there's a subtle case: if the file exists but is empty, jq --slurp 'last' returns null, and // 0 yields 0. This silently reports zero peers completed instead of triggering the fallback. Consider whether this is acceptable or if you want to treat 0 as a warning condition too.

💡 Optional: Add explicit check for zero result
 function fetch_peer_end_count() {
     local alloc_id="$1"
     local fallback_peer_count="$2"
     local result
     result=$(nomad alloc fs "$alloc_id" alloc/run_summary.jsonl 2>/dev/null | jq --slurp 'last | .peer_end_count // 0') || result=""
-    if [[ -z "$result" || "$result" == "null" ]]; then
+    if [[ -z "$result" || "$result" == "null" || "$result" == "0" ]]; then
         echo "Warning: could not fetch peer_end_count for alloc $alloc_id, falling back to configured peer_count ($fallback_peer_count)" >&2
         echo "$fallback_peer_count"
         return
     fi
     echo "Fetched peer_end_count for alloc $alloc_id: $result" >&2
     echo "$result"
 }

Alternatively, if 0 is a valid value (e.g., all agents failed), the current behavior is correct.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nomad/scripts/ci_allocs.sh` around lines 150 - 162, In fetch_peer_end_count,
handle the empty-file vs legitimate-zero ambiguity by first capturing nomad
output into a raw variable (e.g., raw_output) and checking if raw_output is
empty before running jq; if raw_output is empty or nomad/jq exit non-zero, log
the warning and echo the fallback_peer_count, otherwise run jq on raw_output and
if jq returns null/empty treat as fallback too, finally echo the numeric result;
update references inside fetch_peer_end_count (alloc_id, fallback_peer_count,
result) accordingly so the pipeline failure and empty-file cases both fall back
instead of silently returning 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@nomad/scripts/ci_allocs.sh`:
- Line 77: Move the variable declaration for this_alloc_peer_end_count out of
the loop to avoid re-declaring it on each iteration; declare it once as local
before the loop that uses it (reference the variable name
this_alloc_peer_end_count and the enclosing loop in nomad/scripts/ci_allocs.sh)
and keep assignments/updates inside the loop.
- Around line 150-162: In fetch_peer_end_count, handle the empty-file vs
legitimate-zero ambiguity by first capturing nomad output into a raw variable
(e.g., raw_output) and checking if raw_output is empty before running jq; if
raw_output is empty or nomad/jq exit non-zero, log the warning and echo the
fallback_peer_count, otherwise run jq on raw_output and if jq returns null/empty
treat as fallback too, finally echo the numeric result; update references inside
fetch_peer_end_count (alloc_id, fallback_peer_count, result) accordingly so the
pipeline failure and empty-file cases both fall back instead of silently
returning 0.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 53c51d9a-edf8-42e1-aa28-fa42ecbeb08e

📥 Commits

Reviewing files that changed from the base of the PR and between 46fe784 and 6de2fbe.

📒 Files selected for processing (1)
  • nomad/scripts/ci_allocs.sh

…n summaries inside of allocations by their run_id

closes #504
@veeso veeso force-pushed the feat/504-get-peer_end_count-after-running-nomad-scenarios branch from 6de2fbe to f71d584 Compare March 19, 2026 08:45
@cocogitto-bot
Copy link

cocogitto-bot bot commented Mar 19, 2026

✔️ f71d584 - Conventional commits check succeeded.

@veeso veeso requested a review from a team March 19, 2026 08:47
Copy link
Member

@ThetaSinner ThetaSinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@veeso veeso merged commit f45b7f6 into main Mar 19, 2026
30 of 58 checks passed
@veeso veeso deleted the feat/504-get-peer_end_count-after-running-nomad-scenarios branch March 19, 2026 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Get peer_end_count after running nomad scenarios

2 participants