Skip to content

Add steering-quality evaluation harness + CI metrics comment#466

Merged
tomquist merged 2 commits into
developfrom
claude/steering-eval-harness
Jun 12, 2026
Merged

Add steering-quality evaluation harness + CI metrics comment#466
tomquist merged 2 commits into
developfrom
claude/steering-eval-harness

Conversation

@tomquist

@tomquist tomquist commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

Adds a deterministic steering-quality evaluation harness for the active-control loop, plus a CI job that reports its metrics on every PR. This is the measurement groundwork for the #458 active-control redesign — the controller changes themselves stack on top in a follow-up PR, so this PR establishes the baseline they're compared against.

Evaluation harness (src/astrameter/simulator/evaluation.py)

  • Wires the full closed loop — CT002 (active control) → BatterySimulator (the firmware-accurate Venus ramp / B2500 hysteresis controllers) → LoadModelPowermeterSimulator — under a mock clock, so hours of simulated household activity run in seconds (full suite ~22 s).
  • Scenarios (seeded, deterministic event schedules): single Venus with stepped appliance loads (kettle/oven/dishwasher), single Venus with a solar day-curve crossing into export plus cloud dips, two identical Venus, two Venus + one DC-only B2500 with PV input, and mixed poll cadences (V2 ≈3.1 s + V3 ≈0.45 s). Every multi-battery scenario runs in both balancer modes: plain fair-share and efficiency optimization (deprioritization/rotation/probes).
  • The controller reads the meter at a realistic ~1 s cadence (stale in between) while metrics see the true instantaneous grid.
  • Metrics per scenario: settling time (mean/p95) after each scripted step, opposite-direction overshoot, band crossings per hour, steady-state RMS, avoidable import/export Wh (grid exchange a battery with headroom could have covered), and battery power travel (churn).
  • CLI: --scenario, --seed, --set KEY=VALUE config overrides, --json output, and --compare base.json --input head.json to render a Markdown before/after table.
uv run python -m astrameter.simulator.evaluation

CI integration (.github/workflows/ci.yml, job steering-eval)

Runs the suite on the PR base and head and upserts the comparison as a sticky PR comment (informational only — thresholds live in pytest). On this PR the comment shows head-only numbers since the base has no harness yet; once merged, follow-up PRs (starting with the #458 controller changes) get a real before/after table. Fork PRs skip the comment (read-only token) but still get the table in the job summary; JSONs are uploaded as artifacts.

Tests

tests/test_steering_eval.py runs a tiny inline scenario end-to-end (determinism, metric shape, config-override plumbing, registry shape, Markdown rendering) so the harness itself stays covered by the normal pytest run.

Part of #458.

https://claude.ai/code/session_01SeKwnaH74tzQoZEPhmhBu8


Generated by Claude Code

Summary by CodeRabbit

  • New Features

    • Added a steering-quality evaluation framework with scenarios and metrics (settling time, overshoot, oscillation, steady-state RMS, energy).
    • CI integration that runs evaluations, uploads results/artifacts, and posts an auto-updating PR comment with a comparison.
  • Tests

    • Added end-to-end smoke and regression tests ensuring determinism, metric production, and scenario coverage.
  • Documentation

    • Added guidance for running steering evaluations locally and using baseline comparisons.

Deterministic mock-time evaluation suite (python -m
astrameter.simulator.evaluation) that wires CT002 active control against
the firmware-accurate battery simulators and scripted household scenarios
(load spikes, solar curves, single/multi/mixed batteries, both fair-share
and efficiency-optimization modes). Reports reaction (settling time),
oscillation (overshoot, band crossings, steady RMS) and energy
(avoidable import/export) metrics, with --json/--compare for diffing runs.

CI gains a steering-eval job that runs the suite on PR base and head and
posts the before/after table as a sticky PR comment (informational only).

Groundwork for the active-control redesign in #458.

https://claude.ai/code/session_01SeKwnaH74tzQoZEPhmhBu8
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 94c95b31-f8c7-46c2-8568-dfa6856d8c0d

📥 Commits

Reviewing files that changed from the base of the PR and between ef3ff02 and 62cd097.

📒 Files selected for processing (1)
  • src/astrameter/simulator/evaluation.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/astrameter/simulator/evaluation.py

Walkthrough

Adds a deterministic steering-quality evaluation harness with scenarios and metrics, CLI and rendering, end-to-end tests, CI job to run head/base comparisons and post a sticky PR comment, plus guidance in AGENTS.md.

Changes

Steering-Quality Evaluation System

Layer / File(s) Summary
Evaluation Core: Data Models & Constants
src/astrameter/simulator/evaluation.py
Module docstring, evaluation thresholds (settle band, oscillation band, SOC thresholds), and dataclasses (BatterySpec, Event, Scenario, EvalWorld, _EvalClock) that model batteries, events, scenarios, and deterministic simulation time.
Evaluation Engine: Scenario Execution
src/astrameter/simulator/evaluation.py
run_scenario function deterministically seeds randomness, builds and sorts scenario events, instantiates LoadModel, CT002 balancer, battery simulators, and powermeter, runs an event-driven loop stepping components on polling cadence under mock clock, captures time-series samples via CT002 hook, and returns metric data.
Metric Computation: Analysis Functions
src/astrameter/simulator/evaluation.py
Post-run metric analysis: per-event settle time and overshoot against settle band with quiet-tail rule, oscillation counting via hysteresis crossings, steady-state RMS excluding transient windows, and energy accounting with import/export and "avoidable" conditions based on battery headroom.
Scenario Definitions & Presets
src/astrameter/simulator/evaluation.py
Battery presets and event generator functions (household load steps with jitter, solar curves, labeled cloud dips, DC solar for DC batteries); build_scenarios() builds registry with single/multi-battery cases and fair/eff CT002 modes with different polling cadences.
Reporting & CLI Interface
src/astrameter/simulator/evaluation.py
Text and markdown rendering (render_text, render_markdown_compare for base-vs-head comparison tables), CLI argument parser supporting --scenario, --seed, --set overrides, --list, optional --json output, optional --compare baseline, and asyncio runner.
Evaluation Tests
tests/test_steering_eval.py
End-to-end tests: tiny scenario fixture, validates metric production with non-negative values and correct event counts, confirms determinism with fixed seed, tests override plumbing (balance_deadband), validates scenario registry structure, renders markdown with expected columns, and verifies full scenario event timing.
CI Workflow Integration
.github/workflows/ci.yml
New steering-eval job runs after lint, checks out full history, runs evaluation for PR head and base (via git worktree), renders markdown comparison, uploads artifacts, appends summary to job summary, and creates/updates sticky PR comment with HTML marker guard for same-repo PRs.
User Documentation
AGENTS.md
Guidance section describing steering-quality evaluation workflow: run baseline simulator for unchanged code, then re-run with --compare on PR head, noting CI automation and artifact availability.

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.60% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary changes: adding a steering-quality evaluation harness and CI metrics comment functionality, which are the main objectives across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/steering-eval-harness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Steering evaluation (base vs head)

Lower is better for every metric. See src/astrameter/simulator/evaluation.py for definitions.

mixed_cadence/eff — settle 105.7s, overshoot 1949.5W, RMS 111.7W
Metric Base Head Δ
settle_mean_s 105.7
settle_p95_s 241.7
unsettled_events 1
overshoot_mean_w 766.7
overshoot_max_w 1949.5
band_crossings_per_h 446.0
steady_rms_w 111.7
mean_abs_grid_w 100.0
avoidable_import_wh 75.2
avoidable_export_wh 25.1
battery_travel_w_per_h 180519.0
mixed_cadence/fair — settle 70.9s, overshoot 1655.7W, RMS 16.1W
Metric Base Head Δ
settle_mean_s 70.9
settle_p95_s 92.7
unsettled_events 0
overshoot_mean_w 883.2
overshoot_max_w 1655.7
band_crossings_per_h 162.0
steady_rms_w 16.1
mean_abs_grid_w 49.5
avoidable_import_wh 35.5
avoidable_export_wh 13.9
battery_travel_w_per_h 69350.0
mixed_venus_b2500/eff — settle 247.2s, overshoot 316.9W, RMS 37.8W
Metric Base Head Δ
settle_mean_s 247.2
settle_p95_s 600.0
unsettled_events 7
overshoot_mean_w 127.1
overshoot_max_w 316.9
band_crossings_per_h 648.0
steady_rms_w 37.8
mean_abs_grid_w 47.9
avoidable_import_wh 48.8
avoidable_export_wh 21.9
battery_travel_w_per_h 131474.0
mixed_venus_b2500/fair — settle 151.2s, overshoot 2098.6W, RMS 25.0W
Metric Base Head Δ
settle_mean_s 151.2
settle_p95_s 334.9
unsettled_events 4
overshoot_mean_w 752.6
overshoot_max_w 2098.6
band_crossings_per_h 392.0
steady_rms_w 25.0
mean_abs_grid_w 49.6
avoidable_import_wh 59.6
avoidable_export_wh 15.1
battery_travel_w_per_h 87338.0
single_venus_solar — settle 11.8s, overshoot 1046.6W, RMS 17.1W
Metric Base Head Δ
settle_mean_s 11.8
settle_p95_s 14.8
unsettled_events 0
overshoot_mean_w 526.6
overshoot_max_w 1046.6
band_crossings_per_h 5.33
steady_rms_w 17.1
mean_abs_grid_w 15.6
avoidable_import_wh 15.2
avoidable_export_wh 8.2
battery_travel_w_per_h 11539.0
single_venus_steps — settle 14.5s, overshoot 2545.6W, RMS 15.4W
Metric Base Head Δ
settle_mean_s 14.5
settle_p95_s 24.4
unsettled_events 0
overshoot_mean_w 982.7
overshoot_max_w 2545.6
band_crossings_per_h 18.0
steady_rms_w 15.4
mean_abs_grid_w 54.1
avoidable_import_wh 43.9
avoidable_export_wh 10.2
battery_travel_w_per_h 49842.0
two_venus/eff — settle 34.9s, overshoot 2069.6W, RMS 17.0W
Metric Base Head Δ
settle_mean_s 34.9
settle_p95_s 80.7
unsettled_events 0
overshoot_mean_w 765.5
overshoot_max_w 2069.6
band_crossings_per_h 169.0
steady_rms_w 17.0
mean_abs_grid_w 52.8
avoidable_import_wh 42.2
avoidable_export_wh 9.9
battery_travel_w_per_h 59166.0
two_venus/fair — settle 17.9s, overshoot 2070.9W, RMS 16.4W
Metric Base Head Δ
settle_mean_s 17.9
settle_p95_s 24.4
unsettled_events 0
overshoot_mean_w 880.8
overshoot_max_w 2070.9
band_crossings_per_h 44.0
steady_rms_w 16.4
mean_abs_grid_w 42.9
avoidable_import_wh 34.5
avoidable_export_wh 8.0
battery_travel_w_per_h 41630.0

@tomquist tomquist marked this pull request as ready for review June 12, 2026 20:25

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_steering_eval.py (1)

101-116: ⚠️ Potential issue | 🟠 Major

Run the required ruff/mypy/pytest checks (uv is missing)

  • For tests/test_steering_eval.py (lines 101-116), the guideline commands could not be executed here because uv: command not found; run uv run ruff format ., uv run ruff check ., uv run mypy src/, and uv run pytest in your dev environment/CI and ensure they pass.
  • The random.Random(1) usage in this deterministic test is appropriate; the secrets hint is not relevant for this non-security test code.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_steering_eval.py` around lines 101 - 116, The CI/dev checks failed
because the helper tool `uv` is not available; update your environment or CI
steps to run the linters/typechecks/tests directly (or install `uv`) and re-run:
for example run `ruff format .` and `ruff check .`, then `mypy src/`, and
`pytest` locally/CI to ensure tests pass; verify tests/test_steering_eval.py
(notably the test_full_scenario_definitions_build that constructs
random.Random(1)) still passes under these checks.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/astrameter/simulator/evaluation.py`:
- Around line 174-177: The current _free_udp_port() closes the socket before the
controller (CT002) binds, causing a race; change the binding workflow so the
controller binds directly to port 0 and then reads back its assigned port (or
have _free_udp_port() return an open socket and keep it open until the listener
is ready) instead of returning an integer and closing the socket. Update all
call sites that use _free_udp_port() (including the other occurrences noted
around the CT002 setup) to either accept the open socket or to use the
controller's own bind-to-0/read-back-port logic so ownership isn't released
before the listener is established. Ensure the code that starts CT002 uses the
bound socket/port immediately and only closes/releases it after the listener is
confirmed running.
- Around line 530-540: The cloud dip is being overwritten by per-minute baseline
writes from _solar_curve, so modify _cloud_dips (and callers like
single_venus_solar) to either 1) model cloud cover as a multiplier applied by
the baseline generator: add a representation of cloud intervals (start t0, end
t0+120, depth) and have _solar_curve consult active cloud intervals and multiply
its per-minute irradiance before emitting events, or 2) emit per-minute
reduced-solar events for the full dip window instead of only a one-shot
cloud_on/cloud_off; locate and change _cloud_dips, Event creation for
"cloud_on"/"cloud_off", and/or _set_solar so that the dip persists across the
minute ticks (use the same time grid as _solar_curve) and ensure cloud depth is
applied multiplicatively to the baseline value.

---

Outside diff comments:
In `@tests/test_steering_eval.py`:
- Around line 101-116: The CI/dev checks failed because the helper tool `uv` is
not available; update your environment or CI steps to run the
linters/typechecks/tests directly (or install `uv`) and re-run: for example run
`ruff format .` and `ruff check .`, then `mypy src/`, and `pytest` locally/CI to
ensure tests pass; verify tests/test_steering_eval.py (notably the
test_full_scenario_definitions_build that constructs random.Random(1)) still
passes under these checks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 04558cfd-4b71-40fb-a3f6-bad79efb6759

📥 Commits

Reviewing files that changed from the base of the PR and between 5d1a801 and ef3ff02.

📒 Files selected for processing (4)
  • .github/workflows/ci.yml
  • AGENTS.md
  • src/astrameter/simulator/evaluation.py
  • tests/test_steering_eval.py

Comment on lines +174 to +177
def _free_udp_port() -> int:
with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
s.bind(("127.0.0.1", 0))
return int(s.getsockname()[1])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid the free-port handoff here.

_free_udp_port() closes the socket before CT002 binds that port, so another local process or parallel test can claim it in the gap. That turns this harness into a flaky CI target and can misroute battery traffic away from the controller. Please bind the controller first (ideally on port 0 and then read back the bound port) or otherwise keep ownership of the socket until the listener is ready.

Also applies to: 200-208, 226-233

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/astrameter/simulator/evaluation.py` around lines 174 - 177, The current
_free_udp_port() closes the socket before the controller (CT002) binds, causing
a race; change the binding workflow so the controller binds directly to port 0
and then reads back its assigned port (or have _free_udp_port() return an open
socket and keep it open until the listener is ready) instead of returning an
integer and closing the socket. Update all call sites that use _free_udp_port()
(including the other occurrences noted around the CT002 setup) to either accept
the open socket or to use the controller's own bind-to-0/read-back-port logic so
ownership isn't released before the listener is established. Ensure the code
that starts CT002 uses the bound socket/port immediately and only
closes/releases it after the listener is confirmed running.

Comment thread src/astrameter/simulator/evaluation.py Outdated
Three harness fixes that belong in the baseline rather than in the
controller-change PR stacked on top:

- The controller now reads the meter at a realistic ~1 s cadence
  (Scenario.meter_interval_s) while metrics see the true instantaneous
  grid; a zero-latency meter flattered the current controller.
- The oven thermostat schedule is emitted as paired on/off cycles; the
  previous loop could end stuck ON, stacking loads beyond the battery's
  ceiling for the rest of the run (the 281 W steady RMS in
  single_venus_steps was that saturation, not steering behavior).
- Solar is now curve x factor so the labeled cloud-dip transients
  compose with the per-minute day curve instead of being overwritten by
  its next tick mid-measurement-window.

https://claude.ai/code/session_01SeKwnaH74tzQoZEPhmhBu8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants