Add steering-quality evaluation harness + CI metrics comment by tomquist · Pull Request #466 · tomquist/AstraMeter

tomquist · 2026-06-12T20:21:53Z

Summary

Adds a deterministic steering-quality evaluation harness for the active-control loop, plus a CI job that reports its metrics on every PR. This is the measurement groundwork for the #458 active-control redesign — the controller changes themselves stack on top in a follow-up PR, so this PR establishes the baseline they're compared against.

Evaluation harness (`src/astrameter/simulator/evaluation.py`)

Wires the full closed loop — CT002 (active control) → BatterySimulator (the firmware-accurate Venus ramp / B2500 hysteresis controllers) → LoadModel → PowermeterSimulator — under a mock clock, so hours of simulated household activity run in seconds (full suite ~22 s).
Scenarios (seeded, deterministic event schedules): single Venus with stepped appliance loads (kettle/oven/dishwasher), single Venus with a solar day-curve crossing into export plus cloud dips, two identical Venus, two Venus + one DC-only B2500 with PV input, and mixed poll cadences (V2 ≈3.1 s + V3 ≈0.45 s). Every multi-battery scenario runs in both balancer modes: plain fair-share and efficiency optimization (deprioritization/rotation/probes).
The controller reads the meter at a realistic ~1 s cadence (stale in between) while metrics see the true instantaneous grid.
Metrics per scenario: settling time (mean/p95) after each scripted step, opposite-direction overshoot, band crossings per hour, steady-state RMS, avoidable import/export Wh (grid exchange a battery with headroom could have covered), and battery power travel (churn).
CLI: --scenario, --seed, --set KEY=VALUE config overrides, --json output, and --compare base.json --input head.json to render a Markdown before/after table.

uv run python -m astrameter.simulator.evaluation

CI integration (`.github/workflows/ci.yml`, job `steering-eval`)

Runs the suite on the PR base and head and upserts the comparison as a sticky PR comment (informational only — thresholds live in pytest). On this PR the comment shows head-only numbers since the base has no harness yet; once merged, follow-up PRs (starting with the #458 controller changes) get a real before/after table. Fork PRs skip the comment (read-only token) but still get the table in the job summary; JSONs are uploaded as artifacts.

Tests

tests/test_steering_eval.py runs a tiny inline scenario end-to-end (determinism, metric shape, config-override plumbing, registry shape, Markdown rendering) so the harness itself stays covered by the normal pytest run.

Part of #458.

https://claude.ai/code/session_01SeKwnaH74tzQoZEPhmhBu8

Generated by Claude Code

Summary by CodeRabbit

New Features
- Added a steering-quality evaluation framework with scenarios and metrics (settling time, overshoot, oscillation, steady-state RMS, energy).
- CI integration that runs evaluations, uploads results/artifacts, and posts an auto-updating PR comment with a comparison.
Tests
- Added end-to-end smoke and regression tests ensuring determinism, metric production, and scenario coverage.
Documentation
- Added guidance for running steering evaluations locally and using baseline comparisons.

Deterministic mock-time evaluation suite (python -m astrameter.simulator.evaluation) that wires CT002 active control against the firmware-accurate battery simulators and scripted household scenarios (load spikes, solar curves, single/multi/mixed batteries, both fair-share and efficiency-optimization modes). Reports reaction (settling time), oscillation (overshoot, band crossings, steady RMS) and energy (avoidable import/export) metrics, with --json/--compare for diffing runs. CI gains a steering-eval job that runs the suite on PR base and head and posts the before/after table as a sticky PR comment (informational only). Groundwork for the active-control redesign in #458. https://claude.ai/code/session_01SeKwnaH74tzQoZEPhmhBu8

coderabbitai · 2026-06-12T20:22:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 94c95b31-f8c7-46c2-8568-dfa6856d8c0d

📥 Commits

Reviewing files that changed from the base of the PR and between ef3ff02 and 62cd097.

📒 Files selected for processing (1)

src/astrameter/simulator/evaluation.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/astrameter/simulator/evaluation.py

Walkthrough

Adds a deterministic steering-quality evaluation harness with scenarios and metrics, CLI and rendering, end-to-end tests, CI job to run head/base comparisons and post a sticky PR comment, plus guidance in AGENTS.md.

Changes

Steering-Quality Evaluation System

Layer / File(s)	Summary
Evaluation Core: Data Models & Constants `src/astrameter/simulator/evaluation.py`	Module docstring, evaluation thresholds (settle band, oscillation band, SOC thresholds), and dataclasses (`BatterySpec`, `Event`, `Scenario`, `EvalWorld`, `_EvalClock`) that model batteries, events, scenarios, and deterministic simulation time.
Evaluation Engine: Scenario Execution `src/astrameter/simulator/evaluation.py`	`run_scenario` function deterministically seeds randomness, builds and sorts scenario events, instantiates LoadModel, CT002 balancer, battery simulators, and powermeter, runs an event-driven loop stepping components on polling cadence under mock clock, captures time-series samples via CT002 hook, and returns metric data.
Metric Computation: Analysis Functions `src/astrameter/simulator/evaluation.py`	Post-run metric analysis: per-event settle time and overshoot against settle band with quiet-tail rule, oscillation counting via hysteresis crossings, steady-state RMS excluding transient windows, and energy accounting with import/export and "avoidable" conditions based on battery headroom.
Scenario Definitions & Presets `src/astrameter/simulator/evaluation.py`	Battery presets and event generator functions (household load steps with jitter, solar curves, labeled cloud dips, DC solar for DC batteries); `build_scenarios()` builds registry with single/multi-battery cases and `fair`/`eff` CT002 modes with different polling cadences.
Reporting & CLI Interface `src/astrameter/simulator/evaluation.py`	Text and markdown rendering (`render_text`, `render_markdown_compare` for base-vs-head comparison tables), CLI argument parser supporting `--scenario`, `--seed`, `--set` overrides, `--list`, optional `--json` output, optional `--compare` baseline, and asyncio runner.
Evaluation Tests `tests/test_steering_eval.py`	End-to-end tests: tiny scenario fixture, validates metric production with non-negative values and correct event counts, confirms determinism with fixed seed, tests override plumbing (`balance_deadband`), validates scenario registry structure, renders markdown with expected columns, and verifies full scenario event timing.
CI Workflow Integration `.github/workflows/ci.yml`	New `steering-eval` job runs after lint, checks out full history, runs evaluation for PR head and base (via git worktree), renders markdown comparison, uploads artifacts, appends summary to job summary, and creates/updates sticky PR comment with HTML marker guard for same-repo PRs.
User Documentation `AGENTS.md`	Guidance section describing steering-quality evaluation workflow: run baseline simulator for unchanged code, then re-run with `--compare` on PR head, noting CI automation and artifact availability.

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.60% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the primary changes: adding a steering-quality evaluation harness and CI metrics comment functionality, which are the main objectives across all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/steering-eval-harness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-12T20:22:57Z

Steering evaluation (base vs head)

Lower is better for every metric. See src/astrameter/simulator/evaluation.py for definitions.

mixed_cadence/eff — settle 105.7s, overshoot 1949.5W, RMS 111.7W

Metric	Base	Head	Δ
settle_mean_s	—	105.7	—
settle_p95_s	—	241.7	—
unsettled_events	—	1	—
overshoot_mean_w	—	766.7	—
overshoot_max_w	—	1949.5	—
band_crossings_per_h	—	446.0	—
steady_rms_w	—	111.7	—
mean_abs_grid_w	—	100.0	—
avoidable_import_wh	—	75.2	—
avoidable_export_wh	—	25.1	—
battery_travel_w_per_h	—	180519.0	—

mixed_cadence/fair — settle 70.9s, overshoot 1655.7W, RMS 16.1W

Metric	Base	Head	Δ
settle_mean_s	—	70.9	—
settle_p95_s	—	92.7	—
unsettled_events	—	0	—
overshoot_mean_w	—	883.2	—
overshoot_max_w	—	1655.7	—
band_crossings_per_h	—	162.0	—
steady_rms_w	—	16.1	—
mean_abs_grid_w	—	49.5	—
avoidable_import_wh	—	35.5	—
avoidable_export_wh	—	13.9	—
battery_travel_w_per_h	—	69350.0	—

mixed_venus_b2500/eff — settle 247.2s, overshoot 316.9W, RMS 37.8W

Metric	Base	Head	Δ
settle_mean_s	—	247.2	—
settle_p95_s	—	600.0	—
unsettled_events	—	7	—
overshoot_mean_w	—	127.1	—
overshoot_max_w	—	316.9	—
band_crossings_per_h	—	648.0	—
steady_rms_w	—	37.8	—
mean_abs_grid_w	—	47.9	—
avoidable_import_wh	—	48.8	—
avoidable_export_wh	—	21.9	—
battery_travel_w_per_h	—	131474.0	—

mixed_venus_b2500/fair — settle 151.2s, overshoot 2098.6W, RMS 25.0W

Metric	Base	Head	Δ
settle_mean_s	—	151.2	—
settle_p95_s	—	334.9	—
unsettled_events	—	4	—
overshoot_mean_w	—	752.6	—
overshoot_max_w	—	2098.6	—
band_crossings_per_h	—	392.0	—
steady_rms_w	—	25.0	—
mean_abs_grid_w	—	49.6	—
avoidable_import_wh	—	59.6	—
avoidable_export_wh	—	15.1	—
battery_travel_w_per_h	—	87338.0	—

single_venus_solar — settle 11.8s, overshoot 1046.6W, RMS 17.1W

Metric	Base	Head	Δ
settle_mean_s	—	11.8	—
settle_p95_s	—	14.8	—
unsettled_events	—	0	—
overshoot_mean_w	—	526.6	—
overshoot_max_w	—	1046.6	—
band_crossings_per_h	—	5.33	—
steady_rms_w	—	17.1	—
mean_abs_grid_w	—	15.6	—
avoidable_import_wh	—	15.2	—
avoidable_export_wh	—	8.2	—
battery_travel_w_per_h	—	11539.0	—

single_venus_steps — settle 14.5s, overshoot 2545.6W, RMS 15.4W

Metric	Base	Head	Δ
settle_mean_s	—	14.5	—
settle_p95_s	—	24.4	—
unsettled_events	—	0	—
overshoot_mean_w	—	982.7	—
overshoot_max_w	—	2545.6	—
band_crossings_per_h	—	18.0	—
steady_rms_w	—	15.4	—
mean_abs_grid_w	—	54.1	—
avoidable_import_wh	—	43.9	—
avoidable_export_wh	—	10.2	—
battery_travel_w_per_h	—	49842.0	—

two_venus/eff — settle 34.9s, overshoot 2069.6W, RMS 17.0W

Metric	Base	Head	Δ
settle_mean_s	—	34.9	—
settle_p95_s	—	80.7	—
unsettled_events	—	0	—
overshoot_mean_w	—	765.5	—
overshoot_max_w	—	2069.6	—
band_crossings_per_h	—	169.0	—
steady_rms_w	—	17.0	—
mean_abs_grid_w	—	52.8	—
avoidable_import_wh	—	42.2	—
avoidable_export_wh	—	9.9	—
battery_travel_w_per_h	—	59166.0	—

two_venus/fair — settle 17.9s, overshoot 2070.9W, RMS 16.4W

Metric	Base	Head	Δ
settle_mean_s	—	17.9	—
settle_p95_s	—	24.4	—
unsettled_events	—	0	—
overshoot_mean_w	—	880.8	—
overshoot_max_w	—	2070.9	—
band_crossings_per_h	—	44.0	—
steady_rms_w	—	16.4	—
mean_abs_grid_w	—	42.9	—
avoidable_import_wh	—	34.5	—
avoidable_export_wh	—	8.0	—
battery_travel_w_per_h	—	41630.0	—

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/test_steering_eval.py (1)
101-116: ⚠️ Potential issue | 🟠 Major

Run the required ruff/mypy/pytest checks (uv is missing)

For tests/test_steering_eval.py (lines 101-116), the guideline commands could not be executed here because uv: command not found; run uv run ruff format ., uv run ruff check ., uv run mypy src/, and uv run pytest in your dev environment/CI and ensure they pass.

The random.Random(1) usage in this deterministic test is appropriate; the secrets hint is not relevant for this non-security test code.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_steering_eval.py` around lines 101 - 116, The CI/dev checks failed
because the helper tool `uv` is not available; update your environment or CI
steps to run the linters/typechecks/tests directly (or install `uv`) and re-run:
for example run `ruff format .` and `ruff check .`, then `mypy src/`, and
`pytest` locally/CI to ensure tests pass; verify tests/test_steering_eval.py
(notably the test_full_scenario_definitions_build that constructs
random.Random(1)) still passes under these checks.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/astrameter/simulator/evaluation.py`:
- Around line 174-177: The current _free_udp_port() closes the socket before the
controller (CT002) binds, causing a race; change the binding workflow so the
controller binds directly to port 0 and then reads back its assigned port (or
have _free_udp_port() return an open socket and keep it open until the listener
is ready) instead of returning an integer and closing the socket. Update all
call sites that use _free_udp_port() (including the other occurrences noted
around the CT002 setup) to either accept the open socket or to use the
controller's own bind-to-0/read-back-port logic so ownership isn't released
before the listener is established. Ensure the code that starts CT002 uses the
bound socket/port immediately and only closes/releases it after the listener is
confirmed running.
- Around line 530-540: The cloud dip is being overwritten by per-minute baseline
writes from _solar_curve, so modify _cloud_dips (and callers like
single_venus_solar) to either 1) model cloud cover as a multiplier applied by
the baseline generator: add a representation of cloud intervals (start t0, end
t0+120, depth) and have _solar_curve consult active cloud intervals and multiply
its per-minute irradiance before emitting events, or 2) emit per-minute
reduced-solar events for the full dip window instead of only a one-shot
cloud_on/cloud_off; locate and change _cloud_dips, Event creation for
"cloud_on"/"cloud_off", and/or _set_solar so that the dip persists across the
minute ticks (use the same time grid as _solar_curve) and ensure cloud depth is
applied multiplicatively to the baseline value.

---

Outside diff comments:
In `@tests/test_steering_eval.py`:
- Around line 101-116: The CI/dev checks failed because the helper tool `uv` is
not available; update your environment or CI steps to run the
linters/typechecks/tests directly (or install `uv`) and re-run: for example run
`ruff format .` and `ruff check .`, then `mypy src/`, and `pytest` locally/CI to
ensure tests pass; verify tests/test_steering_eval.py (notably the
test_full_scenario_definitions_build that constructs random.Random(1)) still
passes under these checks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 04558cfd-4b71-40fb-a3f6-bad79efb6759

📥 Commits

Reviewing files that changed from the base of the PR and between 5d1a801 and ef3ff02.

📒 Files selected for processing (4)

.github/workflows/ci.yml
AGENTS.md
src/astrameter/simulator/evaluation.py
tests/test_steering_eval.py

coderabbitai · 2026-06-12T20:34:25Z

+def _free_udp_port() -> int:
+    with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
+        s.bind(("127.0.0.1", 0))
+        return int(s.getsockname()[1])


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid the free-port handoff here.

_free_udp_port() closes the socket before CT002 binds that port, so another local process or parallel test can claim it in the gap. That turns this harness into a flaky CI target and can misroute battery traffic away from the controller. Please bind the controller first (ideally on port 0 and then read back the bound port) or otherwise keep ownership of the socket until the listener is ready.

Also applies to: 200-208, 226-233

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/astrameter/simulator/evaluation.py` around lines 174 - 177, The current _free_udp_port() closes the socket before the controller (CT002) binds, causing a race; change the binding workflow so the controller binds directly to port 0 and then reads back its assigned port (or have _free_udp_port() return an open socket and keep it open until the listener is ready) instead of returning an integer and closing the socket. Update all call sites that use _free_udp_port() (including the other occurrences noted around the CT002 setup) to either accept the open socket or to use the controller's own bind-to-0/read-back-port logic so ownership isn't released before the listener is established. Ensure the code that starts CT002 uses the bound socket/port immediately and only closes/releases it after the listener is confirmed running.

Three harness fixes that belong in the baseline rather than in the controller-change PR stacked on top: - The controller now reads the meter at a realistic ~1 s cadence (Scenario.meter_interval_s) while metrics see the true instantaneous grid; a zero-latency meter flattered the current controller. - The oven thermostat schedule is emitted as paired on/off cycles; the previous loop could end stuck ON, stacking loads beyond the battery's ceiling for the rest of the run (the 281 W steady RMS in single_venus_steps was that saturation, not steering behavior). - Solar is now curve x factor so the labeled cloud-dip transients compose with the per-minute day curve instead of being overwritten by its next tick mid-measurement-window. https://claude.ai/code/session_01SeKwnaH74tzQoZEPhmhBu8

tomquist marked this pull request as ready for review June 12, 2026 20:25

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

tomquist merged commit d6327db into develop Jun 12, 2026
26 checks passed

tomquist mentioned this pull request Jun 12, 2026

Pace active-control deltas to the battery firmware's ramp (#458) #467

Merged

coderabbitai Bot mentioned this pull request Jun 12, 2026

Add grid-power traces and metric glossary to eval PR comments #468

Merged

tomquist mentioned this pull request Jun 12, 2026

Efficiency optimization hunts indefinitely with a mixed Venus + DC-only B2500 pool (eval: mixed_venus_b2500/eff) #469

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add steering-quality evaluation harness + CI metrics comment#466

Add steering-quality evaluation harness + CI metrics comment#466
tomquist merged 2 commits into
developfrom
claude/steering-eval-harness

tomquist commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomquist commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Evaluation harness (src/astrameter/simulator/evaluation.py)

CI integration (.github/workflows/ci.yml, job steering-eval)

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Steering evaluation (base vs head)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomquist commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

Evaluation harness (`src/astrameter/simulator/evaluation.py`)

CI integration (`.github/workflows/ci.yml`, job `steering-eval`)

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

github-actions Bot commented Jun 12, 2026 •

edited

Loading