Add steering-quality evaluation harness + CI metrics comment#466
Conversation
Deterministic mock-time evaluation suite (python -m astrameter.simulator.evaluation) that wires CT002 active control against the firmware-accurate battery simulators and scripted household scenarios (load spikes, solar curves, single/multi/mixed batteries, both fair-share and efficiency-optimization modes). Reports reaction (settling time), oscillation (overshoot, band crossings, steady RMS) and energy (avoidable import/export) metrics, with --json/--compare for diffing runs. CI gains a steering-eval job that runs the suite on PR base and head and posts the before/after table as a sticky PR comment (informational only). Groundwork for the active-control redesign in #458. https://claude.ai/code/session_01SeKwnaH74tzQoZEPhmhBu8
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughAdds a deterministic steering-quality evaluation harness with scenarios and metrics, CLI and rendering, end-to-end tests, CI job to run head/base comparisons and post a sticky PR comment, plus guidance in AGENTS.md. ChangesSteering-Quality Evaluation System
🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Steering evaluation (base vs head)Lower is better for every metric. See mixed_cadence/eff — settle 105.7s, overshoot 1949.5W, RMS 111.7W
mixed_cadence/fair — settle 70.9s, overshoot 1655.7W, RMS 16.1W
mixed_venus_b2500/eff — settle 247.2s, overshoot 316.9W, RMS 37.8W
mixed_venus_b2500/fair — settle 151.2s, overshoot 2098.6W, RMS 25.0W
single_venus_solar — settle 11.8s, overshoot 1046.6W, RMS 17.1W
single_venus_steps — settle 14.5s, overshoot 2545.6W, RMS 15.4W
two_venus/eff — settle 34.9s, overshoot 2069.6W, RMS 17.0W
two_venus/fair — settle 17.9s, overshoot 2070.9W, RMS 16.4W
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_steering_eval.py (1)
101-116:⚠️ Potential issue | 🟠 MajorRun the required ruff/mypy/pytest checks (uv is missing)
- For tests/test_steering_eval.py (lines 101-116), the guideline commands could not be executed here because
uv: command not found; runuv run ruff format .,uv run ruff check .,uv run mypy src/, anduv run pytestin your dev environment/CI and ensure they pass.- The
random.Random(1)usage in this deterministic test is appropriate; thesecretshint is not relevant for this non-security test code.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_steering_eval.py` around lines 101 - 116, The CI/dev checks failed because the helper tool `uv` is not available; update your environment or CI steps to run the linters/typechecks/tests directly (or install `uv`) and re-run: for example run `ruff format .` and `ruff check .`, then `mypy src/`, and `pytest` locally/CI to ensure tests pass; verify tests/test_steering_eval.py (notably the test_full_scenario_definitions_build that constructs random.Random(1)) still passes under these checks.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/astrameter/simulator/evaluation.py`:
- Around line 174-177: The current _free_udp_port() closes the socket before the
controller (CT002) binds, causing a race; change the binding workflow so the
controller binds directly to port 0 and then reads back its assigned port (or
have _free_udp_port() return an open socket and keep it open until the listener
is ready) instead of returning an integer and closing the socket. Update all
call sites that use _free_udp_port() (including the other occurrences noted
around the CT002 setup) to either accept the open socket or to use the
controller's own bind-to-0/read-back-port logic so ownership isn't released
before the listener is established. Ensure the code that starts CT002 uses the
bound socket/port immediately and only closes/releases it after the listener is
confirmed running.
- Around line 530-540: The cloud dip is being overwritten by per-minute baseline
writes from _solar_curve, so modify _cloud_dips (and callers like
single_venus_solar) to either 1) model cloud cover as a multiplier applied by
the baseline generator: add a representation of cloud intervals (start t0, end
t0+120, depth) and have _solar_curve consult active cloud intervals and multiply
its per-minute irradiance before emitting events, or 2) emit per-minute
reduced-solar events for the full dip window instead of only a one-shot
cloud_on/cloud_off; locate and change _cloud_dips, Event creation for
"cloud_on"/"cloud_off", and/or _set_solar so that the dip persists across the
minute ticks (use the same time grid as _solar_curve) and ensure cloud depth is
applied multiplicatively to the baseline value.
---
Outside diff comments:
In `@tests/test_steering_eval.py`:
- Around line 101-116: The CI/dev checks failed because the helper tool `uv` is
not available; update your environment or CI steps to run the
linters/typechecks/tests directly (or install `uv`) and re-run: for example run
`ruff format .` and `ruff check .`, then `mypy src/`, and `pytest` locally/CI to
ensure tests pass; verify tests/test_steering_eval.py (notably the
test_full_scenario_definitions_build that constructs random.Random(1)) still
passes under these checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 04558cfd-4b71-40fb-a3f6-bad79efb6759
📒 Files selected for processing (4)
.github/workflows/ci.ymlAGENTS.mdsrc/astrameter/simulator/evaluation.pytests/test_steering_eval.py
| def _free_udp_port() -> int: | ||
| with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s: | ||
| s.bind(("127.0.0.1", 0)) | ||
| return int(s.getsockname()[1]) |
There was a problem hiding this comment.
Avoid the free-port handoff here.
_free_udp_port() closes the socket before CT002 binds that port, so another local process or parallel test can claim it in the gap. That turns this harness into a flaky CI target and can misroute battery traffic away from the controller. Please bind the controller first (ideally on port 0 and then read back the bound port) or otherwise keep ownership of the socket until the listener is ready.
Also applies to: 200-208, 226-233
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/astrameter/simulator/evaluation.py` around lines 174 - 177, The current
_free_udp_port() closes the socket before the controller (CT002) binds, causing
a race; change the binding workflow so the controller binds directly to port 0
and then reads back its assigned port (or have _free_udp_port() return an open
socket and keep it open until the listener is ready) instead of returning an
integer and closing the socket. Update all call sites that use _free_udp_port()
(including the other occurrences noted around the CT002 setup) to either accept
the open socket or to use the controller's own bind-to-0/read-back-port logic so
ownership isn't released before the listener is established. Ensure the code
that starts CT002 uses the bound socket/port immediately and only
closes/releases it after the listener is confirmed running.
Three harness fixes that belong in the baseline rather than in the controller-change PR stacked on top: - The controller now reads the meter at a realistic ~1 s cadence (Scenario.meter_interval_s) while metrics see the true instantaneous grid; a zero-latency meter flattered the current controller. - The oven thermostat schedule is emitted as paired on/off cycles; the previous loop could end stuck ON, stacking loads beyond the battery's ceiling for the rest of the run (the 281 W steady RMS in single_venus_steps was that saturation, not steering behavior). - Solar is now curve x factor so the labeled cloud-dip transients compose with the per-minute day curve instead of being overwritten by its next tick mid-measurement-window. https://claude.ai/code/session_01SeKwnaH74tzQoZEPhmhBu8
Summary
Adds a deterministic steering-quality evaluation harness for the active-control loop, plus a CI job that reports its metrics on every PR. This is the measurement groundwork for the #458 active-control redesign — the controller changes themselves stack on top in a follow-up PR, so this PR establishes the baseline they're compared against.
Evaluation harness (
src/astrameter/simulator/evaluation.py)CT002(active control) →BatterySimulator(the firmware-accurate Venus ramp / B2500 hysteresis controllers) →LoadModel→PowermeterSimulator— under a mock clock, so hours of simulated household activity run in seconds (full suite ~22 s).--scenario,--seed,--set KEY=VALUEconfig overrides,--jsonoutput, and--compare base.json --input head.jsonto render a Markdown before/after table.CI integration (
.github/workflows/ci.yml, jobsteering-eval)Runs the suite on the PR base and head and upserts the comparison as a sticky PR comment (informational only — thresholds live in pytest). On this PR the comment shows head-only numbers since the base has no harness yet; once merged, follow-up PRs (starting with the #458 controller changes) get a real before/after table. Fork PRs skip the comment (read-only token) but still get the table in the job summary; JSONs are uploaded as artifacts.
Tests
tests/test_steering_eval.pyruns a tiny inline scenario end-to-end (determinism, metric shape, config-override plumbing, registry shape, Markdown rendering) so the harness itself stays covered by the normal pytest run.Part of #458.
https://claude.ai/code/session_01SeKwnaH74tzQoZEPhmhBu8
Generated by Claude Code
Summary by CodeRabbit
New Features
Tests
Documentation