Skip to content

Commit d6327db

Browse files
tomquistclaude
andauthored
Add steering-quality evaluation harness + CI metrics comment (#466)
* Add steering-quality evaluation harness + CI metrics comment Deterministic mock-time evaluation suite (python -m astrameter.simulator.evaluation) that wires CT002 active control against the firmware-accurate battery simulators and scripted household scenarios (load spikes, solar curves, single/multi/mixed batteries, both fair-share and efficiency-optimization modes). Reports reaction (settling time), oscillation (overshoot, band crossings, steady RMS) and energy (avoidable import/export) metrics, with --json/--compare for diffing runs. CI gains a steering-eval job that runs the suite on PR base and head and posts the before/after table as a sticky PR comment (informational only). Groundwork for the active-control redesign in #458. https://claude.ai/code/session_01SeKwnaH74tzQoZEPhmhBu8 * eval: model meter latency, fix oven and cloud-dip scenario bugs Three harness fixes that belong in the baseline rather than in the controller-change PR stacked on top: - The controller now reads the meter at a realistic ~1 s cadence (Scenario.meter_interval_s) while metrics see the true instantaneous grid; a zero-latency meter flattered the current controller. - The oven thermostat schedule is emitted as paired on/off cycles; the previous loop could end stuck ON, stacking loads beyond the battery's ceiling for the rest of the run (the 281 W steady RMS in single_venus_steps was that saturation, not steering behavior). - Solar is now curve x factor so the labeled cloud-dip transients compose with the per-minute day curve instead of being overwritten by its next tick mid-measurement-window. https://claude.ai/code/session_01SeKwnaH74tzQoZEPhmhBu8 --------- Co-authored-by: Claude <noreply@anthropic.com>
1 parent 5d1a801 commit d6327db

4 files changed

Lines changed: 1090 additions & 0 deletions

File tree

.github/workflows/ci.yml

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,109 @@ jobs:
208208
name: coverage-${{ matrix.python-version }}
209209
path: coverage.xml
210210

211+
# Steering-quality evaluation (issue #458): runs the deterministic
212+
# active-control simulation suite (src/astrameter/simulator/evaluation.py)
213+
# on the PR base and head and posts the before/after metrics as a sticky
214+
# PR comment. Purely informational — thresholds that must hold live in
215+
# pytest regressions, not here.
216+
steering-eval:
217+
needs: [ lint ]
218+
runs-on: ubuntu-latest
219+
timeout-minutes: 15
220+
permissions:
221+
contents: read
222+
pull-requests: write
223+
steps:
224+
- name: Checkout repository
225+
uses: actions/checkout@v4
226+
with:
227+
fetch-depth: 0
228+
229+
- name: Install uv
230+
uses: astral-sh/setup-uv@v5
231+
with:
232+
enable-cache: true
233+
234+
- name: Set up Python
235+
uses: actions/setup-python@v5
236+
with:
237+
python-version: "3.13"
238+
239+
- name: Install dependencies
240+
run: uv sync --frozen --extra dev
241+
242+
- name: Run evaluation (head)
243+
run: uv run python -m astrameter.simulator.evaluation --json head.json
244+
245+
# The base run uses the base commit's own harness + controller. Until
246+
# the harness exists on the base branch this produces an empty
247+
# baseline and the comment shows head-only numbers.
248+
- name: Run evaluation (base)
249+
if: github.event_name == 'pull_request'
250+
run: |
251+
git worktree add /tmp/steering-base "${{ github.event.pull_request.base.sha }}"
252+
if [ -f /tmp/steering-base/src/astrameter/simulator/evaluation.py ]; then
253+
(cd /tmp/steering-base \
254+
&& uv sync --frozen --extra dev \
255+
&& uv run python -m astrameter.simulator.evaluation --json "$GITHUB_WORKSPACE/base.json")
256+
else
257+
echo "[]" > "$GITHUB_WORKSPACE/base.json"
258+
fi
259+
260+
- name: Render comparison
261+
if: github.event_name == 'pull_request'
262+
run: |
263+
uv run python -m astrameter.simulator.evaluation \
264+
--input head.json --compare base.json > steering-eval-comment.md
265+
cat steering-eval-comment.md >> "$GITHUB_STEP_SUMMARY"
266+
267+
- name: Upload metrics artifacts
268+
uses: actions/upload-artifact@v4
269+
with:
270+
name: steering-eval
271+
path: |
272+
head.json
273+
base.json
274+
steering-eval-comment.md
275+
if-no-files-found: ignore
276+
277+
# Forks get a read-only token; they still see the table in the job
278+
# summary above.
279+
- name: Post sticky PR comment
280+
if: >-
281+
github.event_name == 'pull_request' &&
282+
github.event.pull_request.head.repo.full_name == github.repository
283+
uses: actions/github-script@v7
284+
with:
285+
script: |
286+
const fs = require('fs');
287+
const marker = '<!-- steering-eval -->';
288+
const body = marker + '\n' +
289+
fs.readFileSync('steering-eval-comment.md', 'utf8');
290+
const {data: comments} = await github.rest.issues.listComments({
291+
owner: context.repo.owner,
292+
repo: context.repo.repo,
293+
issue_number: context.issue.number,
294+
per_page: 100,
295+
});
296+
const existing = comments.find(
297+
c => c.body && c.body.startsWith(marker));
298+
if (existing) {
299+
await github.rest.issues.updateComment({
300+
owner: context.repo.owner,
301+
repo: context.repo.repo,
302+
comment_id: existing.id,
303+
body,
304+
});
305+
} else {
306+
await github.rest.issues.createComment({
307+
owner: context.repo.owner,
308+
repo: context.repo.repo,
309+
issue_number: context.issue.number,
310+
body,
311+
});
312+
}
313+
211314
build:
212315
needs: [ validate ]
213316
permissions:

AGENTS.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,17 @@ CI runs the same steps (see `.github/workflows/ci.yml`).
2323

2424
`esphome/components/ct002/` is a C++ mirror of the Python CT002 stack. Any change to shared behavior must land on **both** sides in the same change. See `CONTRIBUTING.md` for the file mapping and what has no C++ counterpart. Verify with `uv run pytest tests/components/ct002/`.
2525

26+
## Steering-quality evaluation (run when touching balancer behavior)
27+
28+
`uv run python -m astrameter.simulator.evaluation` simulates hours of
29+
realistic household activity against the firmware-accurate battery plant and
30+
reports reaction/oscillation/energy metrics per scenario. When changing
31+
`src/astrameter/ct002/balancer.py` (or anything else in the active-control
32+
loop), capture a baseline first (`--json base.json` on the unchanged code),
33+
re-run after the change, and compare with `--input head.json --compare
34+
base.json`. CI runs the same suite on PR base + head (job `steering-eval`)
35+
and posts the comparison as a sticky PR comment.
36+
2637
## Changelog
2738

2839
For user-facing work on a branch, keep **one bullet under `## Next`** that summarizes the **overall** outcome of that branch. **Add** it when you first document the change; on **later iterations** on the same branch, **edit that same bullet** if the scope or wording shifts—do **not** append extra bullets for each follow-up. Skip `CHANGELOG.md` entirely when nothing users would notice changes (refactors, tests-only, etc.).

0 commit comments

Comments
 (0)