Fix CT002 efficiency-rotation lockup when push-based powermeter goes silent by tomquist · Pull Request #320 · tomquist/AstraMeter

tomquist · 2026-04-11T16:47:20Z

Fixes a battery-rotation lockup reported against the HomeWizard P1 powermeter: at the moment an EFFICIENCY_ROTATION_INTERVAL probe handoff fires, the upstream WebSocket silently half-opens, the CT002 emulator keeps serving the last cached meter reading to every battery poll, and the balancer computes target = 0 because its meter source is pinned at "grid is balanced". Result: batteries stay frozen at their probe values and the grid drifts by the full magnitude of the load until the add-on is manually restarted — ~1.5 h in the reporter's case.

Root cause

HomeWizardPowermeter (src/astrameter/powermeter/homewizard.py) opened its WebSocket without a heartbeat. A half-open TCP would sit forever in async for msg in ws with no way for aiohttp to detect the stall.
There was no application-layer watchdog either, so a dongle that stops emitting measurement events while the TCP session remains alive would also wedge the loop indefinitely.
get_powermeter_watts() had no staleness check — it happily returned self.values regardless of how old the cached measurement was.
HomeAssistant powermeter had the same staleness problem (though it did pass heartbeat=30).
On the CT002 side, _call_before_send logged on every single failure and kept the stale cached consumer values in place, so even once the underlying problem was surfaced there was no graceful degradation.

Fix: powermeter hardening

HomeWizardPowermeter

Pass heartbeat=WS_HEARTBEAT_SECONDS (30 s) to ws_connect. aiohttp now sends ping frames every 30 s and force-closes the connection if no pong arrives within 60 s.
Run _measurement_watchdog() alongside the WebSocket loop. It waits on a _fresh_measurement_event with a WATCHDOG_TIMEOUT_SECONDS (45 s) timeout and force-closes the socket on timeout so the outer loop drops through to the reconnect branch. Catches the "TCP alive, app layer hung" case.
Track _last_measurement_time. get_powermeter_watts() raises ValueError("…stale…") when the value is older than max_measurement_age_seconds (default 30 s, pass 0 to disable).
Inject a clock (clock: Callable[[], float] | None = None) so unit tests can drive the age check deterministically.

HomeAssistant

Factor out WS_HEARTBEAT_SECONDS = 30.
Per-entity _entity_update_time mirror of _entity_values, timestamped in _update_entity_value.
_get_entity_value raises ValueError("…stale…") when age exceeds max_state_age_seconds (default 60 s, pass 0 to disable).
On reconnect the loop now clears every entity timestamp and clears _entities_ready. _check_entities_ready also requires a non-None update time so an entity whose timestamp was cleared cannot re-flip the event to set while its cached value survives. Previously wait_for_message() would return immediately after a disconnect even though _get_entity_value would raise stale.

Fix: CT002 graceful degradation

_call_before_send now tracks consecutive before_send failures:

Emits a single WARNING on the first failure after a healthy spell, then at most one every 30 s while the failure persists (rate-limited on self._clock() — defaults to wall time but tests can inject a fake clock).
Emits a single INFO recovery log when the next successful call arrives.
Still returns None on failure, so the CT002 response falls back to the previously cached consumer values rather than crashing. The user's own action is the real recovery; the log is the escalation.

Defense in depth: balancer smoother reseed

LoadBalancer gained an optional smoother: TargetSmoother | None kwarg. After a probe _commit_probe or _reject_probe, the injected smoother is reseed()-ed so its EMA state cannot drag any pre-probe zero-crossing artifact into the post-handoff balance. The next _compute_smooth_target call then seeds directly from the fresh meter reading instead of EMA-smoothing through the probe transition. This closes a narrower, purely emulator-side failure mode that exists independently of the stale-meter root cause.

TargetSmoother also gained:

reseed() — clears _value, _last_sample, _last_raw_total.
Dedup that now requires both sample_id and raw_total to match, so a caller that reuses an old sample_id with a new raw_total no longer masks the fresh value.
Debug-level logging on seed / update / dedup / reseed so the next time this fires in production the user can see exactly what the smoother is doing with log_level: debug.

Tests (437 passing, 17 skipped)

Reproduction

tests/test_e2e_probe_lockup.py::test_stale_meter_during_probe_causes_persistent_lockup — full-stack reproduction of the log2 scenario: freezes the harness's before_send at a snapshot right before a forced rotation and asserts the grid drifts ≥ 40 W. Documents the unrecoverable path when the powermeter has no staleness detection.
tests/test_e2e_probe_lockup.py::test_powermeter_stale_error_is_handled_gracefully — the fixed path: makes before_send raise ValueError and asserts 3–6 rate-limited warnings across 120 simulated seconds (one per 30 s, on the _FakeClock), batteries hold state within 20 W of their pre-outage values, and exactly one recovery log fires on the next successful call.

HomeWizard unit tests

test_measurement_watchdog_closes_ws_on_timeout — drives the real watchdog with WATCHDOG_TIMEOUT_SECONDS patched to 0.01 s and asserts ws.close() fires on the TimeoutError branch.
test_measurement_watchdog_re_arms_after_each_measurement — feeder task sets _fresh_measurement_event twice then stops; verifies the watchdog re-arms its timer every iteration and eventually times out.
test_ws_loop_passes_heartbeat_to_ws_connect — captures kwargs and asserts the heartbeat= argument is passed through.
test_get_watts_raises_when_measurement_is_stale, test_fresh_measurement_clears_staleness, test_get_watts_staleness_disabled_when_max_age_is_zero — staleness detection with an injected fake clock.

HomeAssistant unit tests

test_stale_state_raises_in_get_powermeter_watts, test_fresh_state_clears_staleness, test_max_state_age_zero_disables_check.
test_reconnect_clears_entity_update_times_and_ready_flag — walks the full disconnect/reconnect cycle through _ws_loop's reset block and asserts that _entities_ready clears, get_powermeter_watts raises, and both signals recover on the next subscribe_entities snapshot.

Smoother unit tests (tests/test_smoother.py)

EMA basics, deadband decay, sign-flip catchup, max_step.
Regression: identical sample_id with different raw_total must still advance the EMA.
Dedup within one tick is still coalesced when both match.
reseed() semantics — next update seeds directly, bypassing EMA.

Balancer integration tests (tests/test_balancer_probe_lockup.py)

test_probe_commit_reseeds_injected_smoother, test_probe_reject_reseeds_injected_smoother — verify the balancer actually calls smoother.reseed() on both paths.
test_active_battery_keeps_covering_demand_after_probe_handoff — replays the log2 sequence against LoadBalancer.compute_target and asserts the new active battery's target reflects real demand.

Caveats

The "stale meter" reproduction in test_stale_meter_during_probe_causes_persistent_lockup still passes after this PR — the in-harness before_send deliberately doesn't implement staleness detection, so the path where the emulator has no upstream signal is unchanged. That's intentional: it's a regression marker for the unrecoverable path, and the fixed path is exercised by test_powermeter_stale_error_is_handled_gracefully.
max_measurement_age_seconds / max_state_age_seconds are keyword-only constructor kwargs with sane defaults. I did not plumb them through config.ini / ha_addon/config.yaml — the defaults should be fine for essentially everyone and adding UI knobs felt like scope creep. Easy to add if wanted.
No changes to other push-based powermeters (shrdzm, mqtt, sma_energy_meter) — each has its own freshness model and was out of scope for the HomeWizard report. Worth a follow-up audit.

Review trail

Internal review flagged four real concerns, all fixed in a follow-up commit on this branch:

Rate-limited warning was using time.time() instead of self._clock, making the e2e test bound loose — now uses the injected clock so the assertion is a deterministic 3-6 warnings across 120 simulated seconds.
Added missing unit tests for _measurement_watchdog.
_entities_ready was not cleared on HA reconnect.
Docstring on test_stale_meter_during_probe_causes_persistent_lockup was misleading.

Summary by CodeRabbit

Release Notes

New Features
- Added configurable maximum state age checking for power meter readings to prevent stale data
- Added WebSocket heartbeat and automatic watchdog monitoring for improved connection stability
- Implemented rate-limited logging for connection failures with recovery notifications
Bug Fixes
- Fixed load balancer probe handoff lockup issue
Tests
- Added comprehensive regression tests for probe lockup and staleness detection scenarios

coderabbitai · 2026-04-11T16:47:45Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 95ddf8e3-21be-4deb-a37d-b3c39b0d89ad

📥 Commits

Reviewing files that changed from the base of the PR and between 73b909a and 35d8248.

📒 Files selected for processing (2)

tests/test_balancer_probe_lockup.py
tests/test_e2e_probe_lockup.py

Walkthrough

The changes introduce staleness detection and recovery mechanisms across the load balancing and power meter subsystems. The CT002 load balancer now supports optional smoother reseeding after probe state transitions, CT002 implements rate-limited logging for before_send failures, both power meter integrations track measurement freshness with configurable age thresholds, and TargetSmoother enhances deduplication and adds state-reset capabilities. New regression tests verify probe-lockup fixes and staleness-handling behavior.

Changes

Cohort / File(s)	Summary
Load Balancer & Smoothing `src/astrameter/ct002/balancer.py`, `src/astrameter/ct002/smoother.py`	`LoadBalancer` accepts optional `smoother` parameter and reseeds it after probe completion. `TargetSmoother` enhances dedup logic to check both `sample_id` and `raw_total`, adds `reseed()` method, introduces logging for state transitions, and refines EMA update path with sign-change catchup.
Error Handling & Logging `src/astrameter/ct002/ct002.py`	CT002 adds clock-injected, rate-limited logging for `before_send` failures using consecutive-failure counters and 30-second throttling windows, with recovery notification when failures clear.
HomeAssistant Power Meter `src/astrameter/powermeter/homeassistant.py`	Introduces per-entity update-timestamp tracking, configurable `max_state_age_seconds` staleness threshold, injectable clock, and reconnect-triggered state reset. Entity readiness now requires both value and fresh timestamp.
HomeAssistant Tests `src/astrameter/powermeter/homeassistant_test.py`	Adds four new async tests covering staleness detection, fresh-event recovery, staleness-disabled mode (`max_state_age_seconds=0`), and reconnect state reset.
HomeWizard Power Meter `src/astrameter/powermeter/homewizard.py`	Adds WebSocket heartbeat support, measurement staleness tracking with injectable clock, configurable `max_measurement_age_seconds`, and a new `_measurement_watchdog` task that force-closes stalled connections independent of lower-level liveness.
HomeWizard Tests `src/astrameter/powermeter/homewizard_test.py`	Adds six new tests covering measurement staleness detection, staleness-disabled mode, fresh-measurement recovery, WebSocket heartbeat validation, watchdog timeout enforcement, and watchdog re-arming across iterations.
Load Balancer Probe Regression Tests `tests/test_balancer_probe_lockup.py`	Introduces deterministic fake clock and regression tests for smoother reseeding on probe completion; includes unit-level verification of `_commit_probe` and `_reject_probe` reseed behavior, plus scenario test simulating probe handoff and asserting recovery from phase-B target lockup.
End-to-End Probe Regression Tests `tests/test_e2e_probe_lockup.py`	Adds in-memory CT002 + dual-battery + simulated powermeter harness with controllable clock and three meter behaviors (live, frozen, raising-stale). Four scenario tests verify probe-lockup fix, meter-freeze regression, rate-limited logging with recovery, and harness shutdown safety.
TargetSmoother Unit Tests `tests/test_smoother.py`	Comprehensive unit test suite validating seeding, EMA convergence, deadband decay, sign-flip catchup, `max_step` limiting, dedup regression, and `reseed()` state clearing.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~85 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.08% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main fix: addressing a lockup in CT002 efficiency-rotation when a push-based powermeter (HomeWizard P1) becomes silent, which is the primary issue resolved across multiple components (powermeter staleness detection, watchdog, smoother reseed, and rate-limited logging).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/debug-log-rotation-VboOh

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Reported from a user log where a CT002 efficiency-rotation probe completed successfully but the newly-active battery's target collapsed to zero and stayed there for ~1.5 h until manual restart, with the grid drifting ~97 W uncompensated the whole time. Two defensive fixes on the control-loop side: 1. TargetSmoother now reseeds (clears `_value` back to `None`) on every probe commit or reject. Injected into LoadBalancer at construction via a new `smoother=` kwarg. The next update after reseed seeds the smoother directly from `raw_total`, bypassing the EMA and skipping any zero-crossing lag that could otherwise drag pre-probe state into the post-handoff balance. 2. TargetSmoother.update() no longer fires its dedup shortcut when `sample_id` matches the previous call but `raw_total` has changed. The intent of the dedup is to coalesce multi-consumer polls within one meter tick; a stale sample_id with a fresh raw_total would otherwise silently mask real meter movement. Also adds debug-level logging to TargetSmoother.seed/update/reseed so future incidents are diagnosable from an info-or-higher log alone; today the balancer only logs at INFO and the rotation site at `balancer.py:886-888` emits nothing at any level. Tests: - tests/test_smoother.py — unit regressions for dedup vs fresh raw_total, reseed semantics, EMA basics, sign-flip catchup. - tests/test_balancer_probe_lockup.py — verifies the balancer actually calls smoother.reseed() on both commit and reject paths, and replays the log2 probe sequence end-to-end against the balancer directly. - tests/test_e2e_probe_lockup.py — full harness reproduction of the log2 topology (two consumers self-reporting phase B, load on phase A) through a forced efficiency rotation, asserting the post-handoff grid stays within the deadband. The exact stale-sensor failure mode from the production log cannot be reproduced in-process (it requires a frozen HA websocket feed), so the fixes are defensive rather than a root-cause repro. The user should still inspect the HA sensor behind `power_input_alias` for staleness — the new smoother debug logs make that visible when `log_level: debug` is set. https://claude.ai/code/session_01CdgD1253g9MGysj21HmBdS

Root cause of the rotation lockup the user reported: the HomeWizard P1 WebSocket connection silently half-opened (TCP keepalives could succeed or the dongle stopped streaming measurements at the application layer without closing the socket). The CT002 emulator kept serving the last-received values to every battery poll; the balancer computed target=0 because its meter source was pinned at "grid is balanced"; the batteries sat at whatever state they were in when the freeze started — for ~1.5 hours until manual restart. The previous commit added emulator-side defences (smoother reseed, dedup fix) that don't actually recover from this because the problem is upstream. This commit fixes it at the source. HomeWizard powermeter --------------------- * Pass `heartbeat=30` to `ws_connect`. aiohttp now pings every 30 s and force-closes the connection if no pong is received within 60 s, so a half-open TCP can no longer freeze `async for msg in ws`. * Track a per-connection measurement watchdog task that force-closes the WebSocket if no `measurement` event arrives within 45 s. Covers the case where the TCP keepalives succeed (dongle is pinging back) but the measurement stream has stalled at the application layer — which is what the published HomeWizard firmware does when the P1 meter goes quiet. * Track `_last_measurement_time`. `get_powermeter_watts()` raises `ValueError("...stale...")` when the value is older than the configurable `max_measurement_age_seconds` (default 30 s). * Inject a clock for deterministic staleness unit tests. HomeAssistant powermeter ------------------------ * Same heartbeat and staleness pattern, keyed on `max_state_age_seconds` (default 60 s) with a per-entity update timestamp. Existing `heartbeat=30` was already present; this commit adds the application-layer freshness check on top so a stuck template sensor (where HA keeps the websocket alive but stops pushing state events) also surfaces as an error instead of silently serving cached values. * On reconnect, clear all per-entity update times so `get_*` raises until fresh initial states arrive — prevents the post-reconnect window where cached-but-actually-stale values would be served before `subscribe_entities` resends its snapshot. CT002 emulator -------------- * Rate-limit `before_send` failure logs to the first failure plus once every 30 s. The CT002 UDP server processes every battery poll, so per-tick logging would produce hundreds of lines per minute during an outage and bury everything else. The recovery log line fires exactly once when the failure spell ends. End-to-end regression --------------------- `tests/test_e2e_probe_lockup.py` adds two new scenarios on top of the happy-path handoff test: 1. `test_stale_meter_during_probe_causes_persistent_lockup` — the harness freezes `before_send` at a snapshot right before a forced rotation and asserts the grid drifts by ≥40 W. This documents the pre-fix behaviour so any future change that accidentally breaks it will be visible. (The fix for *this* scenario is in the powermeter, not the emulator.) 2. `test_powermeter_stale_error_is_handled_gracefully` — the harness makes `before_send` raise `ValueError` (equivalent to the fixed HomeWizard / HomeAssistant detecting staleness) and asserts: - only 1-3 rate-limited warnings in 50 poll cycles - batteries hold their state (no bad commands) - a single recovery log fires when the powermeter returns. Unit tests: - `homewizard_test.py`: heartbeat kwarg passthrough, staleness ages with an injected clock, fresh measurement clears staleness, `max_measurement_age_seconds=0` disables the check. - `homeassistant_test.py`: same four cases against the HA powermeter. All 415 tests pass (up from 406); mypy clean; ruff clean. https://claude.ai/code/session_01CdgD1253g9MGysj21HmBdS

Acts on the independent review of commits 1388678 and 4093da3. Real concerns addressed: 1. CT002 rate-limit warning now uses self._clock (falls back to time.time when no clock is injected) instead of wall time directly. Under the e2e harness's _FakeClock, the rate-limit now fires on simulated-time boundaries rather than real time, so the regression bound in test_powermeter_stale_error_is_handled_gracefully is deterministic: 40 steps × 3 s = 120 simulated seconds, one warning every 30 s = 3-6 warnings exactly. A regression that removes the rate limit entirely would produce ~80 warnings and fail the upper bound cleanly. 2. New unit tests for HomeWizard._measurement_watchdog: - test_measurement_watchdog_closes_ws_on_timeout verifies the timeout path actually calls ws.close() (patching WATCHDOG_TIMEOUT_SECONDS to 0.01 s so the real asyncio.wait_for fires the TimeoutError branch). - test_measurement_watchdog_re_arms_after_each_measurement drives a feeder task that sets _fresh_measurement_event twice, then stops — verifies the timer restarts from zero on every iteration, and that the next iteration still times out. Exercises the real event machinery, no mocks on wait_for. 3. HomeAssistant._entities_ready is now cleared on reconnect alongside _entity_update_time. Previously wait_for_message() would return immediately after a disconnect even though _get_entity_value would raise stale — misleading contract. _check_entities_ready now also requires a non-None update timestamp so an entity whose timestamp has been cleared (but whose value cache survives) will not be treated as ready. New test test_reconnect_clears_entity_update_times_and_ready_flag walks the full disconnect/reconnect cycle and verifies both signals clear and then recover. 4. test_stale_meter_during_probe_causes_persistent_lockup docstring rewritten: old version said "if a future change adds staleness detection... the assertions below will need to be updated" which was confusing because the fix already shipped (at the powermeter layer). New docstring explains clearly that this test documents the path WITHOUT powermeter staleness detection — the in-harness before_send deliberately doesn't implement it — and points to test_powermeter_stale_error_is_handled_gracefully for the other half. Nice-to-have comments added: 5. balancer.py _commit_probe gets a timing-note comment: the smoother reseed fires after compute_target has already captured smoothed_target, so it affects the NEXT tick, not the current one. Intentional but was undocumented. 6. smoother.py float-equality comment: raw_total is always int in production (sum of parse_int) so exact equality is safe; future callers feeding computed floats should swap to math.isclose. All 418 tests pass (up from 415); mypy clean; ruff clean. https://claude.ai/code/session_01CdgD1253g9MGysj21HmBdS

coderabbitai

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

config.ini.example (1)

191-197: ⚠️ Potential issue | 🟡 Minor

Remove duplicate THROTTLE_INTERVAL example in [HOMEASSISTANT].

Line 192 and Line 196 present two different example values for the same key in one section, which is easy to misread during setup.

📝 Suggested cleanup

-## Per-powermeter throttling override (optional)
-#THROTTLE_INTERVAL = 1
-
 ## Per-powermeter throttling override (optional)
 ## HomeAssistant typically needs 2-3 seconds due to network latency
 `#THROTTLE_INTERVAL` = 2

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@config.ini.example` around lines 191 - 197, The [HOMEASSISTANT] example
contains duplicate THROTTLE_INTERVAL entries with different sample values which
is confusing; remove one of the commented example lines so only a single
`#THROTTLE_INTERVAL` example remains (choose the HomeAssistant-recommended value
of 2 seconds), leaving a single clear comment under the section header
referencing THROTTLE_INTERVAL to avoid ambiguity.

src/astrameter/powermeter/iobroker.py (1)

54-61: ⚠️ Potential issue | 🟠 Major

Calculate mode can silently return incorrect power when aliases are missing.

If one alias is absent, Line 54-Line 61 falls back to 0 and still returns a value instead of failing fast.

🔧 Proposed fix

         else:
             response = await self.get_json(
                 f"/getBulk/{self.power_input_alias},{self.power_output_alias}"
             )
             power_in = 0
             power_out = 0
+            found_in = False
+            found_out = False
             for item in response:
                 if item["id"] == self.power_input_alias:
                     power_in = int(item["val"])
+                    found_in = True
                 if item["id"] == self.power_output_alias:
                     power_out = int(item["val"])
+                    found_out = True
+            if not found_in or not found_out:
+                raise ValueError(
+                    f"Missing alias in response: input={self.power_input_alias!r}, "
+                    f"output={self.power_output_alias!r}"
+                )
             return [power_in - power_out]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/powermeter/iobroker.py` around lines 54 - 61, The loop that
computes power_in and power_out from response silently defaults to 0 when an
expected alias is missing (self.power_input_alias or self.power_output_alias),
producing incorrect results; update the code that iterates response to detect
whether each alias was actually found (e.g., track found_input/found_output
booleans while examining response for ids matching self.power_input_alias and
self.power_output_alias) and if either is missing raise a clear exception or
return an explicit error value (include the missing alias names in the error
message) instead of returning power_in - power_out.

🟠 Major comments (28)

docs/ct002-capture-analysis.md-181-183 (1)

181-183: ⚠️ Potential issue | 🟠 Major

Fix request-field indexing in modeling guidance.

This indexing conflicts with your earlier mapping (Line 38-Line 44). As written, it can mis-parse request payloads.

Suggested doc fix

 - Parse request tail strictly as:
-  - `phase = fields[4]`
-  - `power = int(fields[5])`
+  - `phase = fields[5]`
+  - `power = int(fields[6])`
+  - (or safer: parse from tail as `phase = fields[-2]`, `power = int(fields[-1])`)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/ct002-capture-analysis.md` around lines 181 - 183, The doc's
request-tail parsing is using the wrong indices (phase = fields[4], power =
int(fields[5)); update it to match the earlier request-field mapping by shifting
both indices one position so it reads phase = fields[5] and power =
int(fields[6]) and add a brief note that these indices align with the earlier
mapping of the request payload to avoid mis-parsing.

src/astrameter/powermeter/esphome.py-28-29 (1)

28-29: ⚠️ Potential issue | 🟠 Major

Add HTTP status validation before JSON parsing.

Without resp.raise_for_status(), HTTP errors (4xx, 5xx) pass directly to JSON parsing, producing misleading errors instead of clearly indicating the HTTP failure.

🛠️ Suggested fix

         url = f"http://{self.ip}:{self.port}{path}"
         async with self.session.get(url) as resp:
+            resp.raise_for_status()
             return await resp.json(content_type=None)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/powermeter/esphome.py` around lines 28 - 29, Before parsing
the response JSON, validate the HTTP status by calling resp.raise_for_status()
inside the async with self.session.get(url) as resp block so HTTP 4xx/5xx errors
surface as HTTP errors rather than causing misleading JSON parse failures;
update the method that performs the request (the async block using
self.session.get and resp.json(content_type=None)) to call
resp.raise_for_status() immediately after entering the context and before
awaiting resp.json.

src/astrameter/powermeter/script.py-11-17 (1)

11-17: ⚠️ Potential issue | 🟠 Major

Add an execution timeout and kill hung scripts.

proc.communicate() on line 16 can wait forever if the command hangs. This blocks polling indefinitely and differs from other powermeters in the codebase, which consistently enforce ~10-second I/O timeouts (see sml.py, mqtt.py, homewizard.py, and HTTP-based meters).

🛠️ Suggested fix

         proc = await asyncio.create_subprocess_shell(
             self.script,
             stdout=asyncio.subprocess.PIPE,
             stderr=asyncio.subprocess.PIPE,
         )
-        stdout, stderr = await proc.communicate()
+        try:
+            stdout, stderr = await asyncio.wait_for(proc.communicate(), timeout=10)
+        except asyncio.TimeoutError as exc:
+            proc.kill()
+            await proc.wait()
+            raise RuntimeError(f"Script timed out after 10s: {self.script}") from exc

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/powermeter/script.py` around lines 11 - 17, The subprocess
call using asyncio.create_subprocess_shell (and the subsequent proc.communicate)
can hang; wrap the communicate call in an asyncio.wait_for with a ~10 second
timeout, and on asyncio.TimeoutError kill the process (proc.kill()), await
proc.wait() to reap it, and treat this as a failed execution (populate/return
appropriate stdout/stderr or raise an error) so polling doesn't block; update
the logic around proc.communicate / proc.returncode in the class/method that
runs self.script to enforce this timeout and ensure the process is cleaned up.

src/astrameter/powermeter/esphome.py-33-33 (1)

33-33: ⚠️ Potential issue | 🟠 Major

Avoid truncating ESPHome power values.

Casting to int on line 33 violates the function's type contract (list[float]) and drops fractional watts. Use float to preserve meter precision and match the declared return type.
🛠️ Suggested fix
-        return [int(parsed_data["value"])]
+        return [float(parsed_data["value"])]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/powermeter/esphome.py` at line 33, The return currently casts
parsed_data["value"] to int (return [int(parsed_data["value"])]) which drops
fractional watts and violates the declared list[float] return type; change the
cast to float so the function (where parsed_data is used and the return
statement occurs) returns [float(parsed_data["value"])] to preserve precision
and match the type contract.

src/astrameter/powermeter/shrdzm.py-23-33 (1)

23-33: ⚠️ Potential issue | 🟠 Major

Avoid raw credential interpolation in query strings.

Line 32 directly injects user/password into the URL. Credentials containing &, =, ?, or % can produce malformed requests and unpredictable parsing.

🔧 Proposed fix

+from urllib.parse import urlencode
+
 class Shrdzm(Powermeter):
@@
-    async def get_json(self, path):
+    async def get_json(self, path: str):
         if not self.session:
             raise RuntimeError("Session not started; call start() first")
         url = f"http://{self.ip}{path}"
         async with self.session.get(url) as resp:
             return await resp.json(content_type=None)
@@
     async def get_powermeter_watts(self) -> list[float]:
-        response = await self.get_json(
-            f"/getLastData?user={self.user}&password={self.password}"
-        )
+        query = urlencode({"user": self.user, "password": self.password})
+        response = await self.get_json(f"/getLastData?{query}")
         return [int(response["1.7.0"]) - int(response["2.7.0"])]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/powermeter/shrdzm.py` around lines 23 - 33, The code in
get_powermeter_watts constructs the request with raw credential interpolation
(f"/getLastData?user={self.user}&password={self.password}") which can break if
user/password contain special characters; change to build a params dict (e.g.,
params={"user": self.user, "password": self.password}) and let the HTTP client
encode them, or explicitly URL-encode values (urllib.parse.quote_plus) before
adding to the path; update the call site in get_powermeter_watts to pass params
to get_json or modify get_json to accept params and call session.get(url,
params=params) so credentials are encoded safely.

src/astrameter/powermeter/vzlogger.py-30-31 (1)

30-31: ⚠️ Potential issue | 🟠 Major

Don’t truncate fractional watt values.

int(...) silently rounds toward zero, so a payload like 123.8 becomes 123 and underreports the meter. This should preserve the reading as float.

🛠️ Proposed fix

     async def get_powermeter_watts(self) -> list[float]:
-        return [int((await self.get_json())["data"][0]["tuples"][0][1])]
+        return [float((await self.get_json())["data"][0]["tuples"][0][1])]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/powermeter/vzlogger.py` around lines 30 - 31, The
get_powermeter_watts method is truncating fractional watt values by calling
int(...) on the reading; update it to preserve decimals by converting the
payload value to a float (e.g., use float(...) or avoid the int cast) when
building the returned list so the returned list[float] contains the exact
reading (handle string numeric values by passing them through float()). Ensure
you still await get_json() and extract ["data"][0]["tuples"][0][1] as before but
cast to float instead of int.

src/astrameter/powermeter/vzlogger.py-23-28 (1)

23-28: ⚠️ Potential issue | 🟠 Major

Add HTTP status check before parsing JSON response.

Without calling raise_for_status(), a 4xx or 5xx response falls through to resp.json(), which masks the HTTP failure as a JSON parsing error instead of surfacing the actual cause. This pattern is already established across similar powermeters in the codebase (json_http, tasmota, shelly, tq_em).

🛠️ Proposed fix

     async def get_json(self):
         if not self.session:
             raise RuntimeError("Session not started; call start() first")
         url = f"http://{self.ip}:{self.port}/{self.uuid}"
         async with self.session.get(url) as resp:
+            resp.raise_for_status()
             return await resp.json(content_type=None)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/powermeter/vzlogger.py` around lines 23 - 28, The get_json
method in vzlogger.py should check HTTP status before attempting to parse JSON:
in async def get_json(self) (which builds url and uses self.session.get), call
resp.raise_for_status() right after entering the response context (before await
resp.json(...)) so 4xx/5xx errors surface as HTTP errors rather than JSON parse
failures; update the method to invoke resp.raise_for_status() on the aiohttp
response object and then return resp.json(content_type=None).

ha_addon/run.sh-59-63 (1)

59-63: ⚠️ Potential issue | 🟠 Major

Don't emit a synthetic [CT002] block when no CT device is selected.

If device_types contains neither ct002 nor ct003, ct_section stays "CT002" and the fallback branch still writes a CT section with an empty CT_MAC. That makes non-CT configs look like CT installs and can send the loader down the wrong path.

Proposed fix

-    ct_section="CT002"
-    if [ "$has_ct003" -eq 1 ] && [ "$has_ct002" -eq 0 ]; then
-        ct_section="CT003"
-    fi
+    ct_section=""
+    if [ "$has_ct002" -eq 1 ]; then
+        ct_section="CT002"
+    elif [ "$has_ct003" -eq 1 ]; then
+        ct_section="CT003"
+    fi
@@
-        else
+        elif [ -n "$ct_section" ]; then
             echo "[$ct_section]"
             echo "CT_MAC=$ct_mac"
             [ -n "$min_efficient_power" ] && echo "MIN_EFFICIENT_POWER=$min_efficient_power"
             [ -n "$efficiency_rotation_interval" ] && echo "EFFICIENCY_ROTATION_INTERVAL=$efficiency_rotation_interval"
             echo ""
         fi

Also applies to: 86-103

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@ha_addon/run.sh` around lines 59 - 63, The script sets ct_section="CT002" by
default causing an empty CT block when neither CT device is selected; change the
logic so ct_section is set only if has_ct002 or has_ct003 is true (e.g., set
ct_section based on has_ct003/has_ct002 only, otherwise leave unset/empty) and
guard the code that emits the CT section (the block that writes CT_MAC / CT
section around lines where ct_section is used) to run only when has_ct002 or
has_ct003 is true; apply the same conditional fix to the second occurrence of
that CT-writing code later in the file (the other block around lines 86-103).

ha_addon/run.sh-35-39 (1)

35-39: ⚠️ Potential issue | 🟠 Major

Don't dump arbitrary custom config contents with this exact-key redactor.

print_redacted_config only masks keys literally named PASSWORD, TOKEN, etc. A custom config entry like MQTT_PASSWORD= or API_TOKEN= will still be logged verbatim at Line 147, so this can still leak credentials to addon logs.

Also applies to: 147-147
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ha_addon/run.sh` around lines 35 - 39, The current print_redacted_config
function only redacts exact key names (PASSWORD, TOKEN, etc.), which misses keys
like MQTT_PASSWORD or API_TOKEN and can leak secrets when the script logs
config; update the sed regex in print_redacted_config to match keys that contain
those secret substrings anywhere in the key name (e.g., match word parts like
(_|^)(PASSWORD|TOKEN|ACCESSTOKEN|ACCESSTOKEN|SECRET|MAILBOX)\b,
case-insensitive) and also handle optional export/whitespace and quoted values
so any assignment like FOO_PASSWORD="..." or export API_TOKEN=... is replaced
with REDACTED before printing; ensure the same improved redaction logic is used
wherever print_redacted_config is invoked to output configs.

ha_addon/run.sh-42-47 (1)

42-47: ⚠️ Potential issue | 🟠 Major

Keep custom_config constrained to /config.

"/config/$(bashio::config 'custom_config')" accepts ../ segments. A value like ../../etc/passwd will resolve outside /config, get copied, and then be printed, which turns this into an arbitrary file read + log disclosure path.

Proposed fix

-if bashio::config.has_value 'custom_config' && [ -f "/config/$(bashio::config 'custom_config')" ]; then
-    bashio::log.info "Using custom config file: $(bashio::config 'custom_config')"
+custom_config="$(bashio::config 'custom_config')"
+custom_config_path="$(realpath -m "/config/$custom_config")"
+if bashio::config.has_value 'custom_config' \
+    && [ "${custom_config_path#/config/}" != "$custom_config_path" ] \
+    && [ -f "$custom_config_path" ]; then
+    bashio::log.info "Using custom config file: ${custom_config_path#/config/}"
@@
-    cp "/config/$(bashio::config 'custom_config')" "$CONFIG"
+    cp "$custom_config_path" "$CONFIG"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@ha_addon/run.sh` around lines 42 - 47, The custom_config path is not
validated and can include ../ to escape /config; before using cp with
"/config/$(bashio::config 'custom_config')" (and before logging it) validate and
sanitize the value from bashio::config 'custom_config' — either restrict to a
filename (basename) or resolve the full path (readlink -f /config/<value>) and
ensure the resolved path starts with /config; if the check fails, log an error
and abort. Replace direct uses of the raw bashio::config 'custom_config' in the
bashio::log.info and cp call with the sanitized/validated path and keep $CONFIG
as the destination.

src/astrameter/powermeter/json_http.py-1-5 (1)

1-5: ⚠️ Potential issue | 🟠 Major

Handle request timeouts the same way as other transport failures.

With ClientTimeout(total=10), aiohttp raises asyncio.TimeoutError when the total timeout is exceeded. This exception currently escapes the error handler and propagates uncaught, unlike other HTTP failures which are normalized to ValueError. Add asyncio.TimeoutError to the exception handler to ensure consistent exception behavior.

Proposed fix

+import asyncio
 import json
 
 import aiohttp
 from aiohttp import BasicAuth, ClientTimeout
@@
-        except aiohttp.ClientError as e:
+        except (aiohttp.ClientError, asyncio.TimeoutError) as e:
             logger.error(f"HTTP request error: {e}")
             raise ValueError(f"HTTP request error: {e}") from e

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/powermeter/json_http.py` around lines 1 - 5, The request
timeout raised by aiohttp when using ClientTimeout(total=10)
(asyncio.TimeoutError) is not being normalized with other transport failures;
modify the request-sending code that uses ClientTimeout to also import asyncio
and include asyncio.TimeoutError alongside other transport exceptions so it is
caught and converted to the same ValueError path (the handler that currently
normalizes aiohttp/transport failures). Ensure you add "import asyncio" at top
and update the except clause to include asyncio.TimeoutError so timeout errors
are handled identically to other transport errors.

src/astrameter/ct002/smoother.py-59-97 (1)

59-97: ⚠️ Potential issue | 🟠 Major

Use identity comparison for sample_id to match documented contract.

Line 91 compares tuples with ==, but the class docstring explicitly states dedup is keyed on tuple identity. Each meter cycle creates a fresh sample_id tuple via tuple(values). With value equality, a fresh cycle with stable readings (same meter values) gets treated as a duplicate and blocks EMA convergence.
Proposed fix
-        if sample_id == self._last_sample and raw_total == self._last_raw_total:
+        if sample_id is self._last_sample and raw_total == self._last_raw_total:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/ct002/smoother.py` around lines 59 - 97, The dedup check in
TargetSmoother.update incorrectly compares sample_id by equality; change the
test so it uses identity for the sample key (i.e., compare sample_id is
self._last_sample) while keeping the raw_total equality check as-is on
self._last_raw_total; update the condition in the update method that currently
reads "if sample_id == self._last_sample and raw_total == self._last_raw_total"
to use identity for sample_id so fresh tuple instances with identical contents
do not suppress EMA updates.

src/astrameter/marstek_api.py-40-74 (1)

40-74: ⚠️ Potential issue | 🟠 Major

Don’t embed credentials in MarstekApiError messages.

full_url includes mailbox, token, and the hashed password, so any logged exception here leaks sensitive data into logs. Use the endpoint path or a redacted URL in error messages instead.

Suggested fix

 def _http_get_json(
     url: str, params: dict[str, Any], headers: dict[str, str] | None = None
 ):
     query = urllib.parse.urlencode(params)
     full_url = f"{url}?{query}"
+    safe_url = url
     req = urllib.request.Request(full_url, headers=headers or {}, method="GET")
@@
     except urllib.error.URLError as exc:
-        raise MarstekApiError(
-            f"Network error calling {full_url}: {exc.reason}"
-        ) from exc
+        raise MarstekApiError(
+            f"Network error calling {safe_url}: {exc.reason}"
+        ) from exc
     except Exception as exc:
-        raise MarstekApiError(f"Unexpected error calling {full_url}: {exc}") from exc
+        raise MarstekApiError(f"Unexpected error calling {safe_url}: {exc}") from exc
@@
     except Exception as exc:
         snippet = body[:200] if body else "<empty>"
-        raise MarstekApiError(f"Non-JSON response from {full_url}: {snippet}") from exc
+        raise MarstekApiError(f"Non-JSON response from {safe_url}: {snippet}") from exc
 
     if code is None or code < 200 or code >= 300:
-        raise MarstekApiError(f"HTTP {code} from {full_url}: {payload}")
+        raise MarstekApiError(f"HTTP {code} from {safe_url}: {payload}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/marstek_api.py` around lines 40 - 74, The _http_get_json
function currently embeds full_url (which contains sensitive query params like
mailbox, token, and password) into MarstekApiError messages; modify this to
build a redacted_url (use urllib.parse.urlparse and urllib.parse.parse_qs on
full_url) that either only includes the path and host or keeps query keys but
masks values for sensitive keys ("mailbox","token","password" etc.), and use
redacted_url in all MarstekApiError messages and logs (in the URLError/Exception
handlers and the non-2xx/JSON error paths) instead of full_url so no credentials
are leaked.

release.sh-71-74 (1)

71-74: ⚠️ Potential issue | 🟠 Major

Also check origin for an existing tag before continuing.

Right now only local refs are checked. If origin already has $VERSION, the script will still update main, create the local tag, and then fail at git push origin "$VERSION", leaving the release half-complete.

Suggested fix

 if git show-ref --verify --quiet "refs/tags/$VERSION"; then
   print_error "Tag $VERSION already exists."
   exit 1
 fi
+
+if git ls-remote --exit-code --tags origin "refs/tags/$VERSION" >/dev/null 2>&1; then
+  print_error "Tag $VERSION already exists on origin."
+  exit 1
+fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@release.sh` around lines 71 - 74, The local-only tag check around VERSION
(the git show-ref block) must also verify the tag does not already exist on
origin to avoid creating a partial release before git push origin "$VERSION"
fails; update the logic so after (or before) the local check you fetch remote
tags (e.g., git fetch --tags) and then query origin for the tag (using git
ls-remote --tags or an equivalent remote check for "$VERSION") and if found
print the same error and exit, ensuring you reference VERSION and the git push
origin "$VERSION" step when adding the remote validation.

src/astrameter/shelly/shelly.py-185-208 (1)

185-208: ⚠️ Potential issue | 🟠 Major

Only mark a sender as “seen” after the request is validated.

_track_battery_seen() runs before UTF-8 decoding, JSON parsing, method checks, and client-filter resolution. Any stray UDP packet on this port can therefore inflate battery_count and later emit a _removed event for a battery that never made a valid request.

Suggested fix

     async def _handle_request(self, transport, data, addr):
-        poll_interval = self._track_battery_seen(addr)
-
         try:
             request_str = data.decode()
@@
             request = json.loads(request_str)
             logger.debug(f"Parsed request: {json.dumps(request, indent=2)}")
             if isinstance(request.get("params", {}).get("id"), int):
                 powermeter = None
                 for pm, client_filter in self._powermeters:
                     if client_filter.matches(addr[0]):
                         powermeter = pm
                         break
                 if powermeter is None:
                     logger.warning(f"No powermeter found for client {addr[0]}")
                     return
+                if request.get("method") not in ("EM.GetStatus", "EM1.GetStatus"):
+                    return
+
+                poll_interval = self._track_battery_seen(addr)
 
                 powers = await powermeter.get_powermeter_watts()
 
-                if request.get("method") == "EM.GetStatus":
+                if request.get("method") == "EM.GetStatus":
                     response = self._create_em_response(request["id"], powers)
                 elif request.get("method") == "EM1.GetStatus":
                     response = self._create_em1_response(request["id"], powers)
-                else:
-                    return

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/shelly/shelly.py` around lines 185 - 208, The call to
_track_battery_seen(addr) is happening too early in _handle_request, causing
invalid/non-UTF8 or non-JSON datagrams to be counted as seen; move the
_track_battery_seen call to after successful UTF-8 decoding, json.loads parsing,
validation of request structure (e.g., method/params checks) and successful
client-filter resolution (the powermeter lookup loop), so only validated
requests increment battery_count and can later trigger _removed events; update
references to _track_battery_seen, _handle_request, battery_count and the
powermeter/client_filter resolution logic accordingly.

src/astrameter/simulator/tui.py-331-349 (1)

331-349: ⚠️ Potential issue | 🟠 Major

Clamp SOC in in-process mode too.

The daemon/API path keeps SOC in [0, 1], but the in-process path can drive it negative or above 1.0 after repeated keypresses. That makes the two execution modes behave differently and can leave the simulator in an invalid state.

💡 Suggested fix

     def action_adjust_soc(self, delta: float) -> None:
         bat = self._get_selected_battery()
         if bat is None:
             return
         if self._runner:
             bat_obj = self._runner.batteries[self._selected_battery]
-            bat_obj.soc = bat_obj.soc + delta
+            bat_obj.soc = max(0.0, min(1.0, bat_obj.soc + delta))
         elif self._daemon_port:
             new_soc = max(0, min(1, bat.get("soc", 0.5) + delta))
             self._post(f"/batteries/{bat['mac']}/soc", {"soc": new_soc})

     def action_set_soc(self, value: float) -> None:
         bat = self._get_selected_battery()
         if bat is None:
             return
         if self._runner:
-            self._runner.batteries[self._selected_battery].soc = value
+            self._runner.batteries[self._selected_battery].soc = max(
+                0.0, min(1.0, value)
+            )
         elif self._daemon_port:
             self._post(f"/batteries/{bat['mac']}/soc", {"soc": value})

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/simulator/tui.py` around lines 331 - 349, The in-process
branch of action_adjust_soc and action_set_soc does not clamp battery.soc to
[0,1], so update both to bound values when modifying
self._runner.batteries[self._selected_battery].soc: in action_adjust_soc compute
new_soc = max(0, min(1, bat_obj.soc + delta)) and assign that, and in
action_set_soc assign self._runner.batteries[self._selected_battery].soc =
max(0, min(1, value)); keep using the same _runner and batteries symbols to
locate the code.

src/astrameter/powermeter/sma_energy_meter.py-234-240 (1)

234-240: ⚠️ Potential issue | 🟠 Major

Reset the message event after a successful wait.

_async_message_event stays set after the first packet, so subsequent wait_for_message() calls return immediately even when no new packet has arrived.

💡 Suggested fix

     async def wait_for_message(self, timeout=5):
         if self._async_message_event is None:
             raise RuntimeError("start() must be called before wait_for_message()")
         try:
             await asyncio.wait_for(self._async_message_event.wait(), timeout)
+            self._async_message_event.clear()
         except asyncio.TimeoutError:
             raise TimeoutError("Timeout waiting for SMA Energy Meter data") from None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/powermeter/sma_energy_meter.py` around lines 234 - 240,
wait_for_message currently leaves self._async_message_event set after the first
packet so future calls return immediately; modify wait_for_message (the async
def wait_for_message method) to clear the event after a successful await (e.g.,
call self._async_message_event.clear() or reassign a fresh asyncio.Event()
immediately after the await completes and before returning) so subsequent calls
block until a new packet arrives, while preserving the existing RuntimeError
check and TimeoutError behavior.

src/astrameter/simulator/powermeter_sim.py-161-166 (1)

161-166: ⚠️ Potential issue | 🟠 Major

Require a real JSON boolean for /auto.

bool(body.get("enabled", False)) treats "false", "0", and any non-empty object as True, so invalid payloads can flip auto mode on.

💡 Suggested fix

     async def _handle_set_auto(self, request: web.Request) -> web.Response:
         body = await self._parse_json(request)
         if isinstance(body, web.Response):
             return body
-        self.load_model.auto_mode = bool(body.get("enabled", False))
+        enabled = body.get("enabled")
+        if not isinstance(enabled, bool):
+            return web.json_response({"error": "invalid 'enabled'"}, status=400)
+        self.load_model.auto_mode = enabled
         return web.json_response(self._build_status())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/simulator/powermeter_sim.py` around lines 161 - 166, The
handler _handle_set_auto currently coerces any truthy value into True via
bool(body.get("enabled", False)), so update it to require a real JSON boolean:
extract enabled = body.get("enabled") and if not isinstance(enabled, bool)
return a 400 Bad Request response (use web.HTTPBadRequest or a JSON error
response) explaining "enabled must be a boolean"; only then set
self.load_model.auto_mode = enabled and return the status via
self._build_status(). Ensure you reference _handle_set_auto, _parse_json, and
self.load_model.auto_mode when making the change.

src/astrameter/powermeter/mqtt.py-150-164 (1)

150-164: ⚠️ Potential issue | 🟠 Major

Avoid clearing _message_event before re-checking completion.

There's a race here: the final message can arrive between the deadline calculation and self._message_event.clear(). In that case all values are already populated, but the signal gets erased and the method can still time out.

💡 Suggested fix

     async def wait_for_message(self, timeout=5):
         if all(v is not None for v in self.values):
             return
         loop = asyncio.get_running_loop()
         deadline = loop.time() + timeout
         while True:
+            if all(v is not None for v in self.values):
+                return
             remaining = deadline - loop.time()
             if remaining <= 0:
                 raise TimeoutError("Timeout waiting for MQTT message")
-            self._message_event.clear()
             try:
                 await asyncio.wait_for(self._message_event.wait(), timeout=remaining)
+                self._message_event.clear()
             except asyncio.TimeoutError:
                 raise TimeoutError("Timeout waiting for MQTT message") from None
-            if all(v is not None for v in self.values):
-                return

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/powermeter/mqtt.py` around lines 150 - 164, In
wait_for_message, avoid clearing self._message_event before re-checking
completion to prevent the race where the final MQTT message arrives and sets
values before clear erases the signal; modify the loop in wait_for_message to
recompute remaining, then immediately check if all(v is not None for v in
self.values) and return if so, only clear or await self._message_event.wait()
when values are still incomplete (alternatively use self._message_event.is_set()
to decide whether to clear), referencing the wait_for_message method and the
self._message_event and self.values symbols.

src/astrameter/simulator/powermeter_sim.py-101-116 (1)

101-116: ⚠️ Potential issue | 🟠 Major

Reject invalid watts payloads instead of letting them reach float().

float(watts) will raise on lists/objects and silently accepts booleans as 1.0/0.0. A malformed client request should stay a 400, not become a server error or implicit coercion.

💡 Suggested fix

     async def _handle_set_solar(self, request: web.Request) -> web.Response:
         body = await self._parse_json(request)
         if isinstance(body, web.Response):
             return body
         watts = body.get("watts")
         if watts is None:
             return web.json_response({"error": "missing 'watts'"}, status=400)
         if isinstance(watts, str):
             if watts == "max":
                 watts = self.load_model.solar_max
             elif watts == "off":
                 watts = 0.0
             else:
                 return web.json_response({"error": "invalid watts"}, status=400)
-        self.load_model.set_solar(float(watts))
+        else:
+            try:
+                if isinstance(watts, bool):
+                    raise TypeError
+                watts = float(watts)
+            except (TypeError, ValueError):
+                return web.json_response({"error": "invalid watts"}, status=400)
+        self.load_model.set_solar(watts)
         return web.json_response(self._build_status())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/simulator/powermeter_sim.py` around lines 101 - 116, In
_handle_set_solar, reject non-numeric/non-allowed-string payloads before calling
float() to avoid implicit coercion or server errors: ensure watts is either an
int/float (not bool) or one of the strings "max"/"off", map
"max"→self.load_model.solar_max and "off"→0.0, and for numeric types call
self.load_model.set_solar(float(watts)); for any other type
(list/dict/bool/None/other string) return web.json_response({"error":"invalid
watts"}, status=400). Use isinstance checks (e.g. isinstance(watts, (int,float))
and not isinstance(watts, bool)) and reference the _handle_set_solar handler and
load_model.set_solar in your change.

src/astrameter/simulator/tui.py-149-172 (1)

149-172: ⚠️ Potential issue | 🟠 Major

Reconnect the SSE listener instead of exiting on the first drop.

In attach mode, any transient disconnect logs once and ends the worker, so the UI stops updating until the whole app is restarted. This should retry in a loop with a short backoff.

💡 Suggested fix

 async def _sse_listener(self) -> None:
     """Connect to daemon SSE endpoint and update status."""
     import aiohttp

     url = f"http://localhost:{self._daemon_port}/events"
-    try:
-        async with (
-            aiohttp.ClientSession() as session,
-            session.get(
-                url, timeout=aiohttp.ClientTimeout(total=None, sock_read=60)
-            ) as resp,
-        ):
-            buffer = ""
-            async for chunk_bytes in resp.content.iter_any():
-                buffer += chunk_bytes.decode("utf-8", errors="replace")
-                while "\n\n" in buffer:
-                    event, buffer = buffer.split("\n\n", 1)
-                    for line in event.split("\n"):
-                        if line.startswith("data: "):
-                            with contextlib.suppress(json.JSONDecodeError):
-                                self._status = json.loads(line[6:])
-    except Exception as exc:
-        logger.error("SSE connection lost: %s", exc)
+    while True:
+        try:
+            async with (
+                aiohttp.ClientSession() as session,
+                session.get(
+                    url, timeout=aiohttp.ClientTimeout(total=None, sock_read=60)
+                ) as resp,
+            ):
+                buffer = ""
+                async for chunk_bytes in resp.content.iter_any():
+                    buffer += chunk_bytes.decode("utf-8", errors="replace")
+                    while "\n\n" in buffer:
+                        event, buffer = buffer.split("\n\n", 1)
+                        for line in event.split("\n"):
+                            if line.startswith("data: "):
+                                with contextlib.suppress(json.JSONDecodeError):
+                                    self._status = json.loads(line[6:])
+        except asyncio.CancelledError:
+            raise
+        except Exception as exc:
+            logger.warning("SSE connection lost: %s", exc)
+            await asyncio.sleep(1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/simulator/tui.py` around lines 149 - 172, The _sse_listener
currently exits on the first connection drop; change it to run reconnect
attempts in a loop inside async def _sse_listener so it continuously tries to
reconnect to f"http://localhost:{self._daemon_port}/events" with a short
exponential (or fixed) backoff (use asyncio.sleep) between attempts, recreating
the aiohttp.ClientSession and request each iteration, preserving the existing
logic that parses resp.content and sets self._status; ensure exceptions are
caught per-attempt (log via logger.error) but do not break the loop, and provide
a cancellation/stop condition if needed (e.g., check an instance flag or
asyncio.CancelledError).

src/astrameter/powermeter/throttling.py-90-92 (1)

90-92: ⚠️ Potential issue | 🟠 Major

Avoid leaving _pending_fetch with an unobserved exception.

When no followers are coalesced and the no-cache error path is hit, set_exception() is called and the leader immediately re-raises. The future is left with an unobserved exception, causing asyncio to emit a Future exception was never retrieved warning. This is a reliability and log-noise issue.

Suggested fix

             if not self._pending_fetch.done():
                 self._pending_fetch.set_exception(e)
+                self._pending_fetch.exception()
             raise

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/powermeter/throttling.py` around lines 90 - 92, When setting
an exception on the coalescing Future self._pending_fetch and then immediately
re-raising, the Future can remain with an unobserved exception; after calling
self._pending_fetch.set_exception(e) ensure the exception is observed (for
example by calling self._pending_fetch.exception() or accessing .result() inside
a try/except) before re-raising so asyncio will not emit "Future exception was
never retrieved"; update the code path that currently does `if not
self._pending_fetch.done(): self._pending_fetch.set_exception(e); raise` to also
retrieve the exception from self._pending_fetch (or otherwise mark it handled)
immediately after setting it.

src/astrameter/simulator/runner.py-173-190 (1)

173-190: ⚠️ Potential issue | 🟠 Major

Validate auto_interval before storing it.

parse_config() accepts any JSON value here, but _auto_loop() later does lo, hi = self.load_model.auto_interval. A 1-item/3-item list, non-numeric values, or an inverted range will crash at runtime in the background task instead of failing fast during config load. Please validate that this is exactly two positive numbers with lo <= hi.

💡 Possible fix

 def validate_config(cfg: SimulationConfig) -> None:
     """Raise ``ValueError`` on invalid configuration."""
@@
     for phase in cfg.solar_phases:
         if phase not in ("A", "B", "C"):
             raise ValueError(f"Invalid solar phase {phase!r}")
 
+    if len(cfg.auto_interval) != 2:
+        raise ValueError(
+            f"auto_interval must contain exactly two values, got {cfg.auto_interval!r}"
+        )
+    lo, hi = cfg.auto_interval
+    if lo <= 0 or hi <= 0:
+        raise ValueError(f"auto_interval values must be > 0, got {cfg.auto_interval!r}")
+    if lo > hi:
+        raise ValueError(f"auto_interval must satisfy lo <= hi, got {cfg.auto_interval!r}")
+
     if cfg.time_scale <= 0:
         raise ValueError(f"time_scale must be positive, got {cfg.time_scale}")

Also applies to: 194-223

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/simulator/runner.py` around lines 173 - 190, The config
accepts any JSON for auto_interval but _auto_loop() unpacks it as lo, hi;
validate in parse_config() (before creating SimulationConfig) that
auto_interval_raw is a sequence of exactly two numbers, both finite and >= 0 (or
>0 if required) and that lo <= hi; convert to a tuple only after validation and
raise a clear ValueError (or ConfigError) with context if validation fails so
load_model.auto_interval is always a valid (lo, hi) pair when assigned to
SimulationConfig.

src/astrameter/powermeter/homewizard.py-61-64 (1)

61-64: ⚠️ Potential issue | 🟠 Major

wait_for_message() stops waiting after the first sample.

_message_event is only ever set, never re-armed, so every call after the first measurement returns immediately—even after a reconnect or a stalled stream. That makes callers think fresh data arrived when they're still looking at an old sample. Track a monotonic message counter or recreate/clear the event around each wait.

Also applies to: 216-240
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/powermeter/homewizard.py` around lines 61 - 64, The
_message_event is only set once so wait_for_message() returns immediately after
the first sample; instead implement a monotonic message counter and use that to
detect new messages: add a self._message_seq integer (initialized 0), increment
it each time a new WS message is processed where the code currently calls
self._message_event.set(), and keep setting the asyncio.Event for wake-ups;
change wait_for_message() to capture the current seq and await until
self._message_seq != captured_seq (or await the event then re-check the seq),
then clear/reset the event or re-arm it for the next message; apply the same
pattern for _fresh_measurement_event (and the ws_loop watchdog logic referenced
around the 216-240 region) so reconnects/stalls don’t cause waiters to
spuriously return for stale data.

src/astrameter/powermeter/sml.py-43-61 (1)

43-61: ⚠️ Potential issue | 🟠 Major

Don't use 0W as the "no sample" sentinel.

When a frame has none of the configured OBIS values, from_sml_frame() falls back to cls(), and _current is also initialized the same way. That makes "never got a valid reading" indistinguishable from a real 0W measurement, so startup checks can pass on a bad OBIS config and concurrent callers can get fabricated zeroes during the first read. Please represent "no valid sample yet" explicitly and raise or keep the last good sample until a matching frame arrives.

Also applies to: 98-126

src/astrameter/config/config_loader.py-73-85 (1)

73-85: ⚠️ Potential issue | 🟠 Major

Don't treat a blank POWER_MULTIPLIER as zero.

parse_float_list() returns [0.0] when the option is present but empty. In read_all_powermeter_configs(), that turns POWER_MULTIPLIER = into a transform that zeros every reading instead of preserving the original value or failing fast.

💡 Suggested fix

-def parse_float_list(value: str, key_name: str, section: str) -> list[float]:
+def parse_float_list(
+    value: str,
+    key_name: str,
+    section: str,
+    *,
+    default: float,
+) -> list[float]:
     tokens = [t.strip() for t in value.split(",")]
     result = []
     for token in tokens:
         if not token:
             continue
@@
-    return result if result else [0.0]
+    return result if result else [default]

Then call it with default=0.0 for offsets and default=1.0 for multipliers.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/config/config_loader.py` around lines 73 - 85,
parse_float_list currently returns [0.0] for an explicitly present-but-empty
option which causes POWER_MULTIPLIER to zero readings; change parse_float_list
to accept a default parameter (e.g., default: float) and when the parsed result
is empty return [default] instead of [0.0], then update
read_all_powermeter_configs to call parse_float_list(..., default=0.0) for
offsets and parse_float_list(..., default=1.0) for multipliers so an empty
POWER_MULTIPLIER preserves values rather than zeroing them.

src/astrameter/mqtt_insights/service.py-146-149 (1)

146-149: ⚠️ Potential issue | 🟠 Major

Guard start() against duplicate service loops.

A second start() overwrites self._task without stopping the first one, so you can end up with multiple MQTT clients publishing/subscribing in parallel and only one of them reachable via stop().

💡 Suggested fix

 async def start(self) -> None:
+    if self._task is not None and not self._task.done():
+        return
     self._connected.clear()
     self._task = asyncio.create_task(self._run())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/mqtt_insights/service.py` around lines 146 - 149, The start()
method can create a second concurrent service loop because it blindly overwrites
self._task; modify start() to guard against duplicates by checking self._task
and its state before creating a new task (e.g., if self._task is not None and
not self._task.done(): either return early or raise), or if you want restart
behavior cancel/await the existing task first (call self._task.cancel() and
await it or handle CancelledError) before assigning a new
asyncio.create_task(self._run()); ensure you still clear self._connected and
that stop() continues to cancel/await the single authoritative self._task so
only one _run() loop is active.

src/astrameter/ct002/balancer.py-879-883 (1)

879-883: ⚠️ Potential issue | 🟠 Major

Reset fade state when efficiency limiting turns off.

This branch clears _deprioritized, but it leaves existing fade_weights and post-probe fade markers intact. If the pool shrinks to one remaining consumer while that consumer was previously faded down, it will keep getting only a fraction of the target for a few cycles instead of taking over immediately.

💡 Suggested fix

         if cfg.min_efficient_power <= 0 or len(reports) < 2:
-            self._probe_state = None
+            self._clear_probe_state("efficiency disabled")
+            self._clear_post_probe_fade()
             self._deprioritized = set()
+            for cid in reports:
+                self._get_consumer(cid).fade_weight = 1.0
             self._invalidate_efficiency_cache()
             return {}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/ct002/balancer.py` around lines 879 - 883, When disabling
efficiency limiting (the branch that sets self._probe_state = None, clears
self._deprioritized and calls self._invalidate_efficiency_cache()), also reset
any per-consumer fade state so a remaining consumer isn't left faded;
specifically clear the in-memory fade-weight storage and any post-probe fade
markers (e.g. the dict/list/set that stores per-consumer fade_weight and the
post-probe fade marker structure) or reset them to defaults. Locate where fade
weights and post-probe flags are maintained in this class and add explicit
clears (e.g. self._fade_weights.clear() and self._post_probe_fade.clear() or
equivalent) in the same branch after _deprioritized is cleared.

🟡 Minor comments (4)

src/astrameter/shelly/shelly.py-54-63 (1)
54-63: ⚠️ Potential issue | 🟡 Minor

Preserve the sign when forcing tiny non-zero phase values.

For inputs like -0.05, this returns 0.001, which flips a small export value into an import value. The epsilon should keep the original sign.
Suggested fix
+import math
@@
     def _calculate_derived_values(self, power):
         decimal_point_enforcer = 0.001
         if abs(power) < 0.1:
-            return decimal_point_enforcer
+            return (
+                math.copysign(decimal_point_enforcer, power)
+                if power != 0
+                else decimal_point_enforcer
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/shelly/shelly.py` around lines 54 - 63, The small-value
handling in _calculate_derived_values currently returns a positive
decimal_point_enforcer for any abs(power) < 0.1, which flips negative tiny
exports to positive; change this to preserve the input sign by using the sign of
power (e.g., math.copysign(decimal_point_enforcer, power)) when returning for
abs(power) < 0.1, and similarly when adding the epsilon in the round(...) branch
replace (decimal_point_enforcer if power == round(power) or power == 0 else 0)
with a signed epsilon (use copysign or power/abs(power) except keep positive for
exact zero) so the forced decimal retains the original sign.
CHANGELOG.md-3-26 (1)
3-26: ⚠️ Potential issue | 🟡 Minor

Collapse ## Next back to a single summary bullet.

This section now expands into many bullets plus a nested subsection, which conflicts with the repo changelog format. Please fold this into one branch-level summary bullet and update that same bullet on follow-ups instead of appending more entries.

As per coding guidelines, CHANGELOG.md: "Keep one bullet under ## Next in CHANGELOG.md per branch that summarizes the overall outcome; edit the same bullet on later iterations rather than appending new bullets" and "Do not expand CHANGELOG.md with every internal or tooling-only follow-up; only update if the user-visible story changes".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.md` around lines 3 - 26, The `## Next` section in CHANGELOG.md is
expanded into many bullets and subsections but must be a single branch-level
summary; collapse all current detailed bullets (CT002/CT003, MQTT Insights,
powermeter additions, Breaking items, etc.) back into one concise summary
sentence under `## Next` that describes the overall release outcome (e.g.,
"Rebrand to AstraMeter and major features/compatibility updates"); keep only
that single bullet and remove the expanded list and nested "### Breaking"
subsection, and ensure future follow-ups update that same `## Next` bullet
instead of appending new bullets per the repository changelog guidelines.
src/astrameter/powermeter/tasmota.py-43-59 (1)
43-59: ⚠️ Potential issue | 🟡 Minor

Validate direct-mode labels too.

JSON_POWER_CALCULATE=False still accepts an empty JSON_POWER_MQTT_LABEL, and that turns into a late KeyError at Line 90 because the normalized label list becomes [""]. This should fail fast in the constructor, just like the calculate-mode validation does.
💡 Suggested fix
         self.json_power_calculate = json_power_calculate
+        if not json_power_calculate and (
+            not self.json_power_mqtt_labels
+            or any(not label.strip() for label in self.json_power_mqtt_labels)
+        ):
+            raise ValueError(
+                "JSON_POWER_MQTT_LABEL entries cannot be empty when "
+                "JSON_POWER_CALCULATE is disabled"
+            )
         if json_power_calculate:
             if len(self.json_power_input_mqtt_labels) != len(
                 self.json_power_output_mqtt_labels
             ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/powermeter/tasmota.py` around lines 43 - 59, The constructor
currently only validates MQTT label lists when json_power_calculate is True, but
when json_power_calculate is False empty JSON_POWER_MQTT_LABEL entries slip
through and later cause a KeyError; add a fast-fail check for direct-mode labels
by validating self.json_power_mqtt_labels (the normalized list produced from
JSON_POWER_MQTT_LABEL) when json_power_calculate is False and raise a ValueError
if any label is empty/whitespace (similar message style to the existing checks
that reference JSON_POWER_MQTT_LABEL entries cannot be empty).
src/astrameter/ct002/ct002.py-262-279 (1)
262-279: ⚠️ Potential issue | 🟡 Minor

Use the injected clock consistently for CT002 timing.

These paths still use time.time() even though the class already accepts clock and passes it into the balancer. With a fake or accelerated clock, dedupe expiry, consumer TTL cleanup, and poll_interval advance on wall time while saturation logic advances on self._clock(), which makes behavior inconsistent and tests nondeterministic. Switching these timestamps to self._clock() keeps the time model coherent.

Also applies to: 297-315, 570-575
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/astrameter/ct002/ct002.py` around lines 262 - 279, Replace direct calls
to time.time() with the injected clock (self._clock()) to keep timing
consistent: in _update_consumer_report (use now = self._clock() before computing
raw_interval, poll_interval EMA, setting consumer.timestamp), and make the same
substitution for the other occurrences referenced (the dedupe expiry logic,
consumer TTL cleanup and the block that advances poll_interval/saturation
logic). Ensure all timestamp reads/writes (consumer.timestamp, now,
raw_interval, TTL checks) use self._clock() so timing across
POLL_INTERVAL_EMA_ALPHA, parse_int-handled power updates, and dedupe/cleanup
logic are coherent.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/test_e2e_probe_lockup.py (1)
41-56: Avoid the free-port race in the harness.

This helper closes the reservation sockets before CT002 and the powermeter bind, so another test process can claim either port in between and make the E2E suite flaky under parallel CI. Prefer letting each server bind port 0 and reading back the assigned port, or keep the reservation sockets open until startup completes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_e2e_probe_lockup.py` around lines 41 - 56, The helper
_find_free_ports currently closes the reservation sockets before the services
(CT002 and the powermeter) bind, creating a race; either change the test harness
to let each server bind to port 0 and read the assigned port, or modify
_find_free_ports to keep reservation sockets open until startup finishes (e.g.,
return both ports and the opened socket objects from _find_free_ports and defer
closing them until after CT002 and the powermeter have bound). Locate
_find_free_ports and implement the chosen approach so the reservation sockets
are not closed prior to the actual service binds.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_balancer_probe_lockup.py`:
- Around line 47-73: The test helper _make_balancer is constructing LoadBalancer
without the shared smoother, causing tests to skip the actual probe reseed path;
update _make_balancer to pass smoother=self._smoother (matching CT002) into the
LoadBalancer constructor so probe commit/reject reseeds the EMA state, and make
the same change to the other test constructors that build LoadBalancer (the
other occurrences that currently couple the smoother out-of-band) so all test
harnesses use the same smoother injection as production.

---

Nitpick comments:
In `@tests/test_e2e_probe_lockup.py`:
- Around line 41-56: The helper _find_free_ports currently closes the
reservation sockets before the services (CT002 and the powermeter) bind,
creating a race; either change the test harness to let each server bind to port
0 and read the assigned port, or modify _find_free_ports to keep reservation
sockets open until startup finishes (e.g., return both ports and the opened
socket objects from _find_free_ports and defer closing them until after CT002
and the powermeter have bound). Locate _find_free_ports and implement the chosen
approach so the reservation sockets are not closed prior to the actual service
binds.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 64d5b536-09a1-41fc-8346-754d5e5f18d2

📥 Commits

Reviewing files that changed from the base of the PR and between 536c95d and 81e6e2c.

📒 Files selected for processing (11)

CHANGELOG.md
src/astrameter/ct002/balancer.py
src/astrameter/ct002/ct002.py
src/astrameter/ct002/smoother.py
src/astrameter/powermeter/homeassistant.py
src/astrameter/powermeter/homeassistant_test.py
src/astrameter/powermeter/homewizard.py
src/astrameter/powermeter/homewizard_test.py
tests/test_balancer_probe_lockup.py
tests/test_e2e_probe_lockup.py
tests/test_smoother.py

🚧 Files skipped from review as they are similar to previous changes (5)

CHANGELOG.md
src/astrameter/ct002/balancer.py
src/astrameter/powermeter/homewizard_test.py
src/astrameter/powermeter/homewizard.py
src/astrameter/ct002/smoother.py

…tion Two review findings verified against current code and fixed. **Finding 1 (real concern): _make_balancer did not inject the smoother, silently skipping the production reseed path.** tests/test_balancer_probe_lockup.py::_make_balancer previously constructed LoadBalancer without the ``smoother=`` kwarg that CT002 passes in production (src/astrameter/ct002/ct002.py:186 in develop). ``test_active_battery_keeps_covering_demand_after_probe_handoff`` then created its own local ``TargetSmoother`` and called ``smoother.update()`` in the _tick helper — but the balancer had ``self._smoother = None``, so the ``if self._smoother is not None`` guard in ``_commit_probe`` / ``_reject_probe`` short-circuited and ``reseed()`` was never called. The test was silently running against the *unpatched* code path. With the fix (smoother wired into ``_make_balancer`` via a new kwarg; the one test that needs it passes its local smoother through), the tick-by-tick trace now shows the reseed firing at t=06 with ``smoothed=None`` and the very next tick re-seeding directly to raw=72 and emitting active_tgt=[0, 72.0, 0] — which is exactly the behaviour the smoother-reseed fix is supposed to produce and which the old test was silently skipping. The two tests in ``TestProbeReseedsSmoother`` already constructed LoadBalancer inline with ``smoother=smoother``, so they needed no change. ``tests/test_balancer.py::_make_balancer`` forwards ``**balancer_kwargs`` and is only used by lifecycle tests that don't touch the probe path, so no change needed there either. **Finding 2 (nitpick): _find_free_ports race window.** The old ``_find_free_ports`` helper bound a reservation socket to port 0, read the assigned port, and *closed the socket* before returning — leaving a TOCTOU window during which a parallel process (especially under pytest-xdist) could grab the port and cause a spurious bind failure when CT002 / the powermeter simulator later tried to bind to it. The fix renames the helper to ``_reserve_free_ports`` and returns ``(port, socket)`` pairs with the reservation sockets still open. ``_Harness`` stores them on ``self._ct_port_sock`` and ``self._http_port_sock`` and closes each one *immediately before* the corresponding service binds, inside ``start()``: self._http_port_sock.close() await self.powermeter.start() # aiohttp TCPSite binds here self._ct_port_sock.close() await self.ct002.start() # create_datagram_endpoint binds here The TOCTOU window is now a single function call rather than the full harness construction time. ``stop()`` also closes any leftover reservation sockets with ``contextlib.suppress(OSError)`` so a test that fails before ``start()`` does not leak fds. Note: ``tests/test_efficiency_e2e.py`` carries the same eager-close ``_find_free_ports`` helper, predates this PR, and is out of scope here — not touching it. Full CI green (437 passing, 17 skipped); ruff format + check clean; mypy clean. https://claude.ai/code/session_01CdgD1253g9MGysj21HmBdS

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_balancer_probe_lockup.py`:
- Around line 250-252: The current assertion is too weak because it passes if
"acd929a74b20" merely appears anywhere in lb._priority; instead assert that
acd929a74b20 is the sole active consumer by checking lb._priority contains
exactly that ID (e.g. lb._priority == {"acd929a74b20"} or equivalent
length+membership checks) and that the other consumer "24215edb1936" is in
lb._deprioritized (or not in lb._priority); update the test assertion(s) around
lb._priority and lb._deprioritized to enforce exclusivity rather than just
presence.
- Around line 246-248: The test is accidentally perturbing the EMA by adding
1e-6 * clock() to the reading (via smoother.update), because _FakeClock starts
at time.time() so that term is ~1700; change the call in
tests/test_balancer_probe_lockup.py so TargetSmoother.update receives the true
zero reading (e.g. 0.0) and only a unique sample_id (keep the (clock(),) tuple),
i.e. remove the added 1e-6 * clock() term so only sample_id varies and the
smoothed value is not contaminated.

In `@tests/test_e2e_probe_lockup.py`:
- Around line 182-191: The start() method can leak a started service if
powermeter.start() succeeds but ct002.start() raises; modify start() (in the
class with self._http_port_sock/self._ct_port_sock) to catch exceptions around
the sequential starts and unwind any services already started before re-raising:
after closing _http_port_sock and awaiting self.powermeter.start(), wrap the
subsequent close/await for self.ct002.start() in try/except, and on exception
call the appropriate shutdown/stop/close method on self.powermeter (e.g.,
self.powermeter.stop()/shutdown()) and await it, then re-raise the original
error; ensure sockets are closed only once and that cleanup is idempotent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 31713dc2-db75-497b-a7cf-19d47600f522

📥 Commits

Reviewing files that changed from the base of the PR and between 81e6e2c and 73b909a.

📒 Files selected for processing (2)

tests/test_balancer_probe_lockup.py
tests/test_e2e_probe_lockup.py

Three review findings, all verified against the current code before fixing. **Finding 1 (real): assertion was tautologically true.** tests/test_balancer_probe_lockup.py:251 was assert "24215edb1936" in lb._deprioritized or "acd929a74b20" in lb._priority The second half is always true because ``acd929a74b20`` is in ``_priority`` regardless of which slot it's in — the ``_priority`` list holds both active *and* deprioritized consumers; only their position within the list distinguishes them. Replaced with strict exclusivity checks against the actual observed state: assert lb._priority == ["24215edb1936", "acd929a74b20"] assert lb._deprioritized == {"acd929a74b20"} The previous comment ("Confirm the balancer picked acd929a74b20 as the sole active") was also wrong: the balancer's init seeds ``_priority`` from ``sorted(current_pool)`` (`balancer.py:867`), so ``24215edb1936`` (alphabetically first) ends up at slot 0. The new comment explains this. **Finding 2 (real): smoother was being polluted with ~1700 W.** tests/test_balancer_probe_lockup.py:248 was smoother.update(0.0 + 1e-6 * clock(), (clock(),)) The intent was "tiny jitter to force a distinct sample_id", but ``_FakeClock.__init__`` seeds from ``time.time()`` which returns ~1.76e9, so ``1e-6 * clock()`` is ~1760, not near zero. The smoother was running its EMA against a 1760 W raw reading every warm-up tick. Instrumentation print confirmed ``smoother.value ≈ 1614`` after warm-up. Fix: pass only the unique sample_id tuple and keep the ``raw_total`` at true zero. A new sanity assertion verifies the smoother stays at 0.0 across the warm-up so a future regression can't silently re-introduce the pollution. **Finding 3 (real): start() leaked the powermeter on partial failure.** tests/test_e2e_probe_lockup.py::_Harness.start opened the powermeter first, then CT002. If ``ct002.start()`` raised after ``powermeter.start()`` already succeeded, the exception would propagate without tearing the powermeter back down — leaking a listening HTTP server. The sockets also weren't tracked, so ``stop()`` couldn't tell the difference between "never started", "half started", and "fully started". Fix: - Added ``_started_powermeter`` and ``_started_ct002`` lifecycle flags initialized to ``False`` in ``__init__``. - ``start()`` sets each flag after its corresponding ``await .start()`` returns successfully, and wraps the full sequence in ``try / except BaseException`` that tears the powermeter back down on partial failure before re-raising. - ``stop()`` consults the flags and only awaits ``.stop()`` on services that were actually started. Wrapped in ``contextlib.suppress(Exception)`` so repeated stops are idempotent and a stop that fires during exception unwind doesn't mask the original error. Added ``test_harness_start_unwinds_on_partial_failure`` which monkey-patches ``ct002.start`` to raise ``RuntimeError``, asserts the ``RuntimeError`` propagates out of ``h.start()``, and then verifies the powermeter's HTTP port is free (re-bindable) to prove the unwind actually stopped it. Would fail cleanly if a future refactor drops the unwind branch. Full CI green (438 passing, +1 from the new unwind test); ruff format + check clean; mypy clean. https://claude.ai/code/session_01CdgD1253g9MGysj21HmBdS

tomquist changed the base branch from main to develop April 11, 2026 16:57

claude added 3 commits April 11, 2026 16:59

tomquist force-pushed the claude/debug-log-rotation-VboOh branch from 536c95d to 81e6e2c Compare April 11, 2026 17:01

tomquist changed the title ~~Fix Modbus unit id parameter (#191)~~ Fix CT002 efficiency-rotation lockup when push-based powermeter goes silent Apr 11, 2026

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

Comment thread tests/test_balancer_probe_lockup.py Outdated

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

Comment thread tests/test_balancer_probe_lockup.py Outdated

Comment thread tests/test_balancer_probe_lockup.py Outdated

Comment thread tests/test_e2e_probe_lockup.py Outdated

claude added 2 commits April 11, 2026 18:06

Drop changelog entry for probe-handoff lockup fix

6f348d8

tomquist merged commit 66a781a into develop Apr 11, 2026
13 checks passed

tomquist deleted the claude/debug-log-rotation-VboOh branch April 11, 2026 20:01

This was referenced Apr 22, 2026

Fix balancer lockup when rejecting probes on empty batteries #339

Merged

Exclude DC-only Marstek batteries from charge distribution (#338) #342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CT002 efficiency-rotation lockup when push-based powermeter goes silent#320

Fix CT002 efficiency-rotation lockup when push-based powermeter goes silent#320
tomquist merged 6 commits into
developfrom
claude/debug-log-rotation-VboOh

tomquist commented Apr 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 11, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomquist commented Apr 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Fix: powermeter hardening

Fix: CT002 graceful degradation

Defense in depth: balancer smoother reseed

Tests (437 passing, 17 skipped)

Caveats

Review trail

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomquist commented Apr 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 11, 2026 •

edited

Loading