fix: smart-strategy resilience for refresh-status 5xx (counters, cache, recovery)#295
fix: smart-strategy resilience for refresh-status 5xx (counters, cache, recovery)#295nledenyi wants to merge 1 commit into
Conversation
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 4 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request enhances the Toyota integration's status refresh strategy by allowing manual service calls to bypass auto-disable and user-disable flags, ensuring users can always trigger a refresh. It also improves error handling during the POST-then-GET sequence by suppressing specific network exceptions and falling back to a bare GET on failure. Feedback was provided to prevent redundant configuration updates when auto-disabling the refresh strategy and to expand exception handling for the fallback status update to ensure better resilience against connectivity issues.
| vin[-6:], | ||
| return_code, | ||
| ) | ||
| if should_auto_disable: |
There was a problem hiding this comment.
The call to async_update_entry should be guarded by a check to see if CONF_AUTO_DISABLED_STATUS_REFRESH is already True. Since on_post_layer1_failure returns True for every failure once the threshold is reached, a persistent failure (e.g., during a service call retry) could trigger redundant entry updates and reloads on every coordinator cycle.
| if should_auto_disable: | |
| if should_auto_disable and not entry.options.get(CONF_AUTO_DISABLED_STATUS_REFRESH, False): |
| # for cycles before auto-disable kicks in, and for any vehicle | ||
| # whose POST 500s but whose /status still serves stale-cache | ||
| # data we can read. | ||
| with contextlib.suppress(ToyotaApiError, httpx.ReadTimeout): |
There was a problem hiding this comment.
The exception suppression list for the fallback GET is narrower than the one used for the POST call at line 412. For better resilience, especially when the gateway is experiencing connectivity issues or timeouts, this block should also suppress httpx.ConnectTimeout, httpcore.ConnectTimeout, and asyncioexceptions.TimeoutError. This ensures that a failure in the fallback status fetch doesn't prevent the rest of the vehicle data (like statistics) from being updated in the same cycle.
| with contextlib.suppress(ToyotaApiError, httpx.ReadTimeout): | |
| with contextlib.suppress( | |
| ToyotaApiError, | |
| httpx.ConnectTimeout, | |
| httpcore.ConnectTimeout, | |
| asyncioexceptions.TimeoutError, | |
| httpx.ReadTimeout, | |
| ): |
…e, recovery) When the Toyota gateway returns persistent HTTP 500 on POST /refresh-status (some Lexus / Aygo / Yaris vehicles per ha_toyota#291 + ha_toyota#293), pytoyoda's controller exhausts its 4-attempt retry sequence and raises ToyotaApiError. The previous implementation called vehicle.refresh_status() without a try/except, which meant the exception propagated out of _refresh_one_vehicle with three consequences: 1. on_post_layer1_failure() never ran (it sits inside the `return_code != "000000"` branch, which a raised POST never reaches), so consecutive_post_rejections never advanced and the _AUTO_DISABLE_REJECTION_THRESHOLD soft/hard-disable mechanism never fired. status_refresh_state stayed `active` forever. 2. _refresh_one_vehicle's post-decision bookkeeping was skipped: trips manager refresh, movement detection, diag state persistence, and the caller's `last_good_per_vin[vin] = vehicle_data` line. Phase 1 had already fetched fresh telemetry/location/etc., but that fresh data was never promoted to the cache layer; entities served from the prior cycle's cached VehicleData. Reported as parking location frozen at home, lock state stale, etc. 3. Once auto-disable did fire (via the existing returnCode-rejection branch on cars where that path could trip), the only documented recovery was the user toggling enable_status_refresh OFF then ON. No way to retry without disabling the feature first. This commit: - Wraps the POST in contextlib.suppress for the same exception set the Layer 2 poll loop already catches. Collapses the exception path and the non-"000000" returnCode path into a single Layer 1 failure branch. - Adds a bare GET fallback (`vehicle.update(only=["status"])`) on Layer 1 failure so /status entities still refresh this cycle, even before auto-disable kicks in. - Lets explicit service calls bypass BOTH HARD_DISABLED_AUTO and HARD_DISABLED_USER. Matches the HA convention that polling toggles stop the cadence but explicit invocations still go through; users can disable the automatic strategy and drive POSTs from their own automations (geofence, garage door, time-of-day). - After a successful POST clears auto_disabled_status_refresh, the strategy goes back to ACTIVE on the next cycle without manual toggling. Users can now recover from auto-disable by simply pressing the refresh button instead of toggling options OFF/ON. - Three new tests in test_refresh_strategy.py covering the service-call bypass behaviour for both AUTO and USER disable + blocking when no service call is pending. - Updates const.py docstring for CONF_ENABLE_STATUS_REFRESH and services.yaml description for refresh_vehicle_status to reflect the cadence-vs-capability distinction. All 31 existing tests still pass; ruff clean. Closes ha_toyota#293.
5c0682a to
c51896e
Compare
|
Pushed
All 31 existing tests still pass; ruff clean. Deployed locally via the combined-with-recent-trips branch alongside, no regressions on a healthy-account regression test (both vehicles populating, no errors, recent-trips intact). |
Addresses #293
Motivation
Some vehicles' Toyota gateways return HTTP 500 on
POST /v1/global/remote/refresh-status(the wake POST introduced in #286). pytoyoda's controller correctly retries 5xx with exponential backoff and raisesToyotaApiErrorafter the four attempts exhaust._execute_post_then_get()calledvehicle.refresh_status()without wrapping it in a try/except, with three consequences that share a single root cause - an exception that should be a Layer 1 failure signal was instead an integration-level fault.Symptom A: auto-disable never fires.
consecutive_post_rejectionsis advanced byon_post_layer1_failure()at line 411, but that call is gated byreturn_code != "000000"- reachable only when the POST returns 200 with a non-000000application-level rejection. A 5xx exception path skips the entireif return_code != "000000":block, so neitherconsecutive_post_rejectionsnorconsecutive_failed_wakesever advances on these vehicles.<alias>_status_refresh_statestaysactiveforever, the user gets log noise on every coordinator cycle.Symptom B: entities go stale. When
_refresh_one_vehicleraises in Phase 2, the post-decision bookkeeping is skipped: trips manager refresh, movement detection, diagnostic state persistence, and - the load-bearing one - the caller'slast_good_per_vin[vin] = vehicle_dataline that promotes Phase 1's freshly-fetched non-status data (telemetry, location, etc.) to the cache layer. The integration's outerexcept (ToyotaApiError, ...)catches the exception and serves the prior cycle's cachedVehicleData, so entities likedevice_tracker.<alias>_parking_locationshow stale values until the user manually disables refresh in the options.Symptom C: rough recovery semantics. Once auto-disable did fire (via the existing returnCode-rejection branch on cars where that path could trip), the only documented recovery was the user toggling
enable_status_refreshOFF then ON. No way to retry without first disabling the feature.What this changes
Single squashed commit, four behavioural changes in the same code path:
1. Wrap the POST + collapse failure paths
The POST is wrapped in
contextlib.suppressfor the same exception set the Layer 2 poll loop already catches (ToyotaApiError,httpx.ConnectTimeout,httpcore.ConnectTimeout,asyncio.TimeoutError,httpx.ReadTimeout). On exception,post_responsestaysNoneand falls through to the existing Layer 1 failure branch (now treated asreturn_code != "000000"semantically). The exception path and the application-rejection path share one block of code and one auto-disable threshold check.2. Bare GET fallback on Layer 1 failure
After recording the Layer 1 failure (and possibly auto-disabling), the integration now fires a bare
vehicle.update(only=["status"])- the same call as theHARD_DISABLEDlegacy path at line 524. This means/statusentities still refresh in the cycles before auto-disable kicks in, and on any vehicle whose POST 500s but whose/statusstill serves stale-cache data we can read.3. Service-call bypass for both HARD_DISABLED forms
_hard_disable_decision()now acceptsuser_service_call_pendingand skips both the AUTO and USER hard-disable cases when a service call is pending. Reasoning:Refresh vehicle statusbutton or service. After a successful POST clears the auto-disable flag (point number 4 below), the strategy returns to ACTIVE on the next cycle. Recovery is one button-press instead of two-save toggle dance.refresh_vehicle_status. Today the service is silently a no-op whenenable_status_refresh: False, which is a footgun for automation authors.4. Auto-clear
auto_disabled_status_refreshon POST successA successful POST (
return_code == "000000") proves the gateway can process the endpoint, soauto_disabled_status_refreshis unconditionally cleared fromentry.optionsif it was set. Triggers self-recovery from auto-disable for service-call retries and for transient 5xx incidents that clear on their own.Doc updates
const.py:33docstring forCONF_ENABLE_STATUS_REFRESH: clarifies "stop the automatic cadence" semantics with explicit service calls still going through.services.yamldescription forrefresh_vehicle_status: documents that the service works regardless of either disable form, supporting fully-manual schedules.Verification
Unit tests: 31 strategy-level tests pass. Three new tests covering:
test_service_call_bypasses_hard_disabled_auto- service call overrides AUTO disable.test_service_call_bypasses_hard_disabled_user- service call overrides USER disable.test_user_disable_blocks_non_service_triggers- without a service call, USER disable still blocks the strategy as before.Live regression test on a healthy account (mine, RAV4 '19 + AYGO X '22 - neither 500s on
/refresh-status):state: loaded).<alias>_status_refresh_state = active, both<alias>_last_successful_fetchadvancing on the polling cadence.Failure-path validation (a 5xx-prone account observing the fix work end-to-end) requires a tester whose vehicle's gateway actually 500s on
/refresh-status.Backward compatibility
000000(success), returnCode!=000000(Layer 1 application rejection), Layer 2 timeout, all unchanged.enable_status_refresh: Falsesemantics shift slightly: previously this blocked all POSTs including service calls; now it stops the automatic cadence and leaves explicit service calls alone. Users who relied on the old "lock out everything" behaviour can simply not call the service - the toggle's job is the cadence, not the capability. Documented in the const.py docstring and services.yaml description.auto_disabled_status_refreshon POST success is new behaviour but strictly a recovery path - any user who would have lived with auto-disable until they manually toggled is also fine if it auto-clears earlier.Diff:
+141 / -22across__init__.py,refresh_strategy.py,const.py,services.yaml,tests/test_refresh_strategy.py.