fix: smart-strategy resilience for refresh-status 5xx (counters, cache, recovery) by nledenyi · Pull Request #295 · pytoyoda/ha_toyota

nledenyi · 2026-04-29T20:30:18Z

Addresses #293

Motivation

Some vehicles' Toyota gateways return HTTP 500 on POST /v1/global/remote/refresh-status (the wake POST introduced in #286). pytoyoda's controller correctly retries 5xx with exponential backoff and raises ToyotaApiError after the four attempts exhaust. _execute_post_then_get() called vehicle.refresh_status() without wrapping it in a try/except, with three consequences that share a single root cause - an exception that should be a Layer 1 failure signal was instead an integration-level fault.

Symptom A: auto-disable never fires. consecutive_post_rejections is advanced by on_post_layer1_failure() at line 411, but that call is gated by return_code != "000000" - reachable only when the POST returns 200 with a non-000000 application-level rejection. A 5xx exception path skips the entire if return_code != "000000": block, so neither consecutive_post_rejections nor consecutive_failed_wakes ever advances on these vehicles. <alias>_status_refresh_state stays active forever, the user gets log noise on every coordinator cycle.

Symptom B: entities go stale. When _refresh_one_vehicle raises in Phase 2, the post-decision bookkeeping is skipped: trips manager refresh, movement detection, diagnostic state persistence, and - the load-bearing one - the caller's last_good_per_vin[vin] = vehicle_data line that promotes Phase 1's freshly-fetched non-status data (telemetry, location, etc.) to the cache layer. The integration's outer except (ToyotaApiError, ...) catches the exception and serves the prior cycle's cached VehicleData, so entities like device_tracker.<alias>_parking_location show stale values until the user manually disables refresh in the options.

Symptom C: rough recovery semantics. Once auto-disable did fire (via the existing returnCode-rejection branch on cars where that path could trip), the only documented recovery was the user toggling enable_status_refresh OFF then ON. No way to retry without first disabling the feature.

What this changes

Single squashed commit, four behavioural changes in the same code path:

1. Wrap the POST + collapse failure paths

The POST is wrapped in contextlib.suppress for the same exception set the Layer 2 poll loop already catches (ToyotaApiError, httpx.ConnectTimeout, httpcore.ConnectTimeout, asyncio.TimeoutError, httpx.ReadTimeout). On exception, post_response stays None and falls through to the existing Layer 1 failure branch (now treated as return_code != "000000" semantically). The exception path and the application-rejection path share one block of code and one auto-disable threshold check.

2. Bare GET fallback on Layer 1 failure

After recording the Layer 1 failure (and possibly auto-disabling), the integration now fires a bare vehicle.update(only=["status"]) - the same call as the HARD_DISABLED legacy path at line 524. This means /status entities still refresh in the cycles before auto-disable kicks in, and on any vehicle whose POST 500s but whose /status still serves stale-cache data we can read.

3. Service-call bypass for both HARD_DISABLED forms

_hard_disable_decision() now accepts user_service_call_pending and skips both the AUTO and USER hard-disable cases when a service call is pending. Reasoning:

HARD_DISABLED_AUTO bypass: the user is explicitly retrying via the Refresh vehicle status button or service. After a successful POST clears the auto-disable flag (point number 4 below), the strategy returns to ACTIVE on the next cycle. Recovery is one button-press instead of two-save toggle dance.
HARD_DISABLED_USER bypass: matches HA convention everywhere else - polling toggles stop the cadence, explicit service calls still go through. Users who want a bespoke schedule (geofence arrival, garage-door close, time-of-day) can disable the strategy and drive POSTs from their own automations against refresh_vehicle_status. Today the service is silently a no-op when enable_status_refresh: False, which is a footgun for automation authors.

4. Auto-clear `auto_disabled_status_refresh` on POST success

A successful POST (return_code == "000000") proves the gateway can process the endpoint, so auto_disabled_status_refresh is unconditionally cleared from entry.options if it was set. Triggers self-recovery from auto-disable for service-call retries and for transient 5xx incidents that clear on their own.

Doc updates

const.py:33 docstring for CONF_ENABLE_STATUS_REFRESH: clarifies "stop the automatic cadence" semantics with explicit service calls still going through.
services.yaml description for refresh_vehicle_status: documents that the service works regardless of either disable form, supporting fully-manual schedules.

Verification

Unit tests: 31 strategy-level tests pass. Three new tests covering:

test_service_call_bypasses_hard_disabled_auto - service call overrides AUTO disable.
test_service_call_bypasses_hard_disabled_user - service call overrides USER disable.
test_user_disable_blocks_non_service_triggers - without a service call, USER disable still blocks the strategy as before.

Live regression test on a healthy account (mine, RAV4 '19 + AYGO X '22 - neither 500s on /refresh-status):

Integration setup completes cleanly (state: loaded).
64 entities, 41 populating, 23 unavailable - identical to pre-fix baseline.
Both <alias>_status_refresh_state = active, both <alias>_last_successful_fetch advancing on the polling cadence.
No Toyota errors or unexpected warnings in the log on the modified path.

Failure-path validation (a 5xx-prone account observing the fix work end-to-end) requires a tester whose vehicle's gateway actually 500s on /refresh-status.

Backward compatibility

No config-flow option added or removed.
No manifest changes.
Existing behaviour preserved on every code path that already worked: returnCode=000000 (success), returnCode!=000000 (Layer 1 application rejection), Layer 2 timeout, all unchanged.
enable_status_refresh: False semantics shift slightly: previously this blocked all POSTs including service calls; now it stops the automatic cadence and leaves explicit service calls alone. Users who relied on the old "lock out everything" behaviour can simply not call the service - the toggle's job is the cadence, not the capability. Documented in the const.py docstring and services.yaml description.
Auto-clear of auto_disabled_status_refresh on POST success is new behaviour but strictly a recovery path - any user who would have lived with auto-disable until they manually toggled is also fine if it auto-clears earlier.

Diff: +141 / -22 across __init__.py, refresh_strategy.py, const.py, services.yaml, tests/test_refresh_strategy.py.

codacy-production · 2026-04-29T20:31:21Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 4 complexity

Metric Results

Complexity 4

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

gemini-code-assist

Code Review

This pull request enhances the Toyota integration's status refresh strategy by allowing manual service calls to bypass auto-disable and user-disable flags, ensuring users can always trigger a refresh. It also improves error handling during the POST-then-GET sequence by suppressing specific network exceptions and falling back to a bare GET on failure. Feedback was provided to prevent redundant configuration updates when auto-disabling the refresh strategy and to expand exception handling for the fallback status update to ensure better resilience against connectivity issues.

gemini-code-assist · 2026-04-29T20:31:47Z

+                    vin[-6:],
+                    return_code,
+                )
            if should_auto_disable:


The call to async_update_entry should be guarded by a check to see if CONF_AUTO_DISABLED_STATUS_REFRESH is already True. Since on_post_layer1_failure returns True for every failure once the threshold is reached, a persistent failure (e.g., during a service call retry) could trigger redundant entry updates and reloads on every coordinator cycle.

Suggested change

if should_auto_disable:

if should_auto_disable and not entry.options.get(CONF_AUTO_DISABLED_STATUS_REFRESH, False):

gemini-code-assist · 2026-04-29T20:31:48Z

+            # for cycles before auto-disable kicks in, and for any vehicle
+            # whose POST 500s but whose /status still serves stale-cache
+            # data we can read.
+            with contextlib.suppress(ToyotaApiError, httpx.ReadTimeout):


The exception suppression list for the fallback GET is narrower than the one used for the POST call at line 412. For better resilience, especially when the gateway is experiencing connectivity issues or timeouts, this block should also suppress httpx.ConnectTimeout, httpcore.ConnectTimeout, and asyncioexceptions.TimeoutError. This ensures that a failure in the fallback status fetch doesn't prevent the rest of the vehicle data (like statistics) from being updated in the same cycle.

Suggested change

with contextlib.suppress(ToyotaApiError, httpx.ReadTimeout):

with contextlib.suppress(

ToyotaApiError,

httpx.ConnectTimeout,

httpcore.ConnectTimeout,

asyncioexceptions.TimeoutError,

httpx.ReadTimeout,

):

…e, recovery) When the Toyota gateway returns persistent HTTP 500 on POST /refresh-status (some Lexus / Aygo / Yaris vehicles per ha_toyota#291 + ha_toyota#293), pytoyoda's controller exhausts its 4-attempt retry sequence and raises ToyotaApiError. The previous implementation called vehicle.refresh_status() without a try/except, which meant the exception propagated out of _refresh_one_vehicle with three consequences: 1. on_post_layer1_failure() never ran (it sits inside the `return_code != "000000"` branch, which a raised POST never reaches), so consecutive_post_rejections never advanced and the _AUTO_DISABLE_REJECTION_THRESHOLD soft/hard-disable mechanism never fired. status_refresh_state stayed `active` forever. 2. _refresh_one_vehicle's post-decision bookkeeping was skipped: trips manager refresh, movement detection, diag state persistence, and the caller's `last_good_per_vin[vin] = vehicle_data` line. Phase 1 had already fetched fresh telemetry/location/etc., but that fresh data was never promoted to the cache layer; entities served from the prior cycle's cached VehicleData. Reported as parking location frozen at home, lock state stale, etc. 3. Once auto-disable did fire (via the existing returnCode-rejection branch on cars where that path could trip), the only documented recovery was the user toggling enable_status_refresh OFF then ON. No way to retry without disabling the feature first. This commit: - Wraps the POST in contextlib.suppress for the same exception set the Layer 2 poll loop already catches. Collapses the exception path and the non-"000000" returnCode path into a single Layer 1 failure branch. - Adds a bare GET fallback (`vehicle.update(only=["status"])`) on Layer 1 failure so /status entities still refresh this cycle, even before auto-disable kicks in. - Lets explicit service calls bypass BOTH HARD_DISABLED_AUTO and HARD_DISABLED_USER. Matches the HA convention that polling toggles stop the cadence but explicit invocations still go through; users can disable the automatic strategy and drive POSTs from their own automations (geofence, garage door, time-of-day). - After a successful POST clears auto_disabled_status_refresh, the strategy goes back to ACTIVE on the next cycle without manual toggling. Users can now recover from auto-disable by simply pressing the refresh button instead of toggling options OFF/ON. - Three new tests in test_refresh_strategy.py covering the service-call bypass behaviour for both AUTO and USER disable + blocking when no service call is pending. - Updates const.py docstring for CONF_ENABLE_STATUS_REFRESH and services.yaml description for refresh_vehicle_status to reflect the cadence-vs-capability distinction. All 31 existing tests still pass; ruff clean. Closes ha_toyota#293.

nledenyi · 2026-04-29T20:55:03Z

Pushed c51896e addressing both Gemini medium-priority suggestions:

Guard async_update_entry against redundant reloads - if should_auto_disable and not entry.options.get(CONF_AUTO_DISABLED_STATUS_REFRESH, False): Prevents the service-call-retry-that-still-500s path from re-persisting the already-set flag and triggering a redundant listener-driven reload every cycle.
Widened bare-GET fallback's exception suppression to match the POST's set: now suppresses ToyotaApiError, httpx.ConnectTimeout, httpcore.ConnectTimeout, asyncio.TimeoutError, httpx.ReadTimeout. Protects _refresh_one_vehicle's bookkeeping when the gateway is shaky enough that even the fallback times out.

All 31 existing tests still pass; ruff clean. Deployed locally via the combined-with-recent-trips branch alongside, no regressions on a healthy-account regression test (both vehicles populating, no errors, recent-trips intact).

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

nledenyi force-pushed the fix/wake-post-500-auto-disable branch from 5c0682a to c51896e Compare April 29, 2026 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: smart-strategy resilience for refresh-status 5xx (counters, cache, recovery)#295

fix: smart-strategy resilience for refresh-status 5xx (counters, cache, recovery)#295
nledenyi wants to merge 1 commit into
pytoyoda:mainfrom
nledenyi:fix/wake-post-500-auto-disable

nledenyi commented Apr 29, 2026 •

edited

Loading

Uh oh!

codacy-production Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

nledenyi commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if should_auto_disable:
	if should_auto_disable and not entry.options.get(CONF_AUTO_DISABLED_STATUS_REFRESH, False):

-            with contextlib.suppress(ToyotaApiError, httpx.ReadTimeout):
+            with contextlib.suppress(
+                ToyotaApiError,
+                httpx.ConnectTimeout,
+                httpcore.ConnectTimeout,
+                asyncioexceptions.TimeoutError,
+                httpx.ReadTimeout,
+            ):

Conversation

nledenyi commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What this changes

1. Wrap the POST + collapse failure paths

2. Bare GET fallback on Layer 1 failure

3. Service-call bypass for both HARD_DISABLED forms

4. Auto-clear auto_disabled_status_refresh on POST success

Doc updates

Verification

Backward compatibility

Uh oh!

codacy-production Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

nledenyi commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nledenyi commented Apr 29, 2026 •

edited

Loading

4. Auto-clear `auto_disabled_status_refresh` on POST success

codacy-production Bot commented Apr 29, 2026 •

edited

Loading