Skip to content

fix: smart-strategy resilience for refresh-status 5xx (counters, cache, recovery)#295

Draft
nledenyi wants to merge 1 commit into
pytoyoda:mainfrom
nledenyi:fix/wake-post-500-auto-disable
Draft

fix: smart-strategy resilience for refresh-status 5xx (counters, cache, recovery)#295
nledenyi wants to merge 1 commit into
pytoyoda:mainfrom
nledenyi:fix/wake-post-500-auto-disable

Conversation

@nledenyi
Copy link
Copy Markdown
Member

@nledenyi nledenyi commented Apr 29, 2026

Addresses #293

Motivation

Some vehicles' Toyota gateways return HTTP 500 on POST /v1/global/remote/refresh-status (the wake POST introduced in #286). pytoyoda's controller correctly retries 5xx with exponential backoff and raises ToyotaApiError after the four attempts exhaust. _execute_post_then_get() called vehicle.refresh_status() without wrapping it in a try/except, with three consequences that share a single root cause - an exception that should be a Layer 1 failure signal was instead an integration-level fault.

Symptom A: auto-disable never fires. consecutive_post_rejections is advanced by on_post_layer1_failure() at line 411, but that call is gated by return_code != "000000" - reachable only when the POST returns 200 with a non-000000 application-level rejection. A 5xx exception path skips the entire if return_code != "000000": block, so neither consecutive_post_rejections nor consecutive_failed_wakes ever advances on these vehicles. <alias>_status_refresh_state stays active forever, the user gets log noise on every coordinator cycle.

Symptom B: entities go stale. When _refresh_one_vehicle raises in Phase 2, the post-decision bookkeeping is skipped: trips manager refresh, movement detection, diagnostic state persistence, and - the load-bearing one - the caller's last_good_per_vin[vin] = vehicle_data line that promotes Phase 1's freshly-fetched non-status data (telemetry, location, etc.) to the cache layer. The integration's outer except (ToyotaApiError, ...) catches the exception and serves the prior cycle's cached VehicleData, so entities like device_tracker.<alias>_parking_location show stale values until the user manually disables refresh in the options.

Symptom C: rough recovery semantics. Once auto-disable did fire (via the existing returnCode-rejection branch on cars where that path could trip), the only documented recovery was the user toggling enable_status_refresh OFF then ON. No way to retry without first disabling the feature.

What this changes

Single squashed commit, four behavioural changes in the same code path:

1. Wrap the POST + collapse failure paths

The POST is wrapped in contextlib.suppress for the same exception set the Layer 2 poll loop already catches (ToyotaApiError, httpx.ConnectTimeout, httpcore.ConnectTimeout, asyncio.TimeoutError, httpx.ReadTimeout). On exception, post_response stays None and falls through to the existing Layer 1 failure branch (now treated as return_code != "000000" semantically). The exception path and the application-rejection path share one block of code and one auto-disable threshold check.

2. Bare GET fallback on Layer 1 failure

After recording the Layer 1 failure (and possibly auto-disabling), the integration now fires a bare vehicle.update(only=["status"]) - the same call as the HARD_DISABLED legacy path at line 524. This means /status entities still refresh in the cycles before auto-disable kicks in, and on any vehicle whose POST 500s but whose /status still serves stale-cache data we can read.

3. Service-call bypass for both HARD_DISABLED forms

_hard_disable_decision() now accepts user_service_call_pending and skips both the AUTO and USER hard-disable cases when a service call is pending. Reasoning:

  • HARD_DISABLED_AUTO bypass: the user is explicitly retrying via the Refresh vehicle status button or service. After a successful POST clears the auto-disable flag (point number 4 below), the strategy returns to ACTIVE on the next cycle. Recovery is one button-press instead of two-save toggle dance.
  • HARD_DISABLED_USER bypass: matches HA convention everywhere else - polling toggles stop the cadence, explicit service calls still go through. Users who want a bespoke schedule (geofence arrival, garage-door close, time-of-day) can disable the strategy and drive POSTs from their own automations against refresh_vehicle_status. Today the service is silently a no-op when enable_status_refresh: False, which is a footgun for automation authors.

4. Auto-clear auto_disabled_status_refresh on POST success

A successful POST (return_code == "000000") proves the gateway can process the endpoint, so auto_disabled_status_refresh is unconditionally cleared from entry.options if it was set. Triggers self-recovery from auto-disable for service-call retries and for transient 5xx incidents that clear on their own.

Doc updates

  • const.py:33 docstring for CONF_ENABLE_STATUS_REFRESH: clarifies "stop the automatic cadence" semantics with explicit service calls still going through.
  • services.yaml description for refresh_vehicle_status: documents that the service works regardless of either disable form, supporting fully-manual schedules.

Verification

Unit tests: 31 strategy-level tests pass. Three new tests covering:

  • test_service_call_bypasses_hard_disabled_auto - service call overrides AUTO disable.
  • test_service_call_bypasses_hard_disabled_user - service call overrides USER disable.
  • test_user_disable_blocks_non_service_triggers - without a service call, USER disable still blocks the strategy as before.

Live regression test on a healthy account (mine, RAV4 '19 + AYGO X '22 - neither 500s on /refresh-status):

  • Integration setup completes cleanly (state: loaded).
  • 64 entities, 41 populating, 23 unavailable - identical to pre-fix baseline.
  • Both <alias>_status_refresh_state = active, both <alias>_last_successful_fetch advancing on the polling cadence.
  • No Toyota errors or unexpected warnings in the log on the modified path.

Failure-path validation (a 5xx-prone account observing the fix work end-to-end) requires a tester whose vehicle's gateway actually 500s on /refresh-status.

Backward compatibility

  • No config-flow option added or removed.
  • No manifest changes.
  • Existing behaviour preserved on every code path that already worked: returnCode=000000 (success), returnCode!=000000 (Layer 1 application rejection), Layer 2 timeout, all unchanged.
  • enable_status_refresh: False semantics shift slightly: previously this blocked all POSTs including service calls; now it stops the automatic cadence and leaves explicit service calls alone. Users who relied on the old "lock out everything" behaviour can simply not call the service - the toggle's job is the cadence, not the capability. Documented in the const.py docstring and services.yaml description.
  • Auto-clear of auto_disabled_status_refresh on POST success is new behaviour but strictly a recovery path - any user who would have lived with auto-disable until they manually toggled is also fine if it auto-clears earlier.

Diff: +141 / -22 across __init__.py, refresh_strategy.py, const.py, services.yaml, tests/test_refresh_strategy.py.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 29, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 4 complexity

Metric Results
Complexity 4

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the Toyota integration's status refresh strategy by allowing manual service calls to bypass auto-disable and user-disable flags, ensuring users can always trigger a refresh. It also improves error handling during the POST-then-GET sequence by suppressing specific network exceptions and falling back to a bare GET on failure. Feedback was provided to prevent redundant configuration updates when auto-disabling the refresh strategy and to expand exception handling for the fallback status update to ensure better resilience against connectivity issues.

Comment thread custom_components/toyota/__init__.py Outdated
vin[-6:],
return_code,
)
if should_auto_disable:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to async_update_entry should be guarded by a check to see if CONF_AUTO_DISABLED_STATUS_REFRESH is already True. Since on_post_layer1_failure returns True for every failure once the threshold is reached, a persistent failure (e.g., during a service call retry) could trigger redundant entry updates and reloads on every coordinator cycle.

Suggested change
if should_auto_disable:
if should_auto_disable and not entry.options.get(CONF_AUTO_DISABLED_STATUS_REFRESH, False):

Comment thread custom_components/toyota/__init__.py Outdated
# for cycles before auto-disable kicks in, and for any vehicle
# whose POST 500s but whose /status still serves stale-cache
# data we can read.
with contextlib.suppress(ToyotaApiError, httpx.ReadTimeout):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The exception suppression list for the fallback GET is narrower than the one used for the POST call at line 412. For better resilience, especially when the gateway is experiencing connectivity issues or timeouts, this block should also suppress httpx.ConnectTimeout, httpcore.ConnectTimeout, and asyncioexceptions.TimeoutError. This ensures that a failure in the fallback status fetch doesn't prevent the rest of the vehicle data (like statistics) from being updated in the same cycle.

Suggested change
with contextlib.suppress(ToyotaApiError, httpx.ReadTimeout):
with contextlib.suppress(
ToyotaApiError,
httpx.ConnectTimeout,
httpcore.ConnectTimeout,
asyncioexceptions.TimeoutError,
httpx.ReadTimeout,
):

…e, recovery)

When the Toyota gateway returns persistent HTTP 500 on
POST /refresh-status (some Lexus / Aygo / Yaris vehicles per
ha_toyota#291 + ha_toyota#293), pytoyoda's controller exhausts its
4-attempt retry sequence and raises ToyotaApiError. The previous
implementation called vehicle.refresh_status() without a try/except,
which meant the exception propagated out of _refresh_one_vehicle
with three consequences:

1. on_post_layer1_failure() never ran (it sits inside the
   `return_code != "000000"` branch, which a raised POST never
   reaches), so consecutive_post_rejections never advanced and the
   _AUTO_DISABLE_REJECTION_THRESHOLD soft/hard-disable mechanism
   never fired. status_refresh_state stayed `active` forever.

2. _refresh_one_vehicle's post-decision bookkeeping was skipped:
   trips manager refresh, movement detection, diag state persistence,
   and the caller's `last_good_per_vin[vin] = vehicle_data` line.
   Phase 1 had already fetched fresh telemetry/location/etc., but
   that fresh data was never promoted to the cache layer; entities
   served from the prior cycle's cached VehicleData. Reported as
   parking location frozen at home, lock state stale, etc.

3. Once auto-disable did fire (via the existing returnCode-rejection
   branch on cars where that path could trip), the only documented
   recovery was the user toggling enable_status_refresh OFF then ON.
   No way to retry without disabling the feature first.

This commit:

- Wraps the POST in contextlib.suppress for the same exception set
  the Layer 2 poll loop already catches. Collapses the exception
  path and the non-"000000" returnCode path into a single Layer 1
  failure branch.
- Adds a bare GET fallback (`vehicle.update(only=["status"])`) on
  Layer 1 failure so /status entities still refresh this cycle,
  even before auto-disable kicks in.
- Lets explicit service calls bypass BOTH HARD_DISABLED_AUTO and
  HARD_DISABLED_USER. Matches the HA convention that polling
  toggles stop the cadence but explicit invocations still go
  through; users can disable the automatic strategy and drive POSTs
  from their own automations (geofence, garage door, time-of-day).
- After a successful POST clears auto_disabled_status_refresh,
  the strategy goes back to ACTIVE on the next cycle without manual
  toggling. Users can now recover from auto-disable by simply
  pressing the refresh button instead of toggling options OFF/ON.
- Three new tests in test_refresh_strategy.py covering the
  service-call bypass behaviour for both AUTO and USER disable +
  blocking when no service call is pending.
- Updates const.py docstring for CONF_ENABLE_STATUS_REFRESH and
  services.yaml description for refresh_vehicle_status to reflect
  the cadence-vs-capability distinction.

All 31 existing tests still pass; ruff clean.

Closes ha_toyota#293.
@nledenyi nledenyi force-pushed the fix/wake-post-500-auto-disable branch from 5c0682a to c51896e Compare April 29, 2026 20:51
@nledenyi
Copy link
Copy Markdown
Member Author

Pushed c51896e addressing both Gemini medium-priority suggestions:

  1. Guard async_update_entry against redundant reloads - if should_auto_disable and not entry.options.get(CONF_AUTO_DISABLED_STATUS_REFRESH, False): Prevents the service-call-retry-that-still-500s path from re-persisting the already-set flag and triggering a redundant listener-driven reload every cycle.
  2. Widened bare-GET fallback's exception suppression to match the POST's set: now suppresses ToyotaApiError, httpx.ConnectTimeout, httpcore.ConnectTimeout, asyncio.TimeoutError, httpx.ReadTimeout. Protects _refresh_one_vehicle's bookkeeping when the gateway is shaky enough that even the fallback times out.

All 31 existing tests still pass; ruff clean. Deployed locally via the combined-with-recent-trips branch alongside, no regressions on a healthy-account regression test (both vehicles populating, no errors, recent-trips intact).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant