Skip to content

Entities become unavailable on transient API failures due to retry logic and UpdateFailed propagation #146

@keydoh

Description

@keydoh

Hi, I've been having problems with the entities becoming unavailable - soI've tried to use Claude to write the Issue Report.

Describe the bug

Entities managed by the integration intermittently become unavailable during normal operation. The root cause is two related issues in the retry and error handling logic that cause transient API failures to propagate all the way to Home Assistant's entity availability state.

Root cause analysis

Issue 1: _MAX_RETRY_TIME_SECONDS is too short to allow any retries on timeout

In api/solis_api.py, the default _MAX_RETRY_TIME_SECONDS = 30 equals the per-request timeout _TIMEOUT_SECONDS = 30. When a request times out, elapsed_time is already >= max_retry_time, so the check in _with_retry raises immediately with no retries:

if elapsed_time >= max_retry_time:
    raise err

For any operation using the default retry window, a single timeout means zero retry attempts.

Issue 2: Retry delay is calculated from stale elapsed time

In _with_retry, the next delay is calculated from elapsed_time captured before the sleep, not after. This means the remaining-time budget used to cap the delay is already stale by the time the next attempt starts, potentially allowing the retry loop to overshoot the max_retry_time window unexpectedly:

await asyncio.sleep(delay)
delay = min(delay * 2, max_retry_time - elapsed_time)  # elapsed_time not refreshed after sleep

Issue 3: UpdateFailed raised on any API error marks all entities unavailable

In coordinator.py, any SolisCloudControlApiError (including transient ones) is unconditionally converted to UpdateFailed:

except SolisCloudControlApiError as error:
    raise UpdateFailed(error) from error

Home Assistant's DataUpdateCoordinator responds to UpdateFailed by marking all dependent entities as unavailable. A single brief API outage or rate-limit response therefore causes every entity to disappear from the UI until the next successful poll.

Expected behavior

Transient API failures should not cause entities to become unavailable. The coordinator should retain the last known good data and log a warning, only marking entities unavailable if no data has ever been successfully fetched.

Suggested fix

In coordinator.py, return last known data on failure if it exists:

except SolisCloudControlApiError as error:
    if self.data is not None:
        _LOGGER.warning("Solis API error, keeping last known data: %s", error)
        return self.data
    raise UpdateFailed(error) from error

In api/solis_api.py, increase the default retry window and refresh elapsed_time after the sleep:

_MAX_RETRY_TIME_SECONDS = 90  # was 30, must exceed _TIMEOUT_SECONDS to allow any retries
await asyncio.sleep(delay)
elapsed_time = time.monotonic() - start_time  # refresh after sleep
delay = min(delay * 2, max_retry_time - elapsed_time)

Environment

  • Integration version: Version 2.17.6 (although I've had this problem since starting to use this integration)
  • Home Assistant version: 2026.5.4
  • Installation method: HACS
  • Datalogger: Model S3-WIFI-ST version 0001320d

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions