Entities become unavailable on transient API failures due to retry logic and UpdateFailed propagation

Hi, I've been having problems with the entities becoming unavailable - soI've tried to use Claude to write the Issue Report.


**Describe the bug**

Entities managed by the integration intermittently become unavailable during normal operation. The root cause is two related issues in the retry and error handling logic that cause transient API failures to propagate all the way to Home Assistant's entity availability state.

**Root cause analysis**

**Issue 1: `_MAX_RETRY_TIME_SECONDS` is too short to allow any retries on timeout**

In `api/solis_api.py`, the default `_MAX_RETRY_TIME_SECONDS = 30` equals the per-request timeout `_TIMEOUT_SECONDS = 30`. When a request times out, `elapsed_time` is already `>= max_retry_time`, so the check in `_with_retry` raises immediately with no retries:

```python
if elapsed_time >= max_retry_time:
    raise err
```

For any operation using the default retry window, a single timeout means zero retry attempts.

**Issue 2: Retry delay is calculated from stale elapsed time**

In `_with_retry`, the next delay is calculated from `elapsed_time` captured *before* the sleep, not after. This means the remaining-time budget used to cap the delay is already stale by the time the next attempt starts, potentially allowing the retry loop to overshoot the `max_retry_time` window unexpectedly:

```python
await asyncio.sleep(delay)
delay = min(delay * 2, max_retry_time - elapsed_time)  # elapsed_time not refreshed after sleep
```

**Issue 3: `UpdateFailed` raised on any API error marks all entities unavailable**

In `coordinator.py`, any `SolisCloudControlApiError` (including transient ones) is unconditionally converted to `UpdateFailed`:

```python
except SolisCloudControlApiError as error:
    raise UpdateFailed(error) from error
```

Home Assistant's `DataUpdateCoordinator` responds to `UpdateFailed` by marking all dependent entities as unavailable. A single brief API outage or rate-limit response therefore causes every entity to disappear from the UI until the next successful poll.

**Expected behavior**

Transient API failures should not cause entities to become unavailable. The coordinator should retain the last known good data and log a warning, only marking entities unavailable if no data has ever been successfully fetched.

**Suggested fix**

In `coordinator.py`, return last known data on failure if it exists:

```python
except SolisCloudControlApiError as error:
    if self.data is not None:
        _LOGGER.warning("Solis API error, keeping last known data: %s", error)
        return self.data
    raise UpdateFailed(error) from error
```

In `api/solis_api.py`, increase the default retry window and refresh `elapsed_time` after the sleep:

```python
_MAX_RETRY_TIME_SECONDS = 90  # was 30, must exceed _TIMEOUT_SECONDS to allow any retries
```

```python
await asyncio.sleep(delay)
elapsed_time = time.monotonic() - start_time  # refresh after sleep
delay = min(delay * 2, max_retry_time - elapsed_time)
```

**Environment**

- Integration version: Version 2.17.6 (although I've had this problem since starting to use this integration)
- Home Assistant version: 2026.5.4
- Installation method: HACS
- Datalogger: Model S3-WIFI-ST version 0001320d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entities become unavailable on transient API failures due to retry logic and UpdateFailed propagation #146

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Entities become unavailable on transient API failures due to retry logic and UpdateFailed propagation #146

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions