Skip to content

Replace pymodbus, add block reads, improve startup#229

Open
sphings79 wants to merge 6 commits into
ViperRNMC:mainfrom
sphings79:dev_tmodbus_blockread
Open

Replace pymodbus, add block reads, improve startup#229
sphings79 wants to merge 6 commits into
ViperRNMC:mainfrom
sphings79:dev_tmodbus_blockread

Conversation

@sphings79

Copy link
Copy Markdown
Contributor

Overview

This PR is a substantial rework of the integration's Modbus communication layer
and coordinator polling logic, motivated by real-world stability problems
observed on a Marstek Venus D connected via an TCP
It covers four areas:

  1. pymodbustmodbus with batch block reads – replacing the original
    one-register-at-a-time pymodbus loop with tmodbus and grouped FC03 requests.
  2. Self-healing gap avoidance – automatically learning which register ranges
    the device does not support and avoiding them in future requests without
    removing any YAML-defined register from polling.
  3. Double-scaling – fixing a scaling bug in sensor.py and number.py
    that caused every non-unity-scale sensor to show values that were too small
    by exactly one scale factor.
  4. Startup reliability – correct ConfigEntryNotReady propagation so HA
    retries automatically when the device is unreachable at startup, and a new
    setup sequence that guarantees user-enabled non-default entities receive
    values immediately.

1. helpers/modbus_client.py – pymodbus → tmodbus with batch reads

Original behaviour

The original file used pymodbus.client.tcp.AsyncModbusTcpClient with a
MarstekModbusClient wrapper. Every register was read individually via
async_read_register() inside a _request_lock, with up to 3 retries each.
For a typical poll with 22 default sensors this meant 22+ sequential TCP
round-trips.

What changed

The file is rewritten to use tmodbus (create_async_tcp_client). The core
addition is a batch_read() function that groups the requested addresses into
the fewest possible FC03 requests using _build_blocks():

  • MAX_GAP = 15 – consecutive addresses with a gap of ≤ 15 registers are
    combined into a single block. Registers between the requested addresses that
    happen to fall inside the block are read for free.
  • MAX_BLOCK_SIZE = 64 – blocks are split if they would exceed 64 registers.

Measured improvement on the Venus D:

Tick Unique addresses Requests before Requests after
high 5 5 (one per register) 4
medium 7 7 5
low 6 6 3
very_low 111 111 14

High-tick wall time dropped from ~0.6–1.4 s to 0.45–0.55 s (~50 % faster).

Self-healing gap avoidance (bad_gaps / good_gaps)

When a block read fails (timeout or device error) the gaps between the
consecutive requested addresses inside that failed block
are recorded in a
bad_gaps set. _build_blocks then avoids bridging those gaps in all future
calls, so unsupported register ranges are never queried again.

A good_gaps set records every gap that has been bridged successfully at
least once
. Good gaps are permanently protected: a temporary TCP outage that
causes every block to fail simultaneously cannot misclassify a working gap as
bad, so the optimised block layout is preserved across transient outages.

A MarstekModbusClient compatibility wrapper is retained so config_flow.py
and other callers that expect the original interface continue to work without
modification.


2. coordinator.py – group-based tick polling replaces per-sensor loop

Original behaviour

The original coordinator looped over self._all_definitions in
_async_update_data() and called self.client.async_read_register() for every
sensor individually, with per-sensor interval tracking using
_last_attempt_times and per-register exponential backoff using
_register_failures. Connection suspension logic (_consecutive_failures,
_connection_suspended) attempted to pause polling after repeated failures.

What changed

Group-based polling. Registers are sorted into four groups at load time by
their scan_interval YAML field (high, medium, low, very_low). Each
group is polled as a single batch_read call. A tick counter determines which
groups are due on each coordinator refresh.

Selective polling with dynamic opt-in. Registers with
enabled_by_default: false are excluded from the groups at startup. After all
entity platforms are set up, async_register_enabled_entities() reads the HA
entity registry and adds every register whose entity has been manually enabled
by the user to the very_low group (~3-minute interval). This replaces the
original approach of skipping disabled entities inside the poll loop.

Group-by-group initial poll on tick 1. All four groups are polled
sequentially on the very first coordinator tick so every entity has a value
immediately. Crucially each group uses its own separate batch_read call so
that failures in very_low (which may contain unsupported user-enabled
registers) cannot poison the good_gaps of the vetted default registers in
high / medium / low.

One-shot very_low catch-up flag. HA's select platform uses
update_before_add=True, which can trigger tick 1 before
async_register_enabled_entities() has run. A _needs_very_low_poll flag is
set when new entries are registered late and consumed on the next tick to fire
an additional one-shot very_low poll.

Extended coordinator timeout (30 s → 120 s). The very_low group can
contain up to 82 user-enabled non-default registers. The per-register fallback
in batch_read retries each address individually at 5 s each; 16 cell-voltage
registers × 5 s = 80 s exceeds the old 30 s timeout, causing spurious
UpdateFailed that silenced even the fast default sensors.

bad_gaps / good_gaps wired into every batch_read call via
self._bad_gaps and self._good_gaps on the coordinator.


3. __init__.py – setup order and ConfigEntryNotReady propagation

Original behaviour

async_init() logged an error and returned False when the TCP connection
failed. The outer except Exception in async_setup_entry caught
ConfigEntryNotReady along with everything else, returning False. HA treats
False as a permanent failure and does not schedule a retry.

What changed

ConfigEntryNotReady is now re-raised so it propagates to HA's built-in
exponential-backoff retry mechanism (5 s → 10 s → 30 s → 60 s → …). A
disconnected LAN cable or a device rebooting no longer permanently disables the
integration.

Setup order. async_register_enabled_entities() is now called between
async_forward_entry_setups() and async_config_entry_first_refresh() so
user-enabled non-default entities are already in the polling groups when tick 1
fires wherever the platform setup timing allows.


4. sensor.py – fix double-scaling

Original behaviour

coordinator.data stores already-scaled values because
extract_typed_value() (called inside batch_read / coordinator) applies the
scale factor before storing. However, MarstekSensor.native_value applied
scale a second time, and MarstekCalculatedSensor._calculate set
dep_values[alias] = float(val) * scale on already-scaled data.

This caused every non-unity-scale sensor to show values that were too small by
exactly one scale factor:

Sensor Raw register Correct Was showing
AC voltage 2390 239.0 V 23.9 V
AC frequency 500 50.0 Hz 5.0 Hz
Battery capacity 5120 5.12 kWh 0.005 kWh
Internal temperature 191 19.1 °C 1.91 °C
Stored energy 0.61 kWh 0.0 kWh
Battery cycles (calc) 11.5 115

What changed

  • MarstekSensor.native_value – removed scale * value; only precision
    rounding and states mapping remain. The coordinator is the single source of
    truth for scaled data.
  • MarstekCalculatedSensor._calculatedep_values[alias] is now
    float(val) (no additional scale multiplication).

5. number.py – fix double-scaling and optimistic update

Original behaviour

MarstekNumber.native_value returned raw_value * self._scale on
already-scaled coordinator data (same root cause as sensor.py).
async_set_native_value stored the raw register integer
(int(value / self._scale)) in coordinator.data for the optimistic update.

What changed

  • native_value now returns data.get(self._key) directly.
  • async_set_native_value stores the engineering-unit value (the float passed
    in by HA) in coordinator.data for the optimistic update, so the entity
    shows the correct number immediately after a write.

Files changed

File Change type
custom_components/marstek_modbus/helpers/modbus_client.py Full rewrite: pymodbus → tmodbus, batch reads, bad_gaps/good_gaps
custom_components/marstek_modbus/coordinator.py Full rewrite: group-based tick polling, batch reads, startup fixes
custom_components/marstek_modbus/__init__.py Setup order + ConfigEntryNotReady re-raise
custom_components/marstek_modbus/sensor.py Double-scaling fix in native_value and _calculate
custom_components/marstek_modbus/number.py Double-scaling fix in native_value and optimistic update

Testing

Hardware: Marstek Venus D, 2 battery packs
Connection: TCP
HA version: 2026.4.4
Coworker: Claude Code

  • All 22 default sensors show correct scaled values within ~5 s of HA restart.
  • User-enabled non-default entities (16 cell voltages, 4 MPPT channels, 6
    schedule slots, WiFi/BT/cloud status, MOS temperatures, …) show values within
    ~10 s of startup.
  • After the first very_low tick, the bad gaps (35002→35010,
    42000→42010, 42011→42020) are learned and logged at INFO level; no further
    ModbusConnectionError / reconnect events occur for those ranges.
  • High-tick poll completes in 0.45–0.55 s (4 requests), down from the original
    approach of 5+ sequential single-register reads.
  • Coordinator runs stably for 2+ hours (tick 277+ observed) without
    UpdateFailed or entity freezes.
  • Disconnecting the LAN cable during normal operation: coordinator recovers
    within one tick after reconnection; good gaps are preserved; no manual reload
    needed.
  • Starting HA with the device offline: integration logs
    ConfigEntryNotReady and retries automatically, becoming available as soon
    as the device responds.
  • TCP outage during a very_low poll: good gaps protect previously-working
    block layouts from being marked bad; block layout returns to optimised form
    on the next successful poll.

sphings79 added 5 commits May 4, 2026 21:18
Refactor setup entry process and improve error handling.

Signed-off-by: sphings79 <43515272+sphings79@users.noreply.github.com>
Refactor number entity setup and improve comments for clarity.

Signed-off-by: sphings79 <43515272+sphings79@users.noreply.github.com>
Refactor Marstek sensor classes to ensure scaling is handled correctly by the coordinator. Adjusted methods to prevent double scaling and improved handling of sensor attributes and calculations.

Signed-off-by: sphings79 <43515272+sphings79@users.noreply.github.com>
Refactor Modbus client to use tmodbus, adding async connection and read/write methods. Maintain backward compatibility with existing read helpers.

Signed-off-by: sphings79 <43515272+sphings79@users.noreply.github.com>
Signed-off-by: sphings79 <43515272+sphings79@users.noreply.github.com>
@Nerdiyde

Copy link
Copy Markdown

Thanks for your work. :)
Did you test this on your system? Just asking because I got an requirements issue regarding tmodbus. I think you did not add tmdobus to the requirements list?

"requirements": ["tmodbus", "pymodbus>=3.9.2"],

@sphings79

Copy link
Copy Markdown
Contributor Author

You are right, forgot to update that.
Tried that local :) but you only need
"requirements": ["tmodbus"],

I think i forgot that cause i got an fw.bin and trying to decompile that since a week :)

Signed-off-by: sphings79 <43515272+sphings79@users.noreply.github.com>
@Nerdiyde

Copy link
Copy Markdown

I did some more quick testing regarding this.
Unfortunately on my setup (Venus 3.0 connected via LAN-->AP-->Wifi-->Router) the integration is not more stable than the current beta of the original implementation.
Also some values where missing for me. It was not possible to set the charge, max-charge and max-discharge power.
Sorry for the bad news but I hope this helps. :)

@sphings79

Copy link
Copy Markdown
Contributor Author

I did some more quick testing regarding this. Unfortunately on my setup (Venus 3.0 connected via LAN-->AP-->Wifi-->Router) the integration is not more stable than the current beta of the original implementation. Also some values where missing for me. It was not possible to set the charge, max-charge and max-discharge power. Sorry for the bad news but I hope this helps. :)

Do you have double checked, they were enabled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants