Skip to content

`metadata.json corrupted by concurrent publish_data calls (race condition) — persists in v0.17.2 #825

@Coernel82

Description

@Coernel82

5b918bf2_emhass_2026-04-23T15-10-30.195Z.log

Describe the bug

When two EMHASS optimization calls overlap in time, all internal post_data() calls (in
_publish_standard_forecasts, _publish_deferrable_loads, and _publish_battery_data)
read and write /share/entities/metadata.json concurrently. This causes the file to be
written by one request while another is reading it, resulting in a corrupted JSON file and
a subsequent crash.

The symptom is a repeating cascade of errors in the EMHASS log:

ERROR in retrieve_hass: Corrupted metadata file found at /share/entities/metadata.json. Creating a new one.
ERROR in app: Exception on request POST /action/naive-mpc-optim
Traceback (most recent call last):
  File "/app/src/emhass/retrieve_hass.py", line 1343, in post_data
    metadata = orjson.loads(content)
orjson.JSONDecodeError: unexpected content after document: line 44 column 3 (char 1149)

During handling of the above exception, another exception occurred:
  File "/app/src/emhass/retrieve_hass.py", line 1351, in post_data
    os.rename(metadata_path, corrupt_path)
FileNotFoundError: [Errno 2] No such file or directory:
  '/share/entities/metadata.json' -> '/share/entities/metadata_corrupt.json'

The FileNotFoundError on os.rename() indicates that a parallel thread already deleted
the file in the same error-handling path — confirming this is a race condition between
concurrent requests, not a one-off file corruption.

Note: This analysis was performed with the help of an AI (Claude by Anthropic) via
log file analysis. The root cause description below is based on static code analysis
of the EMHASS source and the log patterns observed. It has not been confirmed by
running a debugger.

To Reproduce

  1. Use EMHASS as a Home Assistant add-on (v0.17.2, HA OS).
  2. Configure a PyScript that posts to /action/naive-mpc-optim every hour (:05).
  3. Add any logic that causes the PyScript to await task.sleep(30) after the first
    successful POST (e.g. waiting for sensor data to be written back before reading it).
  4. While the PyScript is sleeping inside the first call, the next hourly trigger fires
    and issues a second POST to EMHASS.
  5. EMHASS is now processing two naive_mpc_optim requests concurrently. Each call
    invokes publish_data(), which internally makes multiple sequential post_data()
    calls. These interleave with each other and corrupt metadata.json.

The cascade then self-perpetuates: each failed POST in the error handler may trigger
another retry POST, which overlaps again with the next scheduled call.

Root cause (AI-assisted analysis)

naive_mpc_optim in command_line.py calls publish_data(), which internally calls:

  • _publish_standard_forecasts()post_data()
  • _publish_deferrable_loads()post_data()
  • _publish_battery_data()post_data()

Each post_data() in retrieve_hass.py reads, modifies, and re-writes
/share/entities/metadata.json without any file-level lock. If two naive_mpc_optim
requests are handled concurrently (ASGI/Quart allows this), their post_data() calls
interleave and corrupt the file.

PR #716 (merged in v0.17.0) added a try/except around orjson.loads() and attempts
os.rename() on the corrupt file. However, it does not prevent concurrent writes —
it only handles the symptom. When two threads hit the error handler simultaneously,
the first thread deletes the file and the second thread's os.rename() fails with
FileNotFoundError, crashing the entire request.

Expected behavior

Concurrent post_data() calls should not corrupt metadata.json. Each write should be
atomic. Either:

  • A per-file asyncio lock (e.g. asyncio.Lock) should serialize all post_data() calls
    that touch the same file, or
  • Writes should use atomic replace semantics (write to a temp file, then os.replace()).

Screenshots / Log excerpt

Log excerpt (2026-04-28, EMHASS v0.17.2, HA OS):

00:00:48 ERROR retrieve_hass: Corrupted metadata file found at /share/entities/metadata.json. Creating a new one.
00:00:48 ERROR retrieve_hass: Corrupted metadata file found at /share/entities/metadata.json. Creating a new one.
00:00:48 ERROR app: Exception on request POST /action/naive-mpc-optim
             orjson.JSONDecodeError: unexpected content after document: line 44 column 3 (char 1149)
             ...
             FileNotFoundError: '/share/entities/metadata.json' -> '/share/entities/metadata_corrupt.json'

01:00:40 ERROR retrieve_hass: Corrupted metadata file found (×5 in 73ms)
01:00:40 ERROR app: Exception on request POST /action/naive-mpc-optim
             orjson.JSONDecodeError: unexpected content after document: line 20 column 1 (char 486)
             FileNotFoundError: '/share/entities/metadata.json' -> '/share/entities/metadata_corrupt.json'

The error repeats every hour, and multiple "Corrupted metadata file" messages appear
within milliseconds of each other (73 ms apart at 01:00:40), which is consistent with
concurrent threads hitting the same error path simultaneously.

Home Assistant installation type

  • Home Assistant OS

Your hardware

  • OS: Home Assistant OS
  • Architecture: amd64

EMHASS installation type

  • Add-on

Additional context

  • EMHASS version: 0.17.2
  • The bug was present in every hour of the attached log (2026-04-28, 00:00 – 09:05).
  • After the EMHASS add-on is restarted (via HA watchdog automation), the next call
    succeeds — confirming the issue is in-memory/file state that is reset on restart.
  • A possible minimal fix in retrieve_hass.py would be to replace the write pattern
    with atomic temp-file semantics:
    import tempfile, os
    tmp_path = metadata_path + ".tmp"
    with open(tmp_path, "wb") as f:
        f.write(orjson.dumps(metadata))
    os.replace(tmp_path, metadata_path)  # atomic on POSIX and Windows
    This would not fully prevent a read racing with a write, but would at least ensure
    no partial write is ever visible to a reader. A proper fix would additionally add an
    asyncio.Lock shared across all post_data() calls for the same metadata file path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions