`metadata.json corrupted by concurrent publish_data calls (race condition) — persists in v0.17.2

[5b918bf2_emhass_2026-04-23T15-10-30.195Z.log](https://github.com/user-attachments/files/27156118/5b918bf2_emhass_2026-04-23T15-10-30.195Z.log)

**Describe the bug**

When two EMHASS optimization calls overlap in time, all internal `post_data()` calls (in
`_publish_standard_forecasts`, `_publish_deferrable_loads`, and `_publish_battery_data`)
read and write `/share/entities/metadata.json` concurrently. This causes the file to be
written by one request while another is reading it, resulting in a corrupted JSON file and
a subsequent crash.

The symptom is a repeating cascade of errors in the EMHASS log:

```
ERROR in retrieve_hass: Corrupted metadata file found at /share/entities/metadata.json. Creating a new one.
ERROR in app: Exception on request POST /action/naive-mpc-optim
Traceback (most recent call last):
  File "/app/src/emhass/retrieve_hass.py", line 1343, in post_data
    metadata = orjson.loads(content)
orjson.JSONDecodeError: unexpected content after document: line 44 column 3 (char 1149)

During handling of the above exception, another exception occurred:
  File "/app/src/emhass/retrieve_hass.py", line 1351, in post_data
    os.rename(metadata_path, corrupt_path)
FileNotFoundError: [Errno 2] No such file or directory:
  '/share/entities/metadata.json' -> '/share/entities/metadata_corrupt.json'
```

The `FileNotFoundError` on `os.rename()` indicates that a parallel thread already deleted
the file in the same error-handling path — confirming this is a race condition between
concurrent requests, not a one-off file corruption.

**Note: This analysis was performed with the help of an AI (Claude by Anthropic) via
log file analysis. The root cause description below is based on static code analysis
of the EMHASS source and the log patterns observed. It has not been confirmed by
running a debugger.**

**To Reproduce**

1. Use EMHASS as a Home Assistant add-on (v0.17.2, HA OS).
2. Configure a PyScript that posts to `/action/naive-mpc-optim` every hour (`:05`).
3. Add any logic that causes the PyScript to `await task.sleep(30)` *after* the first
   successful POST (e.g. waiting for sensor data to be written back before reading it).
4. While the PyScript is sleeping inside the first call, the next hourly trigger fires
   and issues a second POST to EMHASS.
5. EMHASS is now processing two `naive_mpc_optim` requests concurrently. Each call
   invokes `publish_data()`, which internally makes multiple sequential `post_data()`
   calls. These interleave with each other and corrupt `metadata.json`.

The cascade then self-perpetuates: each failed POST in the error handler may trigger
another retry POST, which overlaps again with the next scheduled call.

**Root cause (AI-assisted analysis)**

`naive_mpc_optim` in `command_line.py` calls `publish_data()`, which internally calls:
- `_publish_standard_forecasts()` → `post_data()`
- `_publish_deferrable_loads()` → `post_data()`
- `_publish_battery_data()` → `post_data()`

Each `post_data()` in `retrieve_hass.py` reads, modifies, and re-writes
`/share/entities/metadata.json` without any file-level lock. If two `naive_mpc_optim`
requests are handled concurrently (ASGI/Quart allows this), their `post_data()` calls
interleave and corrupt the file.

PR #716 (merged in v0.17.0) added a `try/except` around `orjson.loads()` and attempts
`os.rename()` on the corrupt file. However, it does not prevent concurrent writes —
it only handles the symptom. When two threads hit the error handler simultaneously,
the first thread deletes the file and the second thread's `os.rename()` fails with
`FileNotFoundError`, crashing the entire request.

**Expected behavior**

Concurrent `post_data()` calls should not corrupt `metadata.json`. Each write should be
atomic. Either:
- A per-file asyncio lock (e.g. `asyncio.Lock`) should serialize all `post_data()` calls
  that touch the same file, or
- Writes should use atomic replace semantics (write to a temp file, then `os.replace()`).

**Screenshots / Log excerpt**

Log excerpt (2026-04-28, EMHASS v0.17.2, HA OS):

```
00:00:48 ERROR retrieve_hass: Corrupted metadata file found at /share/entities/metadata.json. Creating a new one.
00:00:48 ERROR retrieve_hass: Corrupted metadata file found at /share/entities/metadata.json. Creating a new one.
00:00:48 ERROR app: Exception on request POST /action/naive-mpc-optim
             orjson.JSONDecodeError: unexpected content after document: line 44 column 3 (char 1149)
             ...
             FileNotFoundError: '/share/entities/metadata.json' -> '/share/entities/metadata_corrupt.json'

01:00:40 ERROR retrieve_hass: Corrupted metadata file found (×5 in 73ms)
01:00:40 ERROR app: Exception on request POST /action/naive-mpc-optim
             orjson.JSONDecodeError: unexpected content after document: line 20 column 1 (char 486)
             FileNotFoundError: '/share/entities/metadata.json' -> '/share/entities/metadata_corrupt.json'
```

The error repeats every hour, and multiple "Corrupted metadata file" messages appear
within milliseconds of each other (73 ms apart at 01:00:40), which is consistent with
concurrent threads hitting the same error path simultaneously.

**Home Assistant installation type**

- [x] Home Assistant OS

**Your hardware**

- OS: Home Assistant OS
- Architecture: amd64

**EMHASS installation type**

- [x] Add-on

**Additional context**

- EMHASS version: **0.17.2**
- The bug was present in every hour of the attached log (2026-04-28, 00:00 – 09:05).
- After the EMHASS add-on is restarted (via HA watchdog automation), the next call
  succeeds — confirming the issue is in-memory/file state that is reset on restart.
- A possible minimal fix in `retrieve_hass.py` would be to replace the write pattern
  with atomic temp-file semantics:
  ```python
  import tempfile, os
  tmp_path = metadata_path + ".tmp"
  with open(tmp_path, "wb") as f:
      f.write(orjson.dumps(metadata))
  os.replace(tmp_path, metadata_path)  # atomic on POSIX and Windows
  ```
  This would not fully prevent a read racing with a write, but would at least ensure
  no partial write is ever visible to a reader. A proper fix would additionally add an
  `asyncio.Lock` shared across all `post_data()` calls for the same metadata file path.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`metadata.json corrupted by concurrent publish_data calls (race condition) — persists in v0.17.2 #825

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

`metadata.json corrupted by concurrent publish_data calls (race condition) — persists in v0.17.2 #825

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions