Skip to content

Commit e10e140

Browse files
authored
Log lock-release lifecycle in executor_step_status and distributed_lock (#5027)
## Summary Adds two INFO-level log lines so the next occurrence of the distributed-lock self-race described in #5026 can be identified from logs alone. - `StatusFile.write_status` (lib/marin/src/marin/execution/executor_step_status.py) logs `Releasing lock path=... worker=... reason=terminal_status:<STATUS>` before the conditional `release_lock()` branch on terminal statuses. - `DistributedLease.release` (lib/rigging/src/rigging/distributed_lock.py) logs `Released lock path=... worker=...` at INFO immediately after `self._delete()` (the prior DEBUG line in the same spot was promoted to INFO rather than duplicated). Together these disambiguate a self-release from an external delete or a stale-lease takeover — the existing `LeaseLostError` message at distributed_lock.py:152 cannot tell them apart today. Implements diffs #1 and #2 from the issue's "Instrumentation gap" proposal. Diff #3 (the `refresh` error-message fix) is not included here and can follow up if/when useful. Refs #5026 ## Test plan - `./infra/pre-commit.py --all-files --fix` — passes (ruff, black, pyrefly, license headers, etc.) - `uv run pytest lib/rigging/tests -m "not slow"` — 66 passed - `uv run pytest lib/iris/tests/test_distributed_lock.py -m "not slow"` — 16 passed (covers the modified `DistributedLease.release` path) Reviewers: please confirm the log-line wording is what you want to see in prod, and that promoting the existing `[%s] Released lock %s` DEBUG line to the new INFO shape (rather than keeping both) is the right call. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
1 parent 154ec05 commit e10e140

2 files changed

Lines changed: 7 additions & 1 deletion

File tree

lib/marin/src/marin/execution/executor_step_status.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,12 @@ def write_status(self, status: str) -> None:
120120
f.write(status)
121121

122122
if status != STATUS_RUNNING:
123+
logger.info(
124+
"Releasing lock path=%s worker=%s reason=terminal_status:%s",
125+
self._lock_path,
126+
self.worker_id,
127+
status,
128+
)
123129
self.release_lock()
124130
logger.debug("[%s] Wrote status %s to %s", self.worker_id, status, self.path)
125131

lib/rigging/src/rigging/distributed_lock.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,7 @@ def release(self) -> None:
161161
_, lock_data = self._read_with_generation()
162162
if lock_data and lock_data.worker_id == self.worker_id:
163163
self._delete()
164-
logger.debug("[%s] Released lock %s", self.worker_id, self.lock_path)
164+
logger.info("Released lock path=%s worker=%s", self.lock_path, self.worker_id)
165165
except FileNotFoundError:
166166
pass
167167

0 commit comments

Comments
 (0)