Conversation
When capture_manager is killed (SIGKILL, OOM, container restart) while captures are in-flight, the finally blocks in _capture() never run. UUIDs persist in the lacus:ongoing Redis sorted set because Valkey writes to disk and reloads on restart. These zombie UUIDs occupy concurrent_captures slots, blocking all new captures until clear_dead_captures() eventually ages them out (max_capture_time * 1.1 seconds later). Fix: flush lacus:ongoing before the event loop starts. At process init, self.captures is empty — every UUID in Redis is a zombie.
When a Playwright subprocess hangs, task.cancel() may not propagate — the browser process ignores the CancelledError. After 5 failed cancel attempts, the code logged an error but left the UUID in lacus:ongoing, permanently blocking the slot. Fix: call clear_capture() after exhausting cancel retries to free the Redis entry even if the asyncio task is still stuck.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 83027cef9c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Startup cleanup: iterate with clear_capture() per UUID instead of bulk-deleting lacus:ongoing. This stores a proper error result for each zombie and cleans up capture_settings, so clients polling a UUID across a restart get an explicit failure instead of UNKNOWN. - Force-clear: also discard the task from self.captures so the in-memory slot is freed for max_new_captures computation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 93bb5a3f59
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for uuid, _ in ongoing: | ||
| self.lacus.core.clear_capture(uuid, 'Cleared on startup: previous process died.') |
There was a problem hiding this comment.
Clear startup zombies without clear_capture's age gate
Calling self.lacus.core.clear_capture() here does not actually clear most crash leftovers. lacuscore.LacusCore.clear_capture() returns early for any UUID that is still in lacus:ongoing and started less than max_capture_time * 1.1 ago, so after a normal crash/restart these orphaned captures are usually still considered “probably still going.” The manager logs that it cleared them, but /capture_status, /lacus_status, and /is_busy will keep reporting them as ongoing until they age past the timeout window instead of immediately getting the failure result this startup cleanup is meant to write.
Useful? React with 👍 / 👎.
| if not capture.done(): | ||
| self.logger.error(f'{expected_uuid} could not be canceled after 5 attempts, force-clearing from Redis.') | ||
| self.lacus.core.clear_capture(expected_uuid, 'Force-cleared: task could not be canceled.') | ||
| self.captures.discard(capture) |
There was a problem hiding this comment.
Decrement the running count when force-clearing a hung task
This path frees the in-memory slot, but it leaves the Redis running counter one too high. Each capture increments that counter when scheduled (set_running() at line 84), and the only per-task decrement is in clear_list_callback() (lines 70-72) when the task actually finishes. In the exact case handled here—a task that never responds to cancellation—that callback never fires, so a later scripts_controller.py stop|restart capture_manager can block forever waiting for zscore('running', 'capture_manager') to reach None even after the process has exited.
Useful? React with 👍 / 👎.
|
Thank you for the very detailed but report and PR, and I get your point but simply removing zombie captures when starting In that case, you may decide to just start a new One solution for that would be to make sure to clear the cache when valkey is launched: if valkey was actually down, you're certain that no captures are ongoing somewhere else. |
First of all, thank you very much for opening a pull request in this repository.
In order for us to review your changes prior merging them, please makes sure it follows the rules below (if relevant):
What kind of change does this PR introduce?
NOTE: if you used black or any other code formatting tool, we will only accept one change per Pull Request.
Do not attempt to "fix" everything in a single PR, as it makes reviewing the changes extremely difficult to review.
If the fix is related to the length of the lines, it will be rejected.
If you're using an AI to assist you in generating this PR, please review the changes extensively and make sure they are accurate and useful.
Once it is the case, please tick the box below.
What does it do?
Fixes zombie captures permanently blocking
concurrent_capturesslots after a container restart (or any uncleancapture_managershutdown).The problem
When
capture_manageris killed (SIGKILL, OOM,docker restart) while captures are in-flight, thefinallyblocks inLacusCore._capture()never execute. The UUIDs remain in thelacus:ongoingRedis sorted set because Valkey persists to disk and reloads on restart.On the next startup, the new
capture_managerprocess has an emptyself.capturestask set, butlacus:ongoingis full of stale UUIDs from the dead process. These zombie entries occupyconcurrent_capturesslots.The existing
clear_dead_captures()method does detect orphaned UUIDs (in Redis but not in the task set), but it has a safety window -it won't clear a capture ifstart_time > time.time() - max_capture_time * 1.1. During this window, zombie UUIDs block all slots and no new captures can start.With
concurrent_captures=8andmax_capture_time=90, that's ~99 seconds of zero capture throughput after every restart.Reproduction
The fix
Commit 840a018 -Startup cleanup (
_clear_ongoing_on_startup)When
capture_managerstarts,self.capturesis empty -no capture task can possibly be running. Any UUID inlacus:ongoingis by definition a zombie. The fix flushes the sorted set before the event loop starts.This is safe because:
capture_manageris single-process -only one instance runs per containerwebsiteprocess (Gunicorn) only reads fromlacus:ongoing, never writes to itLacusCore._capture(), which runs insidecapture_manager's event loopCommit 83027ce -Force-clear after failed cancellation
When a Playwright subprocess hangs and
task.cancel()doesn't propagate (browser process ignores theCancelledError), the existing code retries cancellation 5 times, then logs an error but leaves the UUID in Redis -permanently blocking the slot. The fix callsclear_capture()after exhausting all cancel attempts.Testing
1. Reproduce the bug (before fix)
2. Apply the fix
# Copy patched capture_manager.py into the container: docker cp capture_manager.py lacus:/app/lacus/bin/capture_manager.py3. Inject zombies and verify cleanup