Skip to content

Fix/clear zombie captures on startup#59

Open
cln-io wants to merge 3 commits intoail-project:mainfrom
cln-io:fix/clear-zombie-captures-on-startup
Open

Fix/clear zombie captures on startup#59
cln-io wants to merge 3 commits intoail-project:mainfrom
cln-io:fix/clear-zombie-captures-on-startup

Conversation

@cln-io
Copy link

@cln-io cln-io commented Mar 21, 2026

First of all, thank you very much for opening a pull request in this repository.

In order for us to review your changes prior merging them, please makes sure it follows the rules below (if relevant):

  • no reformatting changing the line length globally
  • make multiple PRs if you're changing multiple things

What kind of change does this PR introduce?

  • Bugfix
  • New Feature
  • Code style update (formatting, local variables)

NOTE: if you used black or any other code formatting tool, we will only accept one change per Pull Request.
Do not attempt to "fix" everything in a single PR, as it makes reviewing the changes extremely difficult to review.
If the fix is related to the length of the lines, it will be rejected.

  • Refactoring (no functional changes, no API changes)
  • CI-related changes
  • Documentation content changes
  • Tests
  • Other - please describe extensively what it does

If you're using an AI to assist you in generating this PR, please review the changes extensively and make sure they are accurate and useful.
Once it is the case, please tick the box below.

  • I used an AI to generate this PR and made sure I'm not wasting the time of the reviewer.

What does it do?

Fixes zombie captures permanently blocking concurrent_captures slots after a container restart (or any unclean capture_manager shutdown).

The problem

  Normal operation:
  ┌──────────────────┐         ┌───────────────────┐
  │ capture_manager  │         │  Valkey (Redis)    │
  │                  │         │                    │
  │  self.captures:  │ zadd    │  lacus:ongoing:    │
  │   {task_A,       │────────►│   uuid_A (t=100)   │
  │    task_B}       │         │   uuid_B (t=101)   │
  │                  │ zrem    │                    │
  │  finally: ───────│────────►│  (removed on done) │
  └──────────────────┘         └───────────────────┘
          ✓ in sync                  ✓ in sync

  After kill/restart:
  ┌──────────────────┐         ┌───────────────────┐
  │ capture_manager  │         │  Valkey (Redis)    │
  │  (new process)   │         │  (persisted data)  │
  │                  │         │                    │
  │  self.captures:  │         │  lacus:ongoing:    │
  │   {} (empty)     │  !=     │   uuid_A (t=100)   │  ◄── zombie
  │                  │         │   uuid_B (t=101)   │  ◄── zombie
  │                  │         │   ...              │
  │  finally: never  │         │   uuid_H (t=108)   │  ◄── zombie
  │  ran on old proc │         │                    │
  └──────────────────┘         └───────────────────┘
     0 real tasks              8 slots blocked
                               concurrent_captures = 8
                               → no new captures can start

When capture_manager is killed (SIGKILL, OOM, docker restart) while captures are in-flight, the finally blocks in LacusCore._capture() never execute. The UUIDs remain in the lacus:ongoing Redis sorted set because Valkey persists to disk and reloads on restart.

On the next startup, the new capture_manager process has an empty self.captures task set, but lacus:ongoing is full of stale UUIDs from the dead process. These zombie entries occupy concurrent_captures slots.

The existing clear_dead_captures() method does detect orphaned UUIDs (in Redis but not in the task set), but it has a safety window -it won't clear a capture if start_time > time.time() - max_capture_time * 1.1. During this window, zombie UUIDs block all slots and no new captures can start.

With concurrent_captures=8 and max_capture_time=90, that's ~99 seconds of zero capture throughput after every restart.

Reproduction

# With captures actively running, restart the container:
docker restart lacus

# Immediately check Redis:
valkey-cli -s /app/lacus/cache/cache.sock ZCARD lacus:ongoing
# Returns: 8 (all zombies -no real tasks exist in the new process)

# All concurrent_captures slots are blocked until max_capture_time * 1.1 elapses.

The fix

Commit 840a018 -Startup cleanup (_clear_ongoing_on_startup)

When capture_manager starts, self.captures is empty -no capture task can possibly be running. Any UUID in lacus:ongoing is by definition a zombie. The fix flushes the sorted set before the event loop starts.

This is safe because:

  • capture_manager is single-process -only one instance runs per container
  • The website process (Gunicorn) only reads from lacus:ongoing, never writes to it
  • The only writer is LacusCore._capture(), which runs inside capture_manager's event loop

Commit 83027ce -Force-clear after failed cancellation

When a Playwright subprocess hangs and task.cancel() doesn't propagate (browser process ignores the CancelledError), the existing code retries cancellation 5 times, then logs an error but leaves the UUID in Redis -permanently blocking the slot. The fix calls clear_capture() after exhausting all cancel attempts.

Testing

1. Reproduce the bug (before fix)

# Confirm captures are actively running:
docker exec lacus valkey-cli -s /app/lacus/cache/cache.sock ZCARD lacus:ongoing
# → 8

# Restart the container (Valkey persists to disk):
docker restart lacus
sleep 15

# Check Redis -all slots blocked by zombies:
docker exec lacus valkey-cli -s /app/lacus/cache/cache.sock ZCARD lacus:ongoing
# → 8 (all zombies -new process has zero real tasks)

# Verify these are stale UUIDs with old timestamps:
docker exec lacus valkey-cli -s /app/lacus/cache/cache.sock ZRANGE lacus:ongoing 0 -1 WITHSCORES
# → timestamps from before the restart

2. Apply the fix

# Copy patched capture_manager.py into the container:
docker cp capture_manager.py lacus:/app/lacus/bin/capture_manager.py

3. Inject zombies and verify cleanup

# Inject 8 fake zombie UUIDs into Redis:
for i in $(seq 1 8); do
  docker exec lacus valkey-cli -s /app/lacus/cache/cache.sock \
    ZADD lacus:ongoing 1774000000 "zombie-$(printf '%04d' $i)-0000-0000-000000000000"
done

# Confirm they're there:
docker exec lacus valkey-cli -s /app/lacus/cache/cache.sock ZCARD lacus:ongoing
# → 8+ (zombies + any real captures)

# Restart with patched code:
docker restart lacus
sleep 15

# Check the log -startup cleanup should fire:
docker logs lacus 2>&1 | grep "Startup cleanup"
# → "Startup cleanup: clearing 8 zombie capture(s) from lacus:ongoing"

# Confirm zero zombie UUIDs remain:
docker exec lacus valkey-cli -s /app/lacus/cache/cache.sock \
  ZRANGE lacus:ongoing 0 -1 | grep -c zombie
# → 0

# Verify fresh captures are running (new timestamps):
docker exec lacus valkey-cli -s /app/lacus/cache/cache.sock ZRANGE lacus:ongoing 0 -1 WITHSCORES
# → only UUIDs with timestamps from after the restart

cln-io added 2 commits March 21, 2026 11:11
When capture_manager is killed (SIGKILL, OOM, container restart)
while captures are in-flight, the finally blocks in _capture()
never run. UUIDs persist in the lacus:ongoing Redis sorted set
because Valkey writes to disk and reloads on restart.

These zombie UUIDs occupy concurrent_captures slots, blocking
all new captures until clear_dead_captures() eventually ages
them out (max_capture_time * 1.1 seconds later).

Fix: flush lacus:ongoing before the event loop starts. At process
init, self.captures is empty — every UUID in Redis is a zombie.
When a Playwright subprocess hangs, task.cancel() may not
propagate — the browser process ignores the CancelledError.
After 5 failed cancel attempts, the code logged an error but
left the UUID in lacus:ongoing, permanently blocking the slot.

Fix: call clear_capture() after exhausting cancel retries to
free the Redis entry even if the asyncio task is still stuck.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83027cef9c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Startup cleanup: iterate with clear_capture() per UUID instead
  of bulk-deleting lacus:ongoing. This stores a proper error result
  for each zombie and cleans up capture_settings, so clients polling
  a UUID across a restart get an explicit failure instead of UNKNOWN.

- Force-clear: also discard the task from self.captures so the
  in-memory slot is freed for max_new_captures computation.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93bb5a3f59

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +36 to +37
for uuid, _ in ongoing:
self.lacus.core.clear_capture(uuid, 'Cleared on startup: previous process died.')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Clear startup zombies without clear_capture's age gate

Calling self.lacus.core.clear_capture() here does not actually clear most crash leftovers. lacuscore.LacusCore.clear_capture() returns early for any UUID that is still in lacus:ongoing and started less than max_capture_time * 1.1 ago, so after a normal crash/restart these orphaned captures are usually still considered “probably still going.” The manager logs that it cleared them, but /capture_status, /lacus_status, and /is_busy will keep reporting them as ongoing until they age past the timeout window instead of immediately getting the failure result this startup cleanup is meant to write.

Useful? React with 👍 / 👎.

Comment on lines +63 to +66
if not capture.done():
self.logger.error(f'{expected_uuid} could not be canceled after 5 attempts, force-clearing from Redis.')
self.lacus.core.clear_capture(expected_uuid, 'Force-cleared: task could not be canceled.')
self.captures.discard(capture)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Decrement the running count when force-clearing a hung task

This path frees the in-memory slot, but it leaves the Redis running counter one too high. Each capture increments that counter when scheduled (set_running() at line 84), and the only per-task decrement is in clear_list_callback() (lines 70-72) when the task actually finishes. In the exact case handled here—a task that never responds to cancellation—that callback never fires, so a later scripts_controller.py stop|restart capture_manager can block forever waiting for zscore('running', 'capture_manager') to reach None even after the process has exited.

Useful? React with 👍 / 👎.

@Rafiot
Copy link
Contributor

Rafiot commented Mar 21, 2026

Thank you for the very detailed but report and PR, and I get your point but simply removing zombie captures when starting capture_manager is not a good solution: on big instances, you're going to have cases where you requested a stop on capture_manager (say, you updated playwright) and it keeps going until the captures that were started conclude (that can technically take an hour or so, depending on the timeouts, and if depth is not 0).

In that case, you may decide to just start a new capture_manager before the old one stops (i.e. you want to go for lunch, and also don't want to stop the captures for too long). If starting a new capture_manager removes the captures that are still processed, you have a problem.

One solution for that would be to make sure to clear the cache when valkey is launched: if valkey was actually down, you're certain that no captures are ongoing somewhere else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants