Fix/clear zombie captures on startup by cln-io · Pull Request #59 · ail-project/lacus

cln-io · 2026-03-21T10:19:32Z

First of all, thank you very much for opening a pull request in this repository.

In order for us to review your changes prior merging them, please makes sure it follows the rules below (if relevant):

no reformatting changing the line length globally
make multiple PRs if you're changing multiple things

What kind of change does this PR introduce?

Bugfix
New Feature
Code style update (formatting, local variables)

NOTE: if you used black or any other code formatting tool, we will only accept one change per Pull Request.
Do not attempt to "fix" everything in a single PR, as it makes reviewing the changes extremely difficult to review.
If the fix is related to the length of the lines, it will be rejected.

Refactoring (no functional changes, no API changes)
CI-related changes
Documentation content changes
Tests
Other - please describe extensively what it does

If you're using an AI to assist you in generating this PR, please review the changes extensively and make sure they are accurate and useful.
Once it is the case, please tick the box below.

I used an AI to generate this PR and made sure I'm not wasting the time of the reviewer.

What does it do?

Fixes zombie captures permanently blocking concurrent_captures slots after a container restart (or any unclean capture_manager shutdown).

The problem

  Normal operation:
  ┌──────────────────┐         ┌───────────────────┐
  │ capture_manager  │         │  Valkey (Redis)    │
  │                  │         │                    │
  │  self.captures:  │ zadd    │  lacus:ongoing:    │
  │   {task_A,       │────────►│   uuid_A (t=100)   │
  │    task_B}       │         │   uuid_B (t=101)   │
  │                  │ zrem    │                    │
  │  finally: ───────│────────►│  (removed on done) │
  └──────────────────┘         └───────────────────┘
          ✓ in sync                  ✓ in sync

  After kill/restart:
  ┌──────────────────┐         ┌───────────────────┐
  │ capture_manager  │         │  Valkey (Redis)    │
  │  (new process)   │         │  (persisted data)  │
  │                  │         │                    │
  │  self.captures:  │         │  lacus:ongoing:    │
  │   {} (empty)     │  !=     │   uuid_A (t=100)   │  ◄── zombie
  │                  │         │   uuid_B (t=101)   │  ◄── zombie
  │                  │         │   ...              │
  │  finally: never  │         │   uuid_H (t=108)   │  ◄── zombie
  │  ran on old proc │         │                    │
  └──────────────────┘         └───────────────────┘
     0 real tasks              8 slots blocked
                               concurrent_captures = 8
                               → no new captures can start

When capture_manager is killed (SIGKILL, OOM, docker restart) while captures are in-flight, the finally blocks in LacusCore._capture() never execute. The UUIDs remain in the lacus:ongoing Redis sorted set because Valkey persists to disk and reloads on restart.

On the next startup, the new capture_manager process has an empty self.captures task set, but lacus:ongoing is full of stale UUIDs from the dead process. These zombie entries occupy concurrent_captures slots.

The existing clear_dead_captures() method does detect orphaned UUIDs (in Redis but not in the task set), but it has a safety window -it won't clear a capture if start_time > time.time() - max_capture_time * 1.1. During this window, zombie UUIDs block all slots and no new captures can start.

With concurrent_captures=8 and max_capture_time=90, that's ~99 seconds of zero capture throughput after every restart.

Reproduction

# With captures actively running, restart the container:
docker restart lacus

# Immediately check Redis:
valkey-cli -s /app/lacus/cache/cache.sock ZCARD lacus:ongoing
# Returns: 8 (all zombies -no real tasks exist in the new process)

# All concurrent_captures slots are blocked until max_capture_time * 1.1 elapses.

The fix

Commit 840a018 -Startup cleanup (_clear_ongoing_on_startup)

When capture_manager starts, self.captures is empty -no capture task can possibly be running. Any UUID in lacus:ongoing is by definition a zombie. The fix flushes the sorted set before the event loop starts.

This is safe because:

capture_manager is single-process -only one instance runs per container
The website process (Gunicorn) only reads from lacus:ongoing, never writes to it
The only writer is LacusCore._capture(), which runs inside capture_manager's event loop

Commit 83027ce -Force-clear after failed cancellation

When a Playwright subprocess hangs and task.cancel() doesn't propagate (browser process ignores the CancelledError), the existing code retries cancellation 5 times, then logs an error but leaves the UUID in Redis -permanently blocking the slot. The fix calls clear_capture() after exhausting all cancel attempts.

Testing

1. Reproduce the bug (before fix)

# Confirm captures are actively running:
docker exec lacus valkey-cli -s /app/lacus/cache/cache.sock ZCARD lacus:ongoing
# → 8

# Restart the container (Valkey persists to disk):
docker restart lacus
sleep 15

# Check Redis -all slots blocked by zombies:
docker exec lacus valkey-cli -s /app/lacus/cache/cache.sock ZCARD lacus:ongoing
# → 8 (all zombies -new process has zero real tasks)

# Verify these are stale UUIDs with old timestamps:
docker exec lacus valkey-cli -s /app/lacus/cache/cache.sock ZRANGE lacus:ongoing 0 -1 WITHSCORES
# → timestamps from before the restart

2. Apply the fix

# Copy patched capture_manager.py into the container:
docker cp capture_manager.py lacus:/app/lacus/bin/capture_manager.py

3. Inject zombies and verify cleanup

# Inject 8 fake zombie UUIDs into Redis:
for i in $(seq 1 8); do
  docker exec lacus valkey-cli -s /app/lacus/cache/cache.sock \
    ZADD lacus:ongoing 1774000000 "zombie-$(printf '%04d' $i)-0000-0000-000000000000"
done

# Confirm they're there:
docker exec lacus valkey-cli -s /app/lacus/cache/cache.sock ZCARD lacus:ongoing
# → 8+ (zombies + any real captures)

# Restart with patched code:
docker restart lacus
sleep 15

# Check the log -startup cleanup should fire:
docker logs lacus 2>&1 | grep "Startup cleanup"
# → "Startup cleanup: clearing 8 zombie capture(s) from lacus:ongoing"

# Confirm zero zombie UUIDs remain:
docker exec lacus valkey-cli -s /app/lacus/cache/cache.sock \
  ZRANGE lacus:ongoing 0 -1 | grep -c zombie
# → 0

# Verify fresh captures are running (new timestamps):
docker exec lacus valkey-cli -s /app/lacus/cache/cache.sock ZRANGE lacus:ongoing 0 -1 WITHSCORES
# → only UUIDs with timestamps from after the restart

When capture_manager is killed (SIGKILL, OOM, container restart) while captures are in-flight, the finally blocks in _capture() never run. UUIDs persist in the lacus:ongoing Redis sorted set because Valkey writes to disk and reloads on restart. These zombie UUIDs occupy concurrent_captures slots, blocking all new captures until clear_dead_captures() eventually ages them out (max_capture_time * 1.1 seconds later). Fix: flush lacus:ongoing before the event loop starts. At process init, self.captures is empty — every UUID in Redis is a zombie.

When a Playwright subprocess hangs, task.cancel() may not propagate — the browser process ignores the CancelledError. After 5 failed cancel attempts, the code logged an error but left the UUID in lacus:ongoing, permanently blocking the slot. Fix: call clear_capture() after exhausting cancel retries to free the Redis entry even if the asyncio task is still stuck.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83027cef9c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

bin/capture_manager.py

- Startup cleanup: iterate with clear_capture() per UUID instead of bulk-deleting lacus:ongoing. This stores a proper error result for each zombie and cleans up capture_settings, so clients polling a UUID across a restart get an explicit failure instead of UNKNOWN. - Force-clear: also discard the task from self.captures so the in-memory slot is freed for max_new_captures computation.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93bb5a3f59

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-21T10:45:30Z

bin/capture_manager.py

+            for uuid, _ in ongoing:
+                self.lacus.core.clear_capture(uuid, 'Cleared on startup: previous process died.')


Clear startup zombies without clear_capture's age gate

Calling self.lacus.core.clear_capture() here does not actually clear most crash leftovers. lacuscore.LacusCore.clear_capture() returns early for any UUID that is still in lacus:ongoing and started less than max_capture_time * 1.1 ago, so after a normal crash/restart these orphaned captures are usually still considered “probably still going.” The manager logs that it cleared them, but /capture_status, /lacus_status, and /is_busy will keep reporting them as ongoing until they age past the timeout window instead of immediately getting the failure result this startup cleanup is meant to write.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-21T10:45:30Z

bin/capture_manager.py

+                if not capture.done():
+                    self.logger.error(f'{expected_uuid} could not be canceled after 5 attempts, force-clearing from Redis.')
+                    self.lacus.core.clear_capture(expected_uuid, 'Force-cleared: task could not be canceled.')
+                    self.captures.discard(capture)


Decrement the running count when force-clearing a hung task

This path frees the in-memory slot, but it leaves the Redis running counter one too high. Each capture increments that counter when scheduled (set_running() at line 84), and the only per-task decrement is in clear_list_callback() (lines 70-72) when the task actually finishes. In the exact case handled here—a task that never responds to cancellation—that callback never fires, so a later scripts_controller.py stop|restart capture_manager can block forever waiting for zscore('running', 'capture_manager') to reach None even after the process has exited.

Useful? React with 👍 / 👎.

Rafiot · 2026-03-21T15:02:59Z

Thank you for the very detailed but report and PR, and I get your point but simply removing zombie captures when starting capture_manager is not a good solution: on big instances, you're going to have cases where you requested a stop on capture_manager (say, you updated playwright) and it keeps going until the captures that were started conclude (that can technically take an hour or so, depending on the timeouts, and if depth is not 0).

In that case, you may decide to just start a new capture_manager before the old one stops (i.e. you want to go for lunch, and also don't want to stop the captures for too long). If starting a new capture_manager removes the captures that are still processed, you have a problem.

One solution for that would be to make sure to clear the cache when valkey is launched: if valkey was actually down, you're certain that no captures are ongoing somewhere else.

cln-io added 2 commits March 21, 2026 11:11

chatgpt-codex-connector bot reviewed Mar 21, 2026

View reviewed changes

bin/capture_manager.py Outdated Show resolved Hide resolved

bin/capture_manager.py Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Mar 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/clear zombie captures on startup#59

Fix/clear zombie captures on startup#59
cln-io wants to merge 3 commits intoail-project:mainfrom
cln-io:fix/clear-zombie-captures-on-startup

cln-io commented Mar 21, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 21, 2026

Uh oh!

chatgpt-codex-connector bot Mar 21, 2026

Uh oh!

Rafiot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		for uuid, _ in ongoing:
		self.lacus.core.clear_capture(uuid, 'Cleared on startup: previous process died.')

Conversation

cln-io commented Mar 21, 2026

What does it do?

The problem

Reproduction

The fix

Testing

1. Reproduce the bug (before fix)

2. Apply the fix

3. Inject zombies and verify cleanup

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Rafiot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants