Skip to content

fix: disable PR_SET_PDEATHSIG (kernel binds it to parent thread under Flask threaded=True)#145

Merged
NotPunchnox merged 1 commit into
NotPunchnox:mainfrom
iiiokojiadbi:fix/pr-set-pdeathsig-flask-threaded
Apr 16, 2026
Merged

fix: disable PR_SET_PDEATHSIG (kernel binds it to parent thread under Flask threaded=True)#145
NotPunchnox merged 1 commit into
NotPunchnox:mainfrom
iiiokojiadbi:fix/pr-set-pdeathsig-flask-threaded

Conversation

@iiiokojiadbi

Copy link
Copy Markdown
Contributor

Problem

After PR #144 was merged, every request to /api/embed makes the worker subprocess die, and the next request hangs for ~40s and returns HTTP 500. This is the observable cause of issue #117 ("sequential embedding requests return invalid/zeroed vectors on RK3588").

Reproducer (5 sequential embed requests on a clean image):

req 1: 200 1.78s   ← cold load
req 2: 500 57.90s  ← hangs, then fails
req 3: 200 1.70s   ← new worker
req 4: 500 58.27s  ← hangs, then fails
req 5: 200 1.72s

The logs show the familiar cascade:

POST /api/embed HTTP/1.1" 200 -
Received signal 15, stopping all workers...
(30s later) Worker for model 'X' died unexpectedly (exitcode=0); cleaning up stale entry.
POST /api/embed HTTP/1.1" 500 -

Root cause

_set_parent_death_signal() (added in #144) calls prctl(PR_SET_PDEATHSIG, SIGTERM) in each forked worker so the worker dies if its parent dies. Good intent.

Problem: on Linux, PR_SET_PDEATHSIG is bound to the thread that forked the child, not the whole parent process. Quoting man 2 prctl:

Warning: the "parent" in this case is considered to be the thread that created this process. In other words, the signal will be sent when that thread terminates (via, for example, pthread_exit(3)), rather than after all threads in the parent process terminate.

rkllama_server runs Flask with threaded=True, so every HTTP request is handled on a short-lived thread from the pool. Worker.create_worker_process() calls Process.start() from that request thread, so the kernel binds the death signal to the request thread, not to the main process.

As soon as the HTTP response returns and the request thread exits, the kernel delivers SIGTERM to the worker. The worker's inherited _handle_shutdown_signal runs, calls stop_all() + sys.exit(0), and the worker dies right after serving a single request. The main process then observes the ''unexpected'' worker death 30s later (after stop_worker's join timeout), and the next request has to start a new worker.

Diagnosed via strace -p 1 -e signal:

<worker_pid>    --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---

Confirmed by an A/B test on a clean image built from this branch: reverting this patch reproduces the bug on every even-numbered embed; reapplying the patch → 10/10 sequential embeds return 200 in 115–165ms each.

Fix

Turn _set_parent_death_signal() into a documented no-op. Orphan-worker protection continues to work via _kill_orphaned_workers() at startup, which is a more reliable mechanism for the Flask threaded model — it scans ppid == 1 processes with rkllama_server in their cmdline on boot. In a Docker deployment this is also redundant: PID 1 dying kills the whole container namespace.

Trade-off: on a native (non-Docker) install, if the main process crashes ungracefully (SIGKILL, segfault, OOM), worker subprocesses become orphans with NPU memory still allocated, and that memory stays busy until the next rkllama start (when _kill_orphaned_workers() cleans them up). That's an acceptable gap given the alternative is ''workers die after every HTTP request''.

Validation

Clean image built from this branch, full test matrix (on Orange Pi 5 Plus, RK3588):

Test Result
Sequential embed ×10 (primary bug) 1.77s cold, 115–180ms hot, all 200
Batch embed (5 inputs in one request) 5 vectors in 693ms
Chat non-streaming (Qwen3-0.6B) 9 tokens, 28.5s, coherent response
Chat streaming 100 chunks in 7.5s
Embed → chat → embed (mixed workflow) no stale state, all 200
Logs: died unexpectedly 0
Logs: Received signal 0

Known separate issues not addressed here (exist on baseline too, likely covered by upcoming #139):

  • Concurrent embed requests on a single model (global lock → OSError: handle is closed).
  • /api/chat on an embed model (separate pipe lifecycle problem).

Related


Co-Authored-By: Claude noreply@anthropic.com

…process)

PR_SET_PDEATHSIG is bound to the *thread* that forked the child, not to the
parent process (man 2 prctl: "the 'parent' in this case is considered to be
the thread that created this process").

rkllama_server runs Flask with threaded=True, so Process.start() for a worker
is executed from a short-lived request-handler thread. As soon as the request
finishes and its thread exits, the kernel delivers SIGTERM to the worker, the
inherited shutdown handler cascades into stop_all() / sys.exit(0), and the
worker dies after serving a single request. The next /api/embed hits the
dying worker, waits the 30s stop_worker timeout, and returns 500.

Turn _set_parent_death_signal() into a documented no-op. Orphan-worker
protection continues to work via _kill_orphaned_workers() at startup.

Fixes NotPunchnox#117.

Co-Authored-By: Claude <noreply@anthropic.com>
@jaylfc

jaylfc commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Great catch, and apologies for the oversight in #144. The intent was to clean up orphaned workers on parent exit, but I missed the crucial detail in man 2 prctl that PR_SET_PDEATHSIG binds to the thread rather than the process — which under Flask's threaded=True model means the death signal fires the moment the request thread exits, not when the server actually goes down.

Your diagnosis is thorough and the fix is the right call. The orphan-cleanup gap you've documented (SIGKILL leaving NPU memory allocated until next start) is an acceptable trade-off, and _kill_orphaned_workers() at startup covers the realistic failure case.

I've tested the same alternating 200/500 pattern on RK3588 — this fix resolves it cleanly. Thanks for taking the time to properly root-cause and validate it.

@NotPunchnox NotPunchnox merged commit 1836cf4 into NotPunchnox:main Apr 16, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sequential embedding requests return invalid/zeroed vectors on RK3588 (NPU)

3 participants