Skip to content

feat: cross-platform force-kill primitive for stuck PHP threads#2365

Open
nicolas-grekas wants to merge 1 commit intophp:mainfrom
nicolas-grekas:force-kill-primitives
Open

feat: cross-platform force-kill primitive for stuck PHP threads#2365
nicolas-grekas wants to merge 1 commit intophp:mainfrom
nicolas-grekas:force-kill-primitives

Conversation

@nicolas-grekas
Copy link
Copy Markdown
Contributor

@nicolas-grekas nicolas-grekas commented Apr 22, 2026

First step of the split suggested in #2287: land the force-kill
infrastructure as a standalone, reviewable primitive independent of
background workers.

Design

Each PHP thread, at boot from its own TSRM context, hands a
force_kill_slot (pointers to its EG(vm_interrupt) and EG(timed_out)
atomic bools, plus pthread_t / Windows HANDLE) back to Go via
go_frankenphp_store_force_kill_slot. The slot lives on phpThread
and is protected by a per-thread RWMutex so the zero-and-release path
at thread exit cannot race an in-flight kill. From any goroutine, Go
passes the slot back to frankenphp_force_kill_thread, which stores
true into both atomic bools (waking the VM at the next opcode
boundary, routing through zend_timeout -> "Maximum execution time
exceeded") and delivers a platform-specific wake-up:

  • Linux/FreeBSD: pthread_kill(SIGRTMIN+3) with a no-op handler
    installed once via pthread_once, SA_ONSTACK, no SA_RESTART.
    Signal delivery returns any in-flight blocking syscall with EINTR.
  • Windows: CancelSynchronousIo + QueueUserAPC covers alertable
    I/O and SleepEx. Non-alertable Sleep (including PHP's usleep)
    stays uninterruptible.
  • macOS: atomic-bool path only; threads stuck in blocking syscalls
    wait for the syscall to complete naturally.

Reserved signal: SIGRTMIN+3. A PHP script that calls
pcntl_signal(SIGRTMIN+3, ...) clobbers this. Embedders whose own Go
code uses SIGRTMIN+3 must patch it here. glibc NPTL reserves
SIGRTMIN..SIGRTMIN+2, so the offset cannot go lower.

Drain integration

drainWorkerThreads waits drainGracePeriod (30s) for each thread to
reach Yielding, then arms force-kill on stragglers and keeps
waiting
until they yield. phpThread.shutdown does the same. There
is no abandon path: if a thread is stuck in a syscall force-kill cannot
interrupt (macOS, Windows non-alertable Sleep), the drain blocks until
the syscall returns naturally — matching pre-patch behaviour exactly,
just typically much faster because force-kill cuts a sleep(60) down
to milliseconds. Operators that want a harder bound rely on their
orchestrator (systemd, k8s, supervisord) to SIGKILL the process.

go_frankenphp_on_thread_shutdown runs on both the healthy path and
the unhealthy-during-Shutdown path so state.Done is set even when
force-kill bails the thread. Without it, phpThread.shutdown's
WaitFor(state.Done) would never unblock.

Testing

TestRestartWorkersForceKillsStuckThread drives the full path via a
marker file so RestartWorkers only arms once the worker is proven
parked in sleep(), then asserts bounded elapsed time and that the
post-sleep echo never runs.

Comment thread frankenphp.c Outdated
Copy link
Copy Markdown
Member

@withinboredom withinboredom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. FYI: bugs like php/php-src#21267 mean that sometimes JIT just will never hit these vm breakpoints. So, be prepared for bug reports that aren't related to this change, but are due to JIT.

Comment thread worker_test.go
Comment thread worker.go Outdated
Comment thread frankenphp.c Outdated
Comment thread frankenphp.c
Comment thread frankenphp.c
@dunglas
Copy link
Copy Markdown
Member

dunglas commented Apr 24, 2026

Good work! Shouldn't this code be directly on php-src (TSRM)? It could be useful in other contexts than FrankenPHP.

@nicolas-grekas
Copy link
Copy Markdown
Contributor Author

@dunglas dunno in the future, but at the moment, this allows providing the capablity to all versions of PHP, without having to wait for an hypothetical merge in 8.6+.

Comment thread phpthread.go Outdated
@henderkes
Copy link
Copy Markdown
Contributor

@dunglas dunno in the future, but at the moment, this allows providing the capablity to all versions of PHP, without having to wait for an hypothetical merge in 8.6+.

The two don't exclude each other, it would actually help getting this upstreamed because it will open the doors to more calls being switched to alertable. Once that lands in master, we can backport it in FrankenPHP for 8.4+.

@nicolas-grekas
Copy link
Copy Markdown
Contributor Author

nicolas-grekas commented Apr 24, 2026

PR should be ready! It took me a while to get it correct. PR description is updated. Lots of comments in the patch; let me know if that's too much.

@nicolas-grekas
Copy link
Copy Markdown
Contributor Author

PR ready twice 😅
I removed the "abandon" trial, that's a dead end. This is now really just a best effort and we wait for threads to stop on their own after signaling them (like before).

Comment thread frankenphp.c Outdated
Comment thread frankenphp.c Outdated
@AlliBalliBaba
Copy link
Copy Markdown
Contributor

AlliBalliBaba commented Apr 30, 2026

Looks largely good to me, have you also tried that this unblocks actual syscalls like system('sleep 45'); If this works, we'll probably also use the mechanism for a custom request timeout.

It's probably good to have this on restarts, but on general shutdown people might want to have a longer grace period. Eg you might want to have some worker cleanup happening on shutdown.

@nicolas-grekas
Copy link
Copy Markdown
Contributor Author

COol :)

system('sleep 45');

I didn't try but this should work; it's all syscall-based.

For the longer grace period on shutdown vs restart:

The worker side can't tell the two modes apart though, can it? frankenphp_handle_request() returns false on both restart and shutdown, so any worker cleanup logic runs identically in both cases. I bumped the grace period from 5s to 30s so cleanup has more headroom in both paths. Let me know if that's good enough for you.

🚀

Copy link
Copy Markdown
Member

@alexandre-daubois alexandre-daubois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a cross-platform “force-kill” primitive for PHP threads that are stuck in blocking operations, and wires it into worker draining/restarts so graceful shutdown/restart doesn’t hang on long sleep()/I/O.

Changes:

  • Add a per-PHP-thread force_kill_slot (captured from the PHP thread’s TSRM context) plus C helpers to trigger VM interrupts and attempt syscall wake-ups (POSIX signal / Windows APC+CancelSynchronousIo).
  • Integrate a drain grace period that arms force-kill for straggler worker threads during DrainWorkers, RestartWorkers, and per-thread shutdown.
  • Add an integration test and worker script that reproduces a stuck sleep() and asserts RestartWorkers() completes within a bounded time.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
worker.go Adds drainGracePeriod and force-kill arming during worker draining/restart.
worker_test.go Adds a cross-platform drain/restart test that exercises force-kill against sleep().
testdata/worker-sleep.php New worker script used by the force-kill drain test (touch marker + sleep(60)).
phpthread.go Stores per-thread force-kill slots and arms force-kill during thread shutdown after a grace period.
phpmainthread.go Makes drainPHPThreads() idempotent to avoid double-closing the main thread done channel.
frankenphp.h Declares force_kill_slot and force-kill C API, plus kill-signal feature macros.
frankenphp.c Implements force-kill primitive, installs signal handler once, publishes/clears slots around TSRM teardown.
export_test.go Adds a test-only setter to override drainGracePeriod for faster tests.
Comments suppressed due to low confidence (1)

frankenphp.c:1279

  • PR description states go_frankenphp_on_thread_shutdown runs on both the healthy path and the unhealthy-during-shutdown path to ensure Go waiters unblock. In this code, go_frankenphp_on_thread_shutdown() is still only called when thread_is_healthy, and the unhealthy path always spawns a replacement thread instead. Please either update the PR description or adjust the unhealthy path to signal Go appropriately during shutdown (and avoid relying on successfully spawning a replacement thread to unblock shutdown).
  /* Thread is healthy, signal to Go that the thread has shut down */
  if (thread_is_healthy) {
    go_frankenphp_on_thread_shutdown(thread_index);
    return NULL;
  }

  frankenphp_log_message("Restarting unhealthy thread", LOG_WARNING);

  if (!frankenphp_new_php_thread(thread_index)) {
    /* probably unreachable */
    frankenphp_log_message("Failed to restart an unhealthy thread", LOG_ERR);
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread worker_test.go
Comment on lines +64 to +68
cwd, _ := os.Getwd()
testDataDir := cwd + "/testdata/"

require.NoError(t, frankenphp.Init(
frankenphp.WithWorkers("sleep-worker", testDataDir+"worker-sleep.php", 1),
Comment thread worker.go
Comment on lines +168 to +173
// drainGracePeriod: time to wait for threads to yield before arming force-kill.
var drainGracePeriod = 30 * time.Second

// EXPERIMENTAL: DrainWorkers initiates a graceful drain of all worker scripts.
// Blocks until every drained thread yields. Force-kill is armed after a
// grace period to wake threads parked in blocking syscalls (sleep, I/O).
Comment thread frankenphp.h
Comment on lines +54 to +59
/* Platform capabilities for the force-kill primitive; declared in the
* header so Go (via CGo) gets the correct struct layout too. */
#if !defined(PHP_WIN32) && defined(SIGRTMIN)
#define FRANKENPHP_HAS_KILL_SIGNAL 1
#define FRANKENPHP_KILL_SIGNAL (SIGRTMIN + 3)
#endif
Comment thread frankenphp.c
Comment on lines +122 to +132
static void install_kill_signal_handler(void) {
/* No SA_RESTART so syscalls return EINTR rather than being restarted.
* SA_ONSTACK guards against an accidental process-level delivery to a
* Go-managed thread, where Go requires the alternate signal stack. */
struct sigaction sa;
memset(&sa, 0, sizeof(sa));
sa.sa_handler = frankenphp_kill_signal_handler;
sigemptyset(&sa.sa_mask);
sa.sa_flags = SA_ONSTACK;
sigaction(FRANKENPHP_KILL_SIGNAL, &sa, NULL);
}
Comment thread export_test.go
// SetDrainGracePeriodForTest overrides drainGracePeriod for tests that need
// to exercise the force-kill path without paying the production wait.
// Returns the previous value so tests can restore it.
func SetDrainGracePeriodForTest(d time.Duration) time.Duration {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lighter/cleaner option would be to just move TestRestartWorkersForceKillsStuckThread in a dedicated test file in the frankenphp package, changing this variable directly.

Comment thread frankenphp.c
@@ -1162,12 +1268,9 @@
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: actually, we should revert this and add blank lines before other returns in a WSL-like style: https://github.com/bombsimon/wsl

Comment thread worker.go
// Blocks until every drained thread yields. Force-kill is armed after a
// grace period to wake threads parked in blocking syscalls (sleep, I/O).
func DrainWorkers() {
_ = drainWorkerThreads()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep, to prevent some lint errors

Introduces a self-contained primitive that wakes a PHP thread parked in
a blocking call (sleep, synchronous I/O, etc.) so the graceful drain
used by RestartWorkers / DrainWorkers / Shutdown completes promptly
instead of waiting for the syscall to return naturally.

Design: each PHP thread, at boot from its own TSRM context, hands a
force_kill_slot (pointers to its EG(vm_interrupt) and EG(timed_out)
atomic bools, plus pthread_t / Windows HANDLE) back to Go via
go_frankenphp_store_force_kill_slot. The slot lives on phpThread and is
protected by a per-thread RWMutex so the zero-and-release path at
thread exit cannot race an in-flight kill. From any goroutine, Go
passes the slot back to frankenphp_force_kill_thread, which stores
true into both bools (waking the VM at the next opcode boundary,
routing through zend_timeout -> "Maximum execution time exceeded") and
delivers a platform-specific wake-up:

- Linux/FreeBSD: pthread_kill(SIGRTMIN+3) with a no-op handler installed
  via pthread_once, SA_ONSTACK, no SA_RESTART. Signal delivery causes
  the in-flight blocking syscall to return EINTR.
- Windows: CancelSynchronousIo + QueueUserAPC covers alertable I/O and
  SleepEx. Non-alertable Sleep (including PHP's usleep) stays
  uninterruptible.
- macOS: atomic-bool-only path. Threads stuck in blocking syscalls wait
  for the syscall to complete naturally.

Reserved signal: SIGRTMIN+3. PHP's pcntl_signal(SIGRTMIN+3, ...)
clobbers it; embedders whose own Go code uses that signal must patch
the constant. glibc NPTL reserves SIGRTMIN..SIGRTMIN+2.

Drain integration: drainWorkerThreads waits drainGracePeriod (5s) for
each thread to reach Yielding, then arms force-kill on stragglers and
keeps waiting until they yield. phpThread.shutdown does the same.
There is no abandon path: if a thread is stuck in a syscall force-kill
cannot interrupt (macOS, Windows non-alertable Sleep) the drain blocks
until the syscall returns naturally - matching pre-patch behaviour
exactly, just typically much faster because force-kill cuts a 60s
sleep down to milliseconds. Operators that want a harder bound rely on
their orchestrator (systemd, k8s, supervisord) to SIGKILL the process.

worker_test.go + testdata/worker-sleep.php exercise the full path:
the test marks a file before sleep(60), polls until the worker is
proven parked, then asserts RestartWorkers completes within the grace
period and that the post-sleep echo never runs (which would mean the
VM interrupt was never observed).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants