Skip to content

Allow configuring the cluster worker ping timeout#23

Open
odahcam wants to merge 1 commit intoamphp:2.xfrom
odahcam:feature/configurable-ping-timeout
Open

Allow configuring the cluster worker ping timeout#23
odahcam wants to merge 1 commit intoamphp:2.xfrom
odahcam:feature/configurable-ping-timeout

Conversation

@odahcam
Copy link
Copy Markdown

@odahcam odahcam commented May 7, 2026

Problem

ContextClusterWorker (the watcher's per-worker bookkeeping object) hard-codes a 10-second ping timeout:

// src/Internal/ContextClusterWorker.php
final class ContextClusterWorker extends AbstractLogger implements ClusterWorker
{
    private const PING_TIMEOUT = 10;
    // ...
    public function run(): void
    {
        \$watcher = EventLoop::repeat(self::PING_TIMEOUT / 2, weakClosure(function (): void {
            if (\$this->lastActivity < \time() - self::PING_TIMEOUT) {
                \$this->close();
                return;
            }
            // ...
        }));
    }
}

This works well when worker handlers are non-blocking. In real-world PHP applications, however, parts of the request lifecycle are synchronous and blocking:

  • PDO drivers (`pdo_mysql`, `pdo_pgsql`, …) — they block the event loop until the database responds.
  • `file_get_contents()` against remote URLs.
  • Synchronous Redis clients (e.g. Predis without the async loop adapter).
  • CPU-bound work (image processing, PDF rendering).

While such code is blocking, the worker process cannot service the ping, so its `lastActivity` is not refreshed. Within ~10s the watcher declares the worker dead and closes its context. From the operator's point of view, the symptom is `Worker N died unexpectedly: The context stopped responding` even though the worker was making progress on a single (slow but legitimate) request.

This forces applications that have legitimate >10s blocking work to either (a) avoid `amphp/cluster` entirely, (b) split that work into async/queued jobs (a substantial refactor), or (c) vendor-patch `PING_TIMEOUT`. Option (c) is what real users end up doing.

Proposed change

Make the ping timeout a configurable parameter on `ClusterWatcher`, threaded through to the internal `ContextClusterWorker`. Default value stays `10` (no behaviour change for existing users).

Public API

```php
$watcher = new ClusterWatcher(
script: DIR . '/server.php',
logger: $logger,
workerPingTimeout: 45, // accommodate legitimate long-blocking work
);
```

Internal change

The hard-coded `private const PING_TIMEOUT` becomes `public const DEFAULT_PING_TIMEOUT` so it remains the single source of truth and is referenceable from `ClusterWatcher`'s constructor signature. `ContextClusterWorker` accepts an optional `int $pingTimeout` constructor parameter and uses it in `EventLoop::repeat()` and the activity comparison.

Validation

`workerPingTimeout < 1` throws `\ValueError` from `ClusterWatcher`'s constructor.

Backwards compatibility

  • Default value is `10`, matching the previous hard-coded constant.
  • New parameter is optional; existing call sites continue to work without change.
  • The renamed constant (`PING_TIMEOUT` → `DEFAULT_PING_TIMEOUT`) was `private`, so no public API depended on its name.

Test plan

  • Existing tests continue to pass (defaults are unchanged).
  • Manual: run a worker that `sleep(20)`s with `workerPingTimeout=10` (current behaviour: dies after ~10s) and `workerPingTimeout=30` (does not die).

I'm happy to add a unit test for the constructor validation if you'd like — wanted to keep the diff minimal for first review.

Open questions for the maintainer

  1. Should the parameter be on `ClusterWatcher`'s constructor (proposed), or on a builder/factory? The constructor is consistent with how `IpcHub` etc. are passed today.
  2. Naming: `workerPingTimeout` vs. `pingTimeout` vs. `pingTimeoutSeconds`. Open to bikeshedding.
  3. Should the value be `int` (seconds) or `float` (sub-second granularity)? `EventLoop::repeat()` accepts a float; the rest of the public API uses `int` for time values.
  4. CLI: `vendor/bin/cluster` could grow a `--worker-ping-timeout=` flag — happy to do that in a follow-up if you'd take it here.

Why we are filing this

We run a cluster of ReactPHP HTTP workers that share long-lived state through Doctrine ORM (synchronous PDO). Specific reporting endpoints have a legitimate execution time that exceeds 10s; today the watcher treats them as dead workers and recycles them mid-response. Bumping `PING_TIMEOUT` is the smallest correct change. We're happy to iterate on the design if any of the choices above don't fit.

The worker liveness ping timeout was a hard-coded 10s `private const` in
`ContextClusterWorker`. Applications that legitimately do synchronous
blocking work longer than 10s (e.g. PDO drivers, sync Redis clients,
remote `file_get_contents`) cannot service pings during the blocked
window, so the watcher terminates them mid-request even though the
worker is making progress.

This change exposes the timeout as an optional `$workerPingTimeout`
constructor parameter on `ClusterWatcher`, threaded through to
`ContextClusterWorker`. The default value is `10`, preserving existing
behaviour for all call sites that don't pass the new parameter.

The constant is renamed from the private `PING_TIMEOUT` to public
`DEFAULT_PING_TIMEOUT` so it remains the single source of truth for the
default and is referenceable from `ClusterWatcher`'s constructor
signature.

Validation: `$workerPingTimeout < 1` throws `\ValueError`.

Sample usage:

    $watcher = new ClusterWatcher(
        script: __DIR__ . '/server.php',
        logger: $logger,
        workerPingTimeout: 45,
    );
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant