Skip to content

iris: async heartbeat dispatch, concurrency 128#4842

Merged
ravwojdyla merged 4 commits intomainfrom
iris/warn-slow-provider-sync
Apr 16, 2026
Merged

iris: async heartbeat dispatch, concurrency 128#4842
ravwojdyla merged 4 commits intomainfrom
iris/warn-slow-provider-sync

Conversation

@ravwojdyla-agent
Copy link
Copy Markdown
Contributor

@ravwojdyla-agent ravwojdyla-agent commented Apr 16, 2026

  • swap WorkerProvider's 32-slot ThreadPoolExecutor for asyncio.gather with an asyncio.Semaphore(128) cap, driven by asyncio.run per round1
  • RpcWorkerStubFactory now caches async WorkerServiceClient instances; the pyqwest HTTP client is Rust-owned so connection pools survive across per-round event loops
  • bump WorkerProvider.parallelism 32 → 128
  • drop DEFAULT_WORKER_RPC_TIMEOUT 30s → 10s, _SLOW_HEARTBEAT_RPC_LOG_THRESHOLD_MS 10s → 5s
  • add slow_log around self._provider.sync(batches) in _sync_all_execution_units with a 5s threshold2
  • WorkerProvider.{get_process_status,profile_task,exec_in_container} keep sync-callable signatures by wrapping each async call in its own asyncio.run
    • observed worst case on a 1339-worker cluster with 308 failing workers was 313s3; new bound is ceil(N_failing / 128) × timeout ≈ 24s

Footnotes

  1. asyncio.run creates and tears down an event loop per call (~1ms); the semaphore lives as a local inside _sync_all, so the cap is per-round rather than cross-round

  2. matches heartbeat_interval default (5s); exceeding the interval means the round can no longer keep pace with the schedule

  3. previous behavior: 308 × 30s timeout / 32 threads ≈ 289s round; healthy workers' inter-heartbeat gap stretched to match

ravwojdyla and others added 2 commits April 16, 2026 13:19
Wraps the WorkerProvider.sync() call in slow_log so heartbeat rounds that
exceed a 1-second budget — e.g. rounds where many failing workers saturate
the thread pool with 30s timeouts — surface a WARNING. A healthy round with
fast RPCs completes well under this budget; exceeding it indicates either
timeouts stacking or a misconfigured pool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the 32-slot ThreadPoolExecutor dispatch path with an asyncio
event loop on a dedicated thread, fanning out per-worker heartbeat RPCs
via asyncio.gather with a Semaphore(128) cap. The default WorkerProvider
parallelism is raised from 32 to 128.

This bounds the worst-case sync round from N_failing * timeout / 32 to
ceil(N_failing / 128) * timeout. On a 1339-worker cluster with 308
failing workers (observed 313s previously), the new worst case is
~24s at a 10s timeout.

Also lower DEFAULT_WORKER_RPC_TIMEOUT from 30s to 10s and
_SLOW_HEARTBEAT_RPC_LOG_THRESHOLD_MS from 10s to 5s to match.

RpcWorkerStubFactory now caches async WorkerServiceClient instances
instead of WorkerServiceClientSync so that the single pyqwest HTTP
client per address persists across rounds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ravwojdyla-agent ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 16, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e2fdda8a8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 123 to 125
def close(self) -> None:
with self._lock:
stubs = list(self._stubs.values())
self._stubs.clear()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Close cached worker clients before dropping stub map

RpcWorkerStubFactory.close() now only clears the _stubs dict, so every cached WorkerServiceClient is discarded without calling close(). In a long-running controller, worker churn/failover creates many distinct addresses, and those async clients can retain open HTTP connections/file descriptors until process exit, causing resource leaks and eventually destabilizing heartbeat/control-plane RPCs. The previous implementation explicitly closed each stub, so this is a regression introduced in this commit.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bot comment is not relevant. Evidence:

  ConnectClient.close at connectrpc/_client_async.py:178-181:

  async def close(self) -> None:
      """Close the HTTP client. After closing, the client cannot be used to make requests."""
      if not self._closed:
          self._closed = True

  It only flips a boolean flag — does not close the underlying pyqwest HTTPClient. The pyqwest client is Rust-owned; its connections/sockets are released via Rust's Drop when the Python reference is GC'd, which happens the moment we clear the dict.

  In the sync version, ConnectClientSync.close() did close the httpx sync client. But we've switched to the async variant, whose close() is a no-op.

  So calling it would (a) require spinning up an event loop from sync close(), and (b) do nothing.

ravwojdyla and others added 2 commits April 16, 2026 13:23
A 1s threshold fires on every healthy round for large clusters: with
1339 workers and 128 concurrency, ~11 RPC waves at 200ms each already
take ~2s. 5s matches the heartbeat interval — exceeding it means the
round can't keep pace with the schedule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the persistent asyncio event-loop thread with asyncio.run() per
sync/exec/profile call. Simpler: no custom thread lifecycle, no
dataclass fields to manage, no close() plumbing. Semaphore now lives
as a local inside _sync_all — semantically correct (per-round cap,
not cross-round).

pyqwest's HTTP connection pool is owned by the Rust tokio runtime, not
the Python loop, so cached stubs keep their connections across loop
creation/teardown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Sync with the execution backend (ThreadPoolExecutor inside provider).
results = self._provider.sync(batches)
with slow_log(logger, "provider sync (RPC dispatch)", threshold_ms=5_000):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5s is somewhat hand-wavy

@ravwojdyla ravwojdyla requested a review from rjpower April 16, 2026 20:32
@ravwojdyla
Copy link
Copy Markdown
Contributor

@rjpower I would like to have this tested, but this feels like a larger testing harness adventure. It could be an interesting project to have a good multithreaded testing env to ensure bugs are these do not resurface 🤔

Copy link
Copy Markdown
Collaborator

@rjpower rjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we removed the close()? Maybe worth kicking out onto a helper thread if it needs to be async?

@ravwojdyla ravwojdyla merged commit a26c774 into main Apr 16, 2026
53 checks passed
@ravwojdyla ravwojdyla deleted the iris/warn-slow-provider-sync branch April 16, 2026 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants