Skip to content

Misc. Fixes#514

Draft
magniloquency wants to merge 12 commits intofinos:mainfrom
magniloquency:misc-fixes
Draft

Misc. Fixes#514
magniloquency wants to merge 12 commits intofinos:mainfrom
magniloquency:misc-fixes

Conversation

@magniloquency
Copy link
Contributor

@magniloquency magniloquency commented Jan 20, 2026

Fix Client Hang on clear() and Improve Shutdown Stability

Summary

This PR addresses multiple issues related to system stability during disconnection and shutdown scenarios. It fixes a client-side hang when clearing futures during a disconnect, improves GIL handling in the Object Storage Server C++ extension to prevent potential crashes or deadlocks, and enhances exception handling in Worker and Processor components for a cleaner shutdown.

Changes

Client Side

  • src/scaler/client/agent/future_manager.py:
    • Imported TimeoutError from concurrent.futures.
    • Updated cancel_all_futures to call future.cancel(timeout=5.0).
    • Added a fallback to force local cancellation (future.set_canceled()) if the 5-second timeout is reached. This prevents the client from hanging indefinitely if the scheduler is unreachable.

Object Storage Server (C++ Extension)

  • src/cpp/scaler/object_storage/pymod_object_storage_server.cpp:
    • GIL Release in wait_until_ready: Wrapped the blocking C++ call waitUntilReady with Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS to release the GIL, allowing other Python threads to execute while waiting.
    • Safe String Conversion: Ensured Python string arguments in run are converted to std::string before releasing the GIL. This prevents potential access to invalid memory if the Python objects are modified or collected while the C++ code is running without the GIL.

Worker & Processor

  • src/scaler/worker/agent/processor/processor.py:
    • Simplified __interrupt handler to raise SystemExit(0) instead of manually destroying connectors, allowing for a more standard process termination flow.
  • src/scaler/worker/worker.py:
    • Added handling for asyncio.CancelledError during DisconnectRequest sending to prevent unnecessary exception logging when the operation is validly cancelled during shutdown.

Verification

  • Verified that tests/cluster/test_cluster_disconnect.py now completes successfully.
  • Confirmed that the C++ extension compiles and runs with the new GIL handling.

Refactored ObjectStorageServerProcess to use the spawn multiprocessing context and delayed the initialization of the ObjectStorageServer C++ object until the child process's run() method. This ensures that C++ threads are not inherited from the parent process, avoiding deadlocks and resource corruption.

Additionally, implemented a robust readiness check by polling the server's TCP port from the parent process, replacing the previous internal C++ pipe-based mechanism which was prone to hangs when used with multiprocessing.

Removed debug traceback prints in SchedulerClusterCombo.
Switched from internal C++ pipe-based signaling to external TCP port polling for the readiness check. This avoids deadlocks caused by Global Interpreter Lock (GIL) contention when background threads attempt to signal readiness while the main server loop is running.

Ensured that ObjectStorageServerProcess uses the spawn context and initializes the C++ server object entirely within the child process to maintain resource isolation and avoid invalid thread inheritance.
@magniloquency magniloquency requested a review from gxuu January 21, 2026 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant