Draft
Conversation
Refactored ObjectStorageServerProcess to use the spawn multiprocessing context and delayed the initialization of the ObjectStorageServer C++ object until the child process's run() method. This ensures that C++ threads are not inherited from the parent process, avoiding deadlocks and resource corruption. Additionally, implemented a robust readiness check by polling the server's TCP port from the parent process, replacing the previous internal C++ pipe-based mechanism which was prone to hangs when used with multiprocessing. Removed debug traceback prints in SchedulerClusterCombo.
Switched from internal C++ pipe-based signaling to external TCP port polling for the readiness check. This avoids deadlocks caused by Global Interpreter Lock (GIL) contention when background threads attempt to signal readiness while the main server loop is running. Ensured that ObjectStorageServerProcess uses the spawn context and initializes the C++ server object entirely within the child process to maintain resource isolation and avoid invalid thread inheritance.
e03d880 to
1c30f60
Compare
8cb11dd to
032a851
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix Client Hang on
clear()and Improve Shutdown StabilitySummary
This PR addresses multiple issues related to system stability during disconnection and shutdown scenarios. It fixes a client-side hang when clearing futures during a disconnect, improves GIL handling in the Object Storage Server C++ extension to prevent potential crashes or deadlocks, and enhances exception handling in Worker and Processor components for a cleaner shutdown.
Changes
Client Side
src/scaler/client/agent/future_manager.py:TimeoutErrorfromconcurrent.futures.cancel_all_futuresto callfuture.cancel(timeout=5.0).future.set_canceled()) if the 5-second timeout is reached. This prevents the client from hanging indefinitely if the scheduler is unreachable.Object Storage Server (C++ Extension)
src/cpp/scaler/object_storage/pymod_object_storage_server.cpp:wait_until_ready: Wrapped the blocking C++ callwaitUntilReadywithPy_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADSto release the GIL, allowing other Python threads to execute while waiting.runare converted tostd::stringbefore releasing the GIL. This prevents potential access to invalid memory if the Python objects are modified or collected while the C++ code is running without the GIL.Worker & Processor
src/scaler/worker/agent/processor/processor.py:__interrupthandler to raiseSystemExit(0)instead of manually destroying connectors, allowing for a more standard process termination flow.src/scaler/worker/worker.py:asyncio.CancelledErrorduringDisconnectRequestsending to prevent unnecessary exception logging when the operation is validly cancelled during shutdown.Verification
tests/cluster/test_cluster_disconnect.pynow completes successfully.