-
Notifications
You must be signed in to change notification settings - Fork 72
Description
Most of this investigation was done by GenAI, but has been reviewed:
Describe the bug
The refactored server bootstrap shutdown path in channel_bootstrap.c defers the shutdown_callback through an async socket cleanup callback. The refactor removed an else guard that previously ensured shutdown_callback was only called for channels where incoming_callback had been called successfully (i.e. channels that completed setup). As a result, shutdown_callback can now fire for channels that failed during setup and may not have the expected handler slots.
This causes a fatal assert crash in downstream consumers (specifically aws-c-event-stream's s_on_accept_channel_shutdown) on Windows, where the IOCP-based async socket cleanup creates a timing window that reliably triggers the bug during rapid server lifecycle cycling.
Filed as a follow-up to awslabs/aws-c-event-stream#140.
This is a regression. The old code in s_on_server_channel_on_shutdown used an if/else that prevented shutdown_callback from firing on the error path. The refactored code removed the else, and the deferred callback now calls shutdown_callback unconditionally when incoming_called is true — but incoming_called is set to true by the error path itself.
The bug was introduced somewhere in the 54 commits between aws-c-io fcb38c80 (bundled in aws-crt-java 0.33.5) and bfb0819d (bundled in aws-crt-java 0.43.6). The last known working version is fcb38c80.
Regression Issue
- Select this option if this issue appears to be a regression.
Expected Behavior
shutdown_callback should only be called for channels that were successfully set up and had incoming_callback called with a non-error channel. This was the behavior of the old code:
// OLD s_on_server_channel_on_shutdown — correct behavior
if (!channel_data->incoming_called) {
s_server_incoming_callback(channel_data, error_code, NULL); // error path
} else {
args->shutdown_callback(server_bootstrap, error_code, channel, ...); // only for successful setup
}
aws_channel_destroy(channel);The if/else made the two paths mutually exclusive: either the channel failed setup (call incoming_callback with error) or it succeeded and is now shutting down (call shutdown_callback). Never both.
Current Behavior
The refactored code calls s_server_incoming_callback on the error path (which sets incoming_called = true), then unconditionally defers to a socket cleanup callback that checks incoming_called and calls shutdown_callback:
// NEW s_on_server_channel_on_shutdown
if (!channel_data->incoming_called) {
s_server_incoming_callback(channel_data, error_code, NULL);
// s_server_incoming_callback sets incoming_called = true
}
// No else — always sets up deferred callback
SETUP_SOCKET_SHUTDOWN_CALLBACKS(..., socket_shutdown_server_channel_shutdown_fn, ...)
aws_socket_clean_up(socket);// socket_shutdown_server_channel_shutdown_fn (fires asynchronously after IOCP socket close)
if (channel_data->incoming_called) { // always true now, even for error path
connection_args->shutdown_callback(
server_bootstrap, error_code, shutdown_args->channel, server_shutdown_user_data);
}
aws_channel_destroy(shutdown_args->channel);This causes shutdown_callback to fire for channels that failed setup. The channel passed to shutdown_callback may not have the handler slots that the callback expects, leading to crashes in the callback implementation.
Observed crash in aws-c-event-stream (event_stream_rpc_server.c:354):
Fatal error condition occurred in event_stream_rpc_server.c:354:
current_slot && "It should be logically impossible to have a channel in this callback
that doesn't have a slot in it"
Exiting Application
Process exit code: 0xC0000409 (STATUS_STACK_BUFFER_OVERRUN — Windows fast-fail for abort()).
Reproduction Steps
This reproduces consistently on Windows via the aws-greengrass-nucleus integration test suite, which rapidly creates and destroys event-stream RPC servers on local sockets:
git clone https://github.com/aws-greengrass/aws-greengrass-nucleus.git
cd aws-greengrass-nucleus
git checkout 2b4fb24 # "fix: update SDK and CRT version (#1768)" — the commit that upgraded to aws-crt 0.43.6
# On Windows with JDK 8 (Corretto):
mvn -ntp -U verify -Djava.io.tmpdir=C:\TempThe crash occurs during DeploymentConfigMergingTest, which cycles through multiple Kernel instances (each with its own IPC RPC server) within a single JVM process.
Failing CI runs (Windows job fails, Linux job passes):
- https://github.com/aws-greengrass/aws-greengrass-nucleus/actions/runs/22873221124
- https://github.com/aws-greengrass/aws-greengrass-nucleus/actions/runs/22782086012
Passing CI runs with the previous aws-c-io version (before the CRT upgrade):
- https://github.com/aws-greengrass/aws-greengrass-nucleus/actions/runs/22082378155
- https://github.com/aws-greengrass/aws-greengrass-nucleus/actions/runs/20288329264
Possible Solution
Restore the mutual exclusivity between the error path and the shutdown_callback path. The deferred callback should track whether incoming_called was already true before s_on_server_channel_on_shutdown ran, rather than checking the flag after the error path has set it.
One approach — add a flag to distinguish the two cases:
// s_on_server_channel_on_shutdown
bool was_incoming_called = channel_data->incoming_called;
if (!channel_data->incoming_called) {
error_code = (error_code) ? error_code : AWS_ERROR_UNKNOWN;
s_server_incoming_callback(channel_data, error_code, NULL);
}
// Pass was_incoming_called to the deferred callback args
// ...// socket_shutdown_server_channel_shutdown_fn
if (shutdown_args->was_incoming_called) { // only true for channels that completed setup
connection_args->shutdown_callback(...);
}
aws_channel_destroy(shutdown_args->channel);Alternatively, avoid calling s_server_incoming_callback on the error path inside s_on_server_channel_on_shutdown and instead handle the error notification through the deferred callback as well, preserving the original incoming_called state.
Additional Information/Context
- The crash is 100% reproducible on
windows-latestGitHub Actions runners after the CRT upgrade - Linux never crashes — the same test suite passes cleanly (likely because socket cleanup is synchronous on Linux, so the timing window doesn't exist)
- The downstream consumer (
aws-c-event-stream) also has a pre-existing missing-returnbug ins_on_accept_channel_setupand lacks a defensive NULL check ins_rpc_connection_from_channel, both of which are tracked in Fatal assert ins_rpc_connection_from_channel()during channel shutdown on Windows aws-c-event-stream#140 - The
aws-c-ioversion bundled inaws-crt-java0.33.5 (fcb38c80) does not exhibit this crash - The
aws-c-ioversion bundled inaws-crt-java0.43.6 (bfb0819d) reliably crashes
aws-c-io version used
bfb0819d3906502483611ce832a5ec6b897c8421 (bundled in aws-crt-java 0.43.6). Last known working: fcb38c804364dd627c335da752a99a125a88f6e9 (bundled in aws-crt-java 0.33.5).
Compiler and version used
MSVC (as built by the aws-crt-java CI pipeline — native DLL path: C:\Program Files (x86)\Jenkins\workspace\aws-crt-java-build-dll-win64\)
Operating System and version
Windows Server 2022 (GitHub Actions windows-latest runner), with Corretto JDK 8.482.08-1