Skip to content

Server bootstrap deferred shutdown callback fires shutdown_callback for channels that failed setup #793

@aws-kevinrickard

Description

@aws-kevinrickard

Most of this investigation was done by GenAI, but has been reviewed:

Describe the bug

The refactored server bootstrap shutdown path in channel_bootstrap.c defers the shutdown_callback through an async socket cleanup callback. The refactor removed an else guard that previously ensured shutdown_callback was only called for channels where incoming_callback had been called successfully (i.e. channels that completed setup). As a result, shutdown_callback can now fire for channels that failed during setup and may not have the expected handler slots.

This causes a fatal assert crash in downstream consumers (specifically aws-c-event-stream's s_on_accept_channel_shutdown) on Windows, where the IOCP-based async socket cleanup creates a timing window that reliably triggers the bug during rapid server lifecycle cycling.

Filed as a follow-up to awslabs/aws-c-event-stream#140.

This is a regression. The old code in s_on_server_channel_on_shutdown used an if/else that prevented shutdown_callback from firing on the error path. The refactored code removed the else, and the deferred callback now calls shutdown_callback unconditionally when incoming_called is true — but incoming_called is set to true by the error path itself.

The bug was introduced somewhere in the 54 commits between aws-c-io fcb38c80 (bundled in aws-crt-java 0.33.5) and bfb0819d (bundled in aws-crt-java 0.43.6). The last known working version is fcb38c80.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

shutdown_callback should only be called for channels that were successfully set up and had incoming_callback called with a non-error channel. This was the behavior of the old code:

// OLD s_on_server_channel_on_shutdown — correct behavior
if (!channel_data->incoming_called) {
    s_server_incoming_callback(channel_data, error_code, NULL);  // error path
} else {
    args->shutdown_callback(server_bootstrap, error_code, channel, ...);  // only for successful setup
}
aws_channel_destroy(channel);

The if/else made the two paths mutually exclusive: either the channel failed setup (call incoming_callback with error) or it succeeded and is now shutting down (call shutdown_callback). Never both.

Current Behavior

The refactored code calls s_server_incoming_callback on the error path (which sets incoming_called = true), then unconditionally defers to a socket cleanup callback that checks incoming_called and calls shutdown_callback:

// NEW s_on_server_channel_on_shutdown
if (!channel_data->incoming_called) {
    s_server_incoming_callback(channel_data, error_code, NULL);
    // s_server_incoming_callback sets incoming_called = true
}
// No else — always sets up deferred callback
SETUP_SOCKET_SHUTDOWN_CALLBACKS(..., socket_shutdown_server_channel_shutdown_fn, ...)
aws_socket_clean_up(socket);
// socket_shutdown_server_channel_shutdown_fn (fires asynchronously after IOCP socket close)
if (channel_data->incoming_called) {  // always true now, even for error path
    connection_args->shutdown_callback(
        server_bootstrap, error_code, shutdown_args->channel, server_shutdown_user_data);
}
aws_channel_destroy(shutdown_args->channel);

This causes shutdown_callback to fire for channels that failed setup. The channel passed to shutdown_callback may not have the handler slots that the callback expects, leading to crashes in the callback implementation.

Observed crash in aws-c-event-stream (event_stream_rpc_server.c:354):

Fatal error condition occurred in event_stream_rpc_server.c:354:
current_slot && "It should be logically impossible to have a channel in this callback
that doesn't have a slot in it"
Exiting Application

Process exit code: 0xC0000409 (STATUS_STACK_BUFFER_OVERRUN — Windows fast-fail for abort()).

Reproduction Steps

This reproduces consistently on Windows via the aws-greengrass-nucleus integration test suite, which rapidly creates and destroys event-stream RPC servers on local sockets:

git clone https://github.com/aws-greengrass/aws-greengrass-nucleus.git
cd aws-greengrass-nucleus
git checkout 2b4fb24  # "fix: update SDK and CRT version (#1768)" — the commit that upgraded to aws-crt 0.43.6
# On Windows with JDK 8 (Corretto):
mvn -ntp -U verify -Djava.io.tmpdir=C:\Temp

The crash occurs during DeploymentConfigMergingTest, which cycles through multiple Kernel instances (each with its own IPC RPC server) within a single JVM process.

Failing CI runs (Windows job fails, Linux job passes):

Passing CI runs with the previous aws-c-io version (before the CRT upgrade):

Possible Solution

Restore the mutual exclusivity between the error path and the shutdown_callback path. The deferred callback should track whether incoming_called was already true before s_on_server_channel_on_shutdown ran, rather than checking the flag after the error path has set it.

One approach — add a flag to distinguish the two cases:

// s_on_server_channel_on_shutdown
bool was_incoming_called = channel_data->incoming_called;

if (!channel_data->incoming_called) {
    error_code = (error_code) ? error_code : AWS_ERROR_UNKNOWN;
    s_server_incoming_callback(channel_data, error_code, NULL);
}

// Pass was_incoming_called to the deferred callback args
// ...
// socket_shutdown_server_channel_shutdown_fn
if (shutdown_args->was_incoming_called) {  // only true for channels that completed setup
    connection_args->shutdown_callback(...);
}
aws_channel_destroy(shutdown_args->channel);

Alternatively, avoid calling s_server_incoming_callback on the error path inside s_on_server_channel_on_shutdown and instead handle the error notification through the deferred callback as well, preserving the original incoming_called state.

Additional Information/Context

  • The crash is 100% reproducible on windows-latest GitHub Actions runners after the CRT upgrade
  • Linux never crashes — the same test suite passes cleanly (likely because socket cleanup is synchronous on Linux, so the timing window doesn't exist)
  • The downstream consumer (aws-c-event-stream) also has a pre-existing missing-return bug in s_on_accept_channel_setup and lacks a defensive NULL check in s_rpc_connection_from_channel, both of which are tracked in Fatal assert in s_rpc_connection_from_channel() during channel shutdown on Windows aws-c-event-stream#140
  • The aws-c-io version bundled in aws-crt-java 0.33.5 (fcb38c80) does not exhibit this crash
  • The aws-c-io version bundled in aws-crt-java 0.43.6 (bfb0819d) reliably crashes

aws-c-io version used

bfb0819d3906502483611ce832a5ec6b897c8421 (bundled in aws-crt-java 0.43.6). Last known working: fcb38c804364dd627c335da752a99a125a88f6e9 (bundled in aws-crt-java 0.33.5).

Compiler and version used

MSVC (as built by the aws-crt-java CI pipeline — native DLL path: C:\Program Files (x86)\Jenkins\workspace\aws-crt-java-build-dll-win64\)

Operating System and version

Windows Server 2022 (GitHub Actions windows-latest runner), with Corretto JDK 8.482.08-1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugThis issue is a bug.needs-triageThis issue or PR still needs to be triaged.potential-regressionMarking this issue as a potential regression to be checked by team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions