Skip to content

Error during host startup can cause a deadlock in the restart flow, leaving the host unhealthy until a manual restart #10766

Open
@cjaliaga

Description

When a host startup fails (e.g., due to storage connectivity issues), sets the host to Error and initiates a new host startup. This new startup acquires _hostStartSemaphore and calls BuildHost(). During that process, WorkerFunctionMetadataProvider.GetFunctionMetadataAsync() detects no active worker channels and calls RestartHostAsync(). However, RestartHostAsync() attempts to cancel the same in-progress startup, and because there’s no ThrowIfCancellationRequested, _hostStartSemaphore is never released. The restart remains blocked, leaving the host in Error state until it is manually restarted.

Repro steps

  1. First Host Start
    The host begins to initialize (loads metadata, starts worker channels, etc.).

  2. Failure Connecting to Storage
    A transient error (e.g., DNS or storage connection issue) leads to an aborted startup and moves the host state to Error.

  3. Host Startup Canceled
    Because of the error, the system transitions the existing startup to a canceled state (shutting down worker channels).

  4. New Host Scheduled
    The system schedules a new host startup after a short delay.

  5. Second Host Startup
    This new host startup acquires the _hostStartSemaphore and begins another BuildHost() process.

  6. Metadata Provider Finds No Channels
    Inside WorkerFunctionMetadataProvider.GetFunctionMetadataAsync(), the code detects that no channels exist (they were previously shut down).

    if (channels?.Any() != true)
    {
    if (_scriptHostManager.State is ScriptHostState.Default
    || _scriptHostManager.State is ScriptHostState.Starting
    || _scriptHostManager.State is ScriptHostState.Initialized)
    {
    // We don't need to restart if the host hasn't even been created yet.
    _logger.LogDebug("Host is starting up, initializing language worker channel");
    await _channelManager.InitializeChannelAsync(workerConfigs, _workerRuntime);
    }
    else
    {
    // During the restart flow, GetFunctionMetadataAsync gets invoked
    // again through a new script host initialization flow.
    _logger.LogDebug("Host is running without any initialized channels, restarting the JobHost.");
    await _scriptHostManager.RestartHostAsync();
    }

  7. RestartHostAsync() Called
    Since no channels are active, RestartHostAsync() is invoked to re-initialize the host.

  8. Cancellation Attempt
    RestartHostAsync() attempts to cancel the active startup operation by calling Cancel() on its CancellationTokenSource. However, because this cancellation call happens within the same call stack/async flow as the current BuildHost(), there is no ThrowIfCancellationRequested check or natural yield point to abort the build operation.

foreach (var startupOperation in ScriptHostStartupOperation.ActiveOperations)
{
_logger.CancelingStartupOperationForRestart(startupOperation.Id);
try
{
startupOperation.CancellationTokenSource.Cancel();
}
catch (ObjectDisposedException)
{
// This can be disposed at any time.
}
}
try
{
await _hostStartSemaphore.WaitAsync();

  1. Semaphore Deadlock
    With the second startup still holding the _hostStartSemaphore—and never releasing it due to the ineffective cancellation—the new restart attempt blocks indefinitely when trying to reacquire that semaphore.

At this point, the host remains in an Error state until it is manually restarted, since the restart logic is effectively deadlocked.

Example Call Stack

Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.RestartHostAsync
Microsoft.Azure.WebJobs.Script.WorkerFunctionMetadataProvider.GetFunctionMetadataAsync
Microsoft.Azure.WebJobs.Script.WebHost.FunctionMetadataProvider.GetFunctionMetadataAsync
Microsoft.Azure.WebJobs.Script.FunctionMetadataManager.LoadFunctionMetadata
Microsoft.Azure.WebJobs.Script.DependencyInjection.ScriptStartupTypeLocator.GetExtensionsStartupTypesAsync
Microsoft.Azure.WebJobs.Script.WebHost.DefaultScriptHostBuilder.BuildHost
Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.BuildHost
Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.StartHostAsync

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions