Description
When a host startup fails (e.g., due to storage connectivity issues), sets the host to Error and initiates a new host startup. This new startup acquires _hostStartSemaphore
and calls BuildHost()
. During that process, WorkerFunctionMetadataProvider.GetFunctionMetadataAsync()
detects no active worker channels and calls RestartHostAsync()
. However, RestartHostAsync()
attempts to cancel the same in-progress startup, and because there’s no ThrowIfCancellationRequested
, _hostStartSemaphore
is never released. The restart remains blocked, leaving the host in Error state until it is manually restarted.
Repro steps
-
First Host Start
The host begins to initialize (loads metadata, starts worker channels, etc.). -
Failure Connecting to Storage
A transient error (e.g., DNS or storage connection issue) leads to an aborted startup and moves the host state to Error. -
Host Startup Canceled
Because of the error, the system transitions the existing startup to a canceled state (shutting down worker channels). -
New Host Scheduled
The system schedules a new host startup after a short delay. -
Second Host Startup
This new host startup acquires the_hostStartSemaphore
and begins anotherBuildHost()
process. -
Metadata Provider Finds No Channels
InsideWorkerFunctionMetadataProvider.GetFunctionMetadataAsync()
, the code detects that no channels exist (they were previously shut down). -
RestartHostAsync()
Called
Since no channels are active,RestartHostAsync()
is invoked to re-initialize the host. -
Cancellation Attempt
RestartHostAsync()
attempts to cancel the active startup operation by callingCancel()
on itsCancellationTokenSource
. However, because this cancellation call happens within the same call stack/async flow as the currentBuildHost()
, there is noThrowIfCancellationRequested
check or natural yield point to abort the build operation.
azure-functions-host/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs
Lines 562 to 577 in dae16f9
- Semaphore Deadlock
With the second startup still holding the_hostStartSemaphore
—and never releasing it due to the ineffective cancellation—the new restart attempt blocks indefinitely when trying to reacquire that semaphore.
At this point, the host remains in an Error state until it is manually restarted, since the restart logic is effectively deadlocked.
Example Call Stack
Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.RestartHostAsync
Microsoft.Azure.WebJobs.Script.WorkerFunctionMetadataProvider.GetFunctionMetadataAsync
Microsoft.Azure.WebJobs.Script.WebHost.FunctionMetadataProvider.GetFunctionMetadataAsync
Microsoft.Azure.WebJobs.Script.FunctionMetadataManager.LoadFunctionMetadata
Microsoft.Azure.WebJobs.Script.DependencyInjection.ScriptStartupTypeLocator.GetExtensionsStartupTypesAsync
Microsoft.Azure.WebJobs.Script.WebHost.DefaultScriptHostBuilder.BuildHost
Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.BuildHost
Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.StartHostAsync