Skip to content

Update health check to ensure blob containers created at right time #9159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
May 16, 2025

Conversation

RussKie
Copy link
Contributor

@RussKie RussKie commented May 8, 2025

Resolves #9139
Resolves #9145

The underlying issue is attributed to a lack of health checks for child resources (as those have no lifetime of their own). In the nutshell, whenever an Azurite emulator is starting up, the readiness of the emulator is indicated by the readiness of the "blobs" resources (represented by BloblServiceClient). Previously, child blob contrainers were created on ResourceReadyEvent, but this created an opportunity for a race condition - a client could attempt to connect after the resource reported healthy but before child resources were created - the flaky test highlighted this problem.

To fix the issue we're now creating an individual health check for each blob container resource. To make it simple, a blob container is being created within the health check itself.

@RussKie RussKie self-assigned this May 8, 2025
@github-actions github-actions bot added the area-integrations Issues pertaining to Aspire Integrations packages label May 8, 2025
@davidfowl
Copy link
Member

Hmm, we had this debate when @sebastienros did the database creation and I thought we decided to use the ResourceReady event (which runs after health checks).

@eerhardt
Copy link
Member

eerhardt commented May 8, 2025

Hmm, we had this debate when @sebastienros did the database creation and I thought we decided to use the ResourceReady event (which runs after health checks).

For databases, do we:

builder.Eventing.Subscribe<ResourceReadyEvent>(sqlServer, async (@event, ct) =>
{
if (connectionString is null)
{
throw new DistributedApplicationException($"ResourceReadyEvent was published for the '{sqlServer.Name}' resource but the connection string was null.");
}
using var sqlConnection = new SqlConnection(connectionString);
await sqlConnection.OpenAsync(ct).ConfigureAwait(false);
if (sqlConnection.State != System.Data.ConnectionState.Open)
{
throw new InvalidOperationException($"Could not open connection to '{sqlServer.Name}'");
}
foreach (var sqlDatabase in sqlServer.DatabaseResources)
{
await CreateDatabaseAsync(sqlConnection, sqlDatabase, @event.Services, ct).ConfigureAwait(false);
}
});

builder.Eventing.Subscribe<ResourceReadyEvent>(postgresServer, async (@event, ct) =>
{
if (connectionString is null)
{
throw new DistributedApplicationException($"ResourceReadyEvent was published for the '{postgresServer.Name}' resource but the connection string was null.");
}
// Non-database scoped connection string
using var npgsqlConnection = new NpgsqlConnection(connectionString + ";Database=postgres;");
await npgsqlConnection.OpenAsync(ct).ConfigureAwait(false);
if (npgsqlConnection.State != System.Data.ConnectionState.Open)
{
throw new InvalidOperationException($"Could not open connection to '{postgresServer.Name}'");
}
foreach (var name in postgresServer.Databases.Keys)
{
if (builder.Resources.FirstOrDefault(n => string.Equals(n.Name, name, StringComparisons.ResourceName)) is PostgresDatabaseResource postgreDatabase)
{
await CreateDatabaseAsync(npgsqlConnection, postgreDatabase, @event.Services, ct).ConfigureAwait(false);
}
}
});

In general, I dislike using health checks to mutate state. They should just be used to check health, not make something "healthy".

@RussKie
Copy link
Contributor Author

RussKie commented May 8, 2025

I am not super fond of this result, but I couldn't find a way to add a healthcheck for individual blob containers.
I originally added blob container creation within ResourceReadyEvent, however, it appears this makes the test flaky - there's a race condition, and the client may attempt to access blob container before those get created.
If I add a check alongside the blob storage, then ResourceReadyEvent never get fired (since no containers yet exist). I couldn't add a healthcheck within the even - the service collection at this point is already locked.

Any suggestions?

@eerhardt
Copy link
Member

eerhardt commented May 9, 2025

but I couldn't find a way to add a healthcheck for individual blob containers.

I don't think we do healthchecks for child resources anywhere else. For example, CosmosDB seems to honly have it for the whole service.

builder.ApplicationBuilder.Eventing.Subscribe<ResourceReadyEvent>(builder.Resource, async (@event, ct) =>
{
if (cosmosClient is null)
{
throw new InvalidOperationException("CosmosClient is not initialized.");
}
await cosmosClient.ReadAccountAsync().WaitAsync(ct).ConfigureAwait(false);
foreach (var database in builder.Resource.Databases)
{
var db = (await cosmosClient.CreateDatabaseIfNotExistsAsync(database.DatabaseName, cancellationToken: ct).ConfigureAwait(false)).Database;
foreach (var container in database.Containers)
{
await db.CreateContainerIfNotExistsAsync(container.ContainerName, container.PartitionKeyPath, cancellationToken: ct).ConfigureAwait(false);
}
}
});

Because the ResourceReadyEvent blocks the resource's "healthy" state until all ResourceReadyEvent listeners complete, the parent resource won't be marked "healthy" until creating the child resources is complete. And the child resources won't be "healthy" until the parent resource is "healthy".

I think we should be able to follow the existing patterns in Sql, Postgres, and in Azure CosmosDB. What doesn't work about the existing pattern?

Any other suggestions here @sebastienros or @mitchdenny ?

@sebastienros
Copy link
Member

I don't think we do healthchecks for child resources anywhere else

SqlServer/Postgres databases have one. It's done by using their own connection string which has the Database= property in it so establishing the connection retrieved from ConnectionStringAvailableEvent is sufficient.

@RussKie RussKie force-pushed the igveliko/fix_9139 branch from 538fc7d to dae66f7 Compare May 13, 2025 01:00
@RussKie RussKie force-pushed the igveliko/fix_9139 branch from dae66f7 to ec3a2df Compare May 13, 2025 01:02
@RussKie
Copy link
Contributor Author

RussKie commented May 13, 2025

Thanks @sebastienros for help and guidance. How does this look now?

@RussKie RussKie requested a review from radical as a code owner May 13, 2025 05:54
@RussKie RussKie removed the request for review from radical May 13, 2025 05:54
@sebastienros sebastienros requested a review from eerhardt May 15, 2025 15:58
Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

{
throw new DistributedApplicationException($"BlobServiceClient was not created for the '{builder.Resource.Name}' resource.");
}
// This event is triggered when the health check is healthy.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This event is triggered when the health check is healthy.
// This event is triggered when the emulator has started, and BlobServiceClient is marked as healthy.

Copy link
Member

@sebastienros sebastienros May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with the change, that's not what I wanted to convey. I am saying that this event happens after the health check, only if it's healthy. Yes it implied the emulator has started (with more information than just "started") and "BlobServiceClient" is healthy doesn't mean much. The storage itself is healthy, the client is not "marked" anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, but "health check is healthy" isn't providing much information either. There are multiple health checks now; it would be good to clarify which health check is triggering this event.

Comment on lines +301 to +307
var healthCheckKey = $"{resource.Name}_check";

BlobServiceClient? blobServiceClient = null;
builder.ApplicationBuilder.Services.AddHealthChecks().AddAzureBlobStorage(sp =>
{
return blobServiceClient ??= CreateBlobServiceClient(connectionString ?? throw new InvalidOperationException("Connection string is not initialized."));
}, name: healthCheckKey);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to duplicate the HC here? Or why do we need to keep the HC on lines:160-167?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it's on the Blobs resource it. Line 160 is on the Emulator resource. Doing it on the storage is not sufficient as the WaitForHealthyAsync doesn't bubble up to the parent resources.

If it were just for the existing tests we could probably not have this specific one. But it's more consistent to keep it if we do it for containers.

@sebastienros sebastienros merged commit 7baf34b into main May 16, 2025
254 checks passed
@sebastienros sebastienros deleted the igveliko/fix_9139 branch May 16, 2025 18:21
@@ -135,9 +135,10 @@ public async Task VerifyAzureStorageEmulatorResource()

[Fact]
[RequiresDocker]
[QuarantinedTest("https://github.com/dotnet/aspire/issues/9139")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that this can be dropped? For quarantined tests we want to take it out after it has been green for a certain number of runs (~100 right now).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add it back then.

What else? Reopening the issues? Is tracking automatic or is there a process to follow to unquarantine like for aspnet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, re-open the issue. And it will be tracked automatically. And I will take care of taking it out of quarantine for now. It will get semi-automated in medium term.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@github-actions github-actions bot locked and limited conversation to collaborators Jun 16, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-integrations Issues pertaining to Aspire Integrations packages
Projects
None yet
5 participants