Storage - Fix Flaky Stress Tests by browndav-msft · Pull Request #48359 · Azure/azure-sdk-for-java

browndav-msft · 2026-03-10T20:21:46Z

This is a fix to fix the issues we've been having with the stress tests.

- read functions had FAIL_FAST which would throw an error when the stream had reached then end and we wanted to read from the stream again. So we removed from both reads. - refactor code so that the exit criteria is a tthe beginning - refactor the emitContentInfo for dry

- changed emitValue to tryEmitValue - remove Sinks.EmitFailureHandler.FAIL_FAST so that multiple closes does not cause an error to be thrown

- opentelemetry-runtime-telemetry-java8 from 2.24.0-alpha -> 2.15.0-alpha - opentelemetry-logback-appender-1.0 from 2.24.0-alpha -> 2.15.0-alpha

… t0 beta.1

Copilot

Pull request overview

This PR aims to reduce flakiness in Storage stress tests by making cleanup more resilient, making CRC telemetry streams tolerate re-subscription/double-close behaviors, and aligning dependencies with the chosen OpenTelemetry runtime metrics version.

Changes:

Replace unconditional deletes with deleteIfExists() across multiple stress scenarios to avoid cleanup failures when resources are already gone.
Add retry/timeout-based global cleanup logic in scenario base classes and add retry logic to async runs.
Adjust CRC stream emission behavior to avoid failures on repeated terminal events; downgrade OpenTelemetry instrumentation dependencies to 2.15.0-alpha.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 20 comments.

Show a summary per file

File	Description
sdk/storage/azure-storage-stress/src/main/java/com/azure/storage/stress/TelemetryHelper.java	Adjust JVM runtime metrics registration and make timeout/cancellation detection null-safe.
sdk/storage/azure-storage-stress/src/main/java/com/azure/storage/stress/CrcOutputStream.java	Switch sink emission to `tryEmitValue` to tolerate double-close.
sdk/storage/azure-storage-stress/src/main/java/com/azure/storage/stress/CrcInputStream.java	Refactor EOF emission, add resubscription state reset, and switch to `tryEmitValue`.
sdk/storage/azure-storage-stress/pom.xml	Downgrade OTel runtime telemetry + logback appender to `2.15.0-alpha`.
sdk/storage/azure-storage-file-share-stress/src/main/java/com/azure/storage/file/share/stress/UploadFromFile.java	Use `deleteIfExists()` during per-test cleanup.
sdk/storage/azure-storage-file-share-stress/src/main/java/com/azure/storage/file/share/stress/ShareScenarioBase.java	Add retrying global cleanup + async retry behavior and new logging.
sdk/storage/azure-storage-file-share-stress/pom.xml	Downgrade OTel runtime telemetry + logback appender to `2.15.0-alpha`.
sdk/storage/azure-storage-file-datalake-stress/src/main/java/com/azure/storage/file/datalake/stress/UploadFromFile.java	Use `deleteIfExists()` during per-test cleanup.
sdk/storage/azure-storage-file-datalake-stress/src/main/java/com/azure/storage/file/datalake/stress/Upload.java	Use `deleteIfExists()` during per-test cleanup.
sdk/storage/azure-storage-file-datalake-stress/src/main/java/com/azure/storage/file/datalake/stress/DataLakeScenarioBase.java	Add retrying global cleanup + async retry behavior and new logging.
sdk/storage/azure-storage-file-datalake-stress/pom.xml	Downgrade OTel runtime telemetry + logback appender to `2.15.0-alpha`.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/UploadPages.java	Use `deleteIfExists()` and swallow delete errors during cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/Upload.java	Use `deleteIfExists()` during per-test cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/StageBlock.java	Use `deleteIfExists()` during per-test cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/PageBlobScenarioBase.java	Add retrying global cleanup + async retry behavior and new logging.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/PageBlobOutputStream.java	Use `deleteIfExists()` and swallow delete errors during cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/CommitBlockList.java	Use `deleteIfExists()` during per-test cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/BlockBlobUpload.java	Use `deleteIfExists()` during per-test cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/BlockBlobOutputStream.java	Use `deleteIfExists()` during per-test cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/BlobScenarioBase.java	Add retrying global cleanup + async retry behavior and structured logging.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/AppendBlock.java	Use `deleteIfExists()` and swallow delete errors during cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/AppendBlobOutputStream.java	Use `deleteIfExists()` during per-test cleanup.
sdk/storage/azure-storage-blob-stress/pom.xml	Downgrade OTel runtime telemetry + logback appender to `2.15.0-alpha`.
sdk/parents/azure-client-sdk-parent/pom.xml	Downgrade `io.clientcore:linting-extensions` used by checkstyle plugin from beta.2 to beta.1.

...talake-stress/src/main/java/com/azure/storage/file/datalake/stress/DataLakeScenarioBase.java

Copilot · 2026-03-10T23:42:31Z

.../azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/BlobScenarioBase.java

+    private Mono<Void> cleanupContainerWithRetry() {
        return asyncNoFaultContainerClient.deleteIfExists()
-            .then(super.globalCleanupAsync());
+            .then()  // Convert Mono<Boolean> to Mono<Void>
+            .timeout(Duration.ofSeconds(30))
+            .retry(3)
+            .onErrorResume(error -> {
+                // If container deletion fails, try to delete all blobs first then retry container deletion
+                return deleteAllBlobsInContainer()
+                    .then(asyncNoFaultContainerClient.deleteIfExists())
+                    .then()  // Convert Mono<Boolean> to Mono<Void>
+                    .timeout(Duration.ofSeconds(30))
+                    .onErrorResume(finalError -> {
+                        // Log the error but don't fail the test
+                        LOGGER.atWarning()
+                            .addKeyValue("error", finalError.getMessage())
+                            .log("Final container cleanup failed after retries");
+                        return Mono.empty();
+                    });
+            });
+    }
+
+    /**
+     * Delete all blobs in the container to help with cleanup.
+     */
+    private Mono<Void> deleteAllBlobsInContainer() {
+        return asyncNoFaultContainerClient.listBlobs()
+            .flatMap(blobItem ->
+                asyncNoFaultContainerClient.getBlobAsyncClient(blobItem.getName()).delete())
+            .then()
+            .timeout(Duration.ofSeconds(60))
+            .onErrorResume(error -> {
+                // Log but continue - some blobs might have been deleted
+                LOGGER.atWarning()
+                    .addKeyValue("error", error.getMessage())
+                    .log("Blob cleanup partially failed");
+                return Mono.empty();
+            });


cleanupContainerWithRetry/deleteAllBlobsInContainer uses flatMap without bounding concurrency. If a run creates many blobs, cleanup may issue a burst of delete requests and hit throttling/timeouts, making cleanup less reliable. Consider limiting concurrency (flatMap(..., concurrency)) or using concatMap, potentially with a small backoff between retries.

same as https://github.com/Azure/azure-sdk-for-java/pull/48359/changes#r2915108878

sdk/storage/azure-storage-file-datalake-stress/pom.xml

sdk/storage/azure-storage-stress/src/main/java/com/azure/storage/stress/CrcOutputStream.java

Copilot · 2026-03-10T23:42:32Z

...e-file-share-stress/src/main/java/com/azure/storage/file/share/stress/ShareScenarioBase.java

+    private Mono<Void> deleteAllFilesInShare() {
+        return asyncNoFaultShareClient.getDirectoryClient("").listFilesAndDirectories()
+            .flatMap(fileRef ->
+                asyncNoFaultShareClient.getFileClient(fileRef.getName()).delete())
+            .then()


cleanupShareWithRetry/deleteAllFilesInShare currently treats every ShareFileItem as a file (getFileClient(...).delete()) and doesn't handle directories. Since listFilesAndDirectories() returns both files and directories, this will fail for directories and likely prevent share deletion, defeating the retry cleanup intent. Please branch on fileRef.isDirectory() and delete directories (ideally recursively, deleting children before the directory) as well as files.

Copilot · 2026-03-10T23:42:34Z

...e-file-share-stress/src/main/java/com/azure/storage/file/share/stress/ShareScenarioBase.java

+        return asyncNoFaultShareClient.getDirectoryClient("").listFilesAndDirectories()
+            .flatMap(fileRef ->
+                asyncNoFaultShareClient.getFileClient(fileRef.getName()).delete())
+            .then()
+            .timeout(java.time.Duration.ofSeconds(60))
+            .onErrorResume(error -> {


deleteAllFilesInShare uses flatMap without any concurrency limit, which can trigger a large number of parallel delete requests during cleanup and increase throttling/failure risk. Consider bounding concurrency (flatMap with a concurrency parameter) or using concatMap for more predictable cleanup behavior.

maybe we could also explore doing this synchronously to prevent throttling?

I think you're right. There aren't that many resources that need to be cleaned up, but piling them all at once could cause issues.

Copilot · 2026-03-10T23:42:34Z

...re-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/PageBlobScenarioBase.java

+    private Mono<Void> cleanupContainerWithRetry() {
        return asyncNoFaultContainerClient.deleteIfExists()
-            .then(super.globalCleanupAsync());
+            .then()  // Convert Mono<Boolean> to Mono<Void>
+            .timeout(Duration.ofSeconds(30))
+            .retry(3)
+            .onErrorResume(error -> {
+                // If container deletion fails, try to delete all blobs first then retry container deletion
+                return deleteAllBlobsInContainer()
+                    .then(asyncNoFaultContainerClient.deleteIfExists())
+                    .then()  // Convert Mono<Boolean> to Mono<Void>
+                    .timeout(Duration.ofSeconds(30))
+                    .onErrorResume(finalError -> {
+                        // Log the error but don't fail the test
+                        LOGGER.atWarning()
+                            .addKeyValue("error", finalError.getMessage())
+                            .log("Final container cleanup failed after retries");
+                        return Mono.empty();
+                    });
+            });
+    }
+
+    /**
+     * Delete all blobs in the container to help with cleanup.
+     */
+    private Mono<Void> deleteAllBlobsInContainer() {
+        return asyncNoFaultContainerClient.listBlobs()
+            .flatMap(blobItem ->
+                asyncNoFaultContainerClient.getBlobAsyncClient(blobItem.getName()).delete())
+            .then()
+            .timeout(Duration.ofSeconds(60))
+            .onErrorResume(error -> {
+                // Log but continue - some blobs might have been deleted
+                LOGGER.atWarning()
+                    .addKeyValue("error", error.getMessage())
+                    .log("Blob cleanup partially failed");
+                return Mono.empty();
+            });


cleanupContainerWithRetry/deleteAllBlobsInContainer uses flatMap without a concurrency limit, which can create a large burst of delete calls and increase throttling/timeouts during cleanup. Consider bounding concurrency or switching to concatMap for more reliable cleanup.

same as https://github.com/Azure/azure-sdk-for-java/pull/48359/changes#r2915108878

sdk/storage/azure-storage-stress/pom.xml

sdk/storage/azure-storage-file-share-stress/pom.xml

sdk/storage/azure-storage-stress/src/main/java/com/azure/storage/stress/CrcInputStream.java

...re-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/PageBlobOutputStream.java

sdk/storage/azure-storage-stress/src/main/java/com/azure/storage/stress/CrcOutputStream.java

ibrandes · 2026-03-10T23:52:16Z

...e-file-share-stress/src/main/java/com/azure/storage/file/share/stress/ShareScenarioBase.java

+        return asyncNoFaultShareClient.getDirectoryClient("").listFilesAndDirectories()
+            .flatMap(fileRef ->
+                asyncNoFaultShareClient.getFileClient(fileRef.getName()).delete())
+            .then()
+            .timeout(java.time.Duration.ofSeconds(60))
+            .onErrorResume(error -> {


maybe we could also explore doing this synchronously to prevent throttling?

ibrandes · 2026-03-10T23:53:14Z

...talake-stress/src/main/java/com/azure/storage/file/datalake/stress/DataLakeScenarioBase.java

+    private Mono<Void> deleteAllFilesInFileSystem() {
+        return asyncNoFaultFileSystemClient.listPaths()
+            .flatMap(pathItem ->
+                asyncNoFaultFileSystemClient.getFileAsyncClient(pathItem.getName()).delete())


same as https://github.com/Azure/azure-sdk-for-java/pull/48359/changes#r2915108878

ibrandes · 2026-03-10T23:53:50Z

...re-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/PageBlobScenarioBase.java

+    private Mono<Void> cleanupContainerWithRetry() {
        return asyncNoFaultContainerClient.deleteIfExists()
-            .then(super.globalCleanupAsync());
+            .then()  // Convert Mono<Boolean> to Mono<Void>
+            .timeout(Duration.ofSeconds(30))
+            .retry(3)
+            .onErrorResume(error -> {
+                // If container deletion fails, try to delete all blobs first then retry container deletion
+                return deleteAllBlobsInContainer()
+                    .then(asyncNoFaultContainerClient.deleteIfExists())
+                    .then()  // Convert Mono<Boolean> to Mono<Void>
+                    .timeout(Duration.ofSeconds(30))
+                    .onErrorResume(finalError -> {
+                        // Log the error but don't fail the test
+                        LOGGER.atWarning()
+                            .addKeyValue("error", finalError.getMessage())
+                            .log("Final container cleanup failed after retries");
+                        return Mono.empty();
+                    });
+            });
+    }
+
+    /**
+     * Delete all blobs in the container to help with cleanup.
+     */
+    private Mono<Void> deleteAllBlobsInContainer() {
+        return asyncNoFaultContainerClient.listBlobs()
+            .flatMap(blobItem ->
+                asyncNoFaultContainerClient.getBlobAsyncClient(blobItem.getName()).delete())
+            .then()
+            .timeout(Duration.ofSeconds(60))
+            .onErrorResume(error -> {
+                // Log but continue - some blobs might have been deleted
+                LOGGER.atWarning()
+                    .addKeyValue("error", error.getMessage())
+                    .log("Blob cleanup partially failed");
+                return Mono.empty();
+            });


same as https://github.com/Azure/azure-sdk-for-java/pull/48359/changes#r2915108878

ibrandes · 2026-03-10T23:54:21Z

.../azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/BlobScenarioBase.java

+    private Mono<Void> cleanupContainerWithRetry() {
        return asyncNoFaultContainerClient.deleteIfExists()
-            .then(super.globalCleanupAsync());
+            .then()  // Convert Mono<Boolean> to Mono<Void>
+            .timeout(Duration.ofSeconds(30))
+            .retry(3)
+            .onErrorResume(error -> {
+                // If container deletion fails, try to delete all blobs first then retry container deletion
+                return deleteAllBlobsInContainer()
+                    .then(asyncNoFaultContainerClient.deleteIfExists())
+                    .then()  // Convert Mono<Boolean> to Mono<Void>
+                    .timeout(Duration.ofSeconds(30))
+                    .onErrorResume(finalError -> {
+                        // Log the error but don't fail the test
+                        LOGGER.atWarning()
+                            .addKeyValue("error", finalError.getMessage())
+                            .log("Final container cleanup failed after retries");
+                        return Mono.empty();
+                    });
+            });
+    }
+
+    /**
+     * Delete all blobs in the container to help with cleanup.
+     */
+    private Mono<Void> deleteAllBlobsInContainer() {
+        return asyncNoFaultContainerClient.listBlobs()
+            .flatMap(blobItem ->
+                asyncNoFaultContainerClient.getBlobAsyncClient(blobItem.getName()).delete())
+            .then()
+            .timeout(Duration.ofSeconds(60))
+            .onErrorResume(error -> {
+                // Log but continue - some blobs might have been deleted
+                LOGGER.atWarning()
+                    .addKeyValue("error", error.getMessage())
+                    .log("Blob cleanup partially failed");
+                return Mono.empty();
+            });


same as https://github.com/Azure/azure-sdk-for-java/pull/48359/changes#r2915108878

.../azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/BlobScenarioBase.java

browndav-msft added 7 commits March 5, 2026 13:51

removed enableDeterministic

96fb58f

change .delete() to .deleteIfExists()

0db7e67

prevent crashes on reattempted close on stream

025a9e4

- changed emitValue to tryEmitValue - remove Sinks.EmitFailureHandler.FAIL_FAST so that multiple closes does not cause an error to be thrown

fix telemetry so that it doesnt swallow errors

30bd8dc

roll back two deps because they were causing failures in the containers

303ad80

- opentelemetry-runtime-telemetry-java8 from 2.24.0-alpha -> 2.15.0-alpha - opentelemetry-logback-appender-1.0 from 2.24.0-alpha -> 2.15.0-alpha

rollback azure-client-sdk-parent linting extensions from 1.0.0-beta.2…

0a25f4d

… t0 beta.1

github-actions bot added Azure.Core azure-core Storage Storage Service (Queues, Blobs, Files) labels Mar 10, 2026

ibrandes marked this pull request as ready for review March 10, 2026 23:35

ibrandes requested review from alzimmermsft and gunjansingh-msft as code owners March 10, 2026 23:35

Copilot AI review requested due to automatic review settings March 10, 2026 23:35

ibrandes requested review from a team, JonathanGiles, ibrandes, kyleknap, samvaity, seanmcc-msft and srnagar as code owners March 10, 2026 23:35

Copilot started reviewing on behalf of ibrandes March 10, 2026 23:36 View session

ibrandes changed the title ~~Fix flaky tests~~ Storage - Fix Flaky Stress Tests Mar 10, 2026

Copilot AI reviewed Mar 10, 2026

View reviewed changes

ibrandes reviewed Mar 10, 2026

View reviewed changes

revert linting extensions to beta2

ec051d8

browndav-msft requested a review from ibrandes March 11, 2026 17:04

browndav-msft added 3 commits March 11, 2026 14:57

remove comments with old code

8917a7c

add logging for errors

9ba9c62

remove catches for double close issue and okay status

ec879d5

Conversation

browndav-msft commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants