Skip to content

Move http-stress to core-v2 #45135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
204 changes: 37 additions & 167 deletions sdk/clientcore/http-stress/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Stress tests for Azure <Name> client library for Java
# Stress tests for Azure Core v2 HTTP client library for Java

This package contains template project for stress tests and recommendations on how to create them for your library.
This package contains the stress test project for Azure Core v2 HTTP stack. It demonstrates how to create and run stress tests for the Azure SDK for Java core HTTP pipeline and client infrastructure.

## Getting started

Expand All @@ -18,54 +18,32 @@ Check out [Azure SDK Stress Test Wiki][azure_sdk_stress_test] for general inform

### Deploy Stress Test

cd into `azure-sdk-for-java` root folder and run command to deploy the package to cluster:
Change directory to the Azure SDK for Java root and deploy the package to your cluster:

```shell
./eng/common/scripts/stress-testing/deploy-stress-tests.ps1 -MatrixSelection all -SearchDirectory ./sdk/<your service directory>
./eng/common/scripts/stress-testing/deploy-stress-tests.ps1 -MatrixSelection all -SearchDirectory ./sdk/core-v2
```

### Check Status

Only the most frequently used commands are listed below. See [Deploying A Stress Test][deploy_stress_test] for more details.
See [Deploying A Stress Test][deploy_stress_test] for more details.

List deployed packages:

```shell
helm list -n <stress test namespace>
```

the namespace usually matches your username.

Get stress test pods and status:

```shell
kubectl get pods -n <stress test namespace>
```

To get readable metadata for pods and/or containers use

```shell
kubectl describe pod -n <stress test namespace> <stress test pod name> -c <container-name>
```

Get stress test pod logs:

```shell
kubectl logs -n <stress test namespace> <stress test pod name>
# Note that we may define multiple containers (for example, `fault-injector` and `main`)
kubectl logs -n <stress test namespace> <stress test pod name> -c <container name>
```

If stress test pod is in `Error` status, check logs from containers:

```shell
kubectl logs -n <stress test namespace> <stress test pod name>
```

You may also get logs for specific containers:

```shell
kubectl logs -n <stress test namespace> <stress test pod name> -c <container-name>
```

Stop and remove deployed package:
Expand All @@ -80,28 +58,21 @@ Execute commands in the container:

```shell
kubectl exec --stdin --tty -n <stress test namespace> <pod name> -c <container name> -- /bin/bash
````

### Share data from within the container

Stress containers run with `$DEBUG_SHARE` environment variable set to the location of the shared folder. You can put anything you want to share there and access it - check out https://aka.ms/azsdk/stress/fileshare.
```

## Key concepts

### Project Structure

See [Layout][stress_test_layout] section for details.

Below is the current structure of project:
```
.
├── src/ # Test code
├── templates/ # A directory of helm templates that will generate Kubernetes manifest files.
├── workbooks/ # A directory of Azure Monitor workbooks for analyzing stress test results.
├── Chart.yaml # A YAML file containing information about the helm chart and its dependencies
├── scenarios-matrix.yaml # A YAML file containing configuration and custom values for stress test(s)
├── Dockerfile # A Dockerfile for building the stress test image
├── stress-test-resources.bicep # An Azure Bicep for deploying stress test azure resources
├── templates/ # Helm templates for Kubernetes manifests
├── workbooks/ # Azure Monitor workbooks for analyzing stress test results
├── Chart.yaml # Helm chart metadata
├── scenarios-matrix.yaml # Configuration for stress test scenarios
├── Dockerfile # Dockerfile for building the stress test image
├── stress-test-resources.bicep # Azure Bicep for deploying stress test resources
├── pom.xml
└── README.md
```
Expand All @@ -110,139 +81,43 @@ Below is the current structure of project:

Start with [Azure SDK stress Wiki](https://aka.ms/azsdk/stress) to learn about stress tests.

1. Copy `src/main/java/com/azure/sdk/clientcore/http-stress` folder to your service folder.
2. Update the code
- Update `pom.xml` to change artifact name and add dependencies on your service.
- Implement your first stress test instead of `HttpGet` and make sure to update `StressTestOptions` to include important parameters for your tests.

Now you can run stress tests locally. Remaining steps are required to run tests on a stress cluster.

3. Update `dockerfiles` to build your service artifacts and any dependencies of current version.
4. Describe Azure resources necessary for your tests in `stress-test-resources.bicep`
5. Update `Chart.yaml`:
- change chart `name` to include your service name. Please keep `java-` prefix.
- change `annotations.stressTest` to `true` to enable auto-discovery
5. Update `templates/job.yaml`
- remove `server` container as you probably don't need it
- replace occurrences of `java-template` to match name in the `Chart.yaml`
- update test parameters for `test` container, feel free to rename the container as you see fit
6. Define scenarios and parameters in `scenarios-matrix.yaml`

Now you're ready to run tests with `./eng/common/scripts/stress-testing/deploy-stress-tests.ps1 -SearchDirectory ./sdk/<your service directory>`.
See [Deploying A Stress Test][deploy_stress_test] for more details.

Let's see how we can check test results.

### Checking test results
1. Copy `src/main/java/com/azure/core/http/stress` to your service folder.
2. Update the code:
- Update `pom.xml` to change artifact name and add dependencies on your service.
- Implement your first stress test instead of `HttpGet` and update `StressOptions` for your test parameters.

#### Stress Test Dashboard
Now you can run stress tests locally. Remaining steps are required to run tests on a stress cluster.

General-purpose stress test dashboard is available at https://aka.ms/azsdk/stress/dashboard. It shows:
- Pod status events
- CPU and memory utilization of the stress test pods
- Container logs and events
3. Update `dockerfiles` to build your service artifacts and dependencies.
4. Describe Azure resources in `stress-test-resources.bicep`.
5. Update `Chart.yaml` and `templates/job.yaml` for your service.
6. Define scenarios and parameters in `scenarios-matrix.yaml`.

Stress test dashboard does not know about local stress test runs.
Now you're ready to run tests with `./eng/common/scripts/stress-testing/deploy-stress-tests.ps1 -SearchDirectory ./sdk/core-v2`.

#### Application Insights

Stress test template comes with OpenTelemetry and rich monitoring experience including:
- resource utilization metrics (CPU, memory, GC, threads, etc.)
- live metrics, performance overview, etc
- distributed tracing and dependency calls (HTTP, Azure SDK calls)
- exceptions and logs
- profiling in production

The telemetry is sent to Application Insights where it's useful to:
- monitor and compare throughput and latency across runs
- investigate issues and find bottlenecks

You may choose to use [ApplicationInsights Java agent](https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-enable?tabs=java#install-the-client-library) if
your test throughput (and amount of telemetry it generates) is relatively low.
Since agent does a lot of things, it might create some noise during performance analysis and micro-optimizations.

Execute the perf test with Application Insights enabled:
`$env:APPLICATIONINSIGHTS_CONNECTION_STRING="value"; java -jar "/path to/your file.jar" <options-for-the-test>`

>Note: If you're running tests locally, you need to provide `APPLICATIONINSIGHTS_CONNECTION_STRING` environment variable,
skip setting the `javaagent` explicitly to send telemetry to Application Insights.

### Logging

We use [logback.xml][logback_xml] to configure the logging. By default, the stress test run on cluster will output
`WARN` level log which you may adjust based on your needs.
You may also control the verbosity of logs that go to Application Insights - see [OpenTelemetry logback appender][opentelemetry-logback] for more details.

Since logs are hard to query and are extremely verbose (in case of high-scale stress tests), we're relying on metrics and workbooks for test result analysis.

See also [Logging in Azure SDK][logging-azure-sdk].

### Metrics

While some Azure SDKs provide custom metrics, we're going to collect generic test metrics and build queries/workbooks on top of them,
so it's important to reuse the same metric across different tests whenever possible.

We need just one generic metric for basic analysis - the one that measures duration of one test execution (with additional dimensions).
It's implemented in `io.clientcore.http.stress.util.TelemetryHelper` and has the following semantic:
- name: `test.run.duration` - it is used in the stress workbook, so make sure to use the same name when applicable
- unit: seconds
- customDimensions:
- `error.type` - The low-cardinality type of error describing what happened (eg. exception class name).

The metric should measure exactly one test operation, so we'll be able to derive the key performance indicators from it such as:
- throughput (rate of operations per period of time)
- duration of one operation
- error rate (how frequently errors of different types occur)

Each metric collected with OpenTelemetry (and exported to Application Insights) also has the following dimensions:
- `cloud_RoleName` - in case of stress tests, it matches value of `otel.service.name` property configured in `Chart.yaml` to `{{ .Release.Name }}-{{ .Stress.BaseName }}`.
- `cloud_RoleInstance` - in case of k8s it matches pod name and is auto-detected.

When running multiple test containers, make sure to assign different role instances to them, for example use `{{ .Stress.BaseName }}-consumer` and `{{ .Stress.BaseName }}-producer`.
This would allow you to distinguish telemetry coming from different containers.

You would need to adjust the workbook to accommodate those changes.

In addition to `test.run_duration`, we're also collecting:
- [JVM metrics](https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/runtime-telemetry/runtime-telemetry-java8/library/README.md) measured by OpenTelemetry:
- CPU and memory usage
- GC stats
- Thread count
- Class stats
- See [JVM metrics semantic conventions for the details](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/runtime/jvm-metrics.md)
### Checking test results

You can also enable [reactor schedulers metrics](https://github.com/reactor/reactor-core/blob/962aeb77a09088fa2a7bac6d814c2b35220b1d35/docs/modules/ROOT/pages/metrics.adoc) collection by installing `micrometer-core` and
[OpenTelemetry micrometer bridge](https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/micrometer/micrometer-1.5/library).
See [Stress Test Dashboard](https://aka.ms/azsdk/stress/dashboard) and [Application Insights](https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-enable?tabs=java#install-the-client-library) for monitoring and telemetry.

### Stress test workbook
### Logging and Metrics

[Stress test workbook](https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/faa080af-c1d8-40ad-9cce-e1a450ca5b57/resourceGroups/rg-stress-cluster-pg/providers/Microsoft.Insights/components/stress-pg-ai-s7b6dif73rup6/workbooks)
shows a summary of a test run.
- Logging is configured via `logback.xml`.
- Metrics are collected using OpenTelemetry and exported to Application Insights.
- The main metric is `test.run.duration` (seconds), implemented in `com.azure.core.http.stress.util.TelemetryHelper`.

First, select a time range and run from the list, then check the report:
- `Test summary` contains key test parameters and key counters (total number of operations, errors, etc.)
- Tst operation success rate, latency and error rate
- CPU and memory utilization, number of threads and time spent in GC
- Warnings, errors, and exceptions in logs. Note logs and traces are sampled (at 1%) rate, so you won't see every error there
### Example: Running a Stress Test

Since you're changing the chart name, you would need to update the workbook to use `java-your-service-name` instead of `java-template`.
Then you'd need to create a new workbook for your service, follow
[Azure Monitor workbook documentation](https://learn.microsoft.com/azure/azure-monitor/visualize/workbooks-create-workbook) for more details.
Then you can import json file from `workbooks` folder.
```java readme-sample-runStressTest
public class RunStressTest {
public static void main(String[] args) {
com.azure.core.http.stress.App.main(args);
}
}
```

## Writing useful tests

Stress tests are intended to detect reliability and resiliency issues:
- bugs in retry policy
- graceful degradation under high load and transient failures
- memory and connection leaks, thread pool starvation, etc

To explore fault injection options, check out [Chaos mesh](https://github.com/Azure/azure-sdk-tools/blob/main/tools/stress-cluster/chaos/README.md#chaos-manifest) and [Http Fault injector](https://github.com/Azure/azure-sdk-tools/tree/main/tools/http-fault-injector).

> Note: [Azure Chaos Studio](https://azure.microsoft.com/products/chaos-studio) is not currently supported by the stress test infra.

Even without fault injection, by applying maximum load to the service, we can detect memory leaks, extensive allocations,
thread pool issues, or other performance issues in the code. So make sure to configure resource limits and apply the maximum load you can get under them.
Stress tests are intended to detect reliability and resiliency issues, such as retry policy bugs, resource leaks, and performance bottlenecks. For fault injection, see [Chaos mesh](https://github.com/Azure/azure-sdk-tools/blob/main/tools/stress-cluster/chaos/README.md#chaos-manifest).

<!-- links -->
[azure_sdk_stress_test]: https://aka.ms/azsdk/stress
Expand All @@ -253,9 +128,4 @@ thread pool issues, or other performance issues in the code. So make sure to con
[helm]: https://helm.sh/docs/intro/install/
[azure_cli]: https://learn.microsoft.com/cli/azure/install-azure-cli
[powershell]: https://learn.microsoft.com/powershell/scripting/install/installing-powershell?view=powershell-7
[enable_application_insights]: https://learn.microsoft.com/en-us/azure/azure-monitor/app/opentelemetry-enable?tabs=java#enable-azure-monitor-application-insights
[logback_xml]: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/servicebus/azure-messaging-servicebus-stress/src/main/resources/logback.xml
[deploy_stress_test]: https://github.com/Azure/azure-sdk-tools/blob/main/tools/stress-cluster/chaos/README.md#deploying-a-stress-test
[stress_test_layout]: https://github.com/Azure/azure-sdk-tools/blob/main/tools/stress-cluster/chaos/README.md#layout
[opentelemetry-logback]: https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/logback/logback-appender-1.0/library
[logging-azure-sdk]: https://github.com/Azure/azure-sdk-for-java/wiki/Logging-in-Azure-SDK
71 changes: 38 additions & 33 deletions sdk/clientcore/http-stress/pom.xml
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
<!-- Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License. -->
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

<modelVersion>4.0.0</modelVersion>

<parent>
<groupId>io.clientcore</groupId>
<artifactId>clientcore-parent</artifactId>
<version>1.0.0-beta.3</version> <!-- {x-version-update;io.clientcore:clientcore-parent;current} -->
<relativePath>../../parents/clientcore-parent</relativePath>
<groupId>com.azure.v2</groupId>
<artifactId>azure-client-sdk-parent</artifactId>
<version>2.0.0-beta.1</version> <!-- {x-version-update;com.azure.v2:azure-client-sdk-parent;current} -->
<relativePath>../../parents/azure-client-sdk-parent-v2/pom.xml</relativePath>
</parent>

<groupId>io.clientcore</groupId>
<groupId>com.azure.v2</groupId>
<artifactId>http-stress</artifactId>
<version>1.0.0-beta.1</version> <!-- {x-version-update;io.clientcore:http-stress;current} -->
<version>1.0.0-beta.1</version> <!-- {x-version-update;com.azure.v2:http-stress;current} -->
<packaging>jar</packaging>

<properties>
Expand All @@ -34,6 +34,11 @@
<artifactId>core</artifactId>
<version>1.0.0-beta.9</version> <!-- {x-version-update;io.clientcore:core;current} -->
</dependency>
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-core</artifactId>
<version>2.0.0</version> <!-- {x-version-update;com.azure.v2:azure-core;current} -->
</dependency>
<dependency>
<groupId>io.clientcore</groupId>
<artifactId>http-okhttp3</artifactId>
Expand Down Expand Up @@ -97,32 +102,32 @@
<version>3.6.0</version> <!-- {x-version-update;org.apache.maven.plugins:maven-shade-plugin;external_dependency} -->
<!-- we need shade plugin to merge MANIFEST-INF properly into the uber jar-->
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>io.clientcore.http.stress.App</mainClass>
</transformer>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
<finalName>${project.artifactId}-${project.version}-jar-with-dependencies</finalName>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/maven/**</exclude>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.azure.core.http.stress.App</mainClass>
</transformer>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
<finalName>${project.artifactId}-${project.version}-jar-with-dependencies</finalName>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/maven/**</exclude>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
Expand Down
Loading
Loading