Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(tests): add network perf tests for Retina #772

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

ritwikranjan
Copy link
Contributor

Description

This change adds network related performance tests for Retina. The test basically creates a new cluster in AKS, runs the tests defined in ritwikranjan/perf-tests, then installs retina in basic mode and runs the tests again. The results is then published after comparing the two results and obtaining the regressions. See below the sample results.
network-regression-results-20240922162603.json

Related Issue

If this pull request is related to any issue, please mention it here. Additionally, make sure that the issue is assigned to you before submitting this pull request.

Checklist

  • I have read the contributing documentation.
  • I signed and signed-off the commits (git commit -S -s ...). See this documentation on signing commits.
  • I have correctly attributed the author(s) of the code.
  • I have tested the changes locally.
  • I have followed the project's style guidelines.
  • I have updated the documentation, if necessary.
  • I have added tests, if applicable.

Screenshots (if applicable) or Testing Completed

Please add any relevant screenshots or GIFs to showcase the changes made.

Additional Notes

Add any additional notes or context about the pull request here.


Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.

- Added new performance tests for iperf throughput (TCP and UDP)
- Metrics include CPU Utilization Host, CPU Utilization Remote, Max RTT, Mean RTT, Min RTT, Retransmits, and Total Throughput

This commit introduces new performance tests to measure iperf throughput under various conditions for the Retina project.

Signed-off-by: Ritwik Ranjan <[email protected]>
@ritwikranjan ritwikranjan requested a review from a team as a code owner September 23, 2024 13:28
@ritwikranjan ritwikranjan changed the title [WIP] chore/ Network perf test for Retina [WIP] chore/tests: add network perf tests for Retina Sep 23, 2024
test/e2e/retina_perf_test.go Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
test/e2e/retina_perf_test.go Outdated Show resolved Hide resolved
@ritwikranjan ritwikranjan changed the title [WIP] chore/tests: add network perf tests for Retina chore/tests: add network perf tests for Retina Sep 27, 2024
@ritwikranjan ritwikranjan changed the title chore/tests: add network perf tests for Retina chore(tests): add network perf tests for Retina Sep 27, 2024
Copy link
Member

@SRodi SRodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run a test on uksouth and getting this

                                --------------------------------------------------------------------------------
                                RESPONSE 400: 400 Bad Request
                                ERROR CODE: ErrCode_InsufficientVCPUQuota
                                --------------------------------------------------------------------------------
                                {
                                  "code": "ErrCode_InsufficientVCPUQuota",
                                  "details": null,
                                  "message": "Insufficient regional vcpu quota left for location uksouth. left regional vcpu quota 20, requested quota 36",
                                  "subcode": ""
                                }
                                --------------------------------------------------------------------------------
                Test:           TestPerfRetina

I also run the test in westus2, and that was not an issue, but I got the following:

2024/09/27 17:48:52 DaemonSet is not ready: kube-system/retina-agent. 0 out of 3 expected pods are ready
2024/09/27 17:48:54 DaemonSet is not ready: kube-system/retina-agent. 0 out of 3 expected pods are ready
2024/09/27 17:48:56 DaemonSet is not ready: kube-system/retina-agent. 0 out of 3 expected pods are ready
2024/09/27 17:48:58 DaemonSet is not ready: kube-system/retina-agent. 0 out of 3 expected pods are ready
2024/09/27 17:49:00 DaemonSet is not ready: kube-system/retina-agent. 0 out of 3 expected pods are ready
2024/09/27 17:49:02 Error received when checking status of resource retina-svc. Error: 'client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline', Resource details: 'Resource: "/v1, Resource=services", GroupVersionKind: "/v1, Kind=Service"
Name: "retina-svc", Namespace: "kube-system"'
2024/09/27 17:49:02 Retryable error? true
2024/09/27 17:49:02 Retrying as current number of retries 0 less than max number of retries 30
    runner.go:27: 
                Error Trace:    /home/srodi/src/retina/test/e2e/framework/types/runner.go:27
                                                        /home/srodi/src/retina/test/e2e/retina_perf_test.go:65
                Error:          Received unexpected error:
                                did not expect error from step InstallHelmChart but got error: failed to install chart: context deadline exceeded
                Test:           TestPerfRetina
DeleteResourceGroup setting stored value for parameter [SubscriptionID] set as [......-.....-....-....-.........]
DeleteResourceGroup setting stored value for parameter [ResourceGroupName] set as [srodi-e2e-netobs-1727452628]
DeleteResourceGroup setting stored value for parameter [Location] set as [westus2]
#################### DeleteResourceGroup ################################################################
2024/09/27 17:49:02 deleting resource group "srodi-e2e-netobs-1727452628"...
2024/09/27 17:49:05 resource group "srodi-e2e-netobs-1727452628" deleted successfully
--- FAIL: TestPerfRetina (3269.87s)

FYI @ritwikranjan

@ritwikranjan ritwikranjan self-assigned this Oct 1, 2024
@ritwikranjan ritwikranjan added the type/enhancement New feature or request label Oct 1, 2024
go.mod Outdated Show resolved Hide resolved
Comment on lines +63 to +65
// Gather benchmark results then install retina and run the performance tests
runner := types.NewRunner(t, jobs.RunPerfTest(kubeConfigFilePath, chartPath))
runner.Run()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we actually testing here? What assertions of behavior are being made?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test basically generating regression data and not asserting anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that be the case, we should just create another binary artifact that runs these perf tests. The testing framework is just getting in the way since it wasn't designed for this purpose.

test/e2e/retina_perf_test.go Outdated Show resolved Hide resolved
test/e2e/retina_perf_test.go Outdated Show resolved Hide resolved
Signed-off-by: Ritwik Ranjan <[email protected]>
@SRodi
Copy link
Member

SRodi commented Oct 2, 2024

@ritwikranjan I just got another fail on insufficient quota, this time for centralus. I would suggest to make sure the test can run in any regions part of locations slice. ([]string{"eastus2", "centralus", "southcentralus", "uksouth", "centralindia", "westus2"})

    runner.go:27: 
                Error Trace:    /home/srodi/src/retina/test/e2e/framework/types/runner.go:27
                                                        /home/srodi/src/retina/test/e2e/retina_perf_test.go:52
                Error:          Received unexpected error:
                                did not expect error from step CreateNPMCluster but got error: failed to finish the create cluster request: PUT https://management.azure.com/subscriptions/....-....-....-....-.........../resourceGroups/srodi-e2e-netobs-1727879517/providers/Microsoft.ContainerService/managedClusters/srodi-e2e-netobs-1727879517
                                --------------------------------------------------------------------------------
                                RESPONSE 400: 400 Bad Request
                                ERROR CODE: ErrCode_InsufficientVCPUQuota
                                --------------------------------------------------------------------------------
                                {
                                  "code": "ErrCode_InsufficientVCPUQuota",
                                  "details": null,
                                  "message": "Insufficient vcpu quota requested 32, remaining 0 for family standardDSv2Family for region centralus.",
                                  "subcode": ""
                                }
                                --------------------------------------------------------------------------------
                Test:           TestE2EPerfRetina
--- FAIL: TestE2EPerfRetina (26.22s)
FAIL
FAIL    command-line-arguments  26.239s
FAIL

Copy link
Member

@SRodi SRodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ritwikranjan I am getting the following error while running the test based on the most recent commit

    runner.go:27: 
                Error Trace:    /home/srodi/src/retina/test/e2e/framework/types/runner.go:27
                                                        /home/srodi/src/retina/test/e2e/retina_perf_test.go:63
                Error:          Received unexpected error:
                                did not expect error from step GetNetworkPerformanceMeasures but got error: failed to get network performance measures: failed to execute tests: error getting CSV data from orchestrator pod: error reading logs from pod netperf-orch-59dsc: the server rejected our request for an unknown reason (get pods netperf-orch-59dsc)
                Test:           TestE2EPerfRetina

@ritwikranjan
Copy link
Contributor Author

Will help with identifying issue #655

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants