-
Notifications
You must be signed in to change notification settings - Fork 94
Add E2E tests #314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add E2E tests #314
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 5749 out of 5752 changed files in this pull request and generated 2 comments.
Files not reviewed (3)
- Makefile: Language not supported
- hack/e2e-test.sh: Language not supported
- tests/go.mod: Language not supported
Comments suppressed due to low confidence (1)
tests/e2e/infra/holodeck.yaml:10
- [nitpick] The 'privateKey' field uses an absolute file path, which may not be portable or secure in different environments. Consider parameterizing this value or using an environment-specific configuration.
privateKey: /Users/eduardoa/.ssh/cnt-ci.pem
tests/e2e/cleanup_test.go
Outdated
| } | ||
|
|
||
| // list all resourceclaimtemplates.resource.k8s.io on the testnamespace | ||
| resourceClaimTemplates, err := clientSet.ResourceV1beta1().ResourceClaimTemplates(testNamespace.Name).List(ctx, metav1.ListOptions{}) |
Copilot
AI
Apr 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context variable 'ctx' is used here but is not defined within the cleanup function scope. Consider passing a context parameter or defining 'ctx' appropriately.
|
@ArangoGutierrez before we review. Could we maybe split into two commits -- one for the code changes and one for the vendoring changes. The hope is that we can then review the code change separately because otherwise this is really slow. |
yeah I understand what you are saying, GItHub simply crashes, makes total sense, going to split into smaller chucks |
To start with two commits (one for the code changes) and one adding the venored dependencies should be sufficient. |
05e962a to
2485163
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 5748 out of 5750 changed files in this pull request and generated 1 comment.
Files not reviewed (2)
- hack/e2e-test.sh: Language not supported
- tests/go.mod: Language not supported
Done, I have refactored the cleanup logic, PR now ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ArangoGutierrez.
I've left some comments / questions.
One additional one? Should this PR actually add the CI test integration?
hack/e2e-test.sh
Outdated
| @@ -0,0 +1,34 @@ | |||
| #!/bin/bash | |||
|
|
|||
| # Copyright 2025 The Kubernetes Authors. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add our own header or at least us as authors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tests/e2e/cleanup_test.go
Outdated
| for _, crd := range crds { | ||
| err := extClient.ApiextensionsV1().CustomResourceDefinitions().Delete(ctx, crd, metav1.DeleteOptions{}) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
| // Optionally, wait for deletion confirmation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is optional about waiting here? I don't see any conditional.
tests/e2e/cleanup_test.go
Outdated
| } | ||
| } | ||
|
|
||
| // cleanupResourceClaims forcibly removes finalizers and deletes all ResourceClaims. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing the finalizers in the tests? Is this in case there are issues with finalizers that we create? Does forcefully removing them mean that there are classes of errors that are not caught by these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was being over-aggressive; I have removed all the finalizer removing logic lines
tests/e2e/data/test-pod1.yaml
Outdated
| numNodes: 1 | ||
| channel: | ||
| resourceClaimTemplate: | ||
| name: imex-channel-0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove the imex- prefixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tests/e2e/data/test-pod1.yaml
Outdated
| spec: | ||
| containers: | ||
| - name: ctr | ||
| image: ubuntu:22.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to cause test outages due to rate-limiting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a long shot, but yes, it is a possibility, should we use an image from the samples repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Righty. This is a to-be-anticipated CI instability. Hopefully we will run into it. Because that means we test a lot.
should we use an image from the samples repo?
When we run this in CI: in the long run, we want to use a reliable container image registry here that is under our control or paid by us so that availability is not a concern and rate limiting and quotas won't get in the way.
"Docker Hub free tier" is not that, yes. But it gets people far, initially. Free tier ECR-in-AWS can be an option, with one free pull per second (that, plus retrying: works).
I still don't know really how good nvcr is for "free tier" non-NVIDIA users. Related: #315 (comment)
| CreateNamespace: true, | ||
| Wait: true, | ||
| Timeout: 5 * time.Minute, | ||
| CleanupOnFail: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does CleanupOnFaile affect the logs that are available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for the deployment phase, once the pods are running, we will collect all logs if the test fails. logs on why the Helm deployment failed will be shown so they are available
tests/e2e/k8s-dra-gpu-driver_test.go
Outdated
| } | ||
|
|
||
| if HelmValuesFile != "" { | ||
| chartSpec.ValuesYaml = HelmValues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless ValuesYaml is a *string, we shoudl be able to set this inline above since chartSpec.ValuesYaml will be "" or HelmValue regardless.
| By("Checking the nvidia-dra-driver-gpu-kubelet-plugin daemonset") | ||
| daemonset, err := clientSet.AppsV1().DaemonSets(testNamespace.Name).Get(ctx, "nvidia-dra-driver-gpu-kubelet-plugin", metav1.GetOptions{}) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
| Expect(daemonset).NotTo(BeNil()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there other porperties that we should check for the deployment and daemonset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added
By("Waiting for nvidia-dra-driver-gpu-kubelet-plugin daemonset to be fully scheduled")
Eventually(func() bool {
dsList, err := clientSet.AppsV1().DaemonSets(testNamespace.Name).List(ctx, metav1.ListOptions{})
Expect(err).NotTo(HaveOccurred())
for _, ds := range dsList.Items {
if ds.Status.CurrentNumberScheduled != ds.Status.DesiredNumberScheduled {
return false
}
}
return true
}, 5*time.Minute, 10*time.Second).Should(BeTrue())to wait for the DS to be running on all the expected nodes ds.Status.CurrentNumberScheduled != ds.Status.DesiredNumberScheduled
tests/e2e/k8s-dra-gpu-driver_test.go
Outdated
| }, 5*time.Minute, 10*time.Second).Should(BeTrue()) | ||
|
|
||
| By("Creating the test pod") | ||
| pod := createPod(ns.Name, "test-pod1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have test-pod1.yaml. Is there a reason that we're duplicating things in code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had that file there for reference, I will remove it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, I think loading the file from disk is easier to read / reason about. Is there a reason to prefer setting up the go structures directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it makes it a bit harder to read, but it provides more control. I want to deploy each object independently and properly wait for the object to be in the desired state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we still deploy the separate objects after we have loaded them from a YAML file and applied any required customisations? Having YAML files ase the base have the following advantages:
- They allow us to more easily debug failures since we can use these as examples and deploy them manually.
- They are easier to update as our APIs change / evolve since this typically how we would demonstrate these in customer-facing documentation.
- They allow for early feedback on the UX around writing such specifications and provide testable examples of content that we would expose to end-users.
In the case of the device plugin, the job that we deploy was failing continuously and the fact that I didn't have a YAML file to reference was a major hinderance in debugging.
Since this is "just" for tests, I'm not going to continue blocking on this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed with commit: 5405d11
tests/e2e/e2e_test.go
Outdated
| Expect(err).NotTo(HaveOccurred()) | ||
|
|
||
| // Create a namespace for the test | ||
| testNamespace, err = CreateTestingNS("nvidia-dra-driver-gpu", clientSet, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the namespace that we recommend to install the DRA driver in, right?
The normal user workflow has workloads running in a different namespace, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the namespace that we recommend to install the DRA driver in, right?
No, this will be a namesapce that will look like nvidia-dra-driver-gpu-<random> just for the sake of the test run.
The normal user workflow has workloads running in a different namespace, right?
Yes, a recommended name for the namespace could be nvidia-dra-driver-gpu but that is really up to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this will be a namesapce that will look like nvidia-dra-driver-gpu- just for the sake of the test run.
Ah, nice! I didn't look at the implementation. So, the arg is just a prefix. Maybe we can use the word "test" in that prefix?
| }) | ||
|
|
||
| When("deploying the K8S DRA Driver GPU via Helm", func() { | ||
| It("should be successful", func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the spirit of maximizing for conclusiveness... what would a corresponding helm install command line look like?
What is this helm chart really installed from? I saw an environment variable HELM_CHART. Is this meant to carry a path to a directory? Or to a tarball? If that is really the source of the Helm chart to be deployed maybe we can rename the variable to make that clearer, and also add a code comment here and there clarifying which Helm chart this really tests :).
Overall, I can see that in the future we might have a bit of discussion on programmatic Helm client vs a CLI command executed. The CLI-based approach is what we document and what users are using. I also like to go fancy in terms of testing and making a codebase maintainable. But in terms of deriving confidence: my baseline is to have at least one test in a per-PR CI that captures the basic user workflow as close as possible. (basically: tested documentation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could translate to aprox ~=
helm install nvidia-dra-driver-gpu $HELM_CHART \
--version="$E2E_IMAGE_TAG" \
--create-namespace \
--namespace nvidia-dra-driver-gpu \
--set nvidiaDriverRoot=/ \
--set nvidiaCtkPath=/usr/bin/nvidia-ctk \
--set gpuResourcesEnabledOverride=trueThe idea is to deploy the helm chart from the deployments/helm folder during CI, but the HELM_CHART env var leaves the option during manual run to point to a diff location, that could be a tar.gz file.
100% agree with you, the idea of this initial E2E is to test #249 in a programmatic way
|
Great that you're working on this. I haven't looked deeply but already left a few remarks. Feel free to do with those what you think is right. Thanks for adding a nice PR description already, answered many questions. Questions left: what is your idea to translate this into signal? How do we want to integrate this into CI? What will a typical per-PR workflow look like? Of course, the wish is to have conclusive and fast per-PR CI. I make a patch, open a PR, and then the set of CI checks on the PR tells me what I broke. If we will not have that: what will we have instead? |
Yes the idea is to run during PR's, but I will have a follow up PR after this one for that part. this PR will focus on the E2E and it's logic, since it is also intended to be run manually. a follow up PR will focus on setting up the CI bits so we run this test on every change |
2485163 to
139057f
Compare
6ff7f86 to
192b741
Compare
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
3d41e92 to
f6e2eff
Compare
f6e2eff to
a9f5596
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds an end-to-end testing suite for the NVIDIA GPU driver with improvements to test scripts, infrastructure, and dependency management.
- New E2E tests and configurations are introduced in several Go test files under the tests/e2e directory.
- Test utilities and cleanup functions have been added or updated, and the CI workflows have been configured for E2E testing.
Reviewed Changes
Copilot reviewed 5750 out of 5753 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/utils.go | New utility functions for resource templates, compute domain creation, pod creation, and log validation |
| tests/e2e/k8s-dra-gpu-driver_test.go | New E2E test suite with Helm deployment checks and pod log validation |
| tests/e2e/infra/holodeck.yaml | New YAML configuration for E2E testing infrastructure on AWS |
| tests/e2e/e2e_test.go | Test suite setup with dependency deployment, client configuration, and namespace management |
| tests/e2e/cleanup_test.go | Cleanup functions for test resources including pods, CRDs, and Helm deployments |
| .github/workflows/image.yaml | Updated GitHub Actions workflow for building the image |
| .github/workflows/e2e.yaml | New workflow configuration for running E2E tests |
| .github/workflows/ci.yaml | CI workflow updated to trigger E2E tests |
Files not reviewed (3)
- Makefile: Language not supported
- hack/e2e-test.sh: Language not supported
- tests/go.mod: Language not supported
Comments suppressed due to low confidence (1)
tests/e2e/cleanup_test.go:98
- [nitpick] The naming of 'cleanUpComputeDomains' is inconsistent with other cleanup functions that use lower-case 'cleanup'. Consider renaming it to 'cleanupComputeDomains' for consistency.
func cleanUpComputeDomains(namespace string) error {
Makefile
Outdated
| cat $(COVERAGE_FILE) | grep -v "_mock.go" > $(COVERAGE_FILE).no-mocks | ||
| go tool cover -func=$(COVERAGE_FILE).no-mocks | ||
|
|
||
| ginkgo: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to tests/e2e/Makefile instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Makefile
Outdated
| mkdir -p $(CURDIR)/bin | ||
| GOBIN=$(CURDIR)/bin go install github.com/onsi/ginkgo/v2/ginkgo@latest | ||
|
|
||
| test-e2e: ginkgo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rather have a tests/e2e/Makefile for these targets. I'm happy to also have an "alias" that invokes that target at this level, but the core logic should be in the tests folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Makefile
Outdated
| GOBIN=$(CURDIR)/bin go install github.com/onsi/ginkgo/v2/ginkgo@latest | ||
|
|
||
| test-e2e: ginkgo | ||
| ./bin/ginkgo $(GINKGO_ARGS) -v --json-report $(LOG_ARTIFACTS_DIR)/ginkgo.json ./tests/e2e/... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of using ginkgo as the test runner? Is it only to produce better logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no runtime difference from using go or ginkgo to run the test, yes I am using the ginkgo binary to run the test to get better logs and the report file. So I am not hard on this one, if we think is not worth it, I am ok
a9f5596 to
be9467b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds new end-to-end (E2E) tests and configurations to validate the NVIDIA GPU driver operations on Kubernetes clusters. Key changes include the addition of multiple E2E test scripts and utilities, updates to dependency management via the Makefile and Go modules, and enhancements to test resource cleanup and infrastructure configuration.
Reviewed Changes
Copilot reviewed 5750 out of 5752 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/utils.go | Introduces helper functions for resource claim templates, compute domains, and pod creation. |
| tests/e2e/k8s-dra-gpu-driver_test.go | Implements the main E2E test suite for deploying and validating the GPU driver via Helm. |
| tests/e2e/infra/holodeck.yaml | Provides YAML configuration for the test infrastructure environment. |
| tests/e2e/e2e_test.go | Sets up the test suite environment and integrates dependency management and Helm settings. |
| tests/e2e/cleanup_test.go | Adds routines for cleaning up test artifacts including CRDs, pods, and resource claims. |
| .github/workflows/image.yaml | Configures image build and Docker Buildx setup for CI workflows. |
| .github/workflows/e2e.yaml | Defines the GitHub Actions workflow for running the E2E tests. |
| .github/workflows/ci.yaml | Orchestrates CI including calling the E2E workflow with proper inputs. |
Files not reviewed (2)
- tests/e2e/Makefile: Language not supported
- tests/go.mod: Language not supported
be9467b to
d59c93c
Compare
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]> mergable commit Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
d59c93c to
edc936c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds a suite of new end-to-end (E2E) tests and the accompanying test infrastructure to verify deployment and behavior of the K8S DRA GPU Driver on Kubernetes clusters with GPU nodes. Key changes include the addition of several test scripts, updates to dependency management (Makefile and go.mod), and enhancements in test resource creation and cleanup utilities.
Reviewed Changes
Copilot reviewed 5750 out of 5752 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/utils.go | Utility functions for setting up test resources such as GPU resource claim templates, compute domains, and pods |
| tests/e2e/k8s-dra-gpu-driver_test.go | New test suite for installing and validating the NVIDIA GPU driver deployment via Helm and verifying the environment |
| tests/e2e/infra/holodeck.yaml | Definition for the end-to-end test infrastructure configuration |
| tests/e2e/e2e_test.go | Entry-point test file that configures and initiates the overall E2E test suite |
| tests/e2e/cleanup_test.go | Added functions for cleanup of the deployed resources including pods, CRDs, and node labels |
| .github/workflows/image.yaml, .github/workflows/e2e.yaml, .github/workflows/ci.yaml | Updated workflow definitions for building images and running the end-to-end tests |
Files not reviewed (2)
- tests/e2e/Makefile: Language not supported
- tests/go.mod: Language not supported
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
ff2950d to
6c0da9a
Compare
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
6c0da9a to
15bbe68
Compare
This E2E test suite is intended to be run against a Kubernetes cluster with a GPU node, where the NVIDIA driver and container toolkit have been deployed on the host, and an ENV VAR option is available to run on clusters where the GPU operator is running.
This pull request includes several changes to add new end-to-end (E2E) tests, update dependencies, and improve the test infrastructure. The most important changes include the addition of new E2E test scripts and configurations, updates to the
Makefilefor managing dependencies, and the introduction of new test utilities.New E2E Tests and Configurations:
hack/e2e-test.sh: Added a new script for running E2E tests, which includes the setup for theginkgotesting framework and execution of tests with JSON reporting.tests/e2e/k8s-dra-gpu-driver_test.go: Added a new test suite for the K8S DRA GPU Driver, including tests for deploying the driver via Helm and running a pod that consumes a compute domain.tests/e2e/cleanup_test.go: Added cleanup functions to remove deployed resources and custom resource definitions (CRDs) after tests.Dependency Management:
Makefile: Updated to include targets for downloading and installing theginkgotesting framework locally.tests/go.mod: Added a new Go module file to manage dependencies for the E2E tests, including necessary libraries such asginkgo,gomega, and Kubernetes client libraries.Test Utilities:
tests/e2e/utils_test.go: Added utility functions for creating resource claim templates, compute domains, and pods to simplify the setup of test resources.These changes enhance the testing capabilities and infrastructure, ensuring more robust and automated testing for the project.