Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented Apr 8, 2025

This E2E test suite is intended to be run against a Kubernetes cluster with a GPU node, where the NVIDIA driver and container toolkit have been deployed on the host, and an ENV VAR option is available to run on clusters where the GPU operator is running.

This pull request includes several changes to add new end-to-end (E2E) tests, update dependencies, and improve the test infrastructure. The most important changes include the addition of new E2E test scripts and configurations, updates to the Makefile for managing dependencies, and the introduction of new test utilities.

New E2E Tests and Configurations:

  • hack/e2e-test.sh: Added a new script for running E2E tests, which includes the setup for the ginkgo testing framework and execution of tests with JSON reporting.
  • tests/e2e/k8s-dra-gpu-driver_test.go: Added a new test suite for the K8S DRA GPU Driver, including tests for deploying the driver via Helm and running a pod that consumes a compute domain.
  • tests/e2e/cleanup_test.go: Added cleanup functions to remove deployed resources and custom resource definitions (CRDs) after tests.

Dependency Management:

  • Makefile: Updated to include targets for downloading and installing the ginkgo testing framework locally.
  • tests/go.mod: Added a new Go module file to manage dependencies for the E2E tests, including necessary libraries such as ginkgo, gomega, and Kubernetes client libraries.

Test Utilities:

  • tests/e2e/utils_test.go: Added utility functions for creating resource claim templates, compute domains, and pods to simplify the setup of test resources.

These changes enhance the testing capabilities and infrastructure, ensuring more robust and automated testing for the project.

@ArangoGutierrez ArangoGutierrez requested a review from Copilot April 8, 2025 16:52
@ArangoGutierrez ArangoGutierrez self-assigned this Apr 8, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5749 out of 5752 changed files in this pull request and generated 2 comments.

Files not reviewed (3)
  • Makefile: Language not supported
  • hack/e2e-test.sh: Language not supported
  • tests/go.mod: Language not supported
Comments suppressed due to low confidence (1)

tests/e2e/infra/holodeck.yaml:10

  • [nitpick] The 'privateKey' field uses an absolute file path, which may not be portable or secure in different environments. Consider parameterizing this value or using an environment-specific configuration.
privateKey: /Users/eduardoa/.ssh/cnt-ci.pem

}

// list all resourceclaimtemplates.resource.k8s.io on the testnamespace
resourceClaimTemplates, err := clientSet.ResourceV1beta1().ResourceClaimTemplates(testNamespace.Name).List(ctx, metav1.ListOptions{})
Copy link

Copilot AI Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context variable 'ctx' is used here but is not defined within the cleanup function scope. Consider passing a context parameter or defining 'ctx' appropriately.

Copilot uses AI. Check for mistakes.
@elezar
Copy link
Member

elezar commented Apr 8, 2025

@ArangoGutierrez before we review. Could we maybe split into two commits -- one for the code changes and one for the vendoring changes. The hope is that we can then review the code change separately because otherwise this is really slow.

@ArangoGutierrez
Copy link
Collaborator Author

@ArangoGutierrez before we review. Could we maybe split into two commits -- one for the code changes and one for the vendoring changes. The hope is that we can then review the code change separately because otherwise this is really slow.

yeah I understand what you are saying, GItHub simply crashes, makes total sense, going to split into smaller chucks

@elezar
Copy link
Member

elezar commented Apr 9, 2025

@ArangoGutierrez before we review. Could we maybe split into two commits -- one for the code changes and one for the vendoring changes. The hope is that we can then review the code change separately because otherwise this is really slow.

yeah I understand what you are saying, GItHub simply crashes, makes total sense, going to split into smaller chucks

To start with two commits (one for the code changes) and one adding the venored dependencies should be sufficient.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5748 out of 5750 changed files in this pull request and generated 1 comment.

Files not reviewed (2)
  • hack/e2e-test.sh: Language not supported
  • tests/go.mod: Language not supported

@ArangoGutierrez
Copy link
Collaborator Author

@ArangoGutierrez before we review. Could we maybe split into two commits -- one for the code changes and one for the vendoring changes. The hope is that we can then review the code change separately because otherwise this is really slow.

yeah I understand what you are saying, GItHub simply crashes, makes total sense, going to split into smaller chucks

To start with two commits (one for the code changes) and one adding the venored dependencies should be sufficient.

Done, I have refactored the cleanup logic, PR now ready for review

@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review April 9, 2025 11:45
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArangoGutierrez.

I've left some comments / questions.

One additional one? Should this PR actually add the CI test integration?

hack/e2e-test.sh Outdated
@@ -0,0 +1,34 @@
#!/bin/bash

# Copyright 2025 The Kubernetes Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add our own header or at least us as authors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for _, crd := range crds {
err := extClient.ApiextensionsV1().CustomResourceDefinitions().Delete(ctx, crd, metav1.DeleteOptions{})
Expect(err).NotTo(HaveOccurred())
// Optionally, wait for deletion confirmation:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is optional about waiting here? I don't see any conditional.

}
}

// cleanupResourceClaims forcibly removes finalizers and deletes all ResourceClaims.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing the finalizers in the tests? Is this in case there are issues with finalizers that we create? Does forcefully removing them mean that there are classes of errors that are not caught by these tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was being over-aggressive; I have removed all the finalizer removing logic lines

numNodes: 1
channel:
resourceClaimTemplate:
name: imex-channel-0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove the imex- prefixes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

spec:
containers:
- name: ctr
image: ubuntu:22.04
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to cause test outages due to rate-limiting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a long shot, but yes, it is a possibility, should we use an image from the samples repo?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Righty. This is a to-be-anticipated CI instability. Hopefully we will run into it. Because that means we test a lot.

should we use an image from the samples repo?

When we run this in CI: in the long run, we want to use a reliable container image registry here that is under our control or paid by us so that availability is not a concern and rate limiting and quotas won't get in the way.

"Docker Hub free tier" is not that, yes. But it gets people far, initially. Free tier ECR-in-AWS can be an option, with one free pull per second (that, plus retrying: works).

I still don't know really how good nvcr is for "free tier" non-NVIDIA users. Related: #315 (comment)

CreateNamespace: true,
Wait: true,
Timeout: 5 * time.Minute,
CleanupOnFail: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does CleanupOnFaile affect the logs that are available?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for the deployment phase, once the pods are running, we will collect all logs if the test fails. logs on why the Helm deployment failed will be shown so they are available

}

if HelmValuesFile != "" {
chartSpec.ValuesYaml = HelmValues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless ValuesYaml is a *string, we shoudl be able to set this inline above since chartSpec.ValuesYaml will be "" or HelmValue regardless.

By("Checking the nvidia-dra-driver-gpu-kubelet-plugin daemonset")
daemonset, err := clientSet.AppsV1().DaemonSets(testNamespace.Name).Get(ctx, "nvidia-dra-driver-gpu-kubelet-plugin", metav1.GetOptions{})
Expect(err).NotTo(HaveOccurred())
Expect(daemonset).NotTo(BeNil())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other porperties that we should check for the deployment and daemonset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added

			By("Waiting for nvidia-dra-driver-gpu-kubelet-plugin daemonset to be fully scheduled")
			Eventually(func() bool {
				dsList, err := clientSet.AppsV1().DaemonSets(testNamespace.Name).List(ctx, metav1.ListOptions{})
				Expect(err).NotTo(HaveOccurred())
				for _, ds := range dsList.Items {
					if ds.Status.CurrentNumberScheduled != ds.Status.DesiredNumberScheduled {
						return false
					}
				}
				return true
			}, 5*time.Minute, 10*time.Second).Should(BeTrue())

to wait for the DS to be running on all the expected nodes ds.Status.CurrentNumberScheduled != ds.Status.DesiredNumberScheduled

}, 5*time.Minute, 10*time.Second).Should(BeTrue())

By("Creating the test pod")
pod := createPod(ns.Name, "test-pod1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have test-pod1.yaml. Is there a reason that we're duplicating things in code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that file there for reference, I will remove it now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, I think loading the file from disk is easier to read / reason about. Is there a reason to prefer setting up the go structures directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it makes it a bit harder to read, but it provides more control. I want to deploy each object independently and properly wait for the object to be in the desired state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we still deploy the separate objects after we have loaded them from a YAML file and applied any required customisations? Having YAML files ase the base have the following advantages:

  1. They allow us to more easily debug failures since we can use these as examples and deploy them manually.
  2. They are easier to update as our APIs change / evolve since this typically how we would demonstrate these in customer-facing documentation.
  3. They allow for early feedback on the UX around writing such specifications and provide testable examples of content that we would expose to end-users.

In the case of the device plugin, the job that we deploy was failing continuously and the fact that I didn't have a YAML file to reference was a major hinderance in debugging.

Since this is "just" for tests, I'm not going to continue blocking on this though.

Copy link
Collaborator Author

@ArangoGutierrez ArangoGutierrez Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed with commit: 5405d11

Expect(err).NotTo(HaveOccurred())

// Create a namespace for the test
testNamespace, err = CreateTestingNS("nvidia-dra-driver-gpu", clientSet, nil)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the namespace that we recommend to install the DRA driver in, right?

The normal user workflow has workloads running in a different namespace, right?

Copy link
Collaborator Author

@ArangoGutierrez ArangoGutierrez Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the namespace that we recommend to install the DRA driver in, right?

No, this will be a namesapce that will look like nvidia-dra-driver-gpu-<random> just for the sake of the test run.

The normal user workflow has workloads running in a different namespace, right?

Yes, a recommended name for the namespace could be nvidia-dra-driver-gpu but that is really up to the user.

Copy link
Collaborator

@jgehrcke jgehrcke Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this will be a namesapce that will look like nvidia-dra-driver-gpu- just for the sake of the test run.

Ah, nice! I didn't look at the implementation. So, the arg is just a prefix. Maybe we can use the word "test" in that prefix?

})

When("deploying the K8S DRA Driver GPU via Helm", func() {
It("should be successful", func(ctx context.Context) {
Copy link
Collaborator

@jgehrcke jgehrcke Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of maximizing for conclusiveness... what would a corresponding helm install command line look like?

What is this helm chart really installed from? I saw an environment variable HELM_CHART. Is this meant to carry a path to a directory? Or to a tarball? If that is really the source of the Helm chart to be deployed maybe we can rename the variable to make that clearer, and also add a code comment here and there clarifying which Helm chart this really tests :).

Overall, I can see that in the future we might have a bit of discussion on programmatic Helm client vs a CLI command executed. The CLI-based approach is what we document and what users are using. I also like to go fancy in terms of testing and making a codebase maintainable. But in terms of deriving confidence: my baseline is to have at least one test in a per-PR CI that captures the basic user workflow as close as possible. (basically: tested documentation).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could translate to aprox ~=

helm install nvidia-dra-driver-gpu $HELM_CHART \
    --version="$E2E_IMAGE_TAG" \
    --create-namespace \
    --namespace nvidia-dra-driver-gpu \
    --set nvidiaDriverRoot=/ \
    --set nvidiaCtkPath=/usr/bin/nvidia-ctk \
    --set gpuResourcesEnabledOverride=true

The idea is to deploy the helm chart from the deployments/helm folder during CI, but the HELM_CHART env var leaves the option during manual run to point to a diff location, that could be a tar.gz file.

100% agree with you, the idea of this initial E2E is to test #249 in a programmatic way

@jgehrcke
Copy link
Collaborator

jgehrcke commented Apr 11, 2025

Great that you're working on this. I haven't looked deeply but already left a few remarks. Feel free to do with those what you think is right.

Thanks for adding a nice PR description already, answered many questions.

Questions left: what is your idea to translate this into signal? How do we want to integrate this into CI? What will a typical per-PR workflow look like?

Of course, the wish is to have conclusive and fast per-PR CI. I make a patch, open a PR, and then the set of CI checks on the PR tells me what I broke. If we will not have that: what will we have instead?

@ArangoGutierrez
Copy link
Collaborator Author

Great that you're working on this. I haven't looked deeply but already left a few remarks. Feel free to do with those what you think is right.

Thanks for adding a nice PR description already, answered many questions.

Questions left: what is your idea to translate this into signal? How do we want to integrate this into CI? What will a typical per-PR workflow look like?

Of course, the wish is to have conclusive and fast per-PR CI. I make a patch, open a PR, and then the set of CI checks on the PR tells me what I broke. If we will not have that: what will we have instead?

Yes the idea is to run during PR's, but I will have a follow up PR after this one for that part. this PR will focus on the E2E and it's logic, since it is also intended to be run manually. a follow up PR will focus on setting up the CI bits so we run this test on every change

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@ArangoGutierrez ArangoGutierrez force-pushed the e2e_tests branch 3 times, most recently from 3d41e92 to f6e2eff Compare April 16, 2025 10:59
@ArangoGutierrez
Copy link
Collaborator Author

@elezar @jgehrcke PR now also adds the Github Action logic

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds an end-to-end testing suite for the NVIDIA GPU driver with improvements to test scripts, infrastructure, and dependency management.

  • New E2E tests and configurations are introduced in several Go test files under the tests/e2e directory.
  • Test utilities and cleanup functions have been added or updated, and the CI workflows have been configured for E2E testing.

Reviewed Changes

Copilot reviewed 5750 out of 5753 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/e2e/utils.go New utility functions for resource templates, compute domain creation, pod creation, and log validation
tests/e2e/k8s-dra-gpu-driver_test.go New E2E test suite with Helm deployment checks and pod log validation
tests/e2e/infra/holodeck.yaml New YAML configuration for E2E testing infrastructure on AWS
tests/e2e/e2e_test.go Test suite setup with dependency deployment, client configuration, and namespace management
tests/e2e/cleanup_test.go Cleanup functions for test resources including pods, CRDs, and Helm deployments
.github/workflows/image.yaml Updated GitHub Actions workflow for building the image
.github/workflows/e2e.yaml New workflow configuration for running E2E tests
.github/workflows/ci.yaml CI workflow updated to trigger E2E tests
Files not reviewed (3)
  • Makefile: Language not supported
  • hack/e2e-test.sh: Language not supported
  • tests/go.mod: Language not supported
Comments suppressed due to low confidence (1)

tests/e2e/cleanup_test.go:98

  • [nitpick] The naming of 'cleanUpComputeDomains' is inconsistent with other cleanup functions that use lower-case 'cleanup'. Consider renaming it to 'cleanupComputeDomains' for consistency.
func cleanUpComputeDomains(namespace string) error {

Makefile Outdated
cat $(COVERAGE_FILE) | grep -v "_mock.go" > $(COVERAGE_FILE).no-mocks
go tool cover -func=$(COVERAGE_FILE).no-mocks

ginkgo:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to tests/e2e/Makefile instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k will do

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Makefile Outdated
mkdir -p $(CURDIR)/bin
GOBIN=$(CURDIR)/bin go install github.com/onsi/ginkgo/v2/ginkgo@latest

test-e2e: ginkgo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rather have a tests/e2e/Makefile for these targets. I'm happy to also have an "alias" that invokes that target at this level, but the core logic should be in the tests folder.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k will do

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Makefile Outdated
GOBIN=$(CURDIR)/bin go install github.com/onsi/ginkgo/v2/ginkgo@latest

test-e2e: ginkgo
./bin/ginkgo $(GINKGO_ARGS) -v --json-report $(LOG_ARTIFACTS_DIR)/ginkgo.json ./tests/e2e/...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of using ginkgo as the test runner? Is it only to produce better logs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no runtime difference from using go or ginkgo to run the test, yes I am using the ginkgo binary to run the test to get better logs and the report file. So I am not hard on this one, if we think is not worth it, I am ok

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds new end-to-end (E2E) tests and configurations to validate the NVIDIA GPU driver operations on Kubernetes clusters. Key changes include the addition of multiple E2E test scripts and utilities, updates to dependency management via the Makefile and Go modules, and enhancements to test resource cleanup and infrastructure configuration.

Reviewed Changes

Copilot reviewed 5750 out of 5752 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/e2e/utils.go Introduces helper functions for resource claim templates, compute domains, and pod creation.
tests/e2e/k8s-dra-gpu-driver_test.go Implements the main E2E test suite for deploying and validating the GPU driver via Helm.
tests/e2e/infra/holodeck.yaml Provides YAML configuration for the test infrastructure environment.
tests/e2e/e2e_test.go Sets up the test suite environment and integrates dependency management and Helm settings.
tests/e2e/cleanup_test.go Adds routines for cleaning up test artifacts including CRDs, pods, and resource claims.
.github/workflows/image.yaml Configures image build and Docker Buildx setup for CI workflows.
.github/workflows/e2e.yaml Defines the GitHub Actions workflow for running the E2E tests.
.github/workflows/ci.yaml Orchestrates CI including calling the E2E workflow with proper inputs.
Files not reviewed (2)
  • tests/e2e/Makefile: Language not supported
  • tests/go.mod: Language not supported

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

mergable commit

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds a suite of new end-to-end (E2E) tests and the accompanying test infrastructure to verify deployment and behavior of the K8S DRA GPU Driver on Kubernetes clusters with GPU nodes. Key changes include the addition of several test scripts, updates to dependency management (Makefile and go.mod), and enhancements in test resource creation and cleanup utilities.

Reviewed Changes

Copilot reviewed 5750 out of 5752 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/e2e/utils.go Utility functions for setting up test resources such as GPU resource claim templates, compute domains, and pods
tests/e2e/k8s-dra-gpu-driver_test.go New test suite for installing and validating the NVIDIA GPU driver deployment via Helm and verifying the environment
tests/e2e/infra/holodeck.yaml Definition for the end-to-end test infrastructure configuration
tests/e2e/e2e_test.go Entry-point test file that configures and initiates the overall E2E test suite
tests/e2e/cleanup_test.go Added functions for cleanup of the deployed resources including pods, CRDs, and node labels
.github/workflows/image.yaml, .github/workflows/e2e.yaml, .github/workflows/ci.yaml Updated workflow definitions for building images and running the end-to-end tests
Files not reviewed (2)
  • tests/e2e/Makefile: Language not supported
  • tests/go.mod: Language not supported

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@ArangoGutierrez ArangoGutierrez force-pushed the e2e_tests branch 2 times, most recently from ff2950d to 6c0da9a Compare April 25, 2025 15:55
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@ArangoGutierrez ArangoGutierrez requested a review from klueska May 7, 2025 08:56
@ArangoGutierrez ArangoGutierrez added the ci/testing issue/PR related to CI and/or testing label May 26, 2025
@klueska klueska added this to the v25.12.0 milestone Aug 13, 2025
@klueska klueska modified the milestones: v25.12.0, unscheduled Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/testing issue/PR related to CI and/or testing

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

4 participants