-
Notifications
You must be signed in to change notification settings - Fork 429
[no-relnote] Add E2E for containerd #1313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive E2E testing for containerd runtime configuration alongside existing Docker testing infrastructure. The changes introduce a nested container testing framework that allows running tests inside containers to validate NVIDIA Container Toolkit behavior in containerized environments.
- Adds new E2E tests for containerd drop-in configuration functionality
- Introduces nvidia-cdi-refresh systemd unit testing
- Implements nested container runner infrastructure for isolated testing
Reviewed Changes
Copilot reviewed 9 out of 32 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/go.mod | Adds new dependencies for UUID generation and test utilities |
| tests/e2e/runner.go | Implements nested container runner with Docker installation and CTK setup |
| tests/e2e/nvidia-ctk_containerd_test.go | New comprehensive containerd E2E test suite |
| tests/e2e/nvidia-ctk_docker_test.go | Refactors to use shared runner infrastructure and fixes macOS compatibility |
| tests/e2e/nvidia-cdi-refresh_test.go | New systemd unit tests for CDI refresh functionality |
| tests/e2e/nvidia-container-cli_test.go | Refactors to use nested container runner |
| tests/e2e/installer.go | Adds containerd installation template and additional flags support |
| tests/e2e/e2e_test.go | Centralizes test runner initialization in BeforeSuite |
| tests/e2e/Makefile | Documents new test categories |
Pull Request Test Coverage Report for Build 18005738357Details
💛 - Coveralls |
5c85056 to
3dc480c
Compare
|
I'll mark this PR as ready for review once #1235 is merged |
3dc480c to
563a192
Compare
Rebased |
029af03 to
1899001
Compare
51ad031 to
c65e468
Compare
|
Rebased |
| AfterAll(func(ctx context.Context) { | ||
| // Cleanup: remove the container and the temporary script on the host. | ||
| // Use || true to ensure cleanup doesn't fail the test | ||
| runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the nolint let's just drop the return values.
| runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck | |
| _, _, _ = runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) |
Does it mak sense to at least WARN if the cleanup fails? The || true doesn't ensure that the test doesn't fail, the fact that we don't check the return value does that.
| # Remove any imports line from the config (reset to original state) | ||
| if [ -f /etc/containerd/config.toml ]; then | ||
| grep -v "^imports = " /etc/containerd/config.toml > /tmp/config.toml.tmp && mv /tmp/config.toml.tmp /etc/containerd/config.toml || true | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just make a copy of the original config and restore that after / before each test?
| # Restart containerd to pick up the clean config | ||
| systemctl restart containerd | ||
| sleep 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to check containerd health?
| output, _, err := nestedContainerRunner.Run(`cat /etc/containerd/conf.d/99-nvidia.toml`) | ||
| Expect(err).ToNot(HaveOccurred()) | ||
| Expect(output).To(ContainSubstring(`nvidia`)) | ||
| Expect(output).To(ContainSubstring(`nvidia-cdi`)) | ||
| Expect(output).To(ContainSubstring(`nvidia-legacy`)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in person, we are nolonger triggering the configuration of containerd with the current installation mechanism.
| output, _, err = nestedContainerRunner.Run(`containerd config dump`) | ||
| Expect(err).ToNot(HaveOccurred()) | ||
| // Verify imports section is in the merged config | ||
| Expect(output).To(ContainSubstring(`imports = ['/etc/containerd/conf.d/*.toml']`)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think that config dump prints that ACTUAL paths of all files processed.
| ContainSubstring(`default_runtime_name = "nvidia"`), | ||
| ContainSubstring(`default_runtime_name = 'nvidia'`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should definitely be VERSION specific checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having tests that pass in multiple states are prone to be flaky. Please pick the format for the version of containerd that we have installed by default and use that. This makes the differences between SPECIFIC containerd versions more obvious when reading the tests.
| ContainSubstring(`default_runtime_name = "nvidia"`), | ||
| ContainSubstring(`default_runtime_name = 'nvidia'`), | ||
| )) | ||
| Expect(output).To(ContainSubstring(`enable_cdi = true`)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we toggle this behaviour? It is disabled by default.
| ContainSubstring(`[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]`), | ||
| ContainSubstring(`[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.nvidia]`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, thsi should be version-specific.
| }) | ||
|
|
||
| When("containerd already has a custom default runtime configured", func() { | ||
| It("should preserve the existing default runtime when --set-as-default=false is specified", func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--set-as-default=false is not specified. It is the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we don't strictly speaking need a custom runtime to test the behaviour of NOT overriding the default.
| `) | ||
| Expect(err).ToNot(HaveOccurred()) | ||
|
|
||
| // Configure containerd with drop-in config (explicitly not setting as default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we configuring with the drop-in config in this case?
| // Verify imports | ||
| if env.configVersion == 2 { | ||
| // containerd 1.7 shows actual resolved imports | ||
| err = validateImports(config, []string{"/etc/containerd/conf.d/99-nvidia.toml"}, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a sideeffect of the config dump command and does not reflect the actual content of the config file. One question I do have is why we're using validateImports here and not for the non v2 case?
| Expect(err).ToNot(HaveOccurred(), "Failed to get plugin config") | ||
|
|
||
| // Verify CDI is enabled | ||
| cdiEnabled, err := getCDIEnabled(pluginConfig, env.configVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the version relevant here?
| containerdSection := pluginConfig.Get("containerd") | ||
| if containerdSection == nil { | ||
| return "", fmt.Errorf("containerd section not found") | ||
| } | ||
|
|
||
| containerdTree, ok := containerdSection.(*toml.Tree) | ||
| if !ok { | ||
| return "", fmt.Errorf("containerd section is not a TOML tree") | ||
| } | ||
|
|
||
| defaultRuntime := containerdTree.Get("default_runtime_name") | ||
| if defaultRuntime == nil { | ||
| return "", nil // No default runtime set | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use GetPath? or Get("containerd.default_runtime_name")?
| } | ||
|
|
||
| // validateRuntimeConfig validates a specific runtime configuration | ||
| func validateRuntimeConfig(runtime interface{}, expectedType string, expectedOptions map[string]interface{}) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under which conditions do we ever call this with anything other than a toml.Tree?
| } | ||
|
|
||
| // Check runtime type | ||
| runtimeType, ok := runtimeMap["runtime_type"].(string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like something that we can check quite simply at the call site -- especially since we seem to mostly call this with runtimeType == "".
| // For empty expectedType, we don't check the runtime type | ||
| // For nvidia runtime, we accept various runtime types | ||
| if expectedType == "nvidia" { | ||
| validTypes := []string{"io.containerd.runc.v2", "io.containerd.nvidia.v1", "io.containerd.nvidia.v2"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is this ever io.containerd.nvidia.v1? Where does this list come from?
| if expectedType != "" { | ||
| // For empty expectedType, we don't check the runtime type | ||
| // For nvidia runtime, we accept various runtime types | ||
| if expectedType == "nvidia" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never seem to call this with expectedType == "nvidia" what is the purpose of this branch?
| return fmt.Errorf("options is not a map or toml.Tree, got %T", v) | ||
| } | ||
|
|
||
| for key, expectedValue := range expectedOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a gomega ContainsElements or ConsistsOf instead?
34346af to
899fec9
Compare
| if !env.hasDefaultImports { | ||
| // For containerd 1.7.x, ensure a config file exists | ||
| // nvidia-ctk should add the imports directive automatically | ||
| _, _, err = nestedContainerRunner.Run(` | ||
| if [ ! -f /etc/containerd/config.toml ]; then | ||
| mkdir -p /etc/containerd | ||
| containerd config default > /etc/containerd/config.toml | ||
| fi`) | ||
| Expect(err).ToNot(HaveOccurred(), "Failed to create default containerd config") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain why this is even required? The nvidia-ctk will create this file if it does not exist. It really should not be needed to create this before the tests.
| // Ensure containerd is running | ||
| _, _, err = nestedContainerRunner.Run(` | ||
| # Start containerd if not running | ||
| if ! systemctl is-active --quiet containerd; then | ||
| systemctl start containerd | ||
| sleep 2 | ||
| fi | ||
| `) | ||
| Expect(err).ToNot(HaveOccurred(), "Failed to ensure containerd is running") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this IN addition to restartContainerdAndWait? Is there a reason that we can't just call that here if required?
| if [ -f /tmp/containerd-config.toml.backup ]; then | ||
| cp /tmp/containerd-config.toml.backup /etc/containerd/config.toml | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we REMOVE /etc/containerd/config.toml if /tmp/containerd-config.toml.backup does not exist?
899fec9 to
f82172f
Compare
| When("configuring containerd", func() { | ||
| It("should add NVIDIA runtime using drop-in config without modifying the main config", func(ctx context.Context) { | ||
| // Configure containerd using nvidia-ctk | ||
| _, _, err := nestedContainerRunner.Run(`nvidia-ctk runtime configure --runtime=containerd --config=/etc/containerd/config.toml --drop-in-config=/etc/containerd/conf.d/99-nvidia.toml --set-as-default --cdi.enabled`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2498628 to
96059e8
Compare
|
|
||
| var additionalContainerArguments []string | ||
|
|
||
| if requiresGPUs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! nice, thanks for this Evan.
96059e8 to
3a45cdf
Compare
Signed-off-by: Evan Lezar <[email protected]>
This change makes the logic w.r.t mounting the host toolkit into a nested runner more explicit and adds a flag to indicate whether the runner needs GPU access. Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
| }) | ||
|
|
||
| When("configuring containerd", func() { | ||
| It("should add NVIDIA runtime using drop-in config and validate merged configuration", func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "and validate merged configuration" is what we're doing in the test, not what we expect.
3a45cdf to
241d330
Compare
| _, _, err := nestedContainerRunner.Run(`nvidia-ctk runtime configure --runtime=containerd --set-as-default --cdi.enabled`) | ||
| Expect(err).ToNot(HaveOccurred(), "Failed to configure containerd") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is what determines what comes in the "It" statement.
| Expect(mainConfigContents).To(ContainSubstring("imports")) | ||
| Expect(mainConfigContents).To(ContainSubstring("/etc/containerd/conf.d")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to check this, we should parse the toml.
| imports := config.Get("imports") | ||
| Expect(imports).ToNot(BeNil()) | ||
| importsList, ok := imports.([]interface{}) | ||
| Expect(ok).To(BeTrue()) | ||
| Expect(len(importsList)).To(BeNumerically(">=", 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking for the presence of imports is not sufficient. We need to ensure that we ALSO check that the desired folder (or file) is imported. For example, see the matcher here: https://github.com/ArangoGutierrez/nvidia-container-toolkit/pull/1/files#diff-056e23fcacb20f063e793331c10a8fedec7eedb12c88a3c7e857ad995ace49daR234-R249
(I'm not saying that we need to use a matcher, but that we should check the value properly).
No description provided.