Parallel download testsuit refactor#1235
Conversation
|
Skipping CI for Draft Pull Request. |
|
@yaozile123: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the GKE CSI Integration tests to dynamically generate test cases from the upstream GCSFuse Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant refactoring of the GKE CSI Integration tests for GCSFuse, moving from hardcoded test suites to dynamically generating them from the upstream test_config.yaml. This is a great step towards reducing maintenance overhead and ensuring test synchronization. My review focuses on the robustness of the new parsing and test generation logic. I've identified a critical issue in the flag parsing logic that could lead to test failures, and a medium-severity issue related to string manipulation of mount options. Addressing these will make the new test framework more reliable.
test/e2e/testsuites/gcsfuse_integration_file_cache_parallel_downloads.go
Show resolved
Hide resolved
a111baf to
8138dfb
Compare
|
/gemini summary |
Summary of ChangesThis pull request significantly refactors the GKE CSI Integration test framework by introducing dynamic test generation. It addresses the historical issue of duplicated test suites between GCSFuse and the GKE CSI driver teams, which led to synchronization delays. By fetching and parsing the canonical Highlights
Changelog
Activity
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This PR is a significant and well-thought-out refactoring of the e2e test suite infrastructure. By dynamically generating tests from the upstream test_config.yaml, it effectively eliminates test duplication and ensures synchronization with the canonical GCSFuse tests. The changes, including moving the GCSFuse version fetching, introducing YAML data structures, and implementing a backward compatibility fallback, are robust and set a strong foundation for future test migrations. I've included a few suggestions to improve efficiency, correctness, and thread-safety in the tests.
8138dfb to
379a523
Compare
|
@yaozile123 I see you have tested by running: Can you please also test other cases so we know this won't break any of our testgrids? You don't need to run entire testsuite , just the relevant tests
|
|
|
||
| func runCommand(action string, cmd *exec.Cmd) error { | ||
| cmd.Stdout = os.Stdout | ||
| cmd.Stdout = ginkgo.GinkgoWriter |
There was a problem hiding this comment.
I see PR description says
"Prevented Ginkgo's output interceptor from getting stuck during parallel execution by updating cmd.Stdout and cmd.Stderr to ginkgo.GinkgoWriter."
Just curious, what do you mean it was getting stuck? which change caused that? I don't remember us having that issue on the testgrids
There was a problem hiding this comment.
Here is the message I got from ginkgo when we were using os.Stdout:
When running in parallel, Ginkgo captures stdout and stderr output
and attaches it to the running spec. It looks like that process is getting
stuck for this suite.
This usually happens if you, or a library you are using, spin up an external
process and set cmd.Stdout = os.Stdout and/or cmd.Stderr = os.Stderr. This
causes the external process to keep Ginkgo's output interceptor pipe open and
causes output interception to hang.
Ginkgo has detected this and shortcircuited the capture process. The specs
will continue running after this message however output from the external
process that caused this issue will not be captured.
You have several options to fix this. In preferred order they are:
1. Pass GinkgoWriter instead of os.Stdout or os.Stderr to your process.
2. Ensure your process exits before the current spec completes. If your
process is long-lived and must cross spec boundaries, this option won't
work for you.
3. Pause Ginkgo's output interceptor before starting your process and then
resume it after. Use PauseOutputInterception() and ResumeOutputInterception()
to do this.
4. Set --output-interceptor-mode=none when running your Ginkgo suite. This will
turn off all output interception but allow specs to run in parallel without this
issue. You may miss important output if you do this including output from Go's
race detector.
There was a problem hiding this comment.
I see thanks for explaining. Do the test logs still look the same as the current test output on the testgrids? I just want to make sure this doesn't make it harder to debug failing tests in any way
There was a problem hiding this comment.
Yes, the logs look the same
test/e2e/utils/utils.go
Outdated
| // It removed the leading '-' from each flag and appends the flag to MountOptions. | ||
| func ParseConfigFlags(flagStr string) ParsedConfig { | ||
| parsed := ParsedConfig{ | ||
| FileCacheCapacity: "50Mi", |
There was a problem hiding this comment.
Why are we hardcoding FileCacheCapacity?
There was a problem hiding this comment.
I added that as a default fallback value just in case the file-cache-max-size-mb flag is missing.
There was a problem hiding this comment.
I don't think we should be hardcoding anything as that will change what the test does. Setting file-cache-max-size-mb will enable the file cache, which will change what the test does. https://docs.cloud.google.com/storage/docs/cloud-storage-fuse/cli-options#file-cache-max-size-mb
When is file-cache-max-size-mb be missing when it should be set?
There was a problem hiding this comment.
I saw the test_config.yaml didn't set file-cache-max-size-mb in some of the read_cache test_cases
test/e2e/utils/utils.go
Outdated
| continue | ||
| } | ||
|
|
||
| // TODO: Remove this block after boolean flags formatting bug is fixed. |
There was a problem hiding this comment.
@uriel-guzman Please address this once your PR is merged: #1231
| return v, fmt.Sprintf(gcsfuseReleaseBranchFormat, v.Major(), v.Minor(), v.Patch()) | ||
| } | ||
|
|
||
| type TestPackage struct { |
There was a problem hiding this comment.
In https://raw.githubusercontent.com/GoogleCloudPlatform/gcsfuse/v3.7.1/tools/integration_tests/test_config.yaml, I see rapid_appends test package defines a secondary_flags property under its configs list for tests that use dual mounts. Are we going to do that in a follow up PR? If so, how will you handle it in TestPackage? Will it be another field like SecondaryFlags?
Do we expect GCSFuse team to add new Flags like secondary_flags in the future? If so, how can we make sure that we don't need to do manual changes each time? Or how can we make the manual changes each time as easy as possible?
There was a problem hiding this comment.
I've added the secondary_flags field to the struct. Currently the yaml parser only processes fields defined in the struct and ignores others without failing, but behavioral changes (like secondary mounts) will still require manual updates as GCSFuse evolves.
There was a problem hiding this comment.
So if there any any changes in the GCS Fuse repo config struct will it fail on our end as we haven't made equivalent changes in our struct?
There was a problem hiding this comment.
No it won't failed, the yaml parser will neglect any field that are not defined in our struct.
There was a problem hiding this comment.
What is the plan to add tests with mounted_directory_secondary? I don't see this being used anywhere? Is this for the append tests?
There was a problem hiding this comment.
The mounted_directory_secondary will be used at test package rapid append.It will be included during the flat gcsfuse_integration testsuit refactoring.
6b674aa to
a52eb74
Compare
|
|
||
| // Filter out duplicate logging:severity:info from testdriver set up | ||
| mo := l.volumeResource.VolSource.CSI.VolumeAttributes["mountOptions"] | ||
| mo = strings.ReplaceAll(mo, "logging:severity:info", fmt.Sprintf("logging:severity:%v", config.LogSeverity)) |
There was a problem hiding this comment.
Can we update testdriver setup to not add these duplicates?
There was a problem hiding this comment.
The default logging:severity is info and the gcsfuse team added their own logging:severity to the test config YAML. We only gonna see duplicates at gcsfuse integration testsuits.
| } | ||
| // The YAML parser treats test package as a list because of the '-' syntax. | ||
| // But there is only one configuration item under each package, so we take the first element. | ||
| pkg := pkgList[0] |
There was a problem hiding this comment.
Can we check with GCS Fuse team if they are willing to fix this?
There was a problem hiding this comment.
Let me check with them. A map-based structure for test pakcage would be cleaner, but on the gcsfuse side they currently expect lists. Transitioning would require a significant refactor of the upstream integration test framework.
5fe37f2 to
94e9e2e
Compare
|
|
||
| BINDIR ?= $(shell pwd)/bin | ||
|
|
||
| ifeq ($(ENABLE_ZB), true) |
There was a problem hiding this comment.
I see we set the variable while launching the internal testgrid, do we need this here?
There was a problem hiding this comment.
We don't need this for internal testgrid but just for manual testing.
|
|
||
| // disallowedFlagsMapping maps the disallowed flags to their config file representation. | ||
| // See: https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/585b8addb42335e0742be7059fd6570c78b62bc6/pkg/sidecar_mounter/sidecar_mounter_config.go#L83 | ||
| // Note: If you add more disallowed flags in sidecar_mounter_config.go or use new ones in gcsfuse tests, |
There was a problem hiding this comment.
Please add this same note to here
as well.There was a problem hiding this comment.
I would also prefer we use the same ma in both cases so we don't accidentially update one and not the other
There was a problem hiding this comment.
I have updated the note and will switch to use the same map in the following pr for file cache testsuit refactoring.
test/e2e/utils/utils.go
Outdated
| // They must be handled specifically: either parsed manually to configure the test pod, | ||
| // or translated into the GCSFuse config file representation (x:y). |
There was a problem hiding this comment.
What does "parsed manually to configure the test pod" mean?
And can you please explain why "translated into the GCSFuse config file representation (x:y)" is a workaround. I think is because the sidecar disallowed flag map only allows the CLI format flag, not the config file flag, so we try to work around it in the testsuite by using the config file format instead. Is that correct? If so, then how does this help for the flags that are already in config file format, like cache-dir.
From my other comment, #1235 (comment), I'm wondering what happens when GCSFuse adds a new test that uses one of these flags. How can we prevent that in the first place, or make sure it doesn't require a manual update? If we can't, then I want to understand what cases will require a manual update, so we can document it somewhere as mentioned here
There was a problem hiding this comment.
Yes that's correct. For cache-dir the sidecar-mounter will automatically populate the cache-dir in its config file map when file cache is enabled.
There was a problem hiding this comment.
If GCSFuse adds a new test that uses a disallowed flag not yet handled in our specific cases, the flag will be swallowed in the default block of ParseConfigFlags. We've added a klog to this block to surface these instances. Let me write a short doc about this.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: amacaskill, yaozile123 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
94e9e2e to
cf9f88a
Compare
|
New changes are detected. LGTM label has been removed. |
60f7b41 to
1db06e8
Compare
|
We noticed that some read cache test cases do not set Currently, |
1db06e8 to
0d43013
Compare
0d43013 to
667d926
Compare
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR introduces the foundational infrastructure to dynamically generate GKE CSI Integration tests directly from the upstream GCSFuse repository's canonical source of truth (
test_config.yaml).Historically, GCSFuse and GKE CSI driver teams maintained duplicated test suites, leading to synchronization delays and engineering toil. This migration resolves that by fetching and parsing the native
test_config.yamlto dynamically build the Ginkgo test tree, ensuring both teams are always testing the exact same configurations.Specifically, this PR:
SyncBeforeTestSuits. However, this was executed after Ginkgo had already generated the test cases. Since we now need the GCSFuse version first in order to fetchtest_config.yamland dynamically generate the test tree, the version fetching logic was moved tohandler.goto run before Ginkgo starts. The fetched version is then exported as an environment variable for test suites to use.TestPackage,TestConfig, andTestBucketTypeto correctly model the upstream YAML.LoadTestConfig()andParseTestConfig()to fetch the configuration securely during the test initialization hook (prior to Ginkgo tree construction).ParseConfigFlags,ExtractOnlyDirFromMountOptions) to safely extract and map raw GCSFuse arguments into CSI driver mount options and capabilities.gcsfuseIntegrationFileCacheTestNew) to begin generating dynamicginkgo.It()blocks based on the parsed config.test_config.yamlbased test execution. Currently thetest_config.yamlis only support in GCSFuse v3.7+.ginkgo.GinkgoWriter.Which issue(s) this PR fixes:
Fixes b/483387111
Fixes b/483387489
Fixes b/483387565
Fixed b/483388635
Special notes for your reviewer:
This is the first major step in migrating our E2E framework away from duplicated
ginkgo.It()blocks. The primary focus here is establishing the robust parsing and translation layer fortest_config.yaml. Subsequent PRs will focus on fully migrating specific test suites (like File Cache test suit and regular GCSFuse test suit) to utilize this new dynamic framework. Although this PR is large, it provides the complete implementation required to establish the new dynamic infrastructure and test it end-to-end.Tested on standard cluster with GKE version 1.35.0-gke.3047002, 18 e2-standard-4 non-managed driver
Tested on standard cluster with GKE version 1.35.0-gke.3047002, 10 e2-standard-4 managed driver
Tested on standard cluster with GKE version 1.34.4-gke.1047000, GCSFuse Version 3.5.6-gke.0
10 e2-standard-4 managed driver
Tested on AP cluster with GKE version 1.35.0-gke.3047002
Does this PR introduce a user-facing change?: