Skip to content

[WIP] Add test/integration #3372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

[WIP] Add test/integration #3372

wants to merge 13 commits into from

Conversation

ntnn
Copy link
Member

@ntnn ntnn commented Apr 14, 2025

Summary

Adding a framework for integration tests.

Startup takes a long time - I am contemplating adding support for a shared embedded etcd (or an external etcd like in kube) if that significantly improves startup.

What Type of PR Is This?

/kind feature

Related Issue(s)

Release Notes

NONE

@kcp-ci-bot kcp-ci-bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the DCO. kind/feature Categorizes issue or PR as related to a new feature. labels Apr 14, 2025
@kcp-ci-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign clubanderson for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kcp-ci-bot kcp-ci-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 14, 2025
@kcp-ci-bot
Copy link
Contributor

Hi @ntnn. Thanks for your PR.

I'm waiting for a kcp-dev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kcp-ci-bot kcp-ci-bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 14, 2025
Comment on lines +46 to 50
func NewControllers() *Controllers {
kcmDefaults, err := kcmoptions.NewKubeControllerManagerOptions()
if err != nil {
panic(err)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to bubble up the error, but that'd require to add error handling to every NewOptions that is calling it:

Controllers: *NewControllers(),

Server: *serveroptions.NewOptions(rootDir),

kcpOptions := options.NewOptions(rootDir)

Which would be a good change - but would also touch a lot of the codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather to that in a separate PR tbh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panic here is not that bad. It at the point of startup config validation so its quick feedback. Not something happening at runtime 🤷

@ntnn ntnn marked this pull request as ready for review April 14, 2025 08:32
@kcp-ci-bot kcp-ci-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 14, 2025
@ntnn
Copy link
Member Author

ntnn commented Apr 14, 2025

Tests with race condition testing fails. I think kube also doesn't race test their integration tests.

@embik
Copy link
Member

embik commented Apr 14, 2025

/ok-to-test

@kcp-ci-bot kcp-ci-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 14, 2025
@ntnn ntnn force-pushed the test-integration branch 2 times, most recently from 8abda95 to f7cb942 Compare April 14, 2025 11:33
@ntnn
Copy link
Member Author

ntnn commented Apr 14, 2025

/test pull-kcp-test-unit

@ntnn
Copy link
Member Author

ntnn commented Apr 14, 2025

/test pull-kcp-test-e2e-shared

@embik
Copy link
Member

embik commented Apr 15, 2025

Startup takes a long time - I am contemplating adding support for a shared embedded etcd (or an external etcd like in kube) if that significantly improves startup.

This is kind of a known problem with starting kcp the first time, so we should probably tackle this in kcp itself, not the integration test framework.

@ntnn
Copy link
Member Author

ntnn commented Apr 15, 2025

Yeah, I just measured the startup times - etcd takes less than a second, kcp phase1 takes roughly 22s. And using a shared kcd for the integration tests would make less sense imho - if the kcp instance could be shared without tests affecting each other they could likely also be e2e tests.

@ntnn
Copy link
Member Author

ntnn commented Apr 15, 2025

Running with race detection works. Root cause was that this method could be called from multiple goroutines if no DialContext is set:

// DefaultTransportWrapper wraps the provided roundtripper with default settings.
func DefaultTransportWrapper(rt http.RoundTripper) http.RoundTripper {
tr, err := transportFor(rt)
if err != nil {
klog.FromContext(context.Background()).Error(err, "Cannot set timeout settings on roundtripper")
return rt
}
tr.DialContext = DefaultDialContext()
return rt
}

This originates from the server config here:

kcp/pkg/server/config.go

Lines 208 to 210 in a2014d6

// break connections on the tcp layer. Setting the client timeout would
// also apply to watches, which we don't want.
c.GenericConfig.LoopbackClientConfig.Wrap(network.DefaultTransportWrapper)

So either the wrapper function is wrapped with a mutex to prevent concurrent writes or the wrapper function acts atomically.
I don't think the GenericAPIServer can be persuaded to call this initially.

I don't particulalry like either option. Another option could be to just remove that config - TCP timeout should be 10m anyhow iirc. Scratch that, I read 25 * time.Minute - but it's 25 * time.Second:

func dialerWithDefaultOptions() DialContext {
nd := &net.Dialer{
// TCP_USER_TIMEOUT does affect the behaviour of connect() which is controlled by this field so we set it to the same value
Timeout: 25 * time.Second,
}
return wrapDialContext(nd.DialContext)
}

@ntnn
Copy link
Member Author

ntnn commented Apr 15, 2025

Technically it would maybe possible to set .Transport on the GenericAPIConfig with a transport that already has the TCP timeout set.

Maybe there was a good reason not to use .Transport and instead to define a wrapper:
9574eef#diff-d84eac55294d8f540a660bfcb4043f7c4e1f76c8bccbb91c129e8e4992dbafe7R216

However I can't discern a reason from reading the documentation - it even reads more like setting the transport would be the right approach to setting the TCP timeout:

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/rest/config.go#L104-L107

	// Transport may be used for custom HTTP behavior. This attribute may not
	// be specified with the TLS client certificate options. Use WrapTransport
	// to provide additional per-server middleware behavior.
	Transport http.RoundTripper

From attempting to set this attribute instead of wrap - I'd guess wrapping was used because setting a transport requires setting up a transport how the library expects it rather than having the library provide a working transport and just adjusting the timeouts.

@ntnn
Copy link
Member Author

ntnn commented Apr 15, 2025

/test pull-kcp-test-e2e-multiple-runs

@ntnn
Copy link
Member Author

ntnn commented Apr 15, 2025

Found out why .Transport isn't used. When this is set this if takes hold:

https://github.com/kubernetes/kubernetes/blob/30469e180361d7da07b0fee6d47c776fa2cf3e86/staging/src/k8s.io/client-go/transport/transport.go#L34-L40

// New returns an http.RoundTripper that will provide the authentication
// or transport level security defined by the provided Config.
func New(config *Config) (http.RoundTripper, error) {
	// Set transport level security
	if config.Transport != nil && (config.HasCA() || config.HasCertAuth() || config.HasCertCallback() || config.TLS.Insecure) {
		return nil, fmt.Errorf("using a custom transport with TLS certificate options or the insecure flag is not allowed")
	}

So the TLS configuration would have to be removed, which then causes connections to fail due to the missing TLS configuration:

E0415 14:28:52.537341   14738 storage_rbac.go:191] "Unhandled Error" err="unable to initialize clusterroles: Get \"https://127.0.0.1:57059/clusters/system:admin/apis/rbac.authorization.k8s.io/v1/clusterroles\": tls: failed to verify certificate: x509: certificate signed by unknown authority" logger="UnhandledError"
F0415 14:28:52.538522   14738 hooks.go:210] PostStartHook "kcp-bootstrap-policy" failed: unable to initialize roles: timed out waiting for the condition
FAIL    github.com/kcp-dev/kcp/test/integration/framework       39.978s

a07f610#diff-d84eac55294d8f540a660bfcb4043f7c4e1f76c8bccbb91c129e8e4992dbafe7R210

Technically a bug, but that was introduced >7y ago - for a fix on the KCP side for now we'll have to go with either of the workarounds.

@ntnn ntnn force-pushed the test-integration branch from 8160045 to 77e5f92 Compare April 23, 2025 07:49
@sttts
Copy link
Member

sttts commented Apr 23, 2025

Very nice!

About startup, I think we need somebody debugging our startup process. I think there must be some polling that is slow. At some point, we have split the kcp core informers into one part that does not need identities (system CRDs) and those resource based on APIExports (which need identities). And there is logic to bootstrap in a multi-shard env. Something in this area must be far too slow. Maybe some polling with too long interval.

@ntnn ntnn changed the title Add test/integration [WIP] Add test/integration Apr 26, 2025
@kcp-ci-bot kcp-ci-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 26, 2025
@ntnn
Copy link
Member Author

ntnn commented Apr 26, 2025

Marking as WIP as I want to change how the server is started

ntnn added 3 commits May 8, 2025 23:49
When multiple kcp instances are started in the same process they share
the same kcmDefaults, which is initialized in the `init` function.

Signed-off-by: Nelo-T. Wallus <[email protected]>
Signed-off-by: Nelo-T. Wallus <[email protected]>
@ntnn ntnn force-pushed the test-integration branch from 77e5f92 to f622196 Compare May 8, 2025 23:07
ntnn added 9 commits May 9, 2025 01:09
Not sure why this breaks only when using gotestsum; without e2e and
integration is filtered out correctly with extended grep. With the
extended grep fails(?!) and lists e2e and integration.

Anyhow - using separate simple expressions works for both.

Signed-off-by: Nelo-T. Wallus <[email protected]>
K8s v1.32.3 introduced new races.

Signed-off-by: Nelo-T. Wallus <[email protected]>
Signed-off-by: Nelo-T. Wallus <[email protected]>
@ntnn ntnn force-pushed the test-integration branch from f622196 to 10933f8 Compare May 8, 2025 23:09
Signed-off-by: Nelo-T. Wallus <[email protected]>
@kcp-ci-bot
Copy link
Contributor

@ntnn: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kcp-lint fa505d8 link true /test pull-kcp-lint
pull-kcp-test-integration fa505d8 link true /test pull-kcp-test-integration
pull-kcp-test-e2e-multiple-runs fa505d8 link true /test pull-kcp-test-e2e-multiple-runs

Full PR test history

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has signed the DCO. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants