WIP kubeapiserver auditloganalyzer: spot handler panics in audit log #29127

vrutkovs · 2024-09-24T14:56:33Z

Don't let any useragent cause too many panics in apiserver

deads2k · 2024-09-24T15:31:02Z

pkg/monitortests/kubeapiserver/auditloganalyzer/monitortest.go

+						Output:  "details in audit log",
+					},
+				},
+			)


Add a success here so that we flake instead of fail to start. We can still search for flakes.

deads2k · 2024-09-24T15:32:13Z

pkg/monitortests/kubeapiserver/auditloganalyzer/monitortest.go

+			for _, id := range panics.UnsortedList() {
+				ids = append(ids, string(id))
+			}
+			failures := append(failures, fmt.Sprintf("failed auditIDs: %s", strings.Join(ids, ",")))


we need times for these to track them more effectively.

we store audit-id, so later more info - like starting time - can be extracted. Do we want time to do more complicated alerting like "no more that n panics in x minutes"?

Do we want time to do more complicated alerting like "no more that n panics in x minutes"?

I don't think these are panics. They may look like panics, but that's because the timeout handler prints stacks.

we store audit-id, so later more info - like starting time - can be extracted.
you store the entire auditEvent to pull the times out. It's just a lot more useful to see it on the spyglass page and compare it to "what's happening then" than to download an audit log and parse backwards.

openshift-trt-bot · 2024-09-25T12:06:22Z

Job Failure Risk Analysis for sha: 341a9c5

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade	IncompleteTests Tests for this run (20) are below the historical average (816): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade	IncompleteTests Tests for this run (190) are below the historical average (837): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-gcp-ovn	IncompleteTests Tests for this run (103) are below the historical average (2050): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt-bot · 2024-09-25T20:08:05Z

Job Failure Risk Analysis for sha: 96386f9

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade	Medium [sig-scheduling][Early] The openshift-console console pods [apigroup:console.openshift.io] should be scheduled on different nodes [Suite:openshift/conformance/parallel] This test has passed 93.41% of 167 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week. --- [bz-kube-apiserver][invariant] alert/KubeAPIErrorBudgetBurn should not be at or above info This test has passed 89.82% of 167 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week. Open Bugs alert/KubeAPIErrorBudgetBurn should not be at or above info

Don't let any useragent cause too many panics in apiserver

openshift-ci · 2025-01-03T08:03:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vrutkovs
Once this PR has been reviewed and has the lgtm label, please assign sosiouxme for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dgoodwin · 2025-01-23T18:09:01Z

pkg/monitortests/kubeapiserver/auditloganalyzer/monitortest.go

@@ -367,6 +370,37 @@ func (w *auditLogAnalyzer) EvaluateTestsFromConstructedIntervals(ctx context.Con

 	ret = append(ret, w.violationChecker.CreateJunits()...)

+	for userAgent, userAgentPanics := range w.apiserverPanicChecker.panicEventsPerUserAgent.panicEvents {
+		testName := fmt.Sprintf("user %s must not produce too many apiserver handler panics", userAgent)


This is generating random test names, infinitely increasing the tests in sippy even as a presubmit. 150 tests so far. We'll clean those up but this needs to be a megatest, can't do a test per uncontrolled user inputs.

ex name | user ip-10-0-21-189/ovnkube@038aea1608e0 (linux/amd64) kubernetes/v0.31.1 must not produce too many apiserver handler panics

Right, this may not have an authenticated user.
Should we use "too many apiserver handler panics produced" as testname and move useragent to the error message?
Perhaps, instead we could add username if available and use "unknown" otherwise? The usernames may be infinite too though

It does seem dangerous even if just authed users, presumably tests could be making some with random names? Typically we would write a whitelist for known users you're particularly interested in broken into their own tests, plus a fallback catch all single test, or just the fallback single test.

openshift-merge-robot · 2025-02-12T20:19:48Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2025-05-02T20:40:03Z

@vrutkovs: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-ipsec-serial	`96386f9`	link	false	`/test e2e-aws-ovn-ipsec-serial`
ci/prow/e2e-gcp-ovn-builds	`96386f9`	link	true	`/test e2e-gcp-ovn-builds`
ci/prow/e2e-aws-ovn-kube-apiserver-rollout	`74b2b95`	link	false	`/test e2e-aws-ovn-kube-apiserver-rollout`
ci/prow/e2e-aws-ovn-single-node-upgrade	`74b2b95`	link	false	`/test e2e-aws-ovn-single-node-upgrade`
ci/prow/e2e-aws-ovn-single-node-serial	`74b2b95`	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/lint	`74b2b95`	link	true	`/test lint`
ci/prow/e2e-gcp-ovn	`74b2b95`	link	true	`/test e2e-gcp-ovn`
ci/prow/e2e-vsphere-ovn	`74b2b95`	link	true	`/test e2e-vsphere-ovn`
ci/prow/unit	`74b2b95`	link	true	`/test unit`
ci/prow/e2e-metal-ipi-ovn-ipv6	`74b2b95`	link	true	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/verify	`74b2b95`	link	true	`/test verify`
ci/prow/e2e-gcp-ovn-upgrade	`74b2b95`	link	true	`/test e2e-gcp-ovn-upgrade`
ci/prow/e2e-aws-ovn-microshift	`74b2b95`	link	true	`/test e2e-aws-ovn-microshift`
ci/prow/images	`74b2b95`	link	true	`/test images`
ci/prow/verify-deps	`74b2b95`	link	true	`/test verify-deps`
ci/prow/e2e-vsphere-ovn-upi	`74b2b95`	link	true	`/test e2e-vsphere-ovn-upi`
ci/prow/e2e-aws-ovn-edge-zones	`74b2b95`	link	true	`/test e2e-aws-ovn-edge-zones`
ci/prow/okd-scos-images	`74b2b95`	link	true	`/test okd-scos-images`
ci/prow/e2e-aws-ovn-microshift-serial	`74b2b95`	link	true	`/test e2e-aws-ovn-microshift-serial`
ci/prow/e2e-aws-ovn-serial-2of2	`74b2b95`	link	true	`/test e2e-aws-ovn-serial-2of2`
ci/prow/e2e-aws-ovn-serial-1of2	`74b2b95`	link	true	`/test e2e-aws-ovn-serial-1of2`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 24, 2024

openshift-ci bot requested review from deads2k and sjenning September 24, 2024 14:58

deads2k reviewed Sep 24, 2024

View reviewed changes

vrutkovs force-pushed the auditloganalyzer-handler-panics branch from 5115e08 to 341a9c5 Compare September 25, 2024 07:46

vrutkovs force-pushed the auditloganalyzer-handler-panics branch 3 times, most recently from 56291b5 to 96386f9 Compare September 25, 2024 15:16

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2024

kubeapiserver auditloganalyzer: spot handler panics in audit log

74b2b95

Don't let any useragent cause too many panics in apiserver

vrutkovs force-pushed the auditloganalyzer-handler-panics branch from 96386f9 to 74b2b95 Compare January 3, 2025 08:03

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 3, 2025

dgoodwin reviewed Jan 23, 2025

View reviewed changes

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP kubeapiserver auditloganalyzer: spot handler panics in audit log #29127

WIP kubeapiserver auditloganalyzer: spot handler panics in audit log #29127

Uh oh!

vrutkovs commented Sep 24, 2024

Uh oh!

deads2k Sep 24, 2024

Uh oh!

deads2k Sep 24, 2024

Uh oh!

vrutkovs Sep 24, 2024

Uh oh!

deads2k Sep 24, 2024

Uh oh!

openshift-trt-bot commented Sep 25, 2024

Uh oh!

openshift-trt-bot commented Sep 25, 2024

Uh oh!

openshift-ci bot commented Jan 3, 2025

Uh oh!

dgoodwin Jan 23, 2025

Uh oh!

dgoodwin Jan 23, 2025

Uh oh!

vrutkovs Jan 24, 2025

Uh oh!

dgoodwin Jan 24, 2025 •

edited

Loading

Uh oh!

openshift-merge-robot commented Feb 12, 2025

Uh oh!

openshift-ci bot commented May 2, 2025

Uh oh!

Uh oh!

WIP kubeapiserver auditloganalyzer: spot handler panics in audit log #29127

Are you sure you want to change the base?

WIP kubeapiserver auditloganalyzer: spot handler panics in audit log #29127

Uh oh!

Conversation

vrutkovs commented Sep 24, 2024

Uh oh!

deads2k Sep 24, 2024

Choose a reason for hiding this comment

Uh oh!

deads2k Sep 24, 2024

Choose a reason for hiding this comment

Uh oh!

vrutkovs Sep 24, 2024

Choose a reason for hiding this comment

Uh oh!

deads2k Sep 24, 2024

Choose a reason for hiding this comment

Uh oh!

openshift-trt-bot commented Sep 25, 2024

Uh oh!

openshift-trt-bot commented Sep 25, 2024

Uh oh!

openshift-ci bot commented Jan 3, 2025

Uh oh!

dgoodwin Jan 23, 2025

Choose a reason for hiding this comment

Uh oh!

dgoodwin Jan 23, 2025

Choose a reason for hiding this comment

Uh oh!

vrutkovs Jan 24, 2025

Choose a reason for hiding this comment

Uh oh!

dgoodwin Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-merge-robot commented Feb 12, 2025

Uh oh!

openshift-ci bot commented May 2, 2025

Uh oh!

Uh oh!

dgoodwin Jan 24, 2025 •

edited

Loading