Skip to content

WIP kubeapiserver auditloganalyzer: spot handler panics in audit log #29127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vrutkovs
Copy link
Member

Don't let any useragent cause too many panics in apiserver

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 24, 2024
@openshift-ci openshift-ci bot requested review from deads2k and sjenning September 24, 2024 14:58
Output: "details in audit log",
},
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a success here so that we flake instead of fail to start. We can still search for flakes.

for _, id := range panics.UnsortedList() {
ids = append(ids, string(id))
}
failures := append(failures, fmt.Sprintf("failed auditIDs: %s", strings.Join(ids, ",")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need times for these to track them more effectively.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we store audit-id, so later more info - like starting time - can be extracted. Do we want time to do more complicated alerting like "no more that n panics in x minutes"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want time to do more complicated alerting like "no more that n panics in x minutes"?

I don't think these are panics. They may look like panics, but that's because the timeout handler prints stacks.

we store audit-id, so later more info - like starting time - can be extracted.
you store the entire auditEvent to pull the times out. It's just a lot more useful to see it on the spyglass page and compare it to "what's happening then" than to download an audit log and parse backwards.

@vrutkovs vrutkovs force-pushed the auditloganalyzer-handler-panics branch from 5115e08 to 341a9c5 Compare September 25, 2024 07:46
@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 341a9c5

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade IncompleteTests
Tests for this run (20) are below the historical average (816): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade IncompleteTests
Tests for this run (190) are below the historical average (837): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-gcp-ovn IncompleteTests
Tests for this run (103) are below the historical average (2050): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@vrutkovs vrutkovs force-pushed the auditloganalyzer-handler-panics branch 3 times, most recently from 56291b5 to 96386f9 Compare September 25, 2024 15:16
@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 96386f9

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade Medium
[sig-scheduling][Early] The openshift-console console pods [apigroup:console.openshift.io] should be scheduled on different nodes [Suite:openshift/conformance/parallel]
This test has passed 93.41% of 167 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.
---
[bz-kube-apiserver][invariant] alert/KubeAPIErrorBudgetBurn should not be at or above info
This test has passed 89.82% of 167 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

Open Bugs
alert/KubeAPIErrorBudgetBurn should not be at or above info

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2024
Don't let any useragent cause too many panics in apiserver
@vrutkovs vrutkovs force-pushed the auditloganalyzer-handler-panics branch from 96386f9 to 74b2b95 Compare January 3, 2025 08:03
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 3, 2025
Copy link
Contributor

openshift-ci bot commented Jan 3, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vrutkovs
Once this PR has been reviewed and has the lgtm label, please assign sosiouxme for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -367,6 +370,37 @@ func (w *auditLogAnalyzer) EvaluateTestsFromConstructedIntervals(ctx context.Con

ret = append(ret, w.violationChecker.CreateJunits()...)

for userAgent, userAgentPanics := range w.apiserverPanicChecker.panicEventsPerUserAgent.panicEvents {
testName := fmt.Sprintf("user %s must not produce too many apiserver handler panics", userAgent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generating random test names, infinitely increasing the tests in sippy even as a presubmit. 150 tests so far. We'll clean those up but this needs to be a megatest, can't do a test per uncontrolled user inputs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ex name | user ip-10-0-21-189/ovnkube@038aea1608e0 (linux/amd64) kubernetes/v0.31.1 must not produce too many apiserver handler panics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this may not have an authenticated user.
Should we use "too many apiserver handler panics produced" as testname and move useragent to the error message?
Perhaps, instead we could add username if available and use "unknown" otherwise? The usernames may be infinite too though

Copy link
Contributor

@dgoodwin dgoodwin Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem dangerous even if just authed users, presumably tests could be making some with random names? Typically we would write a whitelist for known users you're particularly interested in broken into their own tests, plus a fallback catch all single test, or just the fallback single test.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 12, 2025
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented May 2, 2025

@vrutkovs: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-ipsec-serial 96386f9 link false /test e2e-aws-ovn-ipsec-serial
ci/prow/e2e-gcp-ovn-builds 96386f9 link true /test e2e-gcp-ovn-builds
ci/prow/e2e-aws-ovn-kube-apiserver-rollout 74b2b95 link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/e2e-aws-ovn-single-node-upgrade 74b2b95 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-aws-ovn-single-node-serial 74b2b95 link false /test e2e-aws-ovn-single-node-serial
ci/prow/lint 74b2b95 link true /test lint
ci/prow/e2e-gcp-ovn 74b2b95 link true /test e2e-gcp-ovn
ci/prow/e2e-vsphere-ovn 74b2b95 link true /test e2e-vsphere-ovn
ci/prow/unit 74b2b95 link true /test unit
ci/prow/e2e-metal-ipi-ovn-ipv6 74b2b95 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/verify 74b2b95 link true /test verify
ci/prow/e2e-gcp-ovn-upgrade 74b2b95 link true /test e2e-gcp-ovn-upgrade
ci/prow/e2e-aws-ovn-microshift 74b2b95 link true /test e2e-aws-ovn-microshift
ci/prow/images 74b2b95 link true /test images
ci/prow/verify-deps 74b2b95 link true /test verify-deps
ci/prow/e2e-vsphere-ovn-upi 74b2b95 link true /test e2e-vsphere-ovn-upi
ci/prow/e2e-aws-ovn-edge-zones 74b2b95 link true /test e2e-aws-ovn-edge-zones
ci/prow/okd-scos-images 74b2b95 link true /test okd-scos-images
ci/prow/e2e-aws-ovn-microshift-serial 74b2b95 link true /test e2e-aws-ovn-microshift-serial
ci/prow/e2e-aws-ovn-serial-2of2 74b2b95 link true /test e2e-aws-ovn-serial-2of2
ci/prow/e2e-aws-ovn-serial-1of2 74b2b95 link true /test e2e-aws-ovn-serial-1of2

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants