daemon: Dump systemctl status rpm-ostreed on load failure #2642

cgwalters · 2021-06-25T21:09:18Z

daemon: Rename os variable to hostos

Prep for a future patch; os conflicts with the standard Go package.

daemon: Dump systemctl status rpm-ostreed on load failure

This is yet another patch which should help debugging
https://bugzilla.redhat.com/show_bug.cgi?id=1958812
like situations in the future.

I also have coreos/rpm-ostree#2932
cooking but it's going to take a while to cycle down.

Prep for a future patch; `os` conflicts with the standard Go package.

This is yet another patch which should help debugging https://bugzilla.redhat.com/show_bug.cgi?id=1958812 like situations in the future. I also have coreos/rpm-ostree#2932 cooking but it's going to take a while to cycle down.

sinnykumari · 2021-06-28T15:26:53Z

pkg/daemon/daemon.go

 		osImageURL, osVersion, err = nodeUpdaterClient.GetBootedOSImageURL()
 		if err != nil {
+			// If this fails for some reason, let's dump the unit status
+			// into our logs to aid future debugging.
+			cmd := exec.Command("systemctl", "status", "rpm-ostreed")


Does printing status will capture enough error logs or should we print here rpm-ostreed journal log from current boot?

Status will capture the last few lines of logs at the point in time this happens, which I think will be enough.

openshift/must-gather#244 will help for the general case.

sinnykumari

/lgtm

openshift-ci · 2021-06-28T15:56:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kikisdeliveryservice · 2021-06-28T17:59:51Z

/test e2e-agnostic-upgrade

openshift-bot · 2021-06-28T18:07:15Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-06-28T18:19:17Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2021-06-28T20:05:29Z

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-disruptive	`1de3aa9`	link	`/test e2e-aws-disruptive`
ci/prow/okd-e2e-aws	`1de3aa9`	link	`/test okd-e2e-aws`
ci/prow/e2e-metal-ipi	`1de3aa9`	link	`/test e2e-metal-ipi`
ci/prow/e2e-vsphere-upgrade	`1de3aa9`	link	`/test e2e-vsphere-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

This is a hackier an alternative to coreos#2932 that we can ship immediately because we won't block on SELinux policy. For historical reasons, the daemon ends up doing a lot of initialization before even claiming the DBus name. For example, it calls `ostree_sysroot_load()`. We also end up scanning the RPM database, and actually parse all the GPG keys in `/etc/pki/rpm-gpg` etc. This causes two related problems: - By doing all this work before claiming the bus name, we race against the (pretty low) DBus service timeout of 25s. - If something hard fails at initialization, the client can't easily see the error because it just appears in the systemd journal. The client will just see a service timeout. By explicitly using `systemctl start rpm-ostreed.service`, systemd does all of the error checking for us without involving `dbus-broker` as a middleman. Further, by using `systemctl status rpm-ostreed.service` on failure, we reuse systemd's nice rendering of the status of the unit instead of reinventing our own. This PR effectively replicates openshift/machine-config-operator#2642 in our code.

cgwalters · 2021-06-29T21:35:28Z

just to xref, coreos/rpm-ostree#2945 is driving this down into rpm-ostree, so eventually we'll want to remove this code. But right now, shipping this PR involved literally nothing more than me submitting it and having it merge, but changing rpm-ostree in RHEL will take much longer and involve a lot more exciting paperwork.

This is a hackier an alternative to coreos#2932 that we can ship immediately because we won't block on SELinux policy. For historical reasons, the daemon ends up doing a lot of initialization before even claiming the DBus name. For example, it calls `ostree_sysroot_load()`. We also end up scanning the RPM database, and actually parse all the GPG keys in `/etc/pki/rpm-gpg` etc. This causes two related problems: - By doing all this work before claiming the bus name, we race against the (pretty low) DBus service timeout of 25s. - If something hard fails at initialization, the client can't easily see the error because it just appears in the systemd journal. The client will just see a service timeout. By explicitly using `systemctl start rpm-ostreed.service`, systemd does all of the error checking for us without involving `dbus-broker` as a middleman. Further, by using `systemctl status rpm-ostreed.service` on failure, we reuse systemd's nice rendering of the status of the unit instead of reinventing our own. This PR effectively replicates openshift/machine-config-operator#2642 in our code.

This is a hackier an alternative to #2932 that we can ship immediately because we won't block on SELinux policy. For historical reasons, the daemon ends up doing a lot of initialization before even claiming the DBus name. For example, it calls `ostree_sysroot_load()`. We also end up scanning the RPM database, and actually parse all the GPG keys in `/etc/pki/rpm-gpg` etc. This causes two related problems: - By doing all this work before claiming the bus name, we race against the (pretty low) DBus service timeout of 25s. - If something hard fails at initialization, the client can't easily see the error because it just appears in the systemd journal. The client will just see a service timeout. By explicitly using `systemctl start rpm-ostreed.service`, systemd does all of the error checking for us without involving `dbus-broker` as a middleman. Further, by using `systemctl status rpm-ostreed.service` on failure, we reuse systemd's nice rendering of the status of the unit instead of reinventing our own. This PR effectively replicates openshift/machine-config-operator#2642 in our code.

cgwalters added 2 commits June 25, 2021 17:05

daemon: Rename os variable to hostos

98ab043

Prep for a future patch; `os` conflicts with the standard Go package.

daemon: Dump systemctl status rpm-ostreed on load failure

1de3aa9

This is yet another patch which should help debugging https://bugzilla.redhat.com/show_bug.cgi?id=1958812 like situations in the future. I also have coreos/rpm-ostree#2932 cooking but it's going to take a while to cycle down.

openshift-ci bot requested review from kikisdeliveryservice and sinnykumari June 25, 2021 21:09

sinnykumari reviewed Jun 28, 2021

View reviewed changes

sinnykumari approved these changes Jun 28, 2021

View reviewed changes

openshift-ci bot assigned sinnykumari Jun 28, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 28, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2021

openshift-merge-robot merged commit 2bbe717 into openshift:master Jun 28, 2021

cgwalters mentioned this pull request Jun 29, 2021

client: Explicitly systemctl start rpm-ostreed if root, dump status coreos/rpm-ostree#2945

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

daemon: Dump systemctl status rpm-ostreed on load failure #2642

daemon: Dump systemctl status rpm-ostreed on load failure #2642

Uh oh!

cgwalters commented Jun 25, 2021 •

edited

Loading

Uh oh!

sinnykumari Jun 28, 2021

Uh oh!

cgwalters Jun 28, 2021

Uh oh!

sinnykumari left a comment

Uh oh!

openshift-ci bot commented Jun 28, 2021

Uh oh!

kikisdeliveryservice commented Jun 28, 2021

Uh oh!

openshift-bot commented Jun 28, 2021

Uh oh!

openshift-bot commented Jun 28, 2021

Uh oh!

openshift-ci bot commented Jun 28, 2021 •

edited

Loading

Uh oh!

cgwalters commented Jun 29, 2021

Uh oh!

Uh oh!

daemon: Dump systemctl status rpm-ostreed on load failure #2642

daemon: Dump systemctl status rpm-ostreed on load failure #2642

Uh oh!

Conversation

cgwalters commented Jun 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sinnykumari Jun 28, 2021

Choose a reason for hiding this comment

Uh oh!

cgwalters Jun 28, 2021

Choose a reason for hiding this comment

Uh oh!

sinnykumari left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Jun 28, 2021

Uh oh!

kikisdeliveryservice commented Jun 28, 2021

Uh oh!

openshift-bot commented Jun 28, 2021

Uh oh!

openshift-bot commented Jun 28, 2021

Uh oh!

openshift-ci bot commented Jun 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgwalters commented Jun 29, 2021

Uh oh!

Uh oh!

cgwalters commented Jun 25, 2021 •

edited

Loading

openshift-ci bot commented Jun 28, 2021 •

edited

Loading